###### Top

# Mapping Tools to Authors The Right Way - I
### Tanya's Authored Tool List

##### Author: Praveen Saxena
##### Email: saxep01@gmail.com
##### Create Date: 10/27/2021
##### Purpose: Find the reason for discrepancy in Tanya's listed tools on Salesforce vs. nanoHUB
##### Reference: https://trello.com/c/Rx9pOPik

![Tanya's Tool Authorships on Salesforce](static/tanya-salesforce-tool-authorships.png "Tanya's Tool Authorships on Salesforce")

![Tanya's Tool Authorships on nanoHUB.org](static/tanya-nanohub-tool-authorships.jpg "Tanya's Tool Authorships on nanoHUB.org")

## 1. Preliminaries
[Scroll to top](#Top)

In [1]:
import pandas as pd
import os
import time
import datetime
from pathlib import Path
from IPython.display import display, Markdown

In [2]:
from nanoHUB.application import Application
from nanoHUB.repositories import CachedRepository, PandasRepository, ContactsRepository, ToolsRepository
from nanoHUB.pandas import get_rows_by_keyvalue, display_number_of_rows

application = Application.get_instance()
nanohub_db = application.new_db_engine('nanohub')

cache_folder = Path(os.getenv('APP_DIR'), '.cache/tool_authorships/tanya')

[1mnanoHUB - Serving Students, Researchers & Instructors[0m


## 2. Variables  
[Scroll to top](#Top)

In [3]:
tanya_uid = 29294

## 3. Analysis of current code 
#### CurrentQuery - Obtain tool info with authors
[Scroll to top](#Top)

In [4]:
display(Markdown("#### Let's grab information for all tools along with their author ids"))

naive_query = '''
SELECT * from jos_tool_authors;
'''
naive_df = pd.read_sql_query(naive_query, nanohub_db)

display(naive_df.head())
display(naive_df.info())

display(Markdown("**A total of %d entries pulled in for tools**" % len(naive_df.index)))

#### Let's grab information for all tools along with their author ids

Unnamed: 0,toolname,revision,uid,ordering,version_id,name,organization
0,adept,16,4645,1,2,,
1,adept,16,3013,2,2,,
2,greentherm,31,12642,1,4,,
3,greentherm,31,15601,2,4,,
4,greentherm,31,21031,3,4,,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11849 entries, 0 to 11848
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   toolname      11849 non-null  object
 1   revision      11849 non-null  int64 
 2   uid           11849 non-null  int64 
 3   ordering      11849 non-null  int64 
 4   version_id    11849 non-null  int64 
 5   name          8520 non-null   object
 6   organization  8519 non-null   object
dtypes: int64(4), object(3)
memory usage: 648.1+ KB


None

**A total of 11849 entries pulled in for tools**

## 4. Problem 
#### To identify the problem, use Tanya's Results from Current Code
[Scroll to top](#Top)

In [5]:
display(get_rows_by_keyvalue(naive_df, 'uid', tanya_uid))

Unnamed: 0,toolname,revision,uid,ordering,version_id,name,organization
5838,mosfetsat,66,29294,2,2755,Tanya Faltens,Purdue University
8210,mif,6,29294,2,3561,Tanya Faltens,Purdue University
8732,crystal_viewer,394,29294,4,3714,Tanya Faltens,Purdue University
8921,crystal_viewer,395,29294,4,3775,Tanya Faltens,Purdue University
9777,tinwhis,14,29294,2,4284,Tanya Faltens,
10899,engdata,4,29294,1,4954,Tanya Faltens,


Some of the entries for Tanya in the result above are not published tools. They definitely need to be filtered out. 

However, the current code uses table **_nanohub.jos_tool_authors_** which simply doesn't contain enough information to get the right number of tools for a user. We need more information about the tools, such as:

1. Is the tool a version of an existing tool?
2. Is the tool in development or is it a published final product in use by other users?

## 5. Potential Solution
#### New Query - Obtain tool info including publication status and authors
[Scroll to top](#Top)

In [6]:
sql_string = '''
SELECT DISTINCT 
       tool.toolname AS toolname, tool.title AS title,
       author.authorid, author.name,
       res.published, res.type
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
  ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
  ON LOWER(tool.title) = LOWER(res.title) 
;
'''

potential_repo = CachedRepository(
    PandasRepository(sql_string, nanohub_db, 'resources_data'),
    cache_folder
)
potential_df = potential_repo.get_all()

display(potential_df.head())
display(potential_df.info())

display_number_of_rows(potential_df)

Unnamed: 0,toolname,title,authorid,name,published,type
0,hydrolab,Hydrophobicity Lab,4713.0,Eric Darve,1,7
1,hydrolab,Hydrophobicity Lab,12486.0,Artit Wangperawong,1,7
2,hydrolab,Hydrophobicity Lab,12590.0,Kazutora Hayashida,1,7
3,huckel,Huckel-IV,,,0,7
4,nanomos,NanoMOS,-39.0,,1,7


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8154 entries, 0 to 8153
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   toolname   2918 non-null   object 
 1   title      2918 non-null   object 
 2   authorid   7936 non-null   float64
 3   name       7936 non-null   object 
 4   published  8154 non-null   int64  
 5   type       8154 non-null   int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 382.3+ KB


None

**A total of 8154 entries pulled in**

#### Validation - Tanya's Results from Potential Solution 

In [7]:
tanya_df = get_rows_by_keyvalue(potential_df, 'authorid', tanya_uid)

display(tanya_df)
display_number_of_rows(tanya_df)

Unnamed: 0,toolname,title,authorid,name,published,type
2275,kicad,KiCad EDA,29294.0,Tanya Faltens,2,7
2329,,,29294.0,Tanya Faltens,1,1
3153,,,29294.0,Tanya Faltens,1,39
3666,githubpublic,Git Hub Public Tool,29294.0,Tanya Faltens,2,7
3667,githubprivate2,GitHub Private Tool Jupyter NB second try with...,29294.0,Tanya Faltens,2,7
3947,gumby,gumby,29294.0,Tanya Faltens,4,7
3966,mosfetsat,MOSFET Simulation,29294.0,Tanya Faltens,1,7
4296,,,29294.0,Tanya Faltens,1,10
4478,tanyatest,tanya test tool,29294.0,Tanya Faltens,2,7
4553,mif,MIF generator for OOMMF,29294.0,Tanya Faltens,1,7


**A total of 29 entries pulled in**

**NOTES:**

1. This is an excessive amount of data for Tanya. Some of the entries in the potential solution are definitely not tools. Their type seems to be identifiable by _type_ column.
2. All entries contain an identifier for whether they are published or not.


Let's filter the results from the potential solution by:
1. Publication status
2. Type

#### Potential Solution - Filtered Results

#### Query

In [8]:
display(Markdown("**Filtered Results**"))

sql_string = '''
SELECT DISTINCT 
       tool.toolname AS toolname, tool.title AS title,
       author.authorid, author.name,
       res.published, res.type
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
  ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
  ON LOWER(tool.title) = LOWER(res.title) 
WHERE
    res.title != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

**Filtered Results**

#### Results

In [9]:
filtered_repo = CachedRepository(
    PandasRepository(sql_string, nanohub_db, 'resources_data_filtered'),
    cache_folder
)
filtered_df = filtered_repo.get_all()

display(filtered_df.head())
display(filtered_df.info())

display_number_of_rows(filtered_df)

Unnamed: 0,toolname,title,authorid,name,published,type
0,hydrolab,Hydrophobicity Lab,4713.0,Eric Darve,1,7
1,hydrolab,Hydrophobicity Lab,12486.0,Artit Wangperawong,1,7
2,hydrolab,Hydrophobicity Lab,12590.0,Kazutora Hayashida,1,7
3,nanomos,NanoMOS,-39.0,,1,7
4,nanomos,NanoMOS,4323.0,,1,7


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1878 entries, 0 to 1877
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   toolname   1401 non-null   object 
 1   title      1401 non-null   object 
 2   authorid   1874 non-null   float64
 3   name       1874 non-null   object 
 4   published  1878 non-null   int64  
 5   type       1878 non-null   int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 88.2+ KB


None

**A total of 1878 entries pulled in**

Okay, much better. We have now pulled in only published tools. Let's check for Tanya next.

#### Final Results

In [10]:
tanya_df = get_rows_by_keyvalue(filtered_df, 'authorid', tanya_uid)

display(tanya_df)
display_number_of_rows(tanya_df)

Unnamed: 0,toolname,title,authorid,name,published,type
730,mosfetsat,MOSFET Simulation,29294.0,Tanya Faltens,1,7
840,mif,MIF generator for OOMMF,29294.0,Tanya Faltens,1,7


**A total of 2 entries pulled in**

## 5. Conclusion
#### Great Success! 

We have matched Tanya's tool authorships from nanoHUB.org with [results](#Results) from our new [solution](#Query). 

![Tanya's Tool Authorships on nanoHUB.org](static/tanya-nanohub-tool-authorships.jpg "Tanya's Tool Authorships on nanoHUB.org")


Next up: Analysis for Stephen Goodnick.

Side Note:

In the course of this analysis, I had to get data from db2 multiple times. This specific query took long and returned the same results for all practical purposes. 

This is a frequently used pattern and becomes repetitive, leading to wastage of analyst/developer time, and more pressure on db2. So I created a cacheable repository which stores the results locally and uses them automatically. By default, it refreshes the results from db2 or salesforce if the results have been stored for an hour. But that can be changed to 1 second, 1 minute, or 1 day, or any other time interval as desired.

Cacheabled Repository - 
1. works for db2 and Salesforce. 
2. configurable time limit for caching of data

[Click Here for Reference](#Potential-Solution)

[Scroll to top](#Top)