###### Top

# Mapping Tools to Authors The Right Way - II
### Validation of Proposed New Solution from Part I Using Stephen's Authored Tool List

##### Author: Praveen Saxena
##### Email: saxep01@gmail.com
##### Create Date: 10/28/2021
##### Purpose: Validate the new solution using Stephen's listed tools on Salesforce vs. nanoHUB, or find another solution.
##### Reference: https://trello.com/c/Rx9pOPik

![Stephen's Tool Authorships on Salesforce](static/stephen-salesforce-tool-authorships.jpg "Stephen's Tool Authorships on Salesforce")

![Stephen's Tool Authorships on nanoHUB.org](static/stephen-nanohub-tool-authorships.jpg "Stephen's Tool Authorships on nanoHUB.org")

## 1. Preliminaries
[Scroll to top](#Top)

In [1]:
import pandas as pd
import os
import time
import datetime
from pathlib import Path
from IPython.display import display, Markdown

In [2]:
from nanoHUB.application import Application
from nanoHUB.repositories import CachedRepository, PandasRepository, ContactsRepository, ToolsRepository
from nanoHUB.pandas import get_rows_by_keyvalue, display_number_of_rows

application = Application.get_instance()
nanohub_db = application.new_db_engine('nanohub')

cache_folder = Path(os.getenv('APP_DIR'), '.cache/tool_authorships/stephen')

[1mnanoHUB - Serving Students, Researchers & Instructors[0m


## 2. Variables  
[Scroll to top](#Top)

In [3]:
stephen_uid = 29476
tanya_uid = 29294

## 3. Analysis
[Scroll to top](#Top)

The original existing code returns results for Stephen that exactly match what's on nanoHUB.org for him. Let's use the new proposed solution to grab tool data for Stephen and see how well that works.

In [4]:
display(Markdown("#### Grab tool data using proposed solution from part I"))

proposed_sql_string = '''
SELECT DISTINCT 
       tool.toolname AS toolname, tool.title AS title,
       author.authorid, author.name,
       res.published, res.type
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
  ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
  ON LOWER(tool.title) = LOWER(res.title) 
WHERE
    res.title != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

proposed_repo = CachedRepository(
    PandasRepository(proposed_sql_string, nanohub_db, 'resources_data_filtered'),
    cache_folder
)
proposed_df = proposed_repo.get_all()

display(proposed_df.head())
display(proposed_df.info())

display_number_of_rows(proposed_df)

#### Grab tool data using proposed solution from part I

Unnamed: 0,toolname,title,authorid,name,published,type
0,hydrolab,Hydrophobicity Lab,4713.0,Eric Darve,1,7
1,hydrolab,Hydrophobicity Lab,12486.0,Artit Wangperawong,1,7
2,hydrolab,Hydrophobicity Lab,12590.0,Kazutora Hayashida,1,7
3,nanomos,NanoMOS,-39.0,,1,7
4,nanomos,NanoMOS,4323.0,,1,7


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1878 entries, 0 to 1877
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   toolname   1401 non-null   object 
 1   title      1401 non-null   object 
 2   authorid   1874 non-null   float64
 3   name       1874 non-null   object 
 4   published  1878 non-null   int64  
 5   type       1878 non-null   int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 88.2+ KB


None

**A total of 1878 entries pulled in**

In [5]:
display(Markdown("#### Let's filter the tool authorship list for Stephen:"))

display(get_rows_by_keyvalue(proposed_df, 'authorid', stephen_uid))

#### Let's filter the tool authorship list for Stephen:

Unnamed: 0,toolname,title,authorid,name,published,type
1462,,,29476.0,Stephen M. Goodnick,1,7


Ouch! We only got one result for Stephen. And it doesn't include a toolname or a title.    

Is it possible that versions are playing a role in this? Let's look.

In [6]:
display(Markdown("#### Grab tool data using proposed solution from part I"))

modified_sql_string = '''
SELECT DISTINCT 
       tool.toolname, tool.title,
       author.authorid, author.name,
       res.published, res.type,
       version.instance
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
    ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
    ON LOWER(tool.title) = LOWER(res.title) 
LEFT JOIN nanohub.jos_tool_version version
    ON version.instance = tool.toolname
WHERE
    res.title != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

modified_repo = CachedRepository(
    PandasRepository(modified_sql_string, nanohub_db, 'tool_authorship__w_versioned'),
    cache_folder
)
modified_df = modified_repo.get_all()

display(modified_df.head())
display(modified_df.info())

display_number_of_rows(modified_df)

#### Grab tool data using proposed solution from part I

Unnamed: 0,toolname,title,authorid,name,published,type,instance
0,hydrolab,Hydrophobicity Lab,4713.0,Eric Darve,1,7,hydrolab
1,hydrolab,Hydrophobicity Lab,12486.0,Artit Wangperawong,1,7,hydrolab
2,hydrolab,Hydrophobicity Lab,12590.0,Kazutora Hayashida,1,7,hydrolab
3,nanomos,NanoMOS,-39.0,,1,7,nanomos
4,nanomos,NanoMOS,4323.0,,1,7,nanomos


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1878 entries, 0 to 1877
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   toolname   1401 non-null   object 
 1   title      1401 non-null   object 
 2   authorid   1874 non-null   float64
 3   name       1874 non-null   object 
 4   published  1878 non-null   int64  
 5   type       1878 non-null   int64  
 6   instance   252 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 102.8+ KB


None

**A total of 1878 entries pulled in**

In [7]:
display(Markdown("#### And now, let's filter the tool authorship list for Stephen:"))

display(get_rows_by_keyvalue(modified_df, 'authorid', stephen_uid))

#### And now, let's filter the tool authorship list for Stephen:

Unnamed: 0,toolname,title,authorid,name,published,type,instance
1462,,,29476.0,Stephen M. Goodnick,1,7,


**Nope, no luck.**

Sanity Check - let's reconfirm the results for Tanya:

In [8]:
display(Markdown("#### Filtering the tool authorship list for Tanya:"))

display(get_rows_by_keyvalue(modified_df, 'authorid', tanya_uid))

#### Filtering the tool authorship list for Tanya:

Unnamed: 0,toolname,title,authorid,name,published,type,instance
730,mosfetsat,MOSFET Simulation,29294.0,Tanya Faltens,1,7,
840,mif,MIF generator for OOMMF,29294.0,Tanya Faltens,1,7,


All's good for Tanya here.
Let's see what tools pull up for Stephen without distinct.

In [9]:
display(Markdown("#### Grab tool data for Stephen"))

modified_sql_string = '''
SELECT 
       tool.toolname, tool.title,
       author.authorid, author.name,
       res.published, res.type
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
  ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
  ON LOWER(tool.title) = LOWER(res.title) 
WHERE
    res.title != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

modified_repo = CachedRepository(
    PandasRepository(modified_sql_string, nanohub_db, 'tool_authorship__w_versioned_wo_distinct'),
    cache_folder
)
modified_df = modified_repo.get_all()


display(get_rows_by_keyvalue(modified_df, 'authorid', stephen_uid))

#### Grab tool data for Stephen

Unnamed: 0,toolname,title,authorid,name,published,type
1483,,,29476.0,Stephen M. Goodnick,1,7
1514,,,29476.0,Stephen M. Goodnick,1,7


How can toolnames be _None_?
Let's get Stephen's toolnames directly from _nanohub.jos_resources_:

In [10]:
display(Markdown("#### Grab tool data for Stephen"))

modified_sql_string = '''
SELECT 
       author.authorid, author.name,
       res.title, res.published, res.type
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
  ON author.subid  = res.id
WHERE
    res.title != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

modified_repo = CachedRepository(
    PandasRepository(modified_sql_string, nanohub_db, 'tool_authorship__w_versioned_wo_distinct_w_direct'),
    cache_folder
)
modified_df = modified_repo.get_all()


display(get_rows_by_keyvalue(modified_df, 'authorid', stephen_uid))

#### Grab tool data for Stephen

Unnamed: 0,authorid,name,title,published,type
323,29476.0,Stephen M. Goodnick,Bulk Monte Carlo Lab,1,7
387,29476.0,Stephen M. Goodnick,ACUTE,1,7


And from _nanohub.jos_tools_ now:

In [11]:
display(Markdown("#### Grab tool data for Stephen"))

modified_sql_string = '''
SELECT 
    tool.id, tool.toolname, tool.title, tool.revision, tool.version,
    authors.uid
FROM nanohub.jos_tool tool
LEFT JOIN nanohub.jos_tool_authors authors 
    ON authors.toolname  = tool.toolname
;
'''

modified_repo = CachedRepository(
    PandasRepository(modified_sql_string, nanohub_db, 'jos_tools_w_author_w_revision'),
    cache_folder
)
modified_df = modified_repo.get_all()


display(get_rows_by_keyvalue(modified_df, 'uid', stephen_uid))

#### Grab tool data for Stephen

Unnamed: 0,id,toolname,title,revision,version,uid
4120,213,bulkmc,Full Tool NameBulk Monte Carlo Tool,15.0,1.0,29476.0
4121,213,bulkmc,Full Tool NameBulk Monte Carlo Tool,15.0,1.0,29476.0
4125,213,bulkmc,Full Tool NameBulk Monte Carlo Tool,15.0,1.0,29476.0
4129,213,bulkmc,Full Tool NameBulk Monte Carlo Tool,15.0,1.0,29476.0
4844,259,acute,Advanced Computational Electronics,5.0,1.0,29476.0
4845,259,acute,Advanced Computational Electronics,5.0,1.0,29476.0
4849,259,acute,Advanced Computational Electronics,5.0,1.0,29476.0


1. So there's obvious duplicates here. No problem, we can get uniques later.
2. We do have the two tools - bulkmc and acute. 

Note:
_bulkmc's_ _title_ from _nanohub.jos_tool_ is **Bulk Monte Carlo Lab**.  
However, its title from _nanohub.jos_resources_ is **Full Tool NameBulk Monte Carlo Tool**.

Let's find out how many such tools with different titles exist.

In [12]:
display(Markdown("#### Mismatched titles in two tables"))

count_sql_string = '''
SELECT COUNT(*) AS `number of titles in jos_tool but not in jos_resources`
FROM nanohub.jos_tool tools
WHERE 
    LOWER(tools.title) NOT IN (SELECT LOWER(title) FROM jos_resources)
;
'''

count_repo = CachedRepository(
    PandasRepository(count_sql_string, nanohub_db, 'count_test_3'),
    cache_folder
)
count_df = count_repo.get_all()

display(count_df)

count_sql_string = '''
SELECT COUNT(*) AS `number of titles in jos_resources but not in jos_tool`
FROM nanohub.jos_resources res
WHERE 
    LOWER(res.title) NOT IN (SELECT LOWER(title) FROM jos_tool)
;
'''

count_repo = CachedRepository(
    PandasRepository(count_sql_string, nanohub_db, 'count_test_4'),
    cache_folder
)
count_df = count_repo.get_all()

display(count_df)

#### Mismatched titles in two tables

Unnamed: 0,number of titles in jos_tool but not in jos_resources
0,389


Unnamed: 0,number of titles in jos_resources but not in jos_tool
0,31175


So clearly, we cannot join using the titles from the two tables like we did in the proposed solution.

Let's try using tool versions:

In [13]:
display(Markdown("#### Add versions"))

modified_sql_string = '''
SELECT 
       tool.toolname, tool.title AS title_from_jos_tools,
       author.authorid, author.name,
       res.title AS title_from_jos_resources, res.published, res.type,
       version.instance
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
  ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
  ON LOWER(tool.title) = LOWER(res.title) 
LEFT JOIN nanohub.jos_tool_version version
  ON LOWER(tool.toolname) = LOWER(version.instance)
WHERE
    res.title != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

modified_repo = CachedRepository(
    PandasRepository(modified_sql_string, nanohub_db, 'tool_authorship__w_versions-7'),
    cache_folder
)
modified_df = modified_repo.get_all()


display(get_rows_by_keyvalue(modified_df, 'authorid', stephen_uid))

#### Add versions

Unnamed: 0,toolname,title_from_jos_tools,authorid,name,title_from_jos_resources,published,type,instance
1483,,,29476.0,Stephen M. Goodnick,Bulk Monte Carlo Lab,1,7,
1514,,,29476.0,Stephen M. Goodnick,ACUTE,1,7,


Nope. That didn't work. Empty _instance_. Adding versions did not help identify the _toolname_ from _nanohub.jos_tool_.

The _nanohub.jos_tool_ table isn't able to add _toolname_ and _title_ because it cannot JOIN with _nanohub.jos_resources_ as they have different titles. Hence the empty _title_from_jos_tools_ and _toolname_.  

When looking at _nanohub.jos_resources_, I noticed a column _alias_. Let's try using that:

In [14]:
display(Markdown("#### Add alias"))

modified_sql_string = '''
SELECT 
       tool.toolname, tool.title AS title_from_jos_tools,
       author.authorid, author.name,
       res.title AS title_from_jos_resources, res.published, res.type, res.alias,
       version.instance
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
  ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
  ON LOWER(tool.title) = LOWER(res.title) 
LEFT JOIN nanohub.jos_tool_version version
  ON LOWER(tool.toolname) = LOWER(version.instance)
WHERE
    res.title != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

modified_repo = CachedRepository(
    PandasRepository(modified_sql_string, nanohub_db, 'tool_authorship__w_versions-10'),
    cache_folder
)
modified_df = modified_repo.get_all()


display(get_rows_by_keyvalue(modified_df, 'authorid', stephen_uid))

#### Add alias

Unnamed: 0,toolname,title_from_jos_tools,authorid,name,title_from_jos_resources,published,type,alias,instance
1483,,,29476.0,Stephen M. Goodnick,Bulk Monte Carlo Lab,1,7,bulkmc,
1514,,,29476.0,Stephen M. Goodnick,ACUTE,1,7,acute,


Aha! So _alias_ matches the _toolname_ for at least Stephen's tools. 

How many other such tools are there? Let's find out.

In [15]:
display(Markdown("#### Count where _nanohub.jos_resources.alias_ = _nanohub.jos_tool.toolname_"))

modified_sql_string = '''
SELECT 
       tool.toolname, tool.title AS title_from_jos_tools,
       author.authorid, author.name,
       res.title AS title_from_jos_resources, res.published, res.type, res.alias,
       version.instance
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
    ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
    ON
        LOWER(tool.title) != LOWER(res.title) AND
        LOWER(tool.toolname) = LOWER(res.alias) 
LEFT JOIN nanohub.jos_tool_version version
    ON LOWER(tool.toolname) = LOWER(version.instance)
WHERE
    res.title != '' AND
    res.alias != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

modified_repo = CachedRepository(
    PandasRepository(modified_sql_string, nanohub_db, 'tool_authorship__w_versions-12'),
    cache_folder
)
modified_df = modified_repo.get_all()


display_number_of_rows(modified_df)

#### Count where _nanohub.jos_resources.alias_ = _nanohub.jos_tool.toolname_

**A total of 2231 entries pulled in**

## 3. Validation for Stephen

Let's use the above query one more time to match Stephen's results from nanoHUB.org:

#### Solution

In [16]:
modified_sql_string = '''
SELECT 
       tool.toolname, tool.title AS title_from_jos_tools,
       author.authorid, author.name,
       res.title AS title_from_jos_resources, res.published, res.type, res.alias,
       version.instance
FROM nanohub.jos_resources res
LEFT JOIN nanohub.jos_author_assoc author
    ON author.subid  = res.id
LEFT JOIN nanohub.jos_tool tool
    ON
        LOWER(tool.title) != LOWER(res.title) AND
        LOWER(tool.toolname) = LOWER(res.alias) 
LEFT JOIN nanohub.jos_tool_version version
    ON LOWER(tool.toolname) = LOWER(version.instance)
WHERE
    res.title != '' AND
    res.alias != '' AND
    res.published = 1 AND 
    res.type = '7' AND 
    res.access IN ('0','3','1') AND 
    res.standalone = '1'
;
'''

#### Results

In [17]:
modified_repo = CachedRepository(
    PandasRepository(modified_sql_string, nanohub_db, 'tool_authorship__w_versions-final'),
    cache_folder
)
modified_df = modified_repo.get_all()
display_number_of_rows(modified_df)


display(get_rows_by_keyvalue(modified_df, 'authorid', stephen_uid))

**A total of 2231 entries pulled in**

Unnamed: 0,toolname,title_from_jos_tools,authorid,name,title_from_jos_resources,published,type,alias,instance
86,bulkmc,Full Tool NameBulk Monte Carlo Tool,29476.0,Stephen M. Goodnick,Bulk Monte Carlo Lab,1,7,bulkmc,
133,acute,Advanced Computational Electronics,29476.0,Stephen M. Goodnick,ACUTE,1,7,acute,


## 4. Conclusion

We have matched Stephen's tool authorships from nanoHUB.org with [results](#Results) from yet another [solution](#Solution) besides the one from Part I. 

I believe the two solutions from Part I and Part II when combined will work for vast majority of the tool data, there's a lot more to be done including considering potential edge cases and other validations.

![Stephen's Tool Authorships on nanoHUB.org](static/stephen-nanohub-tool-authorships.jpg "Stephen's Tool Authorships on nanoHUB.org")

Next up: 
1. Validations
2. Combining the sqls

[Scroll to top](#Top)