Now we should have uploaded all the data that we have mentioned so far. Let us try another example, this time exploring different types of connection data. We have William Pao, who worked on a journal article titled 'EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib' with the organization Memorial Sloan Kettering Cancer Center. This article has a keyword gefitinib and erlotinib, and cited another journal article titled 'An orally active inhibitor of epidermal growth factor signaling with potential for cancer therapy'. This data is from OpenAlex.

Again, let us upload each entity first, then we upload the connections.

In [1]:
import sys
import os
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)
from SQLConnect import connect_and_query
from SQLConnect import insert_query_dict

To upload an entity, let's say a person, we first make the query. We can use a dictionary mapping column names to the desired values and pass it into the insert_query_dict function from SQLConnect to get the INSERT query easily.

In [2]:
person = {
    'origin_database': 'OpenAlex Demo',
    'email': None,
    'phone': None,
    'name': 'William Pao',
    'first_name': 'William',
    'middle_name': None,
    'last_name': 'Pao',
    'nih_id': None
}
person_query = [insert_query_dict('People', person)]

Similarly for bioentities, organizations, and projects and patents

In [3]:
bio = {
    'origin_database': 'OpenAlex Demo',
    'name': 'Gefitinib',
}
bio_query = [insert_query_dict('Bioentity', bio)]

In [4]:
org = {
    'origin_database': 'OpenAlex Demo',
    'name': 'Memorial Sloan Kettering Cancer Center',
    'funding': None
}
org_query = [insert_query_dict('Org', org)]

In [5]:
work = [
    {
        'origin_database': 'OpenAlex Demo',
        'title': 'EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib',
        'start_date': '2004-09-07',
        'end_date': None,
        'type': 'Journal Article',
        'pmid': 15329413
    },
    {
        'origin_database': 'OpenAlex Demo',
        'title': 'An orally active inhibitor of epidermal growth factor signaling with potential for cancer therapy',
        'start_date': None,
        'end_date': None,
        'type': 'Journal Article',
        'pmid': None
    }
]
work_query = [insert_query_dict('Work', rec) for rec in work]

The next thing to do is to use these queries to insert data to the database using connect_and_query from SQLConnect

In [6]:
queries = person_query + org_query + bio_query + work_query
queries

['INSERT INTO People (origin_database, email, phone, name, first_name, middle_name, last_name, nih_id) VALUES ("OpenAlex Demo", NULL, NULL, "William Pao", "William", NULL, "Pao", NULL);',
 'INSERT INTO Org (origin_database, name, funding) VALUES ("OpenAlex Demo", "Memorial Sloan Kettering Cancer Center", NULL);',
 'INSERT INTO Bioentity (origin_database, name) VALUES ("OpenAlex Demo", "Gefitinib");',
 'INSERT INTO Work (origin_database, title, start_date, end_date, type, pmid) VALUES ("OpenAlex Demo", \'EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib\', "2004-09-07", NULL, "Journal Article", 15329413);',
 'INSERT INTO Work (origin_database, title, start_date, end_date, type, pmid) VALUES ("OpenAlex Demo", "An orally active inhibitor of epidermal growth factor signaling with potential for cancer therapy", NULL, NULL, "Journal Article", NULL);']

In [7]:
connect_and_query(queries, ['INSERT' for _ in queries], 'UnmergedV1')

Connection to database established
MySQL connection is closed


[]

Once finished, the data will have their own id in their respective tables. Now we need to upload the connection between these entities, which are represented using ids in the database tables. Therefore, we need to determine these new ids first by querying to the database.

In [8]:
get_id_person = 'SELECT people_id, name FROM People WHERE origin_database = "OpenAlex Demo"'
get_id_bio = 'SELECT bio_id, name FROM Bioentity WHERE origin_database = "OpenAlex Demo"'
get_id_org = 'SELECT org_id, name FROM Org WHERE origin_database = "OpenAlex Demo"'
get_id_work = 'SELECT work_id, title FROM Work WHERE origin_database = "OpenAlex Demo"'
get_id_queries = [get_id_person, get_id_bio, get_id_work, get_id_org]

In [9]:
ids = connect_and_query(get_id_queries, ['SELECT' for _ in range(4)], 'UnmergedV1')
ids

Connection to database established
MySQL connection is closed


[[(9065, 'William Pao')],
 [(2710, 'Gefitinib')],
 [(12588,
   'EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib'),
  (12589,
   'An orally active inhibitor of epidermal growth factor signaling with potential for cancer therapy')],
 [(3140, 'Memorial Sloan Kettering Cancer Center')]]

Once finished, it is typically useful to put the id information in a form of dictionaries mapping data to ids.

In [10]:
people_to_id = {}
for rec in ids[0]:
    people_to_id[rec[1]] = rec[0]

In [11]:
bio_to_id = {}
for rec in ids[1]:
    bio_to_id[rec[1]] = rec[0]

In [12]:
work_to_id = {}
for rec in ids[2]:
    work_to_id[rec[1]] = rec[0]
work_to_id

{'EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib': 12588,
 'An orally active inhibitor of epidermal growth factor signaling with potential for cancer therapy': 12589}

In [13]:
org_to_id = {}
for rec in ids[3]:
    org_to_id[rec[1]] = rec[0]

Finally, we repeat similar procedures as before for the information on connections.

In [14]:
people_org = {
    'people_id': people_to_id['William Pao'],
    'org_id': org_to_id['Memorial Sloan Kettering Cancer Center'],
    'year': None
}

work_people = {
    'people_id': people_to_id['William Pao'],
    'work_id': work_to_id['EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib']
}

work_org = {
    'work_id': work_to_id['EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib'],
    'org_id': org_to_id['Memorial Sloan Kettering Cancer Center']
}

keyword = {
    'work_id': work_to_id['EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib'],
    'bio_id': bio_to_id['Gefitinib']
}

work_relation = [
    {
        'work_id1': work_to_id['EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib'],
        'work_id2': work_to_id['An orally active inhibitor of epidermal growth factor signaling with potential for cancer therapy'],
        'relation': '1 cited 2'
    },
    {
        'work_id2': work_to_id['EGF receptor gene mutations are common in lung cancers from "never smokers" and are associated with sensitivity of tumors to gefitinib and erlotinib'],
        'work_id1': work_to_id['An orally active inhibitor of epidermal growth factor signaling with potential for cancer therapy'],
        'relation': '2 cited 1'
    }
]

In [15]:
queries = [insert_query_dict('PeopleOrg', people_org)] +\
    [insert_query_dict('WorkPeople', work_people)] +\
        [insert_query_dict('WorkOrg', work_org)] +\
            [insert_query_dict('Keyword', keyword)] +\
                [insert_query_dict('WorkRelation', rec) for rec in work_relation]
queries

['INSERT INTO PeopleOrg (people_id, org_id, year) VALUES (9065, 3140, NULL);',
 'INSERT INTO WorkPeople (people_id, work_id) VALUES (9065, 12588);',
 'INSERT INTO WorkOrg (work_id, org_id) VALUES (12588, 3140);',
 'INSERT INTO Keyword (work_id, bio_id) VALUES (12588, 2710);',
 'INSERT INTO WorkRelation (work_id1, work_id2, relation) VALUES (12588, 12589, "1 cited 2");',
 'INSERT INTO WorkRelation (work_id2, work_id1, relation) VALUES (12588, 12589, "2 cited 1");']

In [16]:
connect_and_query(queries, ['INSERT' for _ in queries], 'UnmergedV1')

Connection to database established
MySQL connection is closed


[]