# GI50
This NCI60 data gives an insight to effect of a chemical on cancer.
Cancer is not one disease but a collection of every possible cell in your body growing uncontrolled.
For this reason NCI60 has multiple different cell lines (column CELL_NAME).

I have here the GI50 data of the NCI60 project
GI50: concentration to stop growth with 50%. column "AVERAGE" is the average concentration needed.

In [10]:
import pandas as pd
from os.path import join as path_join
import numpy as np
import json

In [11]:
gi50 = pd.read_csv(path_join("data", "GI50.csv"))
print(gi50.shape)
gi50.head()

(4585048, 14)


Unnamed: 0,RELEASE_DATE,EXPID,PREFIX,NSC,CONCENTRATION_UNIT,LOG_HI_CONCENTRATION,PANEL_NUMBER,CELL_NUMBER,PANEL_NAME,CELL_NAME,PANEL_CODE,COUNT,AVERAGE,STDDEV
0,20210223,0001MD02,S,123127,M,-4.6021,1,1,Non-Small Cell Lung Cancer,NCI-H23,LNS,1,-7.1391,0.0
1,20210223,0001MD02,S,123127,M,-4.6021,10,14,Melanoma,M14,MEL,1,-7.052,0.0
2,20210223,0001MD02,S,123127,M,-4.6021,12,5,CNS Cancer,SNB-75,CNS,1,-7.138,0.0
3,20210223,0001MD02,S,123127,M,-4.6021,4,2,Colon Cancer,HCC-2998,COL,1,-6.9426,0.0
4,20210223,0001MD02,S,123127,M,-4.6021,5,5,Breast Cancer,MDA-MB-231/ATCC,BRE,1,-6.4485,0.0


In [2]:
import json

from py2neo import Graph

with open("config.json") as f:
    config = json.load(f)

neo4j_url = config.get("neo4jUrl", "bolt://localhost:7687")
user = config.get("user", "neo4j")
pswd = config.get("pswd", "password")
graph = Graph(neo4j_url, auth=(user, pswd))

In [12]:
with open("cellline_nci60_to_chembl.json", "r") as fp:
    cell_lines_json = json.load(fp)

cell_lines = [i for i in cell_lines_json if cell_lines_json[i]]

Creating a pivot table. Where every value is the average GI50 value of the combination cell line (row) and chemical (column)

In [13]:
# Select only the rows that are in the database
correlation = pd.pivot_table(gi50.loc[gi50["CELL_NAME"].isin(cell_lines)], values='AVERAGE', index=['CELL_NAME'],
                    columns=['NSC'], aggfunc=np.mean)
correlation.head()

NSC,1,17,26,89,112,171,185,186,196,197,...,836824,836941,836942,837081,837082,837396,837397,837398,837892,837893
CELL_NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
786-0,-4.716867,-5.557233,-5.4265,-4.70005,-6.4911,-4.0,-7.4287,-4.609967,-5.1022,-5.5461,...,-5.4324,-5.3468,-4.0,-4.8286,-5.1042,-6.6969,-5.5047,-6.5669,-4.301,-5.6072
A-172/H.Fine,,,,,,,,-4.2305,,,...,,,,,,,,,,
A-204,,,,,,,,,,,...,,,,,,,,,,
A431,,,,,,,,,,,...,,,,,,,,,,
A498,-4.3635,-4.664067,,-4.144533,-6.1375,-4.0,,-5.45858,-4.9329,-4.8278,...,-5.3284,-4.0,-4.0,-4.8123,-4.8738,-5.75485,-5.4407,-6.79295,-4.301,-5.6273


The difference in how 2 cell lines react to the same chemical gives us an idea how similar those 2 cell lines are. If we take the average of all difference between all chemicals they have in common we get a difference score. Where 0 means the cells are the same, and infinite means they could not be more different.

In [7]:
all_diffs_pd = np.abs(correlation[19893] - correlation[713724])
print(f"Avg diff {all_diffs_pd.loc[all_diffs_pd.notna()].mean()}, num difs {sum(all_diffs_pd.notna())}")
all_diffs_pd = all_diffs_pd.loc[all_diffs_pd.notna()].to_dict()

Avg diff 0.5365136610125857, num difs 59


We can do the same in a graph database. However the database is not fully in memory. So first we need to retrieve the data from the corresponding cells

In [8]:
response = graph.run(
    """
    MATCH (org_chem:Synonym {pubChemSynId: "176dde90cc9dd83eed129de11b203b03"})
    MATCH (gi50:Measurement {name: "GI50"})

    MATCH (cell:CellLine)<-[:USES]-(c:Condition)-[:USES]->(org_chem)
    MATCH (c)-[m:MEASURES]->(gi50)
    WITH DISTINCT cell, avg(toFloat(m.value)) as values1, org_chem, gi50

    CALL {
        WITH cell, gi50, values1
        MATCH (chem:Synonym)
        WHERE chem.pubChemSynId IN ["1d75798754df81c782e805287ff7ef83"] //, "ecb8cd2425f4430ce611e50153f09dcc"]
        MATCH (cell)<-[:USES]-(c:Condition)-[:USES]->(chem)
        MATCH (c)-[m2:MEASURES]->(gi50)
        RETURN DISTINCT cell as cell2, abs(toFloat(m2.value) - values1) as distance,  chem
    }

    RETURN DISTINCT chem.name, cell2.label as cell_name, avg(distance) as dist, count(distance)
    ORDER BY cell_name
    """
).data()

all_diffs_neo = {}
for r in response:
    all_diffs_neo[r["cell_name"]] = r["dist"]
print(all_diffs_neo)

{'786-0': 0.07140702732849302, 'A498': 0.4297192862045325, 'A549': 1.5849812256267448, 'ACHN': 0.4607676683361186, 'BT-549': 0.3980160843770584, 'CAKI-1': 0.7842997108154997, 'CCRF-CEM': 0.17629994022713547, 'COLO 205': 0.5759622546270435, 'DU-145': 0.4109427840909019, 'EKVX ': 0.5392973776223808, 'HCC 2998': 1.0807086896551752, 'HCT-116': 0.8050908890115922, 'HCT-15': 1.1062023730684443, 'HL-60(TB)': 0.39653701765860916, 'HOP-62': 0.4174759237875314, 'HOP-92': 0.19887397355601966, 'HT-29': 0.9304206349206288, 'Hs-578T': 0.34562432432433, 'IGROV-1': 0.8508854774156802, 'K562': 0.2886813805104502, 'KM12': 0.35321072818232757, 'LOX IMVI ': 0.8849687536571071, 'M14': 0.006811386696730537, 'MCF7': 0.9765160388127683, 'MDA-MB-231': 1.1169995480225974, 'MDA-MB-435': 0.4913412057522084, 'MDA-N': 0.492866342412448, 'MOLT-4': 0.43246514546491355, 'Malme-3M': 0.30963925294887495, 'NCI-H226': 0.6340676885346457, 'NCI-H23': 0.8848983164983046, 'NCI-H322M': 0.5771718475073238, 'NCI-H460': 1.4743202

At this point in the database only hold GI50 data, but we still write the query like more kinds of data exists. We look for all chemicals that have both a condition with cell A and cell B. All conditions also most have GI50 measurement.
Next up we average the measurements and calculate the distance just like above.
The values do differ a bit from each other because not all chemicals could be loaded into the database.

In [10]:
response = graph.run(
    """
    MATCH (org_chem:Synonym {pubChemSynId: "176dde90cc9dd83eed129de11b203b03"})
    MATCH (gi50:Measurement {name: "GI50"})

    MATCH (cell:CellLine)<-[:USES]-(c:Condition)-[:USES]->(org_chem)
    MATCH (c)-[m:MEASURES]->(gi50)

    WITH DISTINCT cell, avg(toFloat(m.value)) as value1, count(m) as num_values, org_chem
    RETURN cell.label as cell_name, value1, num_values, org_chem.name as chem_name
    ORDER BY num_values DESC
    """
).data()
for r in response:
    print(r)

{'cell_name': 'HCT-15', 'value1': -5.243602373068445, 'num_values': 1812, 'chem_name': 'nsc19893'}
{'cell_name': 'HCT-116', 'value1': -5.386090889011593, 'num_values': 1811, 'chem_name': 'nsc19893'}
{'cell_name': 'MDA-MB-435', 'value1': -5.0008412057522085, 'num_values': 1808, 'chem_name': 'nsc19893'}
{'cell_name': 'OVCAR-5', 'value1': -3.767840788013319, 'num_values': 1802, 'chem_name': 'nsc19893'}
{'cell_name': 'U-251', 'value1': -4.355170205669812, 'num_values': 1799, 'chem_name': 'nsc19893'}
{'cell_name': 'KM12', 'value1': -5.044110728182328, 'num_values': 1799, 'chem_name': 'nsc19893'}
{'cell_name': 'ACHN', 'value1': -5.016267668336119, 'num_values': 1797, 'chem_name': 'nsc19893'}
{'cell_name': 'SW-620', 'value1': -4.562712966054529, 'num_values': 1797, 'chem_name': 'nsc19893'}
{'cell_name': 'A549', 'value1': -5.680181225626745, 'num_values': 1795, 'chem_name': 'nsc19893'}
{'cell_name': '786-0', 'value1': -4.915207027328493, 'num_values': 1793, 'chem_name': 'nsc19893'}
{'cell_name

In [5]:
response = graph.run(
    """
    MATCH (org_chem:Synonym {pubChemSynId: "176dde90cc9dd83eed129de11b203b03"})
    MATCH (gi50:Measurement {name: "GI50"})

    MATCH (cell:CellLine)<-[:USES]-(c:Condition)-[:USES]->(org_chem)
    MATCH (c)-[m:MEASURES]->(gi50)
    WITH DISTINCT cell, avg(toFloat(m.value)) as values1, org_chem, gi50

    CALL {
        WITH cell, gi50, values1
        MATCH (chem:Synonym)
        WHERE chem.pubChemSynId IN ["69a8b92463c0467140242a249aa58b2b"] //, "ecb8cd2425f4430ce611e50153f09dcc"]
        MATCH (cell)<-[:USES]-(c:Condition)-[:USES]->(chem)
        MATCH (c)-[m2:MEASURES]->(gi50)
        RETURN DISTINCT cell as cell2, abs(toFloat(m2.value) - values1) as distance,  chem
    }

    RETURN DISTINCT chem.name, cell2.label as cell_name, avg(distance), count(distance)
    ORDER BY cell_name
    """
).data()
for r in response:
    print(r)

{'chem.name': 'nsc-361605', 'cell_name': '786-0', 'avg(distance)': 0.2453070273284932, 'count(distance)': 2}
{'chem.name': 'nsc-361605', 'cell_name': 'A498', 'avg(distance)': 0.9720192862045325, 'count(distance)': 1}
{'chem.name': 'nsc-361605', 'cell_name': 'A549', 'avg(distance)': 0.320981225626745, 'count(distance)': 2}
{'chem.name': 'nsc-361605', 'cell_name': 'ACHN', 'avg(distance)': 0.08511766833611878, 'count(distance)': 2}
{'chem.name': 'nsc-361605', 'cell_name': 'CAKI-1', 'avg(distance)': 0.7124997108154991, 'count(distance)': 1}
{'chem.name': 'nsc-361605', 'cell_name': 'CCRF-CEM', 'avg(distance)': 0.2640000597728651, 'count(distance)': 2}
{'chem.name': 'nsc-361605', 'cell_name': 'COLO 205', 'avg(distance)': 0.49361225462704317, 'count(distance)': 2}
{'chem.name': 'nsc-361605', 'cell_name': 'EKVX ', 'avg(distance)': 1.5574973776223806, 'count(distance)': 2}
{'chem.name': 'nsc-361605', 'cell_name': 'HCC 2998', 'avg(distance)': 0.3087913103448252, 'count(distance)': 2}
{'chem.name

In [7]:
response = graph.run(
    """
MATCH (org_chem:Synonym {pubChemSynId: "176dde90cc9dd83eed129de11b203b03"})
MATCH (gi50:Measurement {name: "GI50"})

MATCH (cell:CellLine)<-[:USES]-(c:Condition)-[:USES]->(org_chem)
MATCH (c)-[m:MEASURES]->(gi50)
WITH DISTINCT cell, avg(toFloat(m.value)) as values1, org_chem, gi50

CALL {
    WITH cell, gi50, values1
    MATCH (cell)<-[:USES]-(c:Condition)-[:USES]->(chem:Synonym)
    MATCH (c)-[m2:MEASURES]->(gi50)
    RETURN DISTINCT cell as cell2, abs(toFloat(m2.value) - values1) as distance,  chem
}

RETURN DISTINCT chem.name, avg(distance) as avg_dist, count(distance) as num_cell ORDER BY avg_dist limit 5
    """
).data()
for r in response:
    print(r)

{'chem.name': 'nsc-361605', 'avg_dist': 0.34530195673664926, 'num_cell': 91}
{'chem.name': 'nsc-684405', 'avg_dist': 0.3656233743457906, 'num_cell': 111}
{'chem.name': 'nsc628537', 'avg_dist': 0.3781293960020346, 'num_cell': 46}
{'chem.name': 'nsc-785594', 'avg_dist': 0.3838617706291631, 'num_cell': 103}
{'chem.name': 'nsc-613493', 'avg_dist': 0.39945509628924536, 'num_cell': 91}


In [46]:
response = graph.run(
    """
    MATCH (org_chem:Synonym {pubChemSynId: "176dde90cc9dd83eed129de11b203b03"})
    MATCH (the_cell:CellLine {label: "HCT-15"})
    MATCH (gi50:Measurement {name: "GI50"})

    MATCH (cell:CellLine)<-[:USES]-(c:Condition)-[:USES]->(org_chem)
    WHERE cell <> the_cell
    MATCH (c)-[m:MEASURES]->(gi50)
    WITH DISTINCT cell, avg(toFloat(m.value)) as values1, org_chem, gi50
    CALL {
        WITH cell, gi50, values1
        MATCH (cell)<-[:USES]-(c:Condition)-[:USES]->(chem:Synonym)
        MATCH (c)-[m2:MEASURES]->(gi50)
        RETURN DISTINCT cell as cell2, abs(avg(toFloat(m2.value)) - values1) as distance, chem
    }

    WITH DISTINCT chem, avg(distance) as avg_dist, count(distance) as num_cell 
    RETURN chem.name, avg_dist, num_cell, chem.pubChemSynId ORDER BY avg_dist limit 25
    """
).data()
for r in response:
    print(r)

{'chem.name': 'nsc19893', 'avg_dist': 3.351961022292937e-15, 'num_cell': 73, 'chem.pubChemSynId': '176dde90cc9dd83eed129de11b203b03'}
{'chem.name': 'nsc-684405', 'avg_dist': 0.30413082478114195, 'num_cell': 59, 'chem.pubChemSynId': 'd40f551fe1c9b0833b916e89b01e386d'}
{'chem.name': 'nsc-361605', 'avg_dist': 0.31986023473182124, 'num_cell': 48, 'chem.pubChemSynId': '69a8b92463c0467140242a249aa58b2b'}
{'chem.name': 'nsc-628083', 'avg_dist': 0.33290434655834006, 'num_cell': 59, 'chem.pubChemSynId': '21e870c91aa63e4e31350aba9c225547'}
{'chem.name': 'nsc-618093', 'avg_dist': 0.34773573490256054, 'num_cell': 59, 'chem.pubChemSynId': '08de420218365f71db7c240b9b130523'}
{'chem.name': 'nsc-613493', 'avg_dist': 0.35006572711758455, 'num_cell': 49, 'chem.pubChemSynId': 'd20214197486f740ddd9212b5d4cb8e3'}
{'chem.name': 'nsc628537', 'avg_dist': 0.3772233298450033, 'num_cell': 45, 'chem.pubChemSynId': 'f38f562a8df970d0fa2419a0cc7d229b'}
{'chem.name': 'nsc-785594', 'avg_dist': 0.3796167720061503, 'num

In [49]:
all_diffs_pd = np.abs(correlation[19893] - correlation[628083])
print(f"Avg diff {all_diffs_pd.loc[all_diffs_pd.notna()].mean()}, num difs {sum(all_diffs_pd.notna())}")
print(all_diffs_pd.loc[all_diffs_pd.notna()])

Avg diff 0.3354599681922921, num difs 60
CELL_NAME
786-0              0.166140
A498               0.242053
A549/ATCC          0.544961
ACHN               0.028868
BT-549             0.022316
CAKI-1             1.179450
CCRF-CEM           0.399325
COLO 205           0.356602
DU-145             0.474643
EKVX               0.520972
HCC-2998           0.044509
HCT-116            0.348371
HCT-15             0.486082
HL-60(TB)          0.323787
HOP-62             0.017504
HOP-92             0.198874
HS 578T            0.505258
HT29               0.217601
IGROV1             0.379185
K-562              0.405421
KM12               0.336771
LOX IMVI           0.500219
M14                0.179045
MALME-3M           0.226139
MCF7               0.533391
MDA-MB-231/ATCC    0.833133
MDA-MB-435         0.286708
MDA-N              0.092500
MOLT-4             0.437065
NCI-H226           0.515718
NCI-H23            0.267118
NCI-H322M          0.321122
NCI-H460           0.532160
NCI-H522           0.3746

In [45]:
response = graph.run(
    """
    MATCH (org_chem:Synonym {pubChemSynId: "176dde90cc9dd83eed129de11b203b03"})
    MATCH (gi50:Measurement {name: "GI50"})

    MATCH (cell:CellLine)<-[:USES]-(c:Condition)-[:USES]->(org_chem)
    MATCH (c)-[m:MEASURES]->(gi50)
    WITH DISTINCT cell, avg(toFloat(m.value)) as values1, org_chem, gi50


    MATCH (chem:Synonym)
    WHERE chem.pubChemSynId = "69a8b92463c0467140242a249aa58b2b"
    MATCH (cell)<-[:USES]-(c:Condition)-[:USES]->(chem)
    MATCH (c)-[m2:MEASURES]->(gi50)
    WITH DISTINCT cell as cell2, avg(toFloat(m2.value)) as values2, chem, values1, count(cell) as count_cell2

    RETURN cell2.label as cell_name, avg(abs(values1 - values2)) as distance, count(values2), count_cell2
    ORDER BY cell_name
    """
).data()
for r in response:
    print(r)

{'cell_name': '786-0', 'distance': 0.24530702732849274, 'count(values2)': 1, 'count_cell2': 2}
{'cell_name': 'A498', 'distance': 0.9720192862045325, 'count(values2)': 1, 'count_cell2': 1}
{'cell_name': 'A549', 'distance': 0.32098122562674547, 'count(values2)': 1, 'count_cell2': 2}
{'cell_name': 'ACHN', 'distance': 0.08511766833611834, 'count(values2)': 1, 'count_cell2': 2}
{'cell_name': 'CAKI-1', 'distance': 0.7124997108154991, 'count(values2)': 1, 'count_cell2': 1}
{'cell_name': 'CCRF-CEM', 'distance': 0.2640000597728651, 'count(values2)': 1, 'count_cell2': 2}
{'cell_name': 'COLO 205', 'distance': 0.4936122546270436, 'count(values2)': 1, 'count_cell2': 2}
{'cell_name': 'EKVX ', 'distance': 1.5574973776223802, 'count(values2)': 1, 'count_cell2': 2}
{'cell_name': 'HCC 2998', 'distance': 0.30879131034482477, 'count(values2)': 1, 'count_cell2': 2}
{'cell_name': 'HCT-116', 'distance': 0.1491408890115924, 'count(values2)': 1, 'count_cell2': 2}
{'cell_name': 'HCT-15', 'distance': 0.218652373