# Issue Understanding
The issue which we are going to deal in this project is to figure out which method of predicting the senstivity of the cancer cell for the given chemical is more accurate and why. The 2 methods which we have in mind are first trying to use chemical fingerprinting and going for a input:number to output:number method using traditional models, or using Chemception embeddings to represent the chemical as an image and training Convolutional Neural Networks to predict the sensitivity.

# Data Understanding

## Data Acquisition
We are going to get our data from the **Genomics of Drug Sensitivity in Cancer** website, especially the GDSC2 fitted dose response dataset which was uploaded on 27th October 2023

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data = pd.read_excel('https://cog.sanger.ac.uk/cancerrxgene/GDSC_release8.5/GDSC2_fitted_dose_response_27Oct23.xlsx')

In [4]:
print(data.shape)
data.sample(10)

(242036, 19)


Unnamed: 0,DATASET,NLME_RESULT_ID,NLME_CURVE_ID,COSMIC_ID,CELL_LINE_NAME,SANGER_MODEL_ID,TCGA_DESC,DRUG_ID,DRUG_NAME,PUTATIVE_TARGET,PATHWAY_NAME,COMPANY_ID,WEBRELEASE,MIN_CONC,MAX_CONC,LN_IC50,AUC,RMSE,Z_SCORE
161885,GDSC2,343,16075091,909753,SW626,SIDM01168,OV,1832,729189,,Unclassified,1043,Y,0.003002,3.0,5.150031,0.965343,0.05319,1.553802
86105,GDSC2,343,16044577,908128,Mewo,SIDM00545,SKCM,1510,Linsitinib,IGF1R,IGF1R signaling,1046,Y,0.010005,10.0,6.100115,0.981631,0.053355,1.619989
177443,GDSC2,343,16000827,905983,U251,SIDM00111,GBM,1908,Ulixertinib,"ERK1, ERK2",ERK MAPK signaling,1046,Y,0.010005,10.0,2.691323,0.882759,0.054549,-0.077317
151671,GDSC2,343,16002097,905990,OVCAR-4,SIDM00092,OV,1815,Dacarbazine,CP11A,Other,1043,Y,0.020011,20.0,5.657917,0.968731,0.05092,0.067637
230887,GDSC2,343,16115804,946382,CAMA-1,SIDM00920,BRCA,2169,AZD6482,PI3Kbeta,PI3K/MTOR signaling,1046,Y,0.009766,10.0,2.008676,0.85729,0.094661,-1.002167
153020,GDSC2,343,15974287,753546,CP66-MEL,SIDM00190,SKCM,1817,Romidepsin,"HDAC1, HDAC2, HDAC3, HDAC8",Chromatin histone acetylation,1043,Y,1e-05,0.01,-5.682401,0.754705,0.162492,-0.441917
53711,GDSC2,343,16058232,908474,NCI-H1703,SIDM00740,UNCLASSIFIED,1086,BI-2536,"PLK1, PLK2, PLK3",Cell cycle,1046,Y,0.001001,1.0,-4.177791,0.409409,0.141154,-1.241928
156527,GDSC2,343,15962677,713869,VMRC-LCD,SIDM00320,LUAD,1825,Podophyllotoxin bromide,,Unclassified,1043,Y,0.001001,1.0,-0.556964,0.807228,0.100518,-0.071161
158238,GDSC2,343,16052425,908445,SNU-5,SIDM01144,STAD,1827,Dihydrorotenone,,Unclassified,1043,Y,0.001001,1.0,-0.169527,0.868773,0.066019,-0.60619
113224,GDSC2,343,15986796,753621,TE-1,SIDM00369,ESCA,1621,PCI-34051,"HDAC8, HDAC6, HDAC1",Chromatin histone acetylation,1046,Y,0.010005,10.0,6.274197,0.981404,0.034122,1.467765


## Columns Understanding
We are provided with an outdated dataset information from the **Genomics of Drug Sensitivity in Cancer** as provided below, it was last updated on 21st September 2017, and as such doesn't contain a lot of columns.

- https://cog.sanger.ac.uk/cancerrxgene/GDSC_release8.5/GDSC_Fitted_Data_Description.pdf

In [5]:
data.columns

Index(['DATASET', 'NLME_RESULT_ID', 'NLME_CURVE_ID', 'COSMIC_ID',
       'CELL_LINE_NAME', 'SANGER_MODEL_ID', 'TCGA_DESC', 'DRUG_ID',
       'DRUG_NAME', 'PUTATIVE_TARGET', 'PATHWAY_NAME', 'COMPANY_ID',
       'WEBRELEASE', 'MIN_CONC', 'MAX_CONC', 'LN_IC50', 'AUC', 'RMSE',
       'Z_SCORE'],
      dtype='object')

In [6]:
for col in data.columns[:7]:
    print(col, data[col].nunique())
    print(col, data[col].unique()[:10])
    print()

DATASET 1
DATASET ['GDSC2']

NLME_RESULT_ID 1
NLME_RESULT_ID [343]

NLME_CURVE_ID 242036
NLME_CURVE_ID [15946310 15946548 15946830 15947087 15947369 15947651 15947932 15948212
 15948491 15948772]

COSMIC_ID 969
COSMIC_ID [683667 684052 684057 684059 684062 684072 687448 687452 687455 687457]

CELL_LINE_NAME 969
CELL_LINE_NAME ['PFSK-1' 'A673' 'ES5' 'ES7' 'EW-11' 'SK-ES-1' 'COLO-829' '5637' 'RT4'
 'SW780']

SANGER_MODEL_ID 969
SANGER_MODEL_ID ['SIDM01132' 'SIDM00848' 'SIDM00263' 'SIDM00269' 'SIDM00203' 'SIDM01111'
 'SIDM00909' 'SIDM00807' 'SIDM01085' 'SIDM01160']

TCGA_DESC 32
TCGA_DESC ['MB' 'UNCLASSIFIED' 'SKCM' 'BLCA' 'CESC' 'GBM' 'LUAD' 'LUSC' 'SCLC'
 'MESO']



In [22]:
for col in data.columns[7:14]:
    print(col, data[col].nunique())
    print(col, data[col].unique()[:10])
    print()

DRUG_ID 295
DRUG_ID [1003 1004 1005 1006 1007 1008 1009 1010 1011 1012]

DRUG_NAME 286
DRUG_NAME ['Camptothecin' 'Vinblastine' 'Cisplatin' 'Cytarabine' 'Docetaxel'
 'Methotrexate' 'Tretinoin' 'Gefitinib' 'Navitoclax' 'Vorinostat']

PUTATIVE_TARGET 185
PUTATIVE_TARGET ['TOP1' 'Microtubule destabiliser' 'DNA crosslinker' 'Antimetabolite'
 'Microtubule stabiliser' 'Retinoic acid' 'EGFR' 'BCL2, BCL-XL, BCL-W'
 'HDAC inhibitor Class I, IIa, IIb, IV' 'ABL']

PATHWAY_NAME 24
PATHWAY_NAME ['DNA replication' 'Mitosis' 'Other' 'EGFR signaling'
 'Apoptosis regulation' 'Chromatin histone acetylation' 'ABL signaling'
 'ERK MAPK signaling' 'PI3K/MTOR signaling' 'Genome integrity']

COMPANY_ID 17
COMPANY_ID [1046 1001 1025 1005 1039 1043 1018 1033 1049 1050]

WEBRELEASE 1
WEBRELEASE ['Y']

MIN_CONC 39
MIN_CONC [1.000e-04 9.800e-05 4.002e-03 3.906e-03 5.859e-03 6.003e-03 2.001e-03
 1.300e-05 1.200e-05 1.001e-03]



In [23]:
for col in data.columns[14:]:
    print(col, data[col].nunique())
    print(col, data[col].unique()[:10])
    print()

MAX_CONC 27
MAX_CONC [ 0.1     6.      4.      8.      2.      0.0125  1.     10.      5.
  0.2   ]

LN_IC50 237097
LN_IC50 [-1.463887 -4.869455 -3.360586 -5.04494  -3.741991 -5.142961 -1.235034
 -2.632632 -2.963191 -1.449138]

AUC 142587
AUC [0.93022  0.61497  0.791072 0.59266  0.734047 0.582439 0.867348 0.834067
 0.821438 0.90505 ]

RMSE 118662
RMSE [0.089052 0.111351 0.142855 0.135539 0.128059 0.137581 0.09347  0.076169
 0.094466 0.074109]

Z_SCORE 233614
Z_SCORE [ 0.433123 -1.4211   -0.599569 -1.516647 -0.807232 -1.570016  0.557727
 -0.203221 -0.3832    0.441154]



At the very first glance, we can start eliminating some features such as:
* `DATASET`: Since the whole column has a single value of `GDSC2`, it is not going to be useful feature for our model.
* `NLME_RESULT_ID`: Since the whole column has a single value of `343`, it is not going to be useful feature for our model.
* `NLME_CURVE_ID`: Since, I could not find any background for this column, and the Google search dated on 27th June 2024, gave me 2 results, both being the same dataset and no context on that particular column, I have decided to exclude this column from training data. On further research, it might be an R package as mentioned in the book mentioned below
  - https://cran.r-project.org/web/packages/nlme/nlme.pdf

  - ![NLME_CURVE_ID](Images/NLME_CURVE_ID.png)

* `COSMIC_ID`: I was adviced to include this column in order to get gene expression vectors and use that data for model training, but after spending several days on trying to study the COSMIC Dataset (Catalogue Of Somatic Mutations In Cancer) and it's incorporation to my current GDSC2 Dataset, by exploring the following research papers, official COSMIC educational videos and related educational material, and failing to find the given COSMIC IDs on the official COSMIC dataset website (dated 27th June 2024), I concluded that using the feature will not be feasible in the scope of this project.
  - Exploring somatic mutations in cancer with COSMIC database and analysis tools (https://youtu.be/bvY7wt9djG4)
  - COSMIC Database (https://cancer.sanger.ac.uk/cosmic)
  - ![COSMIC_ID](Images/COSMIC_ID_683667.png)
  - ![COSMIC_ID](Images/COSMIC_ID_717431.png)



In [25]:
data.columns

Index(['DATASET', 'NLME_RESULT_ID', 'NLME_CURVE_ID', 'COSMIC_ID',
       'CELL_LINE_NAME', 'SANGER_MODEL_ID', 'TCGA_DESC', 'DRUG_ID',
       'DRUG_NAME', 'PUTATIVE_TARGET', 'PATHWAY_NAME', 'COMPANY_ID',
       'WEBRELEASE', 'MIN_CONC', 'MAX_CONC', 'LN_IC50', 'AUC', 'RMSE',
       'Z_SCORE'],
      dtype='object')

In [27]:
data[['COSMIC_ID', 'CELL_LINE_NAME', 'SANGER_MODEL_ID']].drop_duplicates()

Unnamed: 0,COSMIC_ID,CELL_LINE_NAME,SANGER_MODEL_ID
0,683667,PFSK-1,SIDM01132
1,684052,A673,SIDM00848
2,684057,ES5,SIDM00263
3,684059,ES7,SIDM00269
4,684062,EW-11,SIDM00203
...,...,...,...
964,1660035,SNU-61,SIDM00194
965,1660036,SNU-81,SIDM00193
966,1674021,SNU-C5,SIDM00498
967,1789883,DiFi,SIDM00049


In [76]:
print(data.shape)
data2 = data[[
    'CELL_LINE_NAME',
    # 'TCGA_DESC',
    'DRUG_NAME',
    # 'PUTATIVE_TARGET',
    # 'PATHWAY_NAME',
    'MIN_CONC',
    'MAX_CONC',
    'LN_IC50']].drop_duplicates()
print(data2.shape)
data3 = data2[[
    'CELL_LINE_NAME',
    # 'TCGA_DESC',
    'DRUG_NAME',
    # 'PUTATIVE_TARGET',
    # 'PATHWAY_NAME',
    'MIN_CONC',
    'MAX_CONC'
    ]].drop_duplicates()
print(data3.shape)
print("Data Loss :", round((data2.shape[0]-data3.shape[0])*100/data2.shape[0], 2), "%")

(242036, 19)
(242036, 5)
(239995, 4)
Data Loss : 0.84 %


In [77]:
data2

Unnamed: 0,CELL_LINE_NAME,DRUG_NAME,MIN_CONC,MAX_CONC,LN_IC50
0,PFSK-1,Camptothecin,0.000100,0.1,-1.463887
1,A673,Camptothecin,0.000100,0.1,-4.869455
2,ES5,Camptothecin,0.000100,0.1,-3.360586
3,ES7,Camptothecin,0.000100,0.1,-5.044940
4,EW-11,Camptothecin,0.000100,0.1,-3.741991
...,...,...,...,...,...
242031,SNU-175,N-acetyl cysteine,2.001054,2000.0,10.127082
242032,SNU-407,N-acetyl cysteine,2.001054,2000.0,8.576377
242033,SNU-61,N-acetyl cysteine,2.001054,2000.0,10.519636
242034,SNU-C5,N-acetyl cysteine,2.001054,2000.0,10.694579


In [81]:
data2.to_csv('Features.csv', index = False)

In [80]:
data2.CELL_LINE_NAME.nunique()

969

In [61]:
242036-239995

2041

In [48]:
data.head(5)

Unnamed: 0,DATASET,NLME_RESULT_ID,NLME_CURVE_ID,COSMIC_ID,CELL_LINE_NAME,SANGER_MODEL_ID,TCGA_DESC,DRUG_ID,DRUG_NAME,PUTATIVE_TARGET,PATHWAY_NAME,COMPANY_ID,WEBRELEASE,MIN_CONC,MAX_CONC,LN_IC50,AUC,RMSE,Z_SCORE
0,GDSC2,343,15946310,683667,PFSK-1,SIDM01132,MB,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.0001,0.1,-1.463887,0.93022,0.089052,0.433123
1,GDSC2,343,15946548,684052,A673,SIDM00848,UNCLASSIFIED,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.0001,0.1,-4.869455,0.61497,0.111351,-1.4211
2,GDSC2,343,15946830,684057,ES5,SIDM00263,UNCLASSIFIED,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.0001,0.1,-3.360586,0.791072,0.142855,-0.599569
3,GDSC2,343,15947087,684059,ES7,SIDM00269,UNCLASSIFIED,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.0001,0.1,-5.04494,0.59266,0.135539,-1.516647
4,GDSC2,343,15947369,684062,EW-11,SIDM00203,UNCLASSIFIED,1003,Camptothecin,TOP1,DNA replication,1046,Y,0.0001,0.1,-3.741991,0.734047,0.128059,-0.807232


In [54]:
data2.sample(5)

Unnamed: 0,CELL_LINE_NAME,TCGA_DESC,DRUG_NAME,PUTATIVE_TARGET,PATHWAY_NAME,MIN_CONC,MAX_CONC,LN_IC50
190345,EW-3,UNCLASSIFIED,GDC0810,"ESR1, ESR2",Hormone-related,0.010005,10.0,4.797885
165515,TGW,NB,Sinularin,,Unclassified,0.003002,3.0,2.465452
26937,TK10,KIRC,NU7441,DNAPK,Genome integrity,0.010005,10.0,5.156818
202751,KNS-62,LUSC,BPD-00008900,,Other,0.010005,10.0,5.26866
52694,LP-1,MM,Sorafenib,"PDGFR, KIT, VEGFR, RAF","Other, kinases",0.010005,10.0,1.356614


In [58]:
data2.PUTATIVE_TARGET.value_counts(dropna = False).reset_index()

Unnamed: 0,PUTATIVE_TARGET,count
0,,27155
1,"PARP1, PARP2",4714
2,"MEK1, MEK2",4547
3,TOP1,4325
4,EGFR,3836
...,...,...
181,Induces reactive oxygen species,225
182,"RSK, AURKB, PIM1, PIM3",225
183,EGLN1,225
184,"TBK1, PDK1 (PDPK1), IKK, AURKB, AURKC",225


In [59]:
data2.PATHWAY_NAME.value_counts(dropna = False).reset_index()

Unnamed: 0,PATHWAY_NAME,count
0,Unclassified,24979
1,PI3K/MTOR signaling,22724
2,Other,21402
3,DNA replication,17650
4,"Other, kinases",17277
5,ERK MAPK signaling,13350
6,Genome integrity,12221
7,Cell cycle,11620
8,Apoptosis regulation,10828
9,Chromatin histone methylation,10612


In [45]:
242036-239995

2041

In [43]:
data34 = data[['DRUG_ID', 'DRUG_NAME']].drop_duplicates()
data35 = data34.groupby(['DRUG_NAME'])['DRUG_ID'].count().reset_index()
data36 = data34.merge(data35[data35.DRUG_ID>1]['DRUG_NAME'], how = 'right')
data36

Unnamed: 0,DRUG_ID,DRUG_NAME
0,1803,Acetalax
1,1804,Acetalax
2,1811,Dactinomycin
3,1911,Dactinomycin
4,1007,Docetaxel
5,1819,Docetaxel
6,1200,Fulvestrant
7,1816,Fulvestrant
8,1627,GSK343
9,2037,GSK343
