# Spearman correlation calculation for binding scores and TPM levels - single binding dataset

We are given binding data for an unknown target for the following cell lines in binding panel 1 from Takahashi et al. 1988 (cf TableS2):

* BT-20 500
* AN3 CA 1000
* C-33 A 1000
* Hep 3B2.1-7 2000
* IGROV-1 2500
* SK-HEP-1 2500
* SK-LU-1 2600
* SK-CO-1 3000
*MAHLAVU 3700
* PLC/PRF/5 4000
* SK-UT-1 4000
* Calu-3 4000
* A-427 4000
* WiDr 4700
* SW 403 5000
* JEG-3 5000
*FOCUS 7000
* Caov-3 7000
* SK-MEL-5 8000
* LS 180 10500
*Chang Liver 12000
* Hep G2 12500
*Caco2 13000
* A-498 15000
* HeLa 20000

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

#the column names for all data
column_names = ['BT-20', 'AN3 CA', 'C-33 A', 'Hep 3B2.1-7', 'IGROV-1', 'SK-HEP-1', 'SK-LU-1', 'SK-CO-1', 'PLC/PRF/5', 'SK-UT-1', 'Calu-3', 'A-427', 'HT-29 WiDr', 'SW403', 'JEG-3', 'Caov-3', 'SK-MEL-5', 'LS 180', 'Hep G2', 'A-498', 'HeLa']

#create a pandas table for the target binding values
target = pd.DataFrame(columns=column_names)
target.loc[target.size] = [500, 1000, 1000, 2000, 2500, 2500, 2600, 3000, 4000, 4000, 4000, 4000, 4700, 5000, 5000, 7000, 8000, 10500, 12500, 15000, 20000]


target

Unnamed: 0,BT-20,AN3 CA,C-33 A,Hep 3B2.1-7,IGROV-1,SK-HEP-1,SK-LU-1,SK-CO-1,PLC/PRF/5,SK-UT-1,...,A-427,HT-29 WiDr,SW403,JEG-3,Caov-3,SK-MEL-5,LS 180,Hep G2,A-498,HeLa
0,500,1000,1000,2000,2500,2500,2600,3000,4000,4000,...,4000,4700,5000,5000,7000,8000,10500,12500,15000,20000


We are also given RNAseq data (TPM) for the same cell types:

TPM-AVG125filtered.csv is edited in excel to match the cell lines (in same order) for the binding panel considered and saved as TPM-AVGfil125-X.csv (X = binding panel considered) to be loaded in next step

In [3]:
data = pd.read_csv('TPM-AVGfil125-1.csv', index_col=0)
data

Unnamed: 0_level_0,Gene Name,BT-20,AN3 CA,C-33 A,Hep 3B2.1-7,IGROV-1,SK-HEP-1,SK-LU-1,SK-CO-1,PLC/PRF/5,...,A-427,HT-29 WiDr,SW403,JEG-3,Caov-3,SK-MEL-5,LS 180,Hep G2,A-498,HeLa
Gene ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000168003,SLC3A2,132.50,169.0,150.5,133.0,178.0,142.0,68.0,153.0,199.0,...,138.00,165.250,88.666667,319.0,248.0,299.00,235.666667,164.5,420.50,577.0
ENSG00000000419,DPM1,243.50,76.0,250.5,146.0,168.0,125.0,41.0,125.0,183.5,...,159.50,212.250,199.333333,137.0,276.0,151.00,150.333333,251.0,181.50,136.5
ENSG00000001036,FUCA2,77.00,156.5,10.5,241.0,190.5,88.0,36.0,172.0,319.0,...,86.00,116.000,136.666667,85.0,220.0,90.00,150.000000,267.0,124.00,212.5
ENSG00000002586,CD99,70.50,331.0,94.0,45.0,129.5,140.0,20.0,47.0,263.0,...,87.00,2.750,167.000000,87.0,139.0,144.50,42.000000,190.5,118.00,159.0
ENSG00000002834,LASP1,354.00,132.5,76.0,242.0,281.0,320.0,99.0,109.5,159.0,...,161.00,357.000,154.000000,155.0,116.5,42.00,164.000000,216.0,292.50,155.0
ENSG00000004059,ARF5,254.00,399.0,303.0,169.0,164.5,195.0,30.0,308.0,187.5,...,200.50,204.000,134.666667,182.0,180.5,143.00,159.666667,121.0,235.50,169.5
ENSG00000004142,POLDIP2,282.50,244.5,291.5,260.0,149.0,228.0,82.0,153.5,223.0,...,147.00,220.500,153.333333,117.0,100.0,177.50,233.333333,195.5,297.50,274.0
ENSG00000004478,FKBP4,258.50,190.5,248.5,127.0,205.0,75.0,45.0,70.0,148.5,...,360.50,265.250,45.000000,234.0,330.5,101.50,175.000000,97.0,89.00,254.5
ENSG00000004779,NDUFAB1,102.50,146.0,232.0,98.0,87.0,63.0,22.0,102.5,87.0,...,162.00,171.250,163.000000,95.0,67.5,99.50,176.666667,224.0,121.50,180.5
ENSG00000005022,SLC25A5,2812.00,864.0,802.5,443.0,886.0,986.0,203.0,1351.0,837.0,...,931.50,1353.250,864.333333,982.0,1693.5,964.00,1100.333333,764.0,746.50,1040.0


## Target Elucidation

### Correlation Analysis

Use the Spearman (Ranking) Correlation between the target and all genes to generate a ranking of the data 

In [4]:
import scipy.stats

#array holding all correlations with the target
correlations = [0]* data.shape[0]
pvalues =  [0]* data.shape[0]

#compute correlation 
counter=0
for index, row in data.iterrows():
    corr = scipy.stats.spearmanr(target.iloc[0].tolist() , row[ 1 :].tolist())
    correlations[counter] = corr[0]
    pvalues[counter] = corr[1]
    counter+=1

#add correlation and pvalue as columns to the dataframe
data['spearman with target'] =correlations
data['spearman pvalue'] = pvalues

# and sort the dataframe
data.sort_values('spearman with target', ascending=False, inplace=True)

In [5]:
target

Unnamed: 0,BT-20,AN3 CA,C-33 A,Hep 3B2.1-7,IGROV-1,SK-HEP-1,SK-LU-1,SK-CO-1,PLC/PRF/5,SK-UT-1,...,A-427,HT-29 WiDr,SW403,JEG-3,Caov-3,SK-MEL-5,LS 180,Hep G2,A-498,HeLa
0,500,1000,1000,2000,2500,2500,2600,3000,4000,4000,...,4000,4700,5000,5000,7000,8000,10500,12500,15000,20000


In [6]:
data

Unnamed: 0_level_0,Gene Name,BT-20,AN3 CA,C-33 A,Hep 3B2.1-7,IGROV-1,SK-HEP-1,SK-LU-1,SK-CO-1,PLC/PRF/5,...,SW403,JEG-3,Caov-3,SK-MEL-5,LS 180,Hep G2,A-498,HeLa,spearman with target,spearman pvalue
Gene ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSG00000206503,HLA-A,46.0,2.50,31.5,279.0,46.50,187.0,182.0,226.0,310.0,...,871.666667,5.0,379.00,334.0,272.666667,309.5,429.0,692.50,0.639067,0.001817
ENSG00000167996,FTH1,952.0,994.00,473.0,943.0,1422.50,346.0,482.0,950.5,504.5,...,1462.333333,2100.0,3364.50,1218.5,3195.333333,1119.5,1344.0,3984.50,0.632546,0.002091
ENSG00000115307,AUP1,129.5,148.00,160.5,203.0,144.50,139.0,62.0,155.0,248.5,...,139.666667,205.0,153.50,248.5,253.666667,196.0,257.0,212.00,0.614812,0.003018
ENSG00000103257,SLC7A5,233.5,117.50,405.0,239.0,432.00,260.0,14.0,87.0,536.5,...,73.666667,620.0,535.00,508.5,430.000000,490.5,1050.0,1826.00,0.612982,0.003131
ENSG00000123562,MORF4L2,302.5,237.00,293.0,259.0,414.00,435.0,100.0,296.0,347.5,...,219.000000,742.0,475.00,523.0,377.666667,398.0,536.0,615.00,0.610898,0.003263
ENSG00000168003,SLC3A2,132.5,169.00,150.5,133.0,178.00,142.0,68.0,153.0,199.0,...,88.666667,319.0,248.00,299.0,235.666667,164.5,420.5,577.00,0.607113,0.003516
ENSG00000161011,SQSTM1,128.5,79.00,42.0,141.0,75.50,88.0,97.0,89.5,280.0,...,144.000000,447.0,341.50,184.0,144.333333,85.0,330.5,229.50,0.586898,0.005160
ENSG00000266412,NCOA4,136.5,97.50,83.5,189.0,144.50,91.0,25.0,69.0,95.5,...,148.666667,340.0,278.00,92.5,160.000000,436.0,171.5,189.50,0.555597,0.008925
ENSG00000120708,TGFBI,9.0,0.55,1.5,273.0,0.70,5.0,187.0,323.5,72.5,...,298.333333,0.2,77.00,229.5,68.666667,383.5,956.5,379.50,0.534077,0.012636
ENSG00000168209,DDIT4,22.0,244.00,37.5,75.0,206.00,79.0,21.0,56.0,89.0,...,98.333333,415.0,112.00,116.5,378.333333,78.0,327.5,310.50,0.530817,0.013294


In [7]:
data.to_csv('spearmanAVG-1.csv')

This protocol needs to be repeated with the different binding datasets with the corresponding binding scores in Table S2