# Preprocess raw SARS-Cov-2 data into a format accepted by the pipeline

This notebook is provided as a case study for our pipeline. For a second case study with a larger quantity of omics data and samples, view the `src/MSV000085703` directory and associated files. Full details of the original study can be found in the original publication:

*Bojkova, D., Klann, K., Koch, B. et al. Proteomics of SARS-CoV-2-infected host cells reveals therapy targets. Nature 583, 469–472 (2020).* [https://doi.org/10.1038/s41586-020-2332-7](https://doi.org/10.1038/s41586-020-2332-7)

Authors provided a set of excel spreadsheets which contain the multi omics data. 
- [Supplementary table 1: Translatome](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2332-7/MediaObjects/41586_2020_2332_MOESM2_ESM.xlsx)
- [Supplementary table 2: Proteome](https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-020-2332-7/MediaObjects/41586_2020_2332_MOESM3_ESM.xlsx)

This source data is also included in this repository.

Tables were saved as individual `data/proteome.txt` and `data/translatome.txt` files directly from the spreadsheet with tab separated values. No changes to content were made.

## Summary of the original study

The authors investigated the proteome (global protein levels) and translatome (proteins at time of translation) of a human cell line infected with SARS-CoV-2. Protein levels were measured at multiple time points. We apply our pipeline to integrate the two omics data in this experiment as a case study.

## Summary of our analysis
We integrated proteomics and translatomics data for 24 samples. 8 classes were included: `covid states` vs `non-covid states` at multiple timepoints. Classes are balanced but there are repeated measurements in this experiment. This jupyter notebook describes the steps taken to download and parse the input data as well as metadata, resulting in matrices of continuous values suitable for input into our pipeline.

In [1]:
import re
import pandas as pd

In [2]:
proteome_infile = "../../data/case_study_1/proteome.txt"
translatome_infile = "../../data/case_study_1/translatome.txt"
proteome_mapfile = "../../data/case_study_1/proteome_mapfile.txt"
translatome_mapfile = "../../data/case_study_1/translatome_mapfile.txt"
proteome_outfile = "../../data/case_study_1/diablo_proteome.txt"
translatome_outfile = "../../data/case_study_1/diablo_translatome.txt"
classes_outfile = "../../data/case_study_1/classes_diablo.txt"

In [3]:
proteome = pd.read_csv(proteome_infile, sep="\t")
proteome_map = proteome[["UniProt Accession", "Gene Symbol"]]
proteome_map.columns = ["key", "val"]
proteome_map["key"].replace('(;)', '_', regex=True, inplace = True)
proteome_map["key"].replace('($)', '_prot', regex=True, inplace = True)
proteome_map["val"].replace('(;)', '_', regex=True, inplace = True)
proteome_map["val"].replace('(_ )', '_', regex=True, inplace = True)
proteome_map["val"] = proteome_map["val"].fillna(proteome_map["key"])
proteome_map["val"].replace('(_prot)', '__FEATUREID', regex=True, inplace = True)
proteome_map.to_csv(proteome_mapfile, sep="\t")

translatome = pd.read_csv(translatome_infile, sep="\t")
translatome_map = translatome[["Accession", "Gene Symbol01"]]
translatome_map.columns = ["key", "val"]
translatome_map["key"].replace('(;)', '_', regex=True, inplace = True)
translatome_map["key"].replace('($)', '_tran', regex=True, inplace = True)
translatome_map["val"].replace('(;)', '_', regex=True, inplace = True)
translatome_map["val"].replace('(_ )', '_', regex=True, inplace = True)
translatome_map["val"] = translatome_map["val"].fillna(translatome_map["key"])
translatome_map["val"].replace('(_tran)', '__FEATUREID', regex=True, inplace = True)
translatome_map.to_csv(translatome_mapfile, sep="\t")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [4]:
proteome = pd.read_csv(proteome_infile, sep="\t")
proteome_classes = proteome.rename(columns=lambda x: re.sub('_.*','',x)).columns[2:26]
proteome.rename(columns=lambda x: re.sub(' ', '_', x), inplace=True)
proteome.rename(columns=lambda x: re.sub('_$', '', x), inplace=True)
proteome.set_index("UniProt_Accession", inplace=True)
proteome.drop("Gene_Symbol", axis=1, inplace=True)
proteome_diablo = proteome.iloc[:,:24].T
proteome_diablo.rename(columns=lambda x: re.sub('$', '_prot', x), inplace=True)
proteome_diablo.rename(columns=lambda x: re.sub(';', '_', x), inplace=True)
proteome_diablo.to_csv(proteome_outfile, sep="\t")
proteome_diablo
# proteome_classes

UniProt_Accession,A0A0B4J1V1_prot,A0A0B4J2D5_prot,A0A0B4J2F0_prot,A0AV96_prot,A0AVT1_prot,A0FGR8_prot,A0MZ66_prot,A0PK00_prot,A1A4S6_prot,A1L0T0_prot,...,Q9Y6Q5_prot,Q9Y6Q5_Q9BXS5_prot,Q9Y6U3_prot,Q9Y6W3_prot,Q9Y6W5_prot,Q9Y6X2_prot,Q9Y6X4_prot,Q9Y6X9_prot,Q9Y6Y0_prot,Q9Y6Y8_prot
Control_2h,409.991,1546.01,110.127,1176.22,2658.05,889.603,2191.25,24.924,3.2521,927.355,...,867.596,422.453,3978.33,141.15,592.035,142.047,112.916,74.0196,179.183,1835.32
Control_2h_2,406.229,1498.82,116.383,1198.7,2570.76,893.532,2233.28,35.4787,2.25422,897.917,...,893.206,380.552,4083.03,161.916,604.945,131.028,102.811,65.2655,165.204,1770.03
Control_2h_3,412.542,1626.61,113.648,1142.53,2554.98,877.359,2268.62,18.5729,3.17267,959.21,...,885.869,387.17,3882.64,181.52,603.875,130.878,101.021,66.7268,178.048,1749.95
Control_6h_1,408.409,1537.52,104.409,1283.18,2561.47,884.238,2154.78,25.916,4.53747,979.197,...,925.88,417.599,4295.3,126.278,580.255,131.789,108.017,66.9427,192.706,1746.25
Control_6h_2,401.301,1555.54,117.992,1282.27,2526.84,904.276,2201.63,30.1342,2.32585,936.295,...,980.953,387.199,4615.22,166.055,556.247,136.786,104.947,57.5663,177.898,1793.99
Control_6h_3,429.944,1526.2,108.311,1217.57,2552.54,841.393,2234.64,28.7902,2.71079,986.955,...,947.785,390.638,4398.76,178.196,588.794,136.397,105.287,64.2034,194.625,1776.81
Control_10h_1,424.944,1559.43,105.915,1062.1,2487.21,893.053,2104.56,30.654,3.33258,941.908,...,887.223,377.702,4413.6,135.566,616.378,142.495,114.762,69.5718,235.139,1766.62
Control_10h_2,402.828,1435.43,120.005,1131.36,2476.17,939.916,2128.17,36.9144,3.83369,903.491,...,900.762,361.531,4165.39,173.111,563.98,127.242,105.494,72.347,214.714,1728.66
Control_10h_3,434.535,1526.15,107.946,1055.98,2572.75,942.454,2154.49,28.2327,3.05585,971.528,...,924.008,385.88,4472.85,198.223,621.812,127.116,105.91,69.3974,239.484,1764.05
Control_24h_1,411.504,1547.78,102.939,1036.11,2462.46,903.064,2237.39,29.1679,4.03879,958.763,...,854.21,376.822,4053.43,155.441,588.873,137.55,97.9428,85.6926,217.324,1802.76


In [5]:
translatome = pd.read_csv(translatome_infile, sep="\t")
translatome_classes = translatome.rename(columns=lambda x: re.sub('_.*','',x)).columns[3:27]
translatome.rename(columns=lambda x: re.sub(' ', '_', x), inplace=True)
translatome.rename(columns=lambda x: re.sub('_$', '', x), inplace=True)
translatome.set_index("Accession", inplace=True)
translatome.drop(["Gene_Symbol01", "Species_Names01"], axis=1, inplace=True)#.isnull().values.any()
translatome_diablo = translatome.iloc[:,:24].T
translatome_diablo.replace(["#DIV/0!", "#NUM!", None], 0, inplace=True)
translatome_diablo.replace(" ", "_", inplace=True)
translatome_diablo.rename(columns=lambda x: re.sub('$', '_tran', x), inplace=True)
translatome_diablo.rename(columns=lambda x: re.sub(';', '_', x), inplace=True)
translatome_diablo.to_csv(translatome_outfile, sep="\t")
translatome_diablo
# translatome_classes

Accession,P02771_tran,P07148_tran,P09327_tran,P05783_tran,Q9P2E9_tran,P09525_tran,P05787_tran,Q9H3R2_tran,Q12864_tran,P17931_tran,...,Q9Y639_tran,Q9Y673_tran,Q9Y6G3_tran,Q9Y6G9_tran,Q9Y6K0_tran,O14950_tran,P12532_tran,P61204_tran,Q5H9L2_tran,Q71DI3_tran
Control_2h,981.258045,1305.00923,839.773511,604.780284,686.321978,658.811319,485.252738,496.088121,530.905417,520.866102,...,0.800976,1.370368,0.0,0.0,1.762056,51.267299,0.0,5.589031,9.958733,59.898596
Control_2h_2,1183.175939,1280.049817,1031.083693,727.433897,636.582512,602.019906,387.935703,485.892872,430.009577,450.650737,...,0.0,0.0,2.634634,0.0,11.550408,0.0,0.0,0.0,2.481829,100.111166
Control_2h_3,1440.326216,1210.800481,972.479005,784.113305,668.096982,617.299111,579.114302,480.958477,501.978988,472.651453,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Control_6h_1,1205.249514,1154.277967,860.300797,1264.895045,828.087137,710.42688,921.490995,413.705462,618.433205,837.127094,...,3.800407,4.801895,0.337346,0.0,0.0,67.073532,0.0,9.448131,5.387901,70.957065
Control_6h_2,1338.177985,1100.749505,988.012894,1318.04764,794.704113,666.990167,604.649219,371.143724,499.019105,707.69153,...,0.0,5.918495,0.178997,0.0,10.419589,0.0,0.0,0.0,4.485249,89.114337
Control_6h_3,1286.491071,923.31264,962.336146,1405.874444,671.987378,627.434837,859.758949,332.583583,547.755795,712.222383,...,0.0,4.508309,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Control_10h_1,925.301366,773.390082,753.439585,1297.249528,631.514053,657.741191,833.482195,213.913454,557.37562,656.33339,...,2.491082,8.62582,4.802533,0.0,0.0,74.015411,0.0,16.060397,8.74226,47.370639
Control_10h_2,887.729721,751.006613,913.424345,1524.470574,603.273333,667.828527,672.749364,180.680285,446.103672,645.539652,...,0.0,0.0,3.881046,0.0,13.248459,0.0,0.0,0.0,9.727332,88.8583
Control_10h_3,647.200761,305.276062,701.528247,1584.577142,533.51009,547.151209,894.516941,124.340499,461.499413,612.025079,...,0.0,0.0,2.063037,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Control_24h_1,844.290854,938.899082,802.704246,472.917845,594.33488,1346.827328,418.958816,245.548965,433.230956,285.713772,...,0.157091,5.203989,4.189494,0.0,0.921596,50.631322,0.0,6.790811,6.986196,199.794381


In [6]:
proteome_classes == translatome_classes

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True])

In [7]:
# proteome_classes and translatome_classes are identical, use either
classes = pd.DataFrame(proteome_classes).replace({" ", "_"}, regex=True)
classes[0] = classes[0].str.replace(' ', '_')
classes.to_csv(classes_outfile, sep="\t")
classes

Unnamed: 0,0
0,Control_2h
1,Control_2h
2,Control_2h
3,Control_6h
4,Control_6h
5,Control_6h
6,Control_10h
7,Control_10h
8,Control_10h
9,Control_24h
