# Data Pipeline

This notebook consists of three sections. In section [Data Exploration](#explore), we read the data and provide an overview of each of the modalities. In section [Comparisons](#compare), we investigate the features that are shared across different modalities, and finally, in section [Pseudo Bulking](#aggregate), we aggregate the single cell data modalities to obtain pseudobulks. 

Section [Comparisons](#compare) and section [Pseudo Bulking](#aggregate) are independent of each other. However, running the first cell of each subsection of section [Data Exploration](#explore), in which the data is loaded and saved into a variable, is necessary for running different code snippets of this notebook.

In [1]:
import os
import scanpy as sp
import anndata as ad
import pandas as pd 

In [2]:
# Path to the data
cwd = os.getcwd()
data_path = os.path.abspath(os.path.join(cwd, "../data/data"))
print(data_path)

/Users/shakiba/Desktop/integraph/data/data


In [3]:
def read(file_name, dtype, sheet = None):
    path = os.path.join(data_path, file_name + "." + dtype)
    data = None
    if dtype == "csv":
        data = sp.read_csv(path, delimiter=',')
    elif dtype == "tsv":
        data = sp.read_csv(path, delimiter='\t', dtype = str)
    elif dtype == "h5ad":
        data = sp.read_h5ad(path)
    elif dtype == "xlsx": 
        data = ad.read_excel(path, sheet, dtype='str')
    elif dtype == "txt":
        data = sp.read_text(path, delimiter= "	")
    else:
        pass
    return data

## Data Exploration <a id='explore'></a>

This snippet of the notebook contains code for 1) seperating ADT and scRNA, 2) extracting bulkRNA protein names, 3) generating an overview of the data contained in the modalities
Every file is saved in current folder.

### ADT and SC RNA

In [4]:
ADT_and_SC_RNA = read("COMBAT-CITESeq-DATA", "h5ad")
adt_names = []
scRNA_names = []

# Determine ADT feature names and RNA feature names
for name in ADT_and_SC_RNA.var_names: 
    if name.startswith("AB_"): # ADT data start with AB_ 
        adt_names.append(name)
    else:
        scRNA_names.append(name)

# Extract ADT and scRNA 
ADT = ADT_and_SC_RNA[:, adt_names]
scRNA = ADT_and_SC_RNA[:, scRNA_names]

# Remove the prefix AB_ from ADT features 
adt_names = [name.split("AB_")[1] for name in adt_names]

print("ADT: ", ADT, "\n\n")
print("single cell RNA: ", scRNA, "\n\n")

ADT:  View of AnnData object with n_obs × n_vars = 836148 × 192
    obs: 'Annotation_cluster_id', 'Annotation_cluster_name', 'Annotation_minor_subset', 'Annotation_major_subset', 'Annotation_cell_type', 'GEX_region', 'QC_ngenes', 'QC_total_UMI', 'QC_pct_mitochondrial', 'QC_scrub_doublet_scores', 'TCR_chain_composition', 'TCR_clone_ID', 'TCR_clone_count', 'TCR_clone_proportion', 'TCR_contains_unproductive', 'TCR_doublet', 'TCR_chain_TRA', 'TCR_v_gene_TRA', 'TCR_d_gene_TRA', 'TCR_j_gene_TRA', 'TCR_c_gene_TRA', 'TCR_productive_TRA', 'TCR_cdr3_TRA', 'TCR_umis_TRA', 'TCR_chain_TRA2', 'TCR_v_gene_TRA2', 'TCR_d_gene_TRA2', 'TCR_j_gene_TRA2', 'TCR_c_gene_TRA2', 'TCR_productive_TRA2', 'TCR_cdr3_TRA2', 'TCR_umis_TRA2', 'TCR_chain_TRB', 'TCR_v_gene_TRB', 'TCR_d_gene_TRB', 'TCR_j_gene_TRB', 'TCR_c_gene_TRB', 'TCR_productive_TRB', 'TCR_chain_TRB2', 'TCR_v_gene_TRB2', 'TCR_d_gene_TRB2', 'TCR_j_gene_TRB2', 'TCR_c_gene_TRB2', 'TCR_productive_TRB2', 'TCR_cdr3_TRB2', 'TCR_umis_TRB2', 'BCR_umis_HC', 'BCR

#### ADT

In [66]:
summary_ADT = {"feature" : adt_names,
           "min" : ADT.X.toarray().min(axis = 0),
           "max" : ADT.X.toarray().max(axis = 0),
           "mean" : ADT.X.toarray().mean(axis = 0),
           "var" : ADT.X.toarray().var(axis = 0)}

df_adt= pd.DataFrame(summary_ADT)

print("[", ADT.X.toarray().min(), ",", ADT.X.toarray().max(), "]") # Overall range of proteins

[ -67.67877 , 460.98032 ]


In [9]:
# Content overview
ADT.to_df().head()

Unnamed: 0,AB_CD80,AB_CD86,AB_CD274_B7_H1_PD_L1,AB_CD273_B7_DC_PD_L2,AB_CD275_B7_H2_ICOSL,AB_humanCD11b,AB_CD252_OX40L,AB_CD137L_4_1BBLigand,AB_CD155_PVR,AB_CD112_Nectin_2,...,AB_CD101_BB27,AB_CD360_IL_21R,AB_CD88_C5aR,AB_HLA_F,AB_NLRP2,AB_Podocalyxin,AB_CD224,AB_c_Met,AB_CD258_LIGHT,AB_DR3_TRAMP
AAACCTGAGAAAGTGG-1-gPlexA1,1.98787,1.921781,2.613414,0.456505,1.482558,1.789405,1.206598,2.821688,1.885517,1.059094,...,0.840967,2.21781,1.220086,0.12453,1.719992,1.028112,1.729305,1.463706,1.785078,1.601881
AAACCTGAGCGGATCA-1-gPlexA1,-0.539351,0.442409,2.392834,1.047547,0.131874,1.147668,0.541517,1.990631,2.284331,1.762416,...,3.184458,2.73134,1.006678,-0.142191,1.20244,1.168217,3.29596,0.679976,2.942873,2.06682
AAACCTGAGGACATTA-1-gPlexA1,0.993282,1.441381,0.310766,-0.556409,1.708025,0.19521,0.37591,2.114272,-0.618993,-0.049037,...,0.937668,0.563752,0.735085,0.130218,-0.380042,0.24376,0.863445,0.484824,0.713824,1.31677
AAACCTGAGGCGACAT-1-gPlexA1,0.838407,2.64194,0.344012,0.189955,1.021477,2.683172,1.320957,0.759884,1.35568,3.498369,...,0.947138,-0.744993,1.814052,1.67303,1.307825,1.210711,2.648582,0.606611,1.100374,-0.663722
AAACCTGAGGGAACGG-1-gPlexA1,1.172756,14.549344,-0.884014,1.349209,1.489393,4.734301,1.522198,0.871919,5.137852,4.039271,...,1.49347,2.687017,5.53505,1.429137,0.657836,2.042859,6.36786,1.316507,1.871877,1.279506


In [None]:
# Save ADT information
ADT.write_h5ad("adt.h5ad")

with open ('adt_feature_names.txt', 'w') as file:  
    for name in adt_names:
        file.write(name + "\n") 
        
df_adt.to_excel("summary_ADT.xlsx") 

#### scRNA

In [8]:
summary_scRNA = {"feature" : scRNA_names,
           "min" : scRNA.X.min(axis = 0).toarray()[0],
           "max" : scRNA.X.max(axis = 0).toarray()[0]}

df_scRNA= pd.DataFrame(summary_scRNA)

print("[", scRNA.X.min(), ",", scRNA.X.max(), "]")

AttributeError: 'matrix' object has no attribute 'toarray'

In [None]:
# Data overview
scRNA.to_df().head()

Unnamed: 0,OR4F5,OR4F29,OR4F16,SAMD11,NOC2L,KLHL17,PLEKHN1,PERM1,HES4,ISG15,...,AC007325.2,BX072566.1,AL354822.1,AC023491.2,AC004556.3,AC233755.2,AC233755.1,AC240274.1,AC213203.4,AC213203.1
AAACCTGAGAAAGTGG-1-gPlexA1,0.0,0.0,0.0,0.0,1.553033,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACCTGAGCGGATCA-1-gPlexA1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.426129,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACCTGAGGACATTA-1-gPlexA1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACCTGAGGCGACAT-1-gPlexA1,0.0,0.0,0.0,0.0,1.800563,0.0,0.0,0.0,0.0,2.407496,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
AAACCTGAGGGAACGG-1-gPlexA1,0.0,0.0,0.0,0.0,1.445163,0.0,0.0,0.0,0.0,3.293625,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Save scRNA information
scRNA.write_h5ad("scRNA.h5ad")

with open ('scRNA_feature_names.txt', 'w') as file:  
    for name in scRNA_names:
        file.write(name + "\n")  
        
df_scRNA.to_excel("summary_scRNA.xlsx") 

### bulk RNA

In [10]:
bulk_RNA = read("Logcpm_143_23063", "txt")
print("bulk RNA: ", bulk_RNA, "\n\n") # Note that here featres are rows 

bulk_RNA2 = read("module.gene.membership", "tsv") 
print("bulk RNA: ", bulk_RNA2, "\n\n") # Maps gene_ids to gene_names

bulkRNA_joined = bulk_RNA.to_df().join(bulk_RNA2.to_df().set_index("gene_id")) # Join the two data frames based on "gene_id"
bulkRNA_joined

bulk RNA:  AnnData object with n_obs × n_vars = 23063 × 143 


bulk RNA:  AnnData object with n_obs × n_vars = 23063 × 4 




  utils.warn_names_duplicates("obs")


Unnamed: 0,S00016-Ja001T-TRGa,S00020-Ja003T-TRGa,S00024-Ja003T-TRGa,S00027-Ja003T-TRGa,S00028-Ja001T-TRGa,S00030-Ja003T-TRGa,S00033-Ja001T-TRGa,S00033-Ja003T-TRGa,S00034-Ja005T-TRGa,S00033-Ja005T-TRGa,...,S00094-Ja005T-TRGa,S00095-Ja005T-TRGa,S00096-Ja005T-TRGa,S00097-Ja003T-TRGa,S00099-Ja005T-TRGa,S00104-Ja003T-TRGa,S00106-Ja003T-TRGa,gene_name,membership,p.value
ENSG00000000003,0.358735,0.420317,0.212498,0.000000,0.420037,0.000000,0.000000,0.044182,0.037569,0.266312,...,0.240967,0.269932,0.189985,0.285038,0.352373,0.581883,0.000000,TSPAN6,0.634714068658144,1.70946329554292e-17
ENSG00000000419,3.933231,4.017549,4.351443,4.102646,4.466349,3.650819,4.111353,4.330951,4.395805,4.293557,...,3.874222,3.827185,4.275609,4.241249,4.194892,4.251307,3.777344,DPM1,0.862024859927373,1.90516376444767e-43
ENSG00000000457,4.658654,4.322857,4.697171,4.816981,4.586892,4.620245,4.640644,4.646106,4.715597,4.771581,...,4.410680,4.509940,5.482311,4.626495,4.505745,4.417440,4.359805,SCYL3,0.689133234326797,1.82706543764349e-21
ENSG00000000460,2.788302,3.509472,3.071359,2.453118,3.245321,3.029654,2.769487,2.762198,2.738179,2.773700,...,2.794582,3.022211,3.098520,2.858558,3.158803,2.590063,2.596908,C1orf112,0.43943150588816,4.00471354671356e-08
ENSG00000000938,9.294971,9.078311,9.836819,10.287786,9.612483,10.320160,10.118597,10.629440,9.767962,10.012075,...,10.088363,9.523195,10.219047,9.810799,9.909992,9.223608,9.199473,FGR,0.935942743827098,8.78762646249344e-66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ENSG00000288597,0.312633,0.339987,0.074365,0.000000,0.047952,0.182959,0.223352,0.000000,0.109893,0.112489,...,0.165031,0.300381,0.189985,0.218882,0.096484,0.000000,0.195488,AC234782.4,0.2517017045725,0.00242557060235678
ENSG00000288598,0.425238,1.851654,0.338553,0.448594,1.598257,0.225291,0.223352,0.246822,0.707687,0.216838,...,0.710390,0.269932,0.611473,0.926814,1.166692,1.295676,1.499990,AL354833.1,0.420543151899831,1.70069109830835e-07
ENSG00000288600,0.028802,0.034439,0.037662,0.047057,0.000000,0.000000,0.000000,0.000000,0.074184,0.112489,...,0.165031,0.036635,0.000000,0.076710,0.096484,0.050308,0.081401,AL354833.2,0.591823204966647,7.03551034285073e-15
ENSG00000288602,0.358735,0.100945,0.308056,0.301105,0.182927,0.306400,0.504476,0.612279,0.178764,0.491040,...,0.347866,0.470701,0.707862,0.317011,0.215845,0.050308,0.334798,C8orf44-SGK3,0.563582793344071,2.33295900431294e-13


TypeError: Cannot index by location index with a non-integer key

In [71]:
summary_bulkRNA = {"feature" : bulkRNA_joined["gene_name"],
           "min" : bulk_RNA.X.min(axis = 1),
           "max" : bulk_RNA.X.max(axis = 1),
           "mean" : bulk_RNA.X.mean(axis = 1),
           "var" : bulk_RNA.X.var(axis = 1)}

df_bulkRNA= pd.DataFrame(summary_bulkRNA)

print("[", bulk_RNA.X.min(),",", bulk_RNA.X.max(), "]") # Overall range of proteins

[ 0.0 , 15.981462 ]


In [16]:
bulk_RNA.to_df().head()

Unnamed: 0,S00016-Ja001T-TRGa,S00020-Ja003T-TRGa,S00024-Ja003T-TRGa,S00027-Ja003T-TRGa,S00028-Ja001T-TRGa,S00030-Ja003T-TRGa,S00033-Ja001T-TRGa,S00033-Ja003T-TRGa,S00034-Ja005T-TRGa,S00033-Ja005T-TRGa,...,S00081-Ja001T-TRGa,S00081-Ja005T-TRGa,S00082-Ja001T-TRGa,S00094-Ja005T-TRGa,S00095-Ja005T-TRGa,S00096-Ja005T-TRGa,S00097-Ja003T-TRGa,S00099-Ja005T-TRGa,S00104-Ja003T-TRGa,S00106-Ja003T-TRGa
ENSG00000000003,0.358735,0.420317,0.212498,0.0,0.420037,0.0,0.0,0.044182,0.037569,0.266312,...,0.236062,0.181513,0.382617,0.240967,0.269932,0.189985,0.285038,0.352373,0.581883,0.0
ENSG00000000419,3.933231,4.017549,4.351443,4.102646,4.466349,3.650819,4.111353,4.330951,4.395805,4.293557,...,4.074801,4.310133,4.190965,3.874222,3.827185,4.275609,4.241249,4.194892,4.251307,3.777344
ENSG00000000457,4.658654,4.322857,4.697171,4.816981,4.586892,4.620245,4.640644,4.646106,4.715597,4.771581,...,4.734925,4.968107,4.671478,4.41068,4.50994,5.482311,4.626495,4.505745,4.41744,4.359805
ENSG00000000460,2.788302,3.509472,3.071359,2.453118,3.245321,3.029654,2.769487,2.762198,2.738179,2.7737,...,3.121597,2.885727,3.201638,2.794582,3.022211,3.09852,2.858558,3.158803,2.590063,2.596908
ENSG00000000938,9.294971,9.078311,9.836819,10.287786,9.612483,10.32016,10.118597,10.62944,9.767962,10.012075,...,9.839436,9.990548,9.713717,10.088363,9.523195,10.219047,9.810799,9.909992,9.223608,9.199473


In [36]:
with open ('bulkRNA_feature_names.txt', 'w') as file: # Save protein names
    for name in bulkRNA_joined["gene_name"]:
        file.write(name + "\n")  

df_bulkRNA.to_excel("summary_bulkRNA.xlsx") # Save summary



### Luminex

In [17]:
luminex = read("Oxford data_output_combined", "xlsx", "All data")
print("Luminex: ", luminex, "\n\n")

Luminex:  AnnData object with n_obs × n_vars = 349 × 56 




In [74]:
summary_luminex = {"feature" : luminex.var_names[5:],
           "min" : luminex.X[:,5:].astype(float).min(axis = 0),
           "max" : luminex.X[:,5:].astype(float).max(axis = 0),
           "mean" : luminex.X[:,5:].astype(float).mean(axis = 0),
           "var" : luminex.X[:,5:].astype(float).var(axis = 0)}

df_luminex= pd.DataFrame(summary_luminex)

# Overall range of proteins
print("[", luminex.X[:,5:].astype(float).min(),",",luminex.X[:,5:].astype(float).max(), "]")

[ 0.0 , 12019000.0 ]


In [18]:
luminex.to_df().head()

Unnamed: 0,severity,sex,age,BMI,dexamethasone,CCL18/PARC (BR33) (33) low,Lactoferrin (BR36) (36) high,Lipocalin-2/NGAL (BR21) (21) high,Myeloperoxidase/MPO (BR53) (53) high,CCL2/JE/MCP-1 (BR25) (25) high,...,IFN-alpha (BR63) (63) high,IL-2 (BR43) (43) high,IL-5 (BR53) (53) high,IL-8/CXCL8 (BR48) (48) high,IL-12 p70 (BR56) (56) high,IL-15 (BR52) (52) high,IL-23 (BR76) (76) high,IL-33 (BR14) (14) high,Oncostatin M/OSM (BR30) (30) high,TREM-1 (BR65) (65) high
S00029-Ja005E-PMCdb,COVID-critical,F,46,21.4,False,42080.95,15359.95,528930.72,101595.14,185.96,...,2.74,0.0,5.71,11.53,15.05,10.5,151.94,1.44,0.0,327.96
S00029-Ja001E-PMCdb,COVID-critical,F,46,21.4,False,47991.24,96194.78,467186.02,39638.1,1966.97,...,0.0,9.42,0.0,14.35,0.0,19.25,0.0,0.0,0.0,728.84
S00052-Ja005E-PMCdb,COVID-critical,F,41,30.0,False,76407.98,45607.18,51840.97,88627.44,174.56,...,0.0,0.0,0.0,3.74,0.0,0.38,0.0,2.38,0.0,269.0
S00109-Ja005E-PMCdb,COVID-critical,F,52,,False,68297.35,64517.42,27631.27,77299.95,2148.9,...,1.08,0.0,0.0,63.57,96.28,19.37,524.06,0.0,0.0,246.34
S00099-Ja005E-PMCdb,COVID-critical,F,52,35.0,False,171822.5,181611.29,245606.35,166649.32,1009.95,...,0.0,0.0,5.96,8.35,0.0,8.98,0.0,0.0,0.0,955.97


In [23]:
with open ('luminex_feature_names.txt', 'w') as file:  
    for name in luminex.var_names:
        file.write(name + "\n") 

df_luminex.to_excel("summary_Luminex.xlsx") # Save summary

### cyTOF

In [5]:
cytof = read("cytof_full", "h5ad")
print("cyTOF: ", cytof, "\n\n")



cyTOF:  AnnData object with n_obs × n_vars = 7118158 × 48
    obs: 'sample_id', 'condition', 'patient_id', 'batch', 'cellID', 'COMBAT_ID_Time', 'CyTOF_priority', 'major_cell_type', 'fine_cluster_id'
    var: 'channel_name', 'marker_name', 'marker_class'
    uns: 'SOM_codes', 'X_name', 'cluster_codes', 'cofactor', 'experiment_info'
    obsm: 'TSNE', 'UMAP' 




In [77]:
summary_cytof = {"feature" : cytof.var_names,
           "min" : cytof.X.min(axis = 0),
           "max" : cytof.X.max(axis = 0),
           "mean" : cytof.X.mean(axis = 0),
           "var" : cytof.X.var(axis = 0)
           }

df_cytof= pd.DataFrame(summary_cytof)

# Overall range of the proteins
print("[", cytof.X.min(),",",cytof.X.max(), "]")

[ -6.436475 , 24.303553 ]


In [78]:
cytof.to_df().head()

Unnamed: 0,CD16,CD19,CD3,IgG,CD4,HLA_DR,CTLA4,Siglec_8,CD28,Ki_67,...,KLGR1,FOXP3,CD38,CD45,CD123,CD25,CD141,CLA,CX3CR1,Event_length
0,0.002863,0.00712,3.169593,0.778884,0.000112,0.877128,0.482226,0.0,0.489538,1.481506,...,2.4122,0.00615,0.005076,5.388708,0.032491,0.922501,0.963545,2.933704,0.333416,2.32709
1,0.001721,2.329424,0.00177,1.906131,0.001495,4.516001,0.492553,0.0,0.296466,0.024739,...,0.0,1.024434,2.667199,5.177439,0.022996,0.000264,0.001856,0.826805,0.484731,2.363303
2,0.001721,0.007947,0.23164,0.302123,2.17078,3.362985,0.226497,0.437236,0.078629,0.686172,...,0.265121,0.301607,2.895148,4.42983,0.031306,0.727448,0.631563,3.159231,0.020285,2.191561
3,0.001721,0.130171,0.00177,1.14377,0.001495,4.477472,1.895761,0.010462,0.078629,0.024739,...,0.016187,0.003332,3.320227,4.470901,0.01359,0.233912,0.001856,3.245132,0.020285,2.14249
4,1.01013,0.007947,0.00177,1.599139,1.396846,3.425147,0.527159,0.0,0.472719,0.024739,...,0.012915,1.370077,3.129075,4.14228,0.000452,1.384267,0.756194,3.142244,0.667578,2.405398


In [11]:
with open ('cyTOF_feature_names.txt', 'w') as file:  
    for name in cytof.var_names:
        file.write(name + "\n") 

df_cytof.to_excel("summary_cytof.xlsx") # Save summary

[ -6.436475 , 24.303553 ]


### FACS

In [6]:
facs = read("facs_full", "h5ad")
print("FACS: ", facs, "\n\n")

FACS:  AnnData object with n_obs × n_vars = 131920 × 12
    obs: 'fcs_file', 'sample_id', 'condition', 'patient_id', 'cluster_id'
    var: 'channel_name', 'marker_name', 'marker_class', 'used_for_clustering'
    uns: 'SOM_codes', 'X_name', 'cluster_codes', 'experiment_info'
    obsm: 'TSNE', 'UMAP'
    layers: 'exprs' 




  utils.warn_names_duplicates("obs")


In [80]:
summary_facs = {"feature" : facs.var_names,
           "min" : facs.X.min(axis = 0),
           "max" : facs.X.max(axis = 0),
           "mean" : facs.X.mean(axis = 0),
           "var" : facs.X.var(axis = 0)}

df_facs= pd.DataFrame(summary_facs)

# Overall range of proteins
print("[", facs.X.min(),",",facs.X.max(), "]")

[ -6608.2285 , 178222.64 ]


In [21]:
facs.to_df().head()

Unnamed: 0,CXCR3,CCR4,CD45RA,HLA-DR,CD25,CD38,CD127,PD1,CCR6,ICOS,CD27,CCR7
memory,22.444963,8083.472168,-67.636757,48.388863,128.51062,317.221191,153.553589,-477.221954,2538.507568,1307.797119,462.951263,260.562439
memory,46.32246,6292.529297,262.460266,341.741608,349.631439,638.302551,0.721993,1040.776855,2116.770264,6056.441895,8753.99707,1611.8302
memory,138.343292,1583.56958,535.068176,-177.763977,54.735806,396.080292,1005.266602,-21.077803,-146.690323,818.59491,5212.695312,688.128906
memory,99.121498,390.649139,-14.418692,83.374794,63.772312,260.273926,2112.503662,888.982605,1462.157104,672.073364,2724.397949,687.44043
memory,53.704018,3739.389404,14.945232,13.85416,561.055847,-484.798462,175.297806,-1016.438843,4490.885742,-101.385414,7555.630859,981.396484


In [10]:
with open ('facs_feature_names.txt', 'w') as file:  
    for name in facs.var_names:
        file.write(name + "\n") 

df_facs.to_excel("summary_facs.xlsx") # Save summary

NameError: name 'facs' is not defined

## Comparisons <a id='compare'></a>

The amin of this section is to find shared proteins and genes across modalities

In [85]:
all_features = set(facs.var_names).union(set(cytof.var_names), set(luminex.var_names), set(bulkRNA_joined["gene_name"]), set(adt_names), set(scRNA_names)) 

with open ('all_features.txt', 'w') as file:  
    for name in all_features:
        file.write(name + "\n") 

### Shared Protein Names

In [30]:
# compare bulkRNA and ADT
shared_features_adt_bulkRNA = set(adt_names).intersection(set(bulkRNA_joined["gene_name"]))

# compare Luminex, bulkRNA and ADT
shared_features_adt_luminex = set(adt_names).intersection(set(luminex.var_names))
shared_features_bulkRNA_luminex = set(luminex.var_names).intersection(set(bulkRNA_joined["gene_name"]))

# compare cyTOF bulkRNA and ADT
shared_features_adt_cytof = set(cytof.var_names).intersection(set(adt_names)) 
shared_features_bulkRNA_cytof = set(cytof.var_names).intersection(set(bulkRNA_joined["gene_name"]))

# compare FACS, bulkRNA and ADT, cyTOF
shared_features_adt_facs = set(facs.var_names).intersection(set(adt_names)) 
shared_features_bulkRNA_facs = set(facs.var_names).intersection(set(bulkRNA_joined["gene_name"]))
shared_features_cytof_facs = set(facs.var_names).intersection(set(cytof.var_names)) 
shared_features_adt_bulkRNA_facs = shared_features_adt_facs.intersection(set(bulkRNA_joined["gene_name"]))
shared_features_adt_cytof_facs = shared_features_adt_facs.intersection(set(cytof.var_names)) 
shared_features_bulkRNA_cytof_facs = shared_features_bulkRNA_facs.intersection(set(cytof.var_names)) 
shared_features_adt_bulkRNA_cytof_facs = shared_features_adt_bulkRNA.intersection(set(cytof.var_names)) 

# compare FACS, bulkRNA and ADT, cyTOF, csRNA
shared_features_adt_scRNA = set(scRNA_names).intersection(set(adt_names)) 
shared_features_bulkRNA_scRNA = set(scRNA_names).intersection(set(bulkRNA_joined["gene_name"]))
shared_features_cytof_scRNA = set(scRNA_names).intersection(set(cytof.var_names)) 
shared_features_facs_scRNA = set(scRNA_names).intersection(set(facs.var_names))
shared_features_adt_bulkRNA_scRNA = shared_features_adt_scRNA.intersection(set(bulkRNA_joined["gene_name"]))
shared_features_adt_cytof_scRNA = shared_features_adt_scRNA.intersection(set(cytof.var_names)) 
shared_features_adt_facs_scRNA = shared_features_adt_scRNA.intersection(set(facs.var_names)) 
shared_features_bulkRNA_cytof_scRNA = shared_features_bulkRNA_scRNA.intersection(set(cytof.var_names)) 
shared_features_bulkRNA_facs_scRNA = shared_features_bulkRNA_scRNA.intersection(set(facs.var_names))
shared_features_adt_bulkRNA_cytof_scRNA = shared_features_adt_bulkRNA.intersection(set(cytof.var_names)) 
shared_features_adt_bulkRNA_facs_scRNA = shared_features_adt_bulkRNA.intersection(set(facs.var_names)) 
shared_features_adt_cytof_facs_scRNA = shared_features_adt_cytof.intersection(set(facs.var_names)) 
shared_features_bulkRNA_facs_scRNA_cytof = shared_features_bulkRNA_facs_scRNA.intersection(set(cytof.var_names)) 

In [35]:
with open ('shared_ADT_bulkRNA.txt', 'w') as file:  
    for name in shared_features_adt_bulkRNA:
        file.write(name + "\n")  


with open ('shared_ADT_Luminex.txt', 'w') as file:  
    for name in shared_features_adt_luminex:
        file.write(name + "\n")  
with open ('shared_bulkRNA_Luminex.txt', 'w') as file:  
    for name in shared_features_bulkRNA_luminex:
        file.write(name + "\n")  


with open ('shared_ADT_cyTOF.txt', 'w') as file:  
    for name in shared_features_adt_cytof:
        file.write(name + "\n")  
with open ('shared_bulkRNA_cyTOF.txt', 'w') as file:  
    for name in shared_features_bulkRNA_cytof:
        file.write(name + "\n")  


with open ('shared_ADT_FACS.txt', 'w') as file:  
    for name in shared_features_adt_facs:
        file.write(name + "\n")  
with open ('shared_bulkRNA_FACS.txt', 'w') as file:  
    for name in shared_features_bulkRNA_facs:
        file.write(name + "\n")  
with open ('shared_cyTOF_FACS.txt', 'w') as file:  
    for name in shared_features_cytof_facs:
        file.write(name + "\n")  
with open ('shared_ADT_bulkRNA_FACS.txt', 'w') as file:  
    for name in shared_features_adt_bulkRNA:
        file.write(name + "\n")  
with open ('shared_ADT_cytof_FACS.txt', 'w') as file:  
    for name in shared_features_adt_cytof:
        file.write(name + "\n")  
with open ('shared_bulkRNA_cytof_FACS.txt', 'w') as file:  
    for name in shared_features_bulkRNA_cytof:
        file.write(name + "\n") 
with open ('shared_ADT_bulkRNA_cytof_FACS.txt', 'w') as file:  
    for name in shared_features_adt_bulkRNA_cytof_facs:
        file.write(name + "\n")       


with open ('shared_ADT_scRNA.txt', 'w') as file:  
    for name in shared_features_adt_scRNA:
        file.write(name + "\n")  
with open ('shared_bulkRNA_scRNA.txt', 'w') as file:  
    for name in shared_features_bulkRNA_scRNA:
        file.write(name + "\n")  
with open ('shared_cyTOF_scRNA.txt', 'w') as file:  
    for name in shared_features_cytof_scRNA:
        file.write(name + "\n")  
with open ('shared_FACS_scRNA.txt', 'w') as file:  
    for name in shared_features_facs_scRNA:
        file.write(name + "\n")  
with open ('shared_ADT_bulkRNA_scRNA.txt', 'w') as file:  
    for name in shared_features_adt_bulkRNA:
        file.write(name + "\n")  
with open ('shared_ADT_cytof_csRNA.txt', 'w') as file:  
    for name in shared_features_adt_cytof:
        file.write(name + "\n")  
with open ('shared_ADT_FACS_csRNA.txt', 'w') as file:  
    for name in shared_features_adt_facs:
        file.write(name + "\n")  
with open ('shared_bulkRNA_cytof_scRNA.txt', 'w') as file:  
    for name in shared_features_bulkRNA_cytof:
        file.write(name + "\n") 
with open ('shared_bulkRNA_FACS_scRNA.txt', 'w') as file:  
    for name in shared_features_bulkRNA_facs:
        file.write(name + "\n") 
with open ('shared_ADT_bulkRNA_cytof_scRNA.txt', 'w') as file:  
    for name in shared_features_adt_bulkRNA_cytof_scRNA:
        file.write(name + "\n")       
with open ('shared_ADT_bulkRNA_FACS_scRNA.txt', 'w') as file:  
    for name in shared_features_adt_bulkRNA_facs:
        file.write(name + "\n")    
with open ('shared_bulkRNA_cytof_FACS_scRNA.txt', 'w') as file:  
    for name in shared_features_bulkRNA_facs_scRNA_cytof:
        file.write(name + "\n")     

### Shared Protein Statistics

In [84]:
stats_bulkRNA_scRNA = df_bulkRNA.set_index("feature").join(df_scRNA.set_index("feature"), lsuffix="_bulk", rsuffix="_sc", how="inner")
stats_cytof_scRNA = df_cytof.set_index("feature").join(df_scRNA.set_index("feature"), lsuffix="_cytof", rsuffix="_scRNA", how = "inner")
stats_adt_scRNA = df_adt.set_index("feature").join(df_scRNA.set_index("feature"), lsuffix="_adt", rsuffix="_scRNA", how = "inner") 
stats_facs_scRNA = df_facs.set_index("feature").join(df_scRNA.set_index("feature"), lsuffix="_facs", rsuffix="_scRNA", how = "inner") 

stats_bulkRNA_cytof = df_bulkRNA.set_index("feature").join(df_cytof.set_index("feature"), lsuffix="_bulk", rsuffix="_cytof", how="inner")
stats_adt_cytof = df_adt.set_index("feature").join(df_cytof.set_index("feature"), lsuffix="_adt", rsuffix="_cytof", how = "inner") 
stats_cytof_facs = df_facs.set_index("feature").join(df_cytof.set_index("feature"), lsuffix="_facs", rsuffix="_cytof", how = "inner") 

stats_adt_bulkRNA = df_adt.set_index("feature").join(df_bulkRNA.set_index("feature"), lsuffix="_adt", rsuffix="_bulkRNA", how = "inner") 
stats_bulkRNA_facs = df_facs.set_index("feature").join(df_bulkRNA.set_index("feature"), lsuffix="_facs", rsuffix="_bulkRNA", how = "inner") 

stats_adt_facs = df_facs.set_index("feature").join(df_adt.set_index("feature"), lsuffix="_facs", rsuffix="_adt", how = "inner") 

In [86]:
stats_bulkRNA_scRNA.to_excel("stats_bulkRNA_scRNA.xlsx") 
stats_cytof_scRNA.to_excel("stats_cytof_scRNA.xlsx") 
stats_adt_scRNA.to_excel("stats_adt_scRNA.xlsx") 
stats_facs_scRNA.to_excel("stats_facs_scRNA.xlsx") 

stats_bulkRNA_cytof.to_excel("stats_bulkRNA_cytof.xlsx") 
stats_adt_cytof.to_excel("stats_adt_cytof.xlsx") 
stats_cytof_facs.to_excel("stats_cytof_facs.xlsx") 

stats_adt_bulkRNA.to_excel("stats_adt_bulkRNA.xlsx") 
stats_bulkRNA_facs.to_excel("stats_bulkRNA_facs.xlsx") 

stats_adt_facs.to_excel("stats_adt_facs.xlsx") 

## Pseudo Bulking <a id='aggregate'></a>

In [67]:
facs.to_df().groupby(facs.obs.sample_id).sum().to_excel("pseudo_bulk_facs.xlsx")
cytof.to_df().groupby(by = [cytof.obs.major_cell_type, cytof.obs.patient_id]).sum().to_excel("pseudo_bulk_cytof.xlsx")


#scRNA.obs
#ADT

Unnamed: 0,Annotation_cluster_id,Annotation_cluster_name,Annotation_minor_subset,Annotation_major_subset,Annotation_cell_type,GEX_region,QC_ngenes,QC_total_UMI,QC_pct_mitochondrial,QC_scrub_doublet_scores,...,Requiredvasoactive,Respiratorysupport,SARSCoV2PCR,Outcome,TimeSinceOnset,Ethnicity,Tissue,DiseaseClassification,Pool_ID,Channel_ID
AAACCTGAGAAAGTGG-1-gPlexA1,20120.0,NK.CD16hi.1,NK.CD16hi,NK,NK,B: TEM/prolif. T/NK cells,1159,2684,1.862891,0.031883,...,1.0,1.0,1,2.0,12.0,Unknown,Blood;UBERON:0000178,COVID-19;MONDO:0100096,gPlexA,A1
AAACCTGAGCGGATCA-1-gPlexA1,20011.0,CD8.TEMRA.1,CD8.TEMRA,CD8,T,B: TEM/prolif. T/NK cells,1348,3162,1.138520,0.041541,...,0.0,4.0,1,5.0,12.0,Unknown,Blood;UBERON:0000178,COVID-19;MONDO:0100096,gPlexA,A1
AAACCTGAGGACATTA-1-gPlexA1,,,,,,D: B/Plasma cells,937,2579,0.891819,0.003108,...,,4.0,1,6.0,17.0,Unknown,Blood;UBERON:0000178,COVID-19;MONDO:0100096,gPlexA,A1
AAACCTGAGGCGACAT-1-gPlexA1,30400.0,ncMono,ncMono,ncMono,MNP,C: Monocytes/cDC,788,1979,4.194037,0.068193,...,1.0,1.0,1,2.0,14.0,Unknown,Blood;UBERON:0000178,COVID-19;MONDO:0100096,gPlexA,A1
AAACCTGAGGGAACGG-1-gPlexA1,30100.0,cMono.LGALS2.AHNAK,cMono,cMono,MNP,C: Monocytes/cDC,1344,3084,3.728923,0.036749,...,0.0,3.0,1,4.0,6.0,Unknown,Blood;UBERON:0000178,COVID-19;MONDO:0100096,gPlexA,A1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TTTGTCAGTGGCAAAC-1-gPlexK7,30000.0,cMono.S100A8/9/12hi.HMGB2,cMono,cMono,MNP,C: Monocytes/cDC,1316,3722,3.519613,0.068154,...,,4.0,0,6.0,,Unknown,Blood;UBERON:0000178,,gPlexK,K7
TTTGTCAGTTACCGAT-1-gPlexK7,20211.0,CD8.TEM,CD8.TEM,CD8,T,B: TEM/prolif. T/NK cells,1157,3318,1.476793,0.494080,...,1.0,1.0,0,2.0,11.0,Unknown,Blood;UBERON:0000178,Influenza;MONDO:0005812,gPlexK,K7
TTTGTCATCCTCTAGC-1-gPlexK7,21311.0,CD8.TEMRA.mitohi.2,CD8.mitohi,CD8,T,B: TEM/prolif. T/NK cells,502,627,7.017544,0.019165,...,,4.0,1,6.0,7.0,Unknown,Blood;UBERON:0000178,COVID-19;MONDO:0100096,gPlexK,K7
TTTGTCATCGAGGTAG-1-gPlexK7,30300.0,cMono.LGALS2.PSME2.IFITM3hi,cMono,cMono,MNP,C: Monocytes/cDC,805,1612,2.233251,0.050830,...,0.0,5.0,1,1.0,3.0,Unknown,Blood;UBERON:0000178,COVID-19;MONDO:0100096,gPlexK,K7
