## Aim: Integrate gene expression data from three online data portals: Stemformatics, ENCODE and Haemosphere. 

In previous notebooks, we have combined the gene expression data of selected samples in each individual data portals. Here, we will integrate the three pre-combined expression datasets together into a larger dataset of mouse blood. 


In [6]:
import pandas as pd
import atlas
import handler

In [2]:
# Load data 
expression_s4m    = pd.read_csv('../data/interim/mouse_integrate/expression_s4m.tsv', sep='\t', index_col=0)
expression_encode = pd.read_csv('../data/interim/mouse_integrate/expression_encode.tsv', sep='\t', index_col=0)
expression_haemosphere = pd.read_csv('../data/interim/mouse_integrate/expression_Haemosphere.tsv', sep='\t', index_col=0)

samples_s4m = pd.read_csv('../data/interim/mouse_integrate/samples_s4m.tsv', sep='\t', index_col=0)
samples_encode = pd.read_csv('../data/interim/mouse_integrate/samples_encode.tsv', sep='\t', index_col=0)
samples_haemosphere = pd.read_csv('../data/interim/mouse_integrate/samples_Haemosphere.tsv', sep='\t', index_col=0)

In [3]:
print(expression_s4m.shape, expression_encode.shape, expression_haemosphere.shape)

(17947, 240) (22457, 90) (14462, 794)


### Merge expression tables

In [7]:
dfs = [expression_encode, expression_haemosphere, expression_s4m]
common_genes = handler.find_common_genes(dfs)
mouse_atlas = atlas.rankTransform(handler.merge_columns(dfs, common_genes))

print(mouse_atlas.shape)
mouse_atlas.head()

(14030, 1124)


Unnamed: 0,ENCFF547CXK,ENCFF774DIF,ENCFF155RUP,ENCFF702TOW,ENCFF963EBQ,ENCFF649QOI,ENCFF871RIM,ENCFF415ZGH,ENCFF924EMV,ENCFF063SJF,...,GSM854333,GSM854334,GSM1023629,GSM1023630,GSM1023631,GSM1023632,GSM1023633,GSM1023634,GSM1023635,GSM1023636
ENSMUSG00000000001,0.941197,0.931219,0.710371,0.715752,0.948967,0.945189,0.901033,0.923414,0.844369,0.792979,...,0.872345,0.863507,0.949964,0.953314,0.943193,0.943763,0.939914,0.944191,0.958874,0.956664
ENSMUSG00000000003,0.135139,0.134462,0.096793,0.108482,0.128867,0.128795,0.168817,0.182145,0.183215,0.178974,...,0.001212,0.001639,0.00556,0.002423,0.002566,0.005916,0.005346,0.004775,0.002637,0.002851
ENSMUSG00000000028,0.469601,0.464006,0.74583,0.784212,0.854205,0.862473,0.811796,0.803742,0.882929,0.910406,...,0.764647,0.773628,0.783963,0.853742,0.813828,0.901069,0.892801,0.882252,0.865788,0.708909
ENSMUSG00000000037,0.135139,0.134462,0.4268,0.444583,0.413008,0.424555,0.430827,0.746151,0.552637,0.427869,...,0.031504,0.060014,0.320349,0.478118,0.387242,0.317819,0.404918,0.347256,0.209907,0.360513
ENSMUSG00000000049,0.135139,0.134462,0.096793,0.108482,0.128867,0.128795,0.168817,0.182145,0.183215,0.178974,...,0.072202,0.056165,0.090948,0.073557,0.097862,0.070421,0.074911,0.077049,0.148753,0.109408


### Standardise sample metadata 

Since the sample metadata of different datasets may have recorded differnet variables, or using different names for similar attributes, to merge them, we will need to identify a minimal list of necessary attributes to describe samples, and unify the naming of these attributes. The essential list of attributes includes:

- celltype
- cell_lineage
- description
- markers
- dataset_name
- platform

Also, for each data attribute, manual reformatting is necessary to make sure the data is recorded in an uniformed format. e.g. 'NK' and 'natrual killer' should not be existed in the same table. 

In [8]:
samples_s4m.head()

Unnamed: 0_level_0,ds_id,dataset_name,replicate_group_id,organism,sample_type,generic_sample_type,final_cell_type,parental_cell_type,sex,labelling,description,platform
chip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
GSM98876,6658,Rossi,Lin- c-kit+ Sca1+Flk2-CD34- KLSflk2-CD34- old...,Mus musculus,LT-HSC Aged animal,LT-HSC Aged animal,,,,,Loss of immune function and an increased incid...,Microarray (Affymetrix Mouse430_2 (GPL1261 and...
GSM98877,6658,Rossi,Lin- c-kit+ Sca1+Flk2-CD34- KLSflk2-CD34- old...,Mus musculus,LT-HSC Aged animal,LT-HSC Aged animal,,,,,Loss of immune function and an increased incid...,Microarray (Affymetrix Mouse430_2 (GPL1261 and...
GSM98878,6658,Rossi,Lin- c-kit+ Sca1+Flk2-CD34- KLSflk2-CD34- old...,Mus musculus,LT-HSC Aged animal,LT-HSC Aged animal,,,,,Loss of immune function and an increased incid...,Microarray (Affymetrix Mouse430_2 (GPL1261 and...
GSM98882,6658,Rossi,Lin- c-kit+ Sca1+Flk2-CD34- KLSflk2-CD34- old...,Mus musculus,LT-HSC young animal,LT-HSC young animal,,,,,Loss of immune function and an increased incid...,Microarray (Affymetrix Mouse430_2 (GPL1261 and...
GSM98879,6658,Rossi,Lin- c-kit+ Sca1+Flk2-CD34- KLSflk2-CD34- old...,Mus musculus,LT-HSC Aged animal,LT-HSC Aged animal,,,,,Loss of immune function and an increased incid...,Microarray (Affymetrix Mouse430_2 (GPL1261 and...


In [11]:
set(samples_s4m.final_cell)

{'Activated B lymphocytes',
 'Activated B lymphocytes expressing OSKM',
 'BLSP',
 'Bone marrow derived macrophage',
 'C57BL6 BMM',
 'Gut Trm CD8 T cell',
 'LT-HSC Aged animal',
 'LT-HSC young  animal',
 'Lung Trm CD8 T cell',
 'M2-polarised bone marrow derived macrophage (+IL4)',
 'MGL',
 'MPP',
 'Macrophage Recovery from tolerance',
 'Naïve',
 'Naïve spleen CD8 T cell',
 'Skin Trm CD8 T cell',
 'Tcm CD8 T cell',
 'Tem CD8 T cell',
 'Untreated macrophage',
 'WT plus TDM',
 'WT plus vehicle',
 'Whole BM',
 'dendritic cell',
 'hematopoietic cell',
 'hematopoietic stem cell',
 'macrophage',
 'monocyte',
 'neutrophil',
 'pDC',
 'pre pro B cell'}

In [9]:
set(samples_s4m.sample_type)

{'Activated B lymphocytes',
 'Activated B lymphocytes expressing OSKM',
 'Aged mouse HSC 12 hr',
 'Aged mouse HSC 24 hr',
 'Aged mouse HSC 3 hr',
 'Aged mouse HSC 6 hr',
 'Aged mouse preproB',
 'BLSP',
 'BM CD115+ Ly-6C+ mo',
 'BM mac',
 'BM macrophage',
 'BM mo',
 'BM-HSC',
 'BM-derived macrophage day 0',
 'BM-derived macrophage day 12',
 'BM-derived macrophage day 3',
 'BM-derived macrophage day 6',
 'BMN',
 'Bone marrow derived macrophage',
 'C57BL6 BMM',
 'CD11bhi CD115+  monocyte',
 'CD11c+CD8+ splenic DC',
 'CD11c+CD8- splenic DC',
 'CD45hiCD11c+brain DC',
 'Cx3cr1-GFP+ CTM',
 'Cx3cr1-GFP_hi CTM',
 'DC Gr1hi',
 'DC Gr1lo',
 'FL HSC Hmga2+/+',
 'FL HSC Hmga2-/-',
 'Gut Trm CD8 T cell',
 'HSC 12 hr mixed expansion culture',
 'HSC 24 hr mixed expansion culture',
 'HSC 3 hr mixed expansion culture',
 'HSC 6 hr mixed expansion culture',
 'HSC sorted for Lin-c-Kit+Sca1+CD34−Flk2−',
 'LN mac',
 'LRC',
 'LT-HSC Aged animal',
 'LT-HSC young  animal',
 'Lung Trm CD8 T cell',
 'M2-polarised

In [None]:
samples_s4m_std = samples_s4m[['']]