## Aim: Integrate gene expression data from three online data portals: Stemformatics, ENCODE and Haemosphere. 

In previous notebooks, we have combined the gene expression data of selected samples in each individual data portals. Here, we will integrate the three pre-combined expression datasets together into a larger dataset of mouse blood. 


In [1]:
import pandas as pd
import atlas
import handler
import numpy as np

  from pandas.core.index import RangeIndex


### Merge expression tables

In [2]:
# Load data 
expression_s4m    = pd.read_csv('../data/interim/mouse_integrate/expression_s4m.tsv', sep='\t', index_col=0)
expression_encode = pd.read_csv('../data/interim/mouse_integrate/expression_encode.tsv', sep='\t', index_col=0)
expression_haemosphere = pd.read_csv('../data/interim/mouse_integrate/expression_Haemosphere.tsv', sep='\t', index_col=0)

In [3]:
print(expression_s4m.shape, expression_encode.shape, expression_haemosphere.shape)

(18483, 217) (22457, 90) (14462, 786)


In [4]:
dfs = [expression_encode, expression_haemosphere, expression_s4m]
common_genes = handler.find_common_genes(dfs)
mouse_atlas_expression = atlas.rankTransform(handler.merge_columns(dfs, common_genes))

print(mouse_atlas_expression.shape)
mouse_atlas_expression.head()

(14222, 1093)


Unnamed: 0,ENCFF547CXK,ENCFF774DIF,ENCFF155RUP,ENCFF702TOW,ENCFF963EBQ,ENCFF649QOI,ENCFF871RIM,ENCFF415ZGH,ENCFF924EMV,ENCFF063SJF,...,GSM854333,GSM854334,GSM1023629,GSM1023630,GSM1023631,GSM1023632,GSM1023633,GSM1023634,GSM1023635,GSM1023636
ENSMUSG00000000001,0.941569,0.931585,0.711538,0.716847,0.949093,0.945296,0.901315,0.923745,0.844572,0.793735,...,0.872521,0.863803,0.949937,0.953312,0.943257,0.943679,0.939812,0.944101,0.959007,0.956757
ENSMUSG00000000003,0.135951,0.135319,0.097384,0.109127,0.129764,0.129623,0.169702,0.182956,0.183835,0.179862,...,0.001195,0.001617,0.005484,0.002391,0.002531,0.005836,0.005414,0.004781,0.002602,0.002883
ENSMUSG00000000028,0.470855,0.465265,0.746941,0.785297,0.854627,0.863135,0.81272,0.804212,0.883315,0.910561,...,0.764942,0.774012,0.784419,0.854099,0.814231,0.901209,0.893123,0.882365,0.866053,0.709605
ENSMUSG00000000037,0.135951,0.135319,0.428069,0.446105,0.414674,0.426136,0.432675,0.747187,0.55386,0.4293,...,0.031852,0.06047,0.321579,0.478976,0.38792,0.318802,0.40585,0.348123,0.2108,0.361482
ENSMUSG00000000049,0.135951,0.135319,0.097384,0.109127,0.129764,0.129623,0.169702,0.182956,0.183835,0.179862,...,0.072775,0.056673,0.091759,0.074251,0.09865,0.070876,0.075657,0.077978,0.149557,0.110533


### Standardise sample metadata 

We have encountered several difficulties during the integration of the metadata that associated with the collected datasets. 

1. Different datasets may record different pieces of information to describe each sample. e.g. 'tissue' attribute is recorded in some sample metadata but not all.
2. Same piece of information can be recorded in different ways. e.g. cell type information is recorded in a single 'celltype' column in the haemosphere metadata, whereas in s4m metadata, there are 4 columns having information related to the cell type of samples.
3. Inconsistent format of contents e.g. macrophage is represented as 'BM mac', 'BM macrophage', 'BM-derived macrophage day 0', 'Bone marrow derived macrophage' in a same column. 
4. Different type of information might be stored mixedly under the same attributes. e.g. In 'replicate_group_id' column of the s4m metadata, we might find experiment information, cell type and sort markers information about samples.

To address these issues:
1. we will determin a essential list of attributes to describe samples and unify the naming of these attributes. These essential list of attributes are: 

       Cell Type; Cell Lineage; Description; Dataset Name; Platform 

2. Reannotate the metadata so that the content of each attribute is consistent.

In [5]:
# Load the reannotated sample metadata 
# For metadata of each data collection: stemformatics, encode and haemosphere, we added two manually annotated columns
# 'cell_lineage_anno' and 'celltype_anno' with consistent content format. 
samples_s4m = pd.read_csv('../data/interim/reannotated/samples_s4m_anno.tsv', sep='\t', index_col=0)
samples_encode = pd.read_csv('../data/interim/reannotated/samples_encode_anno.tsv', sep='\t', index_col=0)
samples_haemosphere = pd.read_csv('../data/interim/reannotated/samples_Haemosphere_anno.tsv', sep='\t', index_col=0)

slice relevant columns from each metadata table and rename in consistent format

In [6]:
samples_s4m[:3]

Unnamed: 0_level_0,ds_id,dataset_name,replicate_group_id,organism,sample_type,generic_sample_type,final_cell_type,parental_cell_type,sex,labelling,description,platform,cell_lineage_anno,celltype_anno
chip_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
GSM98876,6658.0,Rossi,Lin- c-kit+ Sca1+Flk2-CD34- KLSflk2-CD34- old...,Mus musculus,LT-HSC Aged animal,LT-HSC Aged animal,,,,,Loss of immune function and an increased incid...,Microarray (Affymetrix Mouse430_2),HSC_HPC,hematopoietic stem cell
GSM98877,6658.0,Rossi,Lin- c-kit+ Sca1+Flk2-CD34- KLSflk2-CD34- old...,Mus musculus,LT-HSC Aged animal,LT-HSC Aged animal,,,,,Loss of immune function and an increased incid...,Microarray (Affymetrix Mouse430_2),HSC_HPC,hematopoietic stem cell
GSM98878,6658.0,Rossi,Lin- c-kit+ Sca1+Flk2-CD34- KLSflk2-CD34- old...,Mus musculus,LT-HSC Aged animal,LT-HSC Aged animal,,,,,Loss of immune function and an increased incid...,Microarray (Affymetrix Mouse430_2),HSC_HPC,hematopoietic stem cell


In [7]:
samples_s4m = samples_s4m[['celltype_anno', 'cell_lineage_anno', 'description', 'platform', 'dataset_name']]
samples_s4m.columns = ['Cell Type', 'Cell Lineage', 'Description', 'Platform', 'Dataset Name']

In [8]:
samples_haemosphere[:3]

Unnamed: 0,celltype,cell_lineage,description,markers,platform,dataset_name,cell_lineage_anno,celltype_anno
GSE60927.SRR1561641,Fob,B Cell Lineage,Follicular B cells,FSC-lo B220:+ CD21:+ CD23:+,RNAseq (Illumina HiSeq),Haemopedia-Mouse-RNAseq,B-cell,B cell
GSE60927.SRR1561642,Fob,B Cell Lineage,Follicular B cells,FSC-lo B220:+ CD21:+ CD23:+,RNAseq (Illumina HiSeq),Haemopedia-Mouse-RNAseq,B-cell,B cell
GSE60927.SRR1561645,MZB,B Cell Lineage,Marginal zone B cells,FSC-lo B220:+ CD21:-hi CD23:-,RNAseq (Illumina HiSeq),Haemopedia-Mouse-RNAseq,B-cell,B cell


In [9]:
samples_haemosphere = samples_haemosphere[['celltype_anno', 'cell_lineage_anno', 'description', 'platform', 'dataset_name']]
samples_haemosphere.columns = ['Cell Type', 'Cell Lineage', 'Description', 'Platform', 'Dataset Name']

In [10]:
samples_encode[:3]

Unnamed: 0_level_0,Assay name,Biosample summary,Biosample term name,Description,Lab,Project,Biosample accession,Biological replicate,Technical replicate,Platform,cell_lineage_anno,celltype_anno
Files,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
ENCFF547CXK,RNA-seq,C57BL/6 neutrophil adult (5-6 weeks),neutrophil,PSU mouse neutrophil total RNA scriptseqv2 RNA...,"Ross Hardison, PennState",ENCODE,"ENCBS562NSP,ENCBS622TZA",12,1,RNAseq (Illumina NextSeq),granulocyte,neutrophil
ENCFF774DIF,RNA-seq,C57BL/6 neutrophil adult (5-6 weeks),neutrophil,PSU mouse neutrophil total RNA scriptseqv2 RNA...,"Ross Hardison, PennState",ENCODE,"ENCBS562NSP,ENCBS622TZA",12,1,RNAseq (Illumina NextSeq),granulocyte,neutrophil
ENCFF155RUP,polyA plus RNA-seq,C57BL/6 megakaryocyte-erythroid progenitor cel...,megakaryocyte-erythroid progenitor cell,RNA-seq on mouse megakaryocyte-erythroid proge...,"Ross Hardison, PennState",ENCODE,"ENCBS190OUH,ENCBS176ENC",21,1,RNAseq (Illumina HiSeq),HSC_HPC,megakaryocyte erythroid progenitor


In [11]:
samples_encode = samples_encode[['celltype_anno', 'cell_lineage_anno', 'Description', 'Platform', 'Project']]
samples_encode.columns = ['Cell Type', 'Cell Lineage', 'Description', 'Platform', 'Dataset Name']

### Merge the three metadata tables 

In [12]:
mouse_atlas_samples = pd.concat([samples_encode, samples_haemosphere, samples_s4m])
print(mouse_atlas_samples.shape)
mouse_atlas_samples.head()

(1102, 5)


Unnamed: 0,Cell Type,Cell Lineage,Description,Platform,Dataset Name
ENCFF547CXK,neutrophil,granulocyte,PSU mouse neutrophil total RNA scriptseqv2 RNA...,RNAseq (Illumina NextSeq),ENCODE
ENCFF774DIF,neutrophil,granulocyte,PSU mouse neutrophil total RNA scriptseqv2 RNA...,RNAseq (Illumina NextSeq),ENCODE
ENCFF155RUP,megakaryocyte erythroid progenitor,HSC_HPC,RNA-seq on mouse megakaryocyte-erythroid proge...,RNAseq (Illumina HiSeq),ENCODE
ENCFF702TOW,megakaryocyte erythroid progenitor,HSC_HPC,RNA-seq on mouse megakaryocyte-erythroid proge...,RNAseq (Illumina HiSeq),ENCODE
ENCFF963EBQ,megakaryocyte,megakaryocyte,PSU mouse Megakaryocyte 1 ng RNA-seq,RNAseq (Illumina HiSeq),ENCODE


In [13]:
# remove duplicated samples that are included in multiple data collection
mouse_atlas_samples = mouse_atlas_samples[~mouse_atlas_samples.index.duplicated() & mouse_atlas_samples.index.notna()]
mouse_atlas_expression = mouse_atlas_expression.loc[:,~mouse_atlas_expression.columns.duplicated()]
print(mouse_atlas_samples.shape, mouse_atlas_expression.shape)

(1074, 5) (14222, 1074)


In [14]:
# same the integrated expression table and sample table 
mouse_atlas_expression.to_csv('../data/interim/mouse_atlas/mouse_atlas_expression.tsv', sep='\t')
mouse_atlas_samples.to_csv('../data/interim/mouse_atlas/mouse_atlas_samples.tsv', sep='\t')