## Aim: Collect and combine relevant mouse data from the ENCODE data portal 

Link: https://www.encodeproject.org. 

From the web portal, we used a number of filters to retrieve the transcriptome data of healthy mouse blood cells. 

The filters we used are listed are as follows: 

- Assay type: Transcription (+)
- Organism: Mus musculus (+)
- Available file types: tsv (+)
- Biosample classification: cell line (-)
- Cell: leukocyte (+), hematopoietic cell (+), T cell (+), myeloid cell (+), monocyte (+), stem cell (+), progenitor cell (+)
- Audit category: low replicate concordance (-)

where +/- indicate positive or negative filters

### Samples

After the filtering, 90 samples from 38 experiments were selected. 

These selected samples came from two projects:

- ENCODE
- GGR

We downloaded the transcriptome data along with the associated meatadata. 

Note that the transcriptome data is downloaded as 90 individual data files, each corresponding to a single sample and named with an unique file accession. 

We will need to integrate the separate data files into a combined dataset. 

### Processing steps 

- Metadata handling
- Merge data of ENCODE project 
- Merge data of GGR project
- Combine Encode and GGR data 

In [43]:
import pandas as pd
import numpy as np
import atlas

  from pandas.core.index import RangeIndex


In [75]:
# Load the metadata
meta_exp       = pd.read_csv('../data/raw/encode/experiment_metadata.tsv', sep='\t', index_col=0, skiprows=[0])
meta_exp

Unnamed: 0_level_0,Assay name,Biosample summary,Biosample term name,Description,Lab,Project,Files,Biosample accession,Biological replicate,Technical replicate
Accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ENCSR074WOD,RNA-seq,C57BL/6 neutrophil adult (5-6 weeks),neutrophil,PSU mouse neutrophil total RNA scriptseqv2 RNA...,"Ross Hardison, PennState",ENCODE,"/files/ENCFF407JTK/,/files/ENCFF860EGH/,/files...","ENCBS562NSP,ENCBS622TZA",12,1
ENCSR000CIF,polyA plus RNA-seq,C57BL/6 megakaryocyte-erythroid progenitor cel...,megakaryocyte-erythroid progenitor cell,RNA-seq on mouse megakaryocyte-erythroid proge...,"Ross Hardison, PennState",ENCODE,"/files/ENCFF001MKW/,/files/ENCFF001MKX/,/files...","ENCBS190OUH,ENCBS176ENC",21,1
ENCSR340NCF,RNA-seq,C57BL/6 megakaryocyte male adult (5-6 weeks),megakaryocyte,PSU mouse Megakaryocyte 1 ng RNA-seq,"Ross Hardison, PennState",ENCODE,"/files/ENCFF993QUE/,/files/ENCFF102BWR/,/files...","ENCBS593JCW,ENCBS576MDX",12,1
ENCSR661TLW,RNA-seq,C57BL/6 erythroblast male adult (5-6 weeks),erythroblast,PSU mouse Erythroblast 1 ng RNA-seq,"Ross Hardison, PennState",ENCODE,"/files/ENCFF248DYS/,/files/ENCFF198RBJ/,/files...","ENCBS666WMI,ENCBS129OMN",21,1
ENCSR558PXY,RNA-seq,C57BL/6 erythroid progenitor cell male adult (...,erythroid progenitor cell,PSU mouse erythroid progenitor cell (CFU-Ery) ...,"Ross Hardison, PennState",ENCODE,"/files/ENCFF266YBS/,/files/ENCFF520FYT/,/files...","ENCBS512FEW,ENCBS949PKO",21,1
ENCSR277DPB,RNA-seq,C57BL/6 granulocyte monocyte progenitor cell m...,granulocyte monocyte progenitor cell,PSU mouse GMP 100ng rRNA-depleted RNA-seq via ...,"Ross Hardison, PennState",ENCODE,"/files/ENCFF836JVX/,/files/ENCFF845MSX/,/files...","ENCBS127MVP,ENCBS581UFA",12,1
ENCSR133SAI,RNA-seq,C57BL/6 common myeloid progenitor male adult (...,common myeloid progenitor,PSU mouse CMP 100ng rRNA-depleted RNA-seq via ...,"Ross Hardison, PennState",ENCODE,"/files/ENCFF048FGD/,/files/ENCFF898RXC/,/files...","ENCBS680UYK,ENCBS041EEZ",21,1
ENCSR757VAW,polyA plus RNA-seq,inflammation-experienced regulatory T-cells ma...,inflammation-experienced regulatory T-cells,,"Christina Leslie, MSKCC",GGR,"/files/ENCFF770YVS/,/files/ENCFF611XSR/,/files...",ENCBS578IJB,1,12
ENCSR582PLJ,polyA plus RNA-seq,strain Foxp3GFP-DTRCD4CreERT2R26tdTomato activ...,activated regulatory T-cells,,"Christina Leslie, MSKCC",GGR,"/files/ENCFF962CDM/,/files/ENCFF398WXT/,/files...",ENCBS138ASO,1,123
ENCSR306JBO,polyA plus RNA-seq,strain Foxp3CreERT2 x Rosa26YFP activated regu...,activated regulatory T-cells,,"Christina Leslie, MSKCC",GGR,"/files/ENCFF570EMY/,/files/ENCFF788ZCR/,/files...",ENCBS562IGY,1,12


Note that, each entry of the metadata corresponds to an experiment, which give rise to a variety kinds of data besides the gene quantification file of our interest (e.g. read files in fastq). The index of the metadata is the experiment accession, which does not directly link to the file accessions of our downloaded data. Thus, we will need to transform the original metadata so that the index of the metadata are file accessions and those irrelavant files are excluded. 

In terminal: 

`ls -d ENCFF* | cut -f 1 -d '.' > file_accessions.txt` to record all the selected file accessions into a file called 'file_accessions.txt'. 

__Convert the index of the metadata from experiment ID to file accession__

In [76]:
meta_exp.Files = meta_exp.Files.str.split("[\\s*,]+") # convert the File field from comma separated strings to lists of strings

# load the accessions of the selected gene quantification files.
with open('../data/raw/encode/hiseq2000_files.tsv', 'r') as f:
    hiseq2000_files = f.read().splitlines()[2:]
    
# add platform variable to the metadata 

meta_exp['Platform'] = 'Illumina NextSeq'
meta_exp.loc[meta_exp.index.isin(hiseq2000_files), ['Platform']] = 'Illumina HiSeq'

# load the accessions of the selected gene quantification files.
with open('../data/raw/encode/file_accessions.txt', 'r') as f:
    selected_files = f.read().splitlines()

lst_col = 'Files'

meta_file = pd.DataFrame({
      col:np.repeat(meta_exp[col].values, meta_exp[lst_col].str.len())
      for col in meta_exp.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(meta_exp[lst_col].values)})[meta_exp.columns]
meta_file.Files = meta_file.Files.replace(regex={r'^/files/': '', r'/$':''}) # remove prefix and suffix of the Files field
meta_file = meta_file.loc[meta_file.Files.isin(selected_files)].set_index('Files')

meta_file

Unnamed: 0_level_0,Assay name,Biosample summary,Biosample term name,Description,Lab,Project,Biosample accession,Biological replicate,Technical replicate,Platform
Files,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ENCFF547CXK,RNA-seq,C57BL/6 neutrophil adult (5-6 weeks),neutrophil,PSU mouse neutrophil total RNA scriptseqv2 RNA...,"Ross Hardison, PennState",ENCODE,"ENCBS562NSP,ENCBS622TZA",12,1,Illumina NextSeq
ENCFF774DIF,RNA-seq,C57BL/6 neutrophil adult (5-6 weeks),neutrophil,PSU mouse neutrophil total RNA scriptseqv2 RNA...,"Ross Hardison, PennState",ENCODE,"ENCBS562NSP,ENCBS622TZA",12,1,Illumina NextSeq
ENCFF155RUP,polyA plus RNA-seq,C57BL/6 megakaryocyte-erythroid progenitor cel...,megakaryocyte-erythroid progenitor cell,RNA-seq on mouse megakaryocyte-erythroid proge...,"Ross Hardison, PennState",ENCODE,"ENCBS190OUH,ENCBS176ENC",21,1,Illumina HiSeq
ENCFF702TOW,polyA plus RNA-seq,C57BL/6 megakaryocyte-erythroid progenitor cel...,megakaryocyte-erythroid progenitor cell,RNA-seq on mouse megakaryocyte-erythroid proge...,"Ross Hardison, PennState",ENCODE,"ENCBS190OUH,ENCBS176ENC",21,1,Illumina HiSeq
ENCFF963EBQ,RNA-seq,C57BL/6 megakaryocyte male adult (5-6 weeks),megakaryocyte,PSU mouse Megakaryocyte 1 ng RNA-seq,"Ross Hardison, PennState",ENCODE,"ENCBS593JCW,ENCBS576MDX",12,1,Illumina HiSeq
...,...,...,...,...,...,...,...,...,...,...
ENCFF022JXA,polyA plus RNA-seq,strain Foxp3GFP-DTRCD4CreERT2R26tdTomato regul...,regulatory T cell,,"Christina Leslie, MSKCC",GGR,ENCBS175VON,1,21,Illumina HiSeq
ENCFF029ZUB,polyA plus RNA-seq,strain Foxp3GFP-DTRCD4CreERT2R26tdTomato regul...,regulatory T cell,,"Christina Leslie, MSKCC",GGR,ENCBS175VON,1,21,Illumina HiSeq
ENCFF238CEP,polyA plus RNA-seq,inflammation-experienced regulatory T-cells (1...,inflammation-experienced regulatory T-cells,,"Christina Leslie, MSKCC",GGR,ENCBS550RYN,1,231,Illumina HiSeq
ENCFF625TWI,polyA plus RNA-seq,inflammation-experienced regulatory T-cells (1...,inflammation-experienced regulatory T-cells,,"Christina Leslie, MSKCC",GGR,ENCBS550RYN,1,231,Illumina HiSeq


In [45]:
# Example ENCODE sample
pd.read_csv('../data/raw/Encode/ENCFF040EGE.tsv', sep='\t', index_col=0).head()

Unnamed: 0_level_0,transcript_id(s),length,effective_length,expected_count,TPM,FPKM,posterior_mean_count,posterior_standard_deviation_of_count,pme_TPM,pme_FPKM,TPM_ci_lower_bound,TPM_ci_upper_bound,FPKM_ci_lower_bound,FPKM_ci_upper_bound
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
10000,10000,72.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10001,10001,73.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10002,10002,73.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10003,10003,75.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10004,10004,78.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
'../data/raw/Encode/ENCFF547CXK.tsv'

In [37]:
# Example GGR sample
pd.read_csv('../data/raw/Encode/ENCFF022JXA.tsv', sep='\t', index_col=0, names=['ensembl','count']).head()

Unnamed: 0_level_0,count
ensembl,Unnamed: 1_level_1
ENSMUSG00000000001.4,1478
ENSMUSG00000000003.11,0
ENSMUSG00000000028.10,132
ENSMUSG00000000037.12,0
ENSMUSG00000000049.7,2


We can observe that gene quantification files generated from GGR project and ENCODE project have differnt formats. Thus we need to deal with the data generated from the two projects separately. 

### Merge expression data derived from ENCODE project  (41 samples)

In [34]:
# Retrieve accessions of files generated from the ENCODE project
encode_files = meta_file[meta_file.Project=='ENCODE'].index

def valid_rows(df):
    df = df[df.index.str.startswith("ENSMUSG")]
    df.index = df.index.str.replace(r'\.\d+', '') # remove the suffix of gene ID which represents the gene version
    return df 

def encode_file_processor(filenames):
    merged = pd.DataFrame()
    for name in filenames:
        df = pd.read_csv('../data/raw/Encode/'+name+'.tsv', sep='\t', index_col=0)
        df = valid_rows(df)[['TPM']]
        df.columns = [name]
        if merged.empty is True:
            merged = df
        else:
            merged = pd.merge(merged, df, how='inner', left_index=True, right_index=True)
    return(merged)  

In [47]:
# Merge expression data 
expression_encode = encode_file_processor(encode_files)
expression_encode.head()  

Unnamed: 0_level_0,ENCFF547CXK,ENCFF774DIF,ENCFF155RUP,ENCFF702TOW,ENCFF963EBQ,ENCFF649QOI,ENCFF871RIM,ENCFF415ZGH,ENCFF924EMV,ENCFF063SJF,...,ENCFF342WUL,ENCFF858JHF,ENCFF253ENQ,ENCFF326XCA,ENCFF940NAH,ENCFF241IWV,ENCFF514MKL,ENCFF058IDN,ENCFF247FEJ,ENCFF064MKY
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSMUSG00000000001,14.35,11.79,17.16,16.7,66.49,60.33,2.24,3.3,1.09,0.83,...,4.53,3.33,50.47,40.67,33.58,24.65,38.23,43.97,22.69,34.99
ENSMUSG00000000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ENSMUSG00000000028,0.42,0.36,22.15,28.23,27.89,28.73,1.1,1.13,1.52,2.25,...,3.34,2.26,35.12,25.66,14.97,14.21,17.65,23.63,7.86,12.23
ENSMUSG00000000031,0.0,0.0,0.32,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.81,0.65,0.0,0.0,0.64,0.19
ENSMUSG00000000037,0.0,0.0,1.4,1.59,0.79,1.03,0.06,0.8,0.14,0.04,...,0.45,0.04,0.86,0.83,1.9,1.02,0.49,0.28,0.91,1.02


### Merge expression data derived from GGR project (49 samples)

In [49]:
GGR_files = meta_file[meta_file.Project=='GGR'].index

def GGR_file_processor(filenames):
    merged = pd.DataFrame()
    for name in filenames:
        df = pd.read_csv('../data/raw/encode/'+name+'.tsv', sep='\t', names = ["gene_id", "count"], index_col=0)
        df = valid_rows(df)
        df.columns = [name]
        if merged.empty is True:
            merged = df
        else:
            merged = pd.merge(merged, df, how='inner', left_index=True, right_index=True)
    return(merged)  

# Merge the count data of GGR samples 
GGR_files = meta_file[~meta_file.index.isin(encode_files)].index
expression_GGR = GGR_file_processor(GGR_files)
expression_GGR.head()

Unnamed: 0_level_0,ENCFF670NKV,ENCFF966QMM,ENCFF256LKJ,ENCFF340BRF,ENCFF403QVJ,ENCFF519ATI,ENCFF895QBL,ENCFF545HUC,ENCFF682RDX,ENCFF093KAL,...,ENCFF455YEF,ENCFF024UXA,ENCFF410OSM,ENCFF806WBG,ENCFF462IWW,ENCFF022JXA,ENCFF029ZUB,ENCFF238CEP,ENCFF625TWI,ENCFF580TBQ
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSMUSG00000000001,1488,1434,1429,1483,1448,750,652,653,621,577,...,1636,1425,1488,1348,1032,1478,1401,1614,1538,1665
ENSMUSG00000000003,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENSMUSG00000000028,99,82,212,253,166,187,191,61,52,174,...,388,473,153,107,164,132,106,113,33,129
ENSMUSG00000000037,0,0,7,6,4,0,5,0,0,2,...,0,2,7,3,3,0,7,11,16,5
ENSMUSG00000000049,0,0,1,1,0,0,0,0,0,0,...,0,1,0,0,0,2,0,1,0,0


### Combine the two project together into a merged dataset

In [79]:
merged_dataset = pd.merge(expression_encode, expression_GGR, how='inner', left_index=True, right_index=True)
merged_dataset.head()

Unnamed: 0_level_0,ENCFF547CXK,ENCFF774DIF,ENCFF155RUP,ENCFF702TOW,ENCFF963EBQ,ENCFF649QOI,ENCFF871RIM,ENCFF415ZGH,ENCFF924EMV,ENCFF063SJF,...,ENCFF455YEF,ENCFF024UXA,ENCFF410OSM,ENCFF806WBG,ENCFF462IWW,ENCFF022JXA,ENCFF029ZUB,ENCFF238CEP,ENCFF625TWI,ENCFF580TBQ
gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENSMUSG00000000001,14.35,11.79,17.16,16.7,66.49,60.33,2.24,3.3,1.09,0.83,...,1636,1425,1488,1348,1032,1478,1401,1614,1538,1665
ENSMUSG00000000003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
ENSMUSG00000000028,0.42,0.36,22.15,28.23,27.89,28.73,1.1,1.13,1.52,2.25,...,388,473,153,107,164,132,106,113,33,129
ENSMUSG00000000037,0.0,0.0,1.4,1.59,0.79,1.03,0.06,0.8,0.14,0.04,...,0,2,7,3,3,0,7,11,16,5
ENSMUSG00000000049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,1,0,0,0,2,0,1,0,0


In [80]:
meta_file.head()

Unnamed: 0_level_0,Assay name,Biosample summary,Biosample term name,Description,Lab,Project,Biosample accession,Biological replicate,Technical replicate,Platform
Files,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
ENCFF547CXK,RNA-seq,C57BL/6 neutrophil adult (5-6 weeks),neutrophil,PSU mouse neutrophil total RNA scriptseqv2 RNA...,"Ross Hardison, PennState",ENCODE,"ENCBS562NSP,ENCBS622TZA",12,1,Illumina NextSeq
ENCFF774DIF,RNA-seq,C57BL/6 neutrophil adult (5-6 weeks),neutrophil,PSU mouse neutrophil total RNA scriptseqv2 RNA...,"Ross Hardison, PennState",ENCODE,"ENCBS562NSP,ENCBS622TZA",12,1,Illumina NextSeq
ENCFF155RUP,polyA plus RNA-seq,C57BL/6 megakaryocyte-erythroid progenitor cel...,megakaryocyte-erythroid progenitor cell,RNA-seq on mouse megakaryocyte-erythroid proge...,"Ross Hardison, PennState",ENCODE,"ENCBS190OUH,ENCBS176ENC",21,1,Illumina HiSeq
ENCFF702TOW,polyA plus RNA-seq,C57BL/6 megakaryocyte-erythroid progenitor cel...,megakaryocyte-erythroid progenitor cell,RNA-seq on mouse megakaryocyte-erythroid proge...,"Ross Hardison, PennState",ENCODE,"ENCBS190OUH,ENCBS176ENC",21,1,Illumina HiSeq
ENCFF963EBQ,RNA-seq,C57BL/6 megakaryocyte male adult (5-6 weeks),megakaryocyte,PSU mouse Megakaryocyte 1 ng RNA-seq,"Ross Hardison, PennState",ENCODE,"ENCBS593JCW,ENCBS576MDX",12,1,Illumina HiSeq


In [81]:
# Save the combined dataset
merged_dataset.to_csv('../data/interim/mouse_integrate/expression_encode.tsv', sep='\t')
meta_file.to_csv('../data/interim/mouse_integrate/samples_encode.tsv', sep='\t')