# Introduction
* **작성자**: 박소희
* **본 주피터 코드 목적**: GSE111016 RNA-seq의 DEG 분석을 위함
* **데이터셋**: GSE111016
* **데이터셋 설명**:
    * Experiment type: Expression profiling by high throughput sequencing
    * Platform: Illumina HiSeq 2500 (Homo sapiens)
    * Experiment molecule: total RNA
* **샘플 설명**:
    * 평균 70.2세 (control): 20명
    * 평균 72.7세 (sarcopenic): 20명
* **참고 논문**: https://www.nature.com/articles/s41467-019-13694-1  
* **데이터셋 링크**: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111016  

# Load library

In [1]:
import pandas as pd

import pydeseq2
from pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
from pydeseq2.utils import load_example_data

import pickle
import os
os.chdir('/Users/soheepark/03-GEO근감소/Data/gse111016/')

# Data collection 
* DEG 분석을 위해 **Count table**과 샘플의 그룹 정보가 포함된 **Metadata**가 필요합니다.
* 아래에서 **Count table**과 **Metadata table**을 불러와 형태를 확인합니다.

먼저 **Count table을 확인**해봅니다.

In [2]:
# Count table 로드
counts = pd.read_csv('../2_GSE111016/GSE111016_allSamplesCounts_htseqcov1_sss_forGEO.csv', index_col=0)
print(counts.shape)
counts.head()

(65217, 40)


Unnamed: 0,Sample 1,Sample 2,Sample 3,Sample 4,Sample 5,Sample 6,Sample 7,Sample 8,Sample 9,Sample 10,...,Sample 31,Sample 32,Sample 33,Sample 34,Sample 35,Sample 36,Sample 37,Sample 38,Sample 39,Sample 40
ENSG00000000003,124,145,61,110,122,120,145,118,108,125,...,114,119,79,67,134,114,106,128,149,122
ENSG00000000005,7,6,10,16,124,9,37,3,6,45,...,70,3,6,35,5,16,8,6,27,18
ENSG00000000419,536,839,415,690,825,852,774,838,629,708,...,482,739,639,462,780,551,576,788,742,656
ENSG00000000457,402,538,244,486,538,549,629,525,430,453,...,389,405,431,292,494,443,395,551,483,438
ENSG00000000460,141,119,73,169,96,104,145,115,94,131,...,118,104,145,85,131,115,141,121,125,120


다음으로 **Metadata를 확인**해봅니다.

In [3]:
# Metadata 로드
meta = pd.read_csv('../2_GSE111016/GSE111016_series_matrix.txt', sep='\t', skiprows=37, index_col=0)
meta.head()

Unnamed: 0_level_0,Sample 1 [sss],Sample 2 [sss],Sample 3 [sss],Sample 4 [sss],Sample 5 [sss],Sample 6 [sss],Sample 7 [sss],Sample 8 [sss],Sample 9 [sss],Sample 10 [sss],...,Sample 31 [sss],Sample 32 [sss],Sample 33 [sss],Sample 34 [sss],Sample 35 [sss],Sample 36 [sss],Sample 37 [sss],Sample 38 [sss],Sample 39 [sss],Sample 40 [sss]
!Sample_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
!Sample_geo_accession,GSM3020405,GSM3020406,GSM3020407,GSM3020408,GSM3020409,GSM3020410,GSM3020411,GSM3020412,GSM3020413,GSM3020414,...,GSM3020435,GSM3020436,GSM3020437,GSM3020438,GSM3020439,GSM3020440,GSM3020441,GSM3020442,GSM3020443,GSM3020444
!Sample_status,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,...,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019,Public on Nov 14 2019
!Sample_submission_date,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,...,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018,Feb 22 2018
!Sample_last_update_date,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,...,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019,Nov 14 2019
!Sample_type,SRA,SRA,SRA,SRA,SRA,SRA,SRA,SRA,SRA,SRA,...,SRA,SRA,SRA,SRA,SRA,SRA,SRA,SRA,SRA,SRA


`!Sample_characteristics_ch1` 행에 **환자 그룹의 특징이 저장**되어 있습니다.

In [4]:
meta.loc['!Sample_characteristics_ch1']

Unnamed: 0_level_0,Sample 1 [sss],Sample 2 [sss],Sample 3 [sss],Sample 4 [sss],Sample 5 [sss],Sample 6 [sss],Sample 7 [sss],Sample 8 [sss],Sample 9 [sss],Sample 10 [sss],...,Sample 31 [sss],Sample 32 [sss],Sample 33 [sss],Sample 34 [sss],Sample 35 [sss],Sample 36 [sss],Sample 37 [sss],Sample 38 [sss],Sample 39 [sss],Sample 40 [sss]
!Sample_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
!Sample_characteristics_ch1,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,...,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent
!Sample_characteristics_ch1,sarcopenia status: no,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: no,...,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes
!Sample_characteristics_ch1,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,...,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male
!Sample_characteristics_ch1,age (yr): 69,age (yr): 69,age (yr): 69,age (yr): 73,age (yr): 71,age (yr): 65,age (yr): 78,age (yr): 66,age (yr): 71,age (yr): 67,...,age (yr): 78,age (yr): 79,age (yr): 74,age (yr): 69,age (yr): 69,age (yr): 74,age (yr): 78,age (yr): 65,age (yr): 68,age (yr): 66
!Sample_characteristics_ch1,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,...,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle


`!Sample_characteristics_ch1` 행에서 **그룹 정보가 들어간 행**을 사용할겁니다.  
다음에서 개수를 확인해봅니다.

In [5]:
# 그룹 특징 추출 및 개수 파악
meta.iloc[9].value_counts()

sarcopenia status: no     20
sarcopenia status: yes    20
Name: !Sample_characteristics_ch1, dtype: int64

In [6]:
meta.iloc[9].head()

Sample 1 [sss]     sarcopenia status: no
Sample 2 [sss]     sarcopenia status: no
Sample 3 [sss]    sarcopenia status: yes
Sample 4 [sss]    sarcopenia status: yes
Sample 5 [sss]     sarcopenia status: no
Name: !Sample_characteristics_ch1, dtype: object

In [7]:
# 필요한 행만 추출
val1 = pd.Series(meta.columns).apply(lambda x:x.split(' [')[0])
val2 = meta.iloc[9].apply(lambda x:x.split(': ')[-1])

# no, yes -> oh(old health), os(old sarcopenia)
my_dict = {'no':'OH', 'yes':'OS'}
val2 = val2.map(my_dict)

metadata = dict(zip(val1, val2))
metadata = pd.DataFrame.from_dict(metadata, orient='index')
metadata.rename(columns = {metadata.columns[0]:'Condition'}, inplace=True)
metadata['Sample'] = metadata.index
metadata['Temp'] = metadata['Sample'].apply(lambda x:x.split(' ')[-1])
metadata.head()

Unnamed: 0,Condition,Sample,Temp
Sample 1,OH,Sample 1,1
Sample 2,OH,Sample 2,2
Sample 3,OS,Sample 3,3
Sample 4,OS,Sample 4,4
Sample 5,OH,Sample 5,5


In [9]:
metadata[metadata['Sample'].isin(['Sample 3', 'Sample 16', 'Sample 17', 'Sample 24', 'Sample 27', 'Sample 28', 'Sample 31'])]

Unnamed: 0,Condition,Sample,Temp
Sample 3,OS,Sample 3,3
Sample 16,OS,Sample 16,16
Sample 17,OH,Sample 17,17
Sample 24,OH,Sample 24,24
Sample 27,OH,Sample 27,27
Sample 28,OS,Sample 28,28
Sample 31,OS,Sample 31,31


# Extract metadata for EDA

In [9]:
meta.loc['!Sample_characteristics_ch1']

Unnamed: 0_level_0,Sample 1 [sss],Sample 2 [sss],Sample 3 [sss],Sample 4 [sss],Sample 5 [sss],Sample 6 [sss],Sample 7 [sss],Sample 8 [sss],Sample 9 [sss],Sample 10 [sss],...,Sample 31 [sss],Sample 32 [sss],Sample 33 [sss],Sample 34 [sss],Sample 35 [sss],Sample 36 [sss],Sample 37 [sss],Sample 38 [sss],Sample 39 [sss],Sample 40 [sss]
!Sample_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
!Sample_characteristics_ch1,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,...,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent,population: Chinese descent
!Sample_characteristics_ch1,sarcopenia status: no,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: no,...,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes,sarcopenia status: no,sarcopenia status: yes,sarcopenia status: yes
!Sample_characteristics_ch1,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,...,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male,Sex: male
!Sample_characteristics_ch1,age (yr): 69,age (yr): 69,age (yr): 69,age (yr): 73,age (yr): 71,age (yr): 65,age (yr): 78,age (yr): 66,age (yr): 71,age (yr): 67,...,age (yr): 78,age (yr): 79,age (yr): 74,age (yr): 69,age (yr): 69,age (yr): 74,age (yr): 78,age (yr): 65,age (yr): 68,age (yr): 66
!Sample_characteristics_ch1,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,...,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle,tissue: vastus lateralis muscle


In [10]:
eda = {}
for i in range(0,5):
    key = meta.loc['!Sample_characteristics_ch1'].iloc[i][0].split(': ')[0]
    value = meta.loc['!Sample_characteristics_ch1'].iloc[i]
    value = value.apply(lambda x:x.split(': ')[-1])
    eda[key] = value

eda = pd.DataFrame(eda)

In [11]:
eda.head()

Unnamed: 0,population,sarcopenia status,Sex,age (yr),tissue
Sample 1 [sss],Chinese descent,no,male,69,vastus lateralis muscle
Sample 2 [sss],Chinese descent,no,male,69,vastus lateralis muscle
Sample 3 [sss],Chinese descent,yes,male,69,vastus lateralis muscle
Sample 4 [sss],Chinese descent,yes,male,73,vastus lateralis muscle
Sample 5 [sss],Chinese descent,no,male,71,vastus lateralis muscle


In [12]:
# eda.to_csv('./GSE111016_eda.csv')

# Preprocessing
다음은 유전자를 **Ensembl**에서 **GeneSymbol**로 바꿔주는 과정입니다.  
GSE111016에서 유전자는 **Ensembl**로 저장되어 있으며 GSE167186과 병합을 위해 다음 과정이 필요합니다.

## Ensembl to Symbol (by supplementary data 2)

In [13]:
supple = pd.read_csv('/Users/soheepark/03-GEO근감소/Data/GSE111016_SupplementaryDataset2.csv', index_col=0)
supple[:3]

Unnamed: 0,chr,gene_source,start,end,strand,gene_version,gene_name,gene_biotype,symbol.org.Hs.eg,title.org.Hs.eg,Amean,coef_sarc,modt_sarc,pval_sarc,adjp_sarc
ENSG00000214973,1,havana,27200834,27201473,-,3,CHCHD3P3,processed_pseudogene,,,2.246,-0.323,-5.154,7e-06,0.0796
ENSG00000130312,19,ensembl_havana,17292609,17306843,+,5,MRPL34,protein_coding,MRPL34,mitochondrial ribosomal protein L34,4.219,-0.373,-4.914,1.5e-05,0.0796
ENSG00000250479,22,ensembl_havana,23765834,23768443,-,7,CHCHD10,protein_coding,,,6.602,-0.337,-4.843,1.8e-05,0.0796


In [14]:
# counts 인덱스에 맞는 Supple Genesymbol을 매핑

counts['GeneSymbol'] = counts.index.map(supple['gene_name'])
print(f'Count of rows: {counts.shape[0]}')
counts.head()

Count of rows: 65217


Unnamed: 0,Sample 1,Sample 2,Sample 3,Sample 4,Sample 5,Sample 6,Sample 7,Sample 8,Sample 9,Sample 10,...,Sample 32,Sample 33,Sample 34,Sample 35,Sample 36,Sample 37,Sample 38,Sample 39,Sample 40,GeneSymbol
ENSG00000000003,124,145,61,110,122,120,145,118,108,125,...,119,79,67,134,114,106,128,149,122,TSPAN6
ENSG00000000005,7,6,10,16,124,9,37,3,6,45,...,3,6,35,5,16,8,6,27,18,TNMD
ENSG00000000419,536,839,415,690,825,852,774,838,629,708,...,739,639,462,780,551,576,788,742,656,DPM1
ENSG00000000457,402,538,244,486,538,549,629,525,430,453,...,405,431,292,494,443,395,551,483,438,SCYL3
ENSG00000000460,141,119,73,169,96,104,145,115,94,131,...,104,145,85,131,115,141,121,125,120,C1orf112


In [15]:
# Gene Symbol이 없는 EnsembleID는 삭제
print(f"Count of NA GeneSymbol {counts['GeneSymbol'].isna().sum()}")
counts.dropna(subset=['GeneSymbol'], inplace=True)
print(f'Count of rows: {counts.shape[0]}')

Count of NA GeneSymbol 48338
Count of rows: 16879


## Ensembl to Symbol (by supplementary dataset 2)

In [24]:
supple = pd.read_csv('/Users/soheepark/03-GEO근감소/Data/GSE111016_SupplementaryDataset2.csv', index_col=0)
supple.head()

Unnamed: 0,chr,gene_source,start,end,strand,gene_version,gene_name,gene_biotype,symbol.org.Hs.eg,title.org.Hs.eg,Amean,coef_sarc,modt_sarc,pval_sarc,adjp_sarc
ENSG00000214973,1,havana,27200834,27201473,-,3,CHCHD3P3,processed_pseudogene,,,2.246,-0.323,-5.154,7e-06,0.0796
ENSG00000130312,19,ensembl_havana,17292609,17306843,+,5,MRPL34,protein_coding,MRPL34,mitochondrial ribosomal protein L34,4.219,-0.373,-4.914,1.5e-05,0.0796
ENSG00000250479,22,ensembl_havana,23765834,23768443,-,7,CHCHD10,protein_coding,,,6.602,-0.337,-4.843,1.8e-05,0.0796
ENSG00000156411,14,ensembl_havana,103912288,103928269,-,8,C14orf2,protein_coding,C14orf2,chromosome 14 open reading frame 2,5.394,-0.302,-4.765,2.4e-05,0.0796
ENSG00000146147,6,ensembl_havana,53929982,54266280,+,13,MLIP,protein_coding,MLIP,muscular LMNA interacting protein,6.833,0.277,4.714,2.8e-05,0.0796


In [25]:
# counts 인덱스에 맞는 supple의 gene_name을 매핑

counts['GeneSymbol'] = counts.index.map(supple.gene_name)
print(f'Count of rows: {counts.shape[0]}')

Count of rows: 65217


In [32]:
# Gene Symbol이 na인 경우 삭제
print(f"NA counts of GeneSymbol: {counts['GeneSymbol'].isna().sum()}")
counts.dropna(subset=['GeneSymbol'], inplace=True)
print(f'Count of rows: {counts.shape[0]}')

NA counts of GeneSymbol: 48338
Count of rows: 16879


In [None]:
# 'GeneSymbol' 컬럼을 인덱스로 설정하고 'EnsembleID' 컬럼 삭제

counts.set_index('GeneSymbol', inplace=True)
print(f'The shape of counts table: {counts.shape}')

In [None]:
# 전체 샘플에서 발현량이 없는 유전자를 필터링합니다.
print(f'필터링 전 테이블: {counts.shape}')
counts = counts.loc[counts.sum(axis=1)>0]
print(f'필터링 후 테이블: {counts.shape}')

## Ensembl to Symbol (by pybiomart)
* https://pypi.org/project/pybiomart/

In [60]:
# # pip install pybiomart
# # http://www.ensembl.org/biomart
# from pybiomart import Server

# server = Server(host='http://www.ensembl.org')

# dataset = (server.marts['ENSEMBL_MART_ENSEMBL']
#                  .datasets['hsapiens_gene_ensembl'])

# dataset.query(attributes=['ensembl_gene_id', 'external_gene_name'],
#               filters={'chromosome_name': ['1','2']}).head()

Unnamed: 0,Gene stable ID,Gene name
0,ENSG00000290825,DDX11L2
1,ENSG00000223972,DDX11L1
2,ENSG00000227232,WASH7P
3,ENSG00000278267,MIR6859-1
4,ENSG00000243485,MIR1302-2HG


In [61]:
# from pybiomart import Dataset

# dataset = Dataset(name='hsapiens_gene_ensembl',
#                   host='http://www.ensembl.org')

# dataset.query(attributes=['ensembl_gene_id', 'external_gene_name'],
#               filters={'chromosome_name': ['1','2']}).head()

Unnamed: 0,Gene stable ID,Gene name
0,ENSG00000290825,DDX11L2
1,ENSG00000223972,DDX11L1
2,ENSG00000227232,WASH7P
3,ENSG00000278267,MIR6859-1
4,ENSG00000243485,MIR1302-2HG


## Ensembl to Symbol (by biomart)

In [12]:
# pip install biomart
# Original reference: https://github.com/sebriois/biomart
# Reference blog: https://autobencoder.com/2021-10-03-gene-conversion/
# Reference 2: https://bioconductor.riken.jp/packages/3.4/bioc/vignettes/biomaRt/inst/doc/biomaRt.html

import biomart

def get_ensembl_mappings():                                   
    # 서버 연결 세팅
    server = biomart.BiomartServer('http://www.ensembl.org/biomart')         
    mart = server.datasets['hsapiens_gene_ensembl']                            

    # 원하는 타입의 어노테이션을 작성
    attributes = ['ensembl_gene_id', 'hgnc_symbol'] # ensembl_transcript_id

    # 매핑
    response = mart.search({'attributes': attributes})                          
    data = response.raw.data.decode('ascii')                                    

    # 딕셔너리에 저장
    ensembl_to_genesymbol = {}                                                  
    for line in data.splitlines():                                              
        line = line.split('\t')                                                 

        ensembl_gene = line[0]
        gene_symbol = line[1]

        ensembl_to_genesymbol[ensembl_gene] = ensembl_gene
        ensembl_to_genesymbol[ensembl_gene] = gene_symbol

        
    return ensembl_to_genesymbol

In [13]:
# ensemble genesymbol 매핑 데이터프레임 생성

map_dict = get_ensembl_mappings()
map_dict

{'ENSG00000210049': 'MT-TF',
 'ENSG00000211459': 'MT-RNR1',
 'ENSG00000210077': 'MT-TV',
 'ENSG00000210082': 'MT-RNR2',
 'ENSG00000209082': 'MT-TL1',
 'ENSG00000198888': 'MT-ND1',
 'ENSG00000210100': 'MT-TI',
 'ENSG00000210107': 'MT-TQ',
 'ENSG00000210112': 'MT-TM',
 'ENSG00000198763': 'MT-ND2',
 'ENSG00000210117': 'MT-TW',
 'ENSG00000210127': 'MT-TA',
 'ENSG00000210135': 'MT-TN',
 'ENSG00000210140': 'MT-TC',
 'ENSG00000210144': 'MT-TY',
 'ENSG00000198804': 'MT-CO1',
 'ENSG00000210151': 'MT-TS1',
 'ENSG00000210154': 'MT-TD',
 'ENSG00000198712': 'MT-CO2',
 'ENSG00000210156': 'MT-TK',
 'ENSG00000228253': 'MT-ATP8',
 'ENSG00000198899': 'MT-ATP6',
 'ENSG00000198938': 'MT-CO3',
 'ENSG00000210164': 'MT-TG',
 'ENSG00000198840': 'MT-ND3',
 'ENSG00000210174': 'MT-TR',
 'ENSG00000212907': 'MT-ND4L',
 'ENSG00000198886': 'MT-ND4',
 'ENSG00000210176': 'MT-TH',
 'ENSG00000210184': 'MT-TS2',
 'ENSG00000210191': 'MT-TL2',
 'ENSG00000198786': 'MT-ND5',
 'ENSG00000198695': 'MT-ND6',
 'ENSG00000210194': 

In [14]:
# counts 인덱스에 맞는 map_dict의 Genesymbol을 매핑

counts['GeneSymbol'] = counts.index.map(map_dict)
print(f'Count of rows: {counts.shape[0]}')

Count of rows: 65217


In [15]:
# Gene Symbol이 없는 EnsembleID는 삭제

counts.dropna(subset=['GeneSymbol'], inplace=True)
print(f'Count of rows: {counts.shape[0]}')

Count of rows: 60466


In [16]:
# Gene Symbol열에 공백인 행 발견함
counts = counts[counts['GeneSymbol']!='']
print(f'Count of rows: {counts.shape[0]}')

Count of rows: 43369


In [17]:
# 'GeneSymbol' 컬럼을 인덱스로 설정하고 'EnsembleID' 컬럼 삭제

counts.set_index('GeneSymbol', inplace=True)
print(f'The shape of counts table: {counts.shape}')

The shape of counts table: (43369, 40)


## 이슈 (해결)
매핑하면서 Ensembl ID에서 공동으로 매핑되는 GeneSymbol이 존재함.  
-> Sum 값 활용

In [18]:
my_list = counts.columns.tolist()

common_items = set()

for item in my_list:
    if my_list.count(item) > 1:  # 리스트에서 해당 아이템이 두 번 이상 나타나면
        common_items.add(item)   # 공통 아이템 세트에 추가

In [19]:
counts[common_items]

TSPAN6
TNMD
DPM1
SCYL3
FIRRM
...
LINC01144
RPL23AP64
MUC20-OT1
SLC16A1
SLC45A2


In [20]:
print(f'원 데이터: {counts.shape}')
counts = counts.groupby(counts.index).sum()
print(f'공동 유전자 sum: {counts.shape}')
counts.head()

원 데이터: (43369, 40)
공동 유전자 sum: (40063, 40)


Unnamed: 0_level_0,Sample 1,Sample 2,Sample 3,Sample 4,Sample 5,Sample 6,Sample 7,Sample 8,Sample 9,Sample 10,...,Sample 31,Sample 32,Sample 33,Sample 34,Sample 35,Sample 36,Sample 37,Sample 38,Sample 39,Sample 40
GeneSymbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A1BG,0,7,0,1,5,3,3,3,0,4,...,7,5,1,1,4,2,2,3,2,2
A1BG-AS1,14,13,3,17,18,21,7,12,12,26,...,16,13,10,2,13,16,12,17,24,9
A1CF,1,1,2,4,0,0,3,2,0,1,...,0,0,0,0,1,3,1,1,2,1
A2M,7145,9004,3427,6806,6684,8837,7772,7157,5560,6838,...,6439,7099,5109,3326,5538,8194,6718,8848,7610,5369
A2M-AS1,131,111,65,168,129,160,204,154,159,136,...,112,138,134,101,135,163,141,148,133,93


In [21]:
counts = counts.transpose()

In [25]:
# raw 저장
# metadata.to_csv('./GSE111016_metadata.csv', sep='\t')
# counts.to_csv('./GSE111016_counts.csv', sep='\t')

## Ensembl to Symbol (by sambomics)

In [57]:
# # sanbomics.tools의 id_map 모듈을 활용하여 매핑할 수 있습니다. 
# mapper = id_map(species = 'human', key = 'ensembl', target = 'symbol')

In [58]:
# # counts 데이터프레임의 Index를 gene symbol로 바꿔줍니다.
# counts['Symbol'] = counts.index.map(mapper.mapper)
# counts.head()

Unnamed: 0,Sample 1,Sample 2,Sample 3,Sample 4,Sample 5,Sample 6,Sample 7,Sample 8,Sample 9,Sample 10,...,Sample 32,Sample 33,Sample 34,Sample 35,Sample 36,Sample 37,Sample 38,Sample 39,Sample 40,Symbol
ENSG00000000003,124,145,61,110,122,120,145,118,108,125,...,119,79,67,134,114,106,128,149,122,TSPAN6
ENSG00000000005,7,6,10,16,124,9,37,3,6,45,...,3,6,35,5,16,8,6,27,18,TNMD
ENSG00000000419,536,839,415,690,825,852,774,838,629,708,...,739,639,462,780,551,576,788,742,656,DPM1
ENSG00000000457,402,538,244,486,538,549,629,525,430,453,...,405,431,292,494,443,395,551,483,438,SCYL3
ENSG00000000460,141,119,73,169,96,104,145,115,94,131,...,104,145,85,131,115,141,121,125,120,C1orf112


In [59]:
# # 몇 개의 유전자가 매핑되었는지 확인해봅니다.
# print(f'The shape of counts Dataframe: {counts.shape}')
# counts['Symbol'].isnull().value_counts()

The shape of counts Dataframe: (65217, 41)


False    51332
True     13885
Name: Symbol, dtype: int64

## Filtering genes

In [29]:
counts = counts.T

In [30]:
# 전체 샘플에서 발현량이 없는 유전자를 필터링합니다.
print(f'필터링 전 테이블: {counts.shape}')
counts = counts[counts.sum(axis=1)>0]
print(f'필터링 후 테이블: {counts.shape}')

필터링 전 테이블: (40063, 40)
필터링 후 테이블: (36678, 40)


In [31]:
# 논문 상 필터링 기준
counts = counts[counts.mean(axis=1)>20]
print(f'논문기반 필터링 후 테이블: {counts.shape}')

논문기반 필터링 후 테이블: (15357, 40)


In [32]:
counts = counts.transpose()
counts.head()

GeneSymbol,A2M,A2M-AS1,A2ML1,A4GALT,AAAS,AACS,AADAT,AAGAB,AAK1,AAMDC,...,ZUP1,ZW10,ZWILCH,ZXDA,ZXDB,ZXDC,ZYG11B,ZYX,ZZEF1,ZZZ3
Sample 1,7145,131,5,186,593,172,103,724,1792,1607,...,118,682,96,346,394,1877,15055,573,3175,1338
Sample 2,9004,111,16,177,592,253,93,912,2333,2208,...,194,753,101,418,413,1866,17429,634,3038,1454
Sample 3,3427,65,160,85,330,154,44,434,1488,1042,...,66,329,61,187,196,777,7547,233,1429,642
Sample 4,6806,168,6,161,582,149,91,880,1963,1623,...,144,796,105,331,422,1828,11336,581,2653,1643
Sample 5,6684,129,8,307,559,248,91,917,2257,1558,...,181,665,133,380,407,1762,16293,527,3087,1734


In [33]:
# counts.to_csv('./GSE111016_counts_filtered.csv', sep='\t')

# DEG analysis

In [2]:
# raw 읽어오기
metadata = pd.read_csv('./GSE111016_metadata.csv', sep='\t', index_col=[0])
counts = pd.read_csv('./GSE111016_counts_filtered.csv', sep='\t', index_col=[0], low_memory=False)

In [8]:
def my_deseq(metadata, counts, name):
    # 1. Build DeseqDataSet
    dds = DeseqDataSet(counts=counts,
                       metadata=metadata,
                       design_factors="Condition",
                       refit_cooks=True)
    
    # 2. Run DEseq2
    dds.deseq2()
    
    # 3. Deseq2Stat # 1:'os', 2:'oh'
    group1 = dds.obs.Condition.unique()[1] # 1: 'os'
    group2 = dds.obs.Condition.unique()[0] # 0: 'oh'
    print(f"comparison between {group1} and {group2}")
    
    stat_res = DeseqStats(dds, contrast=('Condition', group1, group2))
    stat_res.summary()
    res = stat_res.results_df

    # 4. Save dds and res as pickle and csv, respectively
    with open(f"./{name.split('_')[1]}/results_{name}_dds.pkl", 'wb') as f:
        pickle.dump(dds, f)
    res.to_csv(f"./{name.split('_')[1]}/results_{name}.csv")

In [9]:
my_deseq(metadata, counts, 'GSE111016_osoh_filtered_refit')

Fitting size factors...
... done in 0.03 seconds.

Fitting dispersions...
... done in 5.10 seconds.

Fitting dispersion trend curve...
... done in 0.57 seconds.

Fitting MAP dispersions...
... done in 6.01 seconds.

Fitting LFCs...
... done in 2.33 seconds.

Replacing 82 outlier genes.

Fitting dispersions...
... done in 0.05 seconds.

Fitting MAP dispersions...
... done in 0.06 seconds.

Fitting LFCs...
... done in 0.05 seconds.

Running Wald tests...


comparison between OS and OH


... done in 0.98 seconds.



Log2 fold change & Wald test p-value: Condition OS vs OH
             baseMean  log2FoldChange     lfcSE      stat    pvalue      padj
A2M       6659.963434       -0.188394  0.083747 -2.249562  0.024477  0.214547
A2M-AS1    123.823629       -0.001157  0.084789 -0.013645  0.989113       NaN
A2ML1       11.030964        0.395696  0.426042  0.928772  0.353007       NaN
A4GALT     206.737365        0.161064  0.106277  1.515517  0.129642       NaN
AAAS       562.242181        0.032603  0.043561  0.748452  0.454188  0.743050
...               ...             ...       ...       ...       ...       ...
ZXDC      1664.379073        0.114611  0.047863  2.394569  0.016640  0.188466
ZYG11B   13204.512796       -0.136301  0.060581 -2.249909  0.024455  0.214547
ZYX        596.301319        0.147968  0.098065  1.508877  0.131330  0.448803
ZZEF1     2829.319141        0.093573  0.040800  2.293480  0.021820  0.207952
ZZZ3      1333.699151        0.016783  0.043503  0.385785  0.699656  0.879382

[15357

# GSEA
* (코멘트) gsea html 파일이 나옴! gmt, tms, gct 파일 input
* GO에 대한 유전자 score에 따른 정보 나옴

In [None]:
res.head()
res['Symbol'] = res.index

In [None]:
ranking = res[['Symbol','stat']].dropna().sort_values('stat', ascending=False)
ranking = ranking.drop_duplicates('Symbol')
ranking

In [None]:
manual_set = {'things':['COL19A1', 'H19', 'FGF7', 'NHLH2', 'MT2A']}

In [None]:
del ranking['Symbol']

In [None]:
pre_res = gp.prerank(rnk = ranking,
                     gene_sets = ['GO_Biological_Process_2021', manual_set],
                     seed = 6, permutation_num = 100)

In [None]:
out = []

for term in list(pre_res.results):
    out.append([term,
               pre_res.results[term]['fdr'],
               pre_res.results[term]['es'],
               pre_res.results[term]['nes']])

out_df = pd.DataFrame(out, columns = ['Term','fdr', 'es', 'nes']).sort_values('fdr').reset_index(drop = True)
out_df

In [None]:
print(out_df.sort_values('nes').iloc[0].Term)
print(out_df.sort_values('nes').iloc[1].Term)

In [None]:
my_term = out_df.sort_values('nes').iloc[0].Term
my_term

In [None]:
gseaplot(rank_metric=pre_res.ranking, term=my_term, **pre_res.results[my_term])