This notebook details the parsing of mouse phenotype ontology gene sets from Mouse Genome Informatics (MGI).

In [1]:
import pandas as pd

The relevant files to curate the mouse gene set were downloaded from MGI.

http://www.informatics.jax.org/downloads/reports/index.html

Two files are required:
1. MGI_PhenoGenoMP.rpt
2. HMD_HumanPhenotype.rpt

In [2]:
mouse_file = "../data/gene_sets_June2022/MGI_PhenoGenoMP.rpt"
human_file = "../data/gene_sets_June2022/HMD_HumanPhenotype.rpt"

## Parsing of mouse file

We first parse the MGI_PhenoGenoMP.rpt file.

In [3]:
mouse = pd.read_csv(mouse_file,header=None,sep='\t')
mouse.columns = ["Allelic_Composition", "AlleleSymbol",
                 "Genetic_Background","Mammalian_Phenotype_ID",
                 "PubMed_ID","MGI_Marker_Accession_ID"]
mouse.head()

Unnamed: 0,Allelic_Composition,AlleleSymbol,Genetic_Background,Mammalian_Phenotype_ID,PubMed_ID,MGI_Marker_Accession_ID
0,Rb1<tm1Tyj>/Rb1<tm1Tyj>,Rb1<tm1Tyj>,involves: 129S2/SvPas,MP:0000600,12529408,MGI:97874
1,Rb1<tm1Tyj>/Rb1<tm1Tyj>,Rb1<tm1Tyj>,involves: 129S2/SvPas,MP:0001716,16449662,MGI:97874
2,Rb1<tm1Tyj>/Rb1<tm1Tyj>,Rb1<tm1Tyj>,involves: 129S2/SvPas,MP:0001698,16449662,MGI:97874
3,Rb1<tm1Tyj>/Rb1<tm1Tyj>,Rb1<tm1Tyj>,involves: 129S2/SvPas,MP:0001092,16449662,MGI:97874
4,Rb1<tm1Tyj>/Rb1<tm1Tyj>,Rb1<tm1Tyj>,involves: 129S2/SvPas,MP:0000961,16449662,MGI:97874


In [4]:
mouse.tail()

Unnamed: 0,Allelic_Composition,AlleleSymbol,Genetic_Background,Mammalian_Phenotype_ID,PubMed_ID,MGI_Marker_Accession_ID
358233,"Rr27<tm1Kpfe>/Rr27<+>,Tg(Tpbpa-cre,-EGFP)5Jcc/0","Tg(Tpbpa-cre,-EGFP)5Jcc|Rr27<tm1Kpfe>|Rr27<+>",involves: 129S1/Sv * 129X1/SvJ,MP:0005033,34982814,MGI:5287871|MGI:5489785
358234,"Rr27<tm1Kpfe>/Rr27<+>,Tg(Tpbpa-cre,-EGFP)5Jcc/0","Tg(Tpbpa-cre,-EGFP)5Jcc|Rr27<tm1Kpfe>|Rr27<+>",involves: 129S1/Sv * 129X1/SvJ,MP:0004256,34982814,MGI:5287871|MGI:5489785
358235,"Rr27<tm1Kpfe>/Rr27<+>,Tg(Tpbpa-cre,-EGFP)5Jcc/0","Tg(Tpbpa-cre,-EGFP)5Jcc|Rr27<tm1Kpfe>|Rr27<+>",involves: 129S1/Sv * 129X1/SvJ,MP:0008957,34982814,MGI:5287871|MGI:5489785
358236,"Rr27<tm1Kpfe>/Rr27<+>,Tg(Tpbpa-cre,-EGFP)5Jcc/0","Tg(Tpbpa-cre,-EGFP)5Jcc|Rr27<tm1Kpfe>|Rr27<+>",involves: 129S1/Sv * 129X1/SvJ,MP:0008959,34982814,MGI:5287871|MGI:5489785
358237,"Rr27<tm1Kpfe>/Rr27<+>,Tg(Tpbpa-cre,-EGFP)5Jcc/0","Tg(Tpbpa-cre,-EGFP)5Jcc|Rr27<tm1Kpfe>|Rr27<+>",involves: 129S1/Sv * 129X1/SvJ,MP:0011521,34982814,MGI:5287871|MGI:5489785


We notice that the "MGI_Marker_Accession_ID" column maps to more than one Mammalian_Phenotype_ID. 

In [5]:
# some rows have more than one MGI_Marker_Accession_ID
# explode columns
mouse = mouse[["MGI_Marker_Accession_ID"]].copy()
mouse["MGI_Marker_Accession_ID"] = mouse["MGI_Marker_Accession_ID"].apply(lambda x: x.split("|"))
mouse = mouse.explode("MGI_Marker_Accession_ID")
mouse

Unnamed: 0,MGI_Marker_Accession_ID
0,MGI:97874
1,MGI:97874
2,MGI:97874
3,MGI:97874
4,MGI:97874
...,...
358235,MGI:5489785
358236,MGI:5287871
358236,MGI:5489785
358237,MGI:5287871


The unique list of MGI ids are outputted to a file and passed through a batch search.

http://www.informatics.jax.org/batch

Option Mammalian Phenotype (MP) was selected in addition to all gene attributes.

In [6]:
#from quick visualization, we see that the MGI_Marker_Accession_ID is not unique
# mouse_mgi_ids_parsed
mouse = mouse.drop_duplicates("MGI_Marker_Accession_ID")
mouse["MGI_Marker_Accession_ID"].to_csv("../data/gene_sets_June2022/MGI_ID_unique_27June_2022.txt",
                                        index=None, header=False)
len(mouse)

22428

## Parsing of batch report file

In [7]:
mgi_batch_report = pd.read_csv("../data/gene_sets_June2022/MGIBatchReport_27June_2022.txt",sep="\t")
mgi_batch_report.head()


Unnamed: 0,Input,Input Type,MGI Gene/Marker ID,Symbol,Name,Feature Type,Chr,Strand,Start,End,Ensembl ID,Entrez Gene ID,MP ID,Term
0,MGI:97874,MGI,MGI:97874,Rb1,RB transcriptional corepressor 1,protein coding gene,14,-,73430298.0,73563446.0,ENSMUSG00000022105,19645.0,MP:0000042,abnormal organ of Corti morphology
1,MGI:97874,MGI,MGI:97874,Rb1,RB transcriptional corepressor 1,protein coding gene,14,-,73430298.0,73563446.0,ENSMUSG00000022105,19645.0,MP:0000160,kyphosis
2,MGI:97874,MGI,MGI:97874,Rb1,RB transcriptional corepressor 1,protein coding gene,14,-,73430298.0,73563446.0,ENSMUSG00000022105,19645.0,MP:0000163,abnormal cartilage morphology
3,MGI:97874,MGI,MGI:97874,Rb1,RB transcriptional corepressor 1,protein coding gene,14,-,73430298.0,73563446.0,ENSMUSG00000022105,19645.0,MP:0000245,abnormal erythropoiesis
4,MGI:97874,MGI,MGI:97874,Rb1,RB transcriptional corepressor 1,protein coding gene,14,-,73430298.0,73563446.0,ENSMUSG00000022105,19645.0,MP:0000350,abnormal cell proliferation


In [8]:
# select relevant columns
mgi_batch_report_parsed = mgi_batch_report[["MP ID","Term", "MGI Gene/Marker ID"]].copy()

# replace spaces with underscores
mgi_batch_report_parsed["Term"] = mgi_batch_report_parsed["Term"].str.replace(" ","_")

# genesets are MP ID + Term
mgi_batch_report_parsed["gene_set"] = mgi_batch_report_parsed["MP ID"]+"_"+mgi_batch_report_parsed["Term"]
mgi_batch_report_parsed = mgi_batch_report_parsed.sort_values("gene_set")
mgi_batch_report_parsed.columns = ["mp","term","mgi", "gene_set"]
mgi_batch_report_parsed["database"] = "MGI"
mgi_batch_report_parsed = mgi_batch_report_parsed[["database","gene_set", "mgi", "mp"]]

# remove any NAs
mgi_batch_report_parsed = mgi_batch_report_parsed.dropna()
print(len(mgi_batch_report_parsed))
mgi_batch_report_parsed.head()

253312


Unnamed: 0,database,gene_set,mgi,mp
183745,MGI,MP:0000003_abnormal_adipose_tissue_morphology,MGI:2177469,MP:0000003
41139,MGI,MP:0000003_abnormal_adipose_tissue_morphology,MGI:895149,MP:0000003
31672,MGI,MP:0000003_abnormal_adipose_tissue_morphology,MGI:109583,MP:0000003
159705,MGI,MP:0000003_abnormal_adipose_tissue_morphology,MGI:3612271,MP:0000003
48307,MGI,MP:0000003_abnormal_adipose_tissue_morphology,MGI:96820,MP:0000003


In [9]:
print(len(mgi_batch_report_parsed["mp"].unique()))
print(len(mgi_batch_report_parsed["mgi"].unique()))

10554
20614


## Parsing of human file

Now that we have our gene sets, we must parse the human file so that we can merge on MGI to identify the assoicated human gene for a given MP.

In [10]:
human = pd.read_csv("../data/gene_sets_June2022/HMD_HumanPhenotype.rpt", 
                    sep="\t", header=None,
                    names = ["gene_symbol", "EntrezID", 
                             "mouse_gene_symbol", "mgi", 
                             "mp_id_high_level", "blank"]
                   )
human = human.drop(["blank", "mp_id_high_level"], axis=1)
human.head()

Unnamed: 0,gene_symbol,EntrezID,mouse_gene_symbol,mgi
0,A1BG,1,A1bg,MGI:2152878
1,A1CF,29974,A1cf,MGI:1917115
2,A2M,2,A2m,MGI:2449119
3,A3GALT2,127550,A3galt2,MGI:2685279
4,A4GALT,53947,A4galt,MGI:3512453


We now merge the two dataframes.

In [11]:
human_mouse = pd.merge(human,mgi_batch_report_parsed,on=["mgi"])
human_mouse = human_mouse[["gene_symbol","EntrezID","gene_set"]]
human_mouse["database"] = "MGI"
human_mouse.head()

Unnamed: 0,gene_symbol,EntrezID,gene_set,database
0,A1CF,29974,MP:0000352_decreased_cell_proliferation,MGI
1,A1CF,29974,MP:0001552_increased_circulating_triglyceride_...,MGI
2,A1CF,29974,MP:0002139_abnormal_hepatobiliary_system_physi...,MGI
3,A1CF,29974,MP:0002727_decreased_circulating_insulin_level,MGI
4,A1CF,29974,MP:0002795_dilated_cardiomyopathy,MGI


In [12]:
mapping_df = pd.read_csv("../data/entrez_ensembl_mapping_v26_16Jun_2022.txt", sep="\t")
mapping_df.head()

Unnamed: 0,EntrezID,ensembl_gene_id,gene_symbol
0,1,ENSG00000121410,A1BG
1,2,ENSG00000175899,A2M
2,3,ENSG00000256069,A2MP1
3,9,ENSG00000171428,NAT1
4,10,ENSG00000156006,NAT2


In [13]:
final_df = pd.merge(human_mouse, mapping_df, 
                    on=["EntrezID", "gene_symbol"])
final_df = final_df.sort_values(["gene_set"])
final_df = final_df[["database", "gene_set", "ensembl_gene_id","EntrezID"]]
final_df.to_csv("../data/gene_sets_June2022/MGI_genesets.txt",
                sep="\t", header=True, index=None
               )