# Answers for Questions

### Luxuan Wang's Answer

1. What was your biggest challenge in this project?
* The biggest challenge in this project was handling the nested loops to compute the overlapping genes between every 2 pathways. Due to the large size of the dataset, validating the code and ensuring its correctness required a significant amount of time. This made the debugging process challenging, as even small changes in the code could result in a long time to verify the results.
2. What did you learn while working on this project?
* While working on this project, I learned how to manipulate datasets using pandas, especially splitting and mapping data across multiple columns. At the same time, the nested loop I wrote to identify overlapping genes made me realize the importance of code efficiency when working with large datasets. Writing efficient and scalable code in data-intensive projects is really important since it benefits the running time. 
3. If you had more time on the project, what other question(s) would you like to answer?
* I would like to answer the question: How do overlapping genes between biological pathways correlate with their functional relationships and disease associations?


In [1]:
import pandas as pd

# Preparation

In [2]:
df_pathway=pd.read_csv("pathway.txt", sep="\t", header=None)
df_pathway.columns=["PATHWAY_ID","PATHWAY_NAME"]
df_gene=pd.read_csv("gene.txt", sep="\t", header=None)
df_gene.columns=["GENE_ID", "TYPE", "TYPE_DESCRIPTION" ,"GENE_INFO"]
df_gene_pathway=pd.read_csv("gen_pathway.txt", sep="\t", header=None)
df_gene_pathway.columns=["GENE_ID" , "PATHWAY_ID"]

In [3]:
df_gene_filter=df_gene.drop(columns=["TYPE","TYPE_DESCRIPTION"])

In [4]:
df_gene_filter_split=df_gene_filter["GENE_INFO"].str.split(';',expand=True)
df_gene_filter_split_new = pd.concat([df_gene_filter.drop(columns=["GENE_INFO"]), df_gene_filter_split], axis=1)
df_gene_filter_split_new.columns=["GENE_ID","GENE_SYMBOL","GENE_NAME"]
df_gene_filter_split_new

Unnamed: 0,GENE_ID,GENE_SYMBOL,GENE_NAME
0,hsa:102466751,"MIR6859-1, hsa-mir-6859-1",microRNA 6859-1
1,hsa:100302278,"MIR1302-2, MIRN1302-2, hsa-mir-1302-2",microRNA 1302-2
2,hsa:79501,OR4F5,olfactory receptor family 4 subfamily F member 5
3,hsa:102465909,"MIR6859-2, hsa-mir-6859-2",microRNA 6859-2
4,hsa:112268260,uncharacterized LOC112268260,
...,...,...,...
24678,hsa:124909318,testis-specific Y-encoded protein 3-like,
24679,hsa:124909320,testis-specific Y-encoded protein 3-like,
24680,hsa:124909329,testis-specific Y-encoded protein 4-like,
24681,hsa:124909330,testis-specific Y-encoded protein 1-like,


# Merge

In [5]:
merge_pathway=df_gene_pathway.merge(df_pathway,how="left", on="PATHWAY_ID")
merge_pathway_gene=merge_pathway.merge(df_gene_filter_split_new, how="left", on="GENE_ID")
merge_pathway_gene

Unnamed: 0,GENE_ID,PATHWAY_ID,PATHWAY_NAME,GENE_SYMBOL,GENE_NAME
0,hsa:10327,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,"AKR1A1, ALDR1, ALR, ARM, DD3, HEL-S-6",aldo-keto reductase family 1 member A1
1,hsa:124,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,"ADH1A, ADH1","alcohol dehydrogenase 1A (class I), alpha pol..."
2,hsa:125,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,"ADH1B, ADH2, HEL-S-117","alcohol dehydrogenase 1B (class I), beta poly..."
3,hsa:126,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,"ADH1C, ADH3","alcohol dehydrogenase 1C (class I), gamma pol..."
4,hsa:127,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,"ADH4, ADH-2, HEL-S-4","alcohol dehydrogenase 4 (class II), pi polype..."
...,...,...,...,...,...
37456,hsa:91860,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,"CALML4, NY-BR-20",calmodulin like 4
37457,hsa:92,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,"ACVR2A, ACTRII, ACVR2",activin A receptor type 2A
37458,hsa:93,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,"ACVR2B, ACTRIIB, ActR-IIB, HTX4",activin A receptor type 2B
37459,hsa:9446,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,"GSTO1, GSTO_1-1, GSTTLp28, HEL-S-21, P28, SPG-R",glutathione S-transferase omega 1


# Overlapping

In [6]:
overlap_all=list()
for i in range(merge_pathway_gene.shape[0]-1):
    for x in range(i+1,merge_pathway_gene.shape[0]):
        PATHWAY_ID1=merge_pathway_gene.loc[i,"PATHWAY_ID"]
        PATHWAY_NAME1=merge_pathway_gene.loc[i,"PATHWAY_NAME"]
        PATHWAY_ID2=merge_pathway_gene.loc[x,"PATHWAY_ID"]
        PATHWAY_NAME2=merge_pathway_gene.loc[x,"PATHWAY_NAME"]
        overlap_list=list(set(merge_pathway_gene.loc[i,"GENE_SYMBOL"].split(', ')) & set(merge_pathway_gene.loc[x, "GENE_SYMBOL"].split(', ')))
        if overlap_list:
            overlap_list_str='; '.join(overlap_list)
            overlap_all.append([ PATHWAY_ID1,PATHWAY_NAME1, PATHWAY_ID2,PATHWAY_NAME2,len(overlap_list),overlap_list_str])
df_overlap_all=pd.DataFrame(overlap_all)
df_overlap_all.columns=["PATHWAY_ID1", "PATHWAY_NAME1", "PATHWAY_ID2", "PATHWAY_NAME2", "NUMBER_OF_OVERLAPPING_GENES", "LIST_OF_OVERLAPPING_GENES"]


In [8]:
df_overlap_all

Unnamed: 0,PATHWAY_ID1,PATHWAY_NAME1,PATHWAY_ID2,PATHWAY_NAME2,NUMBER_OF_OVERLAPPING_GENES,LIST_OF_OVERLAPPING_GENES
0,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,path:hsa00040,Pentose and glucuronate interconversions - Hom...,6,DD3; ALDR1; ALR; AKR1A1; ARM; HEL-S-6
1,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,path:hsa00040,Pentose and glucuronate interconversions - Hom...,1,ALDR1
2,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,path:hsa00051,Fructose and mannose metabolism - Homo sapiens...,1,ALDR1
3,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,path:hsa00052,Galactose metabolism - Homo sapiens (human),1,ALDR1
4,path:hsa00010,Glycolysis / Gluconeogenesis - Homo sapiens (h...,path:hsa00053,Ascorbate and aldarate metabolism - Homo sapie...,6,DD3; ALDR1; ALR; AKR1A1; ARM; HEL-S-6
...,...,...,...,...,...,...
526876,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,6,caM; CAM3; CAMC; PHKD; CALML2; CAMIII
526877,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,3,CAM2; PHKD; CAMB
526878,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,3,PHKD; CALM; CAM1
526879,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,path:hsa05418,Fluid shear stress and atherosclerosis - Homo ...,1,CAV


# Save the results

In [9]:
c1=df_overlap_all["PATHWAY_ID1"] != df_overlap_all["PATHWAY_ID2"]
df_overlap_all_final=df_overlap_all[c1].sort_values(by="NUMBER_OF_OVERLAPPING_GENES", ascending=False)
df_overlap_all_final.to_csv("KEGG_crosstalk.csv", index=False)