# Answers for Questions

### Luxuan Wang's Answer

1. What was your biggest challenge in this project?
* The biggest challenge in this project was handling the nested loops to compute the overlapping genes between every 2 pathways. Due to the large size of the dataset, validating the code and ensuring its correctness required a significant amount of time. This made the debugging process challenging, as even small changes in the code could result in a long time to verify the results.
2. What did you learn while working on this project?
* While working on this project, I learned how to manipulate datasets using pandas, especially splitting and mapping data across multiple columns. At the same time, the nested loop I wrote to identify overlapping genes made me realize the importance of code efficiency when working with large datasets. Writing efficient and scalable code in data-intensive projects is really important since it benefits the running time. 
3. If you had more time on the project, what other question(s) would you like to answer?
* I would like to answer the question: How do overlapping genes between biological pathways correlate with their functional relationships and disease associations?


### Ileanexis Madera Cuevas's Answer

1. What was your biggest challenge in this project?
* The biggest challenge in this project was handling the dataset for overlapping genes between all pairs of pathways because of the high computational cost associated with comparing every possible pair of rows. 
2. What did you learn while working on this project?
* I learned the importance of computational methods to manage large datasets efficiently, which require time and complex code but help analyze large datasets more efficiently. I also learned the importance of preprocessing data to enable faster computations. 
3. If you had more time on the project, what other question(s) would you like to answer?
* What biological functions or pathways are most commonly shared between different pathways? To identify critical genes that might be involved in multiple biological processes. 


### Samantha Wheeler's Answer

1. What was your biggest challenge in this project?
My biggest challenge on this project was figuring out how to subset datasets with pandas, as well as how to get my data into a form that could be graphed. 

2. What did you learn while working on this project?
I learned how to make venn diagrams with matplotlib and in a broader sense, how to take code that I was given by other people and interpret it so that I could work on it. I also learned how to use git and github while working with a group of people rather than just using it to keep my own files under version control.

3. If you had more time on the project, what other question(s) would you like to answer?
If I had more time on this project, I would like to look deeper into the list of top genes with most pathway overlap and see what conclusions can be drawn from which genes are overlapping with which other genes.

### Jacob Horn's Answer

1. What was your biggest challenge in this project?
The biggest challenge in this project was reformatting the data in order to answer the question that was posed. I went through several iterations of reformatting the data before it was compatible with what I wanted to do with it.
2. What did you learn while working on this project?
I was surprised by the amount of overlap from the different pathways on the KEGG database. There were many genes that were on 100+ different pathways.
3. If you had more time on the project, what other question(s) would you like to answer?
If I had more time, I would like to make a web of all the genes and pathways to visualize the connections between genes on the KEGG database. It would be interesting to visualize this as a way of displaying the complex nature of gene interactions and pathways. 


In [None]:
import pandas as pd 
import csv
from collections import Counter
import requests
from matplotlib_venn import venn3
from matplotlib_venn import venn2

# Preparation

In [None]:
df_pathway=pd.read_csv("pathway.txt", sep="\t", header=None)
df_pathway.columns=["PATHWAY_ID","PATHWAY_NAME"]
df_gene=pd.read_csv("gene.txt", sep="\t", header=None)
df_gene.columns=["GENE_ID", "TYPE", "TYPE_DESCRIPTION" ,"GENE_INFO"]
df_gene_pathway=pd.read_csv("gen_pathway.txt", sep="\t", header=None)
df_gene_pathway.columns=["GENE_ID" , "PATHWAY_ID"]

In [None]:
df_gene_filter=df_gene.drop(columns=["TYPE","TYPE_DESCRIPTION"])

In [None]:
df_gene_filter_split=df_gene_filter["GENE_INFO"].str.split(';',expand=True)
df_gene_filter_split_new = pd.concat([df_gene_filter.drop(columns=["GENE_INFO"]), df_gene_filter_split], axis=1)
df_gene_filter_split_new.columns=["GENE_ID","GENE_SYMBOL","GENE_NAME"]
df_gene_filter_split_new

# Merge

In [None]:
merge_pathway=df_gene_pathway.merge(df_pathway,how="left", on="PATHWAY_ID")
merge_pathway_gene=merge_pathway.merge(df_gene_filter_split_new, how="left", on="GENE_ID")
merge_pathway_gene
merge_pathway_gene.to_csv('gene_pathway_gene_symbols.csv', index=False)

# Overlapping

In [None]:
overlap_all=list()
for i in range(merge_pathway_gene.shape[0]-1):
    for x in range(i+1,merge_pathway_gene.shape[0]):
        PATHWAY_ID1=merge_pathway_gene.loc[i,"PATHWAY_ID"]
        PATHWAY_NAME1=merge_pathway_gene.loc[i,"PATHWAY_NAME"]
        PATHWAY_ID2=merge_pathway_gene.loc[x,"PATHWAY_ID"]
        PATHWAY_NAME2=merge_pathway_gene.loc[x,"PATHWAY_NAME"]
        overlap_list=list(set(merge_pathway_gene.loc[i,"GENE_SYMBOL"].split(', ')) & set(merge_pathway_gene.loc[x, "GENE_SYMBOL"].split(', ')))
        if overlap_list:
            overlap_list_str='; '.join(overlap_list)
            overlap_all.append([ PATHWAY_ID1,PATHWAY_NAME1, PATHWAY_ID2,PATHWAY_NAME2,len(overlap_list),overlap_list_str])
df_overlap_all=pd.DataFrame(overlap_all)
df_overlap_all.columns=["PATHWAY_ID1", "PATHWAY_NAME1", "PATHWAY_ID2", "PATHWAY_NAME2", "NUMBER_OF_OVERLAPPING_GENES", "LIST_OF_OVERLAPPING_GENES"]


In [None]:
df_overlap_all

# Save the results

In [None]:
c1=df_overlap_all["PATHWAY_ID1"] != df_overlap_all["PATHWAY_ID2"]
df_overlap_all_final=df_overlap_all[c1].sort_values(by="NUMBER_OF_OVERLAPPING_GENES", ascending=False)
df_overlap_all_final.to_csv("KEGG_crosstalk.csv", index=False)

In [None]:
gene_counter = Counter()
print(gene_counter)

In [None]:
gene_counter = Counter()
filename = 'gene_pathway_gene_symbols.csv'

# Read through each row of the csv, index out only the list of overlapping genes, split by semicolon space and 
# add to a counter of genes. 
with open(filename, 'r') as file:
    reader = csv.reader(file)
    next(reader) # SKIP THAT HEADER!!!!!!!!
    for row in reader:
        items = row[0].split()
        gene_counter.update(items)

print("YESSSSSSSSSSSSSSSSSS")

In [None]:
sorted_gene_counter = Counter(dict(gene_counter.most_common()))

In [None]:
# Save genes to a txt file.

with open('Gene_Counts.txt', 'w') as file:
    for gene, count in sorted_gene_counter.items():
        file.write(f"{gene}: {count}\n")

In [None]:
# read in dataframe of gene pathways and gene symbols
gene_df = pd.read_csv('gene_pathway_gene_symbols.csv')
gene_df.head()

In [None]:
sorted_gene_counter2 = dict(gene_counter.most_common())
top_three = list(sorted_gene_counter2.keys())[:3]

In [None]:
# make a dataframe of all rows that contain one of the top three 
top3_df = gene_df[gene_df['GENE_ID'].isin(top_three)]
top3_df.head()

### 5. Retrieve a set of the pathways the top 3 genes appear on.

In [None]:
# create list of all pathways that the top 3 genes appear on 
set(top3_df['PATHWAY_NAME'].unique().tolist())

### 6. Compute and display a Venn diagram for number of overlapping pathways for the top 3 genes.

In [None]:
# get top three genes from the list

sorted_gene_counter2 = dict(gene_counter.most_common())
top_three = list(sorted_gene_counter2.keys())[:3]
top_three

In [None]:
# function that returns a set of pathways associated with a particular gene_id
def pathway_setter(df, id):
    paths = set(df.loc[df['GENE_ID'] == id, 'PATHWAY_NAME'])
    return paths

In [None]:
# get the sets 
hsa_5595 = pathway_setter(gene_df, 'hsa:5595')

In [None]:
hsa_5594 = pathway_setter(gene_df, 'hsa:5594')

In [None]:
hsa_5290 = pathway_setter(gene_df, 'hsa:5290')

In [None]:
venn3([hsa_5595, hsa_5594, hsa_5290], set_labels = top_three)

# Features

### Samantha Wheeler

### Small GTPase pathway interactions in cancer
[Small GTPases](https://en.wikipedia.org/wiki/Small_GTPase) are vital cellular signaling molecules. There are over 100 small GTPases across five subfamilies. The [Ras](https://en.wikipedia.org/wiki/Ras_GTPase) family is the best known and best-studied in the context of cancer proliferation; however, dysregulation of other small GTPases is known to be involved in carcinogenesis, as well as other diseases. Small GTPase signaling pathways are highly interconnected and understanding the crosstalk between them is a highly biologically relevant question. This feature allows the comparison and visualization of KEGG pathways between some of the most commonly studied small GTPases. 

In [None]:
# dictionary of most common small GTPases associated with disease pathogenesis in general and cancer in particular
# for further reading see https://www.mdpi.com/2072-6694/13/7/1500
gtpase_dict = {'kras': 'hsa:4893',
               'rhoa': 'hsa:387',
               'arf1': 'hsa:375',
               'rab1a': 'hsa:5861',
               'ran': 'hsa:5901'
              }

In [None]:
# read in dataframe of gene symbols with pathways
gene_df = pd.read_csv("gene_pathway_gene_symbols.csv")

### Part one: create dictionary of small GTPases and their associated kegg pathways.
This function takes the gene_pathway_gene_symbols csv and uses it to create a dictionary. This function is generalizable and could also be used to generate dictionaries of other disease genes and their assicated KEGG pathways.

In [None]:
# function that returns a set of pathways associated with a particular gene_id
def pathway_setter_deluxe(fdict, df):#, id):
    key_list = fdict.keys()
    val_list = []
    for value in fdict.values():
        val_list.append(set(gene_df.loc[gene_df['GENE_ID'] == value, 'PATHWAY_NAME']))
    pathway_dict = dict(zip(key_list, val_list))
    return pathway_dict


In [None]:
gtpase_pathways = pathway_setter_deluxe(gtpase_dict, gene_df)
gtpase_pathways

### Comparison and visualization of KEGG pathways
This function allows the comparison of KEGG pathways and puts out a venn diagram of how many pathways they have in common.

In [None]:
def pathway_intersections(fdict, prot1, prot2):
    protein1 = fdict[prot1]
    protein2 = fdict[prot2]
    intersect = protein1.intersection(protein2)
    labels = [prot1, prot2]
    venn2([protein1, protein2], set_labels = labels)
    return intersect

In [None]:
# some examples of the function in action
pathway_intersections(gtpase_pathways, 'rhoa', 'kras')

In [None]:
pathway_intersections(gtpase_pathways, 'ran', 'rhoa')

In [None]:
pathway_intersections(gtpase_pathways, 'rab1a', 'arf1')

In [None]:
pathway_intersections(gtpase_pathways, 'kras', 'rab1a')

### Jacob Horn

Do the top genes in appearances on different pathways taper off dramatically? For example, do we see that the first ten genes appear on many pathways (100+) and then the rest of the genes appear on only a few pathways (<10)? Or do we see a gradual taper that starts at the 100+ range and slowly goes down to genes that only appear on a few pathways? I will make several plots to visualize what this decrease in gene count looks like in order to address how the number of genes decreases in pathway crossover. First I will look at the top 50 genes:

In [None]:

# initialize
import pandas as pd
import matplotlib.pyplot as plt

# reading in data
# adjust separator and parse logic
data = pd.read_csv('correct_Gene_Counts.txt', sep=':', header=None, names=['Gene', 'Dummy', 'Count'])

# combine first two parts gene name, make 'Dummy' string
data['Gene'] = data['Gene'] + ':' + data['Dummy'].astype(str)
data = data[['Gene', 'Count']]

# make 'Count' numeric
data['Count'] = pd.to_numeric(data['Count'], errors='coerce')

# sorting counts, descending order
data_sorted = data.sort_values(by='Count', ascending=False)

# selecting top 50 genes
top_50 = data_sorted.head(50)

# plot bar graph
plt.figure(figsize=(15, 8))
plt.bar(top_50['Gene'], top_50['Count'], color='skyblue')
plt.xticks(rotation=90, fontsize=8)
plt.xlabel('Gene')
plt.ylabel('Count of Pathways Appeared On')
plt.title('Top 50 Genes by Count')
plt.tight_layout()

# show plot
plt.show()


It appears that this is very gradual descent in pathways appeared on. Let's examine what this looks like in the top 200 genes.

In [None]:
# initialize
import pandas as pd
import matplotlib.pyplot as plt

# reading in data
# adjust separator and parse logic
data = pd.read_csv('correct_Gene_Counts.txt', sep=':', header=None, names=['Gene', 'Dummy', 'Count'])

# combine first two parts of the gene name, make 'Dummy' string
data['Gene'] = data['Gene'] + ':' + data['Dummy'].astype(str)
data = data[['Gene', 'Count']]

# make 'Count' numeric
data['Count'] = pd.to_numeric(data['Count'], errors='coerce')

# sorting counts in descending order
data_sorted = data.sort_values(by='Count', ascending=False)

# selecting top 200 genes, instead of 50 in this case.
top_200 = data_sorted.head(200)

# plot bar graph
plt.figure(figsize=(20, 10))
plt.bar(top_200['Gene'], top_200['Count'], color='skyblue')
plt.xticks(rotation=90, fontsize=6)  # Adjust font size to fit more labels
plt.xlabel('Gene')
plt.ylabel('Count of Pathways Appeared On')
plt.title('Top 200 Genes by Count')
plt.tight_layout()

# show plot
plt.show()


Again, this appears to be a very slow taper. 

### Final Thoughts:

The number of pathways of the top genes slowly decreases. Even genes at rank ~200 are on 20+ pathways. It is evident that it is not a large dropoff in gene count, rather a gradual decrease in the amount of pathways the genes appear on. 



### Luxuan Wang

My feature addresses the follwing question:  
What is the distribution of gene counts across KEGG pathways?  
By analyzing the number of genes associated with each pathway, it help us understand overall structure and complexity of the KEGG pathway and indentify the core pathways that may play important roles in biological systems.

In [None]:
import matplotlib.pyplot as plt

### Count the number of genes associated with each pathway

By defining the count_genes_per_pathway function, we can group genes associated with the same pathway and calculate the number of genes within each pathway.

In [None]:
gene_pathway_df = pd.read_csv("./gene_pathway_gene_symbols.csv")

In [None]:
def count_genes_per_pathway(gene_pathway_df):
    pathway_counts = gene_pathway_df.groupby('PATHWAY_NAME')['GENE_ID'].count()
    pathway_counts = pathway_counts.sort_values(ascending=False).reset_index()
    pathway_counts.columns = ['Pathway', 'Gene_Count']
    return pathway_counts

gene_counts_df = count_genes_per_pathway(gene_pathway_df)

In [None]:
gene_counts_df

### Plot the distribution of gene counts across pathways

Here, we visualize the distribution of gene counts across pathways by generating a histogram.

In [None]:
def plot_gene_count_distribution(gene_counts_df):
    plt.figure(figsize=(10, 8))
    plt.hist(gene_counts_df['Gene_Count'], bins=40, color='green',edgecolor='black')
    plt.xlabel('Number of Genes per Pathway')
    plt.ylabel('Frequency')
    plt.title('Distribution of Gene Counts Across KEGG Pathways')
    plt.xticks(range(0, int(plt.gca().get_xlim()[1]) + 1, 100))
    plt.yticks(range(0, int(plt.gca().get_ylim()[1]) + 1, 10))
    plt.show()

# Plot the distribution of gene counts
plot_gene_count_distribution(gene_counts_df)

### Analyze the distribution pattern of gene counts across pathways

Finally, we provide a numerical analysis of the distribution, calculating the maximum, minimum, median, and mean values to quantitatively characterize the pattern of gene counts across pathways.

In [None]:
def analyze_distribution(gene_counts_df):
    mean_count = gene_counts_df['Gene_Count'].mean()
    median_count = gene_counts_df['Gene_Count'].median()
    max_count = gene_counts_df['Gene_Count'].max()
    min_count = gene_counts_df['Gene_Count'].min()
    return {
        "Mean": mean_count,
        "Median": median_count,
        "Max": max_count,
        "Min": min_count
    }

distribution_summary = analyze_distribution(gene_counts_df)
for key, value in distribution_summary.items():
    print(f"{key}: {value}")

### Results Interpretation

From the figure, we can observe a long-tail pattern in the distribution of gene counts across KEGG pathways. Most pathways have relatively few genes, with the median pathway containing 78 genes. However, a small subset of pathways, such as the largest one with 1563 genes, stand out as outliers. These pathways are likely involved in highly complex and integrative biological processes, such as major metabolic or signaling networks. So, further we should investigate these gene-rich pathways to get insght in their biological roles and regulatory mechanisms.

### Ileanexis Madera Cuevas

#### This heatmap will visualize the number of overlapping genes between pathway 1 and pathway 2 of the top 500 and all the datasets. The color intensity indicates a higher number of overlapping genes. This visualization is important because it helps us identify pathways that share a significant number of genes, thereby highlighting potential biological interactions and the functional relationships between these pathways.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Load the data
df_overlap_all = pd.read_csv("KEGG_crosstalk.csv")
print(df_overlap_all.columns)

In [None]:
# Select the top 500 rows
df_top_500 = df_overlap_all.head(500)
print(df_top_500.head())

heatmap_data = df_top_500.pivot_table(index='PATHWAY_ID1', columns='PATHWAY_ID2', values='NUMBER_OF_OVERLAPPING_GENES', fill_value=0)

plt.figure(figsize=(14, 10))
sns.heatmap(heatmap_data, cmap="YlGnBu")
plt.title('Heatmap of Overlapping Genes between Pathways (Top 500 Rows)')
plt.show()

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

heatmap_data = df_overlap_all.pivot_table(index='PATHWAY_ID1', columns='PATHWAY_ID2', values='NUMBER_OF_OVERLAPPING_GENES', fill_value=0)

plt.figure(figsize=(14, 10))
sns.heatmap(heatmap_data, cmap="YlGnBu")
plt.title('Heatmap of Overlapping Genes between Pathways (All Data)')
plt.show()