# Gene Expression Data Analysis

## Learning Objectives

- Use pandas to read and process gene expression data. 
- Generate fold change values and p values for gene expression data.
- Create a volcano plot to visualize gene expression changes.
- Identify a list of the top differentially expressed genes.

## Data Source

You are working with a real gene expression dataset from a study of e-cigarrette vaping. 
This data is sampled from the [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/) dataset: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE176492. 

The data is stored in a TSV file that you should upload to your Jupyter Notebook environment.
This file is called 'ecig_vs_control.tsv'.

I have preprocessed it from the data in GEO so that the first column is the gene name, and the subsequent columns are the expression values for each sample. 

Each sample is labeled with `_evaper` for e-cigarette vaping samples and `_control` for control samples.

## Getting Started

You will work with pandas, scipy, numpy, and matplotlib for this analysis.
You may use chatGPT to help you with any of the code you need to write.

I will also provide hints throughout the notebook to help you write effective queries and accomplish the tasks. 


# Task 1: Load the Data

You need to load the gene expression data from the TSV file into a pandas DataFrame.

Do this with:

```python
import pandas as pd

# note the tab separator
df = pd.read_csv("ecig_vs_control.tsv", sep="\t")
```

Feel free to explore the data frame using `df.head()` and `df.info()` to understand its structure.

# Task 2: Calculate Fold Change and P-values

This is a bit more complex. You want to make a new table that has, for each gene: 
- The fold change between the e-cigarette vaping samples and the control samples.
- The p-value for the difference in expression between the two groups.
- The log2 fold change for easier visualization.

In this case, I suggest you work with ChatGPT with a query like: 

"I have a pandas dataframe that has gene expression data. 
The columns are GENE, and then the expression levels for each sample in the two groups.
Each sample has a unique identifier followed by _control or _evaper.
Show me code to make a new dataframe with mean expression levels for each gene in both groups.
Please use scipy for the t-test and numpy for the log2 fold change calculation.
When running the t-test, make sure that you coerce the data to float type to avoid errors and use the values attribute to get the actual data from the DataFrame."

# Task 3: Visualize with a Volcano Plot

To visualize the results, you can create a volcano plot. This plot will show the log2 fold change on the x-axis and the -log10 p-value on the y-axis.

To focus on the most significant genes, you should remove any genes with a p-value greater than 0.05. The actual significance threshold would need to be adjusted based on your specific analysis, but for this exercise, we will use 0.001. 

I will show you how to create a volcano plot using matplotlib, but you can also ask ChatGPT, or tinker with the code to customize it further.

```python
import matplotlib.pyplot as plt

# Create a volcano plot
plt.figure(figsize=(10, 6))
# make out plot - adjust the variable names as needed
plt.scatter(df_means['log2_fc'], -np.log10(df_means['p_value']), alpha=0.5)
plt.title('Volcano Plot of Gene Expression Changes')
plt.xlabel('Log2 Fold Change')
plt.ylabel('-Log10 P-value')
# add a line for the p-value threshold
plt.axhline(y=-np.log10(0.001), color='red', linestyle='--', label='p=0.001 threshold')
plt.axvline(x=0, color='black', linestyle='--')
plt.legend()
# there will be outliers in the FC calculation, so limit our window for a fold-change of 5
plt.xlim(-5, 5)
plt.show()
```

# Task 4: Find the top differentially expressed genes

Finally - we are going to identify the top differentially expressed genes based on the p-value and log2 fold change.

You can use the following code to filter the DataFrame and sort it by p-value and log2 fold change:

```python
# Filter for significant genes
significant_genes = df_means[(df_means['p_value'] < 0.001) & (df_means['log2_fc'].abs() > 1)]
# Sort by p-value and log2 fold change
top_genes = significant_genes.sort_values(by=['p_value', 'log2_fc'], ascending=[True, False])
# Display the top genes
top_genes
```

# Reflection

In this introduction to gene expression data - you analyzed a real dataset, calculated fold changes and p-values, visualized the results with a volcano plot, and identified the top differentially expressed genes.

## Please answer the following questions: 

1. Why did we use log2 fold change instead of raw fold change?

2. Pick one of the top differentially expressed genes and do a brief web search to find out what it does in the body. Do you think this is plausible in the context of e-cigarette vaping?

