### General instructions.

In this interactive tutorial, you can run each one of the cells by either clicking the ‘play’ button or by pressing ‘Shift + Enter’. You can make changes to the code as well.

# Graphical Forms of Data Charts: Dataset 1

## Filter and Fire dataset

Read and observe the Filter and Fire dataset in Python:

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Python needs to import ceratin packages that help in executing commands of different types. 
We import the packages e.g. pandas and whereever we call it, we use 'pd' for convenience.

In [None]:
df = pd.read_csv('FilterandFireData.csv')
display(df) #displays what the data looks like. 'print(df.head())' only only the first few rows 

Since the data file is in the same repository, we can directly call it. 
If it was in another folder, a path to the respective foler would have to be added.

## Barplot
We can now start by observing the Baseline Accuracy displayed by neurons
for the detection of each handwritten digit by making a Bar plot.

In [None]:
mean_accuracies = df.groupby('digit')['Accuracy.FF'].mean()
plt.figure(figsize=(10, 6)) # for changing the figure size
mean_accuracies.plot(kind='bar', color= 'orange')
plt.title('Mean Accuracy for Each Digit (0-9)')
plt.xlabel('Digits')
plt.ylabel('Mean Accuracy.FF')
plt.ylim(85,100) 
plt.grid(axis='y')
plt.show()

The *groupby* function take the respective category and groups the data accordignly. Here, we first group mean of each digit and then plot them as a bargraph giving it a specific color.

As you can see, this is the simpliest form of a plot. You set the variable for a desired category and plot the saved variable in the form of a graph. 

* Exercice: Change the color of the bar plot.

## Box plot

In [None]:
plt.figure(figsize=(10,6))
#df.boxplot(column='Accuracy.FF', by='digit', grid = False)
sns.boxplot(x='digit', y='Accuracy.FF', data=df, palette='pastel')
plt.title('Boxplot of Accuracy for Each Digit (0-9)')
plt.suptitle('')  # Suppress the default title to avoid duplication
plt.xlabel('Digits')
plt.ylabel('Accuracy.FF')
plt.ylim(85, 100)  # Set the y-axis range
plt.grid(axis='y')
plt.grid(axis='x')
plt.show()

The *seaborn* package provides additional visualization tools such as color palettes. This boxplot uses it for colored palettes, however, it does not provide addiditonal details in the graph, so it can be removed. Sns also requires certain categories for its default execution. 

* Exercise: 
i) What would you do to remove the color palette?

ii) Can you find the variables sns would not work without? What happens if you remove them?

## Violin plot

In [None]:
plt.figure(figsize=(10,6))
sns.violinplot(x='digit', y='Accuracy.FF', data=df, inner="quartile", palette='pastel')
plt.title('Violin Plot of Accuracy for Each Digit (0-9)')
plt.xlabel('Digits')
plt.ylabel('Accuracy')
plt.show()

## Histogram
We will now plot a histogram but only for the values that were trained with the **digit 9**

In [None]:
digit_9 = df[df['digit'] == 9] #subset digit 9 data
#display(digit_9)
plt.figure(figsize=(10, 6))
plt.hist(digit_9['Accuracy.FF'], bins=20, color='skyblue', edgecolor='black')
plt.title('Histogram of Accuracy')
plt.xlabel('Accuracy')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()

In histograms, bins define the division of the histogram into bars. The number of bins change the detail of visualization throughout the data. 

* Exercise: What happens if you change the bins?

# Graphical Forms of Data Charts: Dataset 2

## Brain region-specific Gene Expression

Read and observe the Brain region-specific Gene Expression data in Python:

In [None]:
genes = pd.read_csv('ExpressionData.txt', sep='\t', index_col=0)
display(genes)

Pandas package is used to read the data files. The *read_csv* function in pandas is a versatile function used to read data from a variety of file formats, not just CSV files, in this case .txt file. The delimiter used, sep='\t' for tab for a clear visual display. *index_col=0* uses the first column as the index. This is because a heatmap function requires a matrix as input, but our ‘expressiondata’ object is a list. 

## Heatmap

In [None]:
plt.figure(figsize=(12, 8)) #figsize 25,30 
sns.heatmap(genes, cmap='viridis', annot=False, cbar=True) #,yticklabels=True ,linewidths=0.5
plt.title('Heatmap of Gene Expression Data')
plt.xlabel('Samples')
plt.ylabel('Genes')
plt.show()

The sequence of code matters. Here, the first line forms the base for a heatmap execution. You could also run this without the *plt* function line but It would change the display.

* Exercise: i) What is compromised if you run without the first line?    
ii) How could you incorporate maximum gene names on the display?

## Scatterplot

In [None]:
plt.figure(figsize=(10, 6))
sns.scatterplot(x='NAc1', y='NAc4', data=genes)
plt.title('Scatter Plot of Gene Expression')
plt.grid(True)

plt.show()

We have plotted the same brain region of two different samples (Nucleus Accumbens for rat1 and rat4), we can see both of them are highly correlated with each other.

* Exercise: Plot the NAc expression data for any sample against DG of the same sample.
\
What is the dispersion like? Is the expression between the two brain regions correlated with each other? Is
it what you expected?

## Line plot

In [None]:
genes_T = genes.T #Transpose the data
plt.figure(figsize=(12, 8))
#for gene in genes_T.columns:
    #plt.plot(genes_T.index, genes_T[gene], label=gene)
genes_T.plot(legend=False, alpha=0.5)
plt.title('Line Plot of Gene Expression Data')
plt.xlabel('Samples')
plt.ylabel('Gene Expression')
#plt.legend(loc='upper right', bbox_to_anchor=(1.25, 1))
#plt.grid(True)
plt.show()

* Exercise: i) Try the code without transposing the data and see the difference. Can you explain why or why would it not make a difference?
          ii) Uncomment other line of codes. What do you observe?


Note: you can also use a *melt* function to a long format for seaborn. This resets the data frame for a line plot. 

### Advanced excercises.
If you'd like to have an extra challenge, we suggest you to download the original datasets. You can then try to replicate the plots from the research papers. 
* Filter and Fire original Dataset: https://www.kaggle.com/datasets/selfishgene/fiter-and-fire-paper
* Brain region-specific expression data original Dataset (Fig1d.Region_sepcific_expressed_Gene_cpm_Zscore.txt file): https://figshare.com/projects/Methamphetamine-induced_region-specific_transcriptomic_and_epigenetic_changes_in_the_brain_of_male_rats/177378