# BIOS470/570 Lecture 8

## Last time we covered:
* ### Missing data, duplicated data, and string operations
* ### merging multiple data sets with pandas

## Today we will cover:
* ### Clustering data 
* ### seaborn plotting package
* ### Gene ontology

### Import the usual packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### We will need three new packages today: seaborn, scipy, and gget. 
* #### seaborn is a plotting package built on top of matplotlib for stastical plotting. It allows for making complex plots with high level functions and is built to run on pandas dataframes. 
* #### scipy has packages for scientific computing. It contains a clustering package that we will use today. 
* #### gget is a new package used for querying a variety of biological databases directly from  python code. We will use it for gene ontology.
### The install commands for these are:
* #### conda install scipy
* #### conda install seaborn
* #### conda install -c bioconda gget

In the last command, the -c flag is for channel. The bioconda channel contains many bio-specific python packages. 

In [None]:
import seaborn as sns
import scipy

In [None]:
data_human = pd.read_excel('data/GSE137492_SupplementaryTable1.xlsx')
data_frog = pd.read_csv('data/xen_uic_hik_stage8_13_30min.tsv',delimiter='\t')
data_frog

### Let's process our data as we did before. Make the gene name the index and then drop non-numeric values

In [None]:
data_human.dropna(inplace=True)
# for human data
data_human.index = data_human.loc[:,"genes"]
data_human.drop(["genes"],axis = "columns", inplace = True)
ensIds = data_human.loc[:,"geneIds"]
data_human.drop("geneIds",axis = "columns", inplace = True)

#for frog data
data_frog.index = data_frog.loc[:,"Gene"]
data_frog.drop("Gene",axis = "columns", inplace = True)

### When dealing with large datasets, it is often useful to reduce them to the variables of interest to make the size of the dataset smaller and the processing faster. It can also focus attention on relevant features of the dataset. 

### One of the simplest things is to remove genes with low or no experession in all conditions:


In [None]:
data = data_frog
expressed = data.max(axis = 1) > 1
data = data.loc[expressed]
data

### It can also be helpful to remove genes with very high expression as these tend to dominate downstream analysis but may not be interesting:

In [None]:
notTooHigh = data.max(axis = 1) < 1e3
data = data.loc[notTooHigh]
data

### A second commonly used step is to restrict attention to the variable genes, that is genes that change expression between different conditions in the dataset. Genes that are expressed approximately the same in all conditions are probably not of interest for the conditions being studied. One metric for this is to compare the standard deviation with the mean. 

### Here, we implement a cutoff in this ratio:

In [None]:
variable = data.std(axis = 1)/data.mean(axis = 1) > 1.5
data = data.loc[variable]
data

### Now we will perform a hierachical clustering of the data and visualize it using a heatmap with the tree overlaid. 

### We can call the seaborn clustermap function to do this. Behind the scenes, it is using scipy's hierachical clustering functions to do the clustering. As before we look at log2(data+1) for better visualization.

In [None]:
cg = sns.clustermap(np.log2(data+1))


### That could be more informative. Too much of the data fall in the dark part of the colormap. We can use the vmax parameter to set the top of the colormap (there is also an analagous vmin parameter). 

In [None]:
cg = sns.clustermap(np.log2(data+1), vmax = 5)

### Notice that seaborn has handled all the labelling for us based on the index and columns of the data frame. The gene labels on the y axis are just a subset of all the data labels as there are way too many to fit.

### We can also change the colormap to change the colors for the visualization. seaborn has lots of useful built in colormaps:

In [None]:
cg = sns.clustermap(np.log2(data+1),vmax = 5, cmap = "Blues")


In [None]:
cg = sns.clustermap(np.log2(data+1),vmax = 5, cmap = "Spectral")

### There are lots of options for color in seaborn. See [here](https://seaborn.pydata.org/tutorial/color_palettes.html) for an in depth discussion of color palettes

### The relplot function can also work like the matplotlib function scatter and takes care of labeling for you

In [None]:
sns.relplot(x = data_human.loc["ISL1"], y = data_human.loc["NANOG"], hue = data_human.loc["GATA3"], palette = "rocket");

### The function clusts can be used to split the hierarchical clustering into discrete clusters. The following will split it in a maximum of 10 clusters. Another choice, criterion = "distance" will split the clusters so that no two observations in the cluster have a distance great than this. 

In [None]:
clusts = scipy.cluster.hierarchy.fcluster(cg.dendrogram_row.linkage,5,criterion="maxclust")
clusts

In [None]:
ax = sns.relplot(data = np.log2(data+1), x = "hiK_13", y = "UIC_4", hue = clusts, palette = "colorblind")
len(clusts)

### This didn't work very well on this data and the hierarchical clustering emthod is often not good for making discrete clusters. You can run kmeans clustering via the vq.kmeans2 function. This returns the centers of the clusters and the labels, the second argument is the number of clusters. 

### It is recommended to "whiten" the data. This is in reference to white noise. It makes all the components have 0 mean and unit variance:


In [None]:
centroids, clusts = scipy.cluster.vq.kmeans2(scipy.cluster.vq.whiten(np.log2(data+1)),6,minit = 'random')

In [None]:
ax = sns.relplot(data = np.log2(data+1), x = "hiK_13", y = "UIC_4", hue = clusts, palette = "colorblind")
len(clusts)

### This looks substantially better. Remember that this is a high dimensional dataset and we are only visualizing two of the dimensions. 

### Clustering makes groups of genes but what do we do with these? Gene ontology tries to search for enriched sets of genes within these lists. The gget tool allows you to query the enrichr database for this.

### Lets start with a simple example. All the genes with BMP in their name:

In [None]:
import gget
BMP_list = list(data_human.index[data_human.index.str.contains("BMP")])
enrich_out = gget.enrichr(BMP_list,database="ontology")
enrich_out

### This output is a pandas dataframe and you can programmatically extract its contents.  Note the p values which tell you whether it is likely that this number of genes would be found by random chance.

In [None]:
enrich_out = gget.enrichr(BMP_list,database="pathway")
enrich_out