# BIOS470/570 Lecture 9

## Last time we covered:
* ### Clustering data 
* ### seaborn plotting package
* ### Gene ontology with gget

## Today we will cover:
* ### Dimensionality reduction with PCA. 

#### We will use the pca command from the scikit-learn library which has many functions for machine learning. 
#### Install command: conda install scikit-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import decomposition, preprocessing

### Data that goes into PCA is often scaled to avoid the variance being dominated by the components with the largest numbers. Let's look at a simple example of this scaling. First lets make a dataframe where the scale of x and y is very different:

In [None]:
# Make a dataframe with random data in columns labeled x and y:
rand_data = pd.DataFrame({'x':np.random.random(100),'y':100*np.random.random(100)})
# use seaborn to scatterplot:
sns.scatterplot(rand_data, x = 'x', y = 'y');

### The following applies a standard scaling from the scikit-learn library. The standard scaling is for each variable to subtract its mean and divide by its standard deviation: i.e. $x_{scaled} = (x - \langle x \rangle )/ \rm{std}(x)$

### This example also shows how many of the functions of the scikit-learn work. You define the model and then fit it, which actually runs the algorithm. The fit modifies the model object in place. Then, running the transform command actually returns the transformed data. 


In [None]:
scaler = preprocessing.StandardScaler() #define the model
scaler.fit(rand_data) #apply it to the data

#Now get the transformed data. 
# By default sklearn outputs this as a numpy array, but we can put it back into a dataframe, with the same column names:
new_data = pd.DataFrame(scaler.transform(rand_data),columns=rand_data.columns)
sns.scatterplot(new_data,x = 'x',y = 'y');

### Now let's look at a dataset with a correlation between x and y. We will define a dataframe with x and y, where the value of y is just x with some added random number:

In [None]:
# define some random numbers
x = np.random.random(100)
# a data frame where x are the random numbers and y is x with added noise. 
rand_data = pd.DataFrame({'x':x,'y':x+0.5*np.random.random(100)})
#make a scatterplot of the dataframe
sns.scatterplot(rand_data, x = 'x', y = 'y');

### Let's apply the same standard scaler to the data, it will look basically the same but the means will have shifted, and the scale changed.  Note that some information on the relative magnitudes or dispersions between components has been lost.

In [None]:
scaler = preprocessing.StandardScaler() #make the scalar object
scaler.fit(rand_data) #apply it to the data
new_data = pd.DataFrame(scaler.transform(rand_data), columns=rand_data.columns) #get the transformed data
sns.scatterplot(new_data, x = 'x',y = 'y'); #plot it

### Now let's apply PCA to this dataset. First, we make the pca model object with decomposition.pca, then we run fit to actual do the PCA. This will add some variables into the model. First, let's look at explained variance ratio:

In [None]:
pca = decomposition.PCA(n_components=2) #specify the models with its parameters, for example, the number of components to return
pca.fit(new_data) #actually run the pca
pca.explained_variance_ratio_ #how much variance does each principle component contain?

### There are two components to the PCA, because that is how many we specified. That is also the maximum number because the originail data was 2 dimensional.  ~95% of the variance in the data can be explained by the first component. 

### Now let's get the transformed data. We will put it in a new dataframe with the columns labeled pc1 and pc2:

In [None]:
transformed_data = pd.DataFrame(pca.transform(new_data), columns=['PC1','PC2'])

### Now lets plot some of the outputs of this. We can look at two different things: the principle components in the coodinates of the original data (given by pca.components_ and the transformed data which is given by pca.transform

In [None]:
#axes[0].scatter(new_data[:,0],new_data[:,1])
fig = plt.figure(figsize = (12,5))
axes = fig.subplots(1,2)
sns.scatterplot(new_data, x = 'x',y='y',ax = axes[0]) #plot the original data
#add vectors for the principle components
axes[0].arrow(0,0, pca.components_[0,0], pca.components_[0,1], head_width = 0.1, color = 'r', label = 'pc1')
axes[0].arrow(0,0, pca.components_[1,0], pca.components_[1,1], head_width = 0.1, color = 'r')
#this sets the x and y scales to be equal so if the vectors are at right angles, we will see it that way. 
axes[0].set_aspect('equal')
axes[0].set_title('Original data with principle components indicated')

#Now plot the transformed data. The coordinates are no longer the original x and y but the values of the principle components for each data point
sns.scatterplot(transformed_data, x = 'PC1', y = 'PC2', ax = axes[1])
axes[1].set_title('Transformed data')
axes[1].set_aspect('equal')

### Now let's see how this looks on a real dataset. We will look at the RNAseq from frog development:

In [None]:
data = pd.read_csv('data/xen_uic_hik_stage8_13_30min.tsv',delimiter='\t')

### Apply the scaler to the data as well did before. If you want to skip this step, you can do it with the PCA together by passing the argument whiten=True when you define the model e.g. PCA(n_components = 4, whiten = True). 

### We also need to transpose the data - we want to consider each sample as a datapoint and replace the large space of genes for this sample by the smaller sample of principle components. To do so, we need to have each row be one sample and each column be a gene:

In [None]:
data_numeric = data.iloc[:,1:].transpose().to_numpy() #get only the numeric data
scaler = preprocessing.StandardScaler()
scaler.fit(data_numeric)
data_scaled = scaler.transform(data_numeric) #get the scaled data

### Here the fit_transform both does the fitting and returns the transformed data

In [None]:
pca = decomposition.PCA(n_components=10)
transformed_data = pca.fit_transform(data_scaled)

### let's put this in a dataframe with column labels for the principle components, and include a column which labels the samples

In [None]:
pc_labels = []
for ii in range(1,11):
    pc_labels.append('PC' + str(ii))
    
transformed_data = pd.DataFrame(transformed_data,columns=pc_labels)
transformed_data["sample_labels"] = data.columns[1:]

### Look at the fraction of variance which is explained by each principle component

In [None]:
pca.explained_variance_ratio_

### We will use this function, borrowed from [here](https://stackoverflow.com/questions/46027653/adding-labels-in-x-y-scatter-plot-with-seaborn) to make plots of principle components with the samples labeled. It just uses the seaborn scatterplot together with the .text method to add text to each point:

In [None]:
def scatter_text(x, y, text_column, data):
    """Scatter plot with country codes on the x y coordinates
       Based on this answer: https://stackoverflow.com/a/54789170/2641825"""
    # Create the scatter plot
    p1 = sns.scatterplot(x = x,y =  y, data=data, s = 100, legend=False)
    # Add text besides each point
    for line in range(0,data.shape[0]):
         p1.text(data[x][line]+0.01, data[y][line], 
                 data[text_column][line], horizontalalignment='left', size=18, color='black')


### Make plots of PC1 vs PC2, PC1 vs PC3, and PC2 vs PC3

In [None]:
fig = plt.figure(figsize=(20,10))
ax1 = fig.add_subplot(1,2,1)
scatter_text('PC1','PC2',"sample_labels",transformed_data)
ax1.set_xlabel('PC1',fontsize=24)
ax1.set_ylabel('PC2',fontsize=24)

ax1 = fig.add_subplot(1,2,2)
scatter_text('PC1','PC3',"sample_labels",transformed_data)
ax1.set_xlabel('PC1',fontsize=24)
ax1.set_ylabel('PC3',fontsize=24)


### The samples labeled UIC_... are untreated samples at different times while hiK_... are treated samples. Note how they take different routes the princple component space before joining again.

### Now lets look at PC2 vs PC3

In [None]:
fig = plt.figure(figsize=(10,10))
ax1 = fig.add_subplot()
scatter_text('PC2','PC3',"sample_labels",transformed_data)
ax1.set_xlabel('PC2',fontsize=24)
ax1.set_ylabel('PC3',fontsize=24);
