# BIOS470/570 Lecture 6

## Last time we covered:
* ### Writing functions in python
* ### pandas data series

## Today we will cover:
* ### Introduction to gene expression measurements
* ### pandas data frames

### First import the packages we need with the standard conventions

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Pandas is very good for reading tabular data. In the data folder, I have two files with RNAseq data directly downloaded from the supplementary information of the relevant papers. One is an excel file and one is tab separate (tsv). Both of these can be read with pandas without further modification

In [None]:
human_data = pd.read_excel('data/GSE137492_SupplementaryTable1.xlsx')
human_data

In [None]:
frog_data = pd.read_csv('data/xen_uic_hik_stage8_13_30min.tsv',delimiter='\t')
frog_data

### These objects are pandas dataframes. Similar to a series, they have an index with names the row. They also have another index object called "columns" with names the columns:

In [None]:
human_data.index #this is just an integer index

In [None]:
human_data.columns #this is an index object with the column names taken from the file

### Dataframes can be indexed either by integer index (as for python lists and numpy arrays) or by name in the index. To make this explicit there are two methods for this .loc (for index) and .iloc (for integer)

In [None]:
human_data.loc[:,"media_1"]

In [None]:
human_data.iloc[:,2]

### Note that pandas gets the datatype right and mixed datatypes are okay. The data type for the above is float while for the gene ids it is not a numeric type:

In [None]:
human_data.loc[:,"geneIds"]

### In this case, for the rows, these are equivalent since the index is just the integers

In [None]:
human_data.loc[0] #first row

In [None]:
human_data.iloc[0]

### This isn't convenient as we'd like to be able to access the data by gene name. Lets drop the rows without gene names and then make the index the gene name. As there are NaNs in the gene name column, we can easily do this with the dropna method:

In [None]:
human_data.dropna()

### More generally, we can use indexing. The .loc method can also use a boolean series, which enables us to do this. Note that the index for each row has not changed so the first row no longer has index 0

In [None]:
human_data = human_data.loc[~human_data["genes"].isna()]
human_data

### Now set the index

In [None]:
human_data.index = human_data["genes"]
human_data

### Use the drop method to drop the "genes" column as it is now the index

In [None]:
human_data = human_data.drop("genes", axis=1) #drop the genes column as we are using it for the index
human_data

### Now we can get data by gene name:

In [None]:
human_data.loc["NANOG"]

### Let's see a scatter plot of gene expression for two genes, color coded by a third. Note that we need to not take the geneId column as this doesn't have numeric data. 

In [None]:
fs = 32
nanog = human_data.loc["NANOG"].iloc[1:] 
cdx2 = human_data.loc["CDX2"].iloc[1:] 
isl1 = human_data.loc["ISL1"].iloc[1:] 
fig = plt.figure(figsize=(10,10))
ax = fig.add_subplot()
ax.scatter(nanog,cdx2,c = isl1)
ax.set_ylabel("CDX2",fontsize = fs)
ax.set_xlabel("NANOG",fontsize = fs);

### This suggests that NANOG and CDX2 expression are mutually exclusive and ISL1 is expressed together with CDX2

In [None]:
frog_data.iloc[4000:]

### Other ways to make dataframes:
* ### Dictionary of equal length lists. List keys are columns. Rows get a default index

In [None]:
frame1 = pd.DataFrame({'NANOG':[3, 6, 98, 1],'CDX2':[76, 64, 2, 88]})
frame1

* ### Dictionary of series. Recall we made some series out of our gene expression data before. Let's use a couple to make a dataframe. Again, column names are dictionary keys. Row index is taken from the index of the series:

In [None]:
nanog

In [None]:
frame1 = pd.DataFrame({'NANOG':nanog,'CDX2':cdx2})
frame1

### This is the transpose of what we had before where genes where columns, not rows. We can transpose this if we want:

In [None]:
frame1T = frame1.transpose()
frame1T

### As with numpy, slicing, transposing etc does not make a copy. Changing this new variable will also change the original. 

In [None]:
frame1T.iloc[0,0] = 1
frame1

### You can also convert a numpy array into a dataframe. Both rows and columns will get a default index:

In [None]:
pd.DataFrame(np.random.random((3,3)))

### You can set the index and column names when specifying the data frame. 

In [None]:
frame2 = pd.DataFrame(np.random.random((3,3)), index = ["NANOG","CDX2","ISL1"], columns=["Condition 1", "Condition 2", "Condition 3"])
frame2

### Numpy functions will also work on dataframes in an element-wise fashion just like on ndarrays:

In [None]:
np.sqrt(frame2)

### Many useful data operations can be performed on dataframes

In [None]:
frame2.sort_index()

In [None]:
frame2.sort_index(ascending=False, axis = 'columns') #axis = 1 also works

In [None]:
frame2.sort_values("Condition 1")

In [None]:
frame2.sort_values("NANOG", axis = 1)

In [None]:
frame2.mean()

In [None]:
frame2.mean(axis = 1)