# Downloading data from GEO


## Reading list

- [What the FPKM](https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/) - Explain difference between TPM/FPKM/RPKM units
- [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient) - linear correlation unit

## Intro

The Gene Expression Omnibus (GEO) is a website funded by the NIH to store the expression data associated with papers. Many papers require you to submit your data to GEO to be able to publish.

Search [GEO](http://www.ncbi.nlm.nih.gov/geo) for the accession ID from [Shalek + Satija 2013](http://www.ncbi.nlm.nih.gov/pubmed/23685454). **Download the "Series Matrix" to your laptop** and **copy the link for the `GSE41265_allGenesTPM.txt.gz`" file**. All the "Series" file formats contain the same information in different formats. The Matrix one is the easiest to understand.

Open the "Series Matrix" in Excel (or equivalent) on your laptop. And look at the format and what's described.

In [None]:
! wget [link to GSE41265_allGenesTPM.txt.gz file]

We'll be using three additional libraries in Python:

1. [`numpy`](http://www.numpy.org/) - (pronounced "num-pie") which is basis for most scientific packages. It's basically a nice-looking Python interface to C code. It's very fast.
2. [`pandas`](http://pandas.pydata.org) - This is the "DataFrames in Python." (like R's nice dataframes) They're a super convenient form that's based on `numpy` so they're fast. And you can do convenient things like calculate mea n and variance very easily.
3. [`matplotlib`](http://matplotlib.org/) - This is the base plotting library in Python.
4. [`scipy`](http://www.scipy.org/) - (pronounced "sigh-pie") Contains 
4. [`seaborn`](http://web.stanford.edu/~mwaskom/software/seaborn/index.html) - Statistical plotting library. To be completely honest, R's plotting and graphics capabilities are much better than Python's. However, Python is a really nice langauge to learn and use, it's very memory efficient, can be parallized well, and has a very robust machine learning library, `scikit-learn`, which has a very nice and consistent interface. So this is Python's answer to `ggplot2` (very popular R library for plotting) to try and make plotting in Python nicer looking and to make statistical plots easier to do.

In [None]:
# We're doing "import superlongname as abbrev" for our laziness - this way we don't have to type out the whole thing each time.

# Numerical python library (pronounced "num-pie")
import numpy as np

# Dataframes in Python
import pandas as pd

# Python plotting library
import matplotlib.pyplot as plt

# T-test of independent samples
from scipy.stats import ttest_ind

# Statistical plotting library we'll use
import seaborn as sns

# This is necessary to show the plotted figures inside the notebook -- "inline" with the notebook cells
%matplotlib inline

# Read the data table
geo_expression = pd.read_table('GSE41265_allGenesTPM.txt.gz', 
                               
                               # Sets the first (Python starts counting from 0 not 1) column as the row names
                               index_col=0, 
                               
                               # Tells pandas to decompress the gzipped file
                               compression='gzip')

Let's look at the top of the dataframe by using `head()`. By default, this shows the first 5 rows.

In [None]:
geo_expression.head()

To specify a certain number of rows, put a number between the parentheses.

In [None]:
geo_expression.head(8)

### Exercise 1: using `.head()`

Show the first 17 rows of `geo_expression`

In [None]:
# YOUR CODE HERE

In [None]:
assert _.index.tolist() == ['XKR4', 'AB338584', 'B3GAT2', 'NPL', 'T2', 'T', 'PDE10A', '1700010I14RIK', 
                            '6530411M01RIK', 'PABPC6', 'AK019626', 'AK020722', 'QK', 'B930003M22RIK',
                            'RGS8', 'PACRG', 'AK038428']

Let's get a sense of this data by plotting the distributions using `boxplot` from seaborn. To save the output, we'll need to get access to the current figure, and save this to a variable using `plt.gcf()`. And then we'll save this figure with `fig.savefig("filename.pdf")`. You can use other extensions (e.g. "`.png`", "`.tiff`" and it'll automatically save as that forma)

In [None]:
sns.boxplot(geo_expression)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('geo_expression_boxplot.pdf')

Oh right we have expression data and the scales are enormous... notice the 140,000 maximum. Let's add 1 to all values and take the log2 of the data. We add one because log(0) is undefined and then all our logged values start from zero too. This "$\log_2(TPM + 1)$" is a very common transformation of expression data so it's easier to analyze.

In [None]:
expression_logged = np.log2(geo_expression+1)
expression_logged.head()

In [None]:
sns.boxplot(expression_logged)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('expression_logged_boxplot.pdf')

### Exercise 2: Interpreting distributions
Now that these are on moreso on the same scale ...

Q: What do you notice about the pooled samples (P1, P2, P3) that is different from the single cells?

In [None]:
# YOUR CODE HERE

## Filtering expression data

Seems like a lot of genes are near zero, which means we need to filter our genes.

We can ask which genes have log2 expression values are less than 10 (weird example I know - stay with me). This creates a dataframe of `boolean` values of True/False.

In [None]:
expression_logged < 10

What's nice about booleans is that False is 0 and True is 1, so we can sum to get the number of "Trues." This is a simple, clever way that we can filter on a count for the data. We **could** use this boolean dataframe to filter our original dataframe, but then we lose information. For all values that are less than 10, it puts in a "not a number" - "NaN."

In [None]:
expression_at_most_10 = expression_logged[expression_logged < 10]
expression_at_most_10

### Exercise 3: Crude filtering on expression data

Create a dataframe called "`expression_greater_than_5`" which contains only values that are greater than 5 from `expression_logged`.

In [None]:
# YOUR CODE HERE

In [None]:
# This `assert` tests for the total number of "NaN"s (nulls) in the dataframe by getting a boolean matrix from
# `isnull()` and then summing twice to get the total
assert expression_greater_than_5.isnull().sum().sum() == 539146


The crude filtering above is okay, but we're smarter than that. We want to use the filtering in the paper: 

> *... discarded genes that were not appreciably expressed (transcripts per million (TPM) > 1) in at least three individual cells, retaining 6,313 genes for further analysis.*

We want to do THAT, but first we need a couple more concepts. The first one is summing booleans.

## A smarter way to filter

Remember that booleans are really 0s (`False`) and 1s (`True`)? This turns out to be VERY convenient and we can use this concept in clever ways.

We can use `.sum()` on a boolean matrix to get the number of genes with expression greater than 10 for each sample:

In [None]:
(expression_logged > 10).sum()

`pandas` is column-oriented and by default, it will give you a sum for each column. But **we** want a sum for each row. How do we do that?


We can sum the boolean matrix we created with "`expression_logged < 10`" along `axis=1` (along the samples) to get **for each gene, how many samples have expression less than 10**. In `pandas`, this column is called a "`Series`" because it has only one dimension - its length. Internally, `pandas` stores dataframes as a bunch of columns - specifically these `Series`ssssss.

This turns out to be not that many.

In [None]:
(expression_logged > 10).sum(axis=1)

Now we can apply ANOTHER filter and find genes that are "present" (expression greater than 10) in at least 5 samples. We'll save this as the variable `genes_of_interest`. Notice that this doesn't the `genes_of_interest` but rather the list at the bottom. This is because what you see under a code cell is the output of the last thing you called. The "hash mark"/"number sign" "`#`" is called a **comment character** and makes the rest of the line after it not read by the Python language.

### Exercise 4: Commenting and uncommenting

To see `genes_of_interest`, "uncomment" the line by removing the hash sign, and commenting out the list `[1, 2, 3]`.

In [None]:
genes_of_interest = (expression_logged > 10).sum(axis=1) >= 5
# genes_of_interest
[1, 2, 3]

In [None]:
assert isinstance(_, pd.Series)

## Getting only rows that you want (aka subsetting)

Now we have some genes that we want to use - how do you pick just those? This can also be called "subsetting" and in `pandas` has the technical name [indexing](http://pandas.pydata.org/pandas-docs/stable/indexing.html)

In `pandas`, to get the rows (genes) you want using their name (gene symbol) or boolean matrix, you use `.loc[rows_you_want]`. Check it out below.

In [None]:
expression_filtered = expression_logged.loc[genes_of_interest]
print(expression_filtered.shape)  # shows (nrows, ncols) - like in manhattan you do the Street then the Avenue
expression_filtered.head()

Wow, our matrix is very small - 197 genes! We probably don't want to filter THAT much... I'd say a range of 5,000-15,000 genes after filtering is a good ballpark. Not too big so it's impossible to work with but not too small that you can't do any statistics.

We'll get closer to the expression data created by the paper. Remember that they filtered on genes that had expression greater than 1 in at least 3 *single cells*. We'll filter for expression greater than 1 in at least 3 *samples* for now - we'll get to the single stuff in a bit. For now, we'll filter on all samples.

### Exercise 5: Filtering on the presence of genes

Create a dataframe called `expression_filtered_by_all_samples` that consists only of genes that have expression greater than 1 in at least 3 samples.

#### Hint for `IndexingError: Unalignable boolean Series key provided`

If you're getting this error, double-check your `.sum()` command. Did you remember to specify that you want to get the "number present" for each **gene** (row)? Remember that `.sum()` by default gives you the sum over columns. How do you get the sum over rows?

In [None]:
# YOUR CODE HERE
print(expression_filtered_by_all_samples.shape)
expression_filtered_by_all_samples.head()

In [None]:
assert expression_filtered_by_all_samples.shape == (9943, 21)

Just for fun, let's see how our the distributions in our expression matrix have changed. If you wnat to save the figure

In [None]:
sns.boxplot(expression_filtered_by_all_samples)

# gcf = Get current figure
fig = plt.gcf()
fig.savefig('expression_filtered_by_all_samples_boxplot.pdf')

## Getting only the columns you want

In the next exercise, we'll get just the single cells

For the next step, we're going to pull out just the pooled - which are conveniently labeled as "P#". We'll do this using a [list comprehension](http://www.pythonforbeginners.com/basics/list-comprehensions-in-python), which means we'll create a new list based on the items in `geo_expression.columns` and whether or not they start with the letter `'P'`.

In [None]:
pooled_ids = [x for x in expression_logged.columns if x.startswith('P')]
pooled_ids

We'll access the columns we want using this bracket notation (note that this only works for columns, not rows)

In [None]:
pooled = expression_logged[pooled_ids]
pooled.head()

We could do the same thing using `.loc` but we would need to put a colon "`:`" in the "rows" section (first place) to show that we want "all rows."

In [None]:
expression_logged.loc[:, pooled_ids].head()

### Exercise 6: Make a dataframe of only single samples

Use list comprehensions to make a list called `single_ids` that consists only of single cells, and use that list to subset `expression_logged` and create a dataframe called `singles`. (Hint - how are the single cells ids different from the pooled ids?)

In [None]:
# YOUR CODE HERE
print(singles.shape)
singles.head()

In [None]:
assert singles.shape == (27723, 18)

## Using two different dataframes for filtering

### Exercise 7: Filter the full dataframe using the singles dataframe

Now we'll actually do the filtering done by the paper. Using the `singles` dataframe you just created, get the genes that have expression greater than 1 in at least 3 single cells, and use that to filter `expression_logged`. Call this dataframe `expression_filtered_by_singles`.

In [None]:
# YOUR CODE HERE
print(expression_filtered_by_singles.shape)
expression_filtered_by_singles.head()

In [None]:
assert expression_filtered_by_singles.shape == (6312, 21)

Let's make a boxplot again to see how the data has changed.

In [None]:
sns.boxplot(expression_filtered_by_singles)

fig = plt.gcf()
fig.savefig('expression_filtered_by_singles_boxplot.pdf')

This is much nicer because now we don't have so many zeros and each sample has a reasonable dynamic range.

## Why did this filtering even matter?

You may be wondering, we did all this work to remove some zeros..... so the FPKM what? Let's take a look at how this affects the relationships between samples using `sns.jointplot` from seaborn, which will plot a correlation scatterplot. This also calculates the [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient), a linear correlation metric.

Let's first do this on the unlogged data.

In [None]:
sns.jointplot('S1', 'S2', geo_expression)

Pretty funky looking huh? That's why we logged it :)

Now let's try this on the logged data.

In [None]:
sns.jointplot(expression_logged['S1'], expression_logged['S2'])

Hmm our pearson correlation increased from 0.62 to 0.64. Why could that be?

Let's look at this same plot using the filtered data.

In [None]:
sns.jointplot('S1', 'S2', expression_filtered_by_singles)

And now our correlation went DOWN!? Why would that be? 

### Exercise 8: Discuss changes in correlation

Take 2-5 sentences to explain why the correlation changed between the different datasets.

In [None]:
# YOUR CODE HERE