# Intermediate Python @ Calico

Course instructors: [Tamas Nagy](mailto:tnagy@calicolabs.com) and [Taylor Cavazos](mailto:tcavazos@calicolabs.com)


#### Description

This course is designed to get you more comfortable with *performing common data science techniques in the Python programming language*. Python is very expressive, powerful, and popular language, especially in the data science field. You are very likely to find libraries, tutorials, and documentation for routine data science tasks in Python. The rich ecosystem will let you hit the ground running and let you quickly translate raw data into interpretable tables, statistics, and graphs. 

During this course, we will teach you how to perform data exploration in Python with a focus on tabular data and images. We will be using standard Python libraries that are stable, well-documented, and tested so the skills you acquire here should be helpful for many years to come. We will teach you how to tidy up your data, continuously visualize your data, and then use various statistical techniques to extract meaning from datasets.

#### Objectives

After finishing this course, students will be able to:

1. Comfortably interact with tabular data in Python
2. Recognize problems with datasets and correct errors 
3. Visualize datasets to derive new insight
4. Understand and leverage common statistical approaches
5. Extract quantitative data from images and movies

#### Outline

**Lesson 1:** Basic data manipulation and plotting  
**Lesson 2:** Pre-processing, descriptive statistics, and dimensionality reduction  
**Lesson 3:** Linear regression and interpreting statistical significance  
**Lesson 4:** Clustering data and heatmaps  
**Lesson 5:** Image analysis and feature extraction  
**Lesson 6:** Temporal image data and tracking  

**Optional final project:** Send us a description of a dataset you want to analyze, what question you want to answer with it, and what you would like to accomplish by the end of the course. This will help us help you!  


## General approach for data exploration

The basic approach we will taking is modeled off Hadley Wickham's [R for Data Science](https://r4ds.had.co.nz/explore-intro.html). The same principles apply whether the language is R or Python. The following figure from that book illustrates the general procedure that data scientists use when working with a dataset:

![](https://d33wubrfki0l68.cloudfront.net/795c039ba2520455d833b4034befc8cf360a70ba/558a5/diagrams/data-science-explore.png)



## What is relational data?

A lot of the data that we interact with is *relational* or *labeled* data where each datapoint has several associated attributes. Think of data you might interact with in Excel. A lot of scientific data can be represented in this way and this allows us to leverage 

## Import `Pandas` and `NumPy`

`Pandas` and `NumPy` are two powerful Python libraries that have many convenience classes and functions. `Pandas` stands for "panel datas", i.e. tabular/relational data, and `NumPy` has many common numerical functions (e.g. random number generators, linear algebra, basic statistics, etc)

In [None]:
import pandas as pd
import numpy as np

The `pandas` [documentation](http://pandas.pydata.org/pandas-docs/stable/user_guide/) is very thorough and I recommend giving it a read. You can also access it by running `help(function)` or `?function` in `IPython`.

## Load the `iris` dataset

This is a classic dataset that is often used in introductory data science classes. 

In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

The `read_csv` function has many useful parameters that you can adjust if your dataset has nonstandard features:

In [None]:
?pd.read_csv

In general, the `help(func)` or `?func` tools are very useful to determine what parameters and in what order you need to pass data to a function `func`.

## Viewing the data

Our dataset is larger than what we can conveniently show on a computer screen (which is common!) so `pandas` provides some convenience functions to get a feel for the dataset:

In [None]:
# we have 150 rows!
iris.shape

In [None]:
iris.head()

In [None]:
iris.tail()

In [None]:
iris.describe()

<h3 style="color:red">Exercise: Describe everything</h3>

The default behavior of `describe()` is to only print out statistics on numerical columns. Change this to all columns

## Selecting data

In this section, we'll explore how to select specific subsets of data

In [None]:
# make a small example dataset to show indexing behavior
df1 = pd.DataFrame(
    {
        "id": list(range(3, 11)), 
        "char": ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
    }
)
df1

Each column can have a different datatype, e.g. `int64`, `float`, `object`, etc

In [None]:
df1.dtypes

DataFrames have two indices for the rows and columns called `index` and `columns`, respectively

In [None]:
df1.index

In [None]:
df1.columns

You can get all the values in a column by passing its name:

In [None]:
df1["id"]

### Setting values

We can also set values. For example, here we are adding a new column called `vals` and passing an array of correct length of boolean values

In [None]:
df1["vals"] = [True, False, False, True, False, True, True, False] # new column with boolean values

<h3 style="color:red">Exercise: More setting</h3>

Set column `id` to the numbers between 4 and 11, inclusive. Print out the dataframe. What do you see? Are DataFrames mutable?

### Indexing (`loc` vs `iloc`)

We can also do more advanced indexing using the `loc` and `iloc` indexers. The best way to learn the differences between them is an example:

In [None]:
df1.iloc[0:3, 1]

In [None]:
df1.loc[0:3, "char"]

In [None]:
df1.index = [5, 6, 7, 0, 1, 2, 3, 4]

In [None]:
df1.iloc[0:3, 1]

In [None]:
df1.loc[0:3, "char"]

<h3 style="color:red">Exercise: Switch to alphanumeric index and test behavior</h3>

Change the index of `df2` to the values in `genes` and print out the `id`s of all genes between `NANOG` and `ACTB`, inclusive. Use both `iloc` and `loc`

In [None]:
df2 = df1.copy()

genes = ["OCT4", "SOX2", "AQP4", "NANOG", "ATG8", "GAPDH", "ACTB", "MYH2"]

#### `loc`

#### `iloc`

<h3 style="color:red">Exercise: set subset of values</h3>

Set the `id` column of `df1` equal to 10-13 for the 4th through 8th rows, inclusive.

### Boolean indexing

We can also provide a boolean index to only select a subset of rows (or columns)

In [None]:
iris["sepal_length"] > 5 # indices where the sepal length is longer than 5

In [None]:
iris[iris["sepal_length"] > 5].head()

<h3 style="color:red">Exercise: Boolean indexing</h3>

Select all rows with `petal_length` greater than 2

<h3 style="color:red">Exercise: Complex boolean indexing</h3>

Select the rows with sepal lengths less than or equal to 6 **and** petal lengths greater than 2 and report the number of flowers matching this combo

Hint: the `&` operator can be used to combine equalities that need to **both** be true. E.g.

```
    df[(cond) & (cond)]
```

### Missing data

`pandas` represents missing data using `np.nan`. Generally you'll want to drop rows with NaNs or replace them with realized values. Say if we select a subset of a dataset:

In [None]:
sub_iris = iris.copy()
sub_iris.loc[sub_iris["petal_length"] < 1.4, "petal_length"] = np.nan
sub_iris.head()

In [None]:
iris.describe() # ignores NaNs!

`notna()` and `isna()` will allow you to select non-NaN or NaN values, respectively, by returning a boolean mask

In [None]:
sub_iris["petal_length"].notna().head()

In [None]:
sub_iris["petal_length"].isna().head()

### `any` and `all`

These two functions do exactly what they sound like. `any` returns `True` if any values along an axis are true. `all` requires all to be true to return true, otherwise it returns false.

`all`:
```
    df[0, :] = df[0, 1] & df[0, 2] & ... 
    ...
```

`any`:
```
    df[0, :] = df[0, 1] | df[0, 2] | ... 
    ...
```

In [None]:
sub_iris.isna().any(1).head()

In [None]:
sub_iris.isna().all(1).head()

<h3 style="color:red">Exercise: Set all NaNs to -1</h3>

Set all NaNs in `sub_iris` to -1 and get the standard deviations for each column:

### `dropna` and `fillna` to remove or replace missing data

In [None]:
sub_iris.dropna().head()

In [None]:
sub_iris.fillna(-1).head()

## Introduction to plotting

The primary plotting package in the Python data science ecosystem is `matplotlib`. It is an incredibly powerful and large library that has a long history and, as such, there are lots of tutorials available online that show out-of-date or non-standard approaches. To the best of our ability, we have followed best practices while plotting and also attempt to be consistent with all of our plots. 

### Figure vs axes vs axis

`matplotlib`'s terminology can get a little confusing, but we find the following plot to be useful to remember the distinction between `figure`, `axes`, and `axis`:

![](https://matplotlib.org/1.5.1/_images/fig_map.png)

Lets first import `matplotlib`'s pyplot module 

In [None]:
import matplotlib.pyplot as plt

### Basic scatter plot using `pyplot`

In [None]:
plt.figure(figsize=(8, 4))
plt.scatter(x=iris.sepal_length, y=iris.sepal_width, c=iris.petal_width, cmap='magma')
plt.colorbar(label="petal_width");
plt.suptitle("Awesome plot") # make sure to have a helpful titles and labels
plt.xlabel("Sepal Length") 
plt.ylabel("Sepal Width")
plt.show()

### Basic scatter using `pandas`

This is still `matplotlib` in the background, but it has a different API:

In [None]:
iris.plot.scatter(x="sepal_length", y="sepal_width", c="petal_width", figsize=(8, 4), cmap="magma")

### Histograms and density plots

Sometimes we want to quickly identify the distribution shape. Histograms and kernel density estimates allow us to do just that

In [None]:
iris.hist(bins=25);

Looks like petal sizes have a bimodel distribution!

In [None]:
fig, axes = plt.subplots()
iris["petal_length"].plot.hist(ax=axes, bins=25, density=True)
iris["petal_length"].plot.kde(ax=axes)
axes.set_xlabel("Petal Length");

<h3 style="color:red">Exercise: Selecting a certain population for plotting</h3>

Select only the larger population and plot the new kernel density estimate

# RNAseq dataset

There are two datasets here, the first is the sample sheet (doe - design of experiment) and the second is the abundance file.  The structure of the abundance table is the first two columns are ensembl gene ID and gene symbols, and the remaining columns are samples which correspond to the sample column in the doe file. The values are ln(TPM). 

The RNAseq data from liver cells grown in a plate and sampled at baseline (day 0) or split into 3 groups and sampled at day 5:  BSA (control), or FFA_LPS (treatment 1) or TMC (treatment 2). Both of these treatments are design to induce the integrated stress response pathway in the liver cell. 

In [None]:
df3 = pd.read_csv("inSphero.abundance.table_edit190410.csv")

In [None]:
df3.head()

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=1, sharey=True)
axes.scatter(df3["baseline_1"], df3["baseline_2"], c=df3["baseline_3"])
axes.set_xlabel("baseline_1")
axes.set_ylabel("baseline_2");

In [None]:
from scipy import stats

In [None]:
stats.pearsonr(df3["baseline_1"], df3["baseline_2"])

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=1, sharey=True)
axes.scatter(df3["baseline_1"], df3["DE.FFA.LPS.day5_1"])
axes.set_xlabel("baseline_1")
axes.set_ylabel("DE.FFA.LPS.day5_1");

In [None]:
stats.pearsonr(df3["baseline_1"], df3["DE.FFA.LPS.day5_1"])

## Extra work: Grouping

- **Splitting** the data into groups based on some criteria.
- **Applying** a function to each group independently.
- **Combining** the results into a data structure.

In [None]:
grouped = iris.groupby("species")
grouped

`grouped` is a dict-like object that you can iterate over

In [None]:
for name, group in grouped:
    print(name, ": ", group.shape)

In [None]:
grouped.get_group("setosa").head()

<h3 style="color:red">Exercise: plot groups</h3>

Plot each flower type in a different color and save the figure