# CNJCx Week 5: Practical Python

Tyler Benster
(tbenst@stanford.edu)

## Outline
### Motivation and background
### Hands-on coding

### Motivation and background
- Goals
- Anti-goals
- Extra details
- Tidy Data
- Today's Python Packages

## Goals for today
- "Day in the life" of a Pythonista
- Whirlwhind tour of foudational packages for Data Scientists in Python
- Exposure to opinionated best-practices for formating data and code
- understand the "why" of each code block
- know which library to use for particular analyses

## Anti-goals for today
- comprehend the "how" of each line of code
- know which function to use for particular analyses
- understand the math behind shown analyses
- feeling that the class is going at a comfortable pace
- understand how this presentation was made in a Jupyter notebook with RISE/reveal.js

## Extra details for eager or advanced listeners
- <details>
    <summary><a><strong>IYI</strong></a>: If You're Interested; click me! (no seriously please do :)</summary>
    Optional contest will be prefaced by IYI. This is not essential for understanding the presentation, and if you are at all feeling lost or confused, now is a great time to ignore what I'm saying and ask questions in the chat. IYI is inspired by David Foster Wallace's Infinite Jest.
</details>
- Bonus: quick peak at modern deep learning in Pytorch

## Easy visualization with Tidy Data
![tidy data](https://r4ds.had.co.nz/images/tidy-1.png)

See Hadley Wickham's [publication](https://www.jstatsoft.org/article/view/v059i10) for more details and motivation.

### Hands-on coding
- Data visualization: how to make some basic plots (matplotlib, Altair)
- (5 minute break)
- Advanced data analysis: interrogate the data and visualize(scipy.stats, sklearn)
- how to read in common data formats (images, MAT v6/v7, HDF5, csv)
- data munging: what data structures and patterns to use for optimal efficiency (numpy, pytorch tensor, pandas, tidy data)

## First-up: matplotlib
matplotlib is the most popular plotting library in Python, and is a swiss army knife that can do virtually anything. It's also the most manual difficult to use.

Let's load some example data first

In [None]:
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np

# load data
iris = datasets.load_iris()
iris.keys()

In [None]:
for key, value in iris.items():
    if not key in ['data', 'target']:
        print(f"=========\n{key}: {value}")

Let's create a basic scatter plot using the procedural (scripting) interface

In [None]:
plt.scatter(iris.data[:,0], iris.data[:,2])
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[2])

Now, we create subplots with coloring & legend using the alternate Object-oriented interface

In [None]:
fig, axes = plt.subplots(1,2)
axes[0].scatter(iris.data[:,0], iris.data[:,1], c=iris.target)
axes[0].set_xlabel(iris.feature_names[0])
axes[0].set_ylabel(iris.feature_names[1])
scatter1 = axes[1].scatter(iris.data[:,2], iris.data[:,3], c=iris.target)
axes[1].set_xlabel(iris.feature_names[2])
axes[1].set_ylabel(iris.feature_names[3])
axes[1].legend(scatter1.legend_elements()[0],
               iris.target_names, title="Species")

Uh oh, that looks terrible. Here's a quick fix:

In [None]:
fig.tight_layout()
fig

Better, but legend location still problematic.

**IYI**: This can be fixed using low-level arguments like `bbox`, see [here](https://stackoverflow.com/questions/4700614/how-to-put-the-legend-out-of-the-plot)

## Surely there's a better way??
Introducing the "Grammar of Graphics"! Other python GoG packages include Seaborn and Holoviews. We use Altair, as it is implemented on the cross-language Vega-lite, so what you learn today can also be done in Julia or even used for interactive web-charts!

![grammar of graphics](https://miro.medium.com/max/2000/1*mcLnnVdHNg-ikDbHJfHDNA.png)

**IYI** conceptual guide [here](https://towardsdatascience.com/a-comprehensive-guide-to-the-grammar-of-graphics-for-effective-visualization-of-multi-dimensional-1f92b4ed4149)

## Introducing pandas: convient tables in python

First, let's install a python package with example datasets

In [None]:
!pip install vega_datasets

Next we load an example DataFrame

In [None]:
from vega_datasets import data
import altair as alt, pandas as pd

cars_df = data.cars()
print(f"object type: {type(cars_df)}")

DataFrames have some convenient methods to help us inspect it

In [None]:
cars_df.head()

In [None]:
cars_df.columns

In [None]:
cars_df.tail()

Some of these methods can be chained:

In [None]:
cars_df.Name.tail()

Here we select a single value

In [None]:
cars_df["Name"][402]

Let's take a look at the type of each Series (column)

In [None]:
cars_df.dtypes

Let's see the various Origins

In [None]:
cars_df.Origin.unique()

We can easily do `where` queries

In [None]:
cars_df[cars_df.Origin=='USA'].head()

Or chain multiple requirements

In [None]:
from datetime import datetime
idxs = np.all([cars_df.Origin=='USA',
              cars_df.Horsepower>200,
              cars_df.Year<=datetime(1972,1,1)],
             axis=0)
cars_df[idxs]

# Plotting Tidy Data with Altair
Since our data is Tidy, we can use the Grammar of Graphics to make plots!

In [None]:
line = alt.Chart(cars_df).mark_line().encode(
    x='Year',
    y='mean(Miles_per_Gallon)'
)
# https://altair-viz.github.io/user_guide/generated/core/altair.ErrorBandDef.html#altair.ErrorBandDef
band = alt.Chart(cars_df).mark_errorband(extent='ci').encode(
    x='Year',
    y=alt.Y('Miles_per_Gallon', title='Miles/Gallon'),
)

band + line

The power of this approach becomes especially apparent with complex plots that would require a lot of work for each axis with matplotlib

In [None]:
line = alt.Chart(cars_df).mark_line().encode(
    x='Year',
    y=alt.Y('mean(Miles_per_Gallon)', title="average MPG"),
    color='Cylinders:O' # we specify that the data is Ordinal, meaning ordered
).properties(
    width=180,
    height=180
).facet(
    facet='Origin:N', # data is Nominal, meaning categorical
    columns=3
)
line

### Excercise 1: make a scatter plot of Horsepower vs Acceleration, colored by Origin
Instead of `mark_line`, use `mark_point`

In [None]:
# your code here...feel free to refer to cells above!


Let's quickly revist the Iris dataset and show off our new skills!

In [None]:
iris_df = data.iris()

alt.Chart(iris_df).mark_circle().encode(
    alt.X('sepalLength', scale=alt.Scale(zero=False)),
    alt.Y('sepalWidth', scale=alt.Scale(zero=False, padding=1)),
    color='species',
    size='petalWidth'
)

Finally, **IYI**, here's a more advanced figure: an interactive scatter & Violin plot using `selection`, `transform_filter`, and `transform_density`

In [None]:
brush = alt.selection(type='interval', resolve='global')
scatter = alt.Chart(cars_df).mark_point().encode(
    x=alt.X('Horsepower'),
    y=alt.Y('Acceleration'),
    color=alt.condition(brush, 'Origin', alt.ColorValue('gray'))
)

violin = alt.Chart(cars_df).transform_filter(
    brush
).transform_density(
    'Miles_per_Gallon',
    as_=['Miles_per_Gallon', 'density'],
    extent=[5, 50],
    groupby=['Origin']
).mark_area(orient='horizontal').encode(
    y='Miles_per_Gallon:Q',
    color='Origin:N',
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        axis=alt.Axis(labels=False, values=[0],grid=False, ticks=True),
    ),
    column=alt.Column(
        'Origin:N',
        header=alt.Header(
            titleOrient='bottom',
            labelOrient='bottom',
            labelPadding=0,
        ),
    )
).properties(
    width=100
)


plot = (scatter | violin).add_selection(
# scatter.add_selection(
    brush
).configure_facet(
     spacing=0
).configure_view(
    stroke=None
)

In [None]:
# try drawing a box on the scatter plot!B
plot

For more, checkout this example gallery of beautiful plots with shockingly few lines of code: https://altair-viz.github.io/gallery/index.html

## (5 minute break)

**IYI** A poem while we wait

In [None]:
import this

# WIP below here

## Data munging

In [None]:
# Read in trial-summed response of retinal ganglion cells
# to a 0.5s flash of light
rgcs_df = pd.read_csv("rgc_light_response.csv")

In [None]:
# Each column with number is a 1ms time bin that sums
# the number of Action potentials from `ntrials`.
# i, j index the 2D electrode array.
# unit_num identifies puported individual neurons recorded from each electrode.
rgcs_df.head()

In [None]:
rgcs_tidy = pd.melt(rgcs_df, id_vars=['retina', 'id', 'ntrials'],
        var_name="time_bin",
        value_name="spike_count",
        value_vars=list(map(str, np.arange(3500))))
# 100ms time bins
time_bin = 1000
rgcs_tidy["time"] = (pd.to_numeric(rgcs_tidy.time_bin) + 1) / time_bin
rgcs_tidy["firing_rate"] = rgcs_tidy.spike_count / rgcs_tidy.ntrials * time_bin
rgcs_tidy.drop(columns=["spike_count", "ntrials", "time_bin"], inplace=True)
rgcs_tidy

In [None]:
rgcs_tidy.dtypes

First, we create a nSamples x nFeatures matrix

In [None]:
rgc_mat = np.array(rgcs_tidy.pivot(index='id', columns='time', values='firing_rate'))

In [None]:
rgc_mat.shape

In [None]:
ylim

In [None]:
from matplotlib import patches
from typing import Tuple
time = np.arange(rgc_mat.shape[1])/1000 # convert to seconds
fig, ax = plt.subplots()
ax.plot(time, rgc_mat.mean(0))
ylim = ax.get_ylim()
# Create a Rectangle patch

def make_rect(start:float, duration:float, ylim:Tuple[float, float]):
    return patches.Rectangle((start, ylim[0]), duration, ylim[1],
                             facecolor='black', alpha=0.1)

# we make small helper function to follow DRY: Don't repeat yourself
rect1 = make_rect(0,1, ylim)
rect2 = make_rect(1.5,2, ylim)

# Add the patch to the Axes
ax.add_patch(rect1)
ax.add_patch(rect2)
ax.set_xlim(0,3.5)

In [None]:
signal.convolve?

In [None]:
from scipy import signal
# estimate firing rate using gaussian smoothing
sigma = 6
bandwidth = 0.05 # sec
bin_width = 0.001
transformed_sigma = bandwidth/bin_width
window = signal.gaussian(2*sigma*transformed_sigma, std=transformed_sigma)[None]

# instantaneous firing rate (acausal)
ifr = signal.convolve(rgc_mat, window,mode="same")/(transformed_sigma*np.sqrt(2*np.pi))

In [None]:
def plot_rgc_trace(ax, trace, time=time,
                   light_on=1, light_off=1.5):
    ax.plot(time,trace)
    ylim = ax.get_ylim()
    rect1 = make_rect(0,1, ylim)
    rect2 = make_rect(1.5,2, ylim)
    ax.add_patch(rect1)
    ax.add_patch(rect2)
    ax.set_xlim(0,3.5)
    ax.set_xlabel("time (s)")
    ax.set_ylabel("firing rate (Hz)")

fig, axes = plt.subplots(5,2, figsize=(8,8))
plot_rgc_trace(axes[0,0], rgc_mat[200])
axes[0,0].set_title("Trial-average firing rate")
axes[0,1].set_title("Instantaneous firing rate")
plot_rgc_trace(axes[0,1], ifr[200])

for i,c in zip(range(1,5),[500,1500,2000,2500]):
    plot_rgc_trace(axes[i,0], rgc_mat[c])
    plot_rgc_trace(axes[i,1], ifr[c])

In [None]:
from sklearn.decomposition import PCA
from sklearn.manifold.t_sne import TSNE

In [None]:
pca = PCA(n_components=5)
projected_data = pca.fit_transform(ifr)

In [None]:
plt.scatter(projected_data[:,0], projected_data[:,1])

In [None]:
tsne = TSNE(n_components=2)
# This is slow, so we only fit on every 10th cell for demonstration purposes
tsne_data = tsne.fit_transform(ifr[::10])

In [None]:
plt.scatter(tsne_data[:,0], tsne_data[:,1])

In [None]:
from sklearn.cluster import OPTICS
optics = OPTICS(xi=0.05,min_samples=25)
optics.fit(projected_data)

plt.hist(optics.labels_, bins=np.arange(optics.labels_.max()+1))
plt.title("Count by cluster label")
print(f"fraction unclustered: {sum(optics.labels_==-1)/len(projected_data):.3f}")

In [None]:
from matplotlib import cm

In [None]:
num_clusters = optics.labels_.max()+1 # 0-index
unit_interval_class = optics.labels_ / num_clusters

In [None]:
colors = [cm.tab20(f) if f>=0 else cm.colors.to_rgba("gray")
          for f in unit_interval_class[::10]]
plt.scatter(tsne_data[:,0], tsne_data[:,1],
            color=colors)

In [None]:
rgcs_df

In [None]:
pd.melt?

In [None]:
rgcs_with_cluster = rgcs_df.copy()
rgcs_with_cluster["cluster"] = optics.labels_
# filter to include only clustered cells
rgcs_with_cluster = rgcs_with_cluster[rgcs_with_cluster.cluster!=-1]
tidy_data = pd.melt(rgcs_with_cluster, id_vars=['retina', 'id', 'ntrials', "cluster"],
        var_name="time_bin",
        value_name="spike_count",
        value_vars=list(map(str, np.arange(35))))
# 100ms time bins
tidy_data["time"] = pd.to_numeric(tidy_data.time_bin) / 10
tidy_data["firing_rate"] = tidy_data.spike_count / tidy_data.ntrials * 10
tidy_data.drop(columns=["spike_count", "ntrials", "time_bin"], inplace=True)
tidy_data

In [None]:
## Altair

In [None]:
alt.Chart(tidy_data).mark_line().encode(
    x = "time",
    y = "mean(firing_rate)"
)

In [None]:
# from hd
from hdbscan import HDBSCAN

In [None]:
pip install vega_datasets

In [None]:
import hdbscan

In [None]:
hdbscan
optics.labels_.max()

In [None]:
## archive

In [None]:
cols = "retina,id,i,j,unit_num,ntrials".split(",") + list(map(str,np.arange(3500)))

In [None]:
csv = pd.read_csv("/home/tyler/Dropbox/Science/manuscripts/2019_acuity_paper/acuity_paper/code/integrity_units_1ms.csv",
                 index_col=False,
                 names=cols)[1:]

In [None]:
csv = csv[np.logical_not(csv.retina.str.contains("BENAQ"))].drop_duplicates(['id'])

In [None]:
csv.to_csv("rgc_light_response.csv", index=False)

## Solutions

In [None]:
cars_df

In [None]:
line = alt.Chart(cars_df).mark_point().encode(
    x=alt.X('Horsepower', bin=True),
    y=alt.Y('Acceleration', bin=True),
    size="count()"
    
)
line