In [None]:
%pylab inline
%config InlineBackend.figure_format = 'retina'

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# What is pandas

> pandas is a library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

In [None]:
import pandas as pd

In [None]:
df = sns.load_dataset("iris")
print(df.columns)
display(df)

We can access specific columns in two ways:
* `df['column_name']`: save and can be used with variables
* `df.column_name`: (too) convenient

In [None]:
print(df.species)
print(df['species'])

Typical statistic functions work on dataframes:

In [None]:
print(df.mean())
print(df.max())

Or, the data can be summarized with `describe`

In [None]:
df.describe()

## Grouping

Data can be grouped based on columns

In [None]:
df = sns.load_dataset("fmri")
print(df.columns)
display(df.describe())

Grouping is done based on one or more columns and a function that is applied on each group (this can be constumized). 

In [None]:
display(df.groupby(['timepoint']).mean())
display(df.groupby(['timepoint']).std())

Grouping can be done for multiple columns

In [None]:
display(df.groupby(['timepoint','region']).mean())

## Adding data

In [None]:
df = sns.load_dataset("tips")
print(df.columns)

In [None]:
df['fraction_tip'] = df.tip/df.total_bill
print(df.columns)
print(df.fraction_tip)

## Finding data

By index:

In [None]:
df[df.index==0]

Conditional

In [None]:
df[df.tip>7]

In [None]:
df[(df['size']==2)&(df.sex=='Female')&(df.tip<1.5)]

## EXERCISE

In [None]:
df = sns.load_dataset("tips")
print(df.columns)

Explore this data set:
* Get the mean bill organized by sex, day of the week and the number of guests.
* Get the mean fractional tip for women organized by smoker status and time of day.
* Get row with the maximum tip to bill ratio.

# Plotting with pandas:

There are several methods for plotting data from pandas dataframes:

* Extract data and use matplotlib: most work, but most powerfull
* Use pandas plotting functions: quite limited
* Use seaborn: best option is most cases

# Plotting with pandas and seaborn

Typical function call: `plot_type(data=df,x,y,args)`

## Scatter plot

In [None]:
df = sns.load_dataset("titanic")
print(df.columns)

In [None]:
sns.scatterplot(data=df,x='fare',y='age',hue='sex',style='survived')

## EXERCISE

Play around with the columns used for `x`, `y`, `hue`, and `style` and evaluate if a scatterplot is always the best plot?

## Stripplot

In [None]:
sns.stripplot(data=df,y='fare',x='survived',hue='sex',jitter=False)

In [None]:
sns.stripplot(data=df,y='fare',x='survived',hue='sex')

In [None]:
sns.stripplot(data=df,y='fare',x='survived',hue='sex',dodge=True)

## EXCERSICE

Look at the plots from the previous exercise where scatterplots where not practical. See if recreating them with a stripplot helps.

## Visualizing the distribution

There is a lot of overlap in the stripplot, and this will get worse if there is more data. Using a swarmplot will slightly improve this, but swarmplots get really slow with many data points

In [None]:
sns.swarmplot(data=df,y='fare',x='survived',hue='sex',dodge=True)

A bar plot may be more appropriate. By default, the errorbars depict the 95% confidence interval.

In [None]:
sns.barplot(data=df,y='fare',x='survived',hue='sex',dodge=True,)

You can specify the confidence interval, use the standard deviation (`'sd'`) or choose not to plot error bars (`None`). 

In [None]:
sns.barplot(data=df,y='fare',x='survived',hue='sex',dodge=True,ci='sd')

## EXERCISE

The barcharts hide most of the distribution, better options may be a *boxplot* or a *violinplot*. Try to use `sns.boxplot` and `sns.violinplot`.

## Customization

What I like so much about is Seaborn is that it adds great functionatily, while you still have all the customization of matplotlib.

### Within seaborn functionality

Seaborn's functions have their own arguments and inherit arguments from matplotlib. The syntax used for these is the same as used with matplotlib.

In [None]:
sns.stripplot(data=df,y='fare',x='survived',hue='class',dodge=True,)

`size`, `edgecolor`, and `linewidth` are arguments of `stripplot`, and `marker` is a matplotlib argument that is recognized by `stripplot`.

In [None]:
sns.stripplot(data=df,y='fare',x='survived',hue='class',dodge=True,size=8,edgecolor='k',linewidth=.4,marker='v')

In [None]:
class_order = ['Third','Second','First']
sns.stripplot(data=df,y='fare',x='survived',hue='class',dodge=True,hue_order=class_order)

In [None]:
class_colors = {'First':'b','Second':'g','Third':'r'}
sns.stripplot(data=df,y='fare',x='survived',hue='class',dodge=True,palette=class_colors)

### EXERCISE

Make a boxplot with:
* blue for first class, green for second class, and red for third class
* from left to right: third, second, and first class
* make the outliers bigger
* increase the line width of the boxplot

## Mixing with matplotlib

A seaborn plot can be explicitly put in a specific axes object

In [None]:
ax = plt.gca()
sns.stripplot(data=df,y='fare',x='survived',hue='class',dodge=True,ax=ax)
ax.set(title='Titanic')

### EXERCISE

Combine what you've learned so far to create a subplot with:
* stripplot
* barplot
* boxplot

You can choose wathever you like to put in those plots

## Line plots

In [None]:
df = sns.load_dataset("fmri")
print(df.columns)
sns.lineplot(data=df,x='timepoint',y='signal',ci='sd')

Of course, we want to separate the data based on event and region:

In [None]:
sns.lineplot(data=df,x='timepoint',y='signal',ci='sd',hue='region',style='event')

Line style can be set in multiple ways:
* list of matplotlib styles, e.g.: `['-','--',':']`
* dictionary using dashes: `{'stim':[1,1],'cue':[4,1]}`

In [None]:
sns.lineplot(data=df,x='timepoint',y='signal',ci='sd',hue='region',style='event',dashes={'stim':[1,1],'cue':[4,1]})

### EXERCISE

Set the colors such that the line for the parietal region is black and the line for the frontal region is blue.

### EXERCISE

I don't like this plot, the lines are crossing too much. Therefore, I'd like you to seperate the plot such that I have two plots, one with the mean signal for the parietal region and one for the frontal region and both with lines for stim and cue. You can achieve this by following these steps:
1. select the rows of the data frame for the parietal of frontal region
2. create a line plot for each selected subset
3. use `subplot` to create one figure with two plots