<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /></a><div align="center">This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.</div>

# Introduction to Pandas

[Pandas](https://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Preamble

In [None]:
import numpy as np
import pandas as pd

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

## Load data

We are going to use public data from the [Wordbank project](http://wordbank.stanford.edu/) for this notebook.

We use Panda's `read_csv()` function to read a CSV file: 

In [None]:
data = pd.read_csv("data.csv.gz")

In [None]:
type(data)

The return object from `read_csv()` is a `DataFrame` object, which displays nicely in the notebook as tabular data:

In [None]:
data[:5]  # display the first 5 rows

In [None]:
data['word']  # get the contents of the `word` column

Informally, a Pandas `DataFrame` is a table where each column is a `pandas.Series` object (which closely resembles a NumPy array).

Indeed, we can build a `DataFrame` from a set of NumPy arrays:

In [None]:
# 1. create NumPy arrays

x = np.linspace(-1, +1, 20)
sin_x = np.sin(x)
cos_x = np.cos(x)

# 2. create DataFrame with the given columns

trig = pd.DataFrame({'x': x, 'sin': sin_x, 'cos': cos_x})

# 3. show a few sample columns

trig[:5]  # alternatively: `trig.head()`

## Table manipulations

A `DataFrame` object as a `.shape` attribute like a 2D NumPy array:

In [None]:
data.shape

The above shows that our table has 234350 rows across 10 columns.  

**Note: *row index comes first!*** (This will be important when accessing data with numerical indices below.)

Column names can be retrieved from the `.columns` attribute:

In [None]:
data.columns

Although `data.columns` is an `Index` object, it works like a normal Python list for many intents and purposes.  So for instance we loop over column names with a `for` loop, or we can retrieve the name of the 4th column with this code: 

In [None]:
data.columns[3]

#### Removing columns

In order to do some data cleaning (and save memory), we drop the columns that are not used in the forthcoming analysis. A column may be deleted from the `DataFrame` with the statement:

        del data['name']

In [None]:
# save set of column names
before = set(data.columns)

In [None]:
# remove "unnamed" column
del data['Unnamed: 0']
del data['num_item_id']

In [None]:
# is there any change in the column set?
after = set(data.columns)
print(before - after)

It is easy to also see that the two unwanted columns are now gone:

In [None]:
data.head()  # Pandas convenience for `data[0:5]`

### Selecting a subset of the columns

Columns may be selected by name using the standard `[]`-lookup:

In [None]:
# print the first 3 rows of column 'produces'
data['produces'].head()

However, note that selecting *one* column only returns a `Series`, not a `DataFrame`:

In [None]:
type(data['produces'])

It is possible to **copy** into a new `DataFrame`: one must however use `[[ ... ]]` (i.e., *double* the square brackets):

In [None]:
data2 = data[['age', 'language', 'sex']]

In [None]:
data2[:5]

It looks as if we've picked constant columns? A `DataFrame`'s `.describe()` method provides a quick statistical summary of the data (but only for *continuous* variables):

In [None]:
data.describe()

The difference between `[...]` and `[[...]]` is that the former:

1. only allows selecting *one* column,
2. returns a `Series`, not a `DataFrame`,
3. **does not make a copy of the data!**

### Modifying data

A Pandas `Series` object is pretty similar to a NumPy array, in that arithmetic and logical operations are performed element-wise.

In particular, the result of a comparison like the following is a `Series` object with logical values:

In [None]:
data['sex'] == 'Female'

We can now pass such a selector array to a `Series`'s `[]` operator to modify a column at the row indices where the selector is `True`:

In [None]:
# select items for which this epxression is true
select = (data['sex'] == 'Female')

# copy matching items into new DataFrame
data_only_females = data[select]

# show value sample
data_only_females.head()

Selectors can be used to *modify* a `DataFrame` *in-place* but the syntax is a bit different: one must assign the new value to the `.loc` attribute of the `DataFrame` with *two* selectors: the first one is for rows (could also be a row range *lo:hi*), the second one is for columns (could be a column name).

In [None]:
# change `'Female'` to `1` in the `'sex'` column
data.loc[select, 'sex'] = 1

# show changes in the 'sex' column
data.head()

In [None]:
data_only_females.count()

#### Exercise 8.A

Count the number of rows where column `'sex'` has the value `'Male'`.

In [None]:
# your code here

#### Exercise 8.B 

In table `data`, replace all occurrences of the string `'Male'` in column `'sex'` with the number `0`.

Can you compute how many female subjects were tested?

New columns can be added by simply assigning data to them:

In [None]:
data['new_useless_column'] = np.linspace(1,100000,len(data))

In [None]:
data.head()

Let's get rid of the new useless column, to have clean data for the following.

In [None]:
del data['new_useless_column']

## Grouping data and aggregate computations

Let us tackle the following problem: *compute how many words are uttered by children of a given age.*

Since each subject has many table entries (one per uttered word), then we need to:

* first, aggregate rows based on subject (`data_id`) **and** age, 
* then, compute the number of words per subject.  
* after, we can further aggregate on age alone and sum.

The `.groupby()` method of `DataFrame`'s creates an intermediate object that is *like* a table with aggregate rows:

In [None]:
data.columns

In [None]:
dg = data.groupby(['age', 'data_id'])

In [None]:
dg[:3]  # this is expected to fail

The only methods that we can call on a "groupby" object are those that apply a summarization function to the row groups.  Any function that can operate on `Series` objects or NumPy arrays can be used for aggregation.  The aggregation process requires a dictionary, mapping column names to the function to apply to that column.  Columns that are not named in that dictionary will be discarded.

In [None]:
data2 = dg.agg({'produces': np.sum})

# show it
data2.head()  # or ...[0:5]

Note anything strange in the table output above?

The new table has a *composite* index `[age, data_id]`; in order to perform further aggregation on `age` alone, we must *reset* the indices using method `.reset_index()`.  Note that `.reset_index()` returns a *new* `DataFrame`, does not modify the one it's called on in-place.

In [None]:
data3 = data2.reset_index()

In [None]:
data3.head()

### Exercise 8.C

Define a `DataFrame` object `data4` by aggregating over age and summing over the `'produces'` column.

In [None]:
# 1. aggregate

# 2. sum over 'produces' column

# 3. show


Again we must reset the index; this time we shall do it in-place (= modify `data4`):

In [None]:
data4.reset_index(inplace=True)

In [None]:
data.groupby('word')['produces'].sum()

#### Exercise 8.D

Make a bar plot of the number of words produced by age.

(*Hint:* Seaborn provides an easy-to-use `.barplot` function)

In [None]:
# Initialize the matplotlib figure
fig, ax = plt.subplots(1, figsize=(10, 7))

# Plot the total crashes
sns.barplot(x="age", y="produces", data=data4, label="Nr. of distinct animal names")

# Add a legend and informative axis label
#ax.legend(ncol=2, loc="lower right", frameon=True)
#ax.set(xlim=(0, 24), ylabel="", xlabel="Produced animal names per age group")
#sns.despine(left=True, bottom=True)

#### Exercise 8.E

What's wrong with the above data?  How can you modify the procedure to fix it?

## A bit of statistics

One of the requisites for drawing sensible insights from this data is that the data is e.g., age-matched among male and female subjects.  For this we can compute the distribution of ages of male and female subjects, and compare them using a $t$-test.

In [None]:
dg = data.groupby(['data_id', 'sex'])

In [None]:
data5 = dg.agg({'age': np.mean})

In [None]:
data5[:3]

In [None]:
data5.reset_index(inplace=True)

In [None]:
data5['age'].describe()

In [None]:
data5['sex'].describe()

In [None]:
females = (data5['sex'] == 1)
males = (data5['sex'] == 'Male')
ages = data5['age']

In [None]:
ages_f = ages[females]

ages_f.describe()

In [None]:
ages_m = ages[males]

ages_m.describe()

For simple statistical tests we can use the [`scipy.stats`](http://docs.scipy.org/doc/scipy/reference/stats.html#module-scipy.stats) module of [`scipy`](http://docs.scipy.org/doc/):

In [None]:
from scipy import stats

The function to perform a 2-sample $t$-test is [`scipy.stats.ttest_ind()`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind):

In [None]:
stats.ttest_ind(ages_m, ages_f)

Actually we should only apply $t$-test to normally-distributed data (approx), so at least a visual check won't do harm!  Seaborn provides a very convenient function for it:

In [None]:
sns.distplot(ages_m)

In [None]:
sns.distplot(ages_f)