<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /></a><div align="center">This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.</div>

# Introduction to Pandas

[Pandas](https://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

## Preamble

In [32]:
import numpy as np
import pandas as pd

In [33]:
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns

## Load data

We are going to use public data from the [Wordbank project](http://wordbank.stanford.edu/) for this notebook.

We use Panda's `read_csv()` function to read a CSV file: 

In [57]:
data = pd.read_csv("data.csv",low_memory=False)

In [53]:
type(data)

pandas.core.frame.DataFrame

The return object from `read_csv()` is a `DataFrame` object, which displays nicely in the notebook as tabular data:

In [55]:
data[:5]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,,data_id,num_item_id,age,language,sex,birth_order,ethnicity,produces,word
1,0.0,51699,13,27,English,Female,Fourth,Hispanic,False,alligator
2,1.0,51699,14,27,English,Female,Fourth,Hispanic,True,animal
3,2.0,51699,15,27,English,Female,Fourth,Hispanic,False,ant
4,3.0,51699,16,27,English,Female,Fourth,Hispanic,True,bear


In [None]:
data['sex']

Informally, a Pandas `DataFrame` is a table where each column is a `pandas.Series` object (which closely resembles a NumPy array).

Indeed, we can build a `DataFrame` from a set of NumPy arrays:

In [39]:
# 1. create NumPy arrays

x = np.linspace(-1, +1, 20)
sin_x = np.sin(x)
cos_x = np.cos(x)

# 2. create DataFrame with the given columns

trig = pd.DataFrame({'x': x, 'sin': sin_x, 'cos': cos_x})

# 3. show a few sample columns

trig[:5]

Unnamed: 0,cos,sin,x
0,0.540302,-0.841471,-1.0
1,0.625724,-0.780044,-0.894737
2,0.704219,-0.709983,-0.789474
3,0.774918,-0.632061,-0.684211
4,0.837039,-0.547143,-0.578947


## Table manipulations

A `DataFrame` object as a `.shape` attribute like a 2D NumPy array:

In [40]:
data.shape

(234350, 10)

The above shows that our table has 234350 rows across 10 columns.  (**Note:** row index comes first, this will be important when accessing data with numerical indices below.)

The column names can be retrieved from the `.columns` attribute:

In [45]:
data.columns

Index(['Unnamed: 0', 'data_id', 'num_item_id', 'age', 'language', 'sex',
       'birth_order', 'ethnicity', 'produces', 'word'],
      dtype='object')

#### Removing columns

In order to do some data cleaning (and save memory), we drop the columns that are not used in the forthcoming analysis. A column may be deleted from the `DataFrame` with the statement:

        del data['name']

In [None]:
# save set of column names
before = set(data.columns)

In [None]:
# remove "unnamed" column
del data['Unnamed: 0']
del data['num_item_id']

In [86]:
data.head()

Unnamed: 0.1,Unnamed: 0,data_id,num_item_id,age,language,sex,birth_order,ethnicity,produces,word,newcolums
0,0,51699,13,27,English,0,Fourth,Hispanic,False,alligator,1.0
1,1,51699,14,27,English,0,Fourth,Hispanic,True,animal,1.42671
2,2,51699,15,27,English,0,Fourth,Hispanic,False,ant,1.853419
3,3,51699,16,27,English,0,Fourth,Hispanic,True,bear,2.280129
4,4,51699,17,27,English,0,Fourth,Hispanic,False,bee,2.706839


In [None]:
# is there any change in the column set?
after = set(data.columns)
print(before - after)

It is easy to also see that the two unwanted columns are now gone:

In [59]:
data[1:3]

Unnamed: 0.1,Unnamed: 0,data_id,num_item_id,age,language,sex,birth_order,ethnicity,produces,word
1,1,51699,14,27,English,Female,Fourth,Hispanic,True,animal
2,2,51699,15,27,English,Female,Fourth,Hispanic,False,ant


In [101]:
int(False)

0

### Selecting a subset of the columns

Columns may be selected by name using the standard `[]`-lookup:

In [63]:
# print the first 3 rows of column 'produces'
data[['produces','sex']][:2]

Unnamed: 0,produces,sex
0,False,Female
1,True,Female


However, note that selecting *one* column only returns a `Series`, not a `DataFrame`:

In [64]:
type(data['produces'])

pandas.core.series.Series

It is possible to select *multiple* columns at once, in which case the returned object is a `DataFrame` again.  One must however use `[[ ... ]]` (i.e., *double* the square brackets):

In [23]:
data2 = data[['age', 'language', 'sex']]

In [None]:
data2[:5]

It looks as if we've picked constant columns? A `DataFrame`'s `.describe()` method provides a quick statistical summary of the data (but only for *continuous* variables):

In [None]:
data.describe()

### Modifying data

A Pandas `Series` object is pretty similar to a NumPy array, in that arithmetic and logical operations are performed element-wise.

In particular, the result of a comparison like the following is a `Series` object with logical values:

In [None]:
data['sex'] == 'Female'

We can now pass such a selector array to a `Series`'s `[]` operator to modify a column at the row indices where the selector is `True`:

In [76]:
data['sex'][data['sex'] == 0].count()

(83979,)

In [78]:
data['newcolums'] = np.linspace(1,100000,len(data))

In [None]:
data.head()

#### Exercise 1. 

In table `data`, replace all occurrences of the string `'Male'` in column `'sex'` with the number `0`.

Can you compute how many female subjects were tested?

## Grouping data and aggregate computations

Let us tackle the following problem: *compute how many words are uttered by children of a given age.*

Since each subject has many table entries (one per uttered word), then we need first to aggregate rows based on subject (`data_id`) **and** age, then compute the number of words per subject.  At that point, we can further aggregate on age alone and sum.

The `.groupby()` method of `DataFrame`'s creates an intermediate object that is *like* a table with aggregate rows:

In [82]:
data.columns

Index(['Unnamed: 0', 'data_id', 'num_item_id', 'age', 'language', 'sex',
       'birth_order', 'ethnicity', 'produces', 'word', 'newcolums'],
      dtype='object')

In [None]:
dg = data.groupby(['age', 'data_id'])
dg.indices

In [91]:
dg[:3]  # this is expected to fail

TypeError: unhashable type: 'slice'

The only methods that we can call on a "groupby" object are those that apply a summarization function to the row groups.  Any function that can operate on `Series` objects or NumPy arrays can be used for aggregation.  The aggregation process requires a dictionary, mapping column names to the function to apply to that column.  Columns that are not named in that dictionary will be discarded.

In [97]:
dg.agg({'produces': np.sum})[:8]

Unnamed: 0_level_0,Unnamed: 1_level_0,produces
age,data_id,Unnamed: 2_level_1
16,51716,2.0
16,51733,3.0
16,51740,3.0
16,51834,8.0
16,51835,1.0
16,51844,2.0
16,51847,0.0
16,51940,5.0


In [None]:
dg.indices

In [None]:
data2 = dg.agg({'produces': np.sum})
data2.index

In [107]:
data2[:9]

Unnamed: 0_level_0,Unnamed: 1_level_0,produces
age,data_id,Unnamed: 2_level_1
16,51716,2.0
16,51733,3.0
16,51740,3.0
16,51834,8.0
16,51835,1.0
16,51844,2.0
16,51847,0.0
16,51940,5.0
16,51956,5.0


Note anything strange in the table output above?

The new table has a *composite* index `[age, data_id]`; in order to perform further aggregation on `age` alone, we must *reset* the indices using method `.reset_index()`.  Note that `.reset_index()` returns a *new* `DataFrame`, does not modify the one it's called on in-place.

In [98]:
data3 = data2.reset_index()

In [99]:
data3[:3]

Unnamed: 0,age,data_id,produces
0,16,51716,2.0
1,16,51733,3.0
2,16,51740,3.0


### Exercise 2.

Define a `DataFrame` object `data4` by aggregating over age and summing over the `'produces'` column.

In [None]:
# aggregate

# sum over 'produces' column

# show


Again we must reset the index; this time we shall do it in-place (= modify `data4`):

In [31]:
data4.reset_index(inplace=True)

In [None]:
data.groupby('word')['produces'].sum()

Seaborn provides an easy-to-use `.barplot` function:

In [None]:
# Initialize the matplotlib figure
fig, ax = plt.subplots(1, figsize=(10, 7))

# Plot the total crashes
sns.barplot(x="age", y="produces", data=data4, label="Nr. of distinct animal names")

# Add a legend and informative axis label
ax.legend(ncol=2, loc="lower right", frameon=True)
ax.set(xlim=(0, 24), ylabel="", xlabel="Produced animal names per age group")
sns.despine(left=True, bottom=True)

### Exercise 4.

What's wrong with the above data?  How can you modify the procedure to fix it?

## A bit of statistics

One of the requisites for drawing sensible insights from this data is that the data is e.g., age-matched among male and female subjects.  For this we can compute the distribution of ages of male and female subjects, and compare them using a $t$-test.

In [None]:
dg = data.groupby(['data_id', 'sex'])

In [None]:
data5 = dg.agg({'age': np.mean})

In [None]:
data5[:3]

In [None]:
data5.reset_index(inplace=True)

In [None]:
data5['age'].describe()

In [None]:
data5['sex'].describe()

In [None]:
females = (data5['sex'] == 1)
males = (data5['sex'] == 'Male')
ages = data5['age']

In [None]:
ages_f = ages[females]

ages_f.describe()

In [None]:
ages_m = ages[males]

ages_m.describe()

For simple statistical tests we can use the [`scipy.stats`](http://docs.scipy.org/doc/scipy/reference/stats.html#module-scipy.stats) module of [`scipy`](http://docs.scipy.org/doc/):

In [None]:
from scipy import stats

The function to perform a 2-sample $t$-test is [`scipy.stats.ttest_ind()`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html#scipy.stats.ttest_ind):

In [None]:
stats.ttest_ind(ages_m, ages_f)

Actually we should only apply $t$-test to normally-distributed data (approx), so at least a visual check won't do harm!  Seaborn provides a very convenient function for it:

In [None]:
sns.distplot(ages_m)

In [None]:
sns.distplot(ages_f)