# Resources:

["Working with Text Data"](http://pandas.pydata.org/pandas-docs/stable/text.html)

# Pivot Tables

The ``GroupBy`` operation lets us aggregate rows into representative subsets.

The pivot table takes column-oriented data as input, and groups the entries into a two-dimensional table summary table.

Pivot tables are really popular with Excel users. I promise they're still useful despite this fact.

The difference between a pivot table and a ``GroupBy`` is that pivot tables affect both columns and rows, whereas groupbys only touch rows

We'll use the dataset of passengers on the *Titanic*, a classic machine learning dataset.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
tt = sns.load_dataset('titanic')
tt.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


### Grouping by hand

In [2]:
tt.groupby('sex')[['survived']].mean()

Unnamed: 0_level_0,survived
sex,Unnamed: 1_level_1
female,0.742038
male,0.188908


75% of women on board survived, while only 20% of men

This is useful, but we might like to go one step deeper and look at survival by both sex and, say, class.

Using the vocabulary of ``GroupBy``, we might proceed using something like this:

- *split* by class and sex

- *apply* the average aggregate **on** survival

- *combine* the resulting groups

In [3]:
sex_class = (tt.groupby(['sex', 'class'])
               .agg({
                   "survived": "mean"
               })
)
sex_class

Unnamed: 0_level_0,Unnamed: 1_level_0,survived
sex,class,Unnamed: 2_level_1
female,First,0.968085
female,Second,0.921053
female,Third,0.5
male,First,0.368852
male,Second,0.157407
male,Third,0.135447


Now, we could **unstack** this multidimensional result into a 2d table:

In [4]:
sex_class.unstack()

Unnamed: 0_level_0,survived,survived,survived
class,First,Second,Third
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


This 2d ``GroupBy`` table result is a ``pivot table``.

## Pivot Table Syntax

The equivalent to the `groupby-unstack` uses the ``pivot_table`` method:

In [5]:
tt.pivot_table('survived', index='sex', columns='class')

class,First,Second,Third
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.968085,0.921053,0.5
male,0.368852,0.157407,0.135447


As you might expect from a 1910s cruise "women and children first" rules, first-class women survived almost certainly (Rose), while only one in ten third-class men survived (Jack).

# String Data

Python's built-in string methods are mirrored by a Pandas vector string method that modifies all strings in the `pd.Series`.

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

and some **regex** based methods:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

Notice that these have various return values. Some, like ``lower()``, return a series of strings:

In [6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [7]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [8]:
# Regex extract
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [9]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

# Using python lists in series from str

Since the `str` is a container object, you can use `series.str.XXX` methods on lists from pandas in a vectorized fashion

In [10]:
(monte.str.split() # Split first name, last name
      .str.len() # Len of each list result (2)
)

0    2
1    2
2    2
3    2
4    2
5    2
dtype: int64

In [11]:
(monte.str.split() # Split first name, last name
      .str[0:1] # Only first item in list
)

0     [Graham]
1       [John]
2      [Terry]
3       [Eric]
4      [Terry]
5    [Michael]
dtype: object

#### Indicator variables

Another method that requires a bit of extra explanation is the ``get_dummies()`` method.
This is useful when your data has a column containing some sort of coded indicator.
For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":

In [12]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


The ``get_dummies()`` routine lets you quickly split-out these indicator variables into a ``DataFrame``:

In [13]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1
