# Groupby operations

Some imports:

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass

pd.options.display.max_rows = 10

## Concat DataFrames

In [None]:
# Python program to concatenate 
# dataframes using Pandas
  
# Creating first dataframe 
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 
                    'B': ['B0', 'B1', 'B2', 'B3'], 
                    'C': ['C0', 'C1', 'C2', 'C3'], 
                    'D': ['D0', 'D1', 'D2', 'D3']}, 
                    index = [0, 1, 2, 3]) 
  
# Creating second dataframe 
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 
                    'B': ['B4', 'B5', 'B6', 'B7'], 
                    'C': ['C4', 'C5', 'C6', 'C7'], 
                    'D': ['D4', 'D5', 'D6', 'D7']}, 
                    index = [4, 5, 6, 7]) 
  
# Creating third dataframe 
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'], 
                    'B': ['B8', 'B9', 'B10', 'B11'], 
                    'C': ['C8', 'C9', 'C10', 'C11'], 
                    'D': ['D8', 'D9', 'D10', 'D11']}, 
                    index = [8, 9, 10, 11]) 


#### Vertical Concatenation:

<img align="left" width=50% src="img/concat1.PNG">

In [None]:
# Concatenating the dataframes 
pd.concat([df1, df2, df3], axis = 0) 

In [None]:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'],
                        'D': ['D2', 'D3', 'D6', 'D7'],
                        'F': ['F2', 'F3', 'F6', 'F7']},
                       index=[2, 3, 6, 7])
   

In [None]:
df4

In [None]:
df1

<img align="left" width=50% src="img/concat2.PNG">

In [None]:
result = pd.concat([df1, df4], axis=1)

In [None]:
result

<div class="alert alert-success">
    <b>EXERCISE</b>: Concat df1 and df3 dataframes horizontally
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations09.py
pd.concat([df1, df3], axis = 1)

## Merge DataFrames

### Have a look at 'Movies' database from IMDB

We are using sample data for this excercise. If you want, full data can be downloaded from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8). Put them in the `/data` folder.

`cast` dataset: different roles played by actors/actresses in films

- title: title of the film
- name: name of the actor/actress
- type: actor/actress
- n: the order of the role (n=1: leading role)

In [None]:
cast = pd.read_csv('data/cast_sample.csv')

In [None]:
cast

In [None]:
titles = pd.read_csv('data/titles_sample.csv')

In [None]:
titles.head()

We want the 'year' column from titles dataframe in cast dataframe:

In [None]:
cast = cast.merge(titles, left_on = 'title', right_on = 'title')

In [None]:
cast.head()

## Recap: the groupby operation (split-apply-combine)

The "group by" concept: we want to **apply the same function on subsets of your dataframe, based on some key to split the dataframe in subsets**

This operation is also referred to as the "split-apply-combine" operation, involving the following steps:

* **Splitting** the data into groups based on some criteria
* **Applying** a function to each group independently
* **Combining** the results into a data structure

<img src="img/splitApplyCombine.png">

Similar to SQL `GROUP BY`

The example of the image in pandas syntax:

In [None]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df

In [None]:
df.groupby('key')['data'].sum() 

## And now applying this on some real data

We are using sample data for this excercise. If you want, full data can be downloaded from here: [`titles.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKajNMa1pfSzN6Q3M) and [`cast.csv`](https://drive.google.com/open?id=0B3G70MlBnCgKal9UYTJSR2ZhSW8). Put them in the `/data` folder.

`cast` dataset: different roles played by actors/actresses in films

- title: title of the film
- name: name of the actor/actress
- type: actor/actress
- n: the order of the role (n=1: leading role)

In [None]:
cast = pd.read_csv('data/cast_sample.csv')

In [None]:
cast.head()

In [None]:
titles = pd.read_csv('data/titles_sample.csv')

In [None]:
titles.head(2)

In [None]:
cast = cast.merge(titles)

<div class="alert alert-success">
    <b>QUESTION</b>: Using groupby(), plot the number of films that have been released each decade in the history of cinema.
</div>

In [None]:
titles['decade'] = titles['year'] // 10 * 10

In [None]:
titles.head()

In [None]:
titles.groupby('decade').size().plot(kind='bar')

<div class="alert alert-success">
    <b>QUESTION</b>: Use groupby() to plot the number of "Never Too Late" films made each decade.
</div>

In [None]:
nevertoolate = titles[titles['title'] == 'Never Too Late']
nevertoolate.groupby(nevertoolate['year'] // 10 * 10).size().plot(kind='bar')

<div class="alert alert-success">
    <b>QUESTION</b>: How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations10.py
cast1950 = cast[(cast['decade'] == 1950)  & (cast['n'] == 1)]
cast1950.groupby(['year', 'type']).size()

<div class="alert alert-success">
    <b>QUESTION</b>: List the 10 actors/actresses that have the most leading roles (n=1) since the 1990's.
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations11.py
cast1990 = cast[(cast['year'] >= 1990) & (cast['n'] == 1)]
cast1990.groupby('name').size().nlargest(10)

<div class="alert alert-success">
    <b>EXERCISE</b>: List, in order by year, each of the films in which Mohanlal has played more than 1 role.
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations13.py


<div class="alert alert-success">
    <b>EXERCISE</b>: List each of the characters that Frank Oz has portrayed at least twice.
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations15.py


## Transforms

Sometimes you don't want to aggregate the groups, but transform the values in each group. This can be achieved with `transform`:

In [None]:
df

In [None]:
df.groupby('key').transform('mean')

<div class="alert alert-success">
    <b>QUESTION</b>: Add a column to the `cast` dataframe that indicates the number of roles for the film.
</div>

In [None]:
cast['n_total'] = cast.groupby('title')['n'].transform('max')
cast.head()

## string manipulations

Python strings have a lot of useful methods available to manipulate or check the content of the string:

In [None]:
s = 'Kohli'

In [None]:
s.startswith('K')

In pandas, those methods (together with some additional methods) are also available for string Series through the `.str` accessor:

In [None]:
s = pd.Series(['Kohli', 'Rohit', 'Rahul'])

In [None]:
s

In [None]:
s.str.startswith('R')

For an overview of all string methods, see: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

<div class="alert alert-success">
    <b>QUESTION</b>: We already plotted the number of 'Hamlet' films released each decade, but not all titles are exactly called 'Hamlet'. Give an overview of the titles that contain 'Hamlet':
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations29.py
hamlets = titles[titles['title'].str.contains('Hamlet')]
hamlets['title'].value_counts()

## Value counts

A useful shortcut to calculate the number of occurences of certain values is `value_counts` (this is somewhat equivalent to `df.groupby(key).size())`)

For example, what are the most occuring movie titles?

In [None]:
titles.title.value_counts()

<div class="alert alert-success">
    <b>EXERCISE</b>: Which years saw the most films released?
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations34.py


<div class="alert alert-success">
    <b>EXERCISE</b>: Plot the number of released films over time
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations35.py


<div class="alert alert-success">
    <b>EXERCISE</b>: What are the 11 most common character names in movie history?
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations37.py


<div class="alert alert-success">
    <b>EXERCISE</b>: Which actors or actresses appeared in the most movies in the year 2010?
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations38.py


<div class="alert alert-success">
    <b>EXERCISE</b>: Plot how many roles Brad Pitt has played in each year of his career.
</div>

In [None]:
# %load snippets/04b - Advanced groupby operations39.py
