# Handling data with python

**Author:** Ties de Kok ([Personal Website](http://www.tiesdekok.com))  
**Last updated:** 18 May 2018  
**Python version:** Python 3.6  
**License:** MIT License  

**Note:** Some features (like the ToC) will only work if you run it locally, use Binder, or use nbviewer by clicking this link: 
https://nbviewer.jupyter.org/github/TiesdeKok/LearnPythonforResearch/blob/master/2_handling_data.ipynb

# *Introduction*

Getting your data ready for analysis (i.e. "data wrangling") is usually the most time-consuming part of a project. For data wrangling tasks I recommend `Pandas` and `Numpy`.

What is `Pandas`?

> Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use  data structures and data analysis tools for the Python programming language.

In other words, Pandas introduces a data structure (i.e. `dataframe`) that represents data as a table with columns and rows. Combining Python with `Pandas` yields a very powerful toolkit with which you can process any type of data. 

## Format of this notebook

The `Pandas` library is massive and it is continuously expanding in functionality.  
It is, therefore, impossible to keep it both comprehensive and cover everything in just one notebook.

The goal of this notebook is to cover the basic functionality that I expect you to encounter for an average project. 

I have based parts on this notebook on a PyCon 2015 tutorial/talk by Brandon Rhodes. If you want to know more I highly recommend watching his talk and checking the accompanying GitHub page:
  
https://www.youtube.com/watch?v=5JnMutdy6Fw  
https://github.com/brandon-rhodes/pycon-pandas-tutorial

# *Table of Contents* <a id='toc'></a>

* [Import pandas](#import-pandas)   
* [Create a dataframe](#create-dataframe)   
* [Manipulate dataframe](#manipulate-dataframe)   
* [Rename columns](#rename-columns)   
* [View a dataframe using qgrid](#qgrid)   
* [View (parts) of a dataframe using Pandas](#view-dataframe)   
* [Dealing with datatypes](#datatypes)   
* [Handling missing values](#missing-values)   
* [Work with data in the dataframe](#work-with-data)   
* [Combining dataframes](#combining-dataframes)   
* [Group-by operations](#groupby)   
* [Reshaping and Pivot Tables](#reshaping-pivot)   
* [Dealing with dates](#dates)   

## <span style="text-decoration: underline;">Import Pandas</span><a id='import-pandas'></a> [(to top)](#toc)

In [1]:
import pandas as pd
import numpy as np

*Note:* it is usually a good idea to also import `numpy` when you use `pandas`, their functionality is quite intertwined.  

For convenience we also import `join` to easily create paths:

In [2]:
import os
from os.path import join

### Parameters

Path to our data

In [3]:
data_path = join(os.getcwd(), 'example_data')

##  <span style="text-decoration: underline;">Create a dataframe</span><a id='create-dataframe'></a> [(to top)](#toc)

We can create a dataframe in many ways. Below are a couple of situations:

### 1) Load file from drive into Pandas

For details on opening files such as Excel, CSV, Stata, SAS, HDF see the `1_opening_files` notebook.

In [4]:
df_auto = pd.read_csv(join(data_path, 'auto_df.csv'), sep=';', index_col='Unnamed: 0')

### 2) Create new dataframe and pass data to it

We can pass many different types of data to the `pd.DataFrame()` method.

In [5]:
d = {'col1': [1,2,3,4], 'col2': [5,6,7,8]}
df = pd.DataFrame(data=d)
df

Unnamed: 0,col1,col2
0,1,5
1,2,6
2,3,7
3,4,8


In [6]:
d = [(1, 2 ,3 ,4), (5, 6, 7, 8)]
df = pd.DataFrame(data=d)
df

Unnamed: 0,0,1,2,3
0,1,2,3,4
1,5,6,7,8


### 3) Create dataframe from a dictionary

We can also directly convert a dictionary to a dataframe:

In [7]:
d = {'row1': [1,2,3,4], 'row2': [5,6,7,8]}
df = pd.DataFrame.from_dict(d, orient='index')
df

Unnamed: 0,0,1,2,3
row1,1,2,3,4
row2,5,6,7,8


## <span style="text-decoration: underline;">Manipulate dataframe</span><a id='manipulate-dataframe'></a> [(to top)](#toc)

### Add column

In [8]:
df['col5'] = [10, 10]
df

Unnamed: 0,0,1,2,3,col5
row1,1,2,3,4,10
row2,5,6,7,8,10


### Add row

In [9]:
df.loc['row3'] = [11, 12, 13, 14, 15]
df

Unnamed: 0,0,1,2,3,col5
row1,1,2,3,4,10
row2,5,6,7,8,10
row3,11,12,13,14,15


### Inverse the dataframe

In [10]:
df.T

Unnamed: 0,row1,row2,row3
0,1,5,11
1,2,6,12
2,3,7,13
3,4,8,14
col5,10,10,15


### Remove column

In [11]:
df = df.drop('col5', axis=1)
df

Unnamed: 0,0,1,2,3
row1,1,2,3,4
row2,5,6,7,8
row3,11,12,13,14


### Remove row

In [12]:
df = df.drop('row1', axis=0)
df

Unnamed: 0,0,1,2,3
row2,5,6,7,8
row3,11,12,13,14


### Set index

In [13]:
df

Unnamed: 0,0,1,2,3
row2,5,6,7,8
row3,11,12,13,14


In [14]:
df.set_index(0)

Unnamed: 0_level_0,1,2,3
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
5,6,7,8
11,12,13,14


*Note:* `Pandas` also allows a multi-index. These can be very powerful. 

In [15]:
df.set_index(0, append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,1,2,3
Unnamed: 0_level_1,0,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
row2,5,6,7,8
row3,11,12,13,14


### Reset index

We can convert the index to a regular column using `reset_index()`

In [16]:
df.reset_index()

Unnamed: 0,index,0,1,2,3
0,row2,5,6,7,8
1,row3,11,12,13,14


## <span style="text-decoration: underline;">Rename columns</span><a id='rename-columns'></a> [(to top)](#toc)

We can either manipulate `df.columns` directly or use `df.rename()`

In [17]:
df.columns = ['col1', 'col2', 'col3', 'col4']
df

Unnamed: 0,col1,col2,col3,col4
row2,5,6,7,8
row3,11,12,13,14


In [18]:
df.rename(columns={'col1' : 'column1', 'col2' : 'column2'})

Unnamed: 0,column1,column2,col3,col4
row2,5,6,7,8
row3,11,12,13,14


**Note:** The above creates a copy, it does not modify it in place!  
We need to use either the `inplace=True` argument or assign it:

In [19]:
df = df.rename(columns={'col1' : 'column1', 'col2' : 'column2'})
#or
df.rename(columns={'col1' : 'column1', 'col2' : 'column2'}, inplace=True)

## <span style="text-decoration: underline;">View (parts) of a dataframe using `Pandas`</span><a id='view-dataframe'></a> [(to top)](#toc)

It can take some getting used to, but navigating your way around a dataframe is a very helpful skill. The ability sub-select parts of a dataframe is important for inspection purposes, analysis, exporting, and much more.

### View entire dataframe

*Note:* Pandas will only show the top and bottom parts if the dataframe is large.

In [20]:
df_auto

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
3,Buick Century,4816,20,3.0,4.5,16,3250,196,40,196,2.93,Domestic
4,Buick Electra,7827,15,4.0,4.0,20,4080,222,43,350,2.41,Domestic
5,Buick LeSabre,5788,18,3.0,4.0,21,3670,218,43,231,2.73,Domestic
6,Buick Opel,4453,26,,3.0,10,2230,170,34,304,2.87,Domestic
7,Buick Regal,5189,20,3.0,2.0,16,3280,200,42,196,2.93,Domestic
8,Buick Riviera,10372,16,3.0,3.5,17,3880,207,43,231,2.93,Domestic
9,Buick Skylark,4082,19,3.0,3.5,13,3400,200,42,231,3.08,Domestic


### Get top or bottom of dataframe

In [21]:
df_auto.head(3)

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
0,AMC Concord,4099,22,3.0,2.5,11,2930,186,40,121,3.58,Domestic
1,AMC Pacer,4749,17,3.0,3.0,11,3350,173,40,258,2.53,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic


In [22]:
df_auto.tail(3)

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
71,VW Rabbit,4697,25,4.0,3.0,15,1930,155,35,89,3.78,Foreign
72,VW Scirocco,6850,25,4.0,2.0,16,1990,156,36,97,3.78,Foreign
73,Volvo 260,11995,17,5.0,2.5,14,3170,193,37,163,2.98,Foreign


### Get an X amount of random rows

In [23]:
X = 5
df_auto.sample(X)

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
71,VW Rabbit,4697,25,4.0,3.0,15,1930,155,35,89,3.78,Foreign
56,Datsun 210,4589,35,5.0,2.0,8,2020,165,32,85,3.7,Foreign
7,Buick Regal,5189,20,3.0,2.0,16,3280,200,42,196,2.93,Domestic
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
37,Olds Delta 88,4890,18,4.0,4.0,20,3690,218,42,231,2.73,Domestic


### Select column(s) based on name

*Note:* the below returns a pandas `Series` object, this is different than a pandas `Dataframe` object!   
You can tell by the way that it looks when shown.

In [24]:
df_auto['make'].head(3)

0    AMC Concord
1      AMC Pacer
2     AMC Spirit
Name: make, dtype: object

If the column name has no whitespace you can also use a dot followed with the column name:

In [25]:
df_auto.make.head(3)

0    AMC Concord
1      AMC Pacer
2     AMC Spirit
Name: make, dtype: object

**If you want multiple columns you need to use double brackets:**

In [26]:
df_auto[['make', 'price', 'mpg']].head(10)

Unnamed: 0,make,price,mpg
0,AMC Concord,4099,22
1,AMC Pacer,4749,17
2,AMC Spirit,3799,22
3,Buick Century,4816,20
4,Buick Electra,7827,15
5,Buick LeSabre,5788,18
6,Buick Opel,4453,26
7,Buick Regal,5189,20
8,Buick Riviera,10372,16
9,Buick Skylark,4082,19


### Select  row based on index value

In [27]:
df = df_auto[['make', 'price', 'mpg', 'trunk', 'headroom']].set_index('make')

In [28]:
df.loc['Buick Riviera']

price       10372.0
mpg            16.0
trunk          17.0
headroom        3.5
Name: Buick Riviera, dtype: float64

*Note:* notice the appearance, this returned a pandas.Series object not a pandas.Dataframe object 

### Select row based on index position

In [29]:
df.iloc[2:5]

Unnamed: 0_level_0,price,mpg,trunk,headroom
make,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AMC Spirit,3799,22,12,3.0
Buick Century,4816,20,16,4.5
Buick Electra,7827,15,20,4.0


**You can also include columns based on their column (!) index position:**

In [30]:
df.iloc[2:5, 1:3]

Unnamed: 0_level_0,mpg,trunk
make,Unnamed: 1_level_1,Unnamed: 2_level_1
AMC Spirit,22,12
Buick Century,20,16
Buick Electra,15,20


*Note:* In the example above the first `0:3` selects the first 3 rows, the second `1:3` selects the 2nd and 3rd column.

### Select based on condition

In many cases you want to filter rows based on a condition. You can do this in Pandas by putting the condition inside square brackets.  

It is worth explaining the intuition behind this method as a lot of people find it confusing:  

1. You request Pandas to filter a dataframe by putting a condition between square brackets: df[ `condition` ] 
2. The `condition` is a sequence of `True` or `False` values for each row (so the length of the `condition` always has to match the number of rows in the dataframe!)
3. In Pandas you can generate a `True` or `False` value for each row by simply writing a boolean expression on the whole column. 
4. Pandas will then only show those rows where the value is `True`

In more practical terms:

`df_auto['price'] < 3800` will evaluate each row of `df_auto['price']` and return, for that row, whether the condition is `True` or `False`:

``
0     False
1     False
2      True
3     False
4     False
5     False
``

By putting that condition in square brackets `df_auto[ df_auto['price'] < 3800 ]` pandas will first generate a sequence of `True` / `False` values and then only display the rows for which the value is `True`.

In [31]:
df_auto[ df_auto['price'] < 3800 ]

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
2,AMC Spirit,3799,22,,3.0,12,2640,168,35,121,3.08,Domestic
13,Chev. Chevette,3299,29,3.0,2.5,9,2110,163,34,231,2.93,Domestic
17,Chev. Monza,3667,24,2.0,2.0,7,2750,179,40,151,2.73,Domestic
33,Merc. Zephyr,3291,20,3.0,3.5,17,2830,195,43,140,3.08,Domestic
65,Subaru,3798,35,5.0,2.5,11,2050,164,36,97,3.81,Foreign
67,Toyota Corolla,3748,31,5.0,3.0,9,2200,165,35,97,3.21,Foreign


We can also combine multiple conditions by just chaining the boolean expression.   

* For an **AND** statement use: `&`
* For an **OR** statement use: `|`

In [32]:
df_auto[(df_auto['price'] < 3800) & (df_auto['foreign'] == 'Foreign')]

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
65,Subaru,3798,35,5.0,2.5,11,2050,164,36,97,3.81,Foreign
67,Toyota Corolla,3748,31,5.0,3.0,9,2200,165,35,97,3.21,Foreign


**Note:** all the above return new dataframes that are removed if we do not assign them.  
If we want to keep it as a separate dataframe we have to assign it like so:

In [33]:
df_auto_small = df_auto[(df_auto.price < 3800) & (df_auto.foreign == 'Foreign')]
df_auto_small

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
65,Subaru,3798,35,5.0,2.5,11,2050,164,36,97,3.81,Foreign
67,Toyota Corolla,3748,31,5.0,3.0,9,2200,165,35,97,3.21,Foreign


### Sort dataframe

In [34]:
df_auto.sort_values(by=['headroom', 'trunk'], inplace=True)
df_auto.head()

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
55,Datsun 200,6229,23,4.0,1.5,6,2370,170,35,119,3.89,Foreign
47,Pont. Firebird,4934,18,1.0,1.5,7,3470,198,42,231,3.08,Domestic
44,Plym. Sapporo,6486,26,,1.5,8,2520,182,38,119,3.54,Domestic
23,Ford Fiesta,4389,28,4.0,1.5,9,1800,147,33,98,3.15,Domestic
17,Chev. Monza,3667,24,2.0,2.0,7,2750,179,40,151,2.73,Domestic


## <span style="text-decoration: underline;">View a dataframe using `qgrid`</span><a id='qgrid'></a> [(to top)](#toc)

The primary workflow for inspecting parts of the data is to just create a new temporary dataframe with the data that you want to see. This is fairly quick once you get used to it, but in the beginning it can feel cumbersome. 

A neat workaround for beginners is to use a package called `qgrid` to quickly inspect your data:  

This is the GitHub page for `qgrid`: https://github.com/quantopian/qgrid  
You can install it by running the following two commands in your command line:  
1. `pip install qgrid`   
2. `jupyter nbextension enable --py --sys-prefix qgrid`

Using it is simple:

**First make sure you import the `show_grid` function**

In [35]:
from qgrid import show_grid

**You can inspect a Dataframe as follows:**

In [36]:
show_grid(df_auto)

**Several things to note about qgrid:** 
- If you save the notebook with these `qgrids` it is going to increase the file-size dramatically, it essentially saves the data with the notebook. Try to avoid this, use it only for inspection.
- Opening very big dataframes using `show_grid()` is not a good idea.  
- These `qgrids` will only display locally, not on GitHub. Therefore, if you see this on GitHub, you will not see the actual `qgrid`.
- There are a bunch of options you can use with `show_grid()`, however, I strongly discourage you to modify a dataframe using the qgrid toolbars. 

## <span style="text-decoration: underline;">Dealing with datatypes</span><a id='datatypes'></a> [(to top)](#toc)

It is important to pay attention to the datatypes contained in a column. A lot of errors that you will encounter relate to wrong datatypes (e.g. because of data errors)

### Show current datatypes:

In [37]:
df_auto.dtypes

make             object
price             int64
mpg               int64
rep78           float64
headroom        float64
trunk             int64
weight            int64
length            int64
turn              int64
displacement      int64
gear_ratio      float64
foreign          object
dtype: object

### Convert datatypes

We can convert the datatype of a column in two ways:  

1. Loop over the values and convert them individually
2. Use the build-in Pandas functions to convert the column in one go

*1) Convert values individually*

In [38]:
df_auto['length'].apply(lambda x: str(x)).dtypes

dtype('O')

Note: `'O'` stands for 'object'

In [39]:
df_auto['length'].apply(lambda x: int(x)).dtypes

dtype('int64')

*2) Convert column directly*

If you want to convert a column to `string` I recommend to use `.astype(str)`:

In [40]:
df_auto['length'].astype(str).dtypes

dtype('O')

If you want to convert a column to `numeric` I recommend to use `df.to_numeric()`:

In [41]:
pd.to_numeric(df_auto['length']).dtypes

dtype('int64')

The section `dealing with dates` will discuss how to convert a column with `dates`.

## <span style="text-decoration: underline;">Handling missing values</span><a id='missing-values'></a> [(to top)](#toc)

Dealing with missing values is easy in Pandas, as long as you are careful in defining them as `np.nan` (and **not** a string value like 'np.nan')

http://pandas.pydata.org/pandas-docs/stable/missing_data.html

### Add some missing values

*Note:* We define a missing value as `np.nan` so we can consistently select them!

In [42]:
df_auto.loc['UvT_Car'] = [np.nan for x in range(0,len(df_auto.columns))]
df_auto.loc['UvT_Bike'] = [np.nan for x in range(0,len(df_auto.columns))]

In [43]:
df_auto.loc[['UvT_Car', 'UvT_Bike']]

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
UvT_Car,,,,,,,,,,,,
UvT_Bike,,,,,,,,,,,,


### Select missing or non-missing values

Always use `pd.isnull()` or `pd.notnull()` as it is most reliable.  
`df_auto.make == np.nan` will **not** work consistently.

In [44]:
df_auto[pd.isnull(df_auto.make)]

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
UvT_Car,,,,,,,,,,,,
UvT_Bike,,,,,,,,,,,,


In [45]:
df_auto[pd.notnull(df_auto.make)].head()

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
55,Datsun 200,6229.0,23.0,4.0,1.5,6.0,2370.0,170.0,35.0,119.0,3.89,Foreign
47,Pont. Firebird,4934.0,18.0,1.0,1.5,7.0,3470.0,198.0,42.0,231.0,3.08,Domestic
44,Plym. Sapporo,6486.0,26.0,,1.5,8.0,2520.0,182.0,38.0,119.0,3.54,Domestic
23,Ford Fiesta,4389.0,28.0,4.0,1.5,9.0,1800.0,147.0,33.0,98.0,3.15,Domestic
17,Chev. Monza,3667.0,24.0,2.0,2.0,7.0,2750.0,179.0,40.0,151.0,2.73,Domestic


### Fill missing values

To fill missing values with something we can use `.fillna()`

In [46]:
df = df_auto.fillna('Missing')
df.loc[['UvT_Car', 'UvT_Bike']]

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign
UvT_Car,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing
UvT_Bike,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing,Missing


### Drop rows with missing values

To drop missing values we can use `.dropna()`

In [47]:
df_auto['make'].tail(3)

45          Plym. Volare
UvT_Car              NaN
UvT_Bike             NaN
Name: make, dtype: object

In [48]:
df = df_auto.dropna(axis=0)
df['make'].tail(3)

36       Olds Cutlass
22    Dodge St. Regis
45       Plym. Volare
Name: make, dtype: object

## <span style="text-decoration: underline;">Work with data in the dataframe</span><a id='work-with-data'></a> [(to top)](#toc)

### Combine columns (and output it to a new column)

*Remember:* You can select a column using:
1. `df_auto['price']`
2. `df_auto.price` --> but this one only works if there are no spaces in the column name

In [49]:
df_auto['price_trunk_ratio'] = df_auto.price / df_auto.trunk
df_auto[['price', 'trunk', 'price_trunk_ratio']].head()

Unnamed: 0,price,trunk,price_trunk_ratio
55,6229.0,6.0,1038.166667
47,4934.0,7.0,704.857143
44,6486.0,8.0,810.75
23,4389.0,9.0,487.666667
17,3667.0,7.0,523.857143


### Generate a new column by iterating over the dataframe per row

There are multiple ways to iterate over rows.  
They mainly different in their trade-off between ease-of-use, readability, and performance.  

I will show the three main possibilities.

For the sake of demonstration, let's say our goal is to achieve the following:    
> If the car is a foreign brand, multiple the price by 1.5

**Option 1: use `.apply()` with `lambda`**

*Note:* `lambda` is a so-called anonymous function.

In [50]:
logic = lambda x: x.price*1.5 if x.foreign == 'Foreign' else x.price
df_auto['new_price'] = df_auto.apply(logic, axis=1)
df_auto[['make', 'price', 'foreign', 'new_price']].head()

Unnamed: 0,make,price,foreign,new_price
55,Datsun 200,6229.0,Foreign,9343.5
47,Pont. Firebird,4934.0,Domestic,4934.0
44,Plym. Sapporo,6486.0,Domestic,6486.0
23,Ford Fiesta,4389.0,Domestic,4389.0
17,Chev. Monza,3667.0,Domestic,3667.0


**Option 2: use `.apply()` with a function**

In the example above we use an anonymous `lambda` function.  
For more complex processing it is possible to use a defined function and call it in `.apply()`  

**Personal note:** This is often my preferred method as it is the most flexible and a lot easier to read.

In [51]:
def new_price_function(x):
    if x.foreign == 'Foreign':
        return x.price * 1.5
    else:
        return x.price

In [52]:
df_auto['new_price'] = df_auto.apply(new_price_function, axis=1)
df_auto[['make', 'price', 'foreign', 'new_price']].head()

Unnamed: 0,make,price,foreign,new_price
55,Datsun 200,6229.0,Foreign,9343.5
47,Pont. Firebird,4934.0,Domestic,4934.0
44,Plym. Sapporo,6486.0,Domestic,6486.0
23,Ford Fiesta,4389.0,Domestic,4389.0
17,Chev. Monza,3667.0,Domestic,3667.0


*Note:* make sure to include the `axis = 1` argument, this tells Pandas to iterate over the rows and not the columns.

**Option 3: use a list comprehension:**

In [53]:
df_auto['new_price'] = [p*1.5 if f == 'Foreign' else p for p, f in zip(df_auto.price, df_auto.foreign)]
df_auto[['price', 'foreign', 'new_price']].sample(5, random_state=1)

Unnamed: 0,price,foreign,new_price
58,8129.0,Foreign,12193.5
49,4723.0,Domestic,4723.0
41,4647.0,Domestic,4647.0
36,4733.0,Domestic,4733.0
40,10371.0,Domestic,10371.0


*Note:* `random_state=1` makes sure that we get the same random sample every time we run it

## <span style="text-decoration: underline;">Combining dataframes</span><a id='combining-dataframes'></a> [(to top)](#toc)

You can combine dataframes in three ways:

1. Merge
2. Join
3. Append

I will demonstrate that using the following two datasets:

In [54]:
df_auto_p1 = df_auto[['make', 'price', 'mpg']]
df_auto_p2 = df_auto[['make', 'headroom', 'trunk']]

In [55]:
df_auto_p1.head(3)

Unnamed: 0,make,price,mpg
55,Datsun 200,6229.0,23.0
47,Pont. Firebird,4934.0,18.0
44,Plym. Sapporo,6486.0,26.0


In [56]:
df_auto_p2.head(3)

Unnamed: 0,make,headroom,trunk
55,Datsun 200,1.5,6.0
47,Pont. Firebird,1.5,7.0
44,Plym. Sapporo,1.5,8.0


### 1) Merge datasets

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html  
The `.merge()` function is one of my personal favorites, it is really easy to use.  

In [57]:
merged_auto = pd.merge(df_auto_p1, df_auto_p2, how='left', on='make')
merged_auto.head(3)

Unnamed: 0,make,price,mpg,headroom,trunk
0,Datsun 200,6229.0,23.0,1.5,6.0
1,Pont. Firebird,4934.0,18.0,1.5,7.0
2,Plym. Sapporo,6486.0,26.0,1.5,8.0


### 2) Join datasets on index

Both dataframes need to have the same column set as the index

In [58]:
df_auto_p1.set_index('make', inplace=True)
df_auto_p2.set_index('make', inplace=True)

In [59]:
joined_auto = df_auto_p1.join(df_auto_p2)
joined_auto.reset_index().head(3)

Unnamed: 0,make,price,mpg,headroom,trunk
0,AMC Concord,4099.0,22.0,2.5,11.0
1,AMC Pacer,4749.0,17.0,3.0,11.0
2,AMC Spirit,3799.0,22.0,3.0,12.0


### 3) Append data to the dataframe

See http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-objects  

*Note:* There is also a shortcut function called `.append()`

In [60]:
df_auto_i1 = df_auto.iloc[0:3]
df_auto_i2 = df_auto.iloc[3:6]

In [61]:
df_auto_i1

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign,price_trunk_ratio,new_price
55,Datsun 200,6229.0,23.0,4.0,1.5,6.0,2370.0,170.0,35.0,119.0,3.89,Foreign,1038.166667,9343.5
47,Pont. Firebird,4934.0,18.0,1.0,1.5,7.0,3470.0,198.0,42.0,231.0,3.08,Domestic,704.857143,4934.0
44,Plym. Sapporo,6486.0,26.0,,1.5,8.0,2520.0,182.0,38.0,119.0,3.54,Domestic,810.75,6486.0


In [62]:
df_auto_i2

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign,price_trunk_ratio,new_price
23,Ford Fiesta,4389.0,28.0,4.0,1.5,9.0,1800.0,147.0,33.0,98.0,3.15,Domestic,487.666667,4389.0
17,Chev. Monza,3667.0,24.0,2.0,2.0,7.0,2750.0,179.0,40.0,151.0,2.73,Domestic,523.857143,3667.0
51,Pont. Sunbird,4172.0,24.0,2.0,2.0,7.0,2690.0,179.0,41.0,151.0,2.73,Domestic,596.0,4172.0


Using the higher level function `concat()`:

In [63]:
pd.concat([df_auto_i1, df_auto_i2])

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign,price_trunk_ratio,new_price
55,Datsun 200,6229.0,23.0,4.0,1.5,6.0,2370.0,170.0,35.0,119.0,3.89,Foreign,1038.166667,9343.5
47,Pont. Firebird,4934.0,18.0,1.0,1.5,7.0,3470.0,198.0,42.0,231.0,3.08,Domestic,704.857143,4934.0
44,Plym. Sapporo,6486.0,26.0,,1.5,8.0,2520.0,182.0,38.0,119.0,3.54,Domestic,810.75,6486.0
23,Ford Fiesta,4389.0,28.0,4.0,1.5,9.0,1800.0,147.0,33.0,98.0,3.15,Domestic,487.666667,4389.0
17,Chev. Monza,3667.0,24.0,2.0,2.0,7.0,2750.0,179.0,40.0,151.0,2.73,Domestic,523.857143,3667.0
51,Pont. Sunbird,4172.0,24.0,2.0,2.0,7.0,2690.0,179.0,41.0,151.0,2.73,Domestic,596.0,4172.0


Using the shortcut fuction `append()`:

In [64]:
df_auto_i1.append(df_auto_i2)

Unnamed: 0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign,price_trunk_ratio,new_price
55,Datsun 200,6229.0,23.0,4.0,1.5,6.0,2370.0,170.0,35.0,119.0,3.89,Foreign,1038.166667,9343.5
47,Pont. Firebird,4934.0,18.0,1.0,1.5,7.0,3470.0,198.0,42.0,231.0,3.08,Domestic,704.857143,4934.0
44,Plym. Sapporo,6486.0,26.0,,1.5,8.0,2520.0,182.0,38.0,119.0,3.54,Domestic,810.75,6486.0
23,Ford Fiesta,4389.0,28.0,4.0,1.5,9.0,1800.0,147.0,33.0,98.0,3.15,Domestic,487.666667,4389.0
17,Chev. Monza,3667.0,24.0,2.0,2.0,7.0,2750.0,179.0,40.0,151.0,2.73,Domestic,523.857143,3667.0
51,Pont. Sunbird,4172.0,24.0,2.0,2.0,7.0,2690.0,179.0,41.0,151.0,2.73,Domestic,596.0,4172.0


## <span style="text-decoration: underline;">Group-by operations</span><a id='groupby'></a> [(to top)](#toc)

Often you want to perform an operation within a group, in Pandas you achieve this by using `.groupby()`.

Pandas `.groupby()` is a process involving one or more of the following steps (paraphrasing from the docs):  
1. **Splitting** the data into groups based on some criteria
2. **Applying** a function to each group independently
3. **Combining** the results into a data structure

For the full documentation see: http://pandas.pydata.org/pandas-docs/stable/groupby.html

### Split the dataframe by creating a group object:

Step 1 is to create a `group` object that specifies the groups that we want to create.

In [65]:
col_list = ['price', 'mpg', 'headroom', 'trunk', 'weight', 'length']
grouped = df_auto[col_list + ['foreign']].groupby(['foreign'])

After creating a `group` object we can apply operations to it

### Applying example 1) Compute mean summary statistic

In [66]:
grouped.mean()

Unnamed: 0_level_0,price,mpg,headroom,trunk,weight,length
foreign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Domestic,6072.423077,19.826923,3.153846,14.75,3317.115385,196.134615
Foreign,6384.681818,24.772727,2.613636,11.409091,2315.909091,168.545455


### Applying example 2) Retrieve particular group:

In [67]:
grouped.get_group('Domestic').head()

Unnamed: 0,price,mpg,headroom,trunk,weight,length,foreign
47,4934.0,18.0,1.5,7.0,3470.0,198.0,Domestic
44,6486.0,26.0,1.5,8.0,2520.0,182.0,Domestic
23,4389.0,28.0,1.5,9.0,1800.0,147.0,Domestic
17,3667.0,24.0,2.0,7.0,2750.0,179.0,Domestic
51,4172.0,24.0,2.0,7.0,2690.0,179.0,Domestic


### Applying example 3) Iterate over the groups in the `group` object

By iterating over each group you get a lot of flexibility as you can do anything you want with each group.   

It is worth noting that each group is a dataframe object.

In [68]:
for name, group in grouped:
    print(name)
    print(group.head())

Domestic
     price   mpg  headroom  trunk  weight  length   foreign
47  4934.0  18.0       1.5    7.0  3470.0   198.0  Domestic
44  6486.0  26.0       1.5    8.0  2520.0   182.0  Domestic
23  4389.0  28.0       1.5    9.0  1800.0   147.0  Domestic
17  3667.0  24.0       2.0    7.0  2750.0   179.0  Domestic
51  4172.0  24.0       2.0    7.0  2690.0   179.0  Domestic
Foreign
     price   mpg  headroom  trunk  weight  length  foreign
55  6229.0  23.0       1.5    6.0  2370.0   170.0  Foreign
56  4589.0  35.0       2.0    8.0  2020.0   165.0  Foreign
68  5719.0  18.0       2.0   11.0  2670.0   175.0  Foreign
72  6850.0  25.0       2.0   16.0  1990.0   156.0  Foreign
61  4499.0  28.0       2.5    5.0  1760.0   149.0  Foreign


#### It is also possible to use the `.apply()` function on `group` objects:   

Using a `lambda df: ....` with `.apply()` is a nice way to iterature over subsets of the data.

For example, let's say we want to get the cheapest car within each "trunk" size category:

In [69]:
df_auto.groupby('trunk').apply(lambda df: df.sort_values('price').iloc[0]).head()

Unnamed: 0_level_0,make,price,mpg,rep78,headroom,trunk,weight,length,turn,displacement,gear_ratio,foreign,price_trunk_ratio,new_price
trunk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
5.0,Honda Civic,4499.0,28.0,4.0,2.5,5.0,1760.0,149.0,34.0,91.0,3.3,Foreign,899.8,6748.5
6.0,Datsun 200,6229.0,23.0,4.0,1.5,6.0,2370.0,170.0,35.0,119.0,3.89,Foreign,1038.166667,9343.5
7.0,Chev. Monza,3667.0,24.0,2.0,2.0,7.0,2750.0,179.0,40.0,151.0,2.73,Domestic,523.857143,3667.0
8.0,Dodge Colt,3984.0,30.0,5.0,2.0,8.0,2120.0,163.0,35.0,98.0,3.54,Domestic,498.0,3984.0
9.0,Chev. Chevette,3299.0,29.0,3.0,2.5,9.0,2110.0,163.0,34.0,231.0,2.93,Domestic,366.555556,3299.0


### Aggregate groupby object to new dataframe

If you want to aggregate each group to one row in the new dataframe you have many options, below a couple of examples:

### 1) `grouped.sum()` and `grouped.mean()`


In [70]:
grouped.sum()

Unnamed: 0_level_0,price,mpg,headroom,trunk,weight,length
foreign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Domestic,315766.0,1031.0,164.0,767.0,172490.0,10199.0
Foreign,140463.0,545.0,57.5,251.0,50950.0,3708.0


In [71]:
grouped.mean()

Unnamed: 0_level_0,price,mpg,headroom,trunk,weight,length
foreign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Domestic,6072.423077,19.826923,3.153846,14.75,3317.115385,196.134615
Foreign,6384.681818,24.772727,2.613636,11.409091,2315.909091,168.545455


### 2) `grouped.count()` and `grouped.size()`

In [72]:
grouped.count()

Unnamed: 0_level_0,price,mpg,headroom,trunk,weight,length
foreign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Domestic,52,52,52,52,52,52
Foreign,22,22,22,22,22,22


In [73]:
grouped.size()

foreign
Domestic    52
Foreign     22
dtype: int64

### 3) `grouped.first()` and `grouped.last()`

In [74]:
grouped.first()

Unnamed: 0_level_0,price,mpg,headroom,trunk,weight,length
foreign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Domestic,4934.0,18.0,1.5,7.0,3470.0,198.0
Foreign,6229.0,23.0,1.5,6.0,2370.0,170.0


In [75]:
grouped.last()

Unnamed: 0_level_0,price,mpg,headroom,trunk,weight,length
foreign,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Domestic,4060.0,18.0,5.0,16.0,3330.0,201.0
Foreign,12990.0,14.0,3.5,14.0,3420.0,192.0


** 4) You can also use the `.agg()` function to specify which operations to perform for each column**

In [76]:
grouped.agg({'price' : 'first', 'mpg' : ['mean', 'median'], 'trunk' : ['mean', (lambda x: 100 * np.mean(x))]})

Unnamed: 0_level_0,price,mpg,mpg,trunk,trunk
Unnamed: 0_level_1,first,mean,median,mean,<lambda>
foreign,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Domestic,4934.0,19.826923,19.0,14.75,1475.0
Foreign,6229.0,24.772727,24.5,11.409091,1140.909091


### And a lot of other operations!

There are many-many more things you can do with Pandas `.groupby`, too much to show here.  
Feel free to check out the comprehensive documentation:  
https://pandas.pydata.org/pandas-docs/stable/groupby.html

## <span style="text-decoration: underline;">Reshaping and Pivot Tables</span><a id='reshaping-pivot'></a> [(to top)](#toc)

Pandas includes a variety of tools that allow you to reshape your DataFrame.  
These tools are very powerful but can be a bit confusing to use. 

### Create some sample data:

In [77]:
tuples = [('bar', 'one',   1, 2),
          ('bar', 'two',   3, 4),
          ('bar', 'three', 5, 6),
          ('baz', 'one',   1, 2),
          ('baz', 'two',   3, 4),
          ('baz', 'three', 5, 6),
          ('foo', 'one',   1, 2),
          ('foo', 'two',   3, 4),
          ('foo', 'three', 5, 6)
         ]
df = pd.DataFrame(tuples)
df.columns = ['first', 'second', 'A', 'B']

In [78]:
df

Unnamed: 0,first,second,A,B
0,bar,one,1,2
1,bar,two,3,4
2,bar,three,5,6
3,baz,one,1,2
4,baz,two,3,4
5,baz,three,5,6
6,foo,one,1,2
7,foo,two,3,4
8,foo,three,5,6


### Example 1) Create a pivot table:

Using the `pivot()` function:  
http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-pivoting-dataframe-objects

In [79]:
df.pivot(index='first', columns='second', values='A')

second,one,three,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,1,5,3
baz,1,5,3
foo,1,5,3


Using the `pd.pivot_table()` function:  
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

In [80]:
pd.pivot_table(df, values=['A', 'B'], index='first', columns='second')

Unnamed: 0_level_0,A,A,A,B,B,B
second,one,three,two,one,three,two
first,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,1,5,3,2,6,4
baz,1,5,3,2,6,4
foo,1,5,3,2,6,4


*Note 1:* the above illustrates that Pandas essentially has two indexes: the usual 'row index' but also a 'column index'  
*Note 2:* pandas also has an "unpivot" function called `pandas.melt` (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html)

### Example 2: Stack and Unstack

`Stack` and `Unstack` are higher level operators to reshape a dataframe based on a multi-level index.

From the documentation:  
>stack: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index with a new inner-most level of row labels.  
unstack: inverse operation from stack: “pivot” a level of the (possibly hierarchical) row index to the column axis, producing a reshaped DataFrame with a new inner-most level of column labels.

http://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking

In other words:  
**Stack** --> move the data "down"  
**Unstack** --> move the data "up"

### Stack

In [81]:
pivot_df = pd.pivot_table(df, values=['A', 'B'], index='first', columns='second')
pivot_df

Unnamed: 0_level_0,A,A,A,B,B,B
second,one,three,two,one,three,two
first,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,1,5,3,2,6,4
baz,1,5,3,2,6,4
foo,1,5,3,2,6,4


In [82]:
pivot_df.stack(level=['second'])

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1,2
bar,three,5,6
bar,two,3,4
baz,one,1,2
baz,three,5,6
baz,two,3,4
foo,one,1,2
foo,three,5,6
foo,two,3,4


*Note* We could also just use `pivot_df.stack()` as it will by default choose the 'last' level of the index.

### Unstack

In [83]:
df.set_index(['first', 'second'], inplace=True)

In [84]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1,2
bar,two,3,4
bar,three,5,6
baz,one,1,2
baz,two,3,4
baz,three,5,6
foo,one,1,2
foo,two,3,4
foo,three,5,6


In [85]:
df.unstack(level=['first'])

Unnamed: 0_level_0,A,A,A,B,B,B
first,bar,baz,foo,bar,baz,foo
second,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
one,1,1,1,2,2,2
three,5,5,5,6,6,6
two,3,3,3,4,4,4


In [86]:
df.unstack(level=['second'])

Unnamed: 0_level_0,A,A,A,B,B,B
second,one,three,two,one,three,two
first,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,1,5,3,2,6,4
baz,1,5,3,2,6,4
foo,1,5,3,2,6,4


## <span style="text-decoration: underline;">Dealing with dates</span><a id='dates'></a> [(to top)](#toc)

Pandas has a lot of build-in functionality to deal with timeseries data  
http://pandas.pydata.org/pandas-docs/stable/timeseries.html

A nice overview from the documentation:

| Class           | Remarks                        | How to create:                               |
|-----------------|--------------------------------|----------------------------------------------|
| `Timestamp`     | Represents a single time stamp | `to_datetime`, `Timestamp`                   |
| `DatetimeIndex` | Index of `Timestamp`           | `to_datetime`, `date_range`, `DatetimeIndex` |
| `Period`        | Represents a single time span  | `Period`                                     |
| `PeriodIndex`   | Index of `Period`              | `period_range`, `PeriodIndex`                |

### Create a range of dates

In [87]:
date_index = pd.date_range('1/1/2011', periods=len(df_auto.index), freq='D')
date_index[0:5]

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
               '2011-01-05'],
              dtype='datetime64[ns]', freq='D')

For the sake of illustration, let's add these dates to our `df_auto`:

In [88]:
df_ad = df_auto.copy()[['make', 'price']]
df_ad['date'] = date_index

In [89]:
df_ad.head()

Unnamed: 0,make,price,date
55,Datsun 200,6229.0,2011-01-01
47,Pont. Firebird,4934.0,2011-01-02
44,Plym. Sapporo,6486.0,2011-01-03
23,Ford Fiesta,4389.0,2011-01-04
17,Chev. Monza,3667.0,2011-01-05


In [90]:
df_ad.dtypes

make             object
price           float64
date     datetime64[ns]
dtype: object

**Converting a `str` column to a `date` column**

In many cases you import data but it is not recognized as a date column. 

Let's 'sabotage' our date column and convert it to strings:


In [91]:
df_ad['date'] = df_ad['date'].astype(str)
df_ad['date'].dtypes

dtype('O')

We cannot perform any `datetime` operations on the column now because it has the wrong datatype!  

Luckily we can fix it as such:

In [92]:
pd.to_datetime(df_ad['date']).dtypes

dtype('<M8[ns]')

Or

In [93]:
df_ad['date'] = df_ad['date'].apply(lambda x: pd.to_datetime(x))

### Select observation that fall within a certain range

In [94]:
pd.Timestamp('2011-02-01')

Timestamp('2011-02-01 00:00:00')

In [95]:
pd.to_datetime('01-02-2011', format='%d-%m-%Y')

Timestamp('2011-02-01 00:00:00')

*Note:* it is usually a good idea to explicitly include the format, to avoid unexpected behavior

In [96]:
df_ad[df_ad.date > pd.to_datetime('07-03-2011', format='%d-%m-%Y')]

Unnamed: 0,make,price,date
37,Olds Delta 88,4890.0,2011-03-08
46,Pont. Catalina,5798.0,2011-03-09
5,Buick LeSabre,5788.0,2011-03-10
38,Olds Omega,4181.0,2011-03-11
3,Buick Century,4816.0,2011-03-12
36,Olds Cutlass,4733.0,2011-03-13
22,Dodge St. Regis,6342.0,2011-03-14
45,Plym. Volare,4060.0,2011-03-15
UvT_Car,,,2011-03-16
UvT_Bike,,,2011-03-17


We can also use the Pandas `.isin()` to use a `date_range` object instead

In [97]:
df_ad[df_ad.date.isin(pd.date_range('2/20/2011', '3/11/2011', freq='D'))]

Unnamed: 0,make,price,date
11,Cad. Eldorado,14500.0,2011-02-20
29,Merc. Cougar,5379.0,2011-02-21
8,Buick Riviera,10372.0,2011-02-22
15,Chev. Malibu,4504.0,2011-02-23
33,Merc. Zephyr,3291.0,2011-02-24
40,Olds Toronado,10371.0,2011-02-25
49,Pont. Le Mans,4723.0,2011-02-26
25,Linc. Continental,11497.0,2011-02-27
30,Merc. Marquis,6165.0,2011-02-28
20,Dodge Diplomat,4010.0,2011-03-01


### Select components of the dates

You can extract, for example: `day`, `month`, `year`

See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#time-date-components

In [98]:
df_ad['day'] = df_ad['date'].apply(lambda x: x.day)
df_ad.head()

Unnamed: 0,make,price,date,day
55,Datsun 200,6229.0,2011-01-01,1
47,Pont. Firebird,4934.0,2011-01-02,2
44,Plym. Sapporo,6486.0,2011-01-03,3
23,Ford Fiesta,4389.0,2011-01-04,4
17,Chev. Monza,3667.0,2011-01-05,5


### Manipulate (off-set) the date

See: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#dateoffset-objects  

You can even take into consideration stuff like business days / hours, holidays etc.!

In [99]:
df_ad['new_date'] = df_ad.date.apply(lambda x: x + pd.DateOffset(years=1))
df_ad.head()

Unnamed: 0,make,price,date,day,new_date
55,Datsun 200,6229.0,2011-01-01,1,2012-01-01
47,Pont. Firebird,4934.0,2011-01-02,2,2012-01-02
44,Plym. Sapporo,6486.0,2011-01-03,3,2012-01-03
23,Ford Fiesta,4389.0,2011-01-04,4,2012-01-04
17,Chev. Monza,3667.0,2011-01-05,5,2012-01-05
