This self-educated notebook is inspired from [post]("https://machinelearningmastery.com/resample-interpolate-time-series-data-python/") where data is extracted from [data]("https://datamarket.com/data/set/22r0/sales-of-shampoo-over-a-three-year-period#!ds=22r0&display=line").   
Resampling

Resampling involves changing the frequency of your time series observations.

Two types of resampling are:

    Upsampling: Where you increase the frequency of the samples, such as from minutes to seconds.
    Downsampling: Where you decrease the frequency of the samples, such as from days to months.

In both cases, data must be invented.

In the case of upsampling, care may be needed in determining how the fine-grained observations are calculated using interpolation. In the case of downsampling, care may be needed in selecting the summary statistics used to calculate the new aggregated values.

There are perhaps two main reasons why you may be interested in resampling your time series data:

    Problem Framing: Resampling may be required if your data is available at the same frequency that you want to make predictions.
    Feature Engineering: Resampling can also be used to provide additional structure or insight into the learning problem for supervised learning models.

There is a lot of overlap between these two cases.

For example, you may have daily data and want to predict a monthly problem. You could use the daily data directly or you could downsample it to monthly data and develop your model.

A feature engineering perspective may use observations and summaries of observations from both time scales and more in developing a model.

Let’s make resampling more concrete by looking at a real dataset and some examples

In [5]:
import pandas as pd
from pandas import datetime
import matplotlib.pyplot as plt

def parser(x):
	return datetime.strptime('190'+x, '%Y-%b')

series = pd.read_csv('C:/PreprocessUtility/datasets/shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
print(series.head())
series.plot()
plt.show()

ValueError: time data '191-Jan' does not match format '%Y-%b'

**Upsample Shampoo Sales**

The observations in the Shampoo Sales are monthly.

Imagine we wanted daily sales information. We would have to upsample the frequency from monthly to daily and use an interpolation scheme to fill in the new daily frequency.

The Pandas library provides a function called resample() on the Series and DataFrame objects. This can be used to group records when downsampling and making space for new observations when upsampling.

We can use this function to transform our monthly dataset into a daily dataset by calling resampling and specifying the preferred frequency of calendar day frequency or “D”.

We can see that the resample() function has created the rows by putting NaN values in the new values. We can see we still have the sales volume on the first of January and February from the original data.

Next, we can interpolate the missing values at this new frequency.

The Series Pandas object provides an interpolate() function to interpolate missing values, and there is a nice selection of simple and more complex interpolation functions. You may have domain knowledge to help choose how values are to be interpolated.

A good starting point is to use a linear interpolation. This draws a straight line between available data, in this case on the first of the month, and fills in values at the chosen frequency from this line

In [None]:
from pandas import read_csv
from pandas import datetime

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
upsampled = series.resample('D')
interpolated = upsampled.interpolate(method='linear')
print(interpolated.head(32))

In [None]:
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot

def parser(x):
	return datetime.strptime('190'+x, '%Y-%m')

series = read_csv('shampoo-sales.csv', header=0, parse_dates=[0], index_col=0, squeeze=True, date_parser=parser)
upsampled = series.resample('D')
interpolated = upsampled.interpolate(method='spline', order=2)
print(interpolated.head(32))
interpolated.plot()
pyplot.show()

**Downsample Shampoo Sales**

The sales data is monthly, but perhaps we would prefer the data to be quarterly.

The year can be divided into 4 business quarters, 3 months a piece.

Instead of creating new rows between existing observations, the resample() function in Pandas will group all observations by the new frequency.

We could use an alias like “3M” to create groups of 3 months, but this might have trouble if our observations did not start in January, April, July, or October. Pandas does have a quarter-aware alias of “Q” that we can use for this purpose.

We must now decide how to create a new quarterly value from each group of 3 records. A good starting point is to calculate the average monthly sales numbers for the quarter. For this, we can use the mean() function.

### Reindex and reset_index    
See What's in a name datacamp project

In [6]:
# create data frame
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
df = pd.DataFrame({
          'http_status': [200,200,404,404,301],
          'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, index=index)
df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


Reindex will assign new index, no correspondin value will be assigned **NAN** if **drop** option is not use

In [7]:
new_index= ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10','Chrome']
df.reindex(new_index)

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


We can fill in the missing values by passing a value to the keyword **fill_value**. Because the index is **not monotonically increasing** or **decreasing**, we cannot use arguments to the keyword method to fill the NaN values. See further example

In [8]:
df.reindex(new_index, fill_value=0)

Unnamed: 0,http_status,response_time
Safari,404,0.07
Iceweasel,0,0.0
Comodo Dragon,0,0.0
IE10,404,0.08
Chrome,200,0.02


In [9]:
df.reindex(new_index, fill_value='missing')

Unnamed: 0,http_status,response_time
Safari,404,0.07
Iceweasel,missing,missing
Comodo Dragon,missing,missing
IE10,404,0.08
Chrome,200,0.02


In case of monotonically increasing valu e, e.g.,a dataframe with a **monotonically increasing index** as a sequence of dates).

In [12]:
import numpy as np
date_index = pd.date_range('1/1/2010', periods=6, freq='D')
df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, index=date_index)
df2

Unnamed: 0,prices
2010-01-01,100.0
2010-01-02,101.0
2010-01-03,
2010-01-04,100.0
2010-01-05,89.0
2010-01-06,88.0


Suppose we decide to expand the dataframe to cover a wider date range.

In [13]:
date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
df2.reindex(date_index2)

Unnamed: 0,prices
2009-12-29,
2009-12-30,
2009-12-31,
2010-01-01,100.0
2010-01-02,101.0
2010-01-03,
2010-01-04,100.0
2010-01-05,89.0
2010-01-06,88.0
2010-01-07,


The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with NaN. If desired, we can fill in the missing values using one of several options.

For example, to backpropagate the last valid value to **fill the NaN values**, pass **bfill** as an argument to the method keyword.

In [14]:
df2.reindex(date_index2, method='bfill')

Unnamed: 0,prices
2009-12-29,100.0
2009-12-30,100.0
2009-12-31,100.0
2010-01-01,100.0
2010-01-02,101.0
2010-01-03,
2010-01-04,100.0
2010-01-05,89.0
2010-01-06,88.0
2010-01-07,


### Reset_index

In [16]:
df = pd.DataFrame([('bird',    389.0),
                   ('bird',     24.0),
                    ('mammal',   80.5),
                   ('mammal', np.nan)],
                   index=['falcon', 'parrot', 'lion', 'monkey'],
                   columns=('class', 'max_speed'))
df

Unnamed: 0,class,max_speed
falcon,bird,389.0
parrot,bird,24.0
lion,mammal,80.5
monkey,mammal,


When we reset the index, the old index is added as a column, and a new sequential index is used:

In [17]:
df.reset_index()

Unnamed: 0,index,class,max_speed
0,falcon,bird,389.0
1,parrot,bird,24.0
2,lion,mammal,80.5
3,monkey,mammal,


We can use the drop parameter to avoid the old index being added as a column:

In [18]:
df.reset_index(drop=True)

Unnamed: 0,class,max_speed
0,bird,389.0
1,bird,24.0
2,mammal,80.5
3,mammal,


In [20]:
index = pd.MultiIndex.from_tuples([('bird', 'falcon'),
                                    ('bird', 'parrot'),
                                    ('mammal', 'lion'),
                                    ('mammal', 'monkey')],
                                   names=['class', 'name'])
columns = pd.MultiIndex.from_tuples([('speed', 'max'), ('species', 'type')])
df = pd.DataFrame([(389.0, 'fly'),
                    ( 24.0, 'fly'),
                    ( 80.5, 'run'),
                    (np.nan, 'jump')],
                   index=index,
                   columns=columns)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,speed,species
Unnamed: 0_level_1,Unnamed: 1_level_1,max,type
class,name,Unnamed: 2_level_2,Unnamed: 3_level_2
bird,falcon,389.0,fly
bird,parrot,24.0,fly
mammal,lion,80.5,run
mammal,monkey,,jump


If the index has multiple levels, we can reset a subset of them:

In [21]:
df.reset_index(level='class')

Unnamed: 0_level_0,class,speed,species
Unnamed: 0_level_1,Unnamed: 1_level_1,max,type
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
falcon,bird,389.0,fly
parrot,bird,24.0,fly
lion,mammal,80.5,run
monkey,mammal,,jump


If we are not dropping the index, by default, it is placed in the top level. We can place it in another level:

In [22]:
df.reset_index(level='class', col_level=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,speed,species
Unnamed: 0_level_1,class,max,type
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
falcon,bird,389.0,fly
parrot,bird,24.0,fly
lion,mammal,80.5,run
monkey,mammal,,jump


When the index is inserted under another level, we can specify under which one with the parameter col_fill:

In [23]:
df.reset_index(level='class', col_level=1, col_fill='species')

Unnamed: 0_level_0,species,speed,species
Unnamed: 0_level_1,class,max,type
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
falcon,bird,389.0,fly
parrot,bird,24.0,fly
lion,mammal,80.5,run
monkey,mammal,,jump


If we specify a nonexistent level for col_fill, it is created:

In [24]:
df.reset_index(level='class', col_level=1, col_fill='genus')

Unnamed: 0_level_0,genus,speed,species
Unnamed: 0_level_1,class,max,type
name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
falcon,bird,389.0,fly
parrot,bird,24.0,fly
lion,mammal,80.5,run
monkey,mammal,,jump
