In this short video we're going to talk about a really common task in applied data science, which is the
resampling of time data. This is often done when you have say, intermittent data which you want to make more
regular, or when you have really fine grained data and you want to get more general trends of it.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Resampling and Frequency Conversion

In [None]:
# Resampling refers to the process of converting a time series from one frequency to another. Converting from
# a higher frequency to a lower frequency is called downsampling and the other way around is called
# upsampling. There is another type of sampling, which is neither upsampling or downsampling such asbchanging a
# weekly Wednesday value into a weekly Sunday value.

# Resampling is useful and commonly used in manipulating time related data. Now let's see how it is done in
# Pandas.

In [None]:
# First, let's create some datimeindex using date_range() function. We can set either start or end, and
# specify the frequency and number of periods. Here we set date as Jan 1st 2018 and end as May 31st 2018 and
# use the dates in creating a series with random numbers
dates = pd.date_range(start='1/1/2018', end='05/31/2018')
ts = pd.Series(np.random.randn(len(dates)), index=dates)
ts.head()

In [None]:
# Now let's try downsampling. Two important things need to be considered when downsampling. First, which side
# of each interval is closed. Let's say we want to converting daily frequency to weekly frequency. You need to
# chop up the data into one week intervals. Each interval is said to be half-open. A data point can only
# belong to one interval and the union of the intervals must make up the whole time frame.

# Secondly, we need to decide how the new aggregated bins should be labeled. Either from the start of the
# interval or the end of the interval.

# Let's look at an example using the Series we just created. Here we want to convert daily to weekly. We can
# use the resample() function. The resample function has parameters for specifying the new frequency, which
# side is closed. After that, we also have to decide what kind of aggreate function we want to do with each
# interval.

# Here, we specify the frequency as weekly, which is W, and the closed side is right, aggregate function is
# mean
ts.resample('W', closed='right').mean().head()

In [None]:
# Let's just take a look under the hood, what is this object returned by resample()?
type(ts.resample('W', closed='right'))

In [None]:
# This object allows us to resample pretty much however we want, through use of the agg() function, but it
# also holds many of the common functions we might use, such as mean(). For instance, if we just wanted to
# count all of the data values that were being resampled, we could use len() and write our own lambda
ts.resample('W', closed='right').agg(lambda x: len(x)).head()

In [None]:
# If we pay attention to the bottom of the output, where it says the frequency is "W-SUN", it means weekly on
# Sunday. If we want to do another day, for instance, Wednesday, we can do "W-WED"

# After converting the frequency, Pandas also allows us to adjust the labels with the loffset parameter. If we
# change the daily frequency to monthly frequency, and set the loffset to -1d, which is a month backward,
# let's see and example.
ts.resample('M', closed='right', label='right', loffset='-1M').mean()

In [None]:
# Another popular and useful approach to aggregate is to compute four values for each bucket: the first, last,
# maximum, and minimal values. By using the ohlc() function, we can get a dataframe with the new frequency
# indices and four columns containing the four values at each time period
ts.resample('M', closed='right', label='right', loffset='-1M').ohlc()

In [None]:
# This is a pretty common financial data routine, as you might guess from the names of the columns, but you
# can write your own functions and pass them to agg() or apply() as you see fit.

In [None]:
# Now let's talk about upsampling. Since we are converting lower frequency to higher frequency, there is no
# need to aggregate. We use the asfreq() function to convert to the higher frequency without any aggregation

# let's create a dataframe, with two weekly indices, and four columns. First the indicies
dates = pd.date_range(start='1/1/2018', periods=2, freq='W')
# now let's fill in the DataFrame
df = pd.DataFrame(np.random.randn(2,4), index=dates, 
                  columns=['col1','col2','col3','col4'])
df.head()

In [None]:
# Now we upsample from weekly frequency to daily frequency,
# we use the resample() function with frequency to "D" and the asfreq() function
df_daily = df.resample('D').asfreq()
df_daily.head()

In [None]:
# As you notice, there will be NaN values in some cells because we are upsampling and we do not have the data
# for some of the new intervals. If you want to fill the NaN values, which is called interpolation, you can
# either use ffill(), which we have talked about and is forward filling; or bfill() which is backward filling.
# Or you can use fillna() or reindex() methods.

# In our dataframe, it makes sense to do forward filling, now let's try it
df.resample('D').ffill()

In [None]:
# We can also choose to only fill a certain number of periods, by using the limit parameter in the ffill()
# function. For instance, here, we are limiting to interpolating three observations
df.resample('D').ffill(limit=3)

## Moving Window Functions

In [None]:
# An important group of manipulation techniques on time series are focused on over a sliding window or with
# exponentially decaying weights. This is very useful for smoothing noisy or gappy data. Note that these kinds
# of functions automatically exclude missing data.

# Now let's look at examples on the stock market. We are going to look at Apple and Microsoft's daily stock
# price from 2012 to 2018
apple = pd.read_csv("datasets/AAPL.csv")
ms = pd.read_csv("datasets/MSFT.csv")

apple.head()

In [None]:
# As we see here, we have different kinds of pricing. For the analysis we are going to do, we will use close
# price. Let's combine Apple and Microsoft's daily prices together into one dataframe.
df = pd.DataFrame({'AAPL': apple['Close'],'MSFT':ms['Close']})
df.head()

In [None]:
# Now let's plot the prices over the years
df.plot()
plt.show()

In [None]:
# Now we are going to learn the rolling operator, which behaves similarly to resample and groupby. It can be
# called on a Series or a Dataframe along with a window, which is a number of periods to cover. The number we
# specify in rolling() function means the sliding window we are group by. For example, if we do 100, that is
# grouping over a 100-day sliding window

# Here, let's do a 100 day rolling window where we average the values and plot it.
df.rolling(100).mean().plot()
plt.show()

In [None]:
# You can see that this not only smoothed the data, but we lost the big dropped at the end of the time period
# for apple because of the size of our window. Try playing with a few window values yourself and get a sense
# for how that might change the insights you derive from the data.

Now we've just touched the very basics of manipulating time series data in python. These techniques will be
useful for conducting further time series analysis and for more advanced data visualization on time related
data, as well as when dealing with feature engineering from time series sources.