# Time Series Analysis

In this first part of TSA series, we'll be looking into tools that are useful to analyze time-series data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv('/kaggle/input/co2-mm/co2_mm_mlo.csv')
df['date'] = pd.to_datetime({'year':df['year'], 'month':df['month'], 'day':1}, errors='coerce')
df.head()

In [None]:
df.set_index('date', inplace=True)
df.index.freq = 'MS'
df.head()

In [None]:
df[['interpolated']].plot(figsize=(12,6));

The above plot clearly illustrates some of the fundamentals of time series analysis:
* **trend** - It's clearly non-linear upward trend 
* **seasonality** - Within a year, there are cyclical patterns of rising and falling 
* **noise** - We also see random, non-systemic fluctuations in the data

## endog & exog
The data seen in a time series is described as either <u><em>endogenous</em></u>, that is, caused by factors within the system, or <u><em>exogenous</em></u>, caused by factors outside the system.

# 1) Tools
Let's see tools to represent date, time in Python.

In [None]:
import numpy as np

ls = ['2016-03-15', '2017-05-24', '2018-08-09']

np.array(ls, dtype='datetime64')

type='datetime64[D]' -> Numpy applied date in day-level precision.

In [None]:
print(np.array(ls, dtype='datetime64[Y]'))
print(np.array(ls, dtype='datetime64[h]'))

We can also use `np.arange` to create date range.

In [None]:
print(np.arange('2018-06-01', '2018-06-23', 7, dtype='datetime64[D]'))
print(np.arange('1968', '1976', dtype='datetime64[Y]'))

We can also use Pandas as well.  In Pandas, We can specify precision level as well as *time zone*. We can check all timezone code in Python by using `pytz` module.

In [None]:
import pytz
pytz.all_timezones[:10]

In [None]:
print(pd.date_range('1/1/2018', periods=7, freq='D'))
print(pd.date_range('1 Jan, 2018', '7 Jan, 2018', freq='D'))
print(pd.date_range('1 Jan, 2018', '7 Jan, 2018', freq='D', tz='Asia/Bangkok'))

***Note :*** When we used`pd.date_range()`, we always have to pass `freq` param.

Pandas has `pd.to_datetime` that can do everything like Python's datetime calss.

In [None]:
pd.to_datetime(['2x1x2019'], format='%dx%mx%Y', errors='raise')

errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
- If `raise`, then invalid parsing will raise an exception.
- If `coerce`, then invalid parsing will be set as NaT.
- If `ignore`, then invalid parsing will return the input.

In [None]:
pd.to_datetime(['Jan 01, 2018', '1/2/18', '03-Jan-2018', None])

# 2) Time Series Operations

## 1) Resampling
A common operation with time series data is resampling based on the time series index. When calling `.resample()` you first need to pass in a **rule** parameter, then you need to call some sort of aggregation function.

The **rule** parameter describes the frequency with which to apply the aggregation function (daily, monthly, yearly, etc.)<br>
It is passed in using an "offset alias" - refer to the table below.

<table style="display: inline-block">
    <caption style="text-align: center"><strong>TIME SERIES OFFSET ALIASES</strong></caption>
<tr><th>ALIAS</th><th>DESCRIPTION</th></tr>
<tr><td>B</td><td>business day frequency</td></tr>
<tr><td>C</td><td>custom business day frequency (experimental)</td></tr>
<tr><td>D</td><td>calendar day frequency</td></tr>
<tr><td>W</td><td>weekly frequency</td></tr>
<tr><td>M</td><td>month end frequency</td></tr>
<tr><td>SM</td><td>semi-month end frequency (15th and end of month)</td></tr>
<tr><td>BM</td><td>business month end frequency</td></tr>
<tr><td>CBM</td><td>custom business month end frequency</td></tr>
<tr><td>MS</td><td>month start frequency</td></tr>
<tr><td>SMS</td><td>semi-month start frequency (1st and 15th)</td></tr>
<tr><td>BMS</td><td>business month start frequency</td></tr>
<tr><td>CBMS</td><td>custom business month start frequency</td></tr>
<tr><td>Q</td><td>quarter end frequency</td></tr>
<tr><td></td><td><font color=white>intentionally left blank</font></td></tr></table>

<table style="display: inline-block; margin-left: 40px">
<caption style="text-align: center"></caption>
<tr><th>ALIAS</th><th>DESCRIPTION</th></tr>
<tr><td>BQ</td><td>business quarter endfrequency</td></tr>
<tr><td>QS</td><td>quarter start frequency</td></tr>
<tr><td>BQS</td><td>business quarter start frequency</td></tr>
<tr><td>A</td><td>year end frequency</td></tr>
<tr><td>BA</td><td>business year end frequency</td></tr>
<tr><td>AS</td><td>year start frequency</td></tr>
<tr><td>BAS</td><td>business year start frequency</td></tr>
<tr><td>BH</td><td>business hour frequency</td></tr>
<tr><td>H</td><td>hourly frequency</td></tr>
<tr><td>T, min</td><td>minutely frequency</td></tr>
<tr><td>S</td><td>secondly frequency</td></tr>
<tr><td>L, ms</td><td>milliseconds</td></tr>
<tr><td>U, us</td><td>microseconds</td></tr>
<tr><td>N</td><td>nanoseconds</td></tr></table>

In [None]:
dates = np.arange('2020-07-01', '2020-08-01', dtype = 'datetime64[D]')
idx = pd.DatetimeIndex(dates)

df = pd.DataFrame({'A':1, 'B':np.arange(len(dates))}, index=idx)

df.head(10)

In [None]:
df.resample('3D').sum().head(10)

We can also create our own function. The function will recieve each row of dataframe as an input.

In [None]:
def last(x):
    return x[-1]

df.resample('3D').apply(last).head(10)

## 2) Shifting
shifts the entire date index a given number of rows, without regard for time periods (months & years).

In [None]:
df.shift(1).head() # We will lose the first piece of data

In [None]:
df.shift(-1).tail() # We will lose the last piece of data

### Shifting date index based on Time Series Frequency Code
We can choose to shift ***index values*** up or down *without realigning the data* by passing in a freq argument.

In [None]:
df.head()

In [None]:
df.shift(periods=2, freq='D').head()

## 3) Rolling
The idea is to divide the data into "windows" of time, and then calculate an aggregate function for each window. In this way we can obtain a simple moving average that is able to smooth the data.

In [None]:
df = pd.read_csv('/kaggle/input/starbucks/starbucks.csv', index_col='Date', parse_dates=True)
df.head(3)

In [None]:
df['Close'].plot(figsize=(12,6)).autoscale(axis='x', tight=True);

In [None]:
print('Mean first 3 closes',df.iloc[:3]['Close'].mean())
print('Mean first 3 volumns',df.iloc[:3]['Volume'].mean())
print()
print('Mean next 3 closes',df.iloc[1:4]['Close'].mean())
print('Mean next 3 volumns',df.iloc[1:4]['Volume'].mean())

In [None]:
df.rolling(window=3).mean().head(4)

In [None]:
df['Close'].plot(figsize=(12,6), label='original').autoscale(axis='x', tight=True);
df['Close'].rolling(window=7).mean().plot(label='window=7')
df['Close'].rolling(window=60).mean().plot(label='window=60')

plt.legend();

## 4) Expanding
Take into account everything from the start of the time series up to each point in time (not just the moving window). For example, instead of considering the average over the last 7 days, we would consider all prior data.

In [None]:
df['Close'].plot(figsize=(12,5), label='').autoscale(axis='x',tight=True)
df['Close'].expanding(min_periods=30).mean().plot(figsize=(12,5), label='')
plt.axhline(df['Close'].mean(), color='k', label='Global avg.')

plt.legend();