# Time Series Overview
<br>
Time series is a series of data points ordered in time. <br>
When focused in time series data, the aim is to estimate how the sequence of observations will continue in the future.<br>
Depending on source of generation, we can have various frequencies at which data is collected. <br>

Major assumption in time series is that past pattern will continue in the future. <br>
Examples
- monthly sales data
- quartely/annual revenue
- hourly weather and wind speed data
- IOT sensors and smart devices
- energy forecasting
<br>

## Key Concepts
1. Trends
2. Seasonality
3. Cyclic
4. Statonary

### 1. Trends
long term increase or decrease in data <br>
trend need not be linear <br>
there can be changing direction <br>

### 2. Seasonality
occurs when a time series is affected by regularly occuring pattern <br>
- every month end or weekend the sales may pickup
- higher sales during holiday seasons

seasonality is always fixed. whereas cyclicity is becuase of an event or an external event like a business or an economic cycle. typically commodity prices may crash because of the underlying demand in the commodity. <br><br>
additive seasonality - magnitude of seasonal fluctuation is almost constant. trend could be positive or negative but the fluctuation in the trend is constant
multiplicative seasonality - fluctuation in data or some type of increasing or decreasing pattern in the data. 

### 3. Stationary
data can be called stationary when the mean and standard deviation is constant and there is no seasonality or trend. <br>
the first two conditions of constant mean and standard deviation are very important. <br>
we can make the data stationary using some techniques if the third condition is not met.<br>
White Noise - when there is no autocorrelation at all <br>

Typically when we do time series modelling, the output error serial will be stationary and look like white noise. Such a model is fitted well. 

## Things to Remeber
- Time series can be irregularly spaced
- Missing data (cannot do global mean or median. need to smartly impute the data)
- Seasonality and Cyclic
- Time Series is mostly data analysis

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os

from datetime import datetime
from download import download

mpl.rcParams['figure.figsize'] = (8,6)
mpl.rcParams['axes.grid'] = False

# Time Series Data and Exploratory Analysis
- statistical techniques
- exploratory data analysis

find patterns in data and structure of data

pattern - seasonality, trend, stationary or non-stationary <br>
many models have assumptions around the linearity or the stationary nature. so this type of analysis is important. <br>
<br>
structure - right imputation technique. we cannot drop the data because the order of the events are very important. so we cannot have missing events and need to identify the right substitution techniques <br>

- univariate analysis
- multivariate analysis
- autocorrelation plots 
- lag plots, etc

## Beijing Air Quality Dataset

The zip folder has air quality data from multiple areas in Beijing. I have decided to work with the data from the Dingling region for no reason at all. <br>
The data granularity is at the hourly level. The target column is `PM2.5`

In [3]:
path = download('https://archive.ics.uci.edu/ml/machine-learning-databases/00501/PRSA2017_Data_20130301-20170228.zip','./data/',kind='zip')

Creating data folder...
Downloading data from https://archive.ics.uci.edu/ml/machine-learning-databases/00501/PRSA2017_Data_20130301-20170228.zip (7.6 MB)

file_sizes: 100%|██████████████████████████| 7.96M/7.96M [01:35<00:00, 83.5kB/s]
Extracting zip file...
Successfully downloaded / unzipped to ./data/


In [4]:
df = pd.read_csv('./data/PRSA_Data_20130301-20170228/PRSA_Data_Dingling_20130301-20170228.csv',encoding='ISO-8859-1')

In [6]:
df.head()

Unnamed: 0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,1,2013,3,1,0,4.0,4.0,3.0,,200.0,82.0,-2.3,1020.8,-19.7,0.0,E,0.5,Dingling
1,2,2013,3,1,1,7.0,7.0,3.0,,200.0,80.0,-2.5,1021.3,-19.0,0.0,ENE,0.7,Dingling
2,3,2013,3,1,2,5.0,5.0,3.0,2.0,200.0,79.0,-3.0,1021.3,-19.9,0.0,ENE,0.2,Dingling
3,4,2013,3,1,3,6.0,6.0,3.0,,200.0,79.0,-3.6,1021.8,-19.1,0.0,NNE,1.0,Dingling
4,5,2013,3,1,4,5.0,5.0,3.0,,200.0,81.0,-3.5,1022.3,-19.4,0.0,N,2.1,Dingling


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35064 entries, 0 to 35063
Data columns (total 18 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   No       35064 non-null  int64  
 1   year     35064 non-null  int64  
 2   month    35064 non-null  int64  
 3   day      35064 non-null  int64  
 4   hour     35064 non-null  int64  
 5   PM2.5    34285 non-null  float64
 6   PM10     34408 non-null  float64
 7   SO2      34334 non-null  float64
 8   NO2      33830 non-null  float64
 9   CO       33052 non-null  float64
 10  O3       33850 non-null  float64
 11  TEMP     35011 non-null  float64
 12  PRES     35014 non-null  float64
 13  DEWP     35011 non-null  float64
 14  RAIN     35013 non-null  float64
 15  wd       34924 non-null  object 
 16  WSPM     35021 non-null  float64
 17  station  35064 non-null  object 
dtypes: float64(11), int64(5), object(2)
memory usage: 4.8+ MB


In [10]:
def convert_to_date(s):
    return datetime.strptime(s,'%Y %m %d %H')

In [15]:
aq_df = pd.read_csv('./data/PRSA_Data_20130301-20170228/PRSA_Data_Dingling_20130301-20170228.csv',parse_dates = [['year','month','day','hour']],date_parser=convert_to_date,keep_date_col=True)

In [16]:
aq_df.head()

Unnamed: 0,year_month_day_hour,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
0,2013-03-01 00:00:00,1,2013,3,1,0,4.0,4.0,3.0,,200.0,82.0,-2.3,1020.8,-19.7,0.0,E,0.5,Dingling
1,2013-03-01 01:00:00,2,2013,3,1,1,7.0,7.0,3.0,,200.0,80.0,-2.5,1021.3,-19.0,0.0,ENE,0.7,Dingling
2,2013-03-01 02:00:00,3,2013,3,1,2,5.0,5.0,3.0,2.0,200.0,79.0,-3.0,1021.3,-19.9,0.0,ENE,0.2,Dingling
3,2013-03-01 03:00:00,4,2013,3,1,3,6.0,6.0,3.0,,200.0,79.0,-3.6,1021.8,-19.1,0.0,NNE,1.0,Dingling
4,2013-03-01 04:00:00,5,2013,3,1,4,5.0,5.0,3.0,,200.0,81.0,-3.5,1022.3,-19.4,0.0,N,2.1,Dingling


In [17]:
aq_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35064 entries, 0 to 35063
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   year_month_day_hour  35064 non-null  datetime64[ns]
 1   No                   35064 non-null  int64         
 2   year                 35064 non-null  object        
 3   month                35064 non-null  object        
 4   day                  35064 non-null  object        
 5   hour                 35064 non-null  object        
 6   PM2.5                34285 non-null  float64       
 7   PM10                 34408 non-null  float64       
 8   SO2                  34334 non-null  float64       
 9   NO2                  33830 non-null  float64       
 10  CO                   33052 non-null  float64       
 11  O3                   33850 non-null  float64       
 12  TEMP                 35011 non-null  float64       
 13  PRES                 35014 non-

In [18]:
aq_df['month'] = pd.to_numeric(aq_df['month'])

In [19]:
print("Rows : ",aq_df.shape[0])
print("Columns : ",aq_df.shape[1])
print("\nFeatures : \n",aq_df.columns.tolist())
print("\nMissing values : \n",aq_df.isnull().any())
print("\nUnique values : \n",aq_df.nunique())

Rows :  35064
Columns :  19

Features : 
 ['year_month_day_hour', 'No', 'year', 'month', 'day', 'hour', 'PM2.5', 'PM10', 'SO2', 'NO2', 'CO', 'O3', 'TEMP', 'PRES', 'DEWP', 'RAIN', 'wd', 'WSPM', 'station']

Missing values : 
 year_month_day_hour    False
No                     False
year                   False
month                  False
day                    False
hour                   False
PM2.5                   True
PM10                    True
SO2                     True
NO2                     True
CO                      True
O3                      True
TEMP                    True
PRES                    True
DEWP                    True
RAIN                    True
wd                      True
WSPM                    True
station                False
dtype: bool

Unique values : 
 year_month_day_hour    35064
No                     35064
year                       5
month                     12
day                       31
hour                      24
PM2.5               

In [20]:
aq_df.describe()

Unnamed: 0,No,month,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,WSPM
count,35064.0,35064.0,34285.0,34408.0,34334.0,33830.0,33052.0,33850.0,35011.0,35014.0,35011.0,35013.0,35021.0
mean,17532.5,6.52293,65.989497,83.739723,11.74965,27.585467,904.896073,68.548371,13.686111,1007.760278,1.505495,0.060366,1.853836
std,10122.249256,3.448752,72.267723,79.541685,15.519259,26.383882,903.30622,53.764424,11.365313,10.225664,13.822099,0.752899,1.309808
min,1.0,1.0,3.0,2.0,0.2856,1.0265,100.0,0.2142,-16.6,982.4,-35.1,0.0,0.0
25%,8766.75,4.0,14.0,26.0,2.0,9.0,300.0,31.0,3.4,999.3,-10.2,0.0,1.0
50%,17532.5,7.0,41.0,60.0,5.0,19.0,600.0,61.0,14.7,1007.4,1.8,0.0,1.5
75%,26298.25,10.0,93.0,117.0,15.0,38.0,1200.0,90.0,23.3,1016.0,14.2,0.0,2.3
max,35064.0,12.0,881.0,905.0,156.0,205.0,10000.0,500.0,41.4,1036.5,27.2,52.1,10.0


In [21]:
aq_df_non_indexed = aq_df.copy()

In [23]:
aq_df = aq_df.set_index('year_month_day_hour')

In [24]:
aq_df.index

DatetimeIndex(['2013-03-01 00:00:00', '2013-03-01 01:00:00',
               '2013-03-01 02:00:00', '2013-03-01 03:00:00',
               '2013-03-01 04:00:00', '2013-03-01 05:00:00',
               '2013-03-01 06:00:00', '2013-03-01 07:00:00',
               '2013-03-01 08:00:00', '2013-03-01 09:00:00',
               ...
               '2017-02-28 14:00:00', '2017-02-28 15:00:00',
               '2017-02-28 16:00:00', '2017-02-28 17:00:00',
               '2017-02-28 18:00:00', '2017-02-28 19:00:00',
               '2017-02-28 20:00:00', '2017-02-28 21:00:00',
               '2017-02-28 22:00:00', '2017-02-28 23:00:00'],
              dtype='datetime64[ns]', name='year_month_day_hour', length=35064, freq=None)

In [25]:
aq_df.head()

Unnamed: 0_level_0,No,year,month,day,hour,PM2.5,PM10,SO2,NO2,CO,O3,TEMP,PRES,DEWP,RAIN,wd,WSPM,station
year_month_day_hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
2013-03-01 00:00:00,1,2013,3,1,0,4.0,4.0,3.0,,200.0,82.0,-2.3,1020.8,-19.7,0.0,E,0.5,Dingling
2013-03-01 01:00:00,2,2013,3,1,1,7.0,7.0,3.0,,200.0,80.0,-2.5,1021.3,-19.0,0.0,ENE,0.7,Dingling
2013-03-01 02:00:00,3,2013,3,1,2,5.0,5.0,3.0,2.0,200.0,79.0,-3.0,1021.3,-19.9,0.0,ENE,0.2,Dingling
2013-03-01 03:00:00,4,2013,3,1,3,6.0,6.0,3.0,,200.0,79.0,-3.6,1021.8,-19.1,0.0,NNE,1.0,Dingling
2013-03-01 04:00:00,5,2013,3,1,4,5.0,5.0,3.0,,200.0,81.0,-3.5,1022.3,-19.4,0.0,N,2.1,Dingling
