# Evaluating a model on time series data

In this workspace, you will load in a time series data set related to air quality. You will try out different ways of splitting this data into a training set (for model fitting) and test set (for model evaluation on "new data"), each time observing the effect on model performance.

> In the notebook, specify random_state = 8 in the cell where it is indicated.

| Name| 	Type| 	Description |
| --- | --- | --- |
|`Xtr_one_shuf`|	2d numpy array|	Training data (features) for single shuffled split.|
|`ytr_one_shuf`|	1d numpy array|	Training data (target) for single shuffled split.|
|`Xts_one_shuf`|	2d numpy array| Test data (features) for single shuffled split.|
|`yts_one_shuf`|	1d numpy array|	Test data (target) for single shuffled split.|
|`yts_one_shuf_pred`|	1d numpy array|	Test data (target) predictions for single shuffled split.|
|`r2_one_shuf`|	float|	R2 score for test data for single shuffled split.|
|`Xtr_one_order`|	2d numpy array|	Training data (features) for single ordered split.|
|`ytr_one_order`|	1d numpy array|	Training data (target) for single ordered split.|
|`Xts_one_order`|	2d numpy array|	Test data (features) for single ordered split.|
|`yts_one_order`|	1d numpy array|	Test data (target) for single ordered split.|
|`yts_one_order_pred`|	1d numpy array|	Test data (target) predictions for single ordered split.|
|`r2_one_order`|	float|	R2 score for test data for single ordered split.|
|`r2_kf_shuffle`|	1d numpy array|	Test R2 score of each fold, shuffled split.|
|`r2_kf_shuffle_mean`|	float|	Mean R2 score across folds, shuffled split.|
|`r2_ts`|	1d numpy array|	Test R2 score of each fold, time series split.|
|`r2_ts_mean`|	float|	Mean R2 score across folds, time series split.|

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, TimeSeriesSplit, KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

In this notebook, we will load in a dataset representing sensor readings and reference concentration of various compounds in the air. The data was collected over a period of around one year (2004-2005). The columns in the dataset include:

1. Date: the date of the observation
2. Time: the time of the observation
3. CO(GT): the ground truth of the carbon monoxide level
4. PT08.S1(CO): the level of carbon monoxide observed by the sensor
4. NMHC(GT): the ground truth of the non-methane hydrocarbon level
5. C6H6(GT): the ground truth of the benzene level
6. PT08.S2(NMHC): the level of non-methane hydrocarbons observed by the sensor
7. NOx(GT): the ground truth of the nitrogen oxides level
8. PT08.S3(NOx): the level of nitrogen oxides observed by the sensor
9. NO2(GT): the ground truth of the nitrogen dioxide level
10. PT08.S4(NO2): the level of nitrogen dioxide observed by the sensor
11. PT08.S5(O3): the level of ozone observed by the sensor
12. T: the temperature
13. RH: the relative humidity
14. AH: the absolute humidity


This dataset may be used for several regression tasks. In this notebook, we will use linear regression to predict the NO2 (nitrogen dioxide) level, given the values in the weather-related columns, and the hour of day.

First, we read in the data.  Note the following data processing steps that are included in the cell below:

* This file includes some numeric values that use the comma `,` in place of a period `.` to denote decimals. `pandas` has a `decimal` argument to support this variation. 
* Some columns are empty, and some rows have missing data - we will drop these. We will also drop rows that have a `-200` in the target variable, since according to the data dictionary, these indicate missing values.
* We will convert the `Date` and `Time` columns to a single `DateTime`, and we will make sure the data is then sorted by this `DateTime`.
* We will add an `Hour` feature. Since the hour is cyclical, we will encode it using `sin` and `cos`, so that 23:00 is as close to 00:00 as it is to 22:00. (This is a common approach for cyclical features.)
* We will also add a `Weekday` feature.

In [2]:
df=pd.read_csv('AirQualityUCI.csv', sep=";" , decimal=",")

# drop columns and rows with missing values
df.dropna(how="all",axis=1,inplace=True)
df.dropna(how="all",axis=0,inplace=True)
# in this data, a -200 value indicates a missing value - drop these, too
df = df[df["NO2(GT)"]!=-200]

# create DateTime out of Date and Time
df["DateTime"] = pd.to_datetime(df['Date'] + ' ' + df['Time'], format="%d/%m/%Y %H.%M.%S")
# set the DateTime column as the index
df = df.set_index("DateTime")
# drop the Date and Time columns
df = df.drop(["Date", "Time"], axis=1)
# sort by DateTime
df = df.sort_index()

# add Hour feature
df['HourSin'] = np.sin(df.index.hour*(2.*np.pi/24))
df['HourCos'] = np.cos(df.index.hour*(2.*np.pi/24))
# add Weekday feature
df['Weekday'] = df.index.weekday

In [3]:
df.head()

Unnamed: 0_level_0,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,HourSin,HourCos,Weekday
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
2004-03-10 18:00:00,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,-1.0,-1.83697e-16,2
2004-03-10 19:00:00,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,-0.965926,0.258819,2
2004-03-10 20:00:00,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,-0.866025,0.5,2
2004-03-10 21:00:00,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,-0.707107,0.7071068,2
2004-03-10 22:00:00,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,-0.5,0.8660254,2


Next, we will prepare the feature and target variables. We will train a linear regression model to predict the amount of Nitrogen Dioxide present at ground-level. This data is stored in the `NO2(GT)` column. Hence, this will be your target variable. We also specify the columns that we will use as features.

In [4]:
X = df[["T", "RH", "AH", "HourSin", "HourCos", "Weekday"]].values
y = df["NO2(GT)"].values

Now, we will evaluate several ways to divide the data into a training set (for fitting the parameters of the linear regression model) and a test set (for evaluating the model).

### Single split - random shuffle

First, we'll try a single split, and we will shuffle the data when distributing it into training and test sets.

In the following cell, set the `random_state` variable to the value given in the question page.

In [5]:
random_state = 8

Next, use the `sklearn` implementation of `train_test_split` to split `X` and `y`, and save the results in: `Xtr_one_shuf`, `Xts_one_shuf`, `ytr_one_shuf`, `yts_one_shuf`.

Use 1/5 of the data for the test set, and 4/5 for the training set. Also, use the random state in the variable you just defined, so that your result will match the graders'.



In [6]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE)
Xtr_one_shuf, Xts_one_shuf, ytr_one_shuf, yts_one_shuf = train_test_split(X, y, test_size=1/5, random_state=random_state)

In the following cell, train a linear regression model on the shuffled training data, then evaluate the R2 score of the model on the shuffled test data. Save this R2 value in `r2_one_shuf`. Also, save the model predictions on the test data in `yts_one_shuf_pred`.

In [7]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE)
model = LinearRegression()
model.fit(Xtr_one_shuf, ytr_one_shuf)
yts_one_shuf_pred = model.predict(Xts_one_shuf)
r2_one_shuf = r2_score(yts_one_shuf, yts_one_shuf_pred)


In [8]:
print(r2_one_shuf)

0.3147067763912289


### Single split - sorted data, no shuffle

Next, we'll try a single split again, but this time we will specify the value of the `shuffle` argument to `train_test_split` in order to *not* shuffle the data when distributing it into training and test sets. Since the data is sorted by `DateTime`, this means that the earlier values will be in the training set, and the last values will be in the test set.

Again, use 1/5 of the data for the test set, and 4/5 for the training set. 

Save the results in: `Xtr_one_order`, `Xts_one_order`, `ytr_one_order`, `yts_one_order`.


In [9]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE)
Xtr_one_order, Xts_one_order, ytr_one_order, yts_one_order = train_test_split(X, y, test_size=1/5, shuffle=False)

In the following cell, train a linear regression model on the ordered training data, then evaluate the R2 score of the model on the ordered test data. Save this R2 value in `r2_one_order`. Also, save the model predictions on the test set in `yts_one_order_pred`. 

In [10]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE)
model = LinearRegression()
model.fit(Xtr_one_order, ytr_one_order)
yts_one_order_pred = model.predict(Xts_one_order)
r2_one_order = r2_score(yts_one_order, yts_one_order_pred)


In [11]:
print(r2_one_order)

-0.03077375532439519


### Multiple splits - random shuffle

You might be concerned that your model training and validation uses only a single split of the data - it is possible that this split is not representative. To address this concern, we can use K-fold cross validation - not for model selection, but just for observing the results of this splitting and training process for different splits.

We will use the `sklearn` library's `KFold` implementation, with K=5 (five splits of `X` and `y`).

In the following cell, define a `KFold` with 5 splits, the random state set by the variable you defined earlier, and with the data shuffled. Then, iterate over the folds and in each iteration:

* train a linear regression model on the training data for this fold
* compute the R2 score of the model on the test data for this fold, and save the result in the appropriate element of `r2_kf_shuffle`

Finally, compute the mean R2 score across all folds and save the result in `r2_kf_shuffle_mean`.

In [12]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE)

# prepare an array for holding the results
n_fold = 5
r2_kf_shuffle = np.zeros(shape=(n_fold,))

# Define a KFold CV with shuffle
kf = KFold(n_splits=n_fold, shuffle=True, random_state=random_state)
           
for i, idx in enumerate(kf.split(X)):
    idx_tr, idx_ts = idx
    X_train_kfold = X[idx_tr]
    y_train_kfold = y[idx_tr]
    X_test_kfold = X[idx_ts]
    y_test_kfold = y[idx_ts]

    model = LinearRegression()
    model.fit(X_train_kfold, y_train_kfold)

    y_pred_kfold = model.predict(X_test_kfold)
    r2_kf_shuffle[i] = r2_score(y_test_kfold, y_pred_kfold)
    
r2_kf_shuffle_mean = np.mean(r2_kf_shuffle)


In [13]:
print(r2_kf_shuffle)
print(r2_kf_shuffle_mean)

[0.31470678 0.30545552 0.32868634 0.34387973 0.31956869]
0.322459411332363


### Multiple splits - time series

Finally, we'll repeat the multi-split evaluation, but using the `sklearn` library's [`TimeSeriesSplit`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html). Here is the description from the module documentation:

> Provides train/test indices to split time series data samples that are observed at fixed time intervals, in train/test sets. In each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate.
> 
> This cross-validation object is a variation of KFold. In the kth split, it returns first k folds as train set and the (k+1)th fold as test set.
> 
> Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them.


In the following cell, define a `TimeSeriesSplit` with 5 splits. Then, iterate over the splits and in each iteration:

* train a linear regression model on the training data for this fold
* compute the R2 score of the model on the test data for this fold, and save the result in the appropriate element of `r2_ts`

Finally, compute the mean R2 score across all folds and save the result in `r2_ts_mean`.

In [14]:
#grade (enter your code in this cell - DO NOT DELETE THIS LINE)

# prepare an array for holding the results
n_fold = 5
r2_ts = np.zeros(shape=(n_fold,))

# Define a TimeSeriesSplit 
ts = TimeSeriesSplit(n_splits=n_fold)

for i, idx in enumerate(ts.split(X)):
    idx_tr, idx_ts = idx
    X_train = X[idx_tr]
    y_train = y[idx_tr]
    X_test = X[idx_ts]
    y_test = y[idx_ts]

    model = LinearRegression()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)
    r2_ts[i] = r2_score(y_test, y_pred)

r2_ts_mean = np.mean(r2_ts)

In [15]:
print(r2_ts)
print(r2_ts_mean)

[ 0.22619877  0.11744013  0.1668679  -0.19125143  0.0063742 ]
0.06512591500308551
