# Develop RNN Models with Keras
In this tutorial we go over creating different machine learning models using scikit-learn library. We will experiment with a supervised regression problem.

### 1. Jupyter Environment
Let's start by getting familiar with jupyter environment and some simple tricks.

In [None]:
# Use this cell for some simple commands.
# Press ctrl+enter to execute a cell
# Use shift+enter to execute a cell and move on to the next cell
a = 1
print(a)

### 2. Import necessary packages
Now that we are familiart with the Jupyter environment, let's continue by importing some necessary packages. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### 3. Download a sample dataset
Then we will download a sample data set. The dataset we will be using is "Appliances Energy Prediction Dataset".

Here is more information about his data set.
https://archive.ics.uci.edu/ml/datasets/Appliances+energy+prediction

Attribute Information:

date time year-month-day hour:minute:second<br>
Appliances, energy use in Wh<br>
lights, energy use of light fixtures in the house in Wh<br>
T1, Temperature in kitchen area, in Celsius<br>
RH_1, Humidity in kitchen area, in %<br>
T2, Temperature in living room area, in Celsius<br>
RH_2, Humidity in living room area, in %<br>
T3, Temperature in laundry room area<br>
RH_3, Humidity in laundry room area, in %<br>
T4, Temperature in office room, in Celsius<br>
RH_4, Humidity in office room, in %<br>
T5, Temperature in bathroom, in Celsius<br>
RH_5, Humidity in bathroom, in %<br>
T6, Temperature outside the building (north side), in Celsius<br>
RH_6, Humidity outside the building (north side), in %<br>
T7, Temperature in ironing room , in Celsius<br>
RH_7, Humidity in ironing room, in %<br>
T8, Temperature in teenager room 2, in Celsius<br>
RH_8, Humidity in teenager room 2, in %<br>
T9, Temperature in parents room, in Celsius<br>
RH_9, Humidity in parents room, in %<br>
To, Temperature outside (from Chievres weather station), in Celsius<br>
Pressure (from Chievres weather station), in mm Hg<br>
RH_out, Humidity outside (from Chievres weather station), in %<br>
Wind speed (from Chievres weather station), in m/s<br>
Visibility (from Chievres weather station), in km<br>
Tdewpoint (from Chievres weather station), Â°C<br>
rv1, Random variable 1, nondimensional<br>
rv2, Random variable 2, nondimensional<br>

In [None]:
# Let's download a sample dataset as a pandas dataframe
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00374/energydata_complete.csv")

# Print a few rows of the data, complete the following line:
df...

In [None]:
# How many samples do we have in this data set? Complete the following line
print("Total number of samples: ", df...)

# Let's visualize some of the data
n_samples = 1000
feature_name = "T9"
target_name = "T2"

fig, ax1 = plt.subplots()
ax2 = ax1.twinx()
ax1.plot(df[feature_name].values[:n_samples], 'b-')
ax2.plot(df[target_name].values[:n_samples], 'g-')
ax1.set_xlabel('Samples')
ax1.set_ylabel(feature_name, color='b')
ax2.set_ylabel(target_name, color='g')
plt.show()

### 3. Create input and output
We should extract our inputs and outputs from the dataframe. We will use the living room temperature as the target. To further speed up training, we will use a subset of all the available features.
We will also exclude some of other temperatures as they might be very correlated with the living room temperature.

In [None]:
features_to_use = ["lights", # energy use of light fixtures in the house in Wh
                   "T4", # Temperature in office room, in Celsius
                   "T6", # Temperature outside the building (north side), in Celsius
                   "T7", # Temperature in ironing room , in Celsius
                   "T8", # Temperature in teenager room 2, in Celsius
                   "T9", # Temperature in parents room, in Celsius
                   "T_out", # Temperature outside (from Chievres weather station), in Celsius
                   "Press_mm_hg", # (from Chievres weather station), in mm Hg
                   "RH_out", # Humidity outside (from Chievres weather station), in %
                   "Windspeed", # Windspeed (from Chievres weather station), in m/s
                   "Visibility", # Visibility (from Chievres weather station), in km
                   "Tdewpoint" # Dew point (from Chievres weather station), Â°C
                  ]
# Grab a portion of the data to make training and testing faster
samples = 6000
data = df[features_to_use].values[:samples, :]
target = df[target_name].values[:samples]

### 3. Split the data into train, test, validation
For training a model and evaluating the performance, we devide the model into train, validation, and test sets. 

We will use the training and validation set to design the architecture, train the model, and optimize the hyperparameters. Then use the test set to report the accuracy.

In [None]:
# Import Scikit-learn data splitting functions, complete the following line
from sklearn.model_selection import ...

# Determine train test splits
test_ratio = 0.2

# Split the data into training and testing
x_trn, x_tst, y_trn, y_tst = train_test_split(data, target, test_size=test_ratio, shuffle=True, random_state=0)

# Split the training data into training and validation
x_trn, x_vld, y_trn, y_vld = train_test_split(x_trn, y_trn, test_size=test_ratio, shuffle=True, random_state=0)

# Print how many samples we have in each set, complete the following lines
print("Number of samples in the training set: ", ...)
print("Number of samples in the validation set: ", ...)
print("Number of samples in the test set: ", ...)

### 4. Normalize the Data
Most of the time, we should "prepare" our data and make it ready for model development. The preperation might include dealing with missing data, normalization, etc. 

Here, we will normalize the data. Can you explain why we need to normalize the data?

In [None]:
# Normalize the data, complete the following lines
mean = x_trn...
std = x_trn...
x_trn = ...
x_vld = ...
x_tst = ...

### 5. Train ML models
Now that we have prepared the data and split it into train, validation, and test sets, we can train ML models.

Several different models are available from scikit-learn. We will start with a simple linear regression model. But, we will also look at other regressors as well.

In [None]:
# Import the model from scikit-learn package.
from sklearn.linear_model import LinearRegression

# Create an instance of the model, complete the following line
reg = ...

# Train the model, complete the following line
reg.fit(..., ...)

# Calculate the training error and print, completet the following lines
y_trn_prd = reg.predict(x_trn)
trn_error = np.mean(np.abs(... - ...))
print(...)

# Calculate the validation error, complete the following lines
y_vld_prd = reg.predict(...)
vld_error = np.mean(np.abs(... - ...))
print(...)

### 6. Testing the Model
Once we have trained the model, and have finalized the parameters, we can see how it performs on out test set.

In [None]:
# Once we we have decided on the parameters, we can print the test error
y_tst_prd = reg.predict(x_tst)
tst_error = np.mean(np.abs(y_tst - y_tst_prd))
print(tst_error)

In [None]:
# Making prediciton on a new data sample
target_prd = reg.predict((data-mean)/std)
samples_to_plot = 1000
plt.figure(figsize=(10, 4))
plt.plot(target[:samples_to_plot])
plt.plot(target_prd[:samples_to_plot], "--")
plt.show()

In [None]:
# We can also take a look at the feature_importance. It basically shows how much each feature contributes to the final
# prediction. 
feature_imp_df = pd.DataFrame(data={"Name": features_to_use, "Importance": reg.feature_importances_})
feature_imp_df

### What to try next
You can read about the following topics if you like to further pursue this topic:
- Try the model without normalization and see if it affects the result
- Read about and try other types of regressors in scikitlearn (SVR, GradientBoostingRegressor, ExtraTreesRegressor, etc.)
- Read and try neural network based approaches (ANN, RNN, CNN)
