# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [2]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for data preparation and model building
import scipy as sp
import statsmodels as sm
import sklearn.model_selection as skl
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

# Setting global constants to ensure notebook results are reproducible
PARAMETER_CONSTANT = 42

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [3]:
train_data = pd.read_csv('df_train.csv') # load the train data
test_data = pd.read_csv('df_test.csv')   # load the test data

In [4]:
train_data.columns # overview of train_data columns

Index(['Unnamed: 0', 'time', 'Madrid_wind_speed', 'Valencia_wind_deg',
       'Bilbao_rain_1h', 'Valencia_wind_speed', 'Seville_humidity',
       'Madrid_humidity', 'Bilbao_clouds_all', 'Bilbao_wind_speed',
       'Seville_clouds_all', 'Bilbao_wind_deg', 'Barcelona_wind_speed',
       'Barcelona_wind_deg', 'Madrid_clouds_all', 'Seville_wind_speed',
       'Barcelona_rain_1h', 'Seville_pressure', 'Seville_rain_1h',
       'Bilbao_snow_3h', 'Barcelona_pressure', 'Seville_rain_3h',
       'Madrid_rain_1h', 'Barcelona_rain_3h', 'Valencia_snow_3h',
       'Madrid_weather_id', 'Barcelona_weather_id', 'Bilbao_pressure',
       'Seville_weather_id', 'Valencia_pressure', 'Seville_temp_max',
       'Madrid_pressure', 'Valencia_temp_max', 'Valencia_temp',
       'Bilbao_weather_id', 'Seville_temp', 'Valencia_humidity',
       'Valencia_temp_min', 'Barcelona_temp_max', 'Madrid_temp_max',
       'Barcelona_temp', 'Bilbao_temp_min', 'Bilbao_temp',
       'Barcelona_temp_min', 'Bilbao_temp_max', 'Sev

In [5]:
# overview dataset information for train_data
train_data.info

<bound method DataFrame.info of       Unnamed: 0                 time  Madrid_wind_speed Valencia_wind_deg  \
0              0  2015-01-01 03:00:00           0.666667           level_5   
1              1  2015-01-01 06:00:00           0.333333          level_10   
2              2  2015-01-01 09:00:00           1.000000           level_9   
3              3  2015-01-01 12:00:00           1.000000           level_8   
4              4  2015-01-01 15:00:00           1.000000           level_7   
...          ...                  ...                ...               ...   
8758        8758  2017-12-31 09:00:00           1.000000           level_6   
8759        8759  2017-12-31 12:00:00           5.000000           level_6   
8760        8760  2017-12-31 15:00:00           6.333333           level_9   
8761        8761  2017-12-31 18:00:00           7.333333           level_8   
8762        8762  2017-12-31 21:00:00           4.333333           level_9   

      Bilbao_rain_1h  Valencia_

In [6]:
# overview dataset information for train_data
train_data.info

<bound method DataFrame.info of       Unnamed: 0                 time  Madrid_wind_speed Valencia_wind_deg  \
0              0  2015-01-01 03:00:00           0.666667           level_5   
1              1  2015-01-01 06:00:00           0.333333          level_10   
2              2  2015-01-01 09:00:00           1.000000           level_9   
3              3  2015-01-01 12:00:00           1.000000           level_8   
4              4  2015-01-01 15:00:00           1.000000           level_7   
...          ...                  ...                ...               ...   
8758        8758  2017-12-31 09:00:00           1.000000           level_6   
8759        8759  2017-12-31 12:00:00           5.000000           level_6   
8760        8760  2017-12-31 15:00:00           6.333333           level_9   
8761        8761  2017-12-31 18:00:00           7.333333           level_8   
8762        8762  2017-12-31 21:00:00           4.333333           level_9   

      Bilbao_rain_1h  Valencia_

In [8]:
#checking for missing values

train_data.isnull().T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,8753,8754,8755,8756,8757,8758,8759,8760,8761,8762
Unnamed: 0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
time,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Madrid_wind_speed,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Valencia_wind_deg,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Bilbao_rain_1h,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Valencia_wind_speed,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Seville_humidity,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Madrid_humidity,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Bilbao_clouds_all,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Bilbao_wind_speed,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [9]:
train_data.isnull().sum()

Unnamed: 0                 0
time                       0
Madrid_wind_speed          0
Valencia_wind_deg          0
Bilbao_rain_1h             0
Valencia_wind_speed        0
Seville_humidity           0
Madrid_humidity            0
Bilbao_clouds_all          0
Bilbao_wind_speed          0
Seville_clouds_all         0
Bilbao_wind_deg            0
Barcelona_wind_speed       0
Barcelona_wind_deg         0
Madrid_clouds_all          0
Seville_wind_speed         0
Barcelona_rain_1h          0
Seville_pressure           0
Seville_rain_1h            0
Bilbao_snow_3h             0
Barcelona_pressure         0
Seville_rain_3h            0
Madrid_rain_1h             0
Barcelona_rain_3h          0
Valencia_snow_3h           0
Madrid_weather_id          0
Barcelona_weather_id       0
Bilbao_pressure            0
Seville_weather_id         0
Valencia_pressure       2068
Seville_temp_max           0
Madrid_pressure            0
Valencia_temp_max          0
Valencia_temp              0
Bilbao_weather

In [10]:
#checcking non missing values
train_data.notnull()

Unnamed: 0.1,Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,...,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min,load_shortfall_3h
0,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
1,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
3,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8758,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
8759,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
8760,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True
8761,True,True,True,True,True,True,True,True,True,True,...,True,True,True,True,True,True,True,True,True,True


In [11]:
# the shape of train_data  

train_data.shape

(8763, 49)

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [12]:
# Read data
train_data.head(3)

Unnamed: 0.1,Unnamed: 0,time,Madrid_wind_speed,Valencia_wind_deg,Bilbao_rain_1h,Valencia_wind_speed,Seville_humidity,Madrid_humidity,Bilbao_clouds_all,Bilbao_wind_speed,...,Madrid_temp_max,Barcelona_temp,Bilbao_temp_min,Bilbao_temp,Barcelona_temp_min,Bilbao_temp_max,Seville_temp_min,Madrid_temp,Madrid_temp_min,load_shortfall_3h
0,0,2015-01-01 03:00:00,0.666667,level_5,0.0,0.666667,74.333333,64.0,0.0,1.0,...,265.938,281.013,269.338615,269.338615,281.013,269.338615,274.254667,265.938,265.938,6715.666667
1,1,2015-01-01 06:00:00,0.333333,level_10,0.0,1.666667,78.333333,64.666667,0.0,1.0,...,266.386667,280.561667,270.376,270.376,280.561667,270.376,274.945,266.386667,266.386667,4171.666667
2,2,2015-01-01 09:00:00,1.0,level_9,0.0,1.0,71.333333,64.333333,0.0,1.0,...,272.708667,281.583667,275.027229,275.027229,281.583667,275.027229,278.792,272.708667,272.708667,4274.666667


In [13]:
#Read the train data in Transpose
"""
The addition of the *T* (transpose)tranforms our data by changing the columns and rows """
train_data.head(10).T 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
Unnamed: 0,0,1,2,3,4,5,6,7,8,9
time,2015-01-01 03:00:00,2015-01-01 06:00:00,2015-01-01 09:00:00,2015-01-01 12:00:00,2015-01-01 15:00:00,2015-01-01 18:00:00,2015-01-01 21:00:00,2015-01-02 00:00:00,2015-01-02 03:00:00,2015-01-02 06:00:00
Madrid_wind_speed,0.666667,0.333333,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
Valencia_wind_deg,level_5,level_10,level_9,level_8,level_7,level_7,level_8,level_9,level_9,level_9
Bilbao_rain_1h,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Valencia_wind_speed,0.666667,1.666667,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.333333
Seville_humidity,74.333333,78.333333,71.333333,65.333333,59.0,69.666667,83.666667,83.666667,86.0,87.0
Madrid_humidity,64.0,64.666667,64.333333,56.333333,57.0,67.333333,63.333333,64.0,63.333333,63.666667
Bilbao_clouds_all,0.0,0.0,0.0,0.0,2.0,12.333333,16.333333,8.666667,5.333333,15.333333
Bilbao_wind_speed,1.0,1.0,1.0,1.0,0.333333,0.666667,1.0,1.333333,1.0,1.0


In [14]:
#checking the shape of the data
train_data.shape

(8763, 49)

In [None]:
# look at data statistics

In [None]:
# plot relevant feature interactions

In [None]:
# evaluate correlation

In [15]:
# have a look at feature distributions

train_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,8763.0,4381.0,2529.804538,0.0,2190.5,4381.0,6571.5,8762.0
Madrid_wind_speed,8763.0,2.425729,1.850371,0.0,1.0,2.0,3.333333,13.0
Bilbao_rain_1h,8763.0,0.135753,0.374901,0.0,0.0,0.0,0.1,3.0
Valencia_wind_speed,8763.0,2.586272,2.41119,0.0,1.0,1.666667,3.666667,52.0
Seville_humidity,8763.0,62.658793,22.621226,8.333333,44.333333,65.666667,82.0,100.0
Madrid_humidity,8763.0,57.414717,24.335396,6.333333,36.333333,58.0,78.666667,100.0
Bilbao_clouds_all,8763.0,43.469132,32.551044,0.0,10.0,45.0,75.0,100.0
Bilbao_wind_speed,8763.0,1.850356,1.695888,0.0,0.666667,1.0,2.666667,12.66667
Seville_clouds_all,8763.0,13.714748,24.272482,0.0,0.0,0.0,20.0,97.33333
Bilbao_wind_deg,8763.0,158.957511,102.056299,0.0,73.333333,147.0,234.0,359.3333


In [16]:
#check for skewness

train_data.skew(numeric_only=True)

Unnamed: 0               0.000000
Madrid_wind_speed        1.441144
Bilbao_rain_1h           5.222802
Valencia_wind_speed      3.499637
Seville_humidity        -0.310175
Madrid_humidity         -0.057378
Bilbao_clouds_all       -0.053085
Bilbao_wind_speed        1.716914
Seville_clouds_all       1.814452
Bilbao_wind_deg          0.226927
Barcelona_wind_speed     1.057331
Barcelona_wind_deg      -0.180001
Madrid_clouds_all        1.246745
Seville_wind_speed       1.151006
Barcelona_rain_1h        8.726988
Seville_rain_1h          8.067341
Bilbao_snow_3h          26.177568
Barcelona_pressure      57.979664
Seville_rain_3h         19.342574
Madrid_rain_1h           7.074308
Barcelona_rain_3h       12.696605
Valencia_snow_3h        63.298084
Madrid_weather_id       -3.107722
Barcelona_weather_id    -2.584011
Bilbao_pressure         -0.999642
Seville_weather_id      -3.275574
Valencia_pressure       -1.705162
Seville_temp_max        -0.033931
Madrid_pressure         -1.850768
Valencia_temp_

In [None]:
#find numerical and categorical values
train_data.info()

In [None]:
# plotting distributions of all the features in train
train_data.hist(bins=50, figsize=(30,30), color = 'tab:orange')
plt.show()

In [None]:
#correlation plot heatmap
train_corr=train_data.corr().T
plt.figure(figsize=(40,42))

# Mask top half of matrix as it contains redunant info
mask = np.zeros_like(train_corr)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(train_corr,annot=True, vmin=-1, vmax=1, cmap='autumn', linewidth 

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
# remove missing values/ features

In [None]:
# create new features

In [None]:
# engineer existing features

<a id="five"></a>
## 6. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [None]:
# split data

In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

In [None]:
#Separate the feature variables (X) and the target variable (y) in the train dataset
y = df_clean_train[:len(df)][['load_shortfall_3h']]
x = df_clean_train[:len(df)].drop('load_shortfall_3h' ,axis=1)

In [None]:
# Splitting the train dataset into train and validation sets
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=42)

# X_train: Independent variables for training
# X_val: Independent variables for validation
# y_train: Target variable for training
# y_val: Target variable for validation

# Creating different models

In [None]:
#create a linear regression model
lr = LinearRegression()

In [None]:
#Create an instance of the Ridge regression model. You can specify the regularization strength (alpha) as a parameter. Higher alpha values will result in stronger regularization.
#Create a ridge model
ridge = Ridge(alpha=1.0)

In [None]:
# Create a Random Forest model with a specific random state
rf = RandomForestRegressor(n_estimators=100, random_state=42)

In [None]:
# Create a Decision Tree model
dt = DecisionTreeRegressor(max_depth=5, min_samples_split=2)

In [None]:
#Fit the above created models

In [None]:
# Fit the linear regression model to the training data
lr.fit(x_train, y_train)

In [None]:
# Fit the ridge model to the training data
ridge.fit(x_train, y_train)

In [None]:
# Fit the random forest model to the training data
rf.fit(x_train, y_train.values.ravel())

In [None]:
# Fit the decision tree model to the training data
dt.fit(x_train, y_train)

In [None]:
### Fit the decision tree model to the training data
dt.fit(x_train, y_train)

# Make predictions on the validation set

In [None]:
#Linear regression
y_pred_val_lr = lr.predict(x_val)

In [None]:
#Ridge regression
y_pred_val_rd = ridge.predict(x_val)

In [None]:
#Random Forest
y_pred_val_rf = rf.predict(x_val)

In [None]:
#Decision Tree
y_pred_val_dt = dt.predict(x_val)

<a id="six"></a>
## 7. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

In [None]:
# Check each model's performance using RMSE and R-squared (R^2)

In [None]:
# MODEL PERFOMANCE OF EACH MODEL ON THE VALIDATION SET

In [None]:
def rmse(y_val, y_predict):
    return np.sqrt(mean_squared_error(y_val, y_predict))

In [None]:
# Calculate the RMSE on the validation set
rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_lr))
print("Validation Set RMSE Linear regression(LR):", rmse)

# Calculate the RMSE on the validation set
rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_rd))
print("Validation Set RMSE Linear regression (RD):", rmse)

# Calculate the RMSE on the validation set
rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_rf))
print("Validation Set RMSE Linear regression(RF):", rmse)

# Calculate the RMSE on the validation set
rmse = np.sqrt(mean_squared_error(y_val, y_pred_val_dt))
print("Validation Set RMSE Linear regression(DT):", rmse)

In [None]:
r2_score(y_val, y_pred_val_lr)

In [None]:
r2_score(y_val, y_pred_val_rd)

In [None]:
r2_score(y_val, y_pred_val_rf)

In [None]:
r2_score(y_val, y_pred_val_dt)

In [None]:
x_test = df_clean_test

In [None]:
preds = rf.predict(x_test)

In [None]:
x_train = df_clean_train[:len(df)].drop('load_shortfall_3h', axis=1)
x_test = df_clean_train[len(df):].drop('load_shortfall_3h' , axis =1)

In [None]:
daf = pd.DataFrame(preds, columns=['load_shortfall_3h'])
daf.head()

In [None]:
output = pd.DataFrame({'time': df_test['time']})
submission = output.join(daf)
submission.to_csv('submission.csv', index=False)

In [None]:
submission

<a id="seven"></a>
## 8. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic

# 9. Saving The Model

In [None]:
import pickle

model_save_path = "rf_model.pkl"
with open(model_save_path,'wb') as file:
    pickle.dump(rf,file)

In [None]:
model_load_path = "mlr_model.pkl"
with open(model_load_path,'rb') as file:
    unpickled_model = pickle.load(file)