# Regression Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

We {**TEAM ZF1**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Spain Electricity Shortfall Challenge

The government of Spain is considering an expansion of it's renewable energy resource infrastructure investments. As such, they require information on the trends and patterns of the countries renewable sources and fossil fuel energy generation. Your company has been awarded the contract to:

- 1. analyse the supplied data;
- 2. identify potential errors in the data and clean the existing data set;
- 3. determine if additional features can be added to enrich the data set;
- 4. build a model that is capable of forecasting the three hourly demand shortfalls;
- 5. evaluate the accuracy of the best machine learning model;
- 6. determine what features were most important in the model’s prediction decision, and
- 7. explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model the shortfall between the energy generated by means of fossil fuels and various renewable sources - for the country of Spain. The daily shortfall, which will be referred to as the target variable, will be modelled as a function of various city-specific weather features such as `pressure`, `wind speed`, `humidity`, etc. As with all data science projects, the provided features are rarely adequate predictors of the target variable. As such, you are required to perform feature engineering to ensure that you will be able to accurately model Spain's three hourly shortfalls.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [None]:
# Libraries for data loading, data manipulation and data visulisation
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Libraries for data preparation and model building
import statsmodels.graphics.api as sga
import statsmodels.formula.api as sfa
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression

# print multiple outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Setting global constants to ensure notebook results are reproducible
# PARAMETER_CONSTANT = ###

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [None]:
df1 = pd.read_csv("df_train.csv")
print(f"There are {df1.shape[0]} rows and {df1.shape[1]} columns")
df1.head(2)
print('', end="\n\n")

# Remove unnecessary column(s)

df_train = df1.drop(labels="Unnamed: 0", axis=1)
print(f"There are {df_train.shape[0]} rows and {df_train.shape[1]} columns")
df_train.head(10).T
print('', end="\n\n")

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


### 3.1. Data statistics

In [None]:
# Function to describe variable (including mode and median)

def describe(df):
    d = {0:[df.mean(), df.median(), df.mode()[0]]}
    dat = pd.DataFrame(data=d).rename(index={0: "Mean", 1: "Median", 2: "Mode"})
    return pd.concat([df.describe(), dat])

# Data comprehension

print(f"There are {df_train.isnull().sum().sum()} null values")
df_train.isnull().sum()
print('', end="\n\n")

print("Description of Valencia_pressure")
describe(df_train["Valencia_pressure"])

# Deal with null containing column(s)

df_train_clean = df_train.copy()
df_train_clean["Valencia_pressure"] = df_train_clean["Valencia_pressure"].fillna(df_train_clean["Valencia_pressure"].mode()[0])
print('', end="\n\n")

print(f"There are {df_train_clean.isnull().sum().sum()} null values after subtituting with the mode")
df_train_clean.isnull().sum()
print('', end="\n\n")

print("Description of cleaned Valencia_pressure")
describe(df_train_clean["Valencia_pressure"])
print('', end="\n\n")

# Access column dtypes

df_train_clean.info()
print('', end="\n\n")

# Convert object dtypes to float

df_train_clean["Valencia_wind_deg"] = df_train_clean["Valencia_wind_deg"].str.extract("(\d+)").astype(int)
df_train_clean["Seville_pressure"] = df_train_clean["Seville_pressure"].str.extract("(\d+)").astype(int)
df_train_clean["time"] = pd.to_datetime(df_train_clean["time"])
print("Time, Valencia_wind_deg, Seville_pressure columns has been respectively converted to:")
print(df_train_clean["time"].dtypes)
print(df_train_clean["Valencia_wind_deg"].dtypes)
print(df_train_clean["Seville_pressure"].dtypes)
print('', end="\n\n")

# extract features from date

df_train_clean["time_year"] = df_train_clean["time"].dt.year.astype(int)
df_train_clean["time_month"] = df_train_clean["time"].dt.month.astype(int)
df_train_clean["time_day"] = df_train_clean["time"].dt.day.astype(int)
df_train_clean["time_hour"] = df_train_clean["time"].dt.hour.astype(int)
df_train_clean["time_weekday"] = df_train_clean["time"].dt.weekday.astype(int) # Monday is 0 and Sunday is 6
df_train_clean["time_weeknumber"] = df_train_clean["time"].dt.week.astype(int)

# Sort columns and drop noise ("time")

df_train_clean_sort = df_train_clean[sorted(df_train_clean)]
df_train_clean_sort = df_train_clean_sort.drop(labels="time", axis=1)
df_train_clean_sort.info()
print('', end="\n\n")

# Univariable non-Graphical Analysis

print(f"Univariable non-Graphical Analysis")
print(f"There are {df_train_clean_sort.shape[0]} rows and {df_train_clean_sort.shape[1]} columns")
df_train_clean_sort.describe().T
print(f"There are 5 cities; Barcelona, Bilbao, Madrid, Seville, Valencia", end="\n"
      "There are 5 variables reoccuring across all cities; Pressure, temp, temp_max, temp_min, wind_speed")

In [None]:
# check columns containing negative values

df_train_clean_sort.columns[(df_train_clean_sort < 0).any()].tolist()
df_train_clean_sort[(df_train_clean_sort[df_train_clean_sort.columns] < 0).any(axis=1)][['load_shortfall_3h']].T

### 3.2. Plot relevant feature interactions (correlation and linearity)

In [None]:
# Prepare to observe interactions

y_0 = df_train_clean_sort[["load_shortfall_3h"]]
x_0 = df_train_clean_sort.drop(labels="load_shortfall_3h", axis=1)

y_0.head(2)
x_0.head(2)

#### 3.2.3. Investigate all variables

In [None]:
# Function to check for linearity
# Due to the number of visuals created, this function takes some time to run

def scatter_plot(predictor, response, plotrow=1, plotcolumn=1, figsize=(4,3)):
    fig, axs = plt.subplots(plotcolumn,plotrow, figsize=figsize)
    fig.subplots_adjust(hspace = 0.5, wspace=.2)
    axs = axs.ravel()

    for index, column in enumerate(predictor.columns):
        axs[index].title.set_text("{} vs. Y".format(column))
        predictor_plots = axs[index].scatter(x=predictor[column],y=response, color="blue", edgecolor="white")

    fig.tight_layout(pad=1)

In [None]:
# Check for linearity between all predictor vairiables and y_0

scatter_plot(x_0, y_0, plotrow=4, plotcolumn=13, figsize=(16,39))

In [None]:
# Correlations between predictor variables and response variable

df_xy = x_0.join(y_0)

# Function to get correlation coefficients and p-values of each x to y

def p_values(df, y="load_shortfall_3h", dec_place=6, p_value_threshold=0.1):
    corrs = df.corr()[y]
    dict_cp = {}

    column_titles = [col for col in corrs.index if col!=y]
    for col in column_titles:
        p_val = round(pearsonr(df[col], df[y])[1],dec_place)
        dict_cp[col] = {'Correlation_Coefficient':corrs[col],
                        'P_Value':p_val}

    df_cp = pd.DataFrame(dict_cp).T
    df_cp_sorted = df_cp.sort_values('P_Value')
    return df_cp_sorted[df_cp_sorted['P_Value']<p_value_threshold]

# Correlation and p-value of x and y

p_values(df_xy, dec_place=6, p_value_threshold=0.1)

In [None]:
# Function to show predictor correlation heatmap and list columns with high correlation

def corr_heatmap(corr, diag_len, corr_threshold):
    mask = np.triu(np.ones_like(corr, dtype=bool))
    with plt.rc_context():
        plt.rc("figure", figsize=(diag_len, diag_len))
        predictor_corrs_fig = sns.heatmap(corr, mask=mask)

    r, c = np.where(np.abs(corr) > corr_threshold)
    off_diagonal = np.where(r != c)
    corr_list = [row for row in corr.iloc[r[off_diagonal], c[off_diagonal]].index]
    return corr_list

# Show all predictors correlation heatmap and list columns with high correlation

predictor_corrs = x_0.corr()
corr_heatmap(predictor_corrs, diag_len=15, corr_threshold=0.9)

#### 3.2.3. Investigate temperature

In [None]:
# Function to draw time series (ts) plot

def ts_plot(df, x, y, title="", xlabel='Time', ylabel='Value', dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.show()

# Function to show y_0 time series (ts) plot
    
y_0_ts = df_train_clean[["time"]].join(other=y_0["load_shortfall_3h"])

def y_0_plot(df=y_0_ts, x=y_0_ts.time, y=y_0_ts.load_shortfall_3h, title='tri-hourly load shortfall from 2015 to 2017', xlabel='Time', ylabel="Load shortfall", dpi=100):
    plt.figure(figsize=(16,5), dpi=dpi)
    plt.plot(x, y, color='tab:red')
    plt.gca().set(title=title, xlabel=xlabel, ylabel=ylabel)
    plt.show()

In [None]:
x_temp = x_0.filter(regex="temp$", axis=1)
x_temp["Mean_temp"] = x_temp.mean(axis=1)
x_temp.head(2)
x_temp.describe()
scatter_plot(x_temp, y_0, plotrow=3, plotcolumn=2, figsize=(12,6))

In [None]:
# Investigate temp_max variable

x_temp_max = x_0.filter(regex="max$", axis=1)
x_temp_max["Mean_temp_max"] = x_temp_max.mean(axis=1)
x_temp_max.head(2)
x_temp_max.describe()
scatter_plot(x_temp_max, y_0, plotrow=3, plotcolumn=2, figsize=(12,6))

In [None]:
# Investigate temp_min variable

x_temp_min = x_0.filter(regex="min$", axis=1)
x_temp_min["Mean_temp_min"] = x_temp_min.mean(axis=1)
x_temp_min.head(2)
x_temp_min.describe()
scatter_plot(x_temp_min, y_0, plotrow=3, plotcolumn=2, figsize=(12,6))

In [None]:
# Investigate mean of all temp variables

x_temp_mean = x_temp[["Mean_temp"]].join(other = [x_temp_max["Mean_temp_max"], x_temp_min["Mean_temp_min"]])
xy_temp_mean = x_temp_mean.join(y_0)
xy_temp_mean.head(2)

# Correlation and p-value of temperature predictor variables and y

p_values(xy_temp_mean, y="load_shortfall_3h", dec_place=6, p_value_threshold=0.1)

# Correlation heatmap of temperature predictor variables

temp_predictor_corr = x_temp_mean.corr()
corr_heatmap(temp_predictor_corr, diag_len=3, corr_threshold=0.9)

print('', end="\n")
print("High correlation observed for all variables; using all of these variables will result in working with redundant information")
print("Choose one better correlated (and lower p-value) to the y_0: 'Mean_temp_min'")
print('', end="\n")

In [None]:
# Temp_min time series plot compared with y_0

temp_min_ts = df_train_clean[["time"]].join(other=x_temp_min["Mean_temp_min"])
temp_min_ts.head(2)
ts_plot(temp_min_ts, x=temp_min_ts.time, y=temp_min_ts.Mean_temp_min, title='tri-hourly minimum temperature from 2015 to 2017', ylabel="Minimum temperature")
y_0_plot()
print("Seasonality observed for temperature")

#### 3.2.2. Investigate wind speed

In [None]:
x_wind_speed = x_0.filter(regex="speed$", axis=1)
x_wind_speed["Mean_wind_speed"] = x_wind_speed.mean(axis=1)
x_wind_speed.head(2)
x_wind_speed.describe()
scatter_plot(x_wind_speed, y_0, plotrow=3, plotcolumn=2, figsize=(12,6))

In [None]:
# Wind speed time series plot compared with y_0

wind_speed_ts = df_train_clean[["time"]].join(other=x_wind_speed["Mean_wind_speed"])
wind_speed_ts.head(2)
ts_plot(wind_speed_ts, x=wind_speed_ts.time, y=wind_speed_ts.Mean_wind_speed, title='tri-hourly wind speed from 2015 to 2017', ylabel="Wind speed")
y_0_plot()
print("Stationary time series")

#### 3.2.3. Investigate wind degree

In [None]:
x_wind_deg = x_0.filter(regex="deg$", axis=1)
x_wind_deg["Mean_wind_deg"] = x_wind_deg.mean(axis=1)
x_wind_deg.head(2)
x_wind_deg.describe()
scatter_plot(x_wind_deg, y_0, plotrow=4, plotcolumn=1, figsize=(12,3))

print("Valencia_wind_deg is observerd to consist extreme value range and seem to be a categorical variable")
print('', end="\n")

In [None]:
# Adjust for extreme values of wind_deg across cities

x_wind_deg1 = x_0.filter(regex=r'(Barcelona_wind_deg|Bilbao_wind_deg)', axis=1)
x_wind_deg1["Mean_wind_deg1"] = x_wind_deg1.mean(axis=1)
x_wind_deg1.head(2)
x_wind_deg1.describe()
scatter_plot(x_wind_deg1, y_0, plotrow=3, plotcolumn=1, figsize=(12,3))

In [None]:
# wind_deg time series plot compared with y_0

wind_deg_ts = df_train_clean[["time"]].join(other=x_wind_deg1["Mean_wind_deg1"])
wind_deg_ts.head(2)
ts_plot(wind_deg_ts, x=wind_deg_ts.time, y=wind_deg_ts.Mean_wind_deg1, title='tri-hourly wind degree (strength) from 2015 to 2017', ylabel="Wind degree")
y_0_plot()
print("Partial seasonality observed for wind degree (strength)")

#### 3.2.4. Investigate pressure

In [None]:
x_pressure = x_0.filter(regex="pressure$", axis=1)
x_pressure["Mean_pressure"] = x_pressure.mean(axis=1)
x_pressure.head(2)
x_pressure.describe()
scatter_plot(x_pressure, y_0, plotrow=3, plotcolumn=2, figsize=(12,6))

print("Barcelona_pressure contributed heavily to the mean across cities due to its extreme value range")
print("Seville_pressure is also observerd to consist extreme value range and seem to be a categorical variable")
print('', end="\n")

In [None]:
# Adjust for extreme values of pressure across cities

x_pressure1 = x_0.filter(regex=r'(Bilbao_pressure|Madrid_pressure|Valencia_pressure)', axis=1)
x_pressure1["Mean_pressure1"] = x_pressure1.mean(axis=1)
x_pressure1.head(2)
x_pressure1.describe()
scatter_plot(x_pressure1, y_0, plotrow=2, plotcolumn=2, figsize=(8,6))

In [None]:
# Pressure time series plot compared with y_0

pressure_ts = df_train_clean[["time"]].join(other=x_pressure1["Mean_pressure1"])
pressure_ts.head(2)
ts_plot(pressure_ts, x=pressure_ts.time, y=pressure_ts.Mean_pressure1, title='tri-hourly pressure from 2015 to 2017', ylabel="Pressure")
y_0_plot()
print("Nonconstant variance observed")

#### 3.2.5. Investigate rain amount

In [None]:
x_rain = x_0.filter(regex="rain", axis=1)
x_rain["Mean_rain"] = x_rain.mean(axis=1)
x_rain.head(2)
x_rain.describe()
scatter_plot(x_rain, y_0, plotrow=4, plotcolumn=2, figsize=(16,6))
print("Due to the fact that our time series data span 3 hours interval, only rain variable in the same interval will be consedered")
print('', end="\n")

In [None]:
# Adjust for confromity with data time interval

x_rain1 = x_0.filter(regex="rain_3h", axis=1)
x_rain1["Mean_rain_3h"] = x_rain1.mean(axis=1)
x_rain1.head(2)
x_rain1.describe()
scatter_plot(x_rain1, y_0, plotrow=3, plotcolumn=1, figsize=(12,3))

In [None]:
# Rain_3h time series plot compared with y_0

rain_3h_ts = df_train_clean[["time"]].join(other=x_rain1["Mean_rain_3h"])
rain_3h_ts.head(2)
ts_plot(rain_3h_ts, x=rain_3h_ts.time, y=rain_3h_ts.Mean_rain_3h, title='tri-hourly rain from 2015 to 2017', ylabel="Rain amount")
y_0_plot()
print("Rain amount observed as white noise (white noise is completely random data with a mean of 0)")

#### 3.2.6. Investigate humidity

In [None]:
x_humidity = x_0.filter(regex="humidity", axis=1)
x_humidity["Mean_humidity"] = x_humidity.mean(axis=1)
x_humidity.head(2)
x_humidity.describe()
scatter_plot(x_humidity, y_0, plotrow=4, plotcolumn=1, figsize=(16,3))

In [None]:
# Humidity time series plot compared with y_0

humidity_ts = df_train_clean[["time"]].join(other=x_humidity["Mean_humidity"])
humidity_ts.head(2)
ts_plot(humidity_ts, x=humidity_ts.time, y=humidity_ts.Mean_humidity, title='tri-hourly humidity from 2015 to 2017', ylabel="Humidity")
y_0_plot()
print("Seasonality observed for humidity")

#### 3.2.7. Investigate level of cloud coverage

In [None]:
x_clouds_all = x_0.filter(regex="clouds", axis=1)
x_clouds_all["Mean_clouds_all"] = x_clouds_all.mean(axis=1)
x_clouds_all.head(2)
x_clouds_all.describe()
scatter_plot(x_clouds_all, y_0, plotrow=4, plotcolumn=1, figsize=(16,3))

In [None]:
# Rain_3h time series plot compared with y_0

clouds_all_ts = df_train_clean[["time"]].join(other=x_clouds_all["Mean_clouds_all"])
clouds_all_ts.head(2)
ts_plot(clouds_all_ts, x=clouds_all_ts.time, y=clouds_all_ts.Mean_clouds_all, title='tri-hourly cloud coverage from 2015 to 2017', ylabel="Cloud coverage")
y_0_plot()
print("Partial seasonality observed for level of cloud coverage")

#### 3.2.8. Investigate snow amount

In [None]:
x_snow_3h = x_0.filter(regex="snow", axis=1)
x_snow_3h["Mean_snow_3h"] = x_snow_3h.mean(axis=1)
x_snow_3h.head(2)
x_snow_3h.describe()
scatter_plot(x_snow_3h, y_0, plotrow=3, plotcolumn=1, figsize=(12,3))

In [None]:
# Snow_3h time series plot compared with y_0

snow_3h_ts = df_train_clean[["time"]].join(other=x_snow_3h["Mean_snow_3h"])
snow_3h_ts.head(2)
ts_plot(snow_3h_ts, x=snow_3h_ts.time, y=snow_3h_ts.Mean_snow_3h, title='tri-hourly snow amount from 2015 to 2017', ylabel="Snow amount")
y_0_plot()
print("Snow amount observed as white noise (white noise is completely random data with a mean of 0)")

#### 3.2.9. Investigate weather id (weather condition)

In [None]:
x_weather_id = x_0.filter(regex="weather", axis=1)
x_weather_id["Mean_weather_id"] = x_weather_id.mean(axis=1)
x_weather_id.head(2)
x_weather_id.describe()
scatter_plot(x_weather_id, y_0, plotrow=3, plotcolumn=2, figsize=(12,6))

In [None]:
# Snow_3h time series plot compared with y_0

weather_id_ts = df_train_clean[["time"]].join(other=x_weather_id["Mean_weather_id"])
weather_id_ts.head(2)
ts_plot(weather_id_ts, x=weather_id_ts.time, y=weather_id_ts.Mean_weather_id, title='tri-hourly weather condition from 2015 to 2017', ylabel="Weather condition")
y_0_plot()
print("Stationary time series")

### 3.3. Variable Selection based on observed Correlation and Significance

In [None]:
# Mean of varibles across cities

print("Mean of varibles across cities")
x_mean = x_temp[["Mean_temp"]].join(other = [x_temp_max["Mean_temp_max"], x_temp_min["Mean_temp_min"], x_wind_speed["Mean_wind_speed"],
                                             x_wind_deg["Mean_wind_deg"], x_pressure["Mean_pressure"], x_rain["Mean_rain"], x_humidity["Mean_humidity"],
                                             x_clouds_all["Mean_clouds_all"], x_snow_3h["Mean_snow_3h"], x_weather_id["Mean_weather_id"]])
x_mean.head(2)
x_mean.shape
x_mean.describe().T

# Mean of varibles (as adjusted during investigation) across cities

print('', end="\n\n")
print("Varibles (as adjusted during investigation)")
x_mean1 = x_temp_min[["Mean_temp_min"]].join(other = [x_wind_speed["Mean_wind_speed"], x_wind_deg1["Mean_wind_deg1"], x_pressure1["Mean_pressure1"],
                                                      x_rain1["Mean_rain_3h"], x_humidity["Mean_humidity"], x_clouds_all["Mean_clouds_all"],
                                                      x_snow_3h["Mean_snow_3h"], x_weather_id["Mean_weather_id"]])
x_mean1.head(2)
x_mean1.shape
x_mean1.describe().T

In [None]:
# Correlation and p-value of x and y

x_mean_y = x_mean.join(y_0)
p_values(x_mean_y, dec_place=6, p_value_threshold=0.1)

x_mean1_y = x_mean1.join(y_0)
p_values(x_mean1_y, dec_place=6, p_value_threshold=0.05)

In [None]:
# Show correlation heatmap and list columns with high correlation for x_mean

mean_corrs = x_mean.corr()
corr_heatmap(mean_corrs, diag_len=5, corr_threshold=0.9)

In [None]:
# Show correlation heatmap and list columns with high correlation x_mean1

mean1_corrs = x_mean1.corr()
corr_heatmap(mean1_corrs, diag_len=4, corr_threshold=0.73)
print("correlations between selected variables minimal")

### 3.5. Investigate OLS fit summary using the various model dataframe so far

In [None]:
# Function to fit model

def fit_model(df, y=y_0):
    df_fit = df.copy()
    y_name = ''.join([col for col in y.columns])
    X_name = [col for col in df_fit.columns]

    # Build OLS formula string " y ~ X "

    formula_str = y_name+" ~ "+" + ".join(X_name)

    model = sfa.ols(formula=formula_str, data=df_fit.join(y))
    fitted = model.fit()
    print(fitted.summary())
    
# Fit model of all x varible

fit_model(x_0)

In [None]:
# Fit model using varible means across cities

x_mean.shape
fit_model(x_mean)

In [None]:
# Fit model using selected varible means across cities

x_mean1.shape
fit_model(x_mean1)

In [None]:
# Fit model of selected varible means with observed seasonality over time

x_mean2 = x_mean1[["Mean_temp_min", "Mean_humidity", "Mean_clouds_all", "Mean_wind_deg1"]].copy()
x_mean2.head(2)

fit_model(x_mean2)

### 3.6. Final file so far

In [None]:
x_for_use = x_mean1.copy()

x_for_use.head(2)
x_for_use.shape

### 3.4. Variable Selection by Variance Thresholds

In [None]:
# check

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
# remove missing values/ features

In [None]:
# create new features

In [None]:
# engineer existing features

In [None]:
# to standardize the relevant dfs
# predictors df= x_d1
# dependent variable df= y_d1

from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()

x_d1= x_0.copy() 
y_d1= y_0

In [None]:
# to create a scaled version of the predictors based on z score value
x_d1_scaled= scaler.fit_transform(x_d1)

In [None]:
type(x_d1_scaled)

In [None]:
y_d1.to_csv('y_train.csv',float_format='%.2f')

In [None]:
x_std.to_csv('x_train.csv',float_format='%.2f')

In [None]:
# to convert the scaled predictor variables into a dataframe

x_std= pd.DataFrame(x_d1_scaled, columns=x_d1.columns)
x_std.head()

In [None]:
# Ridge regularisation
# This is best after data scaling as regularisation a model for large coefficients
# Import dependencies

from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge


In [None]:
# Split the data into train and test, being sure to use the standardised predictors
X_train, X_test, y_train, y_test = train_test_split(x_std,
                                                    y_d1,
                                                    test_size=0.2,
                                                    shuffle=False)

In [None]:
ridge= Ridge()

In [None]:
x_std.shape

In [None]:
y_d1.head()

In [None]:
ridge.fit(X_train, y_train)

In [None]:
ridge.coef_.shape

In [None]:
b0= float(ridge.intercept_)

In [None]:
coeff= pd.DataFrame(ridge.coef_, X_train.columns, columns=['Coefficient'])

In [None]:
coeff

In [None]:
a=np.array([x_d1_std.columns])

In [None]:
a.shape

In [None]:
X_train.columns.shape

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

In [None]:
# split data
# creating a random forest model for the data





In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic