Import the following librairies and modules

- `pickle`
- `warnings`
- `numpy` as `np`
- `pandas` as `pd`
- `matplotlib.pyplot` as `plt`
- from `sklearn.linear_model` import `LinearRegression`
- from `sklearn.tree` import `DecisionTreeRegressor`
- from `sklearn.ensemble` import `RandomForestRegressor`, `AdaBoostRegressor`, `BaggingRegressor`, `GradientBoostingRegressor`
- from `sklearn.model_selection` import `cross_val_score` and `train_test_split`
- from `sklearn.metrics` import `r2_score` and `mean_absolute_percentage_error`


In [None]:
# To suppress the warnings in the notebook
warnings.filterwarnings("ignore")

##### Step 0:

- read the csv file `pgm_consumption.csv` and store in the variable `df`
- set `header=0`.
- display the first 5 rows.

We need to treat the `datetime` column to set it as index. Simply execute the cell.

In [None]:
df["datetime"] = pd.to_datetime(df["datetime"], format="%Y-%m-%d %H:%M:%S")
df = df.set_index("datetime")
df = df.sort_index()
df.head()

##### Step 1:
- Check the data health.
    - Are there any missing values? See [hint](https://stackoverflow.com/questions/26266362/how-do-i-count-the-nan-values-in-a-column-in-pandas-dataframe)
    - There are no outliers, so we donot have to check for outliers.

- Plot the first week of the dataframe. See [Pandas Dataframe head](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html)
- By looking at the graph what do you observe in terms of pattern (seasonility) in the graph? Write in a markdown below

##### Step 2:

- We need to transform this univariate dataset into multivariate dataset.
- If you donot understand what is happening in the following cell, please raise a question.

In [None]:
number_of_hours_in_a_day = 24

list_shifting_days = [1, 7, 365, 365+1, 365+7]

for shifting_days in list_shifting_days:
    df[f"consumption_{shifting_days}_day"] = df["consumption"].shift(number_of_hours_in_a_day * shifting_days)

df = df.dropna()

##### Step 3:

- Split the data into input variables `X` and target variable `y`.
- Here we would like to predict the `consumption` based on the generated columns.

In [None]:
dict_regressors = {"DecisionTreeRegressor":DecisionTreeRegressor(),
                   "RandomForestRegressor":RandomForestRegressor(),
                   "AdaBoostRegressor":AdaBoostRegressor(),
                   "BaggingRegressor":BaggingRegressor(),
                   "GradientBoostingRegressor":GradientBoostingRegressor(),
                   "LinearRegressor":LinearRegression()}

Before moving forward, let us select the algorithm which relatively gives us the best model. We can do it using `cross_val_score`.

- Iterate over the dictionary `dict_regressors` to calculate `cross_val_score`. See[Scikit-Learn cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)
    - Keep `cv=5` in the input/argument of `cross_val_score`. It means that you are splitting your dataset into 5 chunks, and cross_val iterates over all of them by keeping one chunk as test subset and the other 4 as train subset. Store the result of `cross_val_score` in a variable `score`.
    - Display the average score, rounded up to 2 decimal points, side by side to the key of `dict_regressors`. See [Python Round](https://www.w3schools.com/python/ref_func_round.asp) and [Numpy mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html)

Choose the regressor you seem better suited for this problem. Justify it in a markdown below

##### Step 4:

We need to split the whole dataset into training subset and test subset. 
- To do so, look at the [train_test_split documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) and figure out how to use the method `train_test_split` to split the dataset into subsets. 
- Keep `test_size=0.20` as an input/argument to `train_test_split` method. It means that we are keeping 20% of data for test and rest for training.
- Keep `random_state=8` as an input/argument to `train_test_split` method.
- Your output variable should be `X_train`, `X_test`, `y_train`, `y_test`

##### Step 5:

- Call the model of your chosen regressor and store in a variable `model`.
- Fit the seen data i.e. `X_train` and `y_train` in it. 

See the official documentation of your chosen regressor. to find it, type it in google `scikit-learn <your_chosen_regressor>`

##### Step 6:

To verify if the model works fine on training data (also known as seen data), we need to first predict using `X_train`.
- Save the result in a variable `y_predict_train`

##### Step 7:

- The result `y_predict_train` should be compared with the measured values `y_train` to calculate the error (or score) using an indicator.
- Coefficient of determination also called as R2 score is used to evaluate the performance of a linear regression model. It is the amount of the variation in the output dependent attribute which is predictable from the input independent variable(s). A good R2 score is close to 1, a bad R2 score is close to 0

You have to measure the r2_score. See [Scikit-Learn r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html). 
- Remember that `y_true=y_train` and `y_pred=y_predict_train` as input/argument to `r2_score` method.
- You should store the result in a variable `r2_train`

Similarly, you have to measure the `mean_absoulute_percentage_error`. See [Scikit-Learn mean_absolute_percentage_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_percentage_error.html)

- You should store the result in a variable `mape_train`

##### Step 8-a:

- predict on unseen data `X_test` subset using the `model` and save it as `y_predict_test`

##### Step 8-b:

- calculate r2 score of `y_test` and `y_predict_test`
- You should store the result in a variable `r2_test`

- calculate mean absolute percentage error of `y_test` and `y_predict_test`
- You should store the result in a variable `mape_test`

##### Step 9:

- Compare `r2_train` and `r2_test`, what do you think about model? Is it best, good, bad or worst? Write your answer in a markdown below.

- Compare `mape_train` and `mape_test`, what do you think about model? Is it best, good, bad or worst? Write your answer in a markdown below.

- Looking at both indicators and their comparison that you made just above, what is your conclusion about the model. Is it marketable? Write your answer in a markdown below.

Run the cell below to visualize the prediction v/s recorded values. What do you think about the superposition of `predicted_values` on `measured_values`? Write your answer in a markdown below.

In [None]:
true_value = y_test.reset_index()
df_test=pd.DataFrame()
df_test['measured_values'] = true_value['consumption']
df_test['predicted_values']=y_predict_test
df_test.head(24*2).plot()

##### Step 10:

Save the model with filename `model_abc.pkl`, where `abc` should be replaced by your chosen regressor name. You can use it later.

##### Real World Use-Case:

Now that you have trained the model on `X_train` and tested internally yourself using `X_test`, let us test it on real-world use-case.

- read the csv file `pgm_consumption_rwi.csv` and store in the variable `X_rwi`
- set `header=0`.
- display the first 5 rows.

In [None]:
X_rwi["datetime"] = pd.to_datetime(X_rwi["datetime"], format="%Y-%m-%d %H:%M:%S")
X_rwi = X_rwi.set_index("datetime")
X_rwi = X_rwi.sort_index()
X_rwi.head()

Filter the data between **2012-07-09 00:00:00** and **2012-07-09 23:59:00**

- Check the data health.
    - Are there any missing values?

Load the saved model `model_abc.pkl` where `abc` should be replaced by your chosen regressor name and store it in variable `loaded_model`. 

Repeat step 8-a, however instead of `X_test`, you have to predict on `X_rwi` and store the results in a variable `y_predict_rwi`. also display `y_predict_rwi` or better visualize it using `matplotlib`

By doing so, you did day-ahead prediction for **2012-07-10** of energy consumption for the DSO (named PGM here). Congratulations! The exercise ends here.