<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

# Writing Custom Transformers

In the next few exercise, you will learn to take your code closer to `production` level. This means getting out of the `Juoyter notebook` and into proper class files. The benefit from this is that you will be able to use your different transformers directly into your `pipelines`, but also that (at least in theory) you will be able to deploy your machine learning model on the cloud, both for **training** and for **prediction**.

Being comfortable with going from `Jupyter notebook` to `Python class` is an important skill that you need to master: this might be the difference between getting the job as a data scientist during interview, or not. And it will be an expectation for the assessed coursework.

For this exercise, you can use this `Jupyter notebook` to test your code and try idea. Most, if not all, of the code you already wrote in your previous exercise. You will find in this folder several `.py` files: use `VSCode` (or the IDE of your choice) to edit them: the skeleton of each function is already written for you, you simply need to modify the code to test your functions.

The goal is to write one `transformer` class that will inherit from `TransformerMixin` and `BaseEstimator`, and have three main methods that need to be implemented: `__init__(self)` which is called when the transformer is created, `fit(self)` which is when the transformer `learns` the different statistics of the data (if needed: not all transformers need to do something at `fit(X,y)` time, and yours might not), and of course `transform (self, X, y)`, which takes the `X` features, transform them, and return the transformed values (and only those).

# Reading the data

I have already split the data for you into a `train_set` and a `test_set` in the cell below. Note that because the data is a `Time Series` I have split it in a way that all of our `train_set` lies between 2003 and 2011 whereas the `test_set` represents the years 2012, and 2013.

Once you have executed the cell below, look at the `train_set` to see what the features we are going to work with look like.

In [None]:
# Run this cell first in order to be able to see changes in your "custom_transformers.py" file without needing to restart your kernel
%load_ext autoreload
%autoreload 2

In [None]:
from nbta.utils import download_data
download_data(id='1ftc-lXujVjX9he3xpX9uC6-9MKxURt50')

In [None]:
import pandas as pd
import numpy as np

train_set = pd.read_csv('raw_data/train_temperatures.csv')
test_set = pd.read_csv('raw_data/test_temperatures.csv')

X_train = train_set.drop(columns='AverageTemperature')
X_test = test_set.drop(columns='AverageTemperature')

y_train = train_set.AverageTemperature
y_test = test_set.AverageTemperature

# Encoding the `month`

It makes intuitive sense that adding information about the time of the year will help us being more predictive in our model. The month comes in the form of a `string` but also are often referred to as 1, 2, 3, ... 12: so we could use an `OrdinalEncoder` to convert the months into integers that represent their position.

Let's try to do this. Create a simple pipeline that you can call `simple_pipe` and that includes the following:
* A `SimpleImputer()` and a `StandardScaler()` for the 'Latitude', 'Longitude', and 'TempPreviousMonth'
* An `OrdinalEncoder()` for the `month` column
* A `LinearRegression()` model

Fit this pipeline, and calcuate the `root mean squared error` on the `X_test`. Save this value into a variable called `base_score`.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('ordinal_encoding',
                         score = ordinal_score
)

result.write()
print(result.check())

# Preserving the ordinal nature of `month`

Intuitively, we should realise that we have lost an important dimension here. We simply converted the strings into number: `Jan`:`1`, `Feb`:`2`, etc... The issue with this encoding is that it does not really reflect the nature of seasonality: **November and December are 1 unit away** (12-11), whereas **December and January are 11 units away** (12-1). This is a common problem of data that are cyclical in nature: you will have this issue when encoding dates and times, seasons, or even angular information.

Luckily, there is an elegant solution to this problem. Have a look at <a href="https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca">this excellent blog post to understand the concept better</a>, and then come back here.

If you have read the post, now you should understand that we can encode cyclical data using two features: the `sin` and the `cosin` of the angle (in degrees `radian`) of the cycle. Given that a full circle in `radian` is equal to $2*\pi$, we can divide it into 12 equal segment and multiply this by the month number. In other words, $January=\frac{2*\pi}{12}$, $February=2*\frac{2*\pi}{12}$...$December = 12*\frac{2*\pi}{12}$. Then, we simply create two new features (`sin_month` and `cos_month`) to encode this angular representation of the month.

Go ahead and implement the `encode_month(month)` function. It should do the following:
1. You pass the entire `month` series as the input of the function
2. Use a dictionary to convert the strings into number between 1 and 12, as explained above.
3. Calculate the correct angle for each month as per the solution suggested above (I suggest using a `lambda` function).
4. Return a `DataFrame` that contains two columns: the `sin_month` and the `cos_month`, i.e. respectively the `sin` and `cosin` of the month converted to degree radian

Then, create a `X_train_prep` and `X_test_prep` dataframes that are copies of `X_train` and `X_test` (we want to preserve the original values). You can then use this function to create two new features `["sin_month", "cos_month"]` for your `X_test_prep` and `X_train_prep`. 

Finally, create a new pipeline that will have the following elements:

* a `SimpleImputer(strategy='most_frequent')` for 'sin_month' and 'cos_month'
* a SimpleImputer() and a StandardScaler() for 'Latitude', 'Longitude', and'TempPreviousMonth'
* a `LinearRegression()` model

Fit this pipeline on `X_train_prep` and `y_train`, and save the `root mean squared error` of the `X_test_prep` as `encoded_score`.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('month_encoding',
                         test_set = X_test_prep,
                         score = encoded_score
)

result.write()
print(result.check())

# `MonthEncoder`

What we did in the last section worked well! But it is a pain to apply it manually. So here you  will implement the transformation from the months in `string` format (`Jan`, `Feb`, etc...) into the `sin` and `cosin` values into a `Transformer class`. Make sure that the class returns a `pd.Dataframe` with only 2 features (`sin_month` and `cos_month`). Note that this is only necessary to be able to see the transformed values nicely, and also to past the test in this notebook. Strictly speaking, you can return a `np.ndarray` as most transformers do (but returning a `DataFrame` is very little extra effort and is well worth it).

Then, create a pipeline that will contain the following:

* a `SimpleImputer(strategy='most_frequent')` and `MonthEncoder` for 'month'
* a `SimpleImputer()` and a `StandardScaler()` for 'Latitude', 'Longitude', and'TempPreviousMonth'
* a `LinearRegression()` model

call this pipeline `final_pipe` in your notebook, and fit it with your `X_train` and `y_train`. Make sure you obtain the same score as before.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('transformer',
                         pipe = final_pipe
)

result.write()
print(result.check())

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.