<img src="https://drive.google.com/uc?id=1-cL5eOpEsbuIEkvwW2KnpXC12-PAbamr" style="Width:1000px">

# Preprocessor Pipeline
### Back to Golden Plains Roadside Biodiversity

As our first exercise we will return to the  <a href='https://data.gov.au/data/dataset/golden-plains-roadside-biodiversity'>Golden Plains Shire (Australia) dataset</a>.<br>
![plain](https://upload.wikimedia.org/wikipedia/commons/thumb/6/6b/Mount_Conner%2C_August_2003.jpg/375px-Mount_Conner%2C_August_2003.jpg)
<br>

🎯 As last time, the exercise will consist of the data preparation and feature selection techniques you have learnt: we will again try to predict via linear regression the `RCACScore`. But this time, we will do it using `Pipelines`, and we don't really care that much about the final score (the goal is to demonstrate how pipelines work.

The goal is to demonstrate how helpful pipelines are in making our code cleaner, and avoid repetition.

👇 Load the data into this notebook as a pandas dataframe named `df`, and display its first 5 rows.

In [None]:
from nbta.utils import download_data
download_data(id='1SUxBmDZF6fsu3ndrgrNI7aI2B2BEr59S')

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Remove Duplicates and Hold Out

Removing duplicates still **needs to be done before training a pipeline**. Why? Because if you don't do it now, once you train-test-split (hold out method) you might have data leakage. Do the following:
1. Remove the duplicates
2. Create the target `y` vector (`RCACScore`) and feature `X` matrix
3. Using a 30% split for the test set, create the `X_train`, `X_test`, `y_train`, `y_test`

In [None]:
# The code below ensures that all transformers return pandas dataframe - makes life easier!

from sklearn import set_config
set_config(transform_output="pandas")

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Creating our numeric pipeline

Let's start by taking care of the numeric values, and do this in a pipeline. The steps we will want to implements are the following:
1. Impute values with a `SimpleImputer`, with the default strategy
2. Scale the values using a `StandardScaler`

Create a pipeline that does just that! Call it `num_pipe`. Then, fit it to the **numeric columns of your** `X_train`.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# Applying your pipline

Create two new dataframes named `X_train_prep` and `X_test_prep`,  containing respectively the values of `X_train` and `X_test` transformed by your pipeline.

**Tip:** Note that the transform function of a pipeline returns a `NumPy array` by default. So, to obtain a nice looking datadframe, you will need to create it from the return numpy array and the columns of the original `X_train` and `X_test` dataframes.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

<details>
    <summary>💡 Conclusions</summary>
<br>
ℹ️ Hopefully you can see with this simple example that <code>Pipeline</code>s make your life much easier! <br> 
    Once you have fitted your pipeline to the <code>X_train</code>, you can <code>transform</code> the <code>X_train</code>, <code>X_test</code>, or indeed any new data you need to predict!
    
</details>

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('num_pipe',
                         X_test = X_test_prep
)

result.write()
print(result.check())

# Categorical Data Pipeline

Let's now build a new `pipeline` and name it `cat_pipe` for **Categorical Pipeline**. We will deal with the categorical data by doing the following:
- Apply a `SimpleImputer()` with the `most_frequent` strategy
- Apply a `OneHotEncoder()` with the `handle_unknown="ignore"` to the categorical data

Then, transform the `X_train` and `X_test` categorical data into a `X_train_cat_prep` and `X_test_cat_prep` in the same way as you trensformed the numerical data above.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('cat_pipe',
                         X_test = X_test_cat_prep)
result.write()
print(result.check())

# Building a single processing Pipeline

Ok, the process above was already saving us a lot of time. But now, we are going to take this to the next level by creating a `ColumnTransformer` that groups both `pipelines`together! Name your `ColumnTransformer` as `preproc_pipe` (**preprocessing pipeline**). If in doubt, consult the documentation!

Once your pipeline is created, you can `fit` it on the `X_train` and then `transform` the `X_train` and the `X_test` into `X_train_preproc` and `X_test_preproc`: you will see how easy it makes your code.

In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

### ☑️ Test your code

In [None]:
from nbresult import ChallengeResult

result = ChallengeResult('preproc_pipe',
                         X_test = X_test_preproc
)

result.write()
print(result.check())

# Adding models into pipelines

The true power of the `pipeline` class is that it can also handle full `sklearn` models! So, all we need to do now is create a new `pipeline`, let's call it `linear_model`, and this will contain the `preproc_pipe` and a `LinearRegression` model. Then, you can directly fit it to the `X_train` and `y_train`. Onces this is fitted, you can estimate the `y_pred` from the `X_test`, and save the `root mean squared error` of this model into a new variable, `score`.



In [None]:
# ADD YOUR CODE HERE -- You can create new markdown and code cells
                    
                    
                    

# 🏁 Finished!

Well done! <span style="color:teal">**Push your exercise to GitHub**</span>, and move on to the next one.