# Model building lab

In [195]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Assignment goals:
+ practice creating custom transformers and estimators
+ practice performing grid search
+ build custom pipelines
+ solidify git knowledge and practice working together effectively
+ make a first attempt to productionizing ML model
    

For this lab we will be using [car sales dataset](https://www.kaggle.com/orgesleka/used-cars-database) from Kaggle. I know - everybody is sick of predicting car prices by now, but cars happen to have a good mix of numerical and categorical features and that's just what we need for this exercise.

**Remember**, the  main objective of this assignment is not to train the most accurate model, but to practice building custom blocks of estimators and transformers. 

### Reading in data

Start by downloading the dataset and doing a quick exploration.
- which features are categorical and which are numerical?
- are there lots of missing values?
- which column contains the dependent variable?

For the sake of simplicity we will consider that each row contains a separate listing so we can treat index as our id variable. We will only focus on private sellers (`seller=='privat'`) and offerType equal to "Angebot". Also to keep it somewhat scoped we will only use the following features:
- brand
- gearbox
- power PS
- kilometer
- vehicle age (needs to be constructed based on year of registration and date of ad placement)
- fuelType	
- model (optional)
- notRepairedDamage

> **Excercise 1**
>
> Load dataset into pandas dataframe, select only specified rows, add age column, select specified columns and create a train and test split


> **Answer 1**

### Categorical transformer

We have a number of categorical columns that we need to convert to numerical values in order to use them in our machine learning model. 

> **Excercise 2**
>
> Create a transformer that one-hot-encodes categorical columns from a `DataFrame`. The transformer should also be able to transform new, unseen data. Consider what you want to do with missing values.

> **Answer 2**

### Numerical transformers

We also need to preprocess numerical columns. Luckily for us this particular data extract has no missing values in numerical columns. But if we want to cover our bases and ensure that the pipeline also works on unseen data it might be a good idea to think about imputation strategy. On the other hand this strategy might be something we would want to include in our hyperparameter search.
Depending on our future model choice we also need to scale our data. Good first option is to use a StandardScaler from sklearn. 

> **Excercise 3**
>
> Create a mini pipeline for numerical columns that implements a Simple Imputer with "mean" imputation strategy followed by a StandardScaler.
>
> If you have time and feeling exceptionally empowered, implement your own StandardScaler transformer which returns a pandas DataFrame (scikit-learn StandardScaler will return a numpy array)

> **Answer 3**

### Regression model

We have preprocessed data that can be passed to a machine learning algorithm. `sklearn` comes with a varied suite of regression and classification models you can use. For this particular excercise you'll need to implement your own estimator.

> **Excercise 4**
>
> Implement the most simple version of a Ridge regression that takes a single hyperparameter `alpha`. Don't go overboard with implementing your own matrix inverses, use `numpy`'s `linalg` module for this.

> **Answer 4**

### GridSearch

> **Excercise 4**
>
> Create a model pipeline that consists of ColumnTransformer for preprocessing the data and RidgeRegression for fitting the model
>
> Perform a grid search over your whole pipeline and visualize the results. 
Right now your pipeline does not have a whole lot of parameters to search over, you have an `alpha` in the RidgeRegression and you can also play with an imputation strategy. If you have time and energy left you can try adding other preprocessing steps (polynomial features, different scalers, ...?) or a different ML model. 


> **Answer 4**

### Python package

Now that you have all components of your model building pipeline you can put them in a python package.

Make sure that your package:
- can be run from the command line
- can run when the data is stored on a different location
- can be run with varying parameter grid
- should store a fitted and serialized model pipeline to a location you specify
- has a minimal test suite
- is well documented