<a href="https://colab.research.google.com/github/zerotodeeplearning/ztdl-masterclasses/blob/master/notebooks/Real_World_ML_Car_Prices_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Learn with us: www.zerotodeeplearning.com

Copyright © 2021: Zero to Deep Learning ® Catalit LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Real World ML Car Prices Regression

This is a long exercise with a complex dataset. It is intended to approximate a real world case where data is not clean and you need to compare several approaches and make decisions.

## Exercise 1: Get the data

Original Dataset from: https://www.kaggle.com/austinreese/craigslist-carstrucks-data

Mirrored for convenience at https://archive.org/download/craigslist-carstrucks-data/craigslist-carstrucks-data.zip


Use your knowledge of shell commands to download and unzip the dataset. (Hint: to pass a command to the shell use `!`)

In [None]:
#@title Update and load  libraries
!pip install -U -q pandas_profiling missingno optuna

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import pandas_profiling as pp
import missingno as msno

## Exercise 2: Load the dataset

- Load the dataset into a Pandas DataFrame
- Explore it using the `.head()` and `.info()` methods

## Exercise 3: Missing data & Categorical data

- Use the `msno.matrix(...)` command to visualize missing data
- Create a new DataFrame called `df_stats` with the following properties:
  - the index should countain the names of the columns in `df`
  - the with 3 columns:
    - `perc_missing`: the percentage of missing data in that column
    - `n_uniques`: the number of unique values in that column (cardinality)
    - `dtype`: the data type of that column

The result should look like:


||perc_missing|n_uniques|dtype|
|:-:|-:|-:|-:|
|**state**|0.000000|51|object|
|**region**|0.000000|403|object|
|**...**|...|...|




## Exercise 4: Data Cleaning

Based on the resulta of the previous exercise, observe that there are few cases:
- columns with no missing
- columns with few missing data (< 2%)
- columns with lots of missing data that can be dropped (> 20%)
- columns with lots of missng data but likely important features. (Manufacturer (5%) and Odometer (17%))


Create a new dataset called `dfclean`, starting from a copy of `df` and apply following procedures:
- drop columns with more than 20% missing data for now
- impute missing `odometer` with 0 and add indicator colum `odometer_missing` that is true when `odometer` has been imputed
- impute missing `manufacturer` with `unknown`
- impute missing `model` with `unknown`
- impute missing `description` with an empty string `''`

- drop rows with missing data from the remaining data
- drop columns with unique identifiers such as: `id`, `url` and `region_url`
- drop `image_url`. We could use images in a second stage, but let's ignore it for now

Once you're done, check `dfclean.info()`. It should read:

```
<class 'pandas.core.frame.DataFrame'>
Int64Index: 421257 entries, 0 to 435848
Data columns (total 14 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   region            421257 non-null  object 
 1   price             421257 non-null  int64  
 2   year              421257 non-null  float64
 3   manufacturer      421257 non-null  object 
 4   model             421257 non-null  object 
 5   fuel              421257 non-null  object 
 6   odometer          421257 non-null  float64
 7   title_status      421257 non-null  object 
 8   transmission      421257 non-null  object 
 9   description       421257 non-null  object 
 10  state             421257 non-null  object 
 11  lat               421257 non-null  float64
 12  long              421257 non-null  float64
 13  odometer_is_null  421257 non-null  bool   
dtypes: bool(1), float64(4), int64(1), object(8)
memory usage: 45.4+ MB
``` 

## Exercise 5: Pandas Profile Report

Use the `pp.ProfileReport(...)` function to generate a report about `dfclean` and read it. What other things can you say about the dataset?

## Exercise 6: Price data exploration

The `price` column is going to be the target of our regression models. Let's explore it first.

- Create a new variable called `y = dfclean['price']`
- Sort it and inspect the largest and smallest values. Do you notice anything strange?
- Plot the sorted price variable and see how it changes. Use a Log scale for better insights.
- Use a cumulative histogram to identify reasonable thresholds for minimum and maximum prices.

You should observe that there are 3 price regimes:
- roughly 10% of the cars have very low or zero price
- most of the remaining cars are between \$100 and \$100,000
- negligible fraction with price over \$100,000


Further dig into the cars with low price by visualizing some rows with price lower than \$100. You will notice that:

- Some of those rows are generic ads for a car dealer, not a specific vehicles
- Some of those rows contain scraping errors

=> remove rows with price < \$100 from `dfclean`

## Exercise 7: Numerical features

Let's explore the numerical features, their ranges and their distributions.

- create a variable `num_columns` that contains the names of the numerical columns. You can find them with the `.select_dtypes` method
- use the `.describe` method to see what the minimum and maximum values are for each of them.
- Plot a histogram for each of the numerical features
- If the feature is extremely skewed try plotting the histogram of `np.log10(feature + 1)` for better visualization.
- BONUS: display all the plots in the same figure using `plt.subplot`

## Exercise 8: Categorical features

- Create a new variable called `cat_columns` that contains the features with categorical data (not numbers)
- Similarly to how you did for `df`, calculate the cardinality of each column, i.e. the number of distinct values.

Observe that there are 2 types of categorical columns:
- low cardinality (x <= 450) => we will 1-hot-encode these
- high cardinality (> 30000) => we will hashing, vectorize or embeddings to treat these.

## Exercise 9: Map

Display a map with a sample of 10000 cars.
- Use the `px.scatter_mapbox` function
- Use `df.sample` to extract a sample
- Use the `lat` and `long` columns

You will observe that cars are mostly in the US, with very few outliers

## Exercise 10: Outliers

Let's further investigate outliers.

- Display a scatter plot of `price` VS `year` and notice that there are some cars with extremely high prices.
- Look at some of the rows where price > \$100,000

Notice that:
- in some cases it's probably missing the cents dot (adding two digits at the price)
- in some cases its clearly nonsense (111111, 121212)
- in some cases it may be a luxury or old car
- in some cases it may be a mistake

it's less than 500 cars total

=> drop all cars with price > \$100,000 for now

- Draw a new scatter plot of `price` VS `year` to see if you notice any trend.
- BONUS: use the `sns.boxplot` function for fancier plots
- BONUS: repeat the plots using `np.log10(price)` VS `year`

## Exercise 11: Odometer inspection

- Display a scatter plot of `price` VS `log10(odometer)` and of `log10(price)` VS `log10(odometer)` to see if there's any correlation
- Use a `sns.barplot` to display the average odometer for each car `manufacturer`
- Notice that there's a distinct group of cars with odometer < 500
- Inspect some rows for that group. Can you guess what happened there?
- Decide what to do with those rows. Will you impute them? Will you ignore them? Will you drop them?

Most likely odometers < 500 are in thousands of miles instead of miles. Impute later, ignore for now and leave as is.

## Exercise 12: Naive Machine Learning

Let's build our first machine learning model. It will most likely be very bad, but it will allow us to close the loop, defining a metric and a baseline.

### Part 1: Baseline

- Create a new variable called `y` that contains the `price` and drop `price` from `dfclean`
- Create a baseline model that always predicts the average price
- Score the model with:
  - Mean Squared Error
  - Mean Absolute Error
  - Mean Absolute Percentage Error (MAPE): use this function:
  ```python
  def mape(y_true, y_pred): 
      mask = (y_true != 0)
      return (np.fabs(y_true - y_pred)/y_true)[mask].mean() 
  ```
  - R2 Score
  - Plot of `y_predicted` VS `y_true`

### Part 2: Linear Regression

- Use the `LinearRegression` class from Scikit Learn
- Build a simple model that uses only `year` and `odometer` as features
- Evaluate the model with all the scores used for the baseline

## Exercise 13: Ideas

Now that you've created your first model, make a list of idea of things that you could try in order to improve the model. These ideas could involve:
- data manipulation
- feature engineering
- model selection
- tooling and infrastructure

Generate at least 10-15 ideas.

## Exercise 14: Assessing ideas

Bucket your ideas into 3 groups:
- EASY. These should be straightforward to code if you know the API and their execution should not take more than a few minutes.
- MEDIUM. These could take a little longer to code and may take a bit more to execute. The whole experiment should be achievable within a few hours.
- HARD. These are good ideas that are time consuming, either because the implementation is not straightforward, or because decision are involved (e.g. how to impute missing data or how to better deal with outliers) or because their evaluation will take a long time.

- Make a plan of your next steps that involves doing all the easy ideas and possibly some of the medium ideas


## Exercise 15: First and easiest idea

My first and easiest idea is to predict the Log10 of prices and perform a Train / Val / Test split. I will also create a small sample X_sample to allow for quick experimentation.

My sample sizes will be:

```
Train set size: 351475 | 80.6% of original df
Validation set size: 20000 | 4.59% of original df
Test set size: 20000 | 4.59% of original df

Small Training sample size: 35147 | 10.0% of training set
```

You can go ahead and implement your first and easiest idea or follow along and implement this one.

## Exercise 16: Next idea

Let's build some tools that will facilitate assessing other ideas. In particular build a `train_val_model` function with the signature:

```python
def train_val_model(model, model_name, X_train, y_train, X_val, y_val):
    ....
    ....
    return pd.DataFrame(results,
                        columns=[model_name],
                        index=['model',
                               'r2_score_train', 'r2_score_val',
                               'mape_train', 'mape_val',
                               'mse_train', 'mse_val',
                               'mae_train', 'mae_val',
                               'train_time', 'pred_time'])
```

Feel free to implement your next easiest idea or follow along and implement this one.

## Exercise 17: Next idea

Let's 1-hot encode the categorical columns with more than 2 and less than 500 distinct values.

- You can use `ColumnTransformer` and `OneHotEncoder` transformers or `pd.get_dummies`.
- You can also calculate following 3 additional features at this stage:
  - the car age as 2020 - year
  - the Log10 of odometer
  - the `odometer_is_null` indicator column

Feel free to implement your next easiest idea or follow along and implement this one.

If you decide to implement this idea, check the shape of your `X_train` dataset. After 1-hot encoding it should be:

`X_train.shape: (351475, 514)`

## Exercise 18: Next idea

New baseline.

Use a `DummyRegressor` to create a new baseline model and score it.

## Exercise 19: Next idea

Let's iterate on different models and see which one has the best performance. Feel free to implement this one or your next best idea.

To iterate on models:
- Load all the necessary model classes from `sklearn`, `xgboost`, `lightgbm`
- Make a list of model instances
- Iterate on the model instances:
  - train the models on `X_train` or on `X_sample` for convenience
  - evaluate the models on `X_val`
  - accumulate your results in a DataFrame called `df_results` so that they are easy to compare

## Exercise 20: Next idea

Best model tuning. Let's look for the best combination of hyperparameters in the best model. The best performance was achieved with a `LGBMRegressor` with default parameters. Let's look for the best hyperparameter combination using 2 tricks:

- Let's leverage the fact that LightGBM can understand categorical columns. We will create a new set of features where categorical columns are encoded with `OrdinalEncoder` instead of `OneHotEncoder`.
- We will create a `lightgbm.Dataset` with these features and train the model setting the `categorical_feature=` parameter
- Let's further optimize the model using the `Optuna` package. We can follow the [example code](https://github.com/optuna/optuna/blob/master/examples/lightgbm_tuner_simple.py).

Feel free to continue with your ideas or try to implement this one.

## Conclusion and Next steps

- Is your model good enough?
- Can you deploy it?
- What other things should you consider?
- What are your next steps?