# Goals:
1. Imputing & Encoding data where needed.
2. Splitting data into train & test sets (20% test size).
3. Filling & Transforming sets (seperately).
4. Using `Pipeline()` class to perform transformations.

In [1]:
# Standard Imports:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Reading CSV into a DF:
car_sales_raw = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv")
car_sales_raw.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


Missing Values count:

In [3]:
car_sales_raw.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [4]:
# Dropping NA values in target column:
car_sales = car_sales_raw
car_sales.dropna(subset=["Price"],inplace = True)
car_sales.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

Splitting Data into X & Y:

In [5]:
X = car_sales.drop("Price", axis=1)
Y = car_sales["Price"]

# Steps:

1. Define categorical, door and numeric features.
2. Build transformer Pipeline()s for imputing missing data and encoding data.
3. Combine our transformer Pipeline()'s with ColumnTransformer().
4. Build a Pipeline() to preprocess and model our data with the ColumnTransformer() and RandomForestRegressor().
5. Split the data into train and test using train_test_split().
6. Fit the preprocessing and modelling Pipeline() on the training data.
7. Score the preprocessing and modelling Pipeline() on the test data.

## Steps 1 & 2:
- 'Make' & 'Colour' should be OneHot encoded; as they are strings.
- Missing no. of doors should be imputed with "4.0"; since it's the most common no. of doors.
- Missing 'Odometer' values are imputed with mean of Odometer values.

In [6]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer # this will help us fill missing values
from sklearn.preprocessing import OneHotEncoder # this will help us turn our categorical variables into numbers

# Define categorical columns
categorical_features = ["Make", "Colour"]
# Create categorical transformer (imputes missing values, then OneHot encodes them)
categorical_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
  ('onehot', OneHotEncoder(handle_unknown='ignore'))                                         
])
# Define door feature
door_feature = ["Doors"]
# Create door transformer (fills all door missing values with 4)
door_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='constant', fill_value=4)),
])

# Define numeric features
numeric_features = ["Odometer (KM)"]
# Create a transformer for filling all missing numeric values with the mean
numeric_transformer = Pipeline(steps=[
  ('imputer', SimpleImputer(strategy='mean'))  
])

## Step 3: Combining transformer pipelines with ColumnTransformer

In [7]:
from sklearn.compose import ColumnTransformer

# Create a column transformer which combines all of the other transformers 
# into one step
preprocessor = ColumnTransformer(
    transformers=[
      # (name, transformer_to_use, features_to_use transform)
      ('categorical', categorical_transformer, categorical_features),
      ('door', door_transformer, door_feature),
      ('numerical', numeric_transformer, numeric_features)
])

# Step 4: Preprocessing data with a RandomForestRegressor estimator:

In [8]:
from sklearn.ensemble import RandomForestRegressor
model = Pipeline(steps=[("preprocessor", preprocessor),
                        ("regressor",RandomForestRegressor())
                         ])

## Step 5: Splitting data into Train & Test sets:

In [9]:
from sklearn.model_selection import train_test_split
np.random.seed (42)
X_train,X_test,Y_train,Y_test = train_test_split(X, Y, test_size=0.2)


## Step 6: Fitting model on training set
- When `Pipeline.fit()` method is called on an estimator, the estimator's `fit()` method is called.
- When `Pipeline.fit()` method is called on a transformer, the transformer's `fit_transform()` method is called.

In [10]:
model.fit(X_train,Y_train)

## Step 7: Scoring model on test set

In [11]:
model.score(X_test, Y_test)

0.22188417408787875