# DAB200 -- Lab 5

In this lab, you will gain further experience in dealing with missing data and further practice converting non-numeric features in a dataset to numeric, as well as, exploring ways to increase model performance through data set improvement.

**Target**: `Comb Unadj FE - Conventional Fuel`

**Data set**: make sure you use the data assigned to your group!

| Groups | Data set |
| :-: | :-: |
| 1-5 | vehicles_2014.csv |
| 6-10 | vehicles_2015.csv |
| 11-15 | vehicles_2016.csv |
| 16-20 | vehicles_2017.csv |
| 21-26 | vehicles_2018.csv |
| 27-32 | vehicles_2019.csv |

**Important Notes:**
- **Only provide FINAL code in each Part of the lab**
    - I only want to see final, well-organized code in your submission
- Only need to use **random forest** models
    - All random forest models should include 150 decision trees
- Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
     - **If you use `train_test_split()` or calculate MSE or MAE you will have marks deducted!**
- Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
- Don't make assumptions!
- Information about the data can be found at [this website](https://www.fueleconomy.gov/feg/download.shtml)



**A few tips**
 - The function `sniff_modified` (from lecture notebooks) will probably come in handy
 - If you use `dropna`, it may be worthwhile exloring the `thresh` parameter
 - If you use `.info()`, you might need to set the `verbose` parameter
 - List comprehensions (or other programmatic methods) might be helpful in trying to select columns: `[f for f in df.columns if 'some_text' in f]`
 - And you don't have to get all the columns of interest in one go because `['a', 'b'] + ['c', 'd'] = ['a', 'b', 'c', 'd']`
 - If a feature has only 1 unique element how long would the list `df[col].unique()` be?
 - To help in viewing the data, you may find these commands helpful (see [here](https://thispointer.com/python-pandas-how-to-display-full-dataframe-i-e-print-all-rows-columns-without-truncation/) for explanation):
    - `pd.set_option('display.max_rows', None)`
    - `pd.set_option('display.max_columns', None)`
    - `pd.set_option('display.width', None)`
    - `pd.set_option('display.max_colwidth', None)`
 - start simple; then build up complexity, but only if needed

### Part 0

Please provide the following information:
 - Group Number 
 - Group Members

     

### Part 1: Initial data clean-up

In this part, you will follow the steps below to do an initial clean-up of the data.

##### Step 1: Remove target related features

The data contains some features that are essentially equivalent to the target. If we leave them in, they will leak information about the target to the model and the model performance will be erroneously high. Columns with a date are also not helpful here. In this step, we will remove any feature that contains:
 - 'smog'
 - 'co2'
 - 'conventional fuel' (except for the actual target!)
 - 'date'


In [54]:
#Code added on 11-04-2021 - Importing the packages
import pandas as pd
import numpy as np

In [55]:
#Code added on 11-04-2021 - Reading the veh15.csv file
Veh15 = pd.read_csv('vehicles_2015.csv')
Veh15.head(5).T

Unnamed: 0,0,1,2,3,4
Model Year,2015,2015,2015,2015,2015
Mfr Name,FCA Italy,aston martin,aston martin,aston martin,aston martin
Division,Alfa Romeo,Aston Martin Lagonda Ltd,Aston Martin Lagonda Ltd,Aston Martin Lagonda Ltd,Aston Martin Lagonda Ltd
Carline,4C,V12 Vantage S,V8 Vantage,V8 Vantage,V8 Vantage S
Verify Mfr Cd,FTG,ASX,ASX,ASX,ASX
...,...,...,...,...,...
120V Charge time at 120 Volts (hours),,,,,
PHEV Total Driving Range (rounded to nearest 10 miles)DISTANCE,,,,,
City PHEV Composite MPGe,,,,,
Hwy PHEV Composite MPGe,,,,,


In [56]:
#Code added on 11-04-2021 - Normalizing all the missing/null values in the string and objects.
from pandas.api.types import is_string_dtype, is_object_dtype
def df_normalize_strings(df):
    for col in df.columns:
        if is_string_dtype(df[col]) or is_object_dtype(df[col]):
            df[col] = df[col].str.lower()
            df[col] = df[col].fillna(np.nan)
            df[col] = df[col].replace('mod', np.nan)
            df[col] = df[col].replace('not applicable.', 'not applicable')
            df[col] = df[col].replace('??', np.nan)

df_normalize_strings(Veh15)
Veh15['Car/Truck Category - Cash for Clunkers Bill.'] = Veh15['Car/Truck Category - Cash for Clunkers Bill.'].replace('1', np.nan)

In [57]:
# #Code added on 11-04-2021 - converting numeric variables from object to float
Veh15['MFR Calculated Gas Guzzler MPG '] = Veh15['MFR Calculated Gas Guzzler MPG '].astype(float)
Veh15['FE Rating (1-10 rating on Label)'] = Veh15['FE Rating (1-10 rating on Label)'].astype(float)
Veh15['GHG Rating (1-10 rating on Label)'] = Veh15['GHG Rating (1-10 rating on Label)'].astype(float)
Veh15['Comb Unadj FE - Conventional Fuel'] = Veh15['Comb Unadj FE - Conventional Fuel'].astype(float)

In [58]:
#Code added on 11-04-2021 - creating a function called sniff_modified
def sniff_modified(df):
    with pd.option_context("display.max_colwidth", 20):
        info = pd.DataFrame()
        info['data type'] = df.dtypes
        info['percent missing'] = df.isnull().sum()*100/len(df)
        info['No. unique'] = df.apply(lambda x: len(x.unique()))
        info['unique values'] = df.apply(lambda x: x.unique())
        return info.sort_values('data type')
sniff_modified(Veh15)

Unnamed: 0,data type,percent missing,No. unique,unique values
Model Year,int64,0.000000,1,[2015]
EPA Calculated Annual Fuel Cost - Conventional Fuel ----- Annual fuel cost error. Please revise Verify.,int64,0.000000,51,"[2050, 4050, 3550, 3800, 3350, 2200, 2850, 570..."
Annual Fuel1 Cost - Conventional Fuel,int64,0.000000,51,"[2050, 4050, 3550, 3800, 3350, 2200, 2850, 570..."
Intake Valves Per Cyl,int64,0.000000,2,"[2, 1]"
City CO2 Rounded Adjusted,int64,0.000000,401,"[365, 742, 655, 681, 656, 772, 686, 753, 395, ..."
...,...,...,...,...
Battery Type Desc,object,96.322942,3,"[nan, lithium ion, nimh]"
Range2 - Alt Fuel Model Typ Driving Range - Alternative Fuel,object,92.086331,46,"[nan, 287, 304, 262, 285, 273, 300, 238, 290, ..."
Fuel2 Usage - Alternative Fuel,object,92.086331,3,"[nan, e, cng]"
Stop/Start System (Engine Management System) Description,object,0.000000,2,"[no, yes]"


In [59]:
#Code added on 11-04-2021 - Creating the variable Veh15_primary
Veh15_primary = Veh15.copy()

In [60]:
#Code added on 11-04-2021 - removing all features with Smog
for col in Veh15_primary.columns.tolist():
    if 'Smog' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [61]:
#Code added on 11-04-2021 - removing all features with CO2
for col in Veh15_primary.columns.tolist():
    if 'CO2' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [62]:
#Code added on 11-04-2021 - removing all features with Conventional Fuel
for col in Veh15_primary.columns.tolist():
    if 'Conventional Fuel' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)
#Code added on 11-04-2021 - Adding back the target variable
Veh15_primary['Comb Unadj FE - Conventional Fuel'] = Veh15['Comb Unadj FE - Conventional Fuel']

In [63]:
#Code added on 11-04-2021 - removing all features with date
for col in Veh15_primary.columns.tolist():
    if 'Date' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [64]:
#Code added on 11-04-2021 - removing all features with Rating
for col in Veh15_primary.columns.tolist():
    if 'Rating' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [65]:
#Code added on 11-04-2021 - removing all features with rating
for col in Veh15_primary.columns.tolist():
    if 'rating' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [66]:
#Code added on 11-04-2021 - removing all features with MPG
for col in Veh15_primary.columns.tolist():
    if 'MPG' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [67]:
#Code added on 11-04-2021 - removing all features with GHG
for col in Veh15_primary.columns.tolist():
    if 'GHG' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [68]:
#Code added on 11-04-2021 - removing all features with Alternative Fuel
for col in Veh15_primary.columns.tolist():
    if 'Alternative Fuel' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [69]:
#Code added on 11-04-2021 - removing all features with fuel costs
for col in Veh15_primary.columns.tolist():
    if 'fuel costs' in col:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [70]:
#Code added on 11-04-2021 - running Veh15_primary through sniff_modified
sniff_modified(Veh15_primary)

Unnamed: 0,data type,percent missing,No. unique,unique values
Model Year,int64,0.000000,1,[2015]
Carline Class,int64,0.000000,23,"[1, 2, 3, 4, 5, 6, 7, 8, 30, 10, 11, 12, 13, 1..."
EPA FE Label Dataset ID,int64,0.000000,1251,"[16850, 16459, 16455, 16452, 16456, 16453, 164..."
Intake Valves Per Cyl,int64,0.000000,2,"[2, 1]"
# Gears,int64,0.000000,7,"[6, 7, 8, 5, 1, 9, 4]"
...,...,...,...,...
"Regen Braking Type, If Other",object,99.840128,3,"[nan, brake pedal triggered regenerative, brak..."
Descriptor - Model Type (40 Char or less),object,36.370903,20,"[sidi; , nan, sidi; z06; , sidi; with stop-sta..."
Driver Cntrl Regen Braking?,object,96.482814,3,"[nan, n, y]"
Var Valve Lift?,object,0.000000,2,"[n, y]"


##### Step 2: Remove features with $\ge$90% missing values

The data contains some features that have a lot of missing values. Remove any feature that has $\ge$90% missing values. There is nothing magical about this number. We are simply picking a starting threshold. In a real-world scenario we would probably need to revisit and justify this threshold, but for now we will use it to get started.

In [71]:
#Code added on 11-04-2021 - removing all features with >= 90% missing values
for col in Veh15_primary.columns.tolist():
    if Veh15_primary[col].isnull().sum()*100/len(Veh15_primary[col]) >= 90:
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [72]:
#Code added on 11-04-2021 - running Veh15_primary through sniff_modified
sniff_modified(Veh15_primary)

Unnamed: 0,data type,percent missing,No. unique,unique values
Model Year,int64,0.0,1,[2015]
EPA FE Label Dataset ID,int64,0.0,1251,"[16850, 16459, 16455, 16452, 16456, 16453, 164..."
Carline Class,int64,0.0,23,"[1, 2, 3, 4, 5, 6, 7, 8, 30, 10, 11, 12, 13, 1..."
Exhaust Valves Per Cyl,int64,0.0,2,"[2, 1]"
# Gears,int64,0.0,7,"[6, 7, 8, 5, 1, 9, 4]"
# Cyl,int64,0.0,8,"[4, 12, 8, 10, 6, 16, 3, 5]"
Intake Valves Per Cyl,int64,0.0,2,"[2, 1]"
Index (Model Type Index),int64,0.0,510,"[264, 8, 4, 1, 5, 2, 6, 3, 27, 29, 35, 33, 26,..."
Eng Displ,float64,0.0,48,"[1.8, 6.0, 4.7, 4.2, 5.2, 2.0, 4.0, 3.0, 8.0, ..."
4Dr Lugg Vol,float64,63.469225,25,"[nan, 14.0, 12.0, 10.0, 11.0, 13.0, 6.0, 16.0,..."


##### Step 3: Convert all string/object features to lower case

If any categorical feature contains both 'Yes' and 'yes', we want our model to treat these as the same. To do that, we need to convert all string/objet type features to lower case. (*Hint*: If you use a function for this, make sure it returns a data frame.) 

In [73]:
#Code added on 11-04-2021 - converting all strings/object features into lower case
from pandas.api.types import is_string_dtype, is_object_dtype
for col in Veh15_primary.columns:
    if is_string_dtype(Veh15_primary[col]) or is_object_dtype(Veh15_primary[col]):
            Veh15_primary[col] = Veh15_primary[col].str.lower()

In [74]:
#Code added on 11-04-2021 - running Veh15_primary through sniff_modified
sniff_modified(Veh15_primary)

Unnamed: 0,data type,percent missing,No. unique,unique values
Model Year,int64,0.0,1,[2015]
EPA FE Label Dataset ID,int64,0.0,1251,"[16850, 16459, 16455, 16452, 16456, 16453, 164..."
Carline Class,int64,0.0,23,"[1, 2, 3, 4, 5, 6, 7, 8, 30, 10, 11, 12, 13, 1..."
Exhaust Valves Per Cyl,int64,0.0,2,"[2, 1]"
# Gears,int64,0.0,7,"[6, 7, 8, 5, 1, 9, 4]"
# Cyl,int64,0.0,8,"[4, 12, 8, 10, 6, 16, 3, 5]"
Intake Valves Per Cyl,int64,0.0,2,"[2, 1]"
Index (Model Type Index),int64,0.0,510,"[264, 8, 4, 1, 5, 2, 6, 3, 27, 29, 35, 33, 26,..."
Eng Displ,float64,0.0,48,"[1.8, 6.0, 4.7, 4.2, 5.2, 2.0, 4.0, 3.0, 8.0, ..."
4Dr Lugg Vol,float64,63.469225,25,"[nan, 14.0, 12.0, 10.0, 11.0, 13.0, 6.0, 16.0,..."


##### Step 4: Remove any feature with only 1 unique value

If any feature contains only a single value, then our model will not be able to use this feature to help it predict our target, since there will be no pattern to discover. These features can be removed from the data. 

In [75]:
#Code added on 11-04-2021 - removing all features with only 1 unique value
for col in Veh15_primary.columns.tolist():
    if len(Veh15_primary[col].dropna().unique()) == 1 :
        Veh15_primary = Veh15_primary.drop(col, axis = 1)

In [76]:
#Code added on 11-04-2021 - running Veh15_primary through sniff_modified
sniff_modified(Veh15_primary)

Unnamed: 0,data type,percent missing,No. unique,unique values
EPA FE Label Dataset ID,int64,0.0,1251,"[16850, 16459, 16455, 16452, 16456, 16453, 164..."
Carline Class,int64,0.0,23,"[1, 2, 3, 4, 5, 6, 7, 8, 30, 10, 11, 12, 13, 1..."
Exhaust Valves Per Cyl,int64,0.0,2,"[2, 1]"
Index (Model Type Index),int64,0.0,510,"[264, 8, 4, 1, 5, 2, 6, 3, 27, 29, 35, 33, 26,..."
Intake Valves Per Cyl,int64,0.0,2,"[2, 1]"
# Cyl,int64,0.0,8,"[4, 12, 8, 10, 6, 16, 3, 5]"
# Gears,int64,0.0,7,"[6, 7, 8, 5, 1, 9, 4]"
4Dr Lugg Vol,float64,63.469225,25,"[nan, 14.0, 12.0, 10.0, 11.0, 13.0, 6.0, 16.0,..."
4Dr Pass Vol,float64,63.469225,37,"[nan, 83.0, 86.0, 85.0, 81.0, 89.0, 93.0, 91.0..."
2Dr Lugg Vol,float64,84.572342,13,"[nan, 7.0, 4.0, 5.0, 10.0, 6.0, 12.0, 13.0, 9...."


##### Step 5: Compare data sets

Compare some basic characteristics between our original data set and the one after our initial clean up. Fill in the table below:

| Characteristic | Original data set | After initial clean-up |
| :- | :- | :- |  
|  # rows  |  1251   |  1251 | 
|  # columns  |  162   | 48  | 
|  # numeric features |  91   | 14  | 
|  # non-nonumeric features |  71   | 34  | 

In [77]:
#Code added on 11-04-2021 - the shape of Veh15
Veh15.shape

(1251, 162)

In [78]:
#Code added on 11-04-2021 - number of numeric columns in Veh15
len(Veh15.select_dtypes(exclude=['object']).columns)

91

In [79]:
#Code added on 11-04-2021 - number of non-numeric columns in Veh15
len(Veh15.select_dtypes(include=['object']).columns)

71

In [80]:
#Code added on 11-04-2021 - the shape of Veh15_primary
Veh15_primary.shape

(1251, 48)

In [81]:
#Code added on 11-04-2021 - number of numeric columns in Veh15_primary
len(Veh15_primary.select_dtypes(exclude=['object']).columns)

14

In [82]:
#Code added on 11-04-2021 - number of non-numeric columns in Veh15_primary
len(Veh15_primary.select_dtypes(include=['object']).columns)

34

### Part 2 - Create and evaluate an initial model

In this part you should: 
 - use the cleaned-up version of the data from **Part 1**
 - isolate all numeric features from the data set 
 - fill in any missing values with 0
 - create and evaluate a baseline model 

In [83]:
#Code added on 11-04-2021 - creating a variable called Veh15_num
Veh15_num = Veh15_primary.select_dtypes(exclude=['object'])
Veh15_num

Unnamed: 0,Index (Model Type Index),Eng Displ,# Cyl,# Gears,Max Ethanol % - Gasoline,2Dr Pass Vol,2Dr Lugg Vol,4Dr Pass Vol,4Dr Lugg Vol,Intake Valves Per Cyl,Exhaust Valves Per Cyl,Carline Class,EPA FE Label Dataset ID,Comb Unadj FE - Conventional Fuel
0,264,1.8,4,6,10.0,,,,,2,2,1,16850,34.4702
1,8,6.0,12,7,10.0,,,,,2,2,1,16459,18.1901
2,4,4.7,8,7,10.0,,,,,2,2,1,16455,20.5806
3,1,4.7,8,6,10.0,,,,,2,2,1,16452,19.4895
4,5,4.7,8,7,10.0,,,,,2,2,1,16456,20.5806
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1246,106,3.0,6,8,15.0,,,,,2,2,33,16838,28.1631
1247,2,5.0,8,6,10.0,98.0,13.0,,,2,2,5,18768,20.4985
1248,1,5.0,8,6,10.0,98.0,13.0,,,2,2,5,18819,19.5452
1249,205,5.5,8,7,10.0,,,112.0,12.0,2,2,6,15722,22.9319


In [84]:
#Code added on 11-04-2021 - Creating feature vector
x = Veh15_num.drop('Comb Unadj FE - Conventional Fuel', axis = 1)
x = x.fillna(0)
x.head(10)

Unnamed: 0,Index (Model Type Index),Eng Displ,# Cyl,# Gears,Max Ethanol % - Gasoline,2Dr Pass Vol,2Dr Lugg Vol,4Dr Pass Vol,4Dr Lugg Vol,Intake Valves Per Cyl,Exhaust Valves Per Cyl,Carline Class,EPA FE Label Dataset ID
0,264,1.8,4,6,10.0,0.0,0.0,0.0,0.0,2,2,1,16850
1,8,6.0,12,7,10.0,0.0,0.0,0.0,0.0,2,2,1,16459
2,4,4.7,8,7,10.0,0.0,0.0,0.0,0.0,2,2,1,16455
3,1,4.7,8,6,10.0,0.0,0.0,0.0,0.0,2,2,1,16452
4,5,4.7,8,7,10.0,0.0,0.0,0.0,0.0,2,2,1,16456
5,2,4.7,8,6,10.0,0.0,0.0,0.0,0.0,2,2,1,16453
6,6,4.7,8,7,10.0,0.0,0.0,0.0,0.0,2,2,1,16457
7,3,4.7,8,6,10.0,0.0,0.0,0.0,0.0,2,2,1,16454
8,27,4.2,8,7,15.0,0.0,0.0,0.0,0.0,2,2,1,15493
9,29,4.2,8,6,15.0,0.0,0.0,0.0,0.0,2,2,1,15497


In [85]:
#Code added on 11-04-2021 - Creating Target vector 
y = Veh15_num['Comb Unadj FE - Conventional Fuel']
y = y.fillna(0)
y.head(10)

0    34.4702
1    18.1901
2    20.5806
3    19.4895
4    20.5806
5    19.4895
6    20.5806
7    19.4895
8    20.5904
9    17.0265
Name: Comb Unadj FE - Conventional Fuel, dtype: float64

In [86]:
#Code added on 11-04-2021 - Importing all the necessary packages
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import random

In [87]:
#Code added on 11-04-2021 - Creating a random forest regressor
rf = RandomForestRegressor(n_estimators=150, n_jobs=-1, oob_score=True,random_state=241)
#Code added on 11-04-2021 - Fitting our model using training dataset 
rf.fit(x, y)
#Code added on 11-04-2021 - calculating the out-of-bag score
rf.oob_score_

0.8860847019206981

In [88]:
#Code added on 11-04-2021 - Creating Empty lists for storing the values for testing and training fields respectivelly
oob_scores = []
#Code added on 05-04-2021 - Creating a for loop to calculate the values of Out-of-bag score for 10 runs for uncleaned data
for i in range(10):
    rf = RandomForestRegressor(n_estimators=150, n_jobs=-1, oob_score=True,random_state=241) 
    rf.fit(x, y)
    oob_scores.append(rf.oob_score_)

In [89]:
#Code added on 11-04-2021 - Printing all the values for out-of-bag score
print("Out-of-bag scores: \n", oob_scores)

Out-of-bag scores: 
 [0.8860847019206981, 0.8860847019206981, 0.8860847019206981, 0.8860847019206981, 0.8860847019206981, 0.8860847019206981, 0.8860847019206981, 0.8860847019206981, 0.8860847019206981, 0.8860847019206981]


In [90]:
#Code added on 11-04-2021 - Calculating the mean of out-of-bag score for uncleaned data
print("Mean Oob score:", np.mean(oob_scores))

Mean Oob score: 0.8860847019206981


### Part 3 - Convert all features to numeric and handle missing values

In this part you should: 
 - only use ordinal encoding 
 - convert **all** non-numeric features to numeric 
 - handle any missing values in any feature
 - assume all missing data has already been normalized appropriately

In [91]:
#Code added on 11-04-2021 - Installing packages
!pip install category_encoders
import category_encoders as ce



In [92]:
#Code added on 11-04-2021 - creating a variable called Veh15_clean
Veh15_clean = Veh15_primary.copy()

In [93]:
#Code added on 11-04-2021 - Installing packages
from pandas.api.types import is_categorical_dtype
#Code added on 11-04-2021 - Creating a function called df_string_to_ca
def df_string_to_cat(df):
    for col in df.columns:
        if is_string_dtype(df[col]):
            df[col] = df[col].astype('category').cat.as_ordered()
#Code added on 11-04-2021 - Creating a function called df_cat_to_catcode
def df_cat_to_catcode(df):
    for col in df.columns:
        if is_categorical_dtype(df[col]):
            df[col] = df[col].cat.codes + 1

In [94]:
#Code added on 11-04-2021 - running Veh15_clean through the functions df_cat_to_catcode and df_string_to_ca
df_string_to_cat(Veh15_clean)
df_cat_to_catcode(Veh15_clean)

In [95]:
#Code added on 11-04-2021 - running Veh15_clean through the functions sniff_modified
sniff_modified(Veh15_clean)

Unnamed: 0,data type,percent missing,No. unique,unique values
Mfr Name,int8,0.0,26,"[3, 1, 25, 2, 7, 4, 5, 8, 10, 13, 14, 15, 18, ..."
Gas Guzzler Exempt Desc (Where Truck = 1975 NHTSA truck definition),int8,0.0,2,"[1, 2]"
Var Valve Lift Desc,int8,0.0,25,"[0, 5, 12, 17, 14, 20, 18, 23, 15, 16, 10, 11,..."
Camless Valvetrain (Y or N),int8,0.0,2,"[1, 2]"
Fuel Metering Sys Desc,int8,0.0,5,"[5, 3, 4, 1, 2]"
Descriptor - Model Type (40 Char or less),int8,0.0,20,"[8, 0, 17, 16, 7, 14, 4, 19, 10, 15, 18, 9, 3,..."
Carline Class Desc,int8,0.0,23,"[21, 5, 20, 1, 3, 2, 8, 4, 9, 6, 7, 16, 17, 22..."
Gas Guzzler Exempt (Where Truck = 1975 NHTSA truck definition),int8,0.0,2,"[1, 2]"
Calc Approach Desc,int8,0.0,3,"[3, 1, 2]"
Label Recalc?,int8,0.0,3,"[1, 0, 2]"


In [96]:
#Code added on 11-04-2021 - Creating a function called fix_missing_num
def fix_missing_num(df, colname):
    df[colname].fillna(df[colname].median(), inplace=True)

In [98]:
#Code added on 11-04-2021 - running numeric variables with null values through the fix_missing_num function
fix_missing_num(Veh15_clean, '2Dr Pass Vol')
fix_missing_num(Veh15_clean, '4Dr Pass Vol')
fix_missing_num(Veh15_clean, '4Dr Lugg Vol')
fix_missing_num(Veh15_clean, '4Dr Lugg Vol')
fix_missing_num(Veh15_clean, '2Dr Lugg Vol')
fix_missing_num(Veh15_clean, 'Max Ethanol % - Gasoline')
fix_missing_num(Veh15_clean, 'Comb Unadj FE - Conventional Fuel')

In [99]:
#Code added on 11-04-2021 - running Veh15_clean through the functions sniff_modified
sniff_modified(Veh15_clean)

Unnamed: 0,data type,percent missing,No. unique,unique values
Mfr Name,int8,0.0,26,"[3, 1, 25, 2, 7, 4, 5, 8, 10, 13, 14, 15, 18, ..."
Gas Guzzler Exempt Desc (Where Truck = 1975 NHTSA truck definition),int8,0.0,2,"[1, 2]"
Var Valve Lift Desc,int8,0.0,25,"[0, 5, 12, 17, 14, 20, 18, 23, 15, 16, 10, 11,..."
Camless Valvetrain (Y or N),int8,0.0,2,"[1, 2]"
Fuel Metering Sys Desc,int8,0.0,5,"[5, 3, 4, 1, 2]"
Descriptor - Model Type (40 Char or less),int8,0.0,20,"[8, 0, 17, 16, 7, 14, 4, 19, 10, 15, 18, 9, 3,..."
Carline Class Desc,int8,0.0,23,"[21, 5, 20, 1, 3, 2, 8, 4, 9, 6, 7, 16, 17, 22..."
Gas Guzzler Exempt (Where Truck = 1975 NHTSA truck definition),int8,0.0,2,"[1, 2]"
Calc Approach Desc,int8,0.0,3,"[3, 1, 2]"
Label Recalc?,int8,0.0,3,"[1, 0, 2]"


### Part 4 - Create and evaluate a new baseline

In this part you should:
 - create and evaluate a model using all the features, that is, after converting everything to numeric and handling missing values

In [100]:
#Code added on 11-04-2021 - Creating feature vector
x1 = Veh15_clean.drop('Comb Unadj FE - Conventional Fuel',axis = 1)
x1.head(10)

Unnamed: 0,Mfr Name,Division,Carline,Verify Mfr Cd,Index (Model Type Index),Eng Displ,# Cyl,Transmission,Air Aspir Method,Air Aspiration Method Desc,...,Var Valve Timing Desc,Var Valve Lift?,Var Valve Lift Desc,Fuel Metering Sys Cd,Fuel Metering Sys Desc,Camless Valvetrain (Y or N),Oil Viscosity,Stop/Start System (Engine Management System) Code,Stop/Start System (Engine Management System) Description,Model Type Desc (MFR entered)
0,3,2,42,7,264,1.8,4,10,2,3,...,7,1,0,3,5,1,21,1,1,0
1,1,3,695,1,8,6.0,12,11,0,1,...,26,1,0,5,3,1,10,1,1,123
2,1,3,701,1,4,4.7,8,11,0,1,...,21,1,0,5,3,1,13,1,1,0
3,1,3,701,1,1,4.7,8,24,0,1,...,21,1,0,5,3,1,13,1,1,0
4,1,3,702,1,5,4.7,8,11,0,1,...,21,1,0,5,3,1,13,1,1,0
5,1,3,702,1,2,4.7,8,24,0,1,...,21,1,0,5,3,1,13,1,1,0
6,1,3,704,1,6,4.7,8,11,0,1,...,21,1,0,5,3,1,13,1,1,0
7,1,3,704,1,3,4.7,8,24,0,1,...,21,1,0,5,3,1,13,1,1,0
8,25,4,553,25,27,4.2,8,8,0,1,...,10,1,0,3,5,1,32,1,1,0
9,25,4,553,25,29,4.2,8,24,0,1,...,10,1,0,3,5,1,32,1,1,0


In [101]:
#Code added on 11-04-2021 - Creating target vector
y1 = Veh15_clean['Comb Unadj FE - Conventional Fuel']
y1.head(10)

0    34.4702
1    18.1901
2    20.5806
3    19.4895
4    20.5806
5    19.4895
6    20.5806
7    19.4895
8    20.5904
9    17.0265
Name: Comb Unadj FE - Conventional Fuel, dtype: float64

In [102]:
#Code added on 11-04-2021 - Creating a random forest regressor
rf = RandomForestRegressor(n_estimators=150, n_jobs=-1, oob_score=True,random_state=241)
#Code added on 11-04-2021 - Fitting our model using training dataset 
rf.fit(x1, y1)
#Code added on 11-04-2021 - calculating the out-of-bag score
rf.oob_score_

0.9458924384814638

In [103]:
#Code added on 11-04-2021 - Creating Empty lists for storing the values for testing and training fields respectivelly
oob_scores1 = []
#Code added on 05-04-2021 - Creating a for loop to calculate the values of Out-of-bag score for 10 runs for uncleaned data
for i in range(10):
    rf = RandomForestRegressor(n_estimators=150, n_jobs=-1, oob_score=True,random_state=241) 
    rf.fit(x1, y1)
    oob_scores1.append(rf.oob_score_)

In [104]:
#Code added on 11-04-2021 - Printing all the values for out-of-bag score
print("Out-of-bag scores: \n", oob_scores1)

Out-of-bag scores: 
 [0.9458924384814638, 0.9458924384814638, 0.9458924384814638, 0.9458924384814638, 0.9458924384814638, 0.9458924384814638, 0.9458924384814638, 0.9458924384814638, 0.9458924384814638, 0.9458924384814638]


In [105]:
#Code added on 05-04-2021 - Calculating the mean of out-of-bag score for uncleaned data
print("Mean Oob score:", np.mean(oob_scores1))

Mean Oob score: 0.9458924384814636


In [106]:
#Code added on 05-04-2021 - diffrence in oob_score
diffrence = np.mean(oob_scores1) - np.mean(oob_scores)
diffrence

0.059807736560765545

In [107]:
#Code added on 05-04-2021 - finding increase in the explained variance
print(f"{round(100 * ((1 - np.mean(oob_scores)) - (1 - np.mean(oob_scores1))) / np.mean(oob_scores))}%")

7%


**Question** Did the performance of the model improve compared to the results of **Part 2**?

**Enter your answer here:**
There is a diffrence of 0.059807736560765545 or about 5.98% diffreence in the oob score before and after cleaning the data and this  represents an increase in the explained variance of about 7%. Yes, there is a performance increase in part 4 compare to the results of Part 2.

### Part 5 - How high can you go?

How high can you get the oob score above the new baseline of **Part 4**? See how much improvement you can squeeze out of the data. 

**For this part, do NOT do any hyper-parameter tuning.**

Some things to try to get started:
 - are some features not important so can be dropped without impacting performance?
 - do other encodings work better than ordinal for some features?  
 - any feature engineering that will help? 
 - any external data that could be included? 

### Part 6 - Create and evaluate a final model

In this part you should:
 - create and evaluate a model using only the features that give the best results after the exploration done in Part 4 

### Part 7 - How did you do?

**Question** What is the percent difference between the oob score of your best model and the baseline calculated in **Part 4**?

**Enter your answer here:**

**Question** What changes did you make to the data set of **Part 4** to get to the final data set used in **Part 6** and how much did each change increase the oob score that you calculated in **Part 4**? 

**Enter your answer here:**

| Change made | Change in oob score | 
| :- | :-: |  
|  example 1: create new feature 'xyz'  | +0.011    | 
|  example 2: one-hot encoded feature 'xyz'  | +0.003    | 