## DAB200 -- Graded Lab 3

In this lab, you will gain some experience in dealing with missing data and further practice converting non-numeric features in a dataset to numeric.

**Target**: to predict `Comb Unadj FE - Conventional Fuel`

**Data set**: make sure you use the data assigned to your group!

| Groups | Data set |
| :-: | :-: |
| 1-3 | veh14_missing.csv |
| 4-5 | veh15_missing.csv |
| 6-8 | veh16_missing.csv |
| 9-11 | veh17_missing.csv |
| 12-14 | veh18_missing.csv |
| 15-17 | veh19_missing.csv |

**Important Notes:**
- Use [Chapter 7](https://mlbook.explained.ai/bulldozer-intro.html) of the textbook as a **guide**:
     - you only need to use **random forest** models;
- Code submitted for this lab should be:
     - error free
         - to make sure this is the case, before submitting, close all Jupyter notebooks, exit Anaconda, reload the lab notebook and execute all cells
     - final code
         - this means that I don't want to see every piece of code you try as you work through this lab but only the final code; only the code that fulfills the objective
- Use the **out-of-bag score** to evaluate models
     - Read Section 5.2 carefully so that you use this method properly
     - The oob score that you provide should be the average of 10 runs
- Don't make assumptions!


### Part 0

Please provide the following information:
 - Group Number: 11
 - Group Members:
     - Ruturajsinh Solanki - 0827884
     - Crish Chhotai - 0826416
     - Isha Dhaduk - 0827577

### Part 1 - Create and evaluate an initial model

In this part you should: 
 - use Section 7.3 of the textbook as a guide, except:
     - use all of the data; and
     - use 150 decision trees in your random forest models
 - read in the data
 - isolate all numeric features from original data set
 - fill in any missing values with 0
 - create and evaluate a baseline model 

#### Code (10 marks)

In [212]:
import pandas as pd      
import numpy as np
from pandas.api.types import is_string_dtype, is_object_dtype, is_categorical_dtype
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

In [213]:
# Read the CSV file
df = pd.read_csv('veh17_missing.csv')

# Replace the empty spaces with _ in variable names
df.columns = df.columns.str.replace(' ', '_')

# Select only the numeric columns from the DataFrame
df_numeric = df.select_dtypes(include=np.number)

# Fill the missing values in the numeric columns with 0
df_numeric = df_numeric.fillna(0)

# Create the feature matrix X_num by dropping the target column 'Comb Unadj FE - Conventional Fuel'
X_num = df_numeric.drop('Comb_Unadj_FE_-_Conventional_Fuel', axis=1)

# Create the target variable y_num with the column 'Comb Unadj FE - Conventional Fuel'
y_num = df_numeric['Comb_Unadj_FE_-_Conventional_Fuel']


In [214]:
oob_init = []

for i in range(10):
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_num, y_num, test_size=0.2)

    # Create a Random Forest Classifier object
    Rf_num = RandomForestRegressor(n_estimators=150, random_state=69, oob_score = True)

    # Train the model on the training data
    Rf_num.fit(X_train.values, y_train.values)

    # Evaluating the model's oob_score
    oob_init.append(Rf_num.oob_score_)

# Print the array of oob scores
print("OOB Scores:",oob_init)

# find and print the mean of all 10 oob scores.
avg_oob_init = np.mean(oob_init)
print(f"Average baseline oob score: {avg_oob_init:.4f}")

OOB Scores: [0.7167969968568588, 0.7356051600409088, 0.7036702716418345, 0.7293657359357641, 0.7050220747343944, 0.7106354016020844, 0.7317259193699852, 0.7321208416518361, 0.7198126778512526, 0.6985374055757213]
Average baseline oob score: 0.7183


### Part 2 - Normalize missing values

In this part you should: 
 - use Section 7.4 of the textbook as a guide
 - convert **all** representations of missing data to a **single** representation
 
#### Code (15 marks)

In [215]:
# Print all columns in our dataframe with their unique values.
for column in df.columns:
    unique_values = df[column].unique()
    print(f"Column: '{column}' \nunique values: {unique_values}\n")

Column: 'Eng_Displ' 
unique values: [1.5 2.  nan 1.4 5.3 3.  3.5 2.7 6.6 4.  4.7 6.3 2.4 5.7 1.8 4.4 1.6 3.6
 6.2 2.5 3.3 6.  5.2 5.6 2.3 5.  3.7 3.8 6.7 5.5 1.2 4.3 6.5 6.4 2.9 3.2
 4.8 0.9 3.9 2.8 1.  4.6 8.4 6.8]

Column: '#_Cyl' 
unique values: [ 4  8  6  3 12 10  5]

Column: 'Comb_Unadj_FE_-_Conventional_Fuel' 
unique values: [48.803  34.1949 36.5536 31.9863 33.9441 29.1624 46.4327 33.3995 24.2077
 24.1704 30.4629 30.684  36.2056 21.9503 39.5463 43.2552 22.7455 38.322
 30.8026 49.3802 41.1333 26.7724 37.5834 30.0963 20.8555 29.3468 32.6673
 17.8786 40.8486 34.8787 26.1317 26.5472 35.5618 39.6651 22.5368 33.2211
 25.7972 16.7215 34.0328 46.1513 18.6267 40.9592 20.0593 36.1902 29.2469
 24.098  34.4702 22.1406 36.703  20.6769 31.6705 64.2204 29.1809 24.911
 22.9079 33.6611 27.8536 40.064  20.657  38.8998 29.8923 25.7879 28.9412
 20.296  22.5528 20.4316 15.6543 26.9715 21.3996 30.3467 23.9936 21.3281
 28.9779 26.988  35.3177 30.1291 28.4345 28.9743 43.0468 27.1795 40.4381
 29.7405 35.

In [216]:
# Function to normalize every catagorical columns
def df_normalize_strings(df):
    for col in df.columns:
        if is_string_dtype(df[col]) or is_object_dtype(df[col]):
            df[col] = df[col].str.lower()
            df[col] = df[col].fillna(np.nan)
            df[col] = df[col].replace('not specified', np.nan)
            df[col] = df[col].replace('   ', np.nan)
            df[col] = df[col].replace('*****', np.nan)

# Apllying df_normalize_strings() function to our dataset.
df_normalize_strings(df)

#### Question (5 marks)

Note here all the different ways missing data was represented in the data.   

**Enter your answer here:**

Below are the different way in which missing data was represented.
- In the column 'Var_Valve_Lift?' there were missing data represented as empty spaces `'   '`.
- In the column 'Fuel_Metering_Sys_Desc' there missing data represented as 'not specified'.
- In th column 'Air_Aspiration_Method_Desc' there missing data represented as '*****'.

### Part 3 - Categorical features

In this part you should: 
 - use Section 7.5.1 as a guide
 - only use ordinal encoding 
 - convert **all** non-numeric features to numeric 
 - handle any missing values
 
#### Code (25 marks)

In [217]:
# Print every catagorical columns with their unique values.
for column in df.select_dtypes(include='object').columns.tolist():
    unique_values = df[column].unique()
    print(f"Column: '{column}' \nunique values: {unique_values}\n")

Column: 'Var_Valve_Lift?' 
unique values: ['n' 'y' nan]

Column: 'Fuel_Metering_Sys_Desc' 
unique values: [nan 'spark ignition direct injection'
 'multipoint/sequential fuel injection'
 'spark ignition direct & ported injection'
 'common rail direct diesel injection'
 'direct diesel injection (non-common rail)']

Column: 'Stop/Start_System_(Engine_Management_System)__Description' 
unique values: ['no' 'yes']

Column: 'Air_Aspiration_Method_Desc' 
unique values: [nan 'turbocharged' 'naturally aspirated' 'supercharged'
 'turbocharged+supercharged']

Column: 'Label_Recalc?' 
unique values: ['n' nan 'y']



In [218]:
# Funtion to convert every object columns to ordinal catagorical columns
def df_string_to_cat(df):
    for col in df.columns:
        if is_string_dtype(df[col]):
            df[col] = df[col].astype('category').cat.as_ordered()

# Function to replace the category values with the integer codes
def df_cat_to_catcode(df):
    for col in df.columns:
        if is_categorical_dtype(df[col]):
            df[col] = df[col].cat.codes + 1

In [219]:
# Apllying df_string_to_cat() function to our dataset.
df_string_to_cat(df)

# # Apllying df_cat_to_catcode() function to our dataset.
df_cat_to_catcode(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1244 entries, 0 to 1243
Data columns (total 12 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   Eng_Displ                                                  874 non-null    float64
 1   #_Cyl                                                      1244 non-null   int64  
 2   Comb_Unadj_FE_-_Conventional_Fuel                          1244 non-null   float64
 3   #_Gears                                                    1244 non-null   int64  
 4   Max_Ethanol_%_-_Gasoline                                   1223 non-null   float64
 5   Intake_Valves_Per_Cyl                                      1244 non-null   int64  
 6   Exhaust_Valves_Per_Cyl                                     1244 non-null   int64  
 7   Var_Valve_Lift?                                            1244 non-null   int8   
 8   Fuel_Met

### Part 4 - Numeric features

In this part you should: 
 - use Section 7.5.2 as a guide
 - handle any missing values
 
#### Code (30 marks)

In [220]:
# Print every numerical columns with their unique values.
for column in df.select_dtypes(include='number').columns.tolist():
    unique_values = df[column].unique()
    print(f"Column: '{column}' \nunique values: {unique_values}\n")

Column: 'Eng_Displ' 
unique values: [1.5 2.  nan 1.4 5.3 3.  3.5 2.7 6.6 4.  4.7 6.3 2.4 5.7 1.8 4.4 1.6 3.6
 6.2 2.5 3.3 6.  5.2 5.6 2.3 5.  3.7 3.8 6.7 5.5 1.2 4.3 6.5 6.4 2.9 3.2
 4.8 0.9 3.9 2.8 1.  4.6 8.4 6.8]

Column: '#_Cyl' 
unique values: [ 4  8  6  3 12 10  5]

Column: 'Comb_Unadj_FE_-_Conventional_Fuel' 
unique values: [48.803  34.1949 36.5536 31.9863 33.9441 29.1624 46.4327 33.3995 24.2077
 24.1704 30.4629 30.684  36.2056 21.9503 39.5463 43.2552 22.7455 38.322
 30.8026 49.3802 41.1333 26.7724 37.5834 30.0963 20.8555 29.3468 32.6673
 17.8786 40.8486 34.8787 26.1317 26.5472 35.5618 39.6651 22.5368 33.2211
 25.7972 16.7215 34.0328 46.1513 18.6267 40.9592 20.0593 36.1902 29.2469
 24.098  34.4702 22.1406 36.703  20.6769 31.6705 64.2204 29.1809 24.911
 22.9079 33.6611 27.8536 40.064  20.657  38.8998 29.8923 25.7879 28.9412
 20.296  22.5528 20.4316 15.6543 26.9715 21.3996 30.3467 23.9936 21.3281
 28.9779 26.988  35.3177 30.1291 28.4345 28.9743 43.0468 27.1795 40.4381
 29.7405 35.

In [221]:
# Function to fill missing values with the median of the numeric column.
def fix_missing_num(df, colname):
    df[colname+'_na'] = pd.isnull(df[colname])
    df[colname].fillna(df[colname].median(), inplace=True)

In [222]:
# Apllying fix_missing_num() function to Eng_Displ and Max_Ethanol_%_-_Gasoline columns.
fix_missing_num(df, 'Eng_Displ')
fix_missing_num(df, 'Max_Ethanol_%_-_Gasoline')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1244 entries, 0 to 1243
Data columns (total 14 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   Eng_Displ                                                  1244 non-null   float64
 1   #_Cyl                                                      1244 non-null   int64  
 2   Comb_Unadj_FE_-_Conventional_Fuel                          1244 non-null   float64
 3   #_Gears                                                    1244 non-null   int64  
 4   Max_Ethanol_%_-_Gasoline                                   1244 non-null   float64
 5   Intake_Valves_Per_Cyl                                      1244 non-null   int64  
 6   Exhaust_Valves_Per_Cyl                                     1244 non-null   int64  
 7   Var_Valve_Lift?                                            1244 non-null   int8   
 8   Fuel_Met

### Part 5 - Create and evaluate a final model

In this part you should:
 - create and evaluate a model using all the features after processing them in Parts 2, 3, and 4 above 

#### Code (10 marks)

In [223]:
# Create the feature matrix X by dropping the target column 'Comb Unadj FE - Conventional Fuel'
X = df.drop('Comb_Unadj_FE_-_Conventional_Fuel', axis=1)

# Create the target variable y with the column 'Comb Unadj FE - Conventional Fuel'
y = df['Comb_Unadj_FE_-_Conventional_Fuel']

In [224]:
oob_final = []

for i in range(10):
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    # Create a Random Forest Classifier object
    Rf_final = RandomForestRegressor(n_estimators=150, random_state=69, oob_score = True)

    # Train the model on the training data
    Rf_final.fit(X_train.values, y_train.values)

    # Evaluating the model's oob_score
    oob_final.append(Rf_final.oob_score_)
       
# Print the array of oob scores
print("OOB Scores:",oob_final)

# Find and print the mean of all 10 oob scores.
avg_oob_final = np.mean(oob_final)
print(f"Average Final oob score: {avg_oob_final:.4f}")

OOB Scores: [0.8304498133663564, 0.8299984487073835, 0.8387313326635327, 0.8199221873998542, 0.8362868435328735, 0.8415738264822089, 0.8204422752795706, 0.8085113569810818, 0.8285493189003319, 0.8157733648755099]
Average Final oob score: 0.8270


#### Questions (5 marks)

Provide answers to the following:
 1. calculate the percent difference between the results of Part 1 and Part 5 (make sure you are using the correct formula for percent difference)
 2. based on the percent difference, state whether or not the results of Part 5 are an improvement over the results of Part 1

In [225]:
#Results of Part 1
oob_part1 = 0.7034

# Results of Part 5
oob_part5 =  0.8228

# Calculate percent difference for Out of bag Score
oob_percent_diff = ((oob_part5 - oob_part1 ) / ((oob_part1 + oob_part5) / 2)) * 100

# Print the percent differences
print("Percent Difference in Out of bag Score: {:.2f}%".format(oob_percent_diff))

Percent Difference in Out of bag Score: 15.65%


**Enter your answers here:**
##### Answer 1:
- Percent Difference = ((oob_part5 - oob_part1 ) / ((oob_part1 + oob_part5) / 2)) * 100
- The percentage difference in Out of bag Scores is: 15.65%

##### Answer 2:
- Based on the percent difference, results of part 5 are an <b>improvement</b> over the result of part 1.
- The Out of bag score increased 15.65%, this suggests the improvement of overall predictive capabilities of the model.