The avearge starting salary of MBA graduates in US is provided in the file. Use the variables below to pedict the Avg Starting salary.

1) Train a decision tree by dropping the rows with missing values<br>
2) Impute the missing values in each column using KNN imputer, and then train a model<br>
3) Compare the score of both the above models. Use 5 fold CV score in both cases, with same model hyper-parameters (depth, etc)

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

In [2]:
# Load data
salary = pd.read_excel('C:/Users/Karthik.Iyer/Downloads/AccelerateAI/Tree-Based-Models-main/08_MBA_Starting_Salary.xlsx', sheet_name='MBA Data')
salary.head()

Unnamed: 0,Fulltime Business Week Ranking,School Name,State,Type,Enrollment,Avg GMAT,"Resident Tuition, Fees",Pct International,Pct Female,Pct Asian American,Pct Minority,Pct with job offers,Avg starting base salary
0,1,University of Chicago,Illinois,Private,1144,713.0,97165.0,35.0,35.0,16.0,7.0,92.0,107091.0
1,2,Harvard University,Massachusetts,Private,1801,720.0,101660.0,33.0,38.0,,,94.0,124378.0
2,3,Northwestern University,Illinois,Private,1200,711.0,93918.0,34.0,36.0,25.0,13.0,95.0,108064.0
3,4,University of Pennsylvania,Pennsylvania,Private,1651,714.0,104410.0,44.0,36.0,7.8,9.0,89.0,112186.0
4,5,University of Michigan,Michigan,Public,898,706.0,80879.0,27.0,34.0,21.0,13.0,89.0,103608.0


In [3]:
# Check shape
salary.shape

(70, 13)

In [4]:
# Check info
salary.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Fulltime Business Week Ranking  70 non-null     object 
 1   School Name                     70 non-null     object 
 2   State                           70 non-null     object 
 3   Type                            70 non-null     object 
 4   Enrollment                      70 non-null     int64  
 5   Avg GMAT                        67 non-null     float64
 6   Resident Tuition, Fees          68 non-null     float64
 7   Pct International               68 non-null     float64
 8   Pct Female                      68 non-null     float64
 9   Pct Asian American              63 non-null     float64
 10  Pct Minority                    66 non-null     float64
 11  Pct with job offers             67 non-null     float64
 12  Avg starting base salary        67 non

In [5]:
# Check missing values
salary.isna().sum()

Fulltime Business Week Ranking    0
School Name                       0
State                             0
Type                              0
Enrollment                        0
Avg GMAT                          3
Resident Tuition, Fees            2
Pct International                 2
Pct Female                        2
Pct Asian American                7
Pct Minority                      4
Pct with job offers               3
Avg starting base salary          3
dtype: int64

In [6]:
# Lets drop Fulltime Business Week Ranking and School Name
salary.drop(['Fulltime Business Week Ranking', 'School Name'], axis=1, inplace=True)

In [7]:
# Lets drop missing values
salary_no_mv = salary.dropna()

In [8]:
# Check missing values again
salary_no_mv.isna().sum()

State                       0
Type                        0
Enrollment                  0
Avg GMAT                    0
Resident Tuition, Fees      0
Pct International           0
Pct Female                  0
Pct Asian American          0
Pct Minority                0
Pct with job offers         0
Avg starting base salary    0
dtype: int64

In [9]:
# Check shape
salary_no_mv.shape

(60, 11)

In [10]:
# Convert State and Type into dummies
salary_no_mv_onehot = pd.get_dummies(salary_no_mv, columns=['State','Type'])
salary_no_mv_onehot.sample(5)

Unnamed: 0,Enrollment,Avg GMAT,"Resident Tuition, Fees",Pct International,Pct Female,Pct Asian American,Pct Minority,Pct with job offers,Avg starting base salary,State_Arizona,State_California,State_Connecticut,State_Florida,State_Georgia,State_Illinois,State_Indiana,State_Iowa,State_Louisiana,State_Maryland,State_Massachusetts,State_Michigan,State_Minnesota,State_Missouri,State_New Hampshire,State_New Jersey,State_New York,State_North Carolina,State_Ohio,State_Pennsylvania,State_Tennessee,State_Texas,State_Virginia,State_Washington,State_Washington D.C.,State_Wisconsin,Type_Private,Type_Public
69,149,629.0,70600.0,15.0,33.0,8.0,8.0,82.0,79392.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
40,204,675.0,55629.0,40.0,37.0,18.0,1.0,62.0,81500.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
55,160,643.0,45061.0,27.0,38.0,19.0,9.0,62.0,83092.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
28,156,681.0,17816.0,19.0,27.0,7.0,5.0,98.0,93403.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
31,387,630.0,72184.0,41.0,37.0,14.0,11.0,62.0,92296.0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [11]:
salary_no_mv_onehot.shape

(60, 37)

In [12]:
# Lets fit the decision tree without removing missing values
X = salary_no_mv_onehot.drop('Avg starting base salary', axis=1)
y = salary_no_mv_onehot['Avg starting base salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

params = {'max_depth' : [2,3,5,7,9,11],
          'min_samples_split' : [5,10,15,20,25,30,35,40],
          'min_samples_leaf' : [5,10,15,20,25,30,35,40]}

clf_gs = GridSearchCV(DecisionTreeRegressor(), cv=5, param_grid=params)

clf_gs.fit(X_train, y_train)

clf_gs.best_params_, clf_gs.best_score_

({'max_depth': 2, 'min_samples_leaf': 10, 'min_samples_split': 25},
 0.13635884252885297)

In [13]:
clf_gs.score(X_test, y_test)

0.4014671557353948

**Only about 40% r2 score is obtained in case of excluding the missing rows from the dataset**

In [14]:
# Lets check missing values
salary.isna().sum()

State                       0
Type                        0
Enrollment                  0
Avg GMAT                    3
Resident Tuition, Fees      2
Pct International           2
Pct Female                  2
Pct Asian American          7
Pct Minority                4
Pct with job offers         3
Avg starting base salary    3
dtype: int64

In [15]:
# Lets impute missing values using KNN imputer
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=2)

# Convert State and Type into dummies
salary_onehot = pd.get_dummies(salary, columns=['State','Type'])

var_list = ['Avg GMAT','Resident Tuition, Fees','Pct International','Pct Female',
            'Pct Asian American','Pct Minority','Pct with job offers','Avg starting base salary']

salary_onehot_imputed = imputer.fit_transform(salary_onehot[var_list])

In [16]:
salary_onehot.sample(4)

Unnamed: 0,Enrollment,Avg GMAT,"Resident Tuition, Fees",Pct International,Pct Female,Pct Asian American,Pct Minority,Pct with job offers,Avg starting base salary,State_Arizona,State_California,State_Colorado,State_Connecticut,State_Florida,State_Georgia,State_Illinois,State_Indiana,State_Iowa,State_Louisiana,State_Maryland,State_Massachusetts,State_Michigan,State_Minnesota,State_Missouri,State_New Hampshire,State_New Jersey,State_New York,State_North Carolina,State_Ohio,State_Pennsylvania,State_Tennessee,State_Texas,State_Utah,State_Virginia,State_Washington,State_Washington D.C.,State_Wisconsin,Type_Private,Type_Public
66,199,631.0,68626.0,32.0,36.0,19.0,21.0,49.0,62567.0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
27,294,681.0,83172.0,34.0,37.0,7.0,12.0,81.0,90775.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
41,119,627.0,21188.0,38.0,25.0,5.0,9.0,57.0,95120.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
2,1200,711.0,93918.0,34.0,36.0,25.0,13.0,95.0,108064.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [17]:
salary_onehot_imputed

array([[7.13000000e+02, 9.71650000e+04, 3.50000000e+01, 3.50000000e+01,
        1.60000000e+01, 7.00000000e+00, 9.20000000e+01, 1.07091000e+05],
       [7.20000000e+02, 1.01660000e+05, 3.30000000e+01, 3.80000000e+01,
        7.90000000e+00, 1.10000000e+01, 9.40000000e+01, 1.24378000e+05],
       [7.11000000e+02, 9.39180000e+04, 3.40000000e+01, 3.60000000e+01,
        2.50000000e+01, 1.30000000e+01, 9.50000000e+01, 1.08064000e+05],
       [7.14000000e+02, 1.04410000e+05, 4.40000000e+01, 3.60000000e+01,
        7.80000000e+00, 9.00000000e+00, 8.90000000e+01, 1.12186000e+05],
       [7.06000000e+02, 8.08790000e+04, 2.70000000e+01, 3.40000000e+01,
        2.10000000e+01, 1.30000000e+01, 8.90000000e+01, 1.03608000e+05],
       [7.26000000e+02, 9.78420000e+04, 4.30000000e+01, 3.60000000e+01,
        7.90000000e+00, 1.10000000e+01, 9.40000000e+01, 1.21171000e+05],
       [7.12000000e+02, 9.41040000e+04, 3.30000000e+01, 3.20000000e+01,
        1.20000000e+01, 1.30000000e+01, 8.70000000e+01, 1.

In [18]:
# Check missing values
salary_onehot_imputed = pd.DataFrame(salary_onehot_imputed, columns=salary_onehot[var_list].columns)
salary_onehot_imputed.isna().sum()

Avg GMAT                    0
Resident Tuition, Fees      0
Pct International           0
Pct Female                  0
Pct Asian American          0
Pct Minority                0
Pct with job offers         0
Avg starting base salary    0
dtype: int64

In [19]:
# Merge with the rest of the dataframe
salary_df = pd.concat([salary_onehot['Enrollment'], salary_onehot_imputed, salary_onehot.drop(['Enrollment'] + var_list, axis=1)], axis=1)

In [20]:
# Lets check if there are any missing values
salary_df.isna().sum().sum()

0

In [21]:
# Lets fit the decision tree without removing missing values
X = salary_df.drop('Avg starting base salary', axis=1)
y = salary_df['Avg starting base salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

params = {'max_depth' : [2,3,5,7,9,11],
          'min_samples_split' : [5,10,15,20,25,30,35,40],
          'min_samples_leaf' : [5,10,15,20,25,30,35,40]}

clf_gs = GridSearchCV(DecisionTreeRegressor(), cv=5, param_grid=params)

clf_gs.fit(X_train, y_train)

clf_gs.best_params_, clf_gs.best_score_

({'max_depth': 5, 'min_samples_leaf': 5, 'min_samples_split': 5},
 0.14760061591587892)

In [22]:
clf_gs.score(X_test, y_test)

0.7675282259192849

**Keeping all the hyperparameters same, it can be seen that after using KNN imputation to handle missing values, the r2 score is much higher with value 76%**

In [23]:
# What if we increase the value of neighbours to 3 in KNN imputation
# Lets impute missing values using KNN imputer
from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)

# Convert State and Type into dummies
salary_onehot = pd.get_dummies(salary, columns=['State','Type'])

var_list = ['Avg GMAT','Resident Tuition, Fees','Pct International','Pct Female',
            'Pct Asian American','Pct Minority','Pct with job offers','Avg starting base salary']

salary_onehot_imputed = imputer.fit_transform(salary_onehot[var_list])

In [24]:
# Convert into dataframe
salary_onehot_imputed = pd.DataFrame(salary_onehot_imputed, columns=salary_onehot[var_list].columns)

In [25]:
# Merge with the rest of the dataframe
salary_df = pd.concat([salary_onehot['Enrollment'], salary_onehot_imputed, salary_onehot.drop(['Enrollment'] + var_list, axis=1)], axis=1)

In [26]:
# Lets fit the decision tree without removing missing values
X = salary_df.drop('Avg starting base salary', axis=1)
y = salary_df['Avg starting base salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

params = {'max_depth' : [2,3,5,7,9,11],
          'min_samples_split' : [5,10,15,20,25,30,35,40],
          'min_samples_leaf' : [5,10,15,20,25,30,35,40]}

clf_gs = GridSearchCV(DecisionTreeRegressor(), cv=5, param_grid=params)

clf_gs.fit(X_train, y_train)

clf_gs.best_params_, clf_gs.best_score_

({'max_depth': 5, 'min_samples_leaf': 5, 'min_samples_split': 5},
 0.15447590711518622)

In [27]:
clf_gs.score(X_test, y_test)

0.714707872544619

**The score did not improve with n_neighbors=3. So lets use n_neighbors=2 only**