# Assignment 2 
## DECISION TREE

DSCI 6601: Practical Machine Learning

Student: Sahil Khan

Date: __-Oct-2024
_____________________________________________________________________________________________________________________________________________________________________________________________

### INTRODUCTION

For this assignment, we chose a publicly available dataset on kaggle that contains at least five features. The goal is to build a Decision Tree Classifier depending on the nature of our dataset. We will evaluate the model’s performance and analyze how the features contribute to the predictions. We will complete the tasks outlined below, and ensure to include both our Python code and the dataset in our submission.

_____________________________________________________________________________________________________________________________________________________________________________________________

### IMPORTING LIBRARIES

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix


import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

## TASK 1: Dataset Overview 
- We will provide a basic overview of the dataset, including details such as feature names, target variable, presence of missing values, and identification of categorical features.

In [2]:
mbaData = pd.read_csv('MBA.csv')
mbaData.head()

Unnamed: 0,application_id,gender,international,gpa,major,race,gmat,work_exp,work_industry,admission
0,1,Female,False,3.3,Business,Asian,620.0,3.0,Financial Services,Admit
1,2,Male,False,3.28,Humanities,Black,680.0,5.0,Investment Management,
2,3,Female,True,3.3,Business,,710.0,5.0,Technology,Admit
3,4,Male,False,3.47,STEM,Black,690.0,6.0,Technology,
4,5,Male,False,3.35,STEM,Hispanic,590.0,5.0,Consulting,


In [3]:
mbaData.shape   # Count of Rows and Columns

(6194, 10)

In [4]:
mbaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   application_id  6194 non-null   int64  
 1   gender          6194 non-null   object 
 2   international   6194 non-null   bool   
 3   gpa             6194 non-null   float64
 4   major           6194 non-null   object 
 5   race            4352 non-null   object 
 6   gmat            6194 non-null   float64
 7   work_exp        6194 non-null   float64
 8   work_industry   6194 non-null   object 
 9   admission       1000 non-null   object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 441.7+ KB


- FEATURE NAMES AND TARGET VARIABLES

In [5]:
print("FEATURES NAMES:", mbaData.columns.tolist()) # Features Names
print("TARGET VARIABLE:", (mbaData.columns.tolist()[-1].upper())) # Target Variable

FEATURES NAMES: ['application_id', 'gender', 'international', 'gpa', 'major', 'race', 'gmat', 'work_exp', 'work_industry', 'admission']
TARGET VARIABLE: ADMISSION


- PRESENCE OF MISSING VALUES

In [6]:
mbaData.isnull().sum()

application_id       0
gender               0
international        0
gpa                  0
major                0
race              1842
gmat                 0
work_exp             0
work_industry        0
admission         5194
dtype: int64

- CATEGORICAL COLUMNS

In [7]:
categorical_columns = [col for col in mbaData.columns if mbaData[col].dtype == 'object']
categorical_columns

['gender', 'major', 'race', 'work_industry', 'admission']

____________________________________________________________________________________________________________________________________________________________________________________________

## TASK 2: Data Preprocessing
We will describe how we would preprocess the dataset before building a Decision Tree model. We will state steps such as handling missing values, encoding categorical features, and scaling features if necessary.

#### Drop unnecessary columns

In [8]:
# Since application id is not significant in building a Decision Tree model, we are dropping the 'APPLICATION ID'
mbaData = mbaData.drop('application_id', axis=1)

#### Handling Missing Values

In [9]:
mbaData.isnull().sum()

gender              0
international       0
gpa                 0
major               0
race             1842
gmat                0
work_exp            0
work_industry       0
admission        5194
dtype: int64

We can clearly see that race and admission column have missing values.

In [10]:
# Handling Missing Values in 'RACE' column
mbaData["race"]=mbaData['race'].fillna(mbaData['race'].mode()[0])

Since the RACE column is a categorical feature, the most common practice to handle missing values is to fill it with the mode() [the most occuring value] for the following reasons:
- Maintains the existing distribution of categorical data.
- Retains more data points for analysis by avoiding row deletion.
- Ensures imputed values are representative rather than arbitrary.

##### For the admission column, even though it is categorical and we could impute the missing values with the mode, we choose not to fill them because it serves as a target variable. Filling these missing values could introduce bias into the dataset.

In [11]:
mbaData['admission'].value_counts(dropna=False)

NaN         5194
Admit        900
Waitlist     100
Name: admission, dtype: int64

##### Given that the count of null values amounts to 5,194, we cannot afford to drop them, as this would result in a significant loss of data. Instead, we have replaced or encoded the values as follows: 'Admit' as 1, 'Waitlist' as 2, and null values as 0.

In [12]:
mbaData['admission'] = mbaData['admission'].replace({'Admit': 1, 'Waitlist': 2})
mbaData['admission'] = mbaData['admission'].fillna(0)
mbaData['admission'].value_counts()

0.0    5194
1.0     900
2.0     100
Name: admission, dtype: int64

In [13]:
# Checking for null values after data preprocessing
mbaData.isnull().sum()

gender           0
international    0
gpa              0
major            0
race             0
gmat             0
work_exp         0
work_industry    0
admission        0
dtype: int64

In [14]:
mbaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   gender         6194 non-null   object 
 1   international  6194 non-null   bool   
 2   gpa            6194 non-null   float64
 3   major          6194 non-null   object 
 4   race           6194 non-null   object 
 5   gmat           6194 non-null   float64
 6   work_exp       6194 non-null   float64
 7   work_industry  6194 non-null   object 
 8   admission      6194 non-null   float64
dtypes: bool(1), float64(4), object(4)
memory usage: 393.3+ KB


In [15]:
# Converting bool column to int to simplify the dataset for model building
mbaData['international'] = mbaData['international'].astype(int)

In [16]:
mbaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   gender         6194 non-null   object 
 1   international  6194 non-null   int32  
 2   gpa            6194 non-null   float64
 3   major          6194 non-null   object 
 4   race           6194 non-null   object 
 5   gmat           6194 non-null   float64
 6   work_exp       6194 non-null   float64
 7   work_industry  6194 non-null   object 
 8   admission      6194 non-null   float64
dtypes: float64(4), int32(1), object(4)
memory usage: 411.4+ KB


- Applying One-Hot encoding to the categorical features

In [17]:
# Apply one-hot encoding to the categorical columns
mbaData_encoded = pd.get_dummies(mbaData, columns=categorical_columns.remove('admission'))
mbaData_encoded.head()

Unnamed: 0,international,gpa,gmat,work_exp,admission,gender_Female,gender_Male,major_Business,major_Humanities,major_STEM,race_Asian,race_Black,race_Hispanic,race_Other,race_White,work_industry_CPG,work_industry_Consulting,work_industry_Energy,work_industry_Financial Services,work_industry_Health Care,work_industry_Investment Banking,work_industry_Investment Management,work_industry_Media/Entertainment,work_industry_Nonprofit/Gov,work_industry_Other,work_industry_PE/VC,work_industry_Real Estate,work_industry_Retail,work_industry_Technology
0,0,3.3,620.0,3.0,1.0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,3.28,680.0,5.0,0.0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,1,3.3,710.0,5.0,1.0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,3.47,690.0,6.0,0.0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,3.35,590.0,5.0,0.0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


- Given that the majority of the columns are within the range of 0 to 1, we will scale the following features to fall within this range:
1. GPA
2. GMAT
3. Work Experience

In [18]:
scaler = MinMaxScaler()

mbaData_encoded[['gpa', 'gmat', 'work_exp']] = scaler.fit_transform(mbaData_encoded[['gpa', 'gmat', 'work_exp']])
mbaData_encoded

Unnamed: 0,international,gpa,gmat,work_exp,admission,gender_Female,gender_Male,major_Business,major_Humanities,major_STEM,race_Asian,race_Black,race_Hispanic,race_Other,race_White,work_industry_CPG,work_industry_Consulting,work_industry_Energy,work_industry_Financial Services,work_industry_Health Care,work_industry_Investment Banking,work_industry_Investment Management,work_industry_Media/Entertainment,work_industry_Nonprofit/Gov,work_industry_Other,work_industry_PE/VC,work_industry_Real Estate,work_industry_Retail,work_industry_Technology
0,0,0.580357,0.238095,0.250,1.0,1,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
1,0,0.562500,0.523810,0.500,0.0,0,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,1,0.580357,0.666667,0.500,1.0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1
3,0,0.732143,0.571429,0.625,0.0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
4,0,0.625000,0.095238,0.500,0.0,0,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6189,0,0.750000,0.333333,0.500,0.0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
6190,0,0.473214,0.476190,0.375,0.0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
6191,1,0.508929,0.523810,0.500,1.0,1,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
6192,1,0.633929,0.095238,0.500,0.0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0


____________________________________________________________________________________________________________________________________________________________________________________________

## TASK 3: Model Implementation
- Implement a Decision Tree classifier using Scikit-learn.
- We will explain how we would split the data into training and testing sets, train the model, and evaluate it using appropriate metrics.

In [21]:
X = mbaData_encoded.drop('admission', axis=1)
y = mbaData_encoded['admission']

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [25]:
# Decision Tree Classifier
dtClassifier = DecisionTreeClassifier()
dtClassifier.fit(X_train, y_train)

In [26]:
# Predicting the Test set results
y_pred = dtClassifier.predict(X_test)

In [27]:
# Checking the accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy of Decision Tree classifier on test set:", accuracy)


Accuracy of Decision Tree classifier on test set: 0.7654653039268424


In [28]:
# Generate the confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

Confusion Matrix:
[[1313  189   23]
 [ 179  110   10]
 [  24   11    0]]
