# Assignment 2 
## DECISION TREE

DSCI 6601: Practical Machine Learning

Student: Sahil Khan

Date: __-Oct-2024
_____________________________________________________________________________________________________________________________________________________________________________________________

### INTRODUCTION

For this assignment, we chose a publicly available dataset on kaggle that contains at least five features. The goal is to build a Decision Tree Classifier depending on the nature of our dataset. We will evaluate the model’s performance and analyze how the features contribute to the predictions. We will complete the tasks outlined below, and ensure to include both our Python code and the dataset in our submission.

_____________________________________________________________________________________________________________________________________________________________________________________________

### IMPORTING LIBRARIES

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings("ignore")

## TASK 1: Dataset Overview 
- We will provide a basic overview of the dataset, including details such as feature names, target variable, presence of missing values, and identification of categorical features.

In [2]:
mbaData = pd.read_csv('MBA.csv')
mbaData.head()

Unnamed: 0,application_id,gender,international,gpa,major,race,gmat,work_exp,work_industry,admission
0,1,Female,False,3.3,Business,Asian,620.0,3.0,Financial Services,Admit
1,2,Male,False,3.28,Humanities,Black,680.0,5.0,Investment Management,
2,3,Female,True,3.3,Business,,710.0,5.0,Technology,Admit
3,4,Male,False,3.47,STEM,Black,690.0,6.0,Technology,
4,5,Male,False,3.35,STEM,Hispanic,590.0,5.0,Consulting,


In [3]:
mbaData.shape   # Count of Rows and Columns

(6194, 10)

In [4]:
mbaData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6194 entries, 0 to 6193
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   application_id  6194 non-null   int64  
 1   gender          6194 non-null   object 
 2   international   6194 non-null   bool   
 3   gpa             6194 non-null   float64
 4   major           6194 non-null   object 
 5   race            4352 non-null   object 
 6   gmat            6194 non-null   float64
 7   work_exp        6194 non-null   float64
 8   work_industry   6194 non-null   object 
 9   admission       1000 non-null   object 
dtypes: bool(1), float64(3), int64(1), object(5)
memory usage: 441.7+ KB


- FEATURE NAMES AND TARGET VARIABLES

In [5]:
print("FEATURES NAMES:", mbaData.columns.tolist()) # Features Names
print("TARGET VARIABLE:", (mbaData.columns.tolist()[-1].upper())) # Target Variable

FEATURES NAMES: ['application_id', 'gender', 'international', 'gpa', 'major', 'race', 'gmat', 'work_exp', 'work_industry', 'admission']
TARGET VARIABLE: ADMISSION


- PRESENCE OF MISSING VALUES

In [6]:
mbaData.isnull().sum()

application_id       0
gender               0
international        0
gpa                  0
major                0
race              1842
gmat                 0
work_exp             0
work_industry        0
admission         5194
dtype: int64

- CATEGORICAL COLUMNS

In [7]:
[col for col in mbaData.columns if mbaData[col].dtype == 'object']

['gender', 'major', 'race', 'work_industry', 'admission']

____________________________________________________________________________________________________________________________________________________________________________________________

## TASK 2: Data Preprocessing
We will describe how we would preprocess the dataset before building a Decision Tree model. We will state steps such as handling missing values, encoding categorical features, and scaling features if necessary.

Handling Missing Values

In [8]:
mbaData.isnull().sum()

application_id       0
gender               0
international        0
gpa                  0
major                0
race              1842
gmat                 0
work_exp             0
work_industry        0
admission         5194
dtype: int64

We can clearly see that race and admission column have missing values.

In [9]:
# Handling Missing Values in 'RACE' column
mbaData["race"]=mbaData['race'].fillna(mbaData['race'].mode()[0])

Since the RACE column is a categorical feature, the most common practice to handle missing values is to fill it with the mode() [the most occuring value] for the following reasons:
- Maintains the existing distribution of categorical data.
- Retains more data points for analysis by avoiding row deletion.
- Ensures imputed values are representative rather than arbitrary.

##### For the admission column, even though it is categorical and we could impute the missing values with the mode, we choose not to fill them because it serves as a target variable. Filling these missing values could introduce bias into the dataset.

In [14]:
mbaData['admission'].value_counts(dropna=False)

NaN         5194
Admit        900
Waitlist     100
Name: admission, dtype: int64

##### Given that the count of null values amounts to 5,194, we cannot afford to drop them, as this would result in a significant loss of data. Instead, we have replaced or encoded the values as follows: 'Admit' as 1, 'Waitlist' as 2, and null values as 0.

In [15]:
mbaData['admission'] = mbaData['admission'].replace({'Admit': 1, 'Waitlist': 2})
mbaData['admission'] = mbaData['admission'].fillna(0)
mbaData['admission'].value_counts()

0.0    5194
1.0     900
2.0     100
Name: admission, dtype: int64

In [17]:
# Checking for null values after data preprocessing
mbaData.isnull().sum()

application_id    0
gender            0
international     0
gpa               0
major             0
race              0
gmat              0
work_exp          0
work_industry     0
admission         0
dtype: int64