Decision tree classifier using the dataset

Import statements

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score

Importing the dataset

In [2]:
df = pd.read_csv("./CleanedData.csv", index_col=0)
columns = df.columns
print(df.head())

   GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  Research  \
0      337.0        118.0                  4  4.5   4.5  9.65         1   
1      324.0        107.0                  4  4.0   4.5  8.87         1   
2      316.0        104.0                  3  3.0   3.5  8.00         1   
3      322.0        110.0                  3  3.5   2.5  8.67         1   
4      314.0        103.0                  2  2.0   3.0  8.21         0   

   SES Percentage  Asian  african american  latinx  white  Chance of Admit  
0              12      1                 0       0      0             0.92  
1              11      0                 0       1      0             0.76  
2              78      0                 0       1      0             0.72  
3              77      0                 0       0      1             0.80  
4               1      0                 1       0      0             0.65  


separating the data into x and y values

In [3]:
x = df[columns[0:-1]]
y = df["Chance of Admit"]

Since decision trees are reliant on the fact that the classification is binary, I need to set a threshold for what
chance of admit means the student will be admitted.

In [4]:
median_value = y.median()
print(median_value)

0.73


Since the median chance of admission is 73%, I will use that as the threshold. Any chance of admission >= 0.73 will be
set to one, and any student < 0.73 will be set to 0

In [5]:
for i in range(len(y)):
    if y.iloc[i] >= 0.73:
        y.iloc[i] = 1
    else:
        y.iloc[i] = 0
print(y)

0      1.0
1      1.0
2      0.0
3      1.0
4      0.0
      ... 
351    1.0
352    1.0
353    1.0
354    0.0
355    1.0
Name: Chance of Admit, Length: 356, dtype: float64


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


To make the y values classes, I will have to make them integers

In [6]:
y = y.convert_dtypes()

print(y)

0      1
1      1
2      0
3      1
4      0
      ..
351    1
352    1
353    1
354    0
355    1
Name: Chance of Admit, Length: 356, dtype: Int64


split the data into training and testing

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=True, random_state=1)
print(x_train, y_train)

     GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  Research  \
23       300.0         97.0                  2  3.0   3.0  8.10         1   
320      338.0        115.0                  5  4.5   5.0  9.23         1   
213      328.0        110.0                  4  4.0   2.5  9.02         1   
51       307.0        101.0                  3  4.0   3.0  8.20         0   
92       301.0        107.0                  3  3.5   3.5  8.34         1   
..         ...          ...                ...  ...   ...   ...       ...   
203      326.0        111.0                  5  4.5   4.0  9.23         1   
255      300.0        102.0                  2  1.5   2.0  7.87         0   
72       299.0         97.0                  3  5.0   3.5  7.66         0   
235      308.0        108.0                  4  4.5   5.0  8.34         0   
37       329.0        114.0                  5  4.0   5.0  9.30         1   

     SES Percentage  Asian  african american  latinx  white  
23           

Using grid search to find the correct hyperparameters

In [8]:
criteria = ["gini", "entropy"]
max_depths = [i for i in range(3, 11)]
min_samples_leaves = [i for i in range(1, 15)]

model_scores = []

for c in criteria:
    for depth in max_depths:
        for leaf_size in min_samples_leaves:
            model = DecisionTreeClassifier(criterion=c, max_depth=depth, min_samples_leaf=leaf_size)
            model.fit(x_train, y_train)
            score = model.score(x_test, y_test)
            model_scores.append(score)

            print("Criteria: " + c, "\nMax Depth: " + str(depth) + "\nMin Leaf Size: " + str(leaf_size) + "\nScore: " + str(score))
            print()


ValueError: Unknown label type: 'unknown'