# Based on our previous analysis, we found that Supervised Learning made more sense as opposed to Unsupervised Learning. Additionally, we narrowed down to 2 models that we wanted to further explore. 1) Decision Trees and Random Forest Trees. We tried several models under Supervised Learning to reach these 2 final models. This file explores which we want to choose in the end.

In [6]:
# First we load the file and we begin manipulating it
import pandas as pd
data_path = '/Users/nathanyap/Desktop/DataMining_Project/project/Nathan Findings/TOEFL_IELTS_Combined.csv'
df_admitsFYI = pd.read_csv(data_path)

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# We only want to work with 5 inputs to test which model would be better
X = df_admitsFYI[['GPA', 'GRE Total', 'TOEFL/IELTS', 'Work Exp', 'Papers']]
y = df_admitsFYI['Status']

# splitting the data to train, we will train 70% of the set.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Using the decision tree
dt_model = DecisionTreeClassifier(random_state=42)
rf_model = RandomForestClassifier(random_state=42)

# doing the fit
dt_model.fit(X_train, y_train)
rf_model.fit(X_train, y_train)

# making the predicition
dt_predictions = dt_model.predict(X_test)
rf_predictions = rf_model.predict(X_test)

# printing out the results
dt_accuracy = accuracy_score(y_test, dt_predictions)
dt_precision = precision_score(y_test, dt_predictions)
dt_recall = recall_score(y_test, dt_predictions)

rf_accuracy = accuracy_score(y_test, rf_predictions)
rf_precision = precision_score(y_test, rf_predictions)
rf_recall = recall_score(y_test, rf_predictions)


print(f"Decision Tree Accuracy: {dt_accuracy:.2%}")
print(f"Decision Tree Precision: {dt_precision:.2%}")
print(f"Decision Tree Recall: {dt_recall:.2%}")
print()
print(f"Random Forest Accuracy: {rf_accuracy:.2%}")
print(f"Random Forest Precision: {rf_precision:.2%}")
print(f"Random Forest Recall: {rf_recall:.2%}")



# (dt_accuracy, dt_precision, dt_recall), (rf_accuracy, rf_precision, rf_recall)

Decision Tree Accuracy: 79.03%
Decision Tree Precision: 86.37%
Decision Tree Recall: 81.10%

Random Forest Accuracy: 80.07%
Random Forest Precision: 83.89%
Random Forest Recall: 86.48%


## Before we dive into the given results, it is worth noting that we prioritize the precision factor above the others. The reason for this is that in this context, of all the students that were predicted on whether they got in, how many were actually admitted? So a high precision would mean a low false positive. The second factor would be the recall factor. Which tells us, of all the students admitted, how many did we predict correctly? Even though the decision tree factor is lower, however, we feel we should still go with it because of the higher precision. In our context, we want to minimize false positives because we do not want to create an illusion where students get too hopeful when using our model. We need students to realize that there are so many factors that our model cannot capture. For example, the rating of a letter of recommendation and how relevant work experience is.