# Final Project
## Authors:
- Taylor Tucker
- Virginia Weston
- Tina Jin
- Jeffrey Bradley

## Code for decision trees  and random forest.

Import statements

In [34]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Importing the dataset

In [35]:
df = pd.read_csv("./cleaned_data.csv")

In [36]:
df.head()

Unnamed: 0.1,Unnamed: 0,Number of Bachelor's Degrees,Percent Financial Aid,Average Amount of Aid,Retention Rate,Enrollment,Percent Women,Percent In State,Percent Out of State,Percent Foreign,...,Graduation Rate,Percent Awarded,Total Staff,Instructional Staff,SA Staff,Librarian Staff,Percent Books,Percent Digital,Percent Admitted,Total Price
0,0,208.0,100.0,32400.0,79.0,996,99.0,59.0,36.0,4.0,...,69.0,66.0,357.0,105.0,56.0,62.0,41,12,70.0,55625.0
1,1,310.0,100.0,40855.0,75.0,1533,54.0,66.0,32.0,1.0,...,64.0,61.0,435.0,132.0,21.0,27.0,37,54,68.0,59470.0
2,2,398.0,100.0,39796.0,68.0,1912,60.0,53.0,46.0,1.0,...,51.0,48.0,355.0,123.0,17.0,21.0,28,13,62.0,60636.0
3,3,382.0,100.0,38689.0,82.0,1771,56.0,50.0,45.0,4.0,...,74.0,70.0,426.0,160.0,41.0,50.0,27,46,64.0,63180.0
4,4,61.0,97.0,10055.0,37.0,698,45.0,64.0,34.0,0.0,...,31.0,10.0,115.0,41.0,4.0,7.0,20,76,64.0,23170.0


I cannot use classifier to guess a continuous target variable. Therefore, for the classifier models, I will need to create
different classes for the target. I will do this by making classes that exist between $10,000 intervals. This will look like
0-10,000, 10,000-20,000, 20,000-30,000, etc.

In [37]:
print(max(df["Total Price"]))
print(min(df["Total Price"]))

76947.0
16700.0


We can see from above that the max price of a school is 76,947 and the minimum is 16,700. Therefore, I will set the boundaries
starting at 10,0000-20,000 and ending at 70,000-80,000

In [38]:
classified_prices = []
for i in range(len(df["Total Price"])):
    if 10000 <= df["Total Price"].iloc[i] < 20000:
        classified_prices.append("10,000-20,000")
    elif 20000 <= df["Total Price"].iloc[i] < 30000:
        classified_prices.append("20,000-30,000")
    elif 30000 <= df["Total Price"].iloc[i] < 40000:
        classified_prices.append("30,000-40,000")
    elif 40000 <= df["Total Price"].iloc[i] < 50000:
        classified_prices.append("40,000-50,000")
    elif 50000 <= df["Total Price"].iloc[i] < 60000:
        classified_prices.append("50,000-60,000")
    elif 60000 <= df["Total Price"].iloc[i] < 70000:
        classified_prices.append("60,000-70,000")
    elif 70000 <= df["Total Price"].iloc[i] < 80000:
        classified_prices.append("70,000-80,000")

if len(classified_prices) == len(df["Total Price"]):
    classified_target = pd.DataFrame(classified_prices, columns=["Total Price"])
    print(classified_target.head())
else:
    print("Error in classifying")

     Total Price
0  50,000-60,000
1  50,000-60,000
2  60,000-70,000
3  60,000-70,000
4  20,000-30,000


Now, quick cleaning of the discreet dataset

In [39]:
df_discreet = df.drop(["Total Price"], axis=1)
df_discreet = pd.concat((df_discreet, classified_target), axis=1)
df_discreet.drop(["Unnamed: 0"], axis=1, inplace=True)
df_discreet.head()
df_discreet.to_csv("./cleaned_data_discreet.csv")

Creating x and y for the DataFrames

In [40]:
x_d = df_discreet.iloc[:, :-1]
y_d = df_discreet.iloc[:, -1]

Train test split 70-30

In [41]:
x_train, x_test, y_train, y_test = train_test_split(x_d, y_d, shuffle=True, test_size=0.3, random_state=1)\

Creating Pipelines for both

In [42]:
pl_dt = make_pipeline(StandardScaler(), MinMaxScaler(), DecisionTreeClassifier(), verbose=True)
pl_rf = make_pipeline(StandardScaler(), MinMaxScaler(), RandomForestClassifier(), verbose=True)

Establishing parameters for Grid Search

In [43]:
criteria = ["gini", "entropy"]
dt_max_depth = [2, 3, 4, 5, 6, 7, 8]
rf_max_depth = [i for i in range(100, 600, 100)]
n_ests = [i for i in range(50, 551, 50)]

grid_dt = {'decisiontreeclassifier__criterion': criteria,
            'decisiontreeclassifier__max_depth': dt_max_depth}

grid_rf = {'randomforestclassifier__criterion': criteria,
           'randomforestclassifier__max_depth': rf_max_depth,
           'randomforestclassifier__n_estimators': n_ests}


Creating Grid search objects, using 'accuracy' as the score since other metrics, like f1 and recall, require a
binary clasification to function. Therefore, we are left with 'accuracy'.

In [44]:
gs_dt = GridSearchCV(estimator=pl_dt, param_grid=grid_dt, scoring='accuracy', refit=True, cv=10, n_jobs=-1, verbose=True)
gs_rf = GridSearchCV(estimator=pl_rf, param_grid=grid_rf, scoring='accuracy', refit=True, cv=10, n_jobs=-1, verbose=True)

Fitting the grid searches to the training data

In [45]:
gs_dt.fit(x_train, y_train)
print("DecisionTreeRegressor:")
print("Best Training Score:", gs_dt.best_score_)
print("Best Parameters:", gs_dt.best_params_)
print("Best Testing Score:", gs_dt.score(x_test, y_test))
print()

gs_rf.fit(x_train, y_train)
print("RandomForestRegressor:")
print("Best Training Score:", gs_rf.best_score_)
print("Best Parameters:", gs_rf.best_params_)
print("Best Testing Score:", gs_rf.score(x_test, y_test))

Fitting 10 folds for each of 14 candidates, totalling 140 fits
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ...... (step 2 of 3) Processing minmaxscaler, total=   0.0s
[Pipeline]  (step 3 of 3) Processing decisiontreeclassifier, total=   0.0s
DecisionTreeRegressor:
Best Training Score: 0.5275000000000001
Best Parameters: {'decisiontreeclassifier__criterion': 'entropy', 'decisiontreeclassifier__max_depth': 5}
Best Testing Score: 0.6176470588235294

Fitting 10 folds for each of 110 candidates, totalling 1100 fits
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ...... (step 2 of 3) Processing minmaxscaler, total=   0.0s
[Pipeline]  (step 3 of 3) Processing randomforestclassifier, total=   0.3s
RandomForestRegressor:
Best Training Score: 0.6116666666666666
Best Parameters: {'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__max_depth': 200, 'randomforestclassifier__n_estimators': 200}
Best Testing Sco

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 140 out of 140 | elapsed:    2.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 224 tasks      | elapsed:   24.8s
[Parallel(n_jobs=-1)]: Done 474 tasks      | elapsed:   54.5s
[Parallel(n_jobs=-1)]: Done 824 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:  2.2min finished


We can see that the accuracy is really not great... I hypothesize that a part of this 63% testing accuracy comes
from confounding variable. Therefore, I will copy the above code and run it, however change the x values to the best
features described in the scaling_selection and README files.

In [46]:
df_discreet = pd.read_csv("./cleaned_data_discreet.csv")

Creating x and y for the DataFrames

In [47]:
x_d = df_discreet[["Average Amount of Aid", "Percent Financial Aid", "Percent Awarded", "Total Staff", "Graduation Rate",
                    "Percent Admitted", "Number of Bachelor's Degrees"]]
y_d = df_discreet.iloc[:, -1]

In [None]:
x_d

In [None]:
y_d

Train test split 70-30

In [48]:
x_train, x_test, y_train, y_test = train_test_split(x_d, y_d, shuffle=True, test_size=0.3, random_state=1)\

Creating Pipelines for both

In [49]:
pl_dt = make_pipeline(StandardScaler(), MinMaxScaler(), DecisionTreeClassifier(), verbose=True)
pl_rf = make_pipeline(StandardScaler(), MinMaxScaler(), RandomForestClassifier(), verbose=True)

Establishing parameters for Grid Search

In [50]:
criteria = ["gini", "entropy"]
dt_max_depth = [2, 3, 4, 5, 6, 7, 8]
rf_max_depth = [i for i in range(100, 600, 100)]
n_ests = [i for i in range(50, 551, 50)]

grid_dt = {'decisiontreeclassifier__criterion': criteria,
            'decisiontreeclassifier__max_depth': dt_max_depth}

grid_rf = {'randomforestclassifier__criterion': criteria,
           'randomforestclassifier__max_depth': rf_max_depth,
           'randomforestclassifier__n_estimators': n_ests}


Creating Grid search objects, using 'accuracy' as the score since other metrics, like f1 and recall, require a
binary clasification to function. Therefore, we are left with 'accuracy'.

In [51]:
gs_dt = GridSearchCV(estimator=pl_dt, param_grid=grid_dt, scoring='accuracy', refit=True, cv=10, n_jobs=-1, verbose=True)
gs_rf = GridSearchCV(estimator=pl_rf, param_grid=grid_rf, scoring='accuracy', refit=True, cv=10, n_jobs=-1, verbose=True)

Fitting the grid searches to the training data

In [52]:
gs_dt.fit(x_train, y_train)
print("DecisionTreeRegressor:")
print("Best Training Score:", gs_dt.best_score_)
print("Best Parameters:", gs_dt.best_params_)
print("Best Testing Score:", gs_dt.score(x_test, y_test))
print()

gs_rf.fit(x_train, y_train)
print("RandomForestRegressor:")
print("Best Training Score:", gs_rf.best_score_)
print("Best Parameters:", gs_rf.best_params_)
print("Best Testing Score:", gs_rf.score(x_test, y_test))

Fitting 10 folds for each of 14 candidates, totalling 140 fits
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ...... (step 2 of 3) Processing minmaxscaler, total=   0.0s
[Pipeline]  (step 3 of 3) Processing decisiontreeclassifier, total=   0.0s
DecisionTreeRegressor:
Best Training Score: 0.5475
Best Parameters: {'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__max_depth': 8}
Best Testing Score: 0.5441176470588235

Fitting 10 folds for each of 110 candidates, totalling 1100 fits
[Pipeline] .... (step 1 of 3) Processing standardscaler, total=   0.0s
[Pipeline] ...... (step 2 of 3) Processing minmaxscaler, total=   0.0s
[Pipeline]  (step 3 of 3) Processing randomforestclassifier, total=   0.7s
RandomForestRegressor:
Best Training Score: 0.6066666666666667
Best Parameters: {'randomforestclassifier__criterion': 'entropy', 'randomforestclassifier__max_depth': 300, 'randomforestclassifier__n_estimators': 550}
Best Testing Score: 0.573529

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:    0.5s
[Parallel(n_jobs=-1)]: Done 140 out of 140 | elapsed:    0.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.0s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   18.3s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   51.4s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 1100 out of 1100 | elapsed:  2.2min finished


Changing the dataset from the entire dataset to using the best features selected by Random Forest feature selection reduced
the accuracy by about 5-6%. Therefore, we would like to use the original datset.

Between 7 classifications, the probability of getting a correct classification by random chance is 14%. The best testing
accuracy we have gotten is 63.2%. This, while being much better than guessing at classifying the schools into the correct
pricing bracket, is still not great. We are beginning to believe that our dataset might be problematic in our quest to
classify or predict college prices, however, this could be an interesting conclusion in that we might be able to say, with
certainty, that the price of collect is more arbitrary than we may have previously thought.
