1. Write simple (straightforward) definitions for the following parameters for RandomForestClassifier
(https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and indicate how they correlate with the precision and recall for the basic diabetes model we built in class. You will need to rerun the model multiple times to do so.

Parameter Correlation with Precision Correlation with Recall
- estimators - the number of trees you want the algorithm to create (default is 100)
- max_depth - the measure of how much further the tree has to be expanded down to each node until we get to the leaf node (default is None). 
- min_samples_split - the minimum number of working set size at node required to split (default=2). This tells the tree not to split at a node unless there are at least min_sample_split samples.
- min_samples_leaf - The minimum number of samples required to be at a leaf node (default=1). A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. Increasing this number can cause underfitting.
- min_weight_fraction_leaf - The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided. (default=0.0)
- max_leaf_nodes - grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity.
- min_impurity_decrease - node will be split if this split induces a decrease of the impurity greater than or equal to this value.
- min_impurity_split - no longer exists

Parameter | Adjustment made | Recall impact | Precision impact
:---|:---|:---|:---
estimators | used values 100, 200, 500, and 800. Precision and recall pretty much moved together with 500 having the best results early on in the homework. |  Recall was the lowest at 100, jumped up .04 at 200, stayed the same at 500, and decreased by .01 at 800. | Precision was the lowest at 100 as well, jumped up .02 at 200, .01 at 500 and decreased by .01 at 800.
max_depth | used a number of values (2, 5, 10, 20, 50, 100, 200, 250, and 500). Precision and recall moved in opposite directions for the smaller max_depth values but were the same for 20 – 500. | Recall was lowest at 2, increased .13 at 5, increased .04 at 10 and reached best value at 20 (increasing another .04). | Precision starting the highest at 2, decreased by .07 at 5, increased .03 at 10 and increased another .01 at 20.
min_samples_split | used values 100,50,10,5,2,0.2 and 0.5 since this one could be an int or a float. Precision and recall moved in opposite directions at the "ends" of my range of numbers. | Recall was lowest at 0.5 and highest at 2. Recall increased by .04 between 100 and 50, by .02 between 50 and 10, by .03 between 10 and 5 and by .01 between 5 and 2.  Between 2 and 0.2, it decreased by .11 and significantly dropped between 0.2 and 0.5 (by .28). | Precision dropped by .02 between 100 and 50, increased by .01 between 50 and 10, 10 and 5, 5 and 2, as well as between 2 and 0.2. It had it's largest jump between 0.2 and 0.5 (.22) while recall dropped the most at this time.
min_samples_leaf | used values 50, 30, 20, 5, 2, and 1. Precision and recall moved in the opposite direction for the larger numbers (50, 30, 20) and in the same direction for the smaller numbers (5, 2, 1). The best value for both ended up being the default of 1. | . Recall increasing from each number in my series (.42 at 50 and .58 at 1). | Precision dropped from .72 at 50 to .27 at 20 before heading back up, ending at .71 at 1.
min_weight_fraction_leaf |used a lot of small numbers here – 0.05, 0.04, 0.03, 0.02, 0.01, 0.005, 0.001, 0.003 and 0.0. Precision and recall didn't move together or opposite in any pattern on this one. |The values of 0.001 and 0.0 (default) had the same results.|The values of 0.001 and 0.0 (default) had the same results.
max_leaf_nodes | used values 2, 5, 10, 20, 100, 250, 500, and None (default). Precision and recall moved in the opposite direction for 2, 5, and 10 and moved together or had no net change for the rest of the numbers. | Since both moved up by .01 between 500 and None, None was the end resulting choice.| Since both moved up by .01 between 500 and None, None was the end resulting choice.
min_impurity_decrease | used values 0.03, 0.02, 0.01, 0.1, 0.001, 0.003 and 0.0. First off, 0.1 resulted in both being 0 as it gave a warning. For the first 3 values, precision and recall moved in opposite directions. | Recall increased by .11 from 0.01 to 0.001, decreased by .03 between 0.001 and 0.003 and increased by .04 between 0.003 and 0.0. | Precision stayed the same from 0.01 to 0.001, decreased by .03 between 0.001 and 0.003 and increased by .03 between 0.003 and 0.0.


In [76]:
import pandas as pd
from sklearn import tree
from sklearn.metrics import classification_report, plot_confusion_matrix, precision_score, recall_score

diabetes_df = pd.read_csv("../../in_class/in_class_assignments/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [4]:
from sklearn.ensemble import RandomForestClassifier
#estimator = model
rf = RandomForestClassifier(n_estimators=200,random_state=42)

rf = rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.7662337662337663

In [5]:
predictions = rf.predict(X_test)
print(classification_report(y_test,predictions))

              precision    recall  f1-score   support

           0       0.79      0.87      0.83       150
           1       0.70      0.58      0.64        81

    accuracy                           0.77       231
   macro avg       0.75      0.72      0.73       231
weighted avg       0.76      0.77      0.76       231



In [77]:
# writing a function and printing just recall and precision scores for each added parameter

def n_est(n):
    rf = RandomForestClassifier(n_estimators=n,random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for estimate ', n, 'is', precision_score(y_test,predictions).round(2))
    print('recall for estimate ', n, 'is', recall_score(y_test,predictions).round(2))
    
n_est(100)
n_est(200)
n_est(500)
n_est(800)

precision for estimate  100 is 0.68
recall for estimate  100 is 0.54
precision for estimate  200 is 0.7
recall for estimate  200 is 0.58
precision for estimate  500 is 0.71
recall for estimate  500 is 0.58
precision for estimate  800 is 0.7
recall for estimate  800 is 0.57


In [173]:
# max_depth function
def max_d(d):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=d,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for max_depth ', d, 'is', precision_score(y_test,predictions).round(2))
    print('recall for max_depth ', d, 'is', recall_score(y_test,predictions).round(2))
    
max_d(2)
max_d(5)
max_d(10)
max_d(20)
max_d(50)
max_d(100)
max_d(200)
max_d(250)
max_d(500)

precision for max_depth  2 is 0.74
recall for max_depth  2 is 0.38
precision for max_depth  5 is 0.67
recall for max_depth  5 is 0.51
precision for max_depth  10 is 0.7
recall for max_depth  10 is 0.54
precision for max_depth  20 is 0.71
recall for max_depth  20 is 0.58
precision for max_depth  50 is 0.71
recall for max_depth  50 is 0.58
precision for max_depth  100 is 0.71
recall for max_depth  100 is 0.58
precision for max_depth  200 is 0.71
recall for max_depth  200 is 0.58
precision for max_depth  250 is 0.71
recall for max_depth  250 is 0.58
precision for max_depth  500 is 0.71
recall for max_depth  500 is 0.58


In [165]:
#performed the same with max_depth of 20, 50, 100, 200, 250, and 500. Will stay with 20
def min_sam(s):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=20,
                                min_samples_split=s,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_samples_split ', s, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_samples_split ', s, 'is', recall_score(y_test,predictions).round(2))
    
min_sam(100)
min_sam(50)
min_sam(10)
min_sam(5)
min_sam(2)
min_sam(0.2)
min_sam(0.5)


precision for min_samples_split  100 is 0.7
recall for min_samples_split  100 is 0.48
precision for min_samples_split  50 is 0.68
recall for min_samples_split  50 is 0.52
precision for min_samples_split  10 is 0.69
recall for min_samples_split  10 is 0.54
precision for min_samples_split  5 is 0.7
recall for min_samples_split  5 is 0.57
precision for min_samples_split  2 is 0.71
recall for min_samples_split  2 is 0.58
precision for min_samples_split  0.2 is 0.72
recall for min_samples_split  0.2 is 0.47
precision for min_samples_split  0.5 is 0.94
recall for min_samples_split  0.5 is 0.19


In [264]:
#min_samples_leaf 
def min_sam_leaf(l):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=l,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_samples_leaf ', l, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_samples_leaf ', l, 'is', recall_score(y_test,predictions).round(2))
    
min_sam_leaf(50)
min_sam_leaf(30)
min_sam_leaf(20)
min_sam_leaf(5)
min_sam_leaf(2)
min_sam_leaf(1)


precision for min_samples_leaf  50 is 0.72
recall for min_samples_leaf  50 is 0.42
precision for min_samples_leaf  30 is 0.69
recall for min_samples_leaf  30 is 0.46
precision for min_samples_leaf  20 is 0.67
recall for min_samples_leaf  20 is 0.47
precision for min_samples_leaf  5 is 0.69
recall for min_samples_leaf  5 is 0.54
precision for min_samples_leaf  2 is 0.69
recall for min_samples_leaf  2 is 0.56
precision for min_samples_leaf  1 is 0.71
recall for min_samples_leaf  1 is 0.58


In [177]:
# min_weight_fraction_leaf - default is 0.0
def min_weight(w):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=w,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_weight_fraction_leaf ', w, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_weight_fraction_leaf ', w, 'is', recall_score(y_test,predictions).round(2))
    
min_weight(0.05)
min_weight(0.04)
min_weight(0.03)
min_weight(0.02)
min_weight(0.01)
min_weight(0.005)
min_weight(0.001)
min_weight(0.003)
min_weight(0.0)


precision for min_weight_fraction_leaf  0.05 is 0.67
recall for min_weight_fraction_leaf  0.05 is 0.47
precision for min_weight_fraction_leaf  0.04 is 0.66
recall for min_weight_fraction_leaf  0.04 is 0.48
precision for min_weight_fraction_leaf  0.03 is 0.67
recall for min_weight_fraction_leaf  0.03 is 0.49
precision for min_weight_fraction_leaf  0.02 is 0.68
recall for min_weight_fraction_leaf  0.02 is 0.52
precision for min_weight_fraction_leaf  0.01 is 0.69
recall for min_weight_fraction_leaf  0.01 is 0.56
precision for min_weight_fraction_leaf  0.005 is 0.7
recall for min_weight_fraction_leaf  0.005 is 0.54
precision for min_weight_fraction_leaf  0.001 is 0.71
recall for min_weight_fraction_leaf  0.001 is 0.58
precision for min_weight_fraction_leaf  0.003 is 0.7
recall for min_weight_fraction_leaf  0.003 is 0.54
precision for min_weight_fraction_leaf  0.0 is 0.71
recall for min_weight_fraction_leaf  0.0 is 0.58


In [183]:
# max_leaf_nodes
def max_nodes(m):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=m,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for max_leaf_nodes ', m, 'is', precision_score(y_test,predictions).round(2))
    print('recall for max_leaf_nodes ', m, 'is', recall_score(y_test,predictions).round(2))

max_nodes(2)
max_nodes(5)
max_nodes(10)
max_nodes(20)
max_nodes(100)
max_nodes(250)
max_nodes(500)
max_nodes(None)

precision for max_leaf_nodes  2 is 0.92
recall for max_leaf_nodes  2 is 0.14
precision for max_leaf_nodes  5 is 0.71
recall for max_leaf_nodes  5 is 0.44
precision for max_leaf_nodes  10 is 0.66
recall for max_leaf_nodes  10 is 0.48
precision for max_leaf_nodes  20 is 0.68
recall for max_leaf_nodes  20 is 0.52
precision for max_leaf_nodes  100 is 0.69
recall for max_leaf_nodes  100 is 0.57
precision for max_leaf_nodes  250 is 0.7
recall for max_leaf_nodes  250 is 0.57
precision for max_leaf_nodes  500 is 0.7
recall for max_leaf_nodes  500 is 0.57
precision for max_leaf_nodes  None is 0.71
recall for max_leaf_nodes  None is 0.58


In [182]:
# looks like none is the best for max_leaf_nodes, which is the default. min_impurity_decrease, default 0.0
def min_imp(i):
    rf = RandomForestClassifier(n_estimators=500,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=i,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_impurity_decrease ', i, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_impurity_decrease ', i, 'is', recall_score(y_test,predictions).round(2))
    
min_imp(0.03)
min_imp(0.02)
min_imp(0.01)
min_imp(0.1)
min_imp(0.001)
min_imp(0.003)
min_imp(0.0)

precision for min_impurity_decrease  0.03 is 0.81
recall for min_impurity_decrease  0.03 is 0.27
precision for min_impurity_decrease  0.02 is 0.75
recall for min_impurity_decrease  0.02 is 0.37
precision for min_impurity_decrease  0.01 is 0.71
recall for min_impurity_decrease  0.01 is 0.46


  _warn_prf(average, modifier, msg_start, len(result))


precision for min_impurity_decrease  0.1 is 0.0
recall for min_impurity_decrease  0.1 is 0.0
precision for min_impurity_decrease  0.001 is 0.71
recall for min_impurity_decrease  0.001 is 0.57
precision for min_impurity_decrease  0.003 is 0.68
recall for min_impurity_decrease  0.003 is 0.54
precision for min_impurity_decrease  0.0 is 0.71
recall for min_impurity_decrease  0.0 is 0.58


In [186]:
## everything I've chosen for the best precision/recall combo (.71/.58)
rf = RandomForestClassifier(n_estimators=500,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                random_state=42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print('precision is', precision_score(y_test,predictions).round(2))
print('recall is', recall_score(y_test,predictions).round(2))

precision is 0.71
recall is 0.58


In [233]:
## adjusting again
rf = RandomForestClassifier(n_estimators=200,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                random_state=42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print('precision is', precision_score(y_test,predictions).round(2))
print('recall is', recall_score(y_test,predictions).round(2))

precision is 0.71
recall is 0.59


In [236]:
# running some of the earlier functions again with the additional values added later on since some of the earlier ones
# had multiple values with the same precision and recall
def n_est(n):
    rf = RandomForestClassifier(n_estimators=n,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for estimate ', n, 'is', precision_score(y_test,predictions).round(2))
    print('recall for estimate ', n, 'is', recall_score(y_test,predictions).round(2))
    
n_est(100)
n_est(200)
n_est(250)
n_est(500)
n_est(800)

precision for estimate  100 is 0.68
recall for estimate  100 is 0.54
precision for estimate  200 is 0.71
recall for estimate  200 is 0.59
precision for estimate  250 is 0.7
recall for estimate  250 is 0.57
precision for estimate  500 is 0.71
recall for estimate  500 is 0.58
precision for estimate  800 is 0.7
recall for estimate  800 is 0.57


In [234]:
# max_depth function
def max_d(d):
    rf = RandomForestClassifier(n_estimators=200,
                                max_depth=d,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for max_depth ', d, 'is', precision_score(y_test,predictions).round(2))
    print('recall for max_depth ', d, 'is', recall_score(y_test,predictions).round(2))
    
max_d(2)
max_d(5)
max_d(10)
max_d(20)
max_d(50)
max_d(100)
max_d(200)
max_d(250)
max_d(500)

precision for max_depth  2 is 0.74
recall for max_depth  2 is 0.38
precision for max_depth  5 is 0.69
recall for max_depth  5 is 0.52
precision for max_depth  10 is 0.68
recall for max_depth  10 is 0.54
precision for max_depth  20 is 0.71
recall for max_depth  20 is 0.59
precision for max_depth  50 is 0.7
recall for max_depth  50 is 0.58
precision for max_depth  100 is 0.7
recall for max_depth  100 is 0.58
precision for max_depth  200 is 0.7
recall for max_depth  200 is 0.58
precision for max_depth  250 is 0.7
recall for max_depth  250 is 0.58
precision for max_depth  500 is 0.7
recall for max_depth  500 is 0.58


In [237]:
def min_sam(s):
    rf = RandomForestClassifier(n_estimators=200,
                                max_depth=20,
                                min_samples_split=s,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                random_state=42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print('precision for min_samples_split ', s, 'is', precision_score(y_test,predictions).round(2))
    print('recall for min_samples_split ', s, 'is', recall_score(y_test,predictions).round(2))
    
min_sam(100)
min_sam(50)
min_sam(10)
min_sam(5)
min_sam(2)
min_sam(0.2)
min_sam(0.5)

precision for min_samples_split  100 is 0.7
recall for min_samples_split  100 is 0.48
precision for min_samples_split  50 is 0.67
recall for min_samples_split  50 is 0.52
precision for min_samples_split  10 is 0.68
recall for min_samples_split  10 is 0.56
precision for min_samples_split  5 is 0.71
recall for min_samples_split  5 is 0.57
precision for min_samples_split  2 is 0.71
recall for min_samples_split  2 is 0.59
precision for min_samples_split  0.2 is 0.7
recall for min_samples_split  0.2 is 0.47
precision for min_samples_split  0.5 is 0.86
recall for min_samples_split  0.5 is 0.22


In [268]:
# got recall to increase by .01 by changing the n_estimators from 500 to 200
rf = RandomForestClassifier(n_estimators=200,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                random_state=42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print('precision is', precision_score(y_test,predictions).round(2))
print('recall is', recall_score(y_test,predictions).round(2))

precision is 0.71
recall is 0.59


2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so.

bootstrap : bool, default=True indicates whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

In [253]:
rf = RandomForestClassifier(n_estimators=200,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                bootstrap=False,
                                random_state=42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print('precision is', precision_score(y_test,predictions).round(2))
print('recall is', recall_score(y_test,predictions).round(2))

precision is 0.71
recall is 0.59


Using bootstrap=False had no impact on the results. I'm thinking it is because max_samples parameter wasn't used so the whole dataset was used, which is what bootstrap=False also does for you. Adding in max_samples with bootstrap=True then False to see if I'm on the right track. It appears that using the whole dataset has better recall and precision.

In [258]:
rf = RandomForestClassifier(n_estimators=200,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                max_samples=500,
                                bootstrap=True,
                                random_state=42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print('precision is', precision_score(y_test,predictions).round(2))
print('recall is', recall_score(y_test,predictions).round(2))

precision is 0.7
recall is 0.57


In [259]:
rf = RandomForestClassifier(n_estimators=200,
                                max_depth=20,
                                min_samples_split=2,
                                min_samples_leaf=1,
                                min_weight_fraction_leaf=0.001,
                                max_leaf_nodes=None,
                                min_impurity_decrease=0.0,
                                max_samples=500,
                                bootstrap=False,
                                random_state=42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print('precision is', precision_score(y_test,predictions).round(2))
print('recall is', recall_score(y_test,predictions).round(2))

precision is 0.71
recall is 0.59
