## 1. Write simple (straightforward) definitions for the following parameters for RandomForestClassifier
(https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and indicate how they correlate with the precision and recall for the basic diabetes model we built in class. You will need to rerun the model multiple times to do so.

 - estimators - number of trees.Ideally, this should be increased until no further improvement is seen in the model            
 - max_depth   - this is depth of tree( longest path between the root node and the leaf node:). This parameter can be used to pruning            
 - min_samples_split   - The minimum number of samples required to split an internal node. It can be int or float     
 - min_samples_leaf     - The minimum number of samples required to split leaf node in left and right branches. This can be int or float.   
 - min_weight_fraction_leaf - If provided fraction of sum of total weights of samples at leaf node else equal weight
 - max_leaf_nodes   - Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.        
 - min_impurity_decrease    - Decides split of a node if there is decrease of the impurity greater than or equal to this value.
 - min_impurity_split - Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.

| Parameter                | Correlation with Precision       | Correlation with Recall         | Highest value at     |
| ------------------------ | ---------------------------------| --------------------------------|--------------------  |
| estimators               | Increased and then went down     | Increased and then went down    | n_estimators= 100    |
| max_depth                | Increased, descrese and increase | Increased, descrese and increase| Max_Depth=  11       |
| min_samples_split        | Increased and then decreased     | Increased and then decreased    | min_samples_split= 4|
| min_samples_leaf         | Increased and decreased pattern  | Increased and decreased pattern | min_samples_leaf=3 |
| min_weight_fraction_leaf | Decrease                         | Decrease                        |min_weight_fraction_leaf=0.2|
| max_leaf_nodes           | remains same                     | Remains same                    |max_leaf_nodes= None
| min_impurity_decrease    | Decreased                        | Decreased                       |min_impurity_decrease=  0.0
| min_impurity_split       | increase,descrese,increase stays decreased|increase,descrese,increase stays decreased                        |min_impurity_split=  0.25

In [1]:
import pandas as pd
from sklearn import tree
from sklearn.metrics import confusion_matrix, classification_report, plot_confusion_matrix
import numpy as np
import warnings
warnings.filterwarnings('ignore')

diabetes_df = pd.read_csv("../week_13/diabetes.csv")
diabetes_df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

#Standardize
sc= StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.fit_transform(X_test)

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support

#Estimators
for value in [10,100,200,300,400,500,1000]:
    rf = RandomForestClassifier(n_estimators=value,random_state =42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print("n_estimators=" , value, precision_recall_fscore_support(y_test, predictions, average='weighted'))



n_estimators= 10 (0.7251114954546856, 0.7337662337662337, 0.7236747438378277, None)
n_estimators= 100 (0.7473701821527908, 0.7532467532467533, 0.748121878121878, None)
n_estimators= 200 (0.7610055001359349, 0.7662337662337663, 0.7613786213786213, None)
n_estimators= 300 (0.7545792843068642, 0.7597402597402597, 0.7554767899150163, None)
n_estimators= 400 (0.7422917218835586, 0.7467532467532467, 0.7436948559373376, None)
n_estimators= 500 (0.7483459936290124, 0.7532467532467533, 0.7495827986975903, None)
n_estimators= 1000 (0.7422917218835586, 0.7467532467532467, 0.7436948559373376, None)


In [4]:
#max_depth
for value in  range(1,20,2):
    rf = RandomForestClassifier(max_depth=value,random_state =42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print("Max_Depth= ", value, precision_recall_fscore_support(y_test, predictions, average='weighted'))

Max_Depth=  1 (0.762012987012987, 0.7012987012987013, 0.6255522141792634, None)
Max_Depth=  3 (0.7325288141985057, 0.7402597402597403, 0.7272936403371186, None)
Max_Depth=  5 (0.7182282003710575, 0.7272727272727273, 0.7179459691252145, None)
Max_Depth=  7 (0.7263910401525081, 0.7337662337662337, 0.7273970049089666, None)
Max_Depth=  9 (0.7680081627847654, 0.7727272727272727, 0.7641125862030235, None)
Max_Depth=  11 (0.774640818119079, 0.7792207792207793, 0.7746353646353646, None)
Max_Depth=  13 (0.7610055001359349, 0.7662337662337663, 0.7613786213786213, None)
Max_Depth=  15 (0.7610055001359349, 0.7662337662337663, 0.7613786213786213, None)
Max_Depth=  17 (0.7401405933516025, 0.7467532467532467, 0.7406947119865779, None)
Max_Depth=  19 (0.7401405933516025, 0.7467532467532467, 0.7406947119865779, None)


In [5]:
#min_samples_split
for value in  range(2,10,1):
    rf = RandomForestClassifier(min_samples_split=value,random_state =42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print("min_samples_split= ", value, precision_recall_fscore_support(y_test, predictions, average='weighted'))

min_samples_split=  2 (0.7473701821527908, 0.7532467532467533, 0.748121878121878, None)
min_samples_split=  3 (0.7496603396603396, 0.7532467532467533, 0.7509206479794716, None)
min_samples_split=  4 (0.7894305694305694, 0.7922077922077922, 0.7902489667195549, None)
min_samples_split=  5 (0.7617771379563832, 0.7662337662337663, 0.7627626513977172, None)
min_samples_split=  6 (0.7610055001359349, 0.7662337662337663, 0.7613786213786213, None)
min_samples_split=  7 (0.7744982290436836, 0.7792207792207793, 0.7732131813764467, None)
min_samples_split=  8 (0.7538901465506971, 0.7597402597402597, 0.7539924190641895, None)
min_samples_split=  9 (0.7483459936290124, 0.7532467532467533, 0.7495827986975903, None)


In [6]:
#min_samples_leaf
for value in  range(1,20,2):
    rf = RandomForestClassifier(min_samples_leaf=value,random_state =42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print("min_samples_leaf= ", value, precision_recall_fscore_support(y_test, predictions, average='weighted'))

min_samples_leaf=  1 (0.7473701821527908, 0.7532467532467533, 0.748121878121878, None)
min_samples_leaf=  3 (0.7538901465506971, 0.7597402597402597, 0.7539924190641895, None)
min_samples_leaf=  5 (0.7395849488872746, 0.7467532467532467, 0.7389951134515556, None)
min_samples_leaf=  7 (0.7465213358070502, 0.7532467532467533, 0.7448082577799559, None)
min_samples_leaf=  9 (0.7182282003710575, 0.7272727272727273, 0.7179459691252145, None)
min_samples_leaf=  11 (0.7251114954546856, 0.7337662337662337, 0.7236747438378277, None)
min_samples_leaf=  13 (0.7394103845647122, 0.7467532467532467, 0.7371540246262261, None)
min_samples_leaf=  15 (0.7322453861927547, 0.7402597402597403, 0.7294135572123244, None)
min_samples_leaf=  17 (0.7250141163184641, 0.7337662337662337, 0.7215829931508851, None)
min_samples_leaf=  19 (0.7323747680890539, 0.7402597402597403, 0.7313771134525852, None)


In [7]:
#min_weight_fraction_leaf
for value in np.arange(0,0.6,0.1):
    rf = RandomForestClassifier(min_weight_fraction_leaf=value,random_state =42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print("min_weight_fraction_leaf= ", value, precision_recall_fscore_support(y_test, predictions, average='weighted'))

min_weight_fraction_leaf=  0.0 (0.7473701821527908, 0.7532467532467533, 0.748121878121878, None)
min_weight_fraction_leaf=  0.1 (0.7250141163184641, 0.7337662337662337, 0.7215829931508851, None)
min_weight_fraction_leaf=  0.2 (0.7502546473134708, 0.7532467532467533, 0.7364226682408501, None)
min_weight_fraction_leaf=  0.30000000000000004 (0.7382920110192837, 0.7272727272727273, 0.6886652367596107, None)
min_weight_fraction_leaf=  0.4 (0.4216562658121099, 0.6493506493506493, 0.5112997238981491, None)
min_weight_fraction_leaf=  0.5 (0.4216562658121099, 0.6493506493506493, 0.5112997238981491, None)


In [8]:
#max_leaf_nodes
for value in np.arange(100,10001,1000):
    rf = RandomForestClassifier(max_leaf_nodes=value,random_state =42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print("max_leaf_nodes= ", value, precision_recall_fscore_support(y_test, predictions, average='weighted'))
    
rf = RandomForestClassifier(max_leaf_nodes=None,random_state =42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print("max_leaf_nodes= None", precision_recall_fscore_support(y_test, predictions, average='weighted'))

max_leaf_nodes=  100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  1100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  2100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  3100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  4100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  5100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  6100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  7100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  8100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes=  9100 (0.7337348641696468, 0.7402597402597403, 0.7348651348651349, None)
max_leaf_nodes= None (0.7473701821527908, 0.7532467532467533, 0.748121878121878, None)


In [9]:

#min_impurity_decrease
for value in np.arange(0.0,0.07,0.01):
    rf = RandomForestClassifier(min_impurity_decrease=value,random_state =42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print("min_impurity_decrease= ", value, precision_recall_fscore_support(y_test, predictions, average='weighted'))
    

min_impurity_decrease=  0.0 (0.7473701821527908, 0.7532467532467533, 0.748121878121878, None)
min_impurity_decrease=  0.01 (0.7253257253257254, 0.7337662337662337, 0.7193267561931156, None)
min_impurity_decrease=  0.02 (0.709478021978022, 0.7142857142857143, 0.6818295739348371, None)
min_impurity_decrease=  0.03 (0.7385888964836334, 0.7207792207792207, 0.6746357693603393, None)
min_impurity_decrease=  0.04 (0.740351198097677, 0.7012987012987013, 0.6321777396157561, None)
min_impurity_decrease=  0.05 (0.4216562658121099, 0.6493506493506493, 0.5112997238981491, None)
min_impurity_decrease=  0.06 (0.4216562658121099, 0.6493506493506493, 0.5112997238981491, None)
min_impurity_decrease=  0.07 (0.4216562658121099, 0.6493506493506493, 0.5112997238981491, None)


In [10]:
#rf = RandomForestClassifier(min_impurity_split=5, random_state =42)
#min_impurity_split
for value in np.arange(0.0,0.7,0.05):
    rf = RandomForestClassifier(min_impurity_split=value,random_state =42)
    rf = rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    print("min_impurity_split= ", value, precision_recall_fscore_support(y_test, predictions, average='weighted'))

min_impurity_split=  0.0 (0.7473701821527908, 0.7532467532467533, 0.748121878121878, None)
min_impurity_split=  0.05 (0.7535895907988931, 0.7597402597402597, 0.7523799794283987, None)
min_impurity_split=  0.1 (0.7465213358070502, 0.7532467532467533, 0.7448082577799559, None)
min_impurity_split=  0.15000000000000002 (0.7606257378984652, 0.7662337662337663, 0.7598727802809436, None)
min_impurity_split=  0.2 (0.7542891890717976, 0.7597402597402597, 0.7487456279654328, None)
min_impurity_split=  0.25 (0.7680081627847654, 0.7727272727272727, 0.7641125862030235, None)
min_impurity_split=  0.30000000000000004 (0.74851419766674, 0.7532467532467533, 0.7387584892172047, None)
min_impurity_split=  0.35000000000000003 (0.7415693550147332, 0.7467532467532467, 0.7307068796987222, None)
min_impurity_split=  0.4 (0.7219941348973608, 0.7272727272727273, 0.702922077922078, None)
min_impurity_split=  0.45 (0.7319044591771865, 0.6948051948051948, 0.6208589764145319, None)
min_impurity_split=  0.5 (0.42165

In [15]:
rf = RandomForestClassifier(min_samples_split=4, random_state =42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.83      0.86      0.84       100
           1       0.72      0.67      0.69        54

    accuracy                           0.79       154
   macro avg       0.77      0.76      0.77       154
weighted avg       0.79      0.79      0.79       154



In [16]:
print(rf.feature_importances_, X.columns)

[0.08245741 0.29732347 0.0840247  0.06510676 0.06805083 0.1647503
 0.12339451 0.114892  ] Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age'],
      dtype='object')


## 2. How does setting bootstrap=False influence the model performance? Note: the default is bootstrap=True. Explain why your results might be so.
If bootsrap=True then bootstrap(random samples with replacement) samples are used when building trees. If False, the whole dataset is used to build each tree.
Changing from True to False, values of precision, recall, f1 scores went down a little. Timewise False setting was little behind than True setting. But in general in larger dataset, what was discussed in the class, if whole dataset is considered it takes long time and consumes lot of memory.


In [13]:
from time import perf_counter
t1_true_start = perf_counter()
rf = RandomForestClassifier(bootstrap=True, random_state =42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print("bootstrap= True", precision_recall_fscore_support(y_test, predictions, average='weighted'))
t1_true_stop = perf_counter()
print("Elapsed time:", t1_true_stop- t1_true_start)


bootstrap= True (0.7473701821527908, 0.7532467532467533, 0.748121878121878, None)
Elapsed time: 0.292067099999997


In [14]:
t1_False_start = perf_counter()
rf = RandomForestClassifier(bootstrap=False, random_state =42)
rf = rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
print("bootstrap= False", precision_recall_fscore_support(y_test, predictions, average='weighted'))
t1_False_stop = perf_counter()
print("Elapsed time:", t1_False_stop- t1_False_start)

bootstrap= False (0.7422917218835586, 0.7467532467532467, 0.7436948559373376, None)
Elapsed time: 0.26468339999999557
