### Name of the classifiers : Random Forests

According to IBM, Random forest "is a commonly-used ML algorithm, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems."

#### 1. Import Packages 

In [1]:
from sklearn.datasets import load_svmlight_file
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from itertools import product
import warnings
import time 
warnings.filterwarnings('ignore')



#### 2. Load Data

In [2]:

X_test, y_test = load_svmlight_file("a9a.t")
X_train, y_train = load_svmlight_file("a9a.txt")


NOTE: For each of learning algorithms, you will need to set various hyperparameters (e.g. For Random Forest: number of estimators and min impurity decrease, etc)


#### 3. Fit model on training data. Fit based on the default values of the hyperparameters for simplicity



In [34]:
model = RandomForestClassifier()
model.fit(X_train, y_train)
print("Default values of the hyperparameters: \n",
      model.get_params())



Default values of the hyperparameters: 
 {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


#### 4. Make predictions for the test data

In [36]:
y_prediction = model.predict(X_test)
predictions = [round(value) for value in y_prediction]

#### 5. Evaluate predictions

In [37]:
accuracy = accuracy_score(y_test, predictions)
print("\nAccuracy: %.2f%%" % (accuracy * 100))



Accuracy: 83.29%


NOTES:

The list of hyperparameters and brief description of each hyperparameter you tuned in training, their default values, and the final hyperparameter settings you use to get the best result. Parameters to be tuned for XGBoost: 1. n estimators : 2. bootstrap 3. max depth 4. min impurity decrease 5. min samples leaf


#### 6. Tuning hyperparameters

Tune the hyperparameters by seting the list of values. This will allow different combination of parameters. Meaning that the more values, more time it will take to compute.


In [38]:

n_estimators = [50, 100, 200]
bootstrap = [True, False]
max_depth = [None, 500, 1000]
min_impurity_decrease = [0.0, .05, 0.1]
min_samples_leaf = [1, 2, 10, 100]



In [39]:
hyperparameters = []
for n_estimate, boot, depth, min_impurity, min_samples in product(n_estimators, bootstrap, max_depth, min_impurity_decrease, min_samples_leaf):
    hyperparameters.append(
        [n_estimate, boot, depth, min_impurity, min_samples])



NOTES:

Number of hyperparameters: We have 216 parameters combination with the above tuning

In [41]:

count = 0
for i in hyperparameters:
    count+=1

print (count)
    
    

216


In [15]:
time.ctime()

'Fri Apr  8 01:47:05 2022'

Started code at 09:46 PM
Ended at 11:05 PM

Takes about 1 hours and 30 minutes to run on Mac OS

#### 7. Time it took :

It took me about hours to run 216 hyperparameters combination of mac os

In [42]:
best_accuracy = 0
count = 0

for parameter in hyperparameters:
    n_est = parameter[0]
    bs = parameter[1]
    max_d = parameter[2]
    min_i = parameter[3]
    min_s= parameter [4]
    
    
    parameters = {'n_estimators': n_est, 'bootstrap': bs, 'max_depth': max_d, 'min_impurity_decrease': min_i, 'min_samples_leaf': min_s}
   
    model = RandomForestClassifier(n_estimators=n_est, bootstrap=bs, max_depth=max_d, min_impurity_decrease=min_i, min_samples_leaf=min_s)
    
    
    kfold = KFold()
    cross_val_scores = cross_val_score(model, X_train, y_train, cv=kfold)# takes long
    accuracy = cross_val_scores.mean() * 100
    
    count += 1
    print(count)

    if (best_accuracy < accuracy):
        best_accuracy = accuracy
        best_model = parameters

time.ctime()


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216


'Fri Apr  8 23:05:00 2022'

In [43]:
print("Best accuracy that we can get: ", accuracy)
print("\nThe best model parameters: ", best_model)
time.ctime()

Best accuracy that we can get:  75.91904595647111

The best model parameters:  {'n_estimators': 200, 'bootstrap': True, 'max_depth': 1000, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 2}


'Fri Apr  8 23:05:36 2022'

#### 8. Analysis with new hyperparameter:

After several trial, I've noticed that the min_samples_leaf should be 2.  With min_samples_leaf as 2 and other fixed values, it resulted in highest accuracy.

In [44]:
## TRIAL 
new_model = RandomForestClassifier(n_estimators=300, bootstrap=True,
                                   max_depth=None, min_impurity_decrease=0.0, min_samples_leaf=2)
new_model.fit(X_train, y_train)
print("New values of the hyperparameters: \n",
      new_model.get_params())


New values of the hyperparameters: 
 {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 300, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


In [45]:
kfold = KFold()
cross_val_scores = cross_val_score(new_model, X_train, y_train, cv=kfold)
accuracy = cross_val_scores.mean() * 100



In [46]:
print("New Model Stats")
print("\nAccuracy: \n", accuracy)

print("Cross Validation Training Error Rate:\n ",
      1-cross_val_scores.mean())

print("Test Error Rate: \n",
      1-new_model.score(X_test, y_test))

New Model Stats

Accuracy: 
 84.5397949140464
Cross Validation Training Error Rate:
  0.15460205085953604
Test Error Rate: 
 0.15226337448559668


**Accuracy when hyperparameter (Default) :** 75.92%

**Accuracy when hyperparameter (New) :** 84.54%

| Hyperparameter | Default Value | New Value |
| :- | -: | :-: |
 *n_estimators* | 200 | 300
 *bootstrap* | True | True
 *max_depth* | 1000 | None
 *min_impurity_decrease* | 0.0 | 0.0
 *min_samples_leaf* | 2 | 2  



**Cross Validation Training Error Rate:** 0.15460205085953604

**Test Error Rate:** 0.15226337448559668



