# Parameter sweeping

This notebook talks about parameter sweeping for random forest classifier. 

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

Parameters in random forest are either to increase the predictive power of the model or to make it easier to train the model.

## 1. Features which make predictions of the model better

There are primarily 3 features which can be tuned to improve the predictive power of the model:
### 1.1 max_features :
The number of features to consider when looking for the best split.  
There are multiple options available to assign maximum features. Common choices including:<br>
Auto: max_features=sqrt(n_features)<br>
None: max_features=n_features<br>
float: max_features is a percentage and int(max_features * n_features) features are considered at each split<br>
Increasing max_features generally improves the performance of the model as at each node now we have a higher number of options to be considered. But you decrease the speed of algorithm by doing so for sure. 
### 1.2 n_estimators：
This is the number of trees you want to build. default=10 <br>
Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable.
### 1.3 min_sample_leaf :
The minimum number of samples required to be at a leaf node。
A smaller leaf makes the model more prone to capturing noise in train data. 
You should try multiple leaf sizes to find the most optimum for your use case.

## 2. Features which will make the model training easier

There are a few attributes which have a direct impact on model training speed. Following are the key parameters which you can tune for model speed :

### 2.1 n_jobs:
The number of jobs to run in parallel for both fit and predict.
A value of “-1” means there is no restriction whereas a value of “1” means it can only use one processor.

### 2.2  random_state :

This parameter makes a solution easy to replicate. A definite value of random_state will always produce same results if given with same parameters and training data.

### 2.3 oob_score :

Whether to use out-of-bag samples to estimate the generalization accuracy.

## Parameter sweeping for a dataset

In [34]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import numpy as np

In [35]:
pd.set_option('display.max_columns', None)# display all the columns
raw_data = pd.read_csv('a_20s_1600her_0.4__maf_0.2_EDM-2_01.txt', sep = "\t")# read in dataset

In [36]:
y = raw_data.iloc[:, -1].values
X = raw_data.iloc[:, :-1].values

In [37]:
from sklearn.cross_validation import KFold
# This function does 10-fold. It saves the result at each time as different parts of y_pred. 
# In the end, it returns the y_pred as the result of all the 10-fold.
def run_cv(X,y,clf_class,**kwargs):
    # Construct a kfolds object
    kf = KFold(len(y),n_folds=10,shuffle=True) # Total number of elements；Number of folds， default=3；Whether to shuffle the data before splitting into batches
    y_pred = y.copy()
    clf = clf_class(**kwargs)
    # Iterate through folds
    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred

In [38]:
# This function calculates accuracy
def accuracy(y_true,y_pred):
    return np.mean(y_true == y_pred) # NumPy interpretes True and False as 1. and 0.

In [56]:
n_estimators_options = list(range(1, 100, 10))
sample_leaf_options = list(range(1, 20, 5))
results = []

In [57]:
for leaf_size in sample_leaf_options:
    for n_estimators_size in n_estimators_options:
        RF_CV_result = run_cv(X,y,RandomForestClassifier, min_samples_leaf=leaf_size, n_estimators=n_estimators_size, random_state=50,n_jobs = -1)
        # Record current min_samples_leaf，n_estimators and accuracy
        results.append((leaf_size, n_estimators_size, str(accuracy(y, RF_CV_result))))

In [52]:
# Find result with highest accuracy
def find_best(lst):
    return max(lst, key=lambda x: x[2])

In [58]:
print ('After parameter sweeping, we find:')
print ('when min_samples_leaf =', find_best(results)[0], ', n_estimators =',find_best(results)[1], ', accuracy =',find_best(results)[2])

After parameter sweeping, we find:
when min_samples_leaf = 6 , n_estimators = 71 , accuracy = 0.77625


### Comments
Parameter sweeping can improve model performance. Before parameter sweeping, the accuracy of Random Forest model in task 2 is 0.701875. After optimizing the value of two parameters, the accuracy increased to 0.77625. For a given dataset, we can do parameter sweeping to improve the predictive power. 

Reference:https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/