# Random Forest Classifier
This notebook reviews the basic concepts you need to know in order to
sucessfully create a random forest model. We will use the
```RandomForestClassifier``` from **Sklearn**.

We start by importing some functions

In [1]:
#Import the classifier
from sklearn.ensemble import RandomForestClassifier

#Import functions to create and split the data
from mlb_misc_functions import create_clf_table_1, prep_model_data

## Data Creation and Prep

In the following block of code we will create a modeling table (**model_data**) using the create_clf_table_1 function. This is a dummy/fake table that resembles the format of real data. The table has 7 columns. One of the columns is a row unique identifier, the following 5 columns are features, while the last column is the target. In this case the target takes the values 0 and 1. See below:

In [2]:
#Call the funciton that creates the table. 10,000 rows
model_data = create_clf_table_1(10000)
model_data.head()

Unnamed: 0,row_id,feature_1,feature_2,feature_3,feature_4,feature_5,target
0,0,2,-4,-7,7,-3,0
1,1,9,0,2,-4,8,1
2,2,6,-7,0,7,6,1
3,3,1,-9,-8,-6,8,0
4,4,-5,-4,-10,-7,5,0


The next step is to split the data into training and a testing set, and put it in a format that our classifier can read it. In the block of code below we do that using a wrapper function called ```prep_model_data```. This function takes in a pandas table, and two lists: one with the name of the column used as target, and a second one with the name of the features. Note that this function (```prep_model_data```) is a custom wrapper function that we are importing from the file **mlb_misc_functions**. We will talk about how to split a training set into training and testing sets in another notebook (see [here](https://github.com/sebaszb/DataScienceSimple/blob/master/machine_learning_basics/training_testing_split.ipynb)).

In [3]:
#target and features names
target = ["target"]
features = ["feature_1", "feature_2", "feature_3", "feature_4", "feature_5"]

#Prep the data for training
train_x, train_y, test_x, test_y = prep_model_data(model_data, target, features)

Before we continue, letâ€™s take a look at the first 5 elements of **train_x** and **train_y**. **train_x** is a list of lists where each element (a list) contains the feature values. On the other hand, **train_y** is a simple list containing the targets (0 or 1). See below:

In [4]:
#Print 5 first elements
print(train_x[:5])
print(train_y[:5])

[[ -4  -1   7   0 -10]
 [ -1  -2  -6  -5  -7]
 [ -7   4   8   9  -6]
 [ -9   8   2   3   8]
 [ -8  -5  -2  -5   2]]
[0 0 1 1 0]


## Training
Now we need to prepare the data for the **Randon Forest Classifier**. This is done in the following block of code.

In [5]:
#Create the classifier
rf_clf = RandomForestClassifier(n_estimators=10, max_depth=5)

#Train the classifier
rf_clf.fit(train_x, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

It is important to notice that we are initializing our random forest classifier with ```n_estimators=10``` and  ```max_depth=5```. **n_estimators** defines to the number of trees in the forest (in the classifier), while **max_depth** defines the maximum depth of the tree. As you change the hyper parameter values, the performance of your classifier will change as well.

There are several other hyperparameter that can be specified, and they can be tuned in order to obtain an optimal performance. We will talk about how to tune them in a different notebook. For now, you can read about all the other hyperparameters [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

## Predictions and Probabilities

Once you have your Random Forest classifier trained, you can give it new data and either obtain predictions on the target (0s or 1s) or probabilities (numbers between 0 and 1).

The probabilities are obtained using the method ```predict_proba```, while the predictions are obtained using the method ```predict```. In the following block of code we obtain the predictions and probabilities for the testing set, and we print to screen the first 5 elements of each list.

In [6]:
predictions = rf_clf.predict(test_x)
probabilities = rf_clf.predict_proba(test_x)

In [7]:
predictions[:5]

array([0, 0, 1, 0, 0])

In [8]:
probabilities[:5]

array([[0.6269234 , 0.3730766 ],
       [0.94629614, 0.05370386],
       [0.42571039, 0.57428961],
       [0.61957773, 0.38042227],
       [0.95849242, 0.04150758]])

While ```predictions``` is a simple array where each element takes the values 0 or 1, ```probabilities``` is an array where each element is also an array containing 2 elements. The first element is the probability of belonging to class 0, and the second element is the probability of belonging to class 1. The addition of both numbers is always 1.

It is important to mention that the probabilities have more information than the predictions, actually the predictions can be constructed using the probabilities. In order to do so, you only need to choose a threshold, if the probability for class 1 is larger than the chosen threshold, then the prediction would be 1, otherwise it would be 0.

## Final words
We have covered the basic concepts to create and train a random forest classifier, plus how to obtain predictions and probabilities. Now is your turn to start coding! Try to use some metric (such as precision and recall) to evaluate the performance of your model as you change the hyperparameters.
