# Random Forest Classifier
This notebook reviews the basic concepts you need to know in order to
sucessfully create a random forest model. We will use the
```RandomForestClassifier``` from **sklearn**.

We start by importing some functions

In [1]:
#Import the train_test_split function
from sklearn.ensemble import RandomForestClassifier

#Import the function that creates the data we will split
from mlb_misc_functions import create_clf_table_1

#Numpy
import numpy as np

Now, we will create our modeling table (```model_data```). This is a dummy/fake table that resembles the format of real data. The table has 7 columns. 1 column is a unique identifier, 5 columns are features, and the last column is the target. In this case the target takes the values 0 and 1.

In [2]:
#Call the funciton that creates the table. 10,000 rows
model_data = create_clf_table_1(10000)
model_data.head()

Unnamed: 0,row_id,feature_1,feature_2,feature_3,feature_4,feature_5,target
0,0,1,2,0,-8,-4,0
1,1,-6,-3,8,7,-7,0
2,2,-8,-9,-1,9,-3,0
3,3,2,6,2,5,6,1
4,4,8,-6,0,-1,8,1


Now, we will split the table into training and testing set. For more info about how to split your data into training and testing click [here](https://github.com/sebaszb/DataScienceSimple/blob/master/machine_learning_basics/training_testing_split.ipynb).

In [3]:
#Import the train_test_split function
from sklearn.model_selection import train_test_split

#split the data into training and testing
model_train, model_test = train_test_split(model_data, test_size=0.20, random_state=11)

Now we need to prepare the data for the **Randon Forest Classifier**. This is done in the following block of code

In [111]:
#Lists with the name of the features (columns), and the name of the target
features = ["feature_1", "feature_2", "feature_3", "feature_4", "feature_5"]
target = ["target"]

#Create arrays witht he "x values" (features) and the "y values" (targets) for the train data
train_x = model_train[features].values
train_y = np.ravel(model_train[target].values)

#Create arrays witht he "x values" (features) and the "y values" (targets) for the test data
test_x = model_test[features].values
test_y = np.ravel(model_test[target].values)

In [108]:
from sklearn.ensemble import RandomForestClassifier

In [109]:
rf_clf = RandomForestClassifier(n_estimators=10, criterion="entropy", max_depth=5)
rf_clf.fit(train_x, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=5, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [105]:
predictions = rf_clf.predict(test_x)
probabilities = rf_clf.predict_proba(test_x)

In [106]:
from sklearn.metrics import precision_score, recall_score, confusion_matrix
print("Precision: ", precision_score(test_y, predictions))
print("Recall:    ", recall_score(test_y, predictions))
print("############")
print("Confusion matrix")
print(confusion_matrix(test_y, predictions))

Precision:  0.8293413173652695
Recall:     0.7446236559139785
############
Confusion matrix
[[1142  114]
 [ 190  554]]
