# Random Forest Classifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement

### Import Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

### Read the dataset

In [3]:
# Read in our dataset
df = pd.read_table('SMSSpamCollection',
                   sep='\t', 
                   header=None, 
                   names=['label', 'sms_message'])

# Fix our response value
df['label'] = df.label.map({'ham':0, 'spam':1})

# Split our dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(df['sms_message'], 
                                                    df['label'], 
                                                    random_state=1)

# Instantiate the CountVectorizer method
count_vector = CountVectorizer()

# Fit the training data and then return the matrix
training_data = count_vector.fit_transform(X_train)

# Transform testing data and return the matrix. Note we are not fitting the testing data to the CountVectorizer()
testing_data = count_vector.transform(X_test)

### Import Model

In [4]:
from sklearn.ensemble import RandomForestClassifier

  from numpy.core.umath_tests import inner1d


### Instantiate a Random Forest Classifier
Instantiate a Bagging Classifier with 200 weak learners (n_estimators) and everything else as default values.

In [8]:
randomforestModel = RandomForestClassifier(n_estimators = 200)

### Fit the training data to the model

In [9]:
randomforestModel = randomforestModel.fit(training_data, y_train)

### Predict on the testing data

In [10]:
preds = randomforestModel.predict(testing_data)

### Print all types of scores that model achieved

In [11]:
print('Accuracy score: ', format(accuracy_score(y_test, preds)))
print('Precision score: ', format(precision_score(y_test, preds)))
print('Recall score: ', format(recall_score(y_test, preds)))
print('F1 score: ', format(f1_score(y_test, preds)))

Accuracy score:  0.9798994974874372
Precision score:  1.0
Recall score:  0.8486486486486486
F1 score:  0.9181286549707602
