# Building Machine Learning Classifiers: Building a basic Random Forest model

### Ensemble Method  
  
Technique that creates multiple methods and then combines them to produce better results than any of the single models individually.  
  
![ensemble_method.png](ensemble_method.png)  
  
The idea is to combine a lot multiple weak models in order to create a single string model. This leverages the aggregate opinion of many over the isolated opinion of one. An example of ensemble learning method is Random Forest

## Random Forest  
  
Ensemble learning method that constructs a collection of decision trees and then aggregates the predictions of each tree to determine the final prediction.  
(it is just a simple voting method for the trees (models))  

- Can be used for classification or regression
- Easily handles outliers, missing values, etc.
- Accepts various types of inputs (continuous, ordinal, etc.)
- Less likely to overfit
- Outputs feature importance (score for each feature)      

### Read in & clean text

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])

X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
# we're creating a data frame X_features that does not include the label
X_features.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,8094,8095,8096,8097,8098,8099,8100,8101,8102,8103
0,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,135,4.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Explore RandomForestClassifier Attributes & Hyperparameters

In [2]:
from sklearn.ensemble import RandomForestClassifier

In [3]:
print(dir(RandomForestClassifier))

['__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_feature_names', '_check_n_features', '_compute_oob_predictions', '_estimator_type', '_get_oob_predictions', '_get_param_names', '_get_tags', '_make_estimator', '_more_tags', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_set_oob_score_and_attributes', '_validate_X_predict', '_validate_data', '_validate_estimator', '_validate_y_class_weight', 'apply', 'decision_path', 'feature_importances_', 'fit', 'get_params', 'n_features_', 'predict', 'predict_log_proba', 'predict_proba', 'score',

In [9]:
import inspect
print(inspect.signature(RandomForestClassifier))

(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)


- `feature_importances_` outputs the value of each feature to the model
- `fit` allows us to fit our actual model and store that fit model as an object
- we can use `predict` method from that fit model object to make predictions on our test set
  
`RandomForestClassifier()` object has a number of default sparameters, for example:  
- `max_depth=None` - it means that it will build each decision tree until it minimizes some loss criteria
- `n_estimators=10` - how many decision trees will be built within our random forest (10 decision trees of unlimited depth)

### Explore RandomForestClassifier through Cross-Validation

In [5]:
from sklearn.model_selection import KFold, cross_val_score

- `KFold` will facilitate the splitting of your full data set into the subsets (it assigns each observation in our original dataset to a certain subset)
- `cross_val_score` will help us get the actual scoring

In [6]:
rf = RandomForestClassifier(n_jobs=-1)
# `n_jobs=-1` allows this to run faster by building the individual decision trees in parallel
k_fold = KFold(n_splits=5)
# we're doing fivefold cross-validation, splitting our data set into 5 subsets
cross_val_score(rf, X_features, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1)
# `rf` - model we're using, 
# `X_features` - input features,
# `data['label']` - label taken from original dataset
# `k_fold` - how we're splitting the original dataset (which observations belong to which subsets)
# `scoring` - scoring metric we want to use
# `n_jobs=-1` - each of our 5 folds can be run independently of one another (in parallel)

array([0.97576302, 0.98025135, 0.97574124, 0.96675651, 0.97124888])

In the first iteration the model was trained and then evaluated on a test set and correctly predicted 97.6% of the samples.  
The second iteration it was trained on a different training set and evaluated on a different test set and it accurately predicted 98% of the samples.  
And so on.
