# Machine Learning
Adapted from "NLP with Python for Machine Learning Essential Training" by Derek Jedamski. 
___
## Notes
In this notebook, we will discuss machine learning as a whole.

Specifically, we'll be talking about the following:
> [Overview](#Overview:-Machine-Learning)
>
> [System Evaluation](#)
>
> [Random Forests](#)
>
> [Gradient Boosting](#)

We finish the notebook off with a [review](#Review) of everything discussed. 

___

> ## Overview: Machine Learning
> ___
> **Machine Learning** refers to algorithms that use data to make predictions. 
>
> There are two types of machine learning: 
> - **Supervised Learning**: Learning a mapping function from labeled training data to make predictions on unseen data.
> 
> - **Unsupervised Learning**: Discovering hidden patterns/structures in unlabelled data.
> 
> We will be focusing on supervised learning for now. 

___

> ## System Evaluation
> ___
> **k-Fold Cross Validation**.
> 
> **Evaluation metrics** quantify the performance of our predictive model. 
>
> Here are a few common evaluation metrics we will use:
> - $\text{Accuracy} = \frac{\text{# predicted correctly}}{\text{ total # of observations}}$
> - $\text{Precision} = \frac{\text{# predicted as class A that are actually class A}}{\text{ total # predicted as class A}}$
> - $\text{Recall} = \frac{\text{# predicted as class A that are actually class A}}{\text{ total # that are actually class A}}$
>
> `Note`: If we want to limit false positives, we will optimize the model for precision. If we want to limit false negatives, we will optomize the model for recall. 

___

> ## Random Forests
> 
> **Random forests** are collections of decision trees whose predictions aggregated to obtain a final prediction. 
> 
> They use **ensemble learning**, a process by which multiple models are created and combined to make a more informed prediction (compared to a single model). 
>
> As with any other model, we can use things like **k-fold cross validation** and **train-test splits** to evaluate our random forests. We can also use things like **grid-search**, an exhaustive search process across all parameter combinations in a given grid for the best performing parameters, to improve our random forests. 
>
> Let's explore random forests a bit. 

First, we'll read in our data and clean it up. 

In [1]:
import nltk
import re
import string
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv('SMSSpamCollection.tsv', sep='\t', header=None)
data.columns = ['label','text']

In [7]:
def count_punctuation(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")),3)*100

data['length'] = data['text'].apply(lambda x: len(x) - x.count(" "))
data['punctuation'] = data['text'].apply(lambda x: count_punctuation(x))

In [8]:
def clean_text(text):
    cleaned_text = ''.join([char for char in text if char not in string.punctuation])
    tokenized_text = re.split('\W+',cleaned_text)
    stemmed_tokens = [ps.stem(word) for word in tokenized_text if word not in stopwords]
    return stemmed_tokens

tdidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf_counts = tdidf_vect.fit_transform(data['text'])

X_features = pd.concat([data['length'], data['punctuation'], pd.DataFrame(X_tfidf_counts.toarray())], axis=1)
X_features.head()

Unnamed: 0,length,punctuation,0,1,2,3,4,5,6,7,...,8181,8182,8183,8184,8185,8186,8187,8188,8189,8190
0,160,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, let's import our `RandomForestClassifier` and explore it through 10-fold cross-validation.

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold, cross_val_score

rf = RandomForestClassifier(n_jobs=-1) # allows for trees to run in parallel
k_fold = KFold(n_splits=10)
cross_val_score(rf, X_features, data['label'], cv=k_fold, scoring='accuracy', n_jobs=-1)

array([0.98384201, 0.96947935, 0.97666068, 0.98743268, 0.97307002,
       0.98025135, 0.97127469, 0.96947935, 0.9676259 , 0.98201439])

Let's explore our `RandomForestClassifier` using a holdout test set. 

In [26]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size = 0.2)
rf2 = RandomForestClassifier(n_estimators= 50,max_depth=20, n_jobs=-1)
rf2_model = rf2.fit(X_train, y_train)

Our `RandomForestClassifier` has a `feature_importances_` attribute we would like to make use of. 

In [28]:
sorted(zip(rf2_model.feature_importances_, X_train.columns), reverse=True)[0:10]

[(0.04727789288599296, 'length'),
 (0.039905742377050364, 2048),
 (0.02900918223533704, 1819),
 (0.0245632355485361, 7090),
 (0.023629581186565912, 4838),
 (0.02316410225353268, 3159),
 (0.021391812063942965, 295),
 (0.020058904539315043, 7422),
 (0.019354545728288715, 2188),
 (0.017632349642892228, 7864)]

The above results tell us that the length of text is a key factor in determining labels. 

Let's go ahead and make predictions using our model (since we've already fit the model to our training data). 

In [34]:
y_pred = rf2_model.predict(X_test)
precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Precision: {}'.format(round(precision,3)))
print('Recall: {}'.format(round(recall,3)))
print('Accuracy: {}'.format(round((y_pred==y_test).sum() / len(y_pred),3)))

Precision: 1.0
Recall: 0.57
Accuracy: 0.942


The above results tell us that:
- 100% mail in the spam folder is actually spam
- 57% of all spam was properly marked as spam
- 94.2% of all emails were identified correctly (whether spam or ham)

To finish up, let's perform a grid-search for our `RandomForestClassifier`. 

In [37]:
def train_rf(num_estimators, depth_num):
    rf = RandomForestClassifier(n_estimators=num_estimators, max_depth=depth_num, n_jobs=-1)
    rf_model = rf.fit(X_train, y_train)
    y_pred = rf_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')
    print('Est: {} / Depth: {} --- Precision: {} / Recall: {} / Accuracy: {}'.format(
        num_estimators, depth_num, round(precision,3), round(recall,3), 
        round((y_pred==y_test).sum() / len(y_pred),3)))

for num_estimators in [10, 50, 100]:
    for depth_num in [10, 20, 30, None]:
        train_rf(num_estimators, depth_num)

Est: 10 / Depth: 10 --- Precision: 1.0 / Recall: 0.146 / Accuracy: 0.884
Est: 10 / Depth: 20 --- Precision: 1.0 / Recall: 0.57 / Accuracy: 0.942
Est: 10 / Depth: 30 --- Precision: 1.0 / Recall: 0.742 / Accuracy: 0.965
Est: 10 / Depth: None --- Precision: 0.991 / Recall: 0.755 / Accuracy: 0.966
Est: 50 / Depth: 10 --- Precision: 1.0 / Recall: 0.238 / Accuracy: 0.897
Est: 50 / Depth: 20 --- Precision: 1.0 / Recall: 0.55 / Accuracy: 0.939
Est: 50 / Depth: 30 --- Precision: 1.0 / Recall: 0.675 / Accuracy: 0.956
Est: 50 / Depth: None --- Precision: 1.0 / Recall: 0.801 / Accuracy: 0.973
Est: 100 / Depth: 10 --- Precision: 1.0 / Recall: 0.258 / Accuracy: 0.899
Est: 100 / Depth: 20 --- Precision: 1.0 / Recall: 0.536 / Accuracy: 0.937
Est: 100 / Depth: 30 --- Precision: 1.0 / Recall: 0.682 / Accuracy: 0.957
Est: 100 / Depth: None --- Precision: 1.0 / Recall: 0.815 / Accuracy: 0.975


The above results tell us that:
- As depth increase, recall increases drastically without a change in precision
- As estimators increase, recall increases slightly without a change in precision

Therefore, the best random forest model for this problem would be one with a very high depth. 