# In this second part of NLP, Random Forest model is applied to the data to predict the mails from spam and non-spam

### Building Machine Learning Classifiers: Building a basic Random Forest model

Read in & clean text

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])

X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,8094,8095,8096,8097,8098,8099,8100,8101,8102,8103
0,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,135,4.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


 --------------------------------------------------------------------------------------
 ### RandomForest Classifier through Holdout Set

In [2]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

In [3]:
x_train,x_test,y_train,y_test=train_test_split(X_features,data['label'],test_size=0.2)

In [4]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=50, max_depth=20,n_jobs=-1)

rf_model=rf.fit(x_train,y_train) #stores a fit model

In [5]:
sorted(zip(rf_model.feature_importances_,x_train.columns),reverse=True)[0:10]


[(0.061342193118805566, 'body_len'),
 (0.04430376899285169, 1803),
 (0.028392492108139555, 2031),
 (0.020566587759555239, 7461),
 (0.020562640305861891, 7350),
 (0.020444607517757004, 7027),
 (0.018310760130820742, 3134),
 (0.017597806511971164, 4796),
 (0.015907402896787479, 6285),
 (0.015797520692157275, 5078)]

#### body len is most imnportant feature, also confirmed through feature evaluation 

------------------

### Predict phase from fit model. Here I am predicting on x_test

In [6]:
y_pred=rf_model.predict(x_test)

# Performance metrics
# positive label = What we are interested to predicting. We care about Spam

precision,recall,fscore,support=score(y_test,y_pred,pos_label='spam',average='binary')

In [7]:
print('Precision:{} / Recall:{} / Accuracy:{}'.format(round(precision*100,3),
                                                     round(recall*100,3),
                                            round((y_pred==y_test).sum()/len(y_pred)*100,3)))

      


Precision:100.0 / Recall:60.274 / Accuracy:94.794


### Summary:

* body len is most imnportant feature, also confirmed through feature evaluation 
* Based on 100% precision, we can confirm that all mail in the spam folder is actually spam.

* And with 94% accuracy, mails which have come to email were identified as spam or ham

* Only 60% recall, suggests that 60% of all spam that has come into email was properly placed in spam folder.

-------------
### Grid-search

Why grid serach?


In last section, we fit a single model with a single set of hyperparameter settings and then we generated a single set of evaluation metrics.

We can imporve our model by changing the hyperparameter settings, like the max depth or estimators

Could be capture more spam by altering hyperparameter setting?, thats when grid search comes in. Grid search basically means defining a grid of hyperparameter settings, and then exploring a model fit with each combination of those hyperparameter settings.

So in our case, that means setting a range of number of estimators and a range of max depth, that you'd like to explore. And then grid search will test every combination of those and fit a model and evaluate it to see which hyperparameter combination generates the best model

In [8]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

In [9]:
 x_train,x_test,y_train,y_test=train_test_split(X_features,data['label'],test_size=0.2)

In [10]:
def train_RF(n_est,depth):
    rf=RandomForestClassifier(n_estimators=n_est,max_depth=depth,n_jobs=-1)
    rf_model=rf.fit(x_train,y_train)
    y_pred=rf_model.predict(x_test)
    
    precision,recall,fscore,support=score(y_test,y_pred,pos_label='spam',average='binary')
    
    print("Est:{} / Depth:{} -------- Precision: {}% / Recall:{}% / Accuracy: {}%".format
          (n_est,depth,round(precision*100,3),round(recall*100, 3),round((y_pred==y_test).sum()/len(y_pred)*100,3)))

In [11]:
for n_est in [10,50,100]:
    for depth in[10,20,30,None]:
        train_RF(n_est,depth)

Est:10 / Depth:10 -------- Precision: 100.0% / Recall:20.144% / Accuracy: 90.036%
Est:10 / Depth:20 -------- Precision: 100.0% / Recall:48.201% / Accuracy: 93.537%
Est:10 / Depth:30 -------- Precision: 99.01% / Recall:71.942% / Accuracy: 96.409%
Est:10 / Depth:None -------- Precision: 99.091% / Recall:78.417% / Accuracy: 97.217%
Est:50 / Depth:10 -------- Precision: 100.0% / Recall:25.899% / Accuracy: 90.754%
Est:50 / Depth:20 -------- Precision: 100.0% / Recall:53.237% / Accuracy: 94.165%
Est:50 / Depth:30 -------- Precision: 100.0% / Recall:69.065% / Accuracy: 96.14%
Est:50 / Depth:None -------- Precision: 100.0% / Recall:80.576% / Accuracy: 97.576%
Est:100 / Depth:10 -------- Precision: 100.0% / Recall:28.058% / Accuracy: 91.023%
Est:100 / Depth:20 -------- Precision: 100.0% / Recall:54.676% / Accuracy: 94.345%
Est:100 / Depth:30 -------- Precision: 100.0% / Recall:68.345% / Accuracy: 96.05%
Est:100 / Depth:None -------- Precision: 100.0% / Recall:80.576% / Accuracy: 97.576%


# Conclusion

* Hence when tuning depth reached 30, we are getting good result. Thus, max depth to be taken into consideration.
* body len is most imnportant feature, also confirmed through feature evaluation