<h1 align="center">Natural Language Processing (NLP) Project, Part 5</h1>
<h2 align="center">Machine Learning Classifiers - Gradient Boosting model with grid-search</h2>
<h3 align="center">Rositsa Chankova</h3>

### Gradient Boosting model with grid-search
Gradient boosting is an ensemble method that takes an iterative approach in order to combine weak learners and create a strong learner by focusing on the mistakes of prior iterations. It uses basic decision trees. Then GB evaluates on that first tree, and then with the next iteration it places a heavier weight on those observations that it got wrong.  It does this over and over again focusing on the examples it doesn’t quite understand until it has minimized the error as much as possible.

![alt text](Grad_vs_RF.png)

### Read in & clean text

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])

X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,8094,8095,8096,8097,8098,8099,8100,8101,8102,8103
0,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,135,4.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Explore GradientBoostingClassifier Attributes & Hyperparameters

In [2]:
#Import the package for the classifier:
from sklearn.ensemble import GradientBoostingClassifier

In [3]:
print(dir(GradientBoostingClassifier))
print(GradientBoostingClassifier())

['_SUPPORTED_LOSS', '__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_feature_names', '_check_initialized', '_check_n_features', '_check_params', '_clear_state', '_compute_partial_dependence_recursion', '_estimator_type', '_fit_stage', '_fit_stages', '_get_param_names', '_get_tags', '_init_state', '_is_initialized', '_make_estimator', '_more_tags', '_raw_predict', '_raw_predict_init', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_resize_state', '_staged_raw_predict', '_validate_data', '_validate_estimator', '_validate_y', '_warn_ma

**The max depth default setting is 3. GB uses a lot of very basic tree and can’t be parallelized because each iteration builds on the prior iteration.**

### Build our own Grid-search

In [4]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

#### Split the data:

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)

#### Function to perform gradient boosting:

In [6]:
def train_GB(est, max_depth, lr):
    gb = GradientBoostingClassifier(n_estimators=est, max_depth=max_depth, learning_rate=lr)
    gb_model = gb.fit(X_train, y_train)
    y_pred = gb_model.predict(X_test)
    precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
    print('Est: {} / Depth: {} / LR: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        est, max_depth, lr, round(precision, 3), round(recall, 3), 
        round((y_pred==y_test).sum()/len(y_pred), 3)))

In [7]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings('ignore')

#### Using grid search to train the model:

In [8]:
for n_est in [50, 100, 150]:
    for max_depth in [3, 7, 11, 15]:
        for lr in [0.01, 0.1, 1]:
            train_GB(n_est, max_depth, lr)

Est: 50 / Depth: 3 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.869
Est: 50 / Depth: 3 / LR: 0.1 ---- Precision: 0.944 / Recall: 0.699 / Accuracy: 0.955
Est: 50 / Depth: 3 / LR: 1 ---- Precision: 0.885 / Recall: 0.795 / Accuracy: 0.96
Est: 50 / Depth: 7 / LR: 0.01 ---- Precision: 0.75 / Recall: 0.021 / Accuracy: 0.871
Est: 50 / Depth: 7 / LR: 0.1 ---- Precision: 0.896 / Recall: 0.767 / Accuracy: 0.958
Est: 50 / Depth: 7 / LR: 1 ---- Precision: 0.863 / Recall: 0.822 / Accuracy: 0.96
Est: 50 / Depth: 11 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.868
Est: 50 / Depth: 11 / LR: 0.1 ---- Precision: 0.872 / Recall: 0.795 / Accuracy: 0.958
Est: 50 / Depth: 11 / LR: 1 ---- Precision: 0.891 / Recall: 0.842 / Accuracy: 0.966
Est: 50 / Depth: 15 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.868
Est: 50 / Depth: 15 / LR: 0.1 ---- Precision: 0.873 / Recall: 0.801 / Accuracy: 0.959
Est: 50 / Depth: 15 / LR: 1 ---- Precision: 0.902 / Recall: 0.822 / Accuracy:

### For the worst models the learning rate is 0.01.The best models had a learning rate of 0.1.

### Evaluate Gradient-boosting Model Performance

#### Using both TF-IDF and Count Vectoriser:

In [9]:
# TF-IDF
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])
X_tfidf_feat = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)

# CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_count = count_vect.fit_transform(data['body_text'])
X_count_feat = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_count.toarray())], axis=1)

X_count_feat.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,8094,8095,8096,8097,8098,8099,8100,8101,8102,8103
0,128,4.7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,49,4.1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,62,3.2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,28,7.1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,135,4.4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Exploring parameter settings using GridSearchCV

In [10]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV

**The classifier is empty, and then the parameters are defined using this parameter grid. The parameter grid will be a dictionary, where the keys are the hyper-parameter values, and the parameter grid will be a dictionary where the keys of the dictionary will be the hyper-parameters, and then the values will be a list of settings to be explored.**

## TF-IDF:

In [11]:
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [100, 150], 
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1]
}

clf = GridSearchCV(gb, param, cv=5, n_jobs=-1)
cv_fit = clf.fit(X_tfidf_feat, data['label'])
pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
1,189.47984,0.666128,0.426417,0.117031,0.1,7,150,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",0.966786,0.981149,0.971249,0.966757,0.967655,0.970719,0.00547,1
5,245.593789,22.084825,0.209827,0.013874,0.1,15,150,"{'learning_rate': 0.1, 'max_depth': 15, 'n_est...",0.966786,0.974865,0.969452,0.968553,0.969452,0.969822,0.002703,2
3,268.231131,1.093614,0.333991,0.073561,0.1,11,150,"{'learning_rate': 0.1, 'max_depth': 11, 'n_est...",0.964093,0.978456,0.967655,0.971249,0.966757,0.969642,0.004966,3
0,128.081472,0.420875,0.375533,0.028019,0.1,7,100,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",0.964093,0.979354,0.969452,0.965858,0.967655,0.969282,0.005343,4
4,221.186754,11.501587,0.313378,0.045683,0.1,15,100,"{'learning_rate': 0.1, 'max_depth': 15, 'n_est...",0.964991,0.975763,0.969452,0.966757,0.966757,0.968744,0.003788,5




### CountVectoriser:

**The n_jobs=-1 means that the models are trained on different subsets and parameter settings in parallel. It does not mean that each of the models themselves will be trained in parallel. That’s because gradient boosting cannot be trained in parallel, because each iteration depends on the prior iteration.**

In [12]:
gb = GradientBoostingClassifier()
param = {
    'n_estimators': [50, 100, 150], 
    'max_depth': [7, 11, 15],
    'learning_rate': [0.1]
}

clf = GridSearchCV(gb, param, cv=5, n_jobs=-1)
cv_fit = clf.fit(X_count_feat, data['label'])
pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]





Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
8,251.099859,11.217493,0.20716,0.027042,0.1,15,150,"{'learning_rate': 0.1, 'max_depth': 15, 'n_est...",0.964093,0.977558,0.967655,0.969452,0.971249,0.970002,0.004458,1
5,273.647401,1.019809,0.302834,0.011659,0.1,11,150,"{'learning_rate': 0.1, 'max_depth': 11, 'n_est...",0.966786,0.977558,0.967655,0.96496,0.97035,0.969462,0.004405,2
2,184.636719,2.748395,0.342022,0.028607,0.1,7,150,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",0.962298,0.979354,0.97035,0.96496,0.967655,0.968923,0.005866,3
4,179.667654,2.137129,0.360055,0.053417,0.1,11,100,"{'learning_rate': 0.1, 'max_depth': 11, 'n_est...",0.962298,0.977558,0.968553,0.962264,0.967655,0.967666,0.005597,4
1,126.321779,1.176044,0.327917,0.02425,0.1,7,100,"{'learning_rate': 0.1, 'max_depth': 7, 'n_esti...",0.963196,0.977558,0.967655,0.962264,0.966757,0.967486,0.005435,5




### Final evaluation of models

In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], data['label'], test_size=0.2)

In [16]:
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(X_train['body_text'])

tfidf_train = tfidf_vect_fit.transform(X_train['body_text'])
tfidf_test = tfidf_vect_fit.transform(X_test['body_text'])

X_train_vect = pd.concat([X_train[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_train.toarray())], axis=1)
X_test_vect = pd.concat([X_test[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,7103,7104,7105,7106,7107,7108,7109,7110,7111,7112
0,116,16.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,72,2.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,101,6.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,67,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,269,2.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import time

In [18]:
rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

start = time.time()
rf_model = rf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = rf_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 1.993 / Predict time: 0.146 ---- Precision: 1.0 / Recall: 0.842 / Accuracy: 0.978


In [19]:
gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

start = time.time()
gb_model = gb.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

start = time.time()
y_pred = gb_model.predict(X_test_vect)
end = time.time()
pred_time = (end - start)

precision, recall, fscore, train_support = score(y_test, y_pred, pos_label='spam', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time, 3), round(pred_time, 3), round(precision, 3), round(recall, 3), round((y_pred==y_test).sum()/len(y_pred), 3)))

Fit time: 140.517 / Predict time: 0.143 ---- Precision: 0.942 / Recall: 0.816 / Accuracy: 0.967


**Tested two different algorithms, random forest, and gradient boosting, on two different vectorization methods, tfidf, and count vectorization, across a variety of hyper-parameter settings. 
The best models are the ones with 150 estimators, and around 11 max_depth.
The difference in how long it takes to fit - Random forest is faster than Gradient-boosting Model.**
