#### NLP with Python
NLP stands for Natural Language Processing. It is a field of study and a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language. NLP aims to enable computers to understand, interpret, manipulate, and generate human language in a way that is meaningful and useful.

The goal of NLP is to bridge the gap between human language and computer understanding. It involves the development and application of computational algorithms and models to process and analyze natural language data.

#### Building Machine Learning Classifiers:

**Grid-search:** Exhaustively search all parameter combinations in a given grid to determine the best model.

**Cross-validation:** Divide a dataset into k subsets and repeat the holdout method k times where a different subset is used as the holdout set in each iteration.

#### Reading data and visualizing data

In [1]:
# The Natural Language Toolkit (NLTK) is an open source Python library for Natural Language Processing. 
import nltk
import pandas as pd
import re

# Python part of the warnings subsystem.
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

# Loading a TSV file into a DataFrame and assigning new column names to the DataFrame's columns.
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']
data

Unnamed: 0,label,body_text
0,spam,Free entry in 2 a wkly comp to win FA Cup fina...
1,ham,"Nah I don't think he goes to usf, he lives aro..."
2,ham,Even my brother is not like to speak with me. ...
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!
4,ham,As per your request 'Melle Melle (Oru Minnamin...
...,...,...
5562,spam,This is the 2nd time we have tried 2 contact u...
5563,ham,Will ü b going to esplanade fr home?
5564,ham,"Pity, * was in mood for that. So...any other s..."
5565,ham,The guy did some bitching but I acted like i'd...


#### Cleaning data

In [2]:
# A collection of string constants.
import string

# Calculating the length of each text in the 'body_text' column of the data DataFrame and storing it in a new column called 'body_len'
# Calculating the percentage of punctuation characters in each text and storing it in a new column called 'punct%'.

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

data

Unnamed: 0,label,body_text,body_len,punct%
0,spam,Free entry in 2 a wkly comp to win FA Cup fina...,128,4.7
1,ham,"Nah I don't think he goes to usf, he lives aro...",49,4.1
2,ham,Even my brother is not like to speak with me. ...,62,3.2
3,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,28,7.1
4,ham,As per your request 'Melle Melle (Oru Minnamin...,135,4.4
...,...,...,...,...
5562,spam,This is the 2nd time we have tried 2 contact u...,131,6.1
5563,ham,Will ü b going to esplanade fr home?,29,3.4
5564,ham,"Pity, * was in mood for that. So...any other s...",48,14.6
5565,ham,The guy did some bitching but I acted like i'd...,100,1.0


#### Explore the dataset

In [3]:
# What is the shape of the dataset?

print("Input data has {} rows and {} columns".format(len(data), len(data.columns)))

Input data has 5567 rows and 4 columns


In [4]:
# How many spam/ham are there?

print("Out of {} rows, {} are spam, {} are ham".format(len(data),
                                                       len(data[data['label']=='spam']),
                                                       len(data[data['label']=='ham'])))

Out of 5567 rows, 746 are spam, 4821 are ham


In [5]:
# How much missing data is there?

print("Number of null in label: {}".format(data['label'].isnull().sum()))
print("Number of null in text: {}".format(data['body_text'].isnull().sum()))

Number of null in label: 0
Number of null in text: 0


#### Remove punctuation and stopwords 

#### Stem text

In [6]:
# Example
import nltk
from nltk.stem import PorterStemmer

# Create an instance of the PorterStemmer
ps = PorterStemmer()

# Example words to be stemmed
words = ['running', 'jumps', 'jumping', 'ran', 'easily', 'eased']

# Stem each word using the PorterStemmer
stemmed_words = [ps.stem(word) for word in words]

# Print the stemmed words
for word, stemmed_word in zip(words, stemmed_words):
    print(f'{word} -> {stemmed_word}')

running -> run
jumps -> jump
jumping -> jump
ran -> ran
easily -> easili
eased -> eas


In [7]:
import string

# The nltk.corpus.stopwords.words('english') expression accesses the stopwords corpus in NLTK and retrieves a list of English stopwords.
# By assigning it to the variable stopwords, you can use this list to filter out stopwords from text data during preprocessing or analysis steps.
stopwords = nltk.corpus.stopwords.words('english')

# Stemming is a process in natural language processing that reduces words to their base or root form. 
ps = nltk.PorterStemmer()

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

#### GridSearchCV

In machine learning, hyperparameter tuning is the process of finding the best combination of hyperparameter values for a given machine learning algorithm. Hyperparameters are parameters that are not learned from data but are set before the learning process. Examples of hyperparameters include the learning rate, regularization strength, or the number of estimators in an algorithm.

Grid search is a common technique for hyperparameter tuning, where you define a grid of hyperparameter values and systematically search through all possible combinations to find the optimal set of hyperparameters. This is where the GridSearchCV class comes into play.

The GridSearchCV class in scikit-learn provides an implementation of grid search combined with cross-validation. It performs an exhaustive search over the specified hyperparameter grid and evaluates each combination using cross-validation. This allows you to find the best hyperparameters by optimizing a performance metric (e.g., accuracy, F1 score, or mean squared error) on the training data.

In [8]:
# The train_test_split function is a utility provided by scikit-learn that automates the process of splitting a dataset into training and testing subsets.
# It takes the input data and corresponding labels (if applicable) and randomly divides them into two or more sets based on a specified ratio or size.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

#### Split into train/test

In [9]:
# Features are variables which acts as the input in the system. Prediction models uses these features to make predictions. 
# Labels are the final output or target Output. They are what you're attempting to predict.  We obtain labels as output when provided with features as input.

labels = data['label']

# 60% for training set, 40% for test set 
x_train, x_test, y_train, y_test = train_test_split(data[['body_text', 'body_len', 'punct%']], labels, test_size=0.4, random_state=42)

# Data for test set will be divided into two. 20% for test set and 20% for validation set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=0.5, random_state=42)


# To print out the all data percentage. 
data = [y_train, y_test, y_val]
data_labels = ["Train: ", "Test: ", "Validation: "]


for (data_label, dataset) in zip(data_labels,data):

    print ( (data_label) , round(len(dataset)/len(labels), 2))

Train:  0.6
Test:  0.2
Validation:  0.2


#### TfidfVectorizer

In natural language processing (NLP) and text mining, feature extraction is an essential step in preparing text data for machine learning algorithms. The TfidfVectorizer class in scikit-learn is a useful tool for converting text documents into numerical feature vectors based on the TF-IDF (Term Frequency-Inverse Document Frequency) representation.

TF-IDF is a numerical statistic that reflects the importance of a term in a document within a collection or corpus of documents. It considers both the frequency of a term within a document (TF) and its rarity across the entire corpus (IDF). TF-IDF is commonly used as a weighting scheme to represent text data numerically while capturing the relative importance of terms.

The TfidfVectorizer class in scikit-learn combines the functionalities of CountVectorizer (which converts text into a matrix of token counts) and TfidfTransformer (which applies the TF-IDF transformation) into a single step. It tokenizes input text, builds a vocabulary of known terms, and transforms text documents into TF-IDF feature vectors.

#### Vectorize text by using TfidfVectorizer

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
tfidf_vect_fit = tfidf_vect.fit(x_train['body_text'])

tfidf_train = tfidf_vect_fit.transform(x_train['body_text'])
tfidf_val = tfidf_vect_fit.transform(x_val['body_text'])
tfidf_test = tfidf_vect_fit.transform(x_test['body_text'])

X_train_vect = pd.concat([x_train[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_train.toarray())], axis=1)
X_val_vect = pd.concat([x_val[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_val.toarray())], axis=1)
X_test_vect = pd.concat([x_test[['body_len', 'punct%']].reset_index(drop=True), 
           pd.DataFrame(tfidf_test.toarray())], axis=1)

X_train_vect.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,6129,6130,6131,6132,6133,6134,6135,6136,6137,6138
0,38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,55,9.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,26,26.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,68,11.8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Training models on the train set

In [11]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# from sklearn.metrics import precision_recall_fscore_support as score
import time

In [12]:
# This line prints the best parameters found during the grid search or cross-validation.
# It uses the best_params_ attribute of the results object to access and print the best parameters.
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

# These lines extract the mean test scores (mean_test_score) and the standard deviations of the test scores (std_test_score) from the results object.
# These values are stored in the means and stds variables, respectively.
    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    
# Then, a loop is used to iterate over the mean scores, standard deviations, and hyperparameter combinations (params) using the zip function.
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [13]:
# rf = RandomForestClassifier(n_estimators=150, max_depth=None, n_jobs=-1)

# start = time.time()
# rf_model = rf.fit(X_train_vect, y_train)
# end = time.time()
# fit_time = (end - start)

In [14]:
# finding the best hyperparameter combination for the random forest classifier using a grid search with cross-validation.
# It trains and evaluates the model on the training data while systematically exploring different hyperparameter settings.
# The resulting gs_fit object provides information about the best hyperparameters and the performance of the model.

rf = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300],
        'max_depth': [30, 60, 90, None]}
gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)

start = time.time()
gs_fit = gs.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
4,10.155732,0.967798,0.230177,0.035848,60.0,150,"{'max_depth': 60, 'n_estimators': 150}",0.968563,0.977545,0.97006,0.964072,0.97006,0.97006,0.004339,1
7,12.565842,0.743059,0.268149,0.025818,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.974551,0.973054,0.967066,0.961078,0.974551,0.97006,0.005272,1
11,23.609831,0.78378,0.30047,0.065713,,300,"{'max_depth': None, 'n_estimators': 300}",0.974551,0.971557,0.967066,0.964072,0.968563,0.969162,0.003618,3
8,27.021516,1.634266,0.427342,0.034982,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.973054,0.973054,0.965569,0.961078,0.968563,0.968263,0.00458,4
5,19.608606,1.462154,0.330776,0.051313,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.973054,0.968563,0.965569,0.962575,0.97006,0.967964,0.003618,5


In [15]:
gs.best_estimator_

In [16]:
import joblib

# # Saving the best estimator (model) obtained from scikit-learn's grid search or cross-validation to a file named "RF_model.pkl" using the joblib library.
joblib.dump(gs.best_estimator_, 'RF_model.pkl')

['RF_model.pkl']

In [17]:
# gb = GradientBoostingClassifier(n_estimators=150, max_depth=11)

# start = time.time()
# gb_model = gb.fit(X_train_vect, y_train)
# end = time.time()
# fit_time = (end - start)

In [18]:
# Performong a grid search to find the best hyperparameter combination for GradientBoostingClassifier using cross-validation.
# It trains and evaluates the model on the training data while systematically exploring different hyperparameter settings. 

gb = GradientBoostingClassifier()
param = {
    'n_estimators': [150], 
    'max_depth': [11],
    'learning_rate': [0.1]
}
clf = GridSearchCV(gb, param, cv=5, n_jobs=-1)

start = time.time()
cv_fit = clf.fit(X_train_vect, y_train)
end = time.time()
fit_time = (end - start)

pd.DataFrame(cv_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_learning_rate,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,199.016323,1.651037,0.18527,0.040451,0.1,11,150,"{'learning_rate': 0.1, 'max_depth': 11, 'n_est...",0.965569,0.958084,0.956587,0.958084,0.971557,0.961976,0.005728,1


In [19]:
clf.best_estimator_

In [20]:
# Saving the best estimator (model) obtained from scikit-learn's grid search or cross-validation to a file named "GB_model.pkl" using the joblib library.
joblib.dump(clf.best_estimator_, 'GB_model.pkl')

['GB_model.pkl']

In [21]:
models = {}

for mdl in ['RF', 'GB']:
    models[mdl] = joblib.load('{}_model.pkl'.format(mdl))

In [22]:
models

{'RF': RandomForestClassifier(max_depth=60, n_estimators=150),
 'GB': GradientBoostingClassifier(max_depth=11, n_estimators=150)}

#### Evaluate models on the validation set

In [23]:
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score
from time import time

def evaluate_model(name, model, features, labels):
    start = time()
    pred = model.predict(features)
    end = time()
    accuracy = round(accuracy_score(labels, pred), 3)
    precision = round(precision_score(labels, pred, average='weighted'), 3)
    recall = round(recall_score(labels, pred, average='weighted'), 3)
    print('{} -- Accuracy: {} / Precision: {} / Recall: {} / Latency: {}ms'.format(name,
                                                                                   accuracy,
                                                                                   precision,
                                                                                   recall,
                                                                                   round((end - start)*1000, 1)))

In [24]:
for name, mdl in models.items():
    evaluate_model(name, mdl, X_val_vect, y_val)

RF -- Accuracy: 0.978 / Precision: 0.979 / Recall: 0.978 / Latency: 267.4ms
GB -- Accuracy: 0.967 / Precision: 0.966 / Recall: 0.967 / Latency: 170.9ms


#### Evaluate the best model on the test set

In [25]:
evaluate_model('Random Forest', models['RF'], X_test_vect, y_test)

Random Forest -- Accuracy: 0.971 / Precision: 0.972 / Recall: 0.971 / Latency: 195.9ms
