**Executive Summary**:

1. The SMS Spam Detection managed to achieve **98%** accuracy (**100%** precision and **84%** recall, **91%** F1 score).
2. The text has been cleaned up before passed to the RandomForest model. The text preprocessing involves converting to lower case, tokenize, remove stopwords, lemmatize, and eventually convert to TF-IDF. The other models have been tried like SupportVectorClassifier, XGBoost, LightGBM, Logistic Regression and KNearestNeighborClassifier (PR-AUC is used for model selection and hyperparameter finetuning).
3. Visualization the most common words in spam and ham classes and topic modelling using Latent Dirichlet Allocation (LDA).

In [None]:
import numpy as np 
import pandas as pd
import re
import os

import nltk
# nltk.download('stopwords')
# nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

from wordcloud import WordCloud
from gensim.models import LdaMulticore
import gensim.corpora as corpora
from pprint import pprint

# !pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Load data

In [None]:
df = pd.read_csv('/kaggle/input/sms-spam-collection-dataset/spam.csv')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df = df.rename(columns={'v1':'tgt', 'v2':'text'})[['tgt','text']]

# EDA

In [None]:
df.tgt.value_counts()

Imbalanced class even though it is not that severe

In [None]:
df[df.tgt=='ham'].sample(10)

In [None]:
df[df.tgt=='spam'].sample(10)

# Preprocess data

Preprocess and clean up the data before we put the data into the model. The processes involves
1. Extract only alphabetical characters and convert them into lower case
2. Tokenize text
2. Remove stopwords (English) + short form of stopwords
3. Lemmatize the word
4. Convert the word into TF-IDF vectors
5. Process target to spam-->1, ham-->0

In [None]:
def clean_data(text):
    """
    Extract alphabetical characters and convert them into lower case.
    """
    out = re.sub('[^a-zA-Z]', ' ', text) 
    out = out.lower() 
    out = out.split()
    out = ' '.join(out)
    return out

def tokenize_word(text):
    """
    Convert sentence into list of tokens
    """
    return nltk.word_tokenize(text)

def remove_stopwords(text):
    """
    Remove English stopwords from text
    """
    stop_words = set(stopwords.words("english")+['u','ur','r','n']) 
    filtered_text = [word for word in text if word not in stop_words]
    return filtered_text

def lemmatize_word(text):
    """
    Convert word into base form (lemmatize)
    """
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in text]
    return lemmas

def get_processed_tokens(text):
    """
    Do all above four preprocess steps.
    """
    text = clean_data(text)
    text = tokenize_word(text)
    text = remove_stopwords(text)
    text = lemmatize_word(text)
    return text

In [None]:
df['processed_text'] = df['text'].apply(get_processed_tokens)

In [None]:
# Create corpus of text for TF-IDF 
corpus= []
for i in df["processed_text"]:
    msg = ' '.join([row for row in i])
    corpus.append(msg)

In [None]:
# Vectorize the corpus
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus).toarray()
X.shape

In [None]:
# Process label
df['label'] = 0
df.loc[df.tgt=='spam', 'label']=1
df.label.value_counts()

In [None]:
y = df.label.values

In [None]:
y.shape

# Find common keywords/topics

This section is separate from SMS Spam Detection. This is to model the topics in the sms

In [None]:
# Create a WordCloud object
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color='steelblue')

In [None]:
# All corpus
long_string = ' '.join(corpus)
wordcloud.generate(long_string)
wordcloud.to_image()

In [None]:
# Spam corpus
spam_corpus= []
for i in df[df.tgt=='spam']["processed_text"]:
    msg = ' '.join([row for row in i])
    spam_corpus.append(msg)
    
long_string = ' '.join(spam_corpus)
wordcloud.generate(long_string)
wordcloud.to_image()

In [None]:
# Ham corpus
ham_corpus= []
for i in df[df.tgt=='ham']["processed_text"]:
    msg = ' '.join([row for row in i])
    ham_corpus.append(msg)

long_string = ' '.join(ham_corpus)
wordcloud.generate(long_string)
wordcloud.to_image()

In [None]:
id2word = corpora.Dictionary(df["processed_text"])
texts = df["processed_text"]
tdf = [id2word.doc2bow(text) for text in texts] #Term Document Frequency

In [None]:
num_topics = 8
lda_model = LdaMulticore(corpus=tdf,id2word=id2word,num_topics=num_topics, random_state=10)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[tdf]

In [None]:
# Visualize the topics
pyLDAvis.enable_notebook()
LDAvis_prepared = pyLDAvis.gensim_models.prepare(lda_model, tdf, id2word)
LDAvis_prepared

Based on the model above, most ham topics revolves around deciding when and where to go/meet, affirmation (ok), getting help, saying love/miss, etc.
Particularly for the spam class, it involves request some action from the recipient (call, text, reply), claim some prize / free stuff / cash. The topic 2 on above visualization is particularly far from other topics on the principal components 2-D plot and seems the topic is dominated by spam keywords.

# Model training

We will split 30% of dataset for evaluating the final model performance.
Since the dataset is quite small, we will test few models and find out the best algorithm that perform the best in cross validated dataset.

We will find the best model algorithm by using their default hyperparameter. The best model algorithm is further finetuned on its hyperparameter.
We will use PR-AUC / average_precision score as performance metrics to be optimized since it is imbalanced class.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [None]:
rfc = RandomForestClassifier(random_state=42)
svc = SVC(random_state=42)
xgb = XGBClassifier(random_state=42)
lgbm = LGBMClassifier(random_state=42)
lr = LogisticRegression(random_state=42)
knc = KNeighborsClassifier()

models=[rfc,svc,xgb,lgbm,lr,knc]

In [None]:
model_perf = pd.DataFrame()
metrics = ['f1','precision','recall','average_precision','roc_auc']
for model in models:
    scores = cross_validate(model,X_train,y_train,scoring=metrics,n_jobs=-1,cv=3)
    
    tmp = {}
    tmp['model'] = str(type(model).__name__)
    for test_metric in ['test_'+metric for metric in metrics]:
        tmp[test_metric] = np.mean(scores[test_metric])
    
    model_perf = model_perf.append(tmp,ignore_index=True)

model_perf.sort_values('test_average_precision', ascending=False)

Based on the above tables, the most performing algorithm is **RandomForestClassifier**. The model is selected due to its **extremely high precision** (almost perfect precision) and **good recall**. In this case, I am keen toward to the model that has extremely high precision to ensure that we minimize the False Positive i.e. misclassify ham messages as spam. The alternative model that we can select if we try to maximize recall is **LGBMClassifier** and it has the highest F1 score. 

It has been shown that tree-based algorithm performed better in this case due to its sparse feature matrix.

In [None]:
params = {
    'n_estimators': [25,50,100,200,400,800],
    'min_samples_split':[2,3,4]
}

search = GridSearchCV(rfc, params, scoring='average_precision', n_jobs=-1, cv=5, refit=True)
# This process might take a while
search_result = search.fit(X_train, y_train)
print(search_result.best_params_)
print(search_result.best_score_)

In [None]:
best_model = search_result.best_estimator_

# best_model = RandomForestClassifier(random_state=42, min_samples_split=3, n_estimators=800)
# best_model.fit(X_train,Y_train)

# Evaluate model on test set

In [None]:
y_test_pred = best_model.predict(X_test)

In [None]:
print('Accuracy score: %.2f' %(accuracy_score(y_test, y_test_pred)))
p,r,f,_ = precision_recall_fscore_support(y_test, y_test_pred, average='binary')
print("Precision= %.2f, Recall = %.2f, F1-score = %.2f" % (p,r,f))

The model has perfect precision (100%) and 84% recall

------- End of notebook---------------------