For this project, we'll use samples of messages known to be spam/not spam to generage features using NLP and Supervised Machine Learning to classify a text message as spam or not spam.

#### Data Pre-processing

In [1]:
# Load the message dataset
import numpy as np
import pandas as pd

df = pd.read_csv('smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [2]:
# Look at distribution of labels in the dataset
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

4825 out of 5572 messages, or 86.6%, are not spam. This means that any text classification model we create has to perform better than 86.6% accuracy score to be better than a majority class classifier.

Let's split the data into train/test sets.

In [7]:
from sklearn.model_selection import train_test_split

X = df['message'] 
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
print(X_train.shape)
print(X_test.shape)

(3733,)
(1839,)


Now we'll use Scikit-learn's TfidfTransformer to build a dictionary of words and transforms documents to feature vectors. It persorms Text preprocessing, Tokenizing and the filters out stopwords. It calculates Term Frequency times Inverse Document Frequency (or tf-idf) to normalize more frequently occurring words and converts the data into feature vectors.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train)
X_train_tfidf.shape

(3733, 7082)

#### Train a SVM Model

In [10]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [12]:
# Build a pipeline to apply the same steps on test data while making predictions
from sklearn.pipeline import Pipeline

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

  if LooseVersion(joblib_version) < '0.12':


Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [13]:
# Use pipeline to make predictions on test set
predictions = text_clf.predict(X_test)

In [14]:
# Look at the confucion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]


In [15]:
# Look at the classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

   micro avg       0.99      0.99      0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [16]:
# Calculate the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.989668297988037


As seen based on performance metrics, we're able to buid a quite accurate model by using Natural Language Processing on text data of messages. The performance can be further improved by performing hyperparameter tuning or using other classification techniques.