<a href="https://colab.research.google.com/github/isegura/TextClassification/blob/master/SpamDetection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification: spam detection



This notebook is about text classification. Spam detection is a binary text classification problem because there is only two classes:  'spam' and 'ham' (non-spam).

The dataset can be downloaded from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection

## Loading data

First, we need to mountain the google drive folder.  To do this, please run the following cell and follow the instructions. 



In [0]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


The file **SMSSpamCollection** contains the dataset. Each line represents a message. Each line starts with the label, followed by a tab and then the message text. 

To load the file, you can read line by line and extract the information using the method split:

In [0]:
sst_home='drive/My Drive/Colab Notebooks/'
#modify this path 
path=sst_home+'TESI/3-TextClasification/data/SMSSpamCollection'

messages=[]
labels=[]
f=open(path, "r")
for line in f.readlines():
  #print(line)
  data=line.split('\t')
  labels.append(data[0])
  messages.append(data[1])

print(len(messages),len(labels))

5574 5574


However, there are several Python libraries to load easily files whose lines contain fields separated by tab or commas.

One of the most popular library for loading data structure is **Pandas**. The following cell shows how you can load the data using it:

In [0]:
import pandas
import csv

messages = pandas.read_csv(path, sep='\t', quoting=csv.QUOTE_NONE, names=["label", "message"])

#shows the first messages
print(messages.head())

  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


## Representing texts

Each message in the dataset must be represented by a set of features. 
The most common approach for representing texts for the text classification is to use the bag-of-words models. In this model, each text is represented by a vector. Each element of the word is the frequency of each word of the vocabulary extracted from the dataset.
 
 ### Cleaning
 
In order to reduce the lexical variability, we use stems instead of tokens. Moreover, we remove stop words, words with digits or special characters, and words with a lenght lower than 3. 

Moreover, we use the library **nltk** to obtain the stop words and the stems of the tokens. 

The **cleanText** functions takes a text as input and cleans it removing the stop words,  words with length lower than 3 and words containing numbers or spacial characters. The list of its stems is returned. 

In [0]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
stopwords_en = stopwords.words("english")

from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
import re


def cleanText(text):
    text=str(text).lower()
    
    #tokeniza the text
    tokens=word_tokenize(text)


    
    #remove the stopwords
    tokens = [word for word in tokens if word not in stopwords_en]

    
    
    #(4) obtain the stems
    tokens = [PorterStemmer().stem(word) for word in tokens]

    
    #(5) finally, remove words with len <3 and words that contain numbers, puntuaction, ect
    min_length = 3
    p = re.compile('[a-zA-Z]+');
    filtered_tokens=[]
    for token in tokens:
        if len(token)>=min_length and p.match(token):
            filtered_tokens.append(token)
            
    return filtered_tokens

  
  



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Let us to apply this function to the first texts of the dataset:

In [0]:

#messages.message.head().apply(cleanText)
aux=messages.message.head()
for text in aux:
  print(text)
  print(cleanText(text))
  print()

Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
['jurong', 'point', 'crazy..', 'avail', 'bugi', 'great', 'world', 'buffet', 'cine', 'got', 'amor', 'wat']

Ok lar... Joking wif u oni...
['lar', 'joke', 'wif', 'oni']

Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
['free', 'entri', 'wkli', 'comp', 'win', 'cup', 'final', 'tkt', 'may', 'text', 'receiv', 'entri', 'question', 'std', 'txt', 'rate', 'appli']

U dun say so early hor... U c already then say...
['dun', 'say', 'earli', 'hor', 'alreadi', 'say']

Nah I don't think he goes to usf, he lives around here though
['nah', "n't", 'think', 'goe', 'usf', 'live', 'around', 'though']



### Vectorization

The **CountVectorizer** class of the sklearn library (https://scikit-learn.org/) allows us to obtain the vocabulary of the dataset (the set of unique words). We can see that the size of the vocabulary is 6,928 unique words (stems).


In [0]:
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(analyzer=cleanText).fit(messages['message'])
#print(bow_transformer.vocabulary_)
print(len(bow.vocabulary_))

#we show the word with index 0 at the vocabulary
print(bow.get_feature_names()[0])

#we show the word with index 253 at the vocabulary
print(bow.get_feature_names()[253])
#we show the word with index 1291 at the vocabulary
print(bow.get_feature_names()[1291])
#we show the word with index 1813 at the vocabulary
print(bow.get_feature_names()[1813])
#we show the word with index 2375 at the vocabulary
print(bow.get_feature_names()[2375])

6928
a-ffection
anymor
cri
enough
gon


Now, we can transform any message to a vector based on this vocabulary. To do this, we will use the method **transform**:



In [0]:
message=messages['message'][10]
print(message)
vector = bow.transform([message])
print(vector.shape)
print(vector)



I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
(1, 6928)
  (0, 253)	1
  (0, 1291)	1
  (0, 1813)	1
  (0, 2375)	1
  (0, 2686)	1
  (0, 3868)	1
  (0, 5479)	1
  (0, 5694)	1
  (0, 5840)	1
  (0, 6070)	1
  (0, 6110)	1
  (0, 6513)	1


Now, we can apply the model to the whole collection in order to tranform their messages to vectors. 

The property **nnz** allows us to obtain the number of non-zeros elements in the matrix. You can see that only the 0.11\% of the elements in the matrix are different to 0. Therefore, the matrix data are very sparse. 

In [0]:
matrix_bow = bow.transform(messages['message'])
print('sparse matrix shape:', matrix_bow.shape)
print('total elements:',matrix_bow.shape[0] * matrix_bow.shape[1])
print( 'number of non-zeros:', matrix_bow.nnz)
print( 'sparsity: %.2f%%' % (100.0 * matrix_bow.nnz / (matrix_bow.shape[0] * matrix_bow.shape[1])))

sparse matrix shape: (5574, 6928)
total elements: 38616672
number of non-zeros: 40774
sparsity: 0.11%


#### Dealing with the sparcity problem: TF-IDF model.

To avoid that most elements in matrix are 0 (almost 99%), we can apply the TF-IDF model, where instead of using the frequency of each word, we take into account the frequency of the word in the whole collections of texts. In this way, we achive that those words very common in many texts, have a low weigth (actually, they are not discriminative words).

TF-IDF for a term can be obtained with the following equation:

${\displaystyle \mathrm {tfidf} (t,d,D)=\mathrm {tf} (t,d)\cdot \mathrm {idf} (t,D)} =
\frac{f(t,d)}{max\{f(w,d):w \in d\} } \cdot  \log \frac{|D|}{|\{d \in D: t \in d\}|} $


where t is the word, d is the document and D the collection of texts.


Fortunately, the sklearn library makes this task for us. You can use the **TfidfTransformer** class, which takes a matrix created with the bow approach, and transforms it to a new matrix based on TF-IDF. Moreover, the normalization L2 is applied. The norm of a vector is obtained dividing each element by the norm of the vector: 

$v = \frac{v}{norm(v)}$

where norm is usually L2 norm  (http://mathworld.wolfram.com/L2-Norm.html)

<img src='https://wikimedia.org/api/rest_v1/media/math/render/svg/348cef86ef91aa2d9a7151031a4fb80578090c4d'/>


The following cells show ho to train a model from the bow matrix, and then, transform it to its tf-idf equivalent matrix:

In [0]:
from sklearn.feature_extraction.text import TfidfTransformer

#first, we train the transformer to tf-idf model
tfidf_transformer = TfidfTransformer().fit(matrix_bow)
print(tfidf_transformer.idf_[bow.vocabulary_['alreadi']])

#now, we transform the bow matrix to a new matrix based on tf-idf
tfidf_vectors = tfidf_transformer.transform(matrix_bow)
print(tfidf_vectors.shape)

5.137411226596179
(5574, 6928)


## Training 

Once we already know how to represent texts using the tf-idf model, you can train a classifier. In this tutorial, we will use the classifier **Naive Bayes**, which is implemented in the sklearn library. 

As first step, we must divide the dataset into training and test sets with the ratio 80-20. To do this, we use the method **train_test_split** of the package **model_selection**:

In [0]:
from sklearn.model_selection import cross_val_score, train_test_split 

msg_train, msg_test, label_train, label_test = train_test_split(messages['message'], messages['label'], test_size=0.2)
print(len(msg_train), len(msg_test), len(msg_train) + len(msg_test))

4459 1115 5574


Now, we are going to train a tf-idf model using the training dataset. Then, we will apply this tf-idf model to tranform the training and test sets. Remember that first you need to obtain the bow model.


In [0]:
bow = CountVectorizer(analyzer=cleanText).fit(msg_train)
#transform the training set to bow model
bow_train = bow.transform(msg_train)
#transform the test set to bow model
bow_test=bow.transform(msg_test)


#learn the tf-idf model from the training bow
tfidf_transformer = TfidfTransformer().fit(bow_train)
#transform the training set to tf-idf model
tfidf_train = tfidf_transformer.transform(bow_train)
#transform the test set to tf-idf model
tfidf_test = tfidf_transformer.transform(bow_test)

print('matrices obtained')

matrices obtained


Now, we can already train our classifier. You need to create an instance of the **MultinomialNB** class, which implements the **Naive Bayes** classifier.

In [0]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
%time nb.fit(tfidf_train, label_train)

CPU times: user 19.2 ms, sys: 32 µs, total: 19.2 ms
Wall time: 25.8 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Evaluation  

Once the model has been trained, you can apply it on the test set in order to obtain the predictions. 

We can compare these predictions to the gold labels of the test set (label_test) in order to know the performance of the classifier. 

The metrics used to evaluate text classification systems are: 

- **Recall** refers to the ability of a model to find all the revelent cases within a dataset. Its equation is: $ recall = \frac{TP}{TP + FN}$

- **Precision** refers to the ability  of a  model to identify only the relevant data points. Its equation is: $precision= \frac{TP}{TP + FP}$


In some cases, we might know that we want to maximize either recall or precision at the expense of the other metric.  However, in many cases where we want to find an optimal blend of precision and recall. To do this, we can combine the two metrics using the **F1 score**, which is the harmonic mean of precision and recall taking both metrics into account in the following equation:
$F1= 2*\frac{precision*recall}{precision + recall}$


Sklearn provides several functions which help us to measure the performance of the classifier. For example, the function **confusion_matrix** returns the confusion matrix. 

In binary classification, the count of true negatives is C$_0,_0$ , false negatives is C$_0,_1$  , true positives is C$_1,_1$ and false positives is  C$_1,_0$ 



In [0]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

predictions = nb.predict(tfidf_test)

print( '(row=expected, col=predicted)')
print( 'confusion matrix\n', confusion_matrix(label_test, predictions))

tn, fp, fn, tp = confusion_matrix(label_test, predictions).ravel()
print('tn, fp, fn, tp:',tn,fp,fn,tp)



(row=expected, col=predicted)
confusion matrix
 [[974   1]
 [ 47  93]]
tn, fp, fn, tp: 974 1 47 93


Using the confusion matrix, we could calculate the scores for precision, recall and F1. However, **sklearn** already provides functions to obtains these scores:

In [0]:
print('accuracy', accuracy_score(label_test, predictions))
print(classification_report(label_test, predictions))

accuracy 0.95695067264574
              precision    recall  f1-score   support

         ham       0.95      1.00      0.98       975
        spam       0.99      0.66      0.79       140

    accuracy                           0.96      1115
   macro avg       0.97      0.83      0.89      1115
weighted avg       0.96      0.96      0.95      1115



## Creating a pipeline

Sklearn allows us to define a sequence of processes that will be excuted one after the other. 

Thus, we can create a pipeline where the first process is to create the bow mode, then the tf-idf model and finally we will apply a classifier (for example, naive bayes)

The following cell shows how to create a pipeline that joins the processes created in the previous cells:

In [0]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('bow', CountVectorizer(analyzer=cleanText)),  # strings to token integer counts
    ('tfidf', TfidfTransformer()),  # integer counts to weighted TF-IDF scores
    ('classifier', MultinomialNB()),  # train on TF-IDF vectors w/ Naive Bayes classifier
])

pipeline.fit(msg_train,label_train)
predictions = pipeline.predict(msg_test)

print( classification_report(label_test, predictions))




              precision    recall  f1-score   support

         ham       0.95      1.00      0.98       975
        spam       0.99      0.66      0.79       140

    accuracy                           0.96      1115
   macro avg       0.97      0.83      0.89      1115
weighted avg       0.96      0.96      0.95      1115

