# Text feature extraction using TF-IDF

In this section I'll use sklearn to build the training and testing data. Then I'll build the NN model and use TF-IDF methods to improve overall model accuracy.<br>



## Load data and split into **X** train and **y** test datasets.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Read tsv file into a dataframe object
# Press tab to check you are in the correct folder location and to browse
# to the tsv file
# The sep command indicates this files is separated by tabs
dataframe = pd.read_csv("../Jupyter notebook files/SMSSpamCollection/SMSSpamCollection.tsv", sep="\t")

In [3]:
dataframe.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Check for missing values
dataframe.isnull().sum()

label      0
message    0
dtype: int64

Lets see how many **spam** and **ham** we have in the dataframe.

In [5]:
dataframe["label"].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

We can see that there are 5572 messages, and 4824 are labelled as **ham**. 

Therefore 4824 / 5572 = 86% of messages are **ham**. If we were to randomly choose either **ham** or **spam** when viewing a text message, then 86% would randomly be correct with this dataset. So our text classifier needs to perform better than 86% to beat a random selection.

First we split the data into **text message data** which we call **X** and **label data** which we call **y**. Make sure you pay attention to capitalisation convention rules for both datasets. **X** contains a large amount of message data and is therefore represneted with a capitalised letter.

In [6]:
# Following convention, X contains message data (large matrix)
# and y contains label data
X = dataframe["message"]
y = dataframe["label"]

In [7]:
# Contains index and message text
X.head()

0    Go until jurong point, crazy.. Available only ...
1                        Ok lar... Joking wif u oni...
2    Free entry in 2 a wkly comp to win FA Cup fina...
3    U dun say so early hor... U c already then say...
4    Nah I don't think he goes to usf, he lives aro...
Name: message, dtype: object

In [8]:
# Contains index and message label
y.head()

0     ham
1     ham
2    spam
3     ham
4     ham
Name: label, dtype: object

## Split the data into train & test sets:

Next we perform the training and testing split of the data, just as we did previously.

In [9]:
from sklearn.model_selection import train_test_split

# test size represents the proportion of training and testing data split. 
# Random_state sets "randomness" of data randomisation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=1)

## Scikit-learn's CountVectorizer

The count vectorizer builds a dictionary of features and transforms documents to feature vectors.

Text preprocessing, tokenising and the ability to filter out stopwords are all included in the count vectorizer in Scikit-learn. See https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html for more detail.

Stop words such as **and**, **the**, **him**, are uninformative in representing the content of a text and are removed to avoid them being construed as signal for prediction. See https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction for more detail.

In [10]:
# Import count vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [11]:
# Create an instance of count vectorizer
count_vectorizer = CountVectorizer()

Now I need to pass the **X_train** data into a count vectorizer and then transform the message data. Remember that X_train contains training message text data.

There are 2 ways to do this.

In [12]:
# Fit count vectorizer to message data
# This step builds vocabulary and counts number of words in text
#count_vectorizer.fit(X_train)

In [13]:
# Transform original text message to a vector
#X_train_count = count_vectorizer.transform(X_train)

There's an easier method to complete this process using the `fit_transform` in the `CountVectorizer` class. This method performs the `fit` and the `transform` processes within one command.

In [14]:
# Alternative is to use the fit_transform function
# which performs the fit and then transforms X_train
# into a numerical vector and stores in X_train_counts
# No need to do this in 2 separate functions (above)
X_train_counts = count_vectorizer.fit_transform(X_train)

The output of the fit transform is a compressed sparse matrix. A sparse matrix is one in which most of the elements are zero. See https://en.wikipedia.org/wiki/Sparse_matrix for more information.

> Lets examine the dimensions of the sparse matrix. It contains 3900 rows of message data. Remember that previously we decided to split the message data into a 70/30 for training and testing. 70% of the original message texts (5572) = 3900. 

> The sparse matrix contains 7155 columns or features. These features represent the number of unique words in all of the text messages.

In [15]:
# The matrix contains 3900 rows of text. These aare 70% of the original
# text messages (5572 rows)
X_train_counts

<3900x7155 sparse matrix of type '<class 'numpy.int64'>'
	with 51338 stored elements in Compressed Sparse Row format>

In [16]:
# Same as size of no of rows in X_train
X_train.shape

(3900,)

## Transform count vectorization to frequencies with Tf-idf
While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might have the same tpoic content within them.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **Term frequeny** or **tf**.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **Term Frequency Inverse Document Frequency** or **tf–idf**.

Both **tf** and **tf–idf** can be computed using scikit-learn's [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html):

In [17]:
# Load the tfidfTransformer
from sklearn.feature_extraction.text import TfidfTransformer

# Create an instance of TfidfTransformer
tfidf_transformer = TfidfTransformer()

# Perform a tf-idf fit transform on the X_train_counts
# sparse matrix. Put the result into X_train_tfidf
X_train_transform = tfidf_transformer.fit_transform(X_train_counts)

# Shape is the same as original count vectorizer
# although it now contains word term frequencies multiplied by the
# inverse document frequency
X_train_transform.shape

(3900, 7155)

There's a library in scikit-learn that allows us to combine the processes of `count_vectorizer` and `fit_transform` into one process. It is called the `TfidfVectorizer`. This replaces the two steps of **vectorizing** followed by **fit transformation** processes that I did above.

Here's how we can complete this work in one process.

In [18]:
# Load the vectorizer library
from sklearn.feature_extraction.text import TfidfVectorizer

In [19]:
# Create an instance of the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

In [20]:
# Complete the vectorizing and fit transform on the original X_train
# dataset
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
# Examine shape of the dataset
X_train_tfidf.shape

(3900, 7155)

In a previous lecture I compared accuracy of several text classifiers. I found that the **Support Vector Classifier (SVC)** performed best out of these. 

Now I'm going to apply a SVC called linearSVC. See https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC for more inforamtion.

LinearSVC has more flexibility than SVC and should scale better to large numbers of samples. It can handle sparse matrix better then SVC and works well with large datasets.

Let's set up `linearSVC` and compare its output.

In [21]:
from sklearn.svm import LinearSVC

In [22]:
# Contents of X_test
X_test.head()

1078                         Yep, by the pretty sculpture
4028        Yes, princess. Are you going to make me moan?
958                            Welp apparently he retired
4642                                              Havent.
4674    I forgot 2 ask ü all smth.. There's a card on ...
Name: message, dtype: object

Now I'll create an instance of the Linear SVC classifier and fit the **vectorized** and **fitted** X_train message data along with the y_train label data to the model.

In [23]:
# Create an instance of the LinearSVC classifier
classifier = LinearSVC()

# X : {array-like, sparse matrix}
# y : array-like, shape = [n_samples], 
# Target vector relative to X
classifier.fit(X_train_tfidf, y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

Next we need to prepare the testing data before it can be used for predictions from the SVC model. **Note** that I don't perform a `fit_transform` on the **X_text** data, just a `transform`.

In [24]:
# Transform original test message data to a vector
# No need to fit and transform it
X_test_transform = tfidf_vectorizer.transform(X_test)

# Predict message type from Linear SVC classifier
predictions = classifier.predict(X_test_transform)

In [25]:
# Predictions contains the predicted label data from inputted message test data
predictions.shape

(1672,)

In [26]:
# Show a confusion matrix of results
from sklearn import metrics
print(metrics.confusion_matrix(y_test, predictions))

[[1437    5]
 [  20  210]]


In [27]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1442
        spam       0.98      0.91      0.94       230

   micro avg       0.99      0.99      0.99      1672
   macro avg       0.98      0.95      0.97      1672
weighted avg       0.98      0.99      0.98      1672



Now I'm going to test the new system with some typical HAM and SPAM text messages.

In [28]:
# Typical ham text message
sample_text_message = ["I'm going to go to work soon"]
transformed_text = tfidf_vectorizer.transform(sample_text_message)

model_output = classifier.predict(transformed_text)
print(model_output)

['ham']


In [29]:
# Typical spam text message
sample_text_message = ["You can win a holiday! Text 23455 to take this offer up."]
transformed_text = tfidf_vectorizer.transform(sample_text_message)

model_output = classifier.predict(transformed_text)
print(model_output)

['spam']


Now I'm going to create a **function** to simplify this process.

In [30]:
def predict_message_type(text_message):
    # Note that I'm using square brackets around the text_message variable
    transformed_message = tfidf_vectorizer.transform([text_message])
    
    # Predict model output from 
    model_output = classifier.predict(transformed_message)
    return(model_output)

In [31]:
# Submit a text message to the function and see if the model can accurately detect whether it is ham or spam.
# Here I'm testing whether the model can predict this as SPAM
predict_message_type("You can win a holiday! Text 23455 to take this offer up.")

array(['spam'], dtype=object)

In [32]:
# Typical spam message
predict_message_type("Your invoice is attached to this text. Click this link to download it.")


array(['spam'], dtype=object)

In [33]:
predict_message_type("Hi there. Hows things with you today? Are you heading out for some food?")

array(['ham'], dtype=object)

## Creating a  scikit-learn Pipeline

So far I've had to vectorize the training data, then build a Linear SVC classifier, fit the model with training data, and then transform the testing data before it can be used for predictions from the trained model.

And I've then tried to speed up the process by implementing a function to complete some of the repetitiveness actiosn required to test the model.

Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier. This means we can creat a `pipeline` to contain several libraries that can be called just like a method.

Let's use a pipeline to complete the processes I just did on the Linear SVC classifier. Note that these steps will produce the same results as the steps I completed above.

When `text_classifier.fit(X_train, y_train)` is called, the pipeline takes in **X_train** data and performs the `TfidfVectorizer` on it. A model is then stored in memory.

When `predictions = text_classifier.predict(X_test)` is called, **X_text** is first transformed  with `tfidf_vect` and then the `LinearSVC_classifier` is used to predict using **X_test** data.  All steps are performed within the `text_classifier` pipeline.

In [34]:
from sklearn.pipeline import Pipeline

# If we didn't have these libraries loaded then we should
# also do this process
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_classifier = Pipeline([('tfidf_vect', TfidfVectorizer()),
                     ('LinearSVC_classifier', LinearSVC()),
])

# Feed the training data through the pipeline
text_classifier.fit(X_train, y_train) 

Pipeline(memory=None,
     steps=[('tfidf_vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...ax_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0))])

In [35]:
# Form a prediction set
predictions = text_classifier.predict(X_test)

In [36]:
# Report the confusion matrix
print(metrics.confusion_matrix(y_test,predictions))

[[1437    5]
 [  20  210]]


In [37]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1442
        spam       0.98      0.91      0.94       230

   micro avg       0.99      0.99      0.99      1672
   macro avg       0.98      0.95      0.97      1672
weighted avg       0.98      0.99      0.98      1672



In [38]:
print(metrics.accuracy_score(y_test,predictions))

0.9850478468899522


This means that we can predict whether a text message is **spam** or **ham** correctly in 98.5% of time. This shows a large improvement over the previous data we used for predictions.

We can test the model out by feeding in some examples of a ham and spam messages. Remember **ham** is a genuine text message whereas **spam** is not.

In [39]:
# Typical ham text message
text_classifier.predict(["I'm going to go to work soon"])

array(['ham'], dtype=object)

In [40]:
text_classifier.predict(["This is my phone number. I hope you can call before 4pm."])

array(['ham'], dtype=object)

In [41]:
text_classifier.predict(["You can win a holiday! Text 23455 to take this offer up."])

array(['spam'], dtype=object)

In [42]:
text_classifier.predict(["Your invoice is attached to this text. Click this link to download it."])

array(['spam'], dtype=object)