## **Spam Classifier**

Interested in a different text classification task?? Here we go beyond the sentiment classifier base example. Classifying arbitrary text as spam or ham is a useful task in determining the validity of a piece of text (SMS text messages, emails, posts, comments).

This notebook does the following:
1. Loads and cleans the SMS text data (from [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection))
2. Builds the spam classifier model
3. Converts the model to CoreML format and saves the model to Skafos, pushing it to your application

In [1]:
import requests, zipfile, io

import turicreate as tc
from skafossdk import *

In [2]:
ska = Skafos() # initialize Skafos

Turi Create was already installed. It's been saved to the tc variable.
2019-01-08 18:54:08,139 - skafossdk.data_engine - INFO - Connecting to DataEngine
2019-01-08 18:54:08,203 - skafossdk.data_engine - INFO - DataEngine Connection Opened


### 1. **Load the data**
The data loaded below is SMS text message data labeled with "spam" or "ham". The functions below are used to download the dataset from the UCI ML Repository. The data is then split into training and testing datasets.

In [None]:
def load_spam_dataset():
    spam_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    _request_and_unzip(spam_url, 'spam/')
    with open("datasets/spam/SMSSpamCollection", "r", encoding="utf-8") as infile:
        d = infile.readlines()
    return d

def _request_and_unzip(url, folder):
    r = requests.get(url)
    if r.ok:
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(f'datasets/{folder}')


In [3]:
# Load spam text data and inspect
spam_data = load_spam_dataset()
print(spam_data[:4], flush=True)

['ham\tGo until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n', 'ham\tOk lar... Joking wif u oni...\n', "spam\tFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n", 'ham\tU dun say so early hor... U c already then say...\n']


In [4]:
# Split text data from its target variable ("ham", "spam")
spam_labels = [line.split('\t')[0] for line in spam_data]
spam_text = [line.split('\t')[1].replace('\n', '') for line in spam_data]
spam_df = tc.SFrame({'label': spam_labels, 'text': spam_text})

In [5]:
spam_df.head(5)

label,text
ham,"Go until jurong point, crazy.. Available onl ..."
ham,Ok lar... Joking wif u oni... ...
spam,Free entry in 2 a wkly comp to win FA Cup final ...
ham,U dun say so early hor... U c already then say... ...
ham,"Nah I don't think he goes to usf, he lives around ..."


In [6]:
# Make a train-test split
train_data, test_data = spam_df.random_split(0.8)

### 2. **Build the model**
We pass the data to the `tc.text_classifier.create` function and specify a few arguments needed to properly run the model. To understand more about this specific function, check out the [Turi Create Documentation](https://apple.github.io/turicreate/docs/userguide/text_classifier/).

In [7]:
# Train the spam filter classification model, this takes approximately 5-10 seconds using CPU.
spam_model = tc.text_classifier.create(train_data, target='label', features=['text'])

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



### 3. **Evaluate the model**
We will evaluate the performance of our trained model on the test set. The test set consists of data that our model has never seen before. It will provide a decent estimate of how the model will perform "in the wild".


In [8]:
# The counts where target and predicted label are the same (meaning the model got it right) should be higher
predictions = spam_model.predict(test_data)
tc.evaluation.confusion_matrix(test_data['label'], predictions)

target_label,predicted_label,count
spam,spam,130
ham,ham,991
ham,spam,5
spam,ham,20


In [9]:
# Model testing accuracy
accuracy = tc.evaluation.accuracy(test_data['label'], predictions)
print(f'Spam filter model has a testing accuracy of {accuracy*100} % !', flush=True)

Spam filter model has a testing accuracy of 97.81849912739965 % !


#### **Let's put our trained model to the test even more!**

In [10]:
# generate some sample text data
sample_text = ['WINNER! You have been selected for a CASH prize!', 'hey how are you?',
               'Want to be a millionaire?', 'What is the weather like today?']

sample_predictions = spam_model.predict(tc.SFrame({'text': sample_text}))

# investiage the results
for t, p in zip(sample_text, sample_predictions):
    print(t, '----', p)

WINNER! You have been selected for a CASH prize! ---- spam
hey how are you? ---- ham
Want to be a millionaire? ---- ham
What is the weather like today? ---- ham
