# Text Classification: Spam or Ham
 Trains a model to classify user text as "spam" (bad) or "ham" (good).

Below we do the following:
1. Setup the training environment.
2. Load and clean the SMS text data (from [UCI ML Repository](https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection)).
3. Build and evaluate the spam classifier model.
4. Convert the model to CoreML and upload to Skafos.

## Environment Setup
All we need to do is install the turicreate and skafos libraries to get started. This example **doesn't** use a GPU for training.

In [0]:
# Install turicreate and skafos
!pip install turicreate==5.4
!pip install skafos

## Data Preparation and Model Training
The data loaded below is SMS text message data labeled with "spam" or "ham". First, the data is processed and then it's split into training and testing datasets.

In [0]:
# Import libraries
import requests, zipfile, io

import turicreate as tc

In [0]:
# Functions to load spam dataset
def load_spam_dataset():
    spam_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip'
    _request_and_unzip(spam_url, 'spam/')
    with open("datasets/spam/SMSSpamCollection", "r", encoding="utf-8") as infile:
        d = infile.readlines()
    return d

def _request_and_unzip(url, folder):
    r = requests.get(url)
    if r.ok:
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(f'datasets/{folder}')


In [0]:
# Fetch data and take a look
spam_data = load_spam_dataset()
print(spam_data[:4], flush=True)

In [0]:
# Split text data from its target variable ("ham", "spam")
spam_labels = [line.split('\t')[0] for line in spam_data]
spam_text = [line.split('\t')[1].replace('\n', '') for line in spam_data]
spam_df = tc.SFrame({'label': spam_labels, 'text': spam_text})

In [0]:
# What does our resulting dataframe look like?
spam_df.head(5)

In [0]:
# Make a train-test split
train_data, test_data = spam_df.random_split(0.8)

In [0]:
# Train the spam filter classification model, this takes approximately 5-10 seconds using CPU.
spam_model = tc.text_classifier.create(
    train_data,
    target='label',
    features=['text'],
    drop_stop_words=True,
    word_count_threshold=2
)

## Model Evaluation

In [0]:
# The counts where target and predicted label are the same (meaning the model got it right) should be higher
predictions = spam_model.predict(test_data)
tc.evaluation.confusion_matrix(test_data['label'], predictions)

In [0]:
# Model testing accuracy
accuracy = tc.evaluation.accuracy(test_data['label'], predictions)
print(f'Spam filter model has a testing accuracy of {accuracy*100} % !', flush=True)

In [0]:
# generate some sample text data
sample_text = ['WINNER! You have been selected for a CASH prize!', 'hey how are you?',
               'Do you want to be a millionaire? You can for free.0020', 'What is the weather like today?']

sample_predictions = spam_model.predict(tc.SFrame({'text': sample_text}))

# investiage the results
for t, p in zip(sample_text, sample_predictions):
    print(t, '----', p)

## Model Export and Skafos Upload
- Convert the model to CoreML format so that it can run on an iOS device. Then deliver the model to your apps with **[Skafos](https://skafos.ai)**.

- If you don't already have an account, Sign Up for one **[here](https://dashboard.skafos.ai)**. 
- Once you've signed up for an account, grab an API token from your account settings.

In [0]:
# Specify the CoreML model name
model_name = 'TextClassifier'
coreml_model_name = model_name + '.mlmodel'

# Export the trained model to CoreML format
res = spam_model.export_coreml(coreml_model_name)

In [0]:
import skafos
from skafos import models
import os

# Set your API Token first for repeated use
os.environ["SKAFOS_API_TOKEN"] = "<YOUR-SKAFOS-API-TOKEN>"

# You can retrieve this info with skafos.summary()
org_name = "<YOUR-SKAFOS-ORG-NAME>"    # Example: "mike-gmail-com-467h2"
app_name = "<YOUR-SKAFOS-APP-NAME>"    # Example: "Text-App"
model_name = "<YOUR-MODEL-NAME>"       # Example: "TextClassifierModel"

# Upload model version to Skafos
model_upload_result = models.upload_version(
    files="TextClassifier.mlmodel",
    org_name=org_name,
    app_name=app_name,
    model_name=model_name
)