In [None]:
# Building a Machine Learning model to detect spam in SMS
> Building a machine learing model to predict that a SMS messages is spam or not

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/chart-preview.png

In this notebook, we'll show how to build a simple machine learning model to predict that a SMS is spam or not. 

The notebook was built to go along with my talk in May 2020 for [Vonage Developer Day](https://www.vonage.com/about-us/vonage-stories/vonage-developer-day/)

youtube: https://www.youtube.com/watch?v=5d4_HpMLXf4&t=1s

We'll be using the scikit-learn library to train a model on a set of messages which are labeled as spam and non spam(aka ham) messages. 

After our model is trained, we'll deploy to an AWS Lambda in which its input will be a message, and its output will be the prediction(spam or ham).

Before we build a model, we'll need some data. So we'll use the [SMS Spam Collection DataSet](http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

This dataset contains over 5k messages which are labeled spam or ham.
In the following cell, we'll download the dataset

In [None]:
!wget --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
!unzip /content/smsspamcollection.zip

--2020-07-22 00:49:18--  https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip
Resolving archive.ics.uci.edu (archive.ics.uci.edu)... 128.195.10.252
Connecting to archive.ics.uci.edu (archive.ics.uci.edu)|128.195.10.252|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 203415 (199K) [application/x-httpd-php]
Saving to: ‘smsspamcollection.zip’


2020-07-22 00:49:19 (509 KB/s) - ‘smsspamcollection.zip’ saved [203415/203415]

Archive:  /content/smsspamcollection.zip
  inflating: SMSSpamCollection       
  inflating: readme                  


Once we have downloaded the datatset, we'll load into a Pandas Dataframe and view the first 10 rows of the dataset.

In [None]:
import pandas as pd
df = pd.read_csv("/content/SMSSpamCollection", sep='\t', header=None, names=['label', 'message'])
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Next, we need to first understand the data before building a model.

We'll first need to see how many messages are considered spam or ham

In [None]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

From the cell above, we see that 4825 messages are valid messages, and only 747 messages are labled as spam.

Lets now just view some messages that are ham and some that are spam

In [None]:
spam = df[df["label"] == "spam"]
spam.head()

Unnamed: 0,label,message
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
5,spam,FreeMsg Hey there darling it's been 3 week's n...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...
11,spam,"SIX chances to win CASH! From 100 to 20,000 po..."


In [None]:
ham = df[df["label"] == "ham"]
ham.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
6,ham,Even my brother is not like to speak with me. ...


after looking at some messages that spam and ham, we can see the spam messages look spammy..

# Preprocessing

The next step is to get the dataset ready to build a model. A machine learning model can only deal with numbers, so we'll have to convert our text into numbers using `TfidfVectorizer`

TfidfVectorizer converts a collection of raw documents to a matrix of [term frequency-inverse document frequency](http://www.tfidf.com/) features. Also known as TF-IDF.

In our case, a document is each message. For each message, we'll compute the number of times a term is in our document divied by all the terms in the document times the total number of documents divded by the number of documents that contain the specific term

![](https://miro.medium.com/max/1066/1*eIDZG3Ot5DP8SKXAvBVALQ.png)
[source](https://towardsdatascience.com/spam-or-ham-introduction-to-natural-language-processing-part-2-a0093185aebd)

The output will be a matrix in which the rows will be all the terms, and the colums will be all the documents
![](https://miro.medium.com/max/1400/1*n4s0LZS1Qi46pF3aaYzE0A.png)

[This notebook by Mike Bernico](https://github.com/mbernico/CS570/blob/master/module_1/TFIDF.ipynb) by goes into more detail on TF-IDF and how to calucate without using sklearn. 

first, we'll split the dataset into a train and test set. For the training set, we'll take 80% of the data from the dataset, and use that for training the model. The rest of the dataset(20%) will be used for testing the model.


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['message'], df['label'], test_size = 0.2, random_state = 1)

once we split our data, we can use the TfidfVectorizer. This will return a sparse matrix(a matrix with mostly 0's)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)
X_test = vectorizer.transform(X_test)

After we fit the TfidfVectorizer to the sentenes, lets plot the matrix as a pandas dataframe to understand what TfidfVectorizer is doing

In [None]:
feature_names = vectorizer.get_feature_names()
tfid_df = pd.DataFrame(tfs.T.todense(), index=feature_names)
print(tfid_df[1200:1205])

           0     1     2         3     4     ...  4452  4453  4454  4455  4456
backdoor    0.0   0.0   0.0  0.000000   0.0  ...   0.0   0.0   0.0   0.0   0.0
backwards   0.0   0.0   0.0  0.000000   0.0  ...   0.0   0.0   0.0   0.0   0.0
bad         0.0   0.0   0.0  0.193352   0.0  ...   0.0   0.0   0.0   0.0   0.0
badass      0.0   0.0   0.0  0.000000   0.0  ...   0.0   0.0   0.0   0.0   0.0
badly       0.0   0.0   0.0  0.000000   0.0  ...   0.0   0.0   0.0   0.0   0.0

[5 rows x 4457 columns]


From the table above, each word in our dataset are the rows are the sentenes index are the columns. We've only plotted a few rows in the middle of the dataframe for a better understanding of the data. 


Next, we'll train a model using Gaussian Naive Bayes in scikit-learn. Its a good starting algorithm for text classification. We'll then print out the accuracy of the model by using the training set and our confusion_matrix

## Model Training

To train our model, we'll use A Navie Bayes algorhtymn to train our model

The formula for Navie Bayes is:
\\[ P(S|W) = P(W|S) \times P(S) \over P(W|S) \times P(S) + P(W|H) \times P(h) \\].

**P(s|w)**  - The probability(**P**) of a message is spam(**s**) Given(**|**) a word(**w**)

**=**

**P(w|s)**   - probability(**P**) that a word(**w**) is spam(**s**)

*

**P(s)** - Overall probability(**P**) that ANY message is spam(**s**)

**/**

**P(w|s)** - probability(**P**) that a word(**w**) exists in spam messages(**s**)

*

**P(s)** - Overall probability(**P**) that ANY message is spam(**s**)

**+**

**P(w|h)** - probability(**P**) the word(**w**) appears in non-spam(**h**) messages

*

**P(h)** - Overall probability(**P**) that any message is not-spam(**h**)





In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

clf = GaussianNB()
clf.fit(X_train.toarray(),y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [None]:
y_true, y_pred = y_test, clf.predict(X_test.toarray())
accuracy_score(y_true, y_pred)

0.8986547085201794

In [None]:
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         ham       0.99      0.89      0.94       968
        spam       0.57      0.93      0.71       147

    accuracy                           0.90      1115
   macro avg       0.78      0.91      0.82      1115
weighted avg       0.93      0.90      0.91      1115



In [None]:
cmtx = pd.DataFrame(
    confusion_matrix(y_true, y_pred, labels=['ham', 'spam']), 
    index=['ham', 'spam'], 
    columns=['ham', 'spam']
)
print(cmtx)

      ham  spam
ham   866   102
spam   11   136


## Grid Search

In [None]:
from sklearn.model_selection import GridSearchCV
parameters = {"var_smoothing":[1e-9, 1e-5, 1e-1]}
gs_clf = GridSearchCV(
        GaussianNB(), parameters)
gs_clf.fit(X_train.toarray(),y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=GaussianNB(priors=None, var_smoothing=1e-09),
             iid='deprecated', n_jobs=None,
             param_grid={'var_smoothing': [1e-09, 1e-05, 0.1]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [None]:
gs_clf.best_params_

{'var_smoothing': 0.1}

In [None]:
y_true, y_pred = y_test, gs_clf.predict(X_test.toarray())
accuracy_score(y_true, y_pred)

0.9650224215246637

In [None]:
cmtx = pd.DataFrame(
    confusion_matrix(y_true, y_pred, labels=['ham', 'spam']), 
    index=['ham', 'spam'], 
    columns=['ham', 'spam']
)
print(cmtx)

      ham  spam
ham   932    36
spam    3   144


In [None]:
print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

         ham       1.00      0.96      0.98       968
        spam       0.80      0.98      0.88       147

    accuracy                           0.97      1115
   macro avg       0.90      0.97      0.93      1115
weighted avg       0.97      0.97      0.97      1115



From our trained model, we get about 96% accuracy. Which is pretty good. 

We also print out the confusion_matrix. This shows how many messages were classificed correctly. In the first column and first row, we see that 866 messages that were classified as ham were actaully ham and 136 messages that were predicted as spam, were in fact spam.

Next, lets test our model with some examples messages

## Inference

In [None]:
message = vectorizer.transform(["i'm on my way home"])
message = message.toarray()
gs_clf.predict(message)

array(['ham'], dtype='<U4')

In [None]:
message = vectorizer.transform(["this offer is to good to be true"])
message = message.toarray()
gs_clf.predict(message)

array(['spam'], dtype='<U4')

The final step is the save the model and the tf-idf vectorizer. We will use these when clasifing incoming messages on our lambda function 

In [None]:
import joblib
joblib.dump(gs_clf, "model.pkl")
joblib.dump(vectorizer, "vectorizer.pkl")

['vectorizer.pkl']

# Lambda

Once our model is trained, we'll now put it in a production envioroment.

For this example, we'll create a lambda function to host our model.

The lambda function will be attached to an API gateway in which we'll be able to have a endpoint to make our predictions

Deploying a scikit-learn model to lambda isnt as easy as you would think. You can't just import your libraries, espcially scikit-learn to work.

Here's what we'll need to do in order to deploy our model
* Spin up EC2 instance
* SSH into the instance and install our dependencies
* copy the lambda function code from this [repo](https://github.com/tbass134/SMS-Spam-Classifier-lambda)
* Run a bash script that zips up the :
* zip the code, including the packages
* upload to S3
* point the lambda function to to s3 file

## Create an EC2 instance
If you have an aws account:
* Go to EC2 on the console and click `Launch Instance`.
* Select the first available AMI(Amazon Linux 2 AMI). 
* Select the t2.micro instance, then click `Review and Launch`
* Click the Next button
* Under IAM Role, Click Create New Role
* Create a new role with the following policies:
  AmazonS3FullAccess
  AWSLambdaFullAccess
  Name your role and click create role
* Under permissions, create a new role that has access to the following:
* lambda full access
* S3 full access

These will be needed when uploading our code to your S3 bucket and pointing the lambda function to zip file that will be creating later.

* Create a new private key pair and click `Lanuch Instance`
* Note, in order to use the key, you have to run `chmod 400` on the key when downloaded to your local machine.


After the instance spins up, you'll need to connect to it via ssh
* Find the newly created instance on EC2 and click `Connect`
* On your local machine, navigate to terminal and run the the command from the Example. It will look something like:
```bash
ssh -i "{PATH TO KEY}" {user_name}@ec2-{instance_ip}.compute-1.amazonaws.com
```

## Install packages
Before installing packages, you will need to install python and pip. You can follow the steps [here](https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/eb-cli3-install-linux.html)
These will most likey be:
```bash 
sudo yum install python37
curl -O https://bootstrap.pypa.io/get-pip.py
 python3 get-pip.py --user
 verify pip is installed using
 ```bash
 pip --version
 ```
 You will also need to install git
 ```bash
 sudo yum install git -y
 ```
 When connected to the instance, clone the repo
```bash
git clone https://github.com/tbass134/SMS-Spam-Classifier-lambda
```
This repo contains everything we need to make predictions. These includes the pickle files from the model and vectorizer, as well as the lambda function to make predictions and returns its response
cd into the SMS-Spam-Classifier-lambda/lambda folder
* Next, you you will need to install the `sklearn` library.
* On your instance, type:
`pip install -t . sklearn`
This will import the library into its own folder


Next, if you want to use your trained model, it will need to be uploaded into your ec2 instance. 
If your using Google Colab, navigate to the files tab, right click on `my_model.pk` and `vectorizer.pkl` and click download.
Note, the sample repo already contains a trained model so this is optional.

To upload your trained model, you can use a few ways:
 * Fork the repo, add your models, and checkout on the ec2 instance
  You can use `scp` to copy to files from your local machine to the instance
  To upload the model file we saved
  ```bash
  scp -i {PATH_TO_KEY} vectorizer.pkl ec2-user@{INSTANCE_NAME}:
  ```

  and we'll do the same for the model
  ```bash
  scp -i {PATH_TO_KEY} my_model.pkl ec2-user@{INSTANCE_NAME}:
  ```

* The other method is to upload the files to s3 and have your lambda function load the files from there using Boto
```Python
  def load_s3_file(key):
      obj = s3.Object(MODEL_BUCKET, key)
      body = obj.get()['Body'].read()
      return joblib.load(BytesIO(body))   

  model = load_s3_file({PATH_TO_S3_MODEL}
  vectorizer = load_s3_file({PATH_TO_S3_VECTORIZER}
```


## Create lambda function
* On the AWS console, navigate to https://console.aws.amazon.com/lambda
* Click on the Create function button
* Make sure `Author from scratch` is selected
* Name your function
* Set the runtime to Python 3.7
* Under Execution Role, create a new role with basic permissions
* Click `Create Function`

## Create S3 bucket
In order to push our code to a lambda function, we need to first copy zip up the code and libraies to a S3 bucket. 
From here, our lambda function will load the zip file from this bucket.
* On the AWS console under `Services`, Search for `S3`
* Click `Create Bucket`
* Name your bucket, and click Create Bucket at the bottom of the page.


## Upload to lambda
Next, we'll run the `publish.sh`script inside the root of the repo, which does the following:
* zip up the pacakages, including our Python code, model and transformer.
* upload the zip to an S3 bucket
* point our lambda function to this bucket

when calling this script, we need to pass in 3 arguments:
* The name of the zip file. We can call it `zip.zip` for now
* The name of the S3 bucket that we will upload the zip to
* the name of lambda function 
```bash
bash publish.sh {ZIP_FILE_NAME} {S3_BUCKET} {LAMBDA_FUNCTION_NAME}
```

If everything is successful, your lambda function will be deployed. 
If you see errors, make sure your EC2 instance has a IAM role that has an S3 permission, and Lambda permissions.
See this [guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html) for more info.


## Add HTTP endpoint
The final piece will be to add a API gateway.
On the configuration tab on the lambda function
* click `Add Trigger`
* Click on the select a trigger box and select `API Gateway`
* Click on `Create an API`
* Set API Type to `REST API`
* Set Security to `OPEN` (make sure to secure when deploying for production)
* At the bottom, click `Add`

For detail, see this [documentation](https://docs.aws.amazon.com/apigateway/latest/developerguide/integrating-api-with-aws-services-lambda.html#api-as-lambda-proxy-create-api-resources)

We can now test the endpoint by using curl and making a call to our endpoint.
Under `API Gateway` section in lambda, click on oi

In the lambda function, we are looking for the `message` GET parameter. When we make our request, we'll pass a query parameter called `message`. This will contain the string we want to make a prediction on.

In [None]:
ham_message = "im on my way home".replace(" ", "%20")
ham_message

In [None]:
%%bash -s "$ham_message"
curl --location --request GET "https://e18fmcospk.execute-api.us-east-1.amazonaws.com/default/spam-detection?message=$1"

{"prediction": "ham"}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0100    21  100    21    0     0     10      0  0:00:02  0:00:02 --:--:--    10100    21  100    21    0     0     10      0  0:00:02  0:00:02 --:--:--    10


In [None]:
spam_message = "this offer is to good to be true".replace(" ", "%20")
spam_message

In [None]:
%%bash -s "$spam_message"
curl --location --request GET "https://e18fmcospk.execute-api.us-east-1.amazonaws.com/default/spam-detection?message=$1"

{"prediction": "spam"}

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0100    22  100    22    0     0     13      0  0:00:01  0:00:01 --:--:--    13100    22  100    22    0     0     13      0  0:00:01  0:00:01 --:--:--    13


# Google Cloud Functions

For non-amazon users, we can use Google Cloud Functions to deploy our model for use in our Vonage SMS API app
![](https://pbs.twimg.com/media/EdYKMFDUwAAwj0T?format=png&name=large)
Code is [here](https://gist.github.com/tbass134/7985c0adf44c938d6e683c18dabac8f9)

# Create Vonage SMS Application

The final step is to build a Vonage SMS Application.
Have a look at this blog post on how to build yourself
Our application will receive an SMS
https://developer.nexmo.com/messaging/sms/code-snippets/receiving-an-sms

and will send a SMS back to the user with its prediction
https://developer.nexmo.com/messaging/sms/code-snippets/send-an-sms

<img src="https://i.ibb.co/8mxfBKW/IMG-9-BA66209-F969-1.png" alt="drawing" width="300" text-align="center"/>


To work through this example, you will need the following
* Login / Signup to [Vonage SMS API](https://dashboard.nexmo.com/sign-up)
* Rent a phone number
* Assign a publicly accessable url via [ngrok](https://www.nexmo.com/blog/2017/07/04/local-development-nexmo-ngrok-tunnel-dr) to that phone number

We'll also build a simple Flask app that will make a request to our API Gateway
```bash
git clone https://github.com/tbass134/SMS-Spam-Classifier-lambda.git
cd app
```

Next we'll create a virtual environment and install the requirements using pip
```bash
virtualenv venv --python=python3
source venv/bin/activate
pip install -r requirments.txt
```

Next, create a `.env` file with the following:
```bash
NEXMO_API_KEY={YOUR_NEXMO_API_KEY}
NEXMO_API_SECRET={YOUR_NEXMO_API_SECRET}
NEXMO_NUMBER={YOUR_NEXMO_NUMBER
API_GATEWAY_URL={FULL_API_GATEWAY}
```

Finally, you can run the application:
```bash
python app.py
```
This will spin up a webserver listening on PORT 3000


# Fin