# Build a spam filter service

It is time to build your own API service. In the following lecture we are going to build a spam filter similar to <a href="https://www.oopspam.com/" target="_blank">oopspam</a>.

For this purpose, we are going to train our own model to detect spam and then create an API using <a href="https://flask.palletsprojects.com/en/1.1.x/" target="_blank">Flask</a> to serve it to end users.

## Creating our spam filter

This chapter is splitted into two steps. The first one is ☝️ training our model so we can save it for later. Then, we need to ✌️ build our API around our fitted model and the job is done!

You will see: we leave out a lot of steps for brevity. For example, you can take your time and explore the dataset furthermore if you want. We focus on the second part which is our focus in this course.

### ☝️ Training

In order to create our spam filter we are using a simple dataset called `spam_dataset.py`. You will find it in the resources. It consists of two columns: `email` and `label`. `label` column contains only `0` for non-spam email and `1` for spam email. Texts in `email` look like:

- Non-spam:

```
john p looney wrote the only way you can resolve this to my knowledge is to download the original libvorbis rpm and the new one remove the old one then do rpm uvh libvorbis rpm then assumes that you want both versions installed at the same time and does so why you can t do this after you have one library already installed is beyond me does using the oldpackage flag help your pain or is your pain caused by obsoletes flags cheers waider waider url yes it is very personal of me irish linux users group ilug url url for un subscription information list maintainer listmaster url 
```

- Spam:

```
important information the new domain names are finally available to the general public at discount prices now you can register one of the exciting new biz or info domain names as well as the original com and net names for just number number these brand new domain extensions were recently approved by icann and have the same rights as the original com and net domain names the biggest benefit is of course that the biz and info domain names are currently more available i e it will be much easier to register an attractive and easy to remember domain name for the same price visit url today for more info register your domain name today for just number number at url registration fees include full access to an easy to use control panel to manage your domain name in the future sincerely domain administrator affordable domains to remove your email address from further promotional mailings from this company click here url enumber numberffronumber numberbzkfnumberlignnumber numberdbtenumberzhwolnumber
```

It is a binary classification problem. We are going to feed a logistic regression model in order to predict if the text is a spam or not.

In a new `training.py` file, we are going to put the following code:

```python
import os
import joblib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

DATASET_PATH = "datas/spam_dataset.csv"
MODELS_FOLDER = "models"

# Load CSV file
df = pd.read_csv(DATASET_PATH)
# Get X and y
X = df["email"]
y = df["label"]
# Split dataset into train set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
# Build a pipeline with our CountVectorizer and our LogisticRegression model
classifier = Pipeline([("vectorizer", CountVectorizer()), ("classifier", LogisticRegression(solver="liblinear"))])
# Fit our classifier
classifier.fit(X_train, y_train)
# Compute accuracy on test set
accuracy = accuracy_score(classifier.predict(X_test), y_test)
print("Accuracy: ", accuracy)
# Save our model with joblib
joblib.dump(classifier, os.path.join(MODELS_FOLDER, "spam_classifier.joblib"))
```

You may not be too much lost with what you see. The accuracy is 0.992. Not too bad. 🤓

The most important thing to notice is the last line:

```python
joblib.dump(classifier, os.path.join(MODELS_FOLDER, "spam_classifier.joblib"))
```

What does it do? It exports our `classifier` _sklearn pipeline object_ into a file so we can use it later on with our API! We saved it in `models/spam_classifier.joblib`. You can read more about this in <a href="https://scikit-learn.org/stable/modules/model_persistence.html" target="_blank">sklearn documentation</a>.

> It doesn't matter which extension you put at the end of the file by the way. 😉

Thus ends training part!

### ✌️ Building the API

Your folder should look something like this:

```
.
├── app.py                       ← You are going to create this one
├── datas
│   └── spam_dataset.csv
├── models
│   └── spam_classifier.joblib
├── requirements.txt             ← Maybe you don't have this one 😙
└── train.py
```

If you haven't created the `app.py` file, do it. Let's think about what we are going to code: we want to create an endpoint `/spam` which accept POST method with the mandatory key `email` pointing to the mail content already preprocessed. We put this `email` data into our loaded model and return the ouput as JSON format.

Here is a proposition of code:

```python
from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)


@app.route("/spam", methods=["POST"])
def index():
    # Check if request has a JSON content
    if request.json:
        # Get the JSON as dictionnary
        req = request.get_json()
        # Check mandatory key
        if "email" in req.keys():
            # Load model
            classifier = joblib.load("models/spam_classifier.joblib")
            # Predict
            prediction = classifier.predict([req["email"]])
            # Return the result as JSON but first we need to transform the
            # result so as to be serializable by jsonify()
            prediction = str(prediction[0])
            return jsonify({"predict": prediction}), 200
    return jsonify({"msg": "Error: not a JSON or no email key in your request"})


if __name__ == "__main__":
    app.run(debug=True)
```

We can test our API, using `requests` (do not forget to start the Flask server):

In [1]:
import requests

In [5]:
response_non_spam = requests.post("http://127.0.0.1:5000/spam", json={"email": "url url date not supplied neil sandman gaiman has won his lawsuit against todd spawn mcfarlane vindicated in his assertion that mcfarlane breached his contracts stole his characters and used his name mcfarlane looked down somberly as the verdict was read as the judge polled the individual jury members he looked at their faces link number discuss number _thanks gnat number _ number url number url number url"})
response_non_spam.json()

{'predict': '0'}

The API returns `0` meaning, the text we sent is not a spam!

In [6]:
response_spam = requests.post("http://127.0.0.1:5000/spam", json={"email": "hi zzzz url today hyperlink hyperlink ________________________________________________________________________________________ if you would not like to get more spacial offers from us please hyperlink click here and you request will be honored immediately ________________________________________________________________________________________ egfbehkrtejpgtuyveahpeibbraqstvnwa"})
response_spam.json()

{'predict': '1'}

On the contrary, this email seems to be a spam.

Of course, this API is very basic. We could improve it in multiple ways: add probabilities with the response, allowing to predict batches of emails, better error message and so on. You can play with the code and try to improve it by yourself as we stay there for brievity.

## Wrapping up

Finally in this walkthrough we wrapped our model into an API and request it. There are some important steps to remember:

- train and save your model (with `joblib` as we did, or `pickle`),
- create an endpoint, load datas and your model, assert everything is valid before going further,
- make a prediction,
- return the answer in JSON response.

And voilà! 👏