# Fake News Detector <a id='top'></a>

## Using SageMaker

_Using Amazon's SageMaker for Train | Deployment_

---

Now that we have explored our data in the previous [notebook](https://github.com/gtraskas/fake-news-detector/blob/master/data-exploration.ipynb), we are ready to use SageMaker to construct a complete ML classifier from end to end. Our goal will be to have a simple web page, which a user can use to enter some text from news. The web page will then send the text off to our deployed model, which will predict if it is fake or true.

## General Outline

1. [Import Libraries](#import)
2. [Read in the Data](#read)
3. [Prepare and Process the Data](#prepare)
    1. [Split Data to Train/Test](#split)
    2. [Clean and Tokenize Text](#clean)
4. [Extract Features](#extract)
5. [Build and Fit a Naive Bayes Model](#build)
    1. [Evaluate Model](#evaluate)
6. [Upload the Data to Amazon's S3](#upload)
7. [Data Preprocessing for BlazingText](#blazingtext)
8. [Deploy the Model](#deploy)
    1. [Test the Model](#test)
    2. [Clean Up](#clean-up)
9. [Put the Model to Work](#work)
    1. [Set up a Lambda Function](#lambda)
    2. [Set up API Gateway](#api)
10. [Deploy our Web App](#web)
    1. [Delete the endpoint](#delete)
    2. [Optional: Clean up](#opt-clean)

**Note:** We will not be testing the model in its own step. We will do it by deploying the model and then using the deployed model by sending the test data to it. One of the reasons for doing this is so that we can make sure that our deployed model is working correctly before moving forward.

## Import Libraries <a id='import'></a>

In the next cells we will import the required libraries.

In [3]:
# Import libraries.
import os
import pickle
from tqdm import tqdm

import pandas as pd
import numpy as np

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix
from sklearn.metrics import accuracy_score

import json
import string
import re
from bs4 import BeautifulSoup
from tqdm import tqdm

from helpers import read_data, process_text, prepare_data, extract_features

# import sagemaker
# from sagemaker import get_execution_role
# from sagemaker.amazon.amazon_estimator import get_image_uri
# from sagemaker.predictor import csv_serializer

# Set global variables
RANDOM_STATE = 5
DIR = 'data/'

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LENOVO\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


ImportError: cannot import name 'joblib' from 'sklearn.externals' (C:\Users\LENOVO\AppData\Roaming\Python\Python38\site-packages\sklearn\externals\__init__.py)

## Read in the Data <a id='read'></a>

First we will read the data using the `read_data` function from the `helpers.py` file. This function read the downloaded data into `pandas` dataframes, then creates a column `label` to indicate if news is fake or true, concatenate the two datasets, shuffle data, and return the df. 

In [None]:
# Read data.
df = read_data(DIR, RANDOM_STATE)

# Show first rows.
df.head()

Let's inspect what other functions we have added in the `helpers.py` file.

In [None]:
!pygmentize helpers.py

## Prepare and Process the Data <a id='prepare'></a>

Now, we will do some data processing. To begin with, we will join `title` and `text` columns into a single input structure and remove the columns `subject` and `date` (and the redundant `title` after the joining), since it wasn't found any useful features during the data exploration.

In [None]:
# Join title and text into one column.
df['text'] = df.title + " " + df.text

# Remove useless columns.
df.drop(columns=['subject', 'date', 'title'], axis=1, inplace=True)

# Show the first rows.
display(df.head())

# Show an example of text.
df.text[3]

### Split Data to Train/Test <a id='split'></a>

Then, we will split the dataset into a training set and a testing set using 80% of the data for training purposes and the rest 20% for testing.

In [None]:
# Split data to train and test datasets.
train_X, test_X, train_y, test_y = train_test_split(df.text, df.label, test_size=0.2,
                                                    random_state=RANDOM_STATE)

print("Fake and True News (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

Now that we have our training and testing sets unified and prepared, we should do a quick check and see an example of the data our model will be trained on. This is generally a good idea as it allows to see how each of the further processing steps affects the reviews and it also ensures that the data has been loaded correctly.

In [None]:
print(train_X[3])
print(train_y[3])

### Clean and Tokenize Text <a id='clean'></a>

The steps in processing the text are:

- Read a text file as a string of raw text.
- Lower case all words, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).
- Normalize numbers, replacing them with the text number.
- Remove non-words, remove punctuation, and trim all white spaces (tabs, newlines, spaces) to a single space character.
- Tokenize the raw text string into a list of words where each entry is a word. Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining. Then the words will be ready to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).
- Use lemmatization or stemming to consolidate closely redundant words. For example, "discount", "discounts", "discounted" and "discounting" will be all replaced with "discount". Sometimes, the Stemmer actually strips off additional characters from the end, so "include", "includes", "included", and "including" are all replaced with "includ".
- Remove stopwords. Stop words are so frequently used that for many tasks (but not all) they don't carry much information. Examples are "any", "all", "what", etc. NLTK has an inbuilt corpus of english stopwords that can be loaded and used.
- Apply additional text preparation steps, such as normalizing links and emails: All https and http links will be replaced with the text "link" and all emails will be replaced with the text "email".

The `process_text` method defined in `helpers.py` uses `BeautifulSoup` and `re` modules to 'normalize' text and uses also the `nltk` package to stem the text. As a check to ensure we know how everything is working, we wiil apply `process_text` to one example in the training set.

In [None]:
# Apply process_text to an example.
process_text(train_X[3])

Everything looks as expected. The next `prepare_data` method of the `helpers.py` caches the results. This is because performing this processing step can take a long time (about ~30 minutes in a MacBook Pro with 2.2 GHz 6-Core Intel Core i7). This way if we are unable to complete the notebook in the current session, we can come back without needing to process the data a second time.

In [None]:
# Create a new 'tqdm' instance to time and estimate the progress of functions.
tqdm.pandas()

# Ensure directory exists.
os.makedirs(DIR, exist_ok=True)

# Preprocess data.
train_X, test_X, train_y, test_y = prepare_data(train_X, test_X, train_y, test_y, cache_dir=DIR)


## Extract Features <a id='extract'></a>

Before training machine learning algorithms, preprocessed text needs to be transformed into numerical data. This process is called feature extraction or vectorization. There are three popular feature extraction algorithms:

- **Bag-of-Words**

This algorithm is made up of two parts. First, it creates a dictionary from the entire corpus. Then, it transforms each text in the corpus as a vector of word occurences. This is called a "bag" because it disregards the order of the words within the text and focuses on content. A Bag of Words can be obtained using Sklearn's `CountVectorizer`.

- **N-grams**

A N-gram is a combination of N number of words treated as a single feature, as opposed to single word features in the Bag of Word. The idea is to extract contextual information and enrich data. For example the word "good" is always positive individually but can be negative when preceeded by "not". In certain cases, "not good" is an informative bigram. N-grams can also be extracted through sklearn's `CountVectorizer`, but with a specific parameter.

- **Term Frequency - Inverse Document Frequency (Tfidf)**

Rather than counting occurences, the TfIdf vectorizer computes an importance value for each word in its text and according the entire corpus. That value is the product of the TF and the IDF.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

The abovementioned steps of the Tfidf algorithm are automatized in Sklearn's `TfidfVectorizer` module.

**Note:** We could use and test all the aforementioned algorithms using Sklearn's `Pipeline` package, which combines all the steps and simplifies our work, but this is out of the scope of this project. So, we will use the Bag-of-Words model, but enabling `ngram_range=(1, 3)`, so as to use more features (unigrams, bigrams, and trigrams).

Later on when we will construct an endpoint, which processes a submitted text, we will need to make use of the `word_dict`, which we have created. As such, we will save it to a file now for future use.

In [None]:
# Extract Bag of Words features for both training and test datasets.
train_X, test_X, vocabulary = extract_features(train_X, test_X, 5000, cache_dir=DIR)

We can check to make sure that things are working as intended examining one example text.

In [None]:
# Use this cell to examine one of the processed reviews to make sure everything is working as intended.
print(train_X[5])
print(len(train_X[5]))

In [None]:
# Print some n-grams from the dictionary.
for key in sorted(vocabulary, key=vocabulary.get, reverse=True)[:20]:
    print(key, ':', vocabulary[key])

Everything looks reasonable and the length of a text in the training set is 5000, which is the vocabulary size after extracting the features.

## Build and Fit a Naive Bayes Model <a id='build'></a>

There are two options to build a model, either use the SageMaker built-in algorithms (like XGBoost) or customized ones or train the model in this instance using Sklearn's models. For the first option, we need to upload the data to S3 and then create the model, which comprises three objects:

 - Model Artifacts,
 - Training Code, and
 - Inference Code,

each of which interact with one another.

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form `label`, `text`, where `text` is a sequence of integers representing the words in the review.

The documentation for the algorithms in SageMaker requires that the saved datasets should contain no headers or index and that for the training and validation data, the **label should occur first** for each sample.

In [None]:
# Build and fit the model.
model = MultinomialNB()
model.fit(train_X, train_y)

### Evaluate Model <a id='evaluate'></a>

Use various metrics from Sklearn to evaluate our model on the test dataset. 

In [None]:
# Make and save the predictions.
predictions = model.predict(test_X)

print(confusion_matrix(test_y, predictions))
print(classification_report(test_y, predictions))
plot_confusion_matrix(model, test_X, test_y, cmap='Blues')

## Upload the Data to Amazon's S3 <a id='upload'></a>

Amazon's S3 service allows us to store files that can be accessed by both the built-in training models such as the XGBoost model, as well as custom models. To do this, we need to split the training dataset into two parts, the data we will train the model with and a validation set. Then, we have to write those datasets to a file and upload the files to S3. In addition, we have to write the test set input to a file and upload the file to S3. This is so that we can use SageMakers Batch Transform functionality to test our model once we've fit it.

It is important to note the format of the data that we are saving as we will need to know it when we write the training code. In our case, each row of the dataset has the form `label`, `text`, where `text` is a sequence of `5000` integers representing the words in the review.

The documentation for the algorithms in SageMaker requires that the saved datasets should contain no headers or index and that for the training and validation data, the **label should occur first** for each sample.

## Data Preprocessing for BlazingText <a id='blazingtext'></a>

We need to preprocess the training data into space separated tokenized text format which can be consumed by the `BlazingText` algorithm. The class label(s) should be prefixed with `__label__` and it should be present in the same line along with the original sentence.

In [None]:
# Split data to train, validation, and test datasets.
df_train, df_test = train_test_split(df, test_size=0.2, random_state=RANDOM_STATE)
df_train, df_valid = train_test_split(df_train, test_size=0.2, random_state=RANDOM_STATE)

# Put label in first column.
df_train = df_train[['label', 'text']]
df_valid = df_valid[['label', 'text']]
df_test = df_test[['label', 'text']]

# Add __label__ to class as prefix.
df_train.label = '__label__' + df_train.label.astype('str')
df_valid.label = '__label__' + df_valid.label.astype('str')
df_test.label = '__label__' + df_test.label.astype('str')

# Clean and normalize text.
df_train.text = df_train.text.progress_apply(process_text)
df_valid.text = df_valid.text.progress_apply(process_text)
df_test.text = df_test.text.progress_apply(process_text)

# Show dfs.
display(df_train.head())
display(df_valid.head())
display(df_test.head())

In [None]:
# Save to csv.
df_train.to_csv(os.path.join(DIR, 'news.train'), sep=' ', header=False, index=False)
df_valid.to_csv(os.path.join(DIR, 'news.valid'), sep=' ', header=False, index=False)
df_test.to_csv(os.path.join(DIR, 'news.test'), sep=' ', header=False, index=False)

Next, we need to upload the data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [None]:
# Store the current SageMaker session.
session = sagemaker.Session()

# Store the bucket.
bucket = session.default_bucket()

# S3 prefix (which folder will we use).
prefix = 'fake-news-bt'

# Upload the processed test, train and validation files,
# which are contained in data directory to S3 using session.upload_data().
test_location = session.upload_data(os.path.join(DIR, 'news.test'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(DIR, 'news.valid'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(DIR, 'news.train'), key_prefix=prefix)

In [None]:
# Our current execution role is required when creating the model as the training
# and inference code will need to access the model artifacts.
role = get_execution_role()

In [None]:
# We need to retrieve the location of the container, which is provided by Amazon for using XGBoost.
# As a matter of convenience, the training and inference code both use the same container.
container = get_image_uri(session.boto_region_name, 'blazingtext', 'latest')

In [None]:
# First we create a SageMaker estimator object for our model.
bt_model = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                         role, # What is our current IAM Role
                                         train_instance_count=1, # How many compute instances
                                         train_instance_type='ml.c4.4xlarge', # What kind of compute instances
                                         train_volume_size = 30,
                                         train_max_run = 360000,
                                         input_mode= 'File',
                                         output_path='s3://{}/{}/output'.format(bucket, prefix),
                                         sagemaker_session=session)

# And then set the algorithm specific parameters.
bt_model.set_hyperparameters(mode="supervised",
                             epochs=10,
                             min_count=2,
                             learning_rate=0.05,
                             vector_dim=10,
                             early_stopping=True,
                             patience=4,
                             min_epochs=5,
                             word_ngrams=3)

In [None]:
s3_input_train = sagemaker.s3_input(s3_data=train_location, distribution='FullyReplicated',
                                    content_type='text/plain')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, distribution='FullyReplicated',
                                         content_type='text/plain')

In [None]:
bt_model.fit({'train': s3_input_train, 'validation': s3_input_validation})

## Deploy the Model <a id='deploy'></a>

Once we construct and fit our model, SageMaker stores the resulting model artifacts and we can use those to deploy an endpoint (inference code). Deploying an endpoint is a lot like training the model with a few important differences. The first is that a deployed model doesn't change the model artifacts, so as you send it various testing instances the model won't change. Another difference is that since we aren't performing a fixed computation, as we were in the training step or while performing a batch transform, the compute instance that gets started stays running until we tell it to stop. This is important to note as if we forget and leave it running we will be charged the entire time.

In [None]:
predictor = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')

### Test the Model <a id='test'></a>

Now that we have deployed our endpoint, we can send the testing data to it and get back the inference results. We already did this earlier using the batch transform functionality of SageMaker, however, we will test our model again using the newly deployed endpoint so that we can make sure that it works properly and to get a bit of a feel for how the endpoint works.

In [None]:
test_X = pd.read_csv(os.path.join(DIR, 'news.test'), header=None)

In [None]:
# Create a function to define the batches.
# From: https://stackoverflow.com/questions/8290397/how-to-split-an-iterable-in-constant-size-chunks
def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

In [None]:
# Create batches of 512 inputs for prediction and add them to a list.
predictions = []
for x in batch(test_X.iloc[:,0].str[12:-1].tolist(), 512):
    payload = {"instances" : x}
    prediction_batch = predictor.predict(json.dumps(payload))
    prediction_batch = [int(prediction.get("label")[0][9:]) for prediction in json.loads(prediction_batch)]
    predictions.append(prediction_batch)

In [None]:
# Flatten list.
predictions = sum(predictions, [])

Lastly, we check again to see what the accuracy of our model is.

In [None]:
accuracy_score(test_y, predictions)

The model has very good accuracy on the unseen test set.

### Clean Up <a id='clean-up'></a>

Now that we've determined that deploying our model works as expected, we are going to shut it down. Remember that the longer the endpoint is left running, the greater the cost and since we have a bit more work to do before we are able to use our endpoint with our simple web app, we should shut everything down.

In [None]:
# Clean only if you don't want to use later the Lambda function and API
# to make predictions through a simple web app.
predictor.delete_endpoint()

## Put the Model to Work <a id='work'></a>

This project's goal is to have our model deployed and then access it using a very simple web app. The intent is for this web app to take some user submitted data (a news text), send it off to our endpoint (the model) and then display the result as fake or true.

However, there is a small catch. Currently the only way we can access the endpoint to send it data is using the SageMaker API. We can, if we wish, expose the actual URL that our model's endpoint is receiving data from, however, if we just send it data ourselves we will not get anything in return. This is because the endpoint created by SageMaker requires the entity accessing it have the correct permissions. So, we would need to somehow authenticate our web app with AWS.

Having a website that authenticates to AWS seems a bit beyond the scope of this lesson so we will opt for an alternative approach. Namely, we will create a new endpoint which does not require authentication and which acts as a proxy for the SageMaker endpoint.

As an additional constraint, we will try to avoid doing any data processing in the web app itself. Remember that when we constructed and tested our model we started with a movie review, then we simplified it by removing any html formatting and punctuation, then we constructed a bag of words embedding and the resulting vector is what we sent to our model. All of this needs to be done to our user input as well. Fortunately we can do all of this data processing in the backend, using Amazon's Lambda service.

### Set up a Lambda Function <a id='lambda'></a>

The first thing we are going to do is set up a Lambda function. This Lambda function will be executed whenever our public API has data sent to it. When it is executed it will receive the data, perform any sort of processing that is required, send the data (the text) to the SageMaker endpoint we've created and then return the result.

#### Part A: Create an IAM Role for the Lambda function

Since we want the Lambda function to call a SageMaker endpoint, we need to make sure that it has permission to do so. To do this, we will construct a role that we can later give the Lambda function.

Using the AWS Console, we navigate to the **IAM** page and click on **Roles**. Then, we click on **Create role**. We make sure that the **AWS service** is the type of trusted entity selected and choose **Lambda** as the service that will use this role and then click **Next: Permissions**.

In the search box we type `sagemaker` and select the check box next to the **AmazonSageMakerFullAccess** policy. Then, we click on **Next: Review**.

Lastly, we give this role a name. We make sure we use a name that we will remember later on, for example `LambdaSageMakerRole`. Then, we click on **Create role**.

#### Part B: Create a Lambda function

Now it is time to actually create the Lambda function. In order to process the user provided input and send it to our endpoint we need to gather two pieces of information:

 - The name of the endpoint, and
 - the vocabulary object.

We will copy these pieces of information to our Lambda function after we create it.

To start, using the AWS Console, navigate to the AWS Lambda page and click on **Create a function**. When you get to the next page, make sure that **Author from scratch** is selected. Now, we name our Lambda function, using a name that we will remember later on, for example `fake_news_func`. We make sure that the latest **Python** runtime is selected and then choose the role that we created in the previous part. Then, we click on **Create Function**.

On the next page we will see some information about the Lambda function we've just created. If we scroll down we should see an editor in which we can write the code that will be executed when our Lambda function is triggered. Collecting the code we wrote above to process text and adding it to the provided example `lambda_handler` we arrive at the following.

```python
# We need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

# And we need the following libraries to do some of the data processing.
import re
import string
import json

# Create a function to process text.
def process_text(text):
    # Normalize links replacing them with the str 'link'.
    text = re.sub('http\S+', 'link', text)

    # Normalize numbers replacing them with the str 'number'.
    text = re.sub('\d+', 'number', text)

    # Normalize emails replacing them with the str 'email'.
    text = re.sub('\S+@\S+', 'email', text, flags=re.MULTILINE)
    
    # Remove punctuation.    
    text = text.translate(str.maketrans('', '', string.punctuation))

    # Remove whitespaces.
    text = text.strip()
    
    # Convert all letters to lower case.
    text = text.lower()

    # Split text into words.
    words = text.split()

    return ' '.join(words)

runtime= boto3.client('runtime.sagemaker')

def lambda_handler(event, context):
    data = event['body']
    sentence = process_text(data)

    try:
        payload = {"instances" : sentence}

        response = runtime.invoke_endpoint(EndpointName='blazingtext-2020-06-15-07-32-18-142',
                                            ContentType='application/json',
                                            Body=json.dumps(payload))

        result = json.loads(response['Body'].read().decode())
        # prob = []
        labels = []
        for label in result[0]['label']:
            labels.append(label[9:])
        print("DATA", data)
        print("SENTENCE", sentence)
        return {
            'statusCode' : 200,
            'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
            'body' : str(labels[0])
    }
        # return {'statusCode': 200, 'body': str(labels[0])}
    except Exception as e:
        print(e)
        return {'statusCode': 400,
                'body': json.dumps({'error_message': 'Unable to generate tag.'})}
```

Once we have copy and pasted the code above into the Lambda code editor, we replace the `**ENDPOINT NAME HERE**` portion with the name of the endpoint that we deployed earlier. We can determine the name of the endpoint using the code cell below.

In [None]:
predictor.endpoint

In addition, we need to copy the vocabulary dictionary to the appropriate place in the code at the beginning of the lambda_handler method. The cell below prints out the vocabulary dict in a way that is easy to copy and paste.

In [None]:
# print(str(vocabulary))

Once we have added the endpoint name to the Lambda function, we click on **Save**. Our Lambda function is now up and running. Next we need to create a way for our web app to execute the Lambda function.

### Set up API Gateway <a id='api'></a>

Now that our Lambda function is set up, it is time to create a new API using API Gateway that will trigger the Lambda function we have just created.

Using AWS Console, we navigate to **Amazon API Gateway** and then click on **Get started**.

On the next page, we make sure that **New API** is selected and give the new api a name, for example, `fake_news_web_app`. Then, we click on **Create API**.

Now we have created an API, however it doesn't currently do anything. What we want it to do is to trigger the Lambda function that we created earlier.

Select the **Actions** dropdown menu and click **Create Method**. A new blank method will be created, we select its dropdown menu and select **POST** and then click on the check mark beside it.

For the integration point, we make sure that **Lambda Function** is selected and we click on the **Use Lambda Proxy integration**. This option makes sure that the data that is sent to the API is then sent directly to the Lambda function with no processing. It also means that the return value must be a proper response object as it will also not be processed by API Gateway.

We type the name of the Lambda function we created earlier into the **Lambda Function** text entry box and then click on **Save**. We click on **OK** in the pop-up box that then appears, giving permission to API Gateway to invoke the Lambda function we created.

The last step in creating the API Gateway is to select the **Actions** dropdown and we click on **Deploy API**. We will need to create a new Deployment stage and name it anything we like, for example `prod`.

We have now successfully set up a public API to access your SageMaker model. We make sure to copy or write down the URL provided to invoke our newly created public API as this will be needed in the next step. This URL can be found at the top of the page, highlighted in blue next to the text **Invoke URL**.

## Deploy our Web App <a id='web'></a>

Now that we have a publicly available API, we can start using it in a web app. For our purposes, we have provided a simple static html file, which can make use of the public api we created earlier.

In the `website` folder there should be a file called `index.html`. We edit this file in a text editor of our choice:

- There is a line which contains **\*\*REPLACE WITH PUBLIC API URL\*\***. We replace this string with the url that we wrote down in the last step and then save the file.

Now, if we open `index.html` on our local computer, our browser will behave as a local web server and we can use the provided site to interact with our SageMaker model.

If we'd like to go further, we can host this html file anywhere we'd like, for example using github or hosting a static site on Amazon's S3. Once we have done this we can share the link with anyone we'd like and have them play with it too.

> **Important Note** In order for the web app to communicate with the SageMaker endpoint, the endpoint has to actually be deployed and running. This means that we are paying for it. Make sure that the endpoint is running when we want to use the web app but that we shut it down when we don't need it, otherwise we will end up with a surprisingly large AWS bill.

### Delete the endpoint <a id='delete'></a>

Remember to always shut down our endpoint if we are no longer using it. We are charged for the length of time that the endpoint is running so if we forget and leave it on we could end up with an unexpectedly large bill.

In [None]:
predictor.delete_endpoint()

### Optional: Clean up <a id='opt-clean'></a>

The default notebook instance on SageMaker doesn't have a lot of excess disk space available. As we continue to complete and execute notebooks we will eventually fill up this disk space, leading to errors, which can be difficult to diagnose. Once we are completely finished using a notebook it is a good idea to remove the files that we created along the way. Of course, we can do this from the terminal or from the notebook hub if we would like. The cell below contains some commands to clean up the created files from within the notebook.

In [None]:
# First we will remove all of the files contained in the data directory.
!rm $data_dir/*

# Then we delete the directory itself.
!rmdir $data_dir