# Toxic comment classification



## Outline


1. Upload the processed data to S3.
2. Train a chosen model.
3. Test the trained model (typically using a batch transform job).
4. Deploy the trained model.
5. Use the deployed model.



### Uploading the training data


Next, we need to upload the training data to the SageMaker default S3 bucket so that we can provide access to it while training our model.

In [1]:
import os
data_dir = './data_to_s3' # The folder where data iterator are stored
if not os.path.exists(data_dir): # Make sure that the folder exists
    os.makedirs(data_dir)
    print('Created: ', data_dir)

In [2]:
import sagemaker

sagemaker_session = sagemaker.Session()

bucket = sagemaker_session.default_bucket()
prefix = 'toxic/data'

role = sagemaker.get_execution_role()

In [3]:
input_data = sagemaker_session.upload_data(path=data_dir, bucket=bucket, key_prefix=prefix)

## Training the model

In [7]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point="train_nlp.py",
                    source_dir="train",
                    role=role,
                    framework_version='0.4.0',
                    train_instance_count=1,
                    train_instance_type='ml.p2.xlarge',
                    hyperparameters={
                        'epochs': 4#,
                    #    'hidden_dim': 100,
                    })

In [8]:
estimator.fit({'training': input_data})

2019-11-20 00:16:52 Starting - Starting the training job...
2019-11-20 00:16:53 Starting - Launching requested ML instances...
2019-11-20 00:17:46 Starting - Preparing the instances for training.........
2019-11-20 00:19:22 Downloading - Downloading input data
2019-11-20 00:19:22 Training - Downloading the training image.....[31mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[31mbash: no job control in this shell[0m
[31m2019-11-20 00:19:58,622 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[31m2019-11-20 00:19:58,647 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[31m2019-11-20 00:19:58,650 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[31m2019-11-20 00:19:58,896 sagemaker-containers INFO     Module train_nlp does not provide a setup.py. [0m
[31mGenerating setup.py[0m
[31m2019-11-20 00:19:58,896 sagemaker-containers INFO

[31m#011Epoch: 0 Epoch Time:  0 m  31 s[0m
[31m#011Train Loss:  0.08138268396802871[0m
[31m#011Val. Loss:   0.05049808020333922[0m
[31m#011Epoch: 1 Epoch Time:  0 m  29 s[0m
[31m#011Train Loss:  0.05385401775917693[0m
[31m#011Val. Loss:   0.04723942037376146[0m
[31m#011Epoch: 2 Epoch Time:  0 m  29 s[0m
[31m#011Train Loss:  0.048769688584016715[0m
[31m#011Val. Loss:   0.046314925664947146[0m

2019-11-20 00:22:25 Uploading - Uploading generated training model[31m#011Epoch: 3 Epoch Time:  0 m  29 s[0m
[31m#011Train Loss:  0.045355694645720264[0m
[31m#011Val. Loss:   0.04657920889024224[0m
[31m2019-11-20 00:22:21,836 sagemaker-containers INFO     Reporting training SUCCESS[0m

2019-11-20 00:22:30 Completed - Training job completed
Training seconds: 212
Billable seconds: 212


In [9]:
estimator

<sagemaker.pytorch.estimator.PyTorch at 0x7f37588482e8>

## Step 5: Testing the model

As mentioned at the top of this notebook, we will be testing this model by first deploying it and then sending the testing data to the deployed endpoint. We will do this so that we can make sure that the deployed model is working correctly.

## Deploy the model for testing

When the built-in inference code is run it must import the `model_fn()` method from the `train.py` file. This is why the training code is wrapped in a main guard ( ie, `if __name__ == '__main__':` )



In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

In [None]:
predictor

## Step 6 - Try on validation set

In [80]:
from train.utils import Data_iterator
iterator_val = Data_iterator('val',data_dir= data_dir)

In [81]:
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
from sklearn.metrics import roc_auc_score

criterion = nn.BCEWithLogitsLoss()



def evaluate(iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    
    preds_list=[]
    labels_list= []
    
    with torch.no_grad():
        iterations = 0
        for batch in iterator:
            iterations+=1
            
            batch_X, batch_y = batch
            
            predictions = torch.tensor(predictor.predict(batch_X)).squeeze(1)
            
            #batch_labels = torch.stack([getattr(batch, y) for y in yFields]) #transpose?
            #batch_labels = torch.transpose(batch_labels,0,1)
            
            loss = criterion(predictions, batch_y)

            epoch_loss += loss.item()
            
            preds_list+=[torch.sigmoid(predictions).numpy()]
            labels_list+=[batch_y.numpy()]
            #if iterations==10: break
    
    return epoch_loss / iterations , np.vstack(labels_list), np.vstack(preds_list)

In [82]:
_loss, _true_labels, _predicted_labels = evaluate(iterator_val, criterion)

In [83]:
_true_labels.shape

(15957, 6)

In [84]:
roc_auc_score(_true_labels, _predicted_labels)

0.9811254843677686

In [96]:
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
import pandas as pd

toxic_labels = ['toxic','severe_toxic',
               'obscene','threat','insult',
               'identity_hate']

In [111]:

roc_auc_scores= []
recall_scores=[]
precision_scores=[]
accuracy_scores=[]
f1_scores=[]

thre = 0.5
for i,j in enumerate(toxic_labels):
    roc_auc_scores.append(roc_auc_score(_true_labels[:,i], _predicted_labels[:,i]))
    recall_scores.append(recall_score(_true_labels[:,i], _predicted_labels[:,i]>=thre))
    accuracy_scores.append(accuracy_score(_true_labels[:,i], _predicted_labels[:,i]>=thre))
    precision_scores.append(precision_score(_true_labels[:,i], _predicted_labels[:,i]>=thre))
    f1_scores.append(f1_score(_true_labels[:,i], _predicted_labels[:,i]>=thre))
    
    

In [112]:
pd.DataFrame(
{'Label': toxic_labels,
 'accuracy': accuracy_scores,
 'recall': recall_scores,
 'precision': precision_scores,
 'f1': f1_scores,
 'roc_auc': roc_auc_scores})

Unnamed: 0,Label,accuracy,recall,precision,f1,roc_auc
0,toxic,0.962086,0.685771,0.890505,0.774842,0.971232
1,severe_toxic,0.990788,0.175182,0.413793,0.246154,0.987748
2,obscene,0.982014,0.765432,0.864714,0.81205,0.983912
3,threat,0.997431,0.097561,0.5,0.163265,0.982795
4,insult,0.975246,0.598465,0.852459,0.703231,0.982194
5,identity_hate,0.993357,0.234375,0.789474,0.361446,0.978871


In [29]:
_predicted_labels[:,0]

(15957,)

Toxic labels stats 

In [17]:
estimator.delete_endpoint()

## Step 7 - Try on test set

In [13]:
from train.utils import Test_iterator

In [14]:
iterator_test = Test_iterator('test',data_dir= data_dir)

In [16]:
import numpy as np
import torch 

myPreds=[]
with torch.no_grad():

    for batch_X in iterator_test:    
        predictions = predictor.predict(batch_X)#.squeeze(1)         
        myPreds+=[torch.sigmoid(torch.tensor(predictions)).detach().numpy()]
myPreds = np.vstack(myPreds)

In [17]:
len(myPreds)

153164

In [18]:
import pandas as pd

testDF = pd.read_csv("./data/test.csv")
for i, col in enumerate(["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]):
    testDF[col] = myPreds[:, i]

In [19]:
testDF.drop("comment_text", axis=1).to_csv("./for_kaggle/submission_aws.csv", index=False)

In [None]:
#from sklearn.metrics import accuracy_score
#accuracy_score(test_y, predictions)

### (TODO) More testing



In [6]:
test_review = 'The simplest pleasures in life are the best, and this film is one of them. Combining a rather basic storyline of love and adventure this movie transcends the usual weekend fair with wit and unmitigated charm.'


 - Removed any html tags and stemmed the input
 - Encoded the review as a sequence of integers using `word_dict`
 
In order process the review we will need to repeat these two steps.



In [21]:
import pickle
word_dict_path = os.path.join(data_dir, 'word_dict.pkl')

with open(os.path.join(data_dir, "word_dict.pkl"), "rb") as f:
    word_dict = pickle.load(f)

In [22]:
import re

rep_numbers=re.compile(r'\d+',re.IGNORECASE) # Numbers
rep_special_chars= re.compile("[^\w']|_") # Special character but not apostrophes

def apostrophes(text):
    return re.findall(r"\w+(?=n't)|n't|\w+(?=')|'\w+|\w+",
               text, re.IGNORECASE | re.DOTALL)

def text_to_words(review):  
    
    text=rep_special_chars.sub(' ', review) # Remove special characters but apostrophes    
    text = rep_numbers.sub('n', text) # substitute all numbers  
    words = text.lower()
    words = apostrophes(words)[:120]# Split string into words
    return words

In [23]:
def predict_toxicity(word_dict, text):
   
    words = text_to_words(text)
    
    words=[word_dict[w] if w in word_dict else word_dict['<unk>'] for w in words]
    
    tensor = torch.LongTensor(words).unsqueeze(1)
    
    tensor = torch.Tensor(predictor.predict(tensor))
    
    prediction = torch.sigmoid(tensor)
    
    return prediction

In [25]:
predict_toxicity(word_dict, "retarded faggot black piece of shit quiet or I will kill you fucking stupid looser")

tensor([[1.0000, 0.7514, 0.9990, 0.7632, 0.9973, 0.7706]])

### Delete the endpoint

Of course, just like in the XGBoost notebook, once we've deployed an endpoint it continues to run until we tell it to shut down. Since we are done using our endpoint for now, we can delete it.

In [26]:
estimator.delete_endpoint()

## Step 6 (again) - Deploy the model for the web app

Now that we know that our model is working, it's time to create some custom inference code so that we can send the model a review which has not been processed and have it determine the sentiment of the review.

As we saw above, by default the estimator which we created, when deployed, will use the entry script and directory which we provided when creating the model. However, since we now wish to accept a string as input and our model expects a processed review, we need to write some custom inference code.

We will store the code that we write in the `serve` directory. Provided in this directory is the `model.py` file that we used to construct our model, a `utils.py` file which contains the `review_to_words` and `convert_and_pad` pre-processing functions which we used during the initial data processing, and `predict.py`, the file which will contain our custom inference code. Note also that `requirements.txt` is present which will tell SageMaker what Python libraries are required by our custom inference code.

When deploying a PyTorch model in SageMaker, you are expected to provide four functions which the SageMaker inference container will use.
 - `model_fn`: This function is the same function that we used in the training script and it tells SageMaker how to load our model.
 - `input_fn`: This function receives the raw serialized input that has been sent to the model's endpoint and its job is to de-serialize and make the input available for the inference code.
 - `output_fn`: This function takes the output of the inference code and its job is to serialize this output and return it to the caller of the model's endpoint.
 - `predict_fn`: The heart of the inference script, this is where the actual prediction is done and is the function which you will need to complete.

For the simple website that we are constructing during this project, the `input_fn` and `output_fn` methods are relatively straightforward. We only require being able to accept a string as input and we expect to return a single value as output. You might imagine though that in a more complex application the input or output may be image data or some other binary data which would require some effort to serialize.

### Writing inference code

Before writing our custom inference code, we will begin by taking a look at the code which has been provided.

In [27]:
!pygmentize serve/predict_nlp.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m
[34mimport[39;49;00m [04m[36msagemaker_containers[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.nn[39;49;00m [34mas[39;49;00m [04m[36mnn[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.optim[39;49;00m [34mas[39;49;00m [04m[36moptim[39;49;00m
[34mimport[39;49;00m [04m[36mtorch.utils.data[39;49;00m



[34mfrom[39;49;00m [04m[36mutils_nlp[39;49;00m [34mimport[39;49;00m tokenize

[34mfrom[39;49;00m [04m[36mmodel_nlp[39;49;00m [34mimport[39;49;00m CNN

[34

As mentioned earlier, the `model_fn` method is the same as the one provided in the training code and the `input_fn` and `output_fn` methods are very simple and your task will be to complete the `predict_fn` method. Make sure that you save the completed file as `predict.py` in the `serve` directory.

**TODO**: Complete the `predict_fn()` method in the `serve/predict.py` file.

### Deploying the model

Now that the custom inference code has been written, we will create and deploy our model. To begin with, we need to construct a new PyTorchModel object which points to the model artifacts created during training and also points to the inference code that we wish to use. Then we can call the deploy method to launch the deployment container.

**NOTE**: The default behaviour for a deployed PyTorch model is to assume that any input passed to the predictor is a `numpy` array. In our case we want to send a string so we need to construct a simple wrapper around the `RealTimePredictor` class to accomodate simple strings. In a more complicated situation you may want to provide a serialization object, for example if you wanted to sent image data.

In [11]:
from sagemaker.predictor import RealTimePredictor
from sagemaker.pytorch import PyTorchModel

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

model = PyTorchModel(model_data=estimator.model_data,
                     role = role,
                     framework_version='0.4.0',
                     entry_point='predict_nlp.py',
                     source_dir='serve',
                     predictor_cls=StringPredictor)
predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

-------------------------------------------------------------------------------------!

In [12]:
prueba = predictor.predict("retarded faggot black piece of shit quiet or I will kill you fucking stupid looser")

In [13]:
print(prueba.decode('utf-8'))



Your text has been classified as:
toxic
severe_toxic
obscene
threat
insult
identity_hate



### TODO: Crosschek on validation set. 

## Step 7 (again): Use the model for the web app

> **TODO:** This entire section and the next contain tasks for you to complete, mostly using the AWS console.

So far we have been accessing our model endpoint by constructing a predictor object which uses the endpoint and then just using the predictor object to perform inference. What if we wanted to create a web app which accessed our model? The way things are set up currently makes that not possible since in order to access a SageMaker endpoint the app would first have to authenticate with AWS using an IAM role which included access to SageMaker endpoints. However, there is an easier way! We just need to use some additional AWS services.

<img src="Web App Diagram.svg">

The diagram above gives an overview of how the various services will work together. On the far right is the model which we trained above and which is deployed using SageMaker. On the far left is our web app that collects a user's movie review, sends it off and expects a positive or negative sentiment in return.

In the middle is where some of the magic happens. We will construct a Lambda function, which you can think of as a straightforward Python function that can be executed whenever a specified event occurs. We will give this function permission to send and recieve data from a SageMaker endpoint.

Lastly, the method we will use to execute the Lambda function is a new endpoint that we will create using API Gateway. This endpoint will be a url that listens for data to be sent to it. Once it gets some data it will pass that data on to the Lambda function and then return whatever the Lambda function returns. Essentially it will act as an interface that lets our web app communicate with the Lambda function.

### Setting up a Lambda function

The first thing we are going to do is set up a Lambda function. This Lambda function will be executed whenever our public API has data sent to it. When it is executed it will receive the data, perform any sort of processing that is required, send the data (the review) to the SageMaker endpoint we've created and then return the result.

#### Part A: Create an IAM Role for the Lambda function

Since we want the Lambda function to call a SageMaker endpoint, we need to make sure that it has permission to do so. To do this, we will construct a role that we can later give the Lambda function.

Using the AWS Console, navigate to the **IAM** page and click on **Roles**. Then, click on **Create role**. Make sure that the **AWS service** is the type of trusted entity selected and choose **Lambda** as the service that will use this role, then click **Next: Permissions**.

In the search box type `sagemaker` and select the check box next to the **AmazonSageMakerFullAccess** policy. Then, click on **Next: Review**.

Lastly, give this role a name. Make sure you use a name that you will remember later on, for example `LambdaSageMakerRole`. Then, click on **Create role**.

#### Part B: Create a Lambda function

Now it is time to actually create the Lambda function.

Using the AWS Console, navigate to the AWS Lambda page and click on **Create a function**. When you get to the next page, make sure that **Author from scratch** is selected. Now, name your Lambda function, using a name that you will remember later on, for example `sentiment_analysis_func`. Make sure that the **Python 3.6** runtime is selected and then choose the role that you created in the previous part. Then, click on **Create Function**.

On the next page you will see some information about the Lambda function you've just created. If you scroll down you should see an editor in which you can write the code that will be executed when your Lambda function is triggered. In our example, we will use the code below. 

```python
# We need to use the low-level library to interact with SageMaker since the SageMaker API
# is not available natively through Lambda.
import boto3

def lambda_handler(event, context):

    # The SageMaker runtime is what allows us to invoke the endpoint that we've created.
    runtime = boto3.Session().client('sagemaker-runtime')

    # Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given
    response = runtime.invoke_endpoint(EndpointName = '**ENDPOINT NAME HERE**',    # The name of the endpoint we created
                                       ContentType = 'text/plain',                 # The data format that is expected
                                       Body = event['body'])                       # The actual review

    # The response is an HTTP response whose body contains the result of our inference
    result = response['Body'].read().decode('utf-8')

    return {
        'statusCode' : 200,
        'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
        'body' : result
    }
```

Once you have copy and pasted the code above into the Lambda code editor, replace the `**ENDPOINT NAME HERE**` portion with the name of the endpoint that we deployed earlier. You can determine the name of the endpoint using the code cell below.

In [14]:
predictor.endpoint

'sagemaker-pytorch-2019-11-20-00-35-22-821'

Once you have added the endpoint name to the Lambda function, click on **Save**. Your Lambda function is now up and running. Next we need to create a way for our web app to execute the Lambda function.

### Setting up API Gateway

Now that our Lambda function is set up, it is time to create a new API using API Gateway that will trigger the Lambda function we have just created.

Using AWS Console, navigate to **Amazon API Gateway** and then click on **Get started**.

On the next page, make sure that **New API** is selected and give the new api a name, for example, `sentiment_analysis_api`. Then, click on **Create API**.

Now we have created an API, however it doesn't currently do anything. What we want it to do is to trigger the Lambda function that we created earlier.

Select the **Actions** dropdown menu and click **Create Method**. A new blank method will be created, select its dropdown menu and select **POST**, then click on the check mark beside it.

For the integration point, make sure that **Lambda Function** is selected and click on the **Use Lambda Proxy integration**. This option makes sure that the data that is sent to the API is then sent directly to the Lambda function with no processing. It also means that the return value must be a proper response object as it will also not be processed by API Gateway.

Type the name of the Lambda function you created earlier into the **Lambda Function** text entry box and then click on **Save**. Click on **OK** in the pop-up box that then appears, giving permission to API Gateway to invoke the Lambda function you created.

The last step in creating the API Gateway is to select the **Actions** dropdown and click on **Deploy API**. You will need to create a new Deployment stage and name it anything you like, for example `prod`.

You have now successfully set up a public API to access your SageMaker model. Make sure to copy or write down the URL provided to invoke your newly created public API as this will be needed in the next step. This URL can be found at the top of the page, highlighted in blue next to the text **Invoke URL**.

## Step 4: Deploying our web app

Now that we have a publicly available API, we can start using it in a web app. For our purposes, we have provided a simple static html file which can make use of the public api you created earlier.

In the `website` folder there should be a file called `index.html`. Download the file to your computer and open that file up in a text editor of your choice. There should be a line which contains **\*\*REPLACE WITH PUBLIC API URL\*\***. Replace this string with the url that you wrote down in the last step and then save the file.

Now, if you open `index.html` on your local computer, your browser will behave as a local web server and you can use the provided site to interact with your SageMaker model.

If you'd like to go further, you can host this html file anywhere you'd like, for example using github or hosting a static site on Amazon's S3. Once you have done this you can share the link with anyone you'd like and have them play with it too!

> **Important Note** In order for the web app to communicate with the SageMaker endpoint, the endpoint has to actually be deployed and running. This means that you are paying for it. Make sure that the endpoint is running when you want to use the web app but that you shut it down when you don't need it, otherwise you will end up with a surprisingly large AWS bill.

**TODO:** Make sure that you include the edited `index.html` file in your project submission.

### Delete the endpoint

Remember to always shut down your endpoint if you are no longer using it. You are charged for the length of time that the endpoint is running so if you forget and leave it on you could end up with an unexpectedly large bill.

In [108]:
predictor.delete_endpoint()