# Sentiment Analysis with Apache MXNet and Gluon

This tutorial shows how to train and test a Sentiment Analysis (Text Classification) model on Amazon SageMaker using Apache MXNet and the Gluon API.

## Create needed directories

We will data directory for the ready-to-use dataset.

In [None]:
%%bash

mkdir -p data

## Download Standardized dataset

As a prerequisite, we should have ran the [preprocessing notebook](../PreprocessNotebook.ipynb).

Then, we will create the variables with routes to the needed directories.

Finally, we will also download the dataset from S3.

In [None]:
data_directory = 'data'
test_dataset_file_name = '../data/preprocessed/sentiments-preprocessed-lemmatizing-test-10percent-dataset.csv'
train_dataset_file_name = '../data/preprocessed/sentiments-preprocessed-lemmatizing-train-10percent-dataset.csv'

## Convert dataset

We need a dataset with each line being space separated tokens, with the first token being the label: 1 for positive and
0 for negative, and the second token our messages.

In [None]:
import csv

def convert_into_space_separated (dataset_file_name, destination_file_name):
    output_data = list()
    with open(dataset_file_name, newline='') as csv_file:
        input_data = list(csv.reader(csv_file))

        for message in input_data:
            output_data.append(f'{message[0]} {message[1]}\n')

    with open(f'{data_directory}/{destination_file_name}', mode='w') as output_file:
        for line in output_data:
            output_file.writelines(line)

convert_into_space_separated(train_dataset_file_name, 'train')
convert_into_space_separated(test_dataset_file_name, 'test')

## Upload the data

We use the `sagemaker.s3.S3Uploader` to upload our datasets to an Amazon S3 location. The return value `inputs`
identifies the location -- we use this later when we start the training job.

In [None]:
import sagemaker
from sagemaker import s3

sess = sagemaker.Session()
bucket = sess.default_bucket()
inputs = s3.S3Uploader.upload('data', 's3://{}/mxnet-gluon-sentiment-example/data'.format(bucket))

## Implement the training function

We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, but you can also access useful properties about the training environment through various environment variables. In addition, hyperparameters are passed to the script as arguments. For more about writing an MXNet training script for use with SageMaker, see [the SageMaker documentation](https://sagemaker.readthedocs.io/en/stable/using_mxnet.html#prepare-an-mxnet-training-script).

The script here is a simplified implementation of ["Bag of Tricks for Efficient Text Classification"](https://arxiv.org/abs/1607.01759), as implemented by Facebook's [FastText](https://github.com/facebookresearch/fastText/) for text classification. The model maps each word to a vector and averages vectors of all the words in a sentence to form a hidden representation of the sentence, which is inputted to a softmax classification layer. For more details, please refer to [the paper](https://arxiv.org/abs/1607.01759).

At the end of every epoch, our script also checks the validation accuracy, and checkpoints the best model so far, along with the optimizer state, in the folder `/opt/ml/checkpoints`. (If the folder `/opt/ml/checkpoints` does not exist, this checkpointing step is skipped.)

In [None]:
!pygmentize 'mxnet_sentiment.py'

## Run a SageMaker training job

The `MXNet` class allows us to run our training function on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. In this case we run our training job on a single `c4.2xlarge` instance.

In [None]:
from sagemaker import get_execution_role
from sagemaker.mxnet import MXNet

m = MXNet('mxnet_sentiment.py',
          role=get_execution_role(),
          instance_count=1,
          instance_type='ml.m5.large',
          framework_version='1.8.0',
          py_version='py37',
          distribution={'parameter_server': {'enabled': True}},
          hyperparameters={'batch-size': 8,
                           'epochs': 2,
                           'learning-rate': 0.01,
                           'embedding-size': 50, 
                           'log-interval': 1000},
          metric_definitions=[
                   {'Name': 'training:accuracy', 'Regex': '\[Epoch ([0-9]*)\] Training: accuracy=([0-9].[0-9]*)'},
                   {'Name': 'validation:accuracy', 'Regex': '\[Epoch ([0-9]*)\] Validation: accuracy=([0-9].[0-9]*)'},
                   {'Name': 'validation:f1', 'Regex': '\[Epoch ([0-9]*)\] Validation: f1=([0-9].[0-9]*)'},
                   {'Name': 'validation:precision', 'Regex': '\[Epoch ([0-9]*)\] Validation: Precision=([0-9].[0-9]*)'},
                   {'Name': 'validation:recall', 'Regex': '\[Epoch ([0-9]*)\] Validation: Recall=([0-9].[0-9]*)'}
          ])

After we've constructed our `MXNet` estimator, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.

In [None]:
m.fit(inputs)

Checking the logs, we can see our model's accuracy percentage on the test set using the above hyperparameters.

After training, we use our `MXNet` object to build and deploy an `MXNetPredictor` object. This creates a SageMaker Endpoint that we can use to perform inference. 

In [None]:
predictor = m.deploy(initial_instance_count=1, instance_type='ml.t2.medium')

With our predictor, we can perform inference on a JSON-encoded string array. 

The predictor runs inference on our input data and returns an array containing, for each inference, the original message, the predicted sentiment (1 for positive and 0 for negative), and the prediction probability.

In [None]:
data = ["i feel good",
        "i feel bad",
        "More bike riding, but this time with a body guard and guide",
        "Turtles are better than my Mac - lasts longer and moves faster",
        "just got back from another camp get together! yay school got moved to next week! OMG ROADTRIP. hahaha more plans for me then!",
        "Work is terrible  I need a break.",
        "I didn't win but I didn't lose though. Drinking time!",
        "is really wobbly again after a really good morning. I hate this",
        "Kinda wish I never went back to bed this morning.. hmm... still feel good to have no real commitments today",
        "Leaving home to come back to Ontario and not too happy about it!  I miss my family",
        "I'm busily hunting for a summer job and a place to move into in August ",
        "I'm wondering if my girlfriend does crap on purpose to make me feel insignificant. I really hate that."
       ]

response = predictor.predict(data)
response

## Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [None]:
predictor.delete_endpoint()