# Lab 3 - Train with a Built-in algorithm on Amazon SageMaker

This notebook demonstrates the use of SageMaker BlazingText to perform supervised binary/multi class with single or multi label text classification. BlazingText can train the model on more than a billion words in a couple of minutes using a multi-core CPU or a GPU, while achieving performance on par with the state-of-the-art deep learning text classification algorithms. BlazingText extends the fastText text classifier to leverage GPU acceleration using custom CUDA kernels.

Let’s start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don’t specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region.

- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the get_execution_role method from sagemaker python SDK.

In [None]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import pandas as pd

sess = sagemaker.Session()

role = get_execution_role()
print(
    role
)  # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch) on your behalf

bucket = sess.default_bucket()  # Replace with your own bucket name if needed
print(bucket)
prefix = "blazingtext/supervised"  # Replace with the prefix under which you want to store the data if needed

In [None]:
%store -r lab3path
!aws s3 cp {lab3path} ./

## Split the prepared input dataset 

In [None]:
from numpy.random import RandomState
import pandas as pd

df = pd.read_json('./sagemaker_input.json', lines=True)
rng = RandomState()

#For a 70-30 split
train = df.sample(frac=0.7, random_state=rng)
test = df.loc[~df.index.isin(train.index)]

If you see a warning generated below saying `Try using .loc[row_indexer,col_indexer] = value instead`, you can safely ignore the same

In [None]:
train.label = train.label.apply(lambda x: '__label__'+str(x))
test.label = test.label.apply(lambda x: '__label__'+str(x))

In [None]:
train.to_csv('train.csv',header=False,index=False, sep=' ')
test.to_csv('test.csv',header=False,index=False, sep=' ')

### Upload data to S3

In [None]:
trainpath = sagemaker.session.Session().upload_data(path='train.csv',key_prefix='blazingtextdata')
testpath = sagemaker.session.Session().upload_data(path='test.csv',key_prefix='blazingtextdata')

## Initialize a BlazingText "Estimator"

Here, we will point to a "built-in" container in SageMaker. As you will see in Lab 4, another way to train models is using your own scripts in estimators.

In [None]:
container = sagemaker.image_uris.retrieve("blazingtext","us-east-1")
container

Now, let’s define the SageMaker Estimator with resource configurations and hyperparameters to train Text Classification on DBPedia dataset, using “supervised” mode on a c4.4xlarge instance.

Refer to BlazingText Hyperparameters in the Amazon SageMaker documentation for the complete list of hyperparameters.

In [None]:
bt_model = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type="ml.c4.4xlarge",
    volume_size=30,
    max_run=360000,
    hyperparameters={
        "mode": "supervised",
        "epochs": 100,
        "min_count": 2,
        "learning_rate": 0.1,
        "early_stopping": True,
        "patience": 10,
        "min_epochs": 25,
        "word_ngrams": 4,
        "vector_dim":500
    },
)

Now that the hyper-parameters are setup, let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the sagemaker.session.s3_input objects from our data channels. These objects are then put in a simple dictionary, which the algorithm consumes.

In [None]:
train_data = sagemaker.inputs.TrainingInput(
    trainpath,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
validation_data = sagemaker.inputs.TrainingInput(
    testpath,
    distribution="FullyReplicated",
    content_type="text/plain",
    s3_data_type="S3Prefix",
)
data_channels = {"train": train_data, "validation": validation_data}

We have our Estimator object, we have set the hyper-parameters for this object and we have our data channels linked with the algorithm. The only remaining thing to do is to train the algorithm. The following command will train the algorithm. Training the algorithm involves a few steps. Firstly, the instance that we requested while creating the Estimator classes is provisioned and is setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take some time, depending on the size of the data. Therefore it might be a few minutes before we start getting training logs for our training jobs. The data logs will also print out Accuracy on the validation data for every epoch after training job has executed min_epochs. This metric is a proxy for the quality of the algorithm.

Once the job has finished a “Job complete” message will be printed. The trained model can be found in the S3 bucket that was setup as output_path in the estimator.

In [None]:
bt_model.fit(inputs=data_channels, logs=True)

> Note: As you can see, this is not a very good model, but we try to introduce some key concepts around SageMAker in this lab. For better results, try the Huggingface Estimator on SageMaker, or hyperParameter tuning!

## Hosting / Inference

Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don’t have to host on the same type of instance that we used to train. Because instance endpoints will be up and running for long, it’s advisable to choose a cheaper instance for inference.



> Note, the following deployment step can take up to 5 mins!

In [None]:
from sagemaker.serializers import JSONSerializer

text_classifier = bt_model.deploy(
    initial_instance_count=1, instance_type="ml.m4.xlarge", serializer=JSONSerializer()
)

In [None]:
import nltk
nltk.download('punkt')
# You can pass in an array for predictions
sentences = [
    "Supplement dated January 28 2019 the Supplement to Official Statement dated January 8 2019 the Official Statement relating to AMERICAN MUNICIPAL POWER INC SOLAR ELECTRICITY PREPAYMENT PROJECT REVENUE BONDS 55195000 SERIES 2019A GREEN BONDS The  Official  Statement  delivered  in  connection  with  the  proposed  issuance  by  American  Municipal Power Inc of the abovereferenced bonds contained a typographical error in the table on page  B12 of Appendix B relating to Large Participant Wadsworth Ohio The power sales revenue for 2017  was  32891000  instead  of  33891000  as  reported  in  the  Official  Statement  which  when  corrected  results in corresponding decreases to Wadsworths Total Revenue and Net Revenue Available for Debt  Service The table as corrected appears on the following page"
]

# using the same nltk tokenizer that we used during data preparation for training
tokenized_sentences = [" ".join(nltk.word_tokenize(sent)) for sent in sentences]

By default, the model will return only one prediction, the one with the highest probability. For retrieving the top k predictions, you can set k in the configuration as shown in the comment below:

In [None]:
# payload = {"instances": tokenized_sentences, "configuration": {"k": 2}}

payload = {"instances": tokenized_sentences}

response = text_classifier.predict(payload)

predictions = json.loads(response)
print(json.dumps(predictions, indent=2))

### Finally delete the endpoint

In [None]:
text_classifier.delete_endpoint()