## Introduction

Text Classification can be used to solve various use-cases like sentiment analysis, spam detection, hashtag prediction etc. This notebook demonstrates the use of Amazon Comprehend to provide text classification. 

## Setup

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting. If you don't specify a bucket, SageMaker SDK will create a default bucket following a pre-defined naming convention in the same region. 
- The IAM role ARN used to give SageMaker access to your data. It can be fetched using the **get_execution_role** method from sagemaker python SDK. 
- Grant CreatePolicy,CreateRole,AttachRolePolicy,PassRole to sagemaker execution role you have created during notebook creation from IAM role.
    1.Create a new policy with CreatePolicy,CreateRole,AttachRolePolicy,PassRole permissions. 
    2.Go to Services -  IAM  - Roles and search for newly created sagemaker execution role by seach the role name          by creation date.
    3.Attach new policy created in step 1 to sagemaker execution role.
       

***Note: This role should have AmazonComprehendFullAccess, so it can create and run custom classification jobs***

In [None]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3
import time
import pytz
import linecache
import random
from datetime import datetime

sess = sagemaker.Session()
role = get_execution_role()

print(role) # This is the role that SageMaker would use to leverage AWS resources (S3, CloudWatch, Comprehend) on your behalf. Note: This role should have AmazonComprehendFullAccess, so it can create and run custom classification jobs
bucket='<<INSERT BUCKET HERE>>' # customize to your bucket, for this workshop the Comprehend policy grants access to buckets with comprehend in the name
prefix = 'dbpedia/' #Replace with the prefix under which you want to store the data if needed
region = 'us-east-1'

### Data Preparation

Now we'll download a dataset from the web on which we want to train the text classification model. BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "\__label\__".

In this example, let us train the text classification model on the [DBPedia Ontology Dataset](https://wiki.dbpedia.org/services-resources/dbpedia-data-set-2014#2) as done by [Zhang et al](https://arxiv.org/pdf/1509.01626.pdf). The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014. It has 560,000 training samples and 70,000 testing samples. The fields we used for this dataset contain title and abstract of each Wikipedia article. 

In [None]:
!wget https://github.com/saurabh3949/Text-Classification-Datasets/raw/master/dbpedia_csv.tar.gz

In [None]:
!tar -xzvf dbpedia_csv.tar.gz

Let us inspect the dataset and the classes to get some understanding about how the data and the label is provided in the dataset. 

In [None]:
!head dbpedia_csv/train.csv -n 3

As can be seen from the above output, the CSV has 3 fields - Label index, title and abstract. Let us first create a label index to label name mapping and then proceed to preprocess the dataset for ingestion by BlazingText.

Next we will print the labels file (`classes.txt`) to see all possible labels followed by creating an index to label mapping.

In [None]:
!cat dbpedia_csv/classes.txt

The following code creates the mapping from integer indices to class label which will later be used to retrieve the actual class name during inference. 

In [None]:
index_to_label = {} 
with open("dbpedia_csv/classes.txt") as f:
    for i,label in enumerate(f.readlines()):
        index_to_label[str(i+1)] = label.strip()
print(index_to_label)

## Data Preprocessing
We need to preprocess the training data into **space separated tokenized text** format which can be consumed by Amazon Comprehend. Also, as mentioned previously, the class label(s) will be mapped from the classes.txt into the training data.

In [None]:
def transform_instance(row):
    cur_row = ''
    cur_row = index_to_label[row] 
    return cur_row

The `transform_instance` will be applied to each data instance in parallel using python's multiprocessing module

In [None]:
def preprocess(input_file, output_file, testfile=1):
    all_rows = ''
    with open(input_file, 'r') as csvinfile:
        #csv_reader = csv.reader(csvinfile, delimiter='\n')
        count = 0
        for row in csvinfile:
            if (testfile == 0):
                count += 1;
                if (count == 200):
                    break
            category = row.split(',')[0]
            title = row.split(',')[1]
            document = row.split(title+',')[1]
            all_rows += transform_instance(category) + ',' + document
    
        with open(output_file, 'w') as csvoutfile:
            csvoutfile.write(all_rows)
            
def preprocesstest(input_file, output_file, testfile=1):
    all_rows = ''
    with open(input_file, 'r') as csvinfile:
        #csv_reader = csv.reader(csvinfile, delimiter='\n')
        count = 0
        for row in csvinfile:
            if (testfile == 0):
                count += 1;
                if (count == 200):
                    break
            title = row.split(',')[1]
            document = row.split(title+',')[1]
            all_rows += document
    
        with open(output_file, 'w') as csvoutfile:
            csvoutfile.write(all_rows)

In [None]:
%%time

# Preparing the training dataset

preprocess('dbpedia_csv/train.csv', 'dbpedia.train')
        
# Preparing the test dataset        
preprocesstest('dbpedia_csv/test.csv', 'dbpedia.test')

In [None]:
!head dbpedia.train -n 3

In [None]:
def upload_to_s3(channel, file):
    s3 = boto3.resource('s3')
    data = open(file, "rb")
    key = channel + '/' + file
    s3.Bucket(bucket).put_object(Key=key, Body=data)

s3_train_key = "dbpedia/train"
s3_test_key = "dbpedia/test"

upload_to_s3(s3_train_key, 'dbpedia.train')
upload_to_s3(s3_test_key, 'dbpedia.test')

The data preprocessing cell might take a minute to run. After the data preprocessing is complete, we need to upload it to S3 so that it can be consumed by SageMaker to execute training jobs. We'll use Python SDK to upload these two files to the bucket and prefix location that we have set above.   

Next we need to setup an output location at S3, where the model artifact will be dumped. These artifacts are also the output of the algorithm's traning job.

In [None]:
s3_output_location = 's3://{}/{}output'.format(bucket, prefix)
s3_train_location = 's3://{}/{}train'.format(bucket, prefix)+'/'+'dbpedia.train'
s3_test_location = 's3://{}/{}test'.format(bucket, prefix)+'/'+'dbpedia.test'

## Training Comprehend for custom classification

Create Policy for Comprehend Service role

In [None]:
iam = boto3.client("iam")
policy_name = "Comprehendpolicy"
policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "comprehend:CreateDocumentClassifier",
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::*Comprehend*",
                "arn:aws:s3:::*comprehend*"
            ]
        }
    ]
}
#print(role)
#response = iam.list_attached_role_policies(RoleName = "AmazonSageMaker-ExecutionRole-20190723T185609")
#print(response)

create_policy_response = iam.create_policy(
    PolicyName = policy_name,
    PolicyDocument = json.dumps(policy_document),
    Description='Comprehend Policy'
)
PolicyArn=create_policy_response["Policy"]["Arn"]
print(PolicyArn)

In [None]:
role_name = "ComprehendRole"
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "comprehend.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}
create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document),
    Description='Amazon Comprehend service role for classifier.'
)

iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = create_policy_response["Policy"]["Arn"]
)

time.sleep(30) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

We will create a custom classification training job using the Boto3 SDK

In [None]:
# Instantiate Boto3 SDK:
client = boto3.client('comprehend', region_name=(region))
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
# Create a document classifier
create_response = client.create_document_classifier(
    InputDataConfig={
        'S3Uri': (s3_train_location)
    },OutputDataConfig={
        'S3Uri': (s3_output_location)
    },
    DataAccessRoleArn=(role_arn),
    DocumentClassifierName='dbpedia-classifier'+(timestamp),
    LanguageCode='en'
)
print(create_response)

In [None]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_classifier_response = client.describe_document_classifier(
    DocumentClassifierArn=create_response['DocumentClassifierArn'])
    status = describe_classifier_response["DocumentClassifierProperties"]["Status"]
    now = datetime.now(pytz.utc)
    elapsed = now - describe_classifier_response["DocumentClassifierProperties"]["SubmitTime"]
    print("DocumentClassifierProperties: {}   (elapsed = {})".format(status, elapsed))
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
        
    time.sleep(15)
    
documentclassifier = describe_classifier_response["DocumentClassifierProperties"]["DocumentClassifierArn"]

Retrieve metrics of the Comprehend custom classifier

In [None]:
evaluation_metrics=describe_classifier_response["DocumentClassifierProperties"]["ClassifierMetadata"]
print(evaluation_metrics)

Lets plot the confusion matrix from the output of the classifier

In [None]:
output_confusion_matrix_location = describe_classifier_response["DocumentClassifierProperties"]["OutputDataConfig"]
outputconfusionmatrixs3=output_confusion_matrix_location["S3Uri"]
print(outputconfusionmatrixs3)
s3copyconfusionmatrixlocation = (outputconfusionmatrixs3)
!aws s3 cp $s3copyconfusionmatrixlocation ./confusionmatrix.tar.gz
!tar -xzvf confusionmatrix.tar.gz

In [None]:
import numpy as np
def plot_confusion_matrix(cm,
                          target_names,
                          title='Confusion matrix',
                          cmap=None,
                          normalize=True):
    """
    given a sklearn confusion matrix (cm), make a nice plot

    Arguments
    ---------
    cm:           confusion matrix from sklearn.metrics.confusion_matrix

    target_names: given classification classes such as [0, 1, 2]
                  the class names, for example: ['high', 'medium', 'low']

    title:        the text to display at the top of the matrix

    cmap:         the gradient of the values displayed from matplotlib.pyplot.cm
                  see http://matplotlib.org/examples/color/colormaps_reference.html
                  plt.get_cmap('jet') or plt.cm.Blues

    normalize:    If False, plot the raw numbers
                  If True, plot the proportions

    Usage
    -----
    plot_confusion_matrix(cm           = cm,                  # confusion matrix created by
                                                              # sklearn.metrics.confusion_matrix
                          normalize    = True,                # show proportions
                          target_names = y_labels_vals,       # list of names of the classes
                          title        = best_estimator_name) # title of graph

    Citiation
    ---------
    http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html

    """
    import matplotlib.pyplot as plt
    import numpy as np
    import itertools

    accuracy = np.trace(cm) / float(np.sum(cm))
    misclass = 1 - accuracy

    if cmap is None:
        cmap = plt.get_cmap('Blues')

    plt.figure(figsize=(8, 6))
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()

    if target_names is not None:
        tick_marks = np.arange(len(target_names))
        plt.xticks(tick_marks, target_names, rotation=45)
        plt.yticks(tick_marks, target_names)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]


    thresh = cm.max() / 1.5 if normalize else cm.max() / 2
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        if normalize:
            plt.text(j, i, "{:0.4f}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
        else:
            plt.text(j, i, "{:,}".format(cm[i, j]),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")


    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label\naccuracy={:0.4f}; misclass={:0.4f}'.format(accuracy, misclass))
    plt.show()

In [None]:
def load_cm(path):
    with open(path) as f:
        data = json.load(f)["confusion_matrix"]
    #print(data)
    return np.asarray(data)

def load_labels(path):
    with open(path) as f:
        data = json.load(f)["labels"]
    #print(data)
    return np.asarray(data)




In [None]:
plot_confusion_matrix(cm = load_cm('./output/confusion_matrix.json'),normalize=False,target_names = load_labels('./output/confusion_matrix.json'),title="Confusion Matrix")

##  Inference
Once the training is done, we can create a job to classify documents with the Amazon Comprehend custom classifier. We will run this against our test data

In [None]:
start_response = client.start_document_classification_job(
    InputDataConfig={
        'S3Uri': (s3_test_location),
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': (s3_output_location)
    },
    DataAccessRoleArn=(role_arn),
    DocumentClassifierArn=(documentclassifier),
    JobName='dbpedia-classification Job'+(timestamp)
)

print("Start response: %s\n", start_response)

Check the status of the job

In [None]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_classification_response = client.describe_document_classification_job(JobId=start_response['JobId'])
    status = describe_classification_response["DocumentClassificationJobProperties"]["JobStatus"]
    now = datetime.now(pytz.utc)
    elapsed = now - describe_classification_response["DocumentClassificationJobProperties"]["SubmitTime"]
    print("DocumentClassificationJobProperties: {}   (elapsed = {})".format(status, elapsed))
    
    if status == "COMPLETED" or status == "FAILED":
        break
        
    time.sleep(15)
output_location = describe_classification_response["DocumentClassificationJobProperties"]["OutputDataConfig"]
outputs3=output_location["S3Uri"]

Once the classification job has run lets download and view the results

In [None]:
s3copylocation = (outputs3)
!aws s3 cp $s3copylocation .
!tar -xzvf output.tar.gz

We will pick a random entry from the dbpedia.test data and its corresponding result from the Amazon Comprehend analysis job.

In [None]:
num_lines = sum(1 for line in open('predictions.jsonl'))
random_result = random.randint(1,(num_lines))

predictionresult = json.loads(linecache.getline('predictions.jsonl', random_result))
testdata = linecache.getline('dbpedia.test', random_result)

print(predictionresult["Classes"])
print(testdata)