# Amazon Comprehend Custom Classification - Lab

This notebook will serve as a template for the overall process of taking a text dataset and integrating it into [Amazon Comprehend Custom Classification](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html) and perform NLP for custom classification.

## Overview

1. [Introduction to Amazon Comprehend Custom Classification](#Introduction)
1. [Obtaining Your Data](#data)
1. [Pre-processing data](#preprocess)
1. [Building Custom Classification model](#build)
1. [Evaluate Custom Classification model](#evaluate)
1. [Cleanup](#cleanup)


## Introduction to Amazon Comprehend Custom Classification <a class="anchor" id="Introduction"/>

If you are not familiar with Amazon Comprehend Custom Classification you can learn more about this tool on these pages:

* [Product Page](https://aws.amazon.com/comprehend/)
* [Product Docs](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification.html)


## Bring Your Own Data <a class="anchor" id="data"/>

We will be using Multi-Class mode in Amazon Comprehend Custom Classifier. Multi-class mode specifies a single class for each document. The individual classes are mutually exclusive, this part is important. If we have an overlapping classes, it is best to set expectaion that our model will learn and try predict same overlapping classes and accuracy might be impacted.

We are going to upload custom dataset. We ensure that dataset is a .csv and the format of the file must be one class and document per line. For example:
```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS,Text of document 3
```
if we dont have the file in above fomat, we will convert it to above format.

To begin the cell below will complete the following:

1. Create a directory for the data files.
1. Upload the file manually to the nlp_data folder.

In [None]:
!mkdir nlp_data


With the data downloaded, now we will import the Pandas library as well as a few other data science tools in order to inspect the information.

In [None]:
import boto3
from time import sleep
import os
import subprocess
import pandas as pd
import json
import time
import pprint
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
import secrets
import string
import datetime 
import random

In [None]:
# run this only once
! pip install tqdm

In [None]:
from tqdm import tqdm
tqdm.pandas()

Lets load the data in to dataframe and look at the data we uploaded. Examine the number of columns that are present. Look at few samples to see the content of the data.

In [None]:
raw_data = pd.read_csv('nlp_data/raw_data.csv')
raw_data.head()

In [None]:
raw_data['CATEGORY_NAME'] = raw_data['CATEGORY_NAME'].astype(str)
raw_data.groupby('CATEGORY_NAME')['CASE_SUBJECT_FULL'].count()

To convert data to the format that is required by Amazon Comprehend Custom Classifier,

```
CLASS,Text of document 1
CLASS,Text of document 2
CLASS,Text of document 3
```
We will identify the column which are class and which have the text content we would like to train on, we can create a new dataframe with selected columns.


In [None]:
selected_columns = ['CATEGORY_NAME', 'CASE_SUBJECT_FULL', 'CASE_DESCRIPTION_FULL']

In [None]:
# Select the columns we are interested in
selected_data = raw_data[selected_columns]
selected_data = selected_data[selected_data['CATEGORY_NAME']!='Not Known']
selected_data.shape

In [None]:
#selected_data.groupby('CATEGORY_NAME')['CASE_SUBJECT_FULL'].count()

As we might be interested in finding outt he accuracy level of the model compared to known labels, we want to held out 10% dataset for later use to infer from the comdel, generate performanace matrix to asses the model. We want to stratify split data based on 'CATEGORY_NAME'

In [None]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(selected_data, test_size=0.1, random_state=0, 
                               stratify=selected_data[['CATEGORY_NAME']])

train_data_df = train_data.copy()
test_data_df = test_data.copy()

## Pre-processing data<a class="anchor" id="preprocess"/> 


For training, the file format must conform with the [following](https://docs.aws.amazon.com/comprehend/latest/dg/how-document-classification-training.html):

- File must contain one label and one text per line – 2 columns
- No header
- Format UTF-8, carriage return “\n”.

Labels “must be uppercase, can be multitoken, have whitespace, consist of multiple words connect by underscores or hyphens or may even contain a comma in it, as long as it is correctly escaped.”

Here are the proposed labels:

| Index | Original | For training |
| --- | --- | --- |
| 1 | Company | COMPANY |
| 2 | EducationalInstitution | EDUCATIONALINSTITUTION |
| 3 | Artist | ARTIST |
| 4 | Athlete | ATHLETE |
| 5 | OfficeHolder | OFFICEHOLDER |
| 6 | MeanOfTransportation | MEANOFTRANSPORTATION |
| 7 | Building | BUILDING |
| 8 | NaturalPlace | NATURALPLACE |
| 9 | Village | VILLAGE |
| 10 | Animal | ANIMAL |
| 11 | Plant | PLANT |
| 12 | Album | ALBUM |
| 13 | Film | FILM |
| 14 | WrittenWork | WRITTENWORK |

For the inference part of it - when you want your custom model to determine which label corresponds to a given text -, the file format must conform with the following:

- File must contain text per line
- No header
- Format UTF-8, carriage return “\n”.

In [None]:
labels_dict = {'1':'COMPANY',
               '2':'EDUCATIONALINSTITUTION',
               '3':'ARTIST',
               '4':'ATHLETE',
               '5':'OFFICEHOLDER',
               '6':'MEANOFTRANSPORTATION',
               '7':'BUILDING',
               '8':'NATURALPLACE',
               '9':'VILLAGE',
               '10':'ANIMAL',
               '11':'PLANT',
               '12':'ALBUM',
               '13':'FILM',
               '14':'WRITTENWORK'
               }

In [None]:
import re

def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

def denoise_text(text):
    text = remove_between_square_brackets(text)
    return text

def preprocess_text(document):
    document = denoise_text(document)
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(document))

    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)

    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)

    return document

def process_data(df):
    df['CATEGORY_NAME'] = df['CATEGORY_NAME'].apply(labels_dict.get)

    df['document'] = df[df.columns[1:]].progress_apply(
        lambda x: ' '.join(x.dropna().astype(str)),
        axis=1
    )

    df.drop(['CASE_SUBJECT_FULL' ,'CASE_DESCRIPTION_FULL'], axis=1, inplace=True)

    df.columns = ['class', 'text']
    
    df['text'] = df['text'].progress_apply(preprocess_text)
    
    return df

In [None]:
train_data_df = process_data(train_data_df)
test_data_df = process_data(test_data_df)

At this point we have all the data the 2 needed files. 

### Building The Target Train and Test Files

With all of the above spelled out the next thing to do is to build 2 distinct files:

1. `comprehend-train.csv` - A CSV file containing 2 columns without header, first column class, second column text.
1. `comprehend-test.csv` - A CSV file containing 1 column of text without header.

In [None]:
DSTTRAINFILE='nlp_data/comprehend-train.csv'
DSTVALIDATIONFILE='nlp_data/comprehend-test.csv'

train_data_df.to_csv(path_or_buf=DSTTRAINFILE,
                  header=False,
                  index=False,
                  escapechar='\\',
                  doublequote=False,
                  quotechar='"')

validattion_data_df = test_data_df.copy()
validattion_data_df.drop(['class'], axis=1, inplace=True)
validattion_data_df.to_csv(path_or_buf=DSTVALIDATIONFILE,
                       header=False,
                       index=False,
                       escapechar='\\',
                       doublequote=False,
                       quotechar='"')

## Getting Started With Amazon Comprehend
Now that all of the required data to get started exists, we can start working on Comprehend Custom Classfier. 

The custom classifier workload is built in two steps:

1. Training the custom model – no particular machine learning or deep learning knowledge is necessary
1. Classifying new data

Lets follow below steps for Training the custom model:

1. Create a bucket that will host training data
1. Create a bucket that will host training data artifacts and production results. That can be the same
1. Configure an IAM role allowing Comprehend to [access newly created buckets](https://docs.aws.amazon.com/comprehend/latest/dg/access-control-managing-permissions.html#auth-role-permissions)
1. Prepare data for training
1. Upload training data in the S3 bucket
1. Launch a “Train Classifier” job from the console: “Amazon Comprehend” > “Custom Classification” > “Train Classifier”
1. Prepare data for classification (one text per line, no header, same format as training data). Some more details [here](https://docs.aws.amazon.com/comprehend/latest/dg/how-class-run.html)


Now using the metada stored on this instance of a SageMaker Notebook determine the region we are operating in. If you are using a Jupyter Notebook outside of SageMaker simply define `region` as the string that indicates the region you would like to use for Forecast and S3.

In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

Configure your AWS APIs

In [None]:
session = boto3.Session(region_name=region) 
comprehend = session.client(service_name='comprehend')

Lets create a s3 bucket that will host training data and test data.

In [None]:
# Now perform the join
print(region)
s3 = boto3.client('s3')
prefix = 'ComprehendDBPediaClassification'
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "-comprehend-dbpedia-classification-{}".format(''.join(
    secrets.choice(string.ascii_lowercase + string.digits) for i in range(8)))
print(bucket_name)
if region != "us-east-1":
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
else:
    s3.create_bucket(Bucket=bucket_name)

### Uploading the data

In [None]:
boto3.Session().resource('s3').Bucket(bucket_name).Object(prefix+'/'+DSTTRAINFILE).upload_file(DSTTRAINFILE)
boto3.Session().resource('s3').Bucket(bucket_name).Object(prefix+'/'+DSTVALIDATIONFILE).upload_file(DSTVALIDATIONFILE)

### Configure an IAM role

In order to authorize Amazon Comprehend to perform bucket reads and writes during the training or during the inference, we must grant Amazon Comprehend access to the Amazon S3 bucket that we created.

We are going to create a data access role in our account to trust the Amazon Comprehend service principal.


In [None]:
iam = boto3.client("iam")

role_name = "ComprehendBucketAccessRole-{}".format(''.join(
    secrets.choice(string.ascii_lowercase + string.digits) for i in range(8)))
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "comprehend.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

policy_arn = "arn:aws:iam::aws:policy/ComprehendFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

## Building Custom Classification model <a class="anchor" id="#build"/>

Launch the classifier training:

In [None]:
s3_train_data = 's3://{}/{}/{}'.format(bucket_name, prefix, DSTTRAINFILE)
s3_output_job = 's3://{}/{}/{}'.format(bucket_name, prefix, 'output/train_job')
print('training data location: ',s3_train_data, "output location:", s3_output_job)

In [None]:
id = str(datetime.datetime.now().strftime("%s"))
training_job = comprehend.create_document_classifier(
    DocumentClassifierName='DBPedia-Ontology-Custom-Classifier-'+ id,
    DataAccessRoleArn=role_arn,
    InputDataConfig={
        'S3Uri': s3_train_data
    },
    OutputDataConfig={
        'S3Uri': s3_output_job
    },
    LanguageCode='en'
)

In [None]:
jobArn = training_job['DocumentClassifierArn']

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_custom_classifier = comprehend.describe_document_classifier(
        DocumentClassifierArn = jobArn
    )
    status = describe_custom_classifier["DocumentClassifierProperties"]["Status"]
    print("Custom classifier: {}".format(status))
    
    if status == "TRAINED" or status == "IN_ERROR":
        break
        
    time.sleep(60)

## Trained model confusion matrix

When a custom classifier model is trained, Amazon Comprehend creates a confusion matrix that provides metrics on how well the model performed in training. This enables you to assess how well the classifier will perform when run. This matrix shows a matrix of labels as predicted by the model compared to actual labels and is created using 10 to 20 percent of the documents submitted to test the trained model.

In [None]:
#Retrieve the S3URI from the model output and create jobkey variable.
job_output = describe_custom_classifier["DocumentClassifierProperties"]["OutputDataConfig"]["S3Uri"]
path_prefix = 's3://{}/'.format(bucket_name)
job_key = os.path.relpath(job_output, path_prefix)

In [None]:
#Download the model metrics
boto3.Session().resource('s3').Bucket(bucket_name).download_file(job_key, './output.tar.gz')

In [None]:
!ls -ltr

In [None]:
#Unpack the gzip file
!tar xvzf ./output.tar.gz

In [None]:
import json

with open('output/confusion_matrix.json') as f:
    comprehend_cm = json.load(f)

cm_array = comprehend_cm['confusion_matrix']


def plot_confusion_matrix(cm_array, labels):
    df_cm = pd.DataFrame(cm_array, index = [i for i in labels],
                      columns = [i for i in labels])

    #sn.set(font_scale=1.4) # for label size
    plt.figure(figsize = (15,13))
    sn.heatmap(df_cm, annot=True) # font size

    plt.show()

plot_confusion_matrix(cm_array, labels = comprehend_cm['labels'])

In [None]:
from sklearn.metrics import confusion_matrix
import numpy as np

cm = np.array(comprehend_cm['confusion_matrix'])

cols = ['label','precision', 'recall','f1_score','type']
models_report = pd.DataFrame(columns = cols)

def precision(label, confusion_matrix):
    col = confusion_matrix[:, label]
    return confusion_matrix[label, label] / col.sum()
    
def recall(label, confusion_matrix):
    row = confusion_matrix[label, :]
    return confusion_matrix[label, label] / row.sum()

def precision_macro_average(confusion_matrix):
    rows, columns = confusion_matrix.shape
    sum_of_precisions = 0
    for label in range(rows):
        sum_of_precisions += precision(label, confusion_matrix)
    return sum_of_precisions / rows

def recall_macro_average(confusion_matrix):
    rows, columns = confusion_matrix.shape
    sum_of_recalls = 0
    for label in range(columns):
        sum_of_recalls += recall(label, confusion_matrix)
    return sum_of_recalls / columns

def f1_score(precision, recall):
    return (2 * (precision * recall) / (precision + recall))

def accuracy(confusion_matrix):
    diagonal_sum = confusion_matrix.trace()
    sum_of_all_elements = confusion_matrix.sum()
    return diagonal_sum / sum_of_all_elements 

def display_confusion_matrix(cm, labels, matrix_type, models_report):
    #print("label precision recall f1score")
    for label in range(len(labels)):
        p = precision(label, cm)
        r = recall(label, cm)
        f1 = f1_score(p, r)
        #print(f"{labels_dict.get(label)} {p:2.4f} {r:2.4f} {f1:2.4f}")
        tmp = pd.Series({'label': labels_dict.get(label+1),\
                 'precision' : p,\
                 'recall': r,\
                 'f1_score': f1,\
                 'type': matrix_type
                 })
        models_report = models_report.append(tmp, ignore_index = True)
    #print(models_report) 

    p_total = precision_macro_average(cm)
    print(f"precision total: {p_total:2.4f}")

    r_total = recall_macro_average(cm)
    print(f"recall total: {r_total:2.4f}")



    a_total = accuracy(cm)
    print(f"accuracy total: {a_total:2.4f}")

    f1_total = f1_score(p_total, r_total)
    print(f"f1 total: {f1_total:2.4f}")
    
    return models_report

training_model_report = display_confusion_matrix(cm, comprehend_cm['labels'], 'training_matrix', models_report)
training_model_report.sort_values(by=['f1_score'], inplace=True, ascending=False)
print(training_model_report.to_string(index=False))

## Evaluate Custom Classification model <a class="anchor" id="evaluate"/>

We will use custom classifier jobs to Evaluate on the test data we have.

In [None]:
model_arn = describe_custom_classifier["DocumentClassifierProperties"]["DocumentClassifierArn"]
print(model_arn)

In [None]:
s3_test_data = 's3://{}/{}/{}'.format(bucket_name, prefix, DSTVALIDATIONFILE)
print(s3_test_data)

In [None]:
id = str(datetime.datetime.now().strftime("%s"))

start_response = comprehend.start_document_classification_job(
    JobName = 'DBPedia-Ontology-Custom-Classifier-Inference'+ id,
    InputDataConfig={
        'S3Uri': s3_test_data,
        'InputFormat': 'ONE_DOC_PER_LINE'
    },
    OutputDataConfig={
        'S3Uri': s3_output_job
    },
    DataAccessRoleArn=role_arn,
    DocumentClassifierArn=model_arn
)

print("Start response: %s\n", start_response)

# Check the status of the job
describe_response = comprehend.describe_document_classification_job(JobId=start_response['JobId'])
print("Describe response: %s\n", describe_response)

# List all classification jobs in account
list_response = comprehend.list_document_classification_jobs()
print("List response: %s\n", list_response)

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_response = comprehend.describe_document_classification_job(JobId=start_response['JobId'])
    status = describe_response["DocumentClassificationJobProperties"]["JobStatus"]
    print("Custom classifier job status : {}".format(status))
    
    if status == "COMPLETED" or status == "FAILED" or status == "STOP_REQUESTED" or status== "STOPPED":
        break
        
    time.sleep(30)

In [None]:
inference_s3uri = describe_response["DocumentClassificationJobProperties"]["OutputDataConfig"]["S3Uri"]
path_prefix = 's3://{}/'.format(bucket_name)
inference_job_key = os.path.relpath(inference_s3uri, path_prefix)
boto3.Session().resource('s3').Bucket(bucket_name).download_file(inference_job_key, './inference_output.tar.gz')

In [None]:
#Unpack the gzip file
!tar xvzf ./inference_output.tar.gz

In [None]:
def load_jsonl(input_path) -> list:
    """
    Read list of objects from a JSON lines file.
    """
    data = []
    with open(input_path, 'r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line.rstrip('\n|\r')))
    print('Loaded {} records from {}'.format(len(data), input_path))
    return data

inference_data = load_jsonl('predictions.jsonl')

In [None]:
test_data_df.shape

In [None]:
inferred_class = []
for line in inference_data:
    predicted_class = sorted(line['Classes'], key=lambda x: x['Score'], reverse=True)[0]['Name']
    inferred_class.append(predicted_class)
    

In [None]:
test_data_df["predicted_class"] = inferred_class
test_data_df.head()

Lets generate confusion metrix and other evaluation metrix for inferred results

In [None]:
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))

In [None]:
from sklearn.metrics import confusion_matrix

y_true = test_data_df['class']
y_pred = test_data_df['predicted_class']
labels = comprehend_cm['labels']
cm_inference = confusion_matrix(y_true, y_pred,labels=labels)

In [None]:
plot_confusion_matrix(cm_inference, labels = labels)

In [None]:
inference_model_report = display_confusion_matrix(cm_inference, labels, 'inference_matrix', models_report)

inference_model_report.sort_values(by=['f1_score'], inplace=True, ascending=False)
print(inference_model_report.to_string(index=False))

In [None]:
%store bucket_name
%store region
%store jobArn
%store role_arn

## Cleanup <a class="anchor" id="cleanup"/>
Run [clean up notebook](./Cleanup.ipynb) to clean all the resources