# Amazon Comprehend - Custom Text Classification

This lab is based off the blog post found [here](https://aws.amazon.com/blogs/machine-learning/building-a-custom-classifier-using-amazon-comprehend/).  This uses some data that has been pre-parsed and split in a public facing S3 bucket.  You will need to update this notebook with your own s3 output location and IAM user policy.

Furthermore, we've reduced the number of documents to speed up the classification training time.

Please email awsaaron@amazon.com for questions

# Data


Typically you'll copy data from S3 into the Sagemaker notebook instance, however, in this example we are not really using the power of Sagemaker for custom model training but using the AI Service - Amazon Comprehend.  To use that, we'll need our data in S3


In [None]:
import os
import boto3
import sagemaker
import pandas as pd

region = boto3.Session().region_name
bucket_name = '<enter bucket name here>'
prefix = 'NLP.Classification'
os.environ["AWS_REGION"] = region
role = '<enter_role_here>'

print(region)
print(bucket_name)

In [None]:
training_data = 's3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-train.csv'
testing_data = 's3://aws-ml-blog/artifacts/comprehend-custom-classification/comprehend-test.csv'

Copy the data from the public bucket to your local instance

In [None]:
!aws s3 cp {training_data} .
!aws s3 cp {testing_data} .

In [None]:
df = pd.read_csv(training_data,header=None,names=['class','text'])
df

In [None]:
df['class'].unique()

In [None]:
df['class'].value_counts()

Due to the size of the dataset, let's downsample it for the purposes of this lab

In [None]:
a = df.sample(1000)

In [None]:
a['class'].value_counts()

Save the downsampled dataset to a CSV file

In [None]:
import csv
a['text'] = '"' + a['text'] + '"'
a.to_csv('limited_dataset.csv',header=None,index=None,quoting=csv.QUOTE_NONE)

In [None]:
limited_dataset_path = 's3://'+bucket_name+'/'+prefix+'/limited_dataset.csv'

In [None]:
!aws s3 cp limited_dataset.csv {limited_dataset_path}

## Create Classifier
Create a custom document classifier, supply the name, location of training data, access role ARN, language, and output S3 bucket location

In [None]:
import boto3

# Instantiate Boto3 SDK:
client = boto3.client('comprehend', region_name='us-east-1')
classifier_name = '<enter name here>'

# Create a document classifier
create_response = client.create_document_classifier(
      DocumentClassifierName=classifier_name,
      DataAccessRoleArn=role,
      InputDataConfig={
          'S3Uri': limited_dataset.csv,
      },
      LanguageCode='en',
  )

print(create_response)

Now let's check the status of the custom classifier.  You can run the following cell's multiple times to check the status if needed

In [None]:
describe_response = client.describe_document_classifier(
    DocumentClassifierArn=create_response['DocumentClassifierArn'])
print("Describe response: \n",describe_response)
print()

# List all classifiers in account
list_response = client.list_document_classifiers()
print("List response: \n", list_response)

In [None]:
describe_response['DocumentClassifierProperties']['DocumentClassifierArn']

# Predictions!

Once the custom classification model is trained, now you can use if for batch or real-time predictions.

Create an end point for real time model prediction.  

In [None]:
# create end point
response = client.create_endpoint(
    EndpointName='my-custom-classification-endpoint2',
    ModelArn=describe_response['DocumentClassifierProperties']['DocumentClassifierArn'],
    DesiredInferenceUnits=1,
)

In [None]:
print(response)

In [1]:
response['EndpointArn']

NameError: name 'response' is not defined

In [None]:
txt = 'After my most recent doctors appointment, I came down with the flu'

In [None]:
# real-time
real_time_response = client.classify_document(
    Text=txt,
    EndpointArn=response['EndpointArn']
)
print(real_time_response['Classes'])

Next, let's try a batch async prediction

In [None]:
# batch
start_response = client.start_document_classification_job(
    InputDataConfig={
        'S3Uri': testing_data,
    },
    OutputDataConfig={
        'S3Uri': s3_output_bucket
    },
    DataAccessRoleArn=data_access_arn,
    DocumentClassifierArn=describe_response['DocumentClassifierProperties']['DocumentClassifierArn']
)

print("Start response: %s\n", start_response)


In [None]:
# Check the status of the job
describe_response = client.describe_document_classification_job(JobId=start_response['JobId'])
print("Describe response: %s\n", describe_response)

# List all classification jobs in account
list_response = client.list_document_classification_jobs()
print("List response: %s\n", list_response)