## Multi-label Content Moderation with HuggingFace and PyTorch on Amazon SageMaker

In [None]:
# make sure the Amazon SageMaker SDK is updated
!pip install "sagemaker" --upgrade

In [36]:
# import a few libraries that will be needed
import sagemaker
from sagemaker.huggingface import HuggingFace
import boto3
import pandas as pd
import os, time, tarfile

In [42]:
# gets role for executing training job and set a few variables
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = "hf-content-mod"
role = sagemaker.get_execution_role()
region = boto3.Session().region_name

The HF's Content Moderation Classification Dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains near equal distributed test samples. The total number of training samples is 75 for training and demo purposes only. 

In [45]:
# download and extract our custom dataset
!wget -nc https://escalona-robocar.s3.amazonaws.com/content_moderation_csv.tgz
tf = tarfile.open('content_moderation_csv.tgz')
tf.extractall()
!rm -fr content_moderation_csv.tgz

--2022-12-06 01:22:59--  https://escalona-robocar.s3.amazonaws.com/content_moderation_csv.tgz
Resolving escalona-robocar.s3.amazonaws.com (escalona-robocar.s3.amazonaws.com)... 52.216.21.83, 52.217.97.12, 52.216.111.35, ...
Connecting to escalona-robocar.s3.amazonaws.com (escalona-robocar.s3.amazonaws.com)|52.216.21.83|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10258 (10K) [application/gzip]
Saving to: ‘content_moderation_csv.tgz’


2022-12-06 01:22:59 (182 MB/s) - ‘content_moderation_csv.tgz’ saved [10258/10258]



In [46]:
# read training data and add a header
train = pd.read_csv('./content_moderation_csv/train.csv')
train.columns = ['label', 'title', 'description']

# read testing data and add a header
test = pd.read_csv('./content_moderation_csv/test.csv')
test.columns = ['label', 'title', 'description']

# write the files with header
train.to_csv("content_moderation_csv/hf-train.csv", index=False)
test.to_csv("content_moderation_csv/hf-test.csv", index=False)

In [47]:
# take a look at the training data
train

Unnamed: 0,label,title,description
0,1,Scorched Earth,Ridiculous behavior that cannot be tolerated a...
1,1,Read a book,Wow... your worldview is apparently informed b...
2,2,spreading Joy,Lets make some fact based statements. Pleae sp...
3,2,Healthy debate,"How about introducing Vukovich to you, Bob1946..."
4,4,Low IQ,The president makes himself an easy target bec...
5,2,Thanks for your response,That is point I'm trying to make. We seem to b...
6,1,Sister of man who died in Vancouver police cus...,My flight was subcontracted to another carrier...
7,1,Removal,this is *&^%ing outrageous. The prosecutor sho...
8,3,Night Vision Goggles,You better not send me on a witch hunt or else...
9,4,The stupid,The profoundly stupid have spoken.


In [48]:
# upload training and testing data to Amazon S3
inputs_train = sagemaker_session.upload_data("content_moderation_csv/hf-train.csv", bucket=bucket, key_prefix='{}/train'.format(prefix))
inputs_test = sagemaker_session.upload_data("content_moderation_csv/hf-test.csv", bucket=bucket, key_prefix='{}/test'.format(prefix))
print(inputs_train)
print(inputs_test)

s3://sagemaker-us-east-1-175748383800/hf-content-mod/train/hf-train.csv
s3://sagemaker-us-east-1-175748383800/hf-content-mod/test/hf-test.csv


In [49]:
# keep in mind the classes used in this dataset
classes = pd.read_csv('./content_moderation_csv/classes.txt', header=None)
classes.columns = ['label']
classes

Unnamed: 0,label
0,Toxic
1,Benign
2,Threat
3,Insult


----

## BERT large uncased
https://huggingface.co/bert-large-uncased
#### Fine-tuning

In [50]:
hyperparameters = {
	'model_name_or_path':'bert-large-uncased',
	'output_dir':'/opt/ml/model',
    'train_file':'/opt/ml/input/data/train/hf-train.csv',
    'validation_file':'/opt/ml/input/data/test/hf-test.csv',
    'do_train':True,
    'do_eval':True,
    'num_train_epochs': 1,
    'save_total_limit': 1,
	# add your remaining hyperparameters
	# more info here https://github.com/huggingface/transformers/tree/v4.10.0/examples/pytorch/text-classification
}

In [51]:
# git configuration to download our fine-tuning script
git_config = {'repo': 'https://github.com/huggingface/transformers.git','branch': 'v4.6.1'}

In [52]:
# creates Hugging Face estimator
huggingface_estimator_bert = HuggingFace(
	entry_point='run_glue.py', # note we are pointing to the processing script in HF repo
	source_dir='./examples/pytorch/text-classification',
	instance_type='ml.p3.16xlarge',
	instance_count=1, #note that this training uses just a single EC2 instance
	role=role,
	git_config=git_config,
	transformers_version='4.6.1',
	pytorch_version='1.7.1',
	py_version='py36',
	hyperparameters = hyperparameters,
    disable_profiler=True
)

In [None]:
training_path='s3://{}/{}/train'.format(bucket, prefix)
testing_path='s3://{}/{}/test'.format(bucket, prefix)
# starting the train job
huggingface_estimator_bert.fit({"train": training_path, "test": testing_path}, wait=False)

In [55]:
# check the status of the training job
client = boto3.client("sagemaker")
describe_response = client.describe_training_job(TrainingJobName=huggingface_estimator_bert.latest_training_job.name)

print ('Time - JobStatus - SecondaryStatus')
print('------------------------------')
print (time.strftime("%H:%M", time.localtime()), '-', describe_response['TrainingJobStatus'] + " - " + describe_response['SecondaryStatus'])

# uncomment this for monitoring the job status...
#job_run_status = describe_response['TrainingJobStatus']
#while job_run_status not in ('Failed', 'Completed', 'Stopped'):
#    describe_response = client.describe_training_job(TrainingJobName=huggingface_estimator_bert.latest_training_job.name)
#    job_run_status = describe_response['TrainingJobStatus']
#    print (time.strftime("%H:%M", time.localtime()), '-', describe_response['TrainingJobStatus'] + " - " + describe_response['SecondaryStatus'])
#    sleep(30)

Time - JobStatus - SecondaryStatus
------------------------------
01:35 - Completed - Completed


**Important:** Make sure the training job is completed before running the "Inference" section below.

You can verify this by running the previous cell and getting JobStatus = "Completed".

#### Inference

In [56]:
from sagemaker.huggingface.model import HuggingFaceModel

# create Hugging Face Model Class
huggingface_model = sagemaker.huggingface.HuggingFaceModel(
env={ 'HF_TASK':'text-classification' },
model_data=huggingface_estimator_bert.model_data,
role=role,
transformers_version="4.6.1",
pytorch_version="1.7.1",
py_version='py36',
)

In [57]:
# create SageMaker Endpoint with the HF model
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g4dn.xlarge"
)

-------!

In [67]:
# example request (you always need to define "inputs"). You can try with your own content text here...
data = {
   #"inputs": "Hugging Face now has a Deep Reinforcement Learning course!"
   "inputs": "You are not the sharpest tool in the shed are you?"
}

response = predictor.predict(data)
print(response, classes['label'][int(response[0]['label'][-1:])])

[{'label': 'LABEL_3', 'score': 0.4378846287727356}] Insult


In [62]:
# let us run a quick performance test
sum_BERT=0
for i in range(1, 1000):
    a_time = float(time.time())
    result_BERT = predictor.predict(data)
    b_time = float(time.time())
    sum_BERT = sum_BERT + (b_time - a_time)
    #print(b_time - a_time)
avg_BERT = sum_BERT/1000
print('BERT average inference time: {:.3f}'.format(avg_BERT), 'secs,')

BERT average inference time: 0.038 secs,


-----

#### Clean-up

In [70]:
# uncomment the below for cleaning-up your content moderation endpoint
predictor.delete_endpoint()


#### Exteding your next training job to use a distributed training

In [68]:
#The Hugging Face Trainer supports SageMaker’s data parallelism library. If your training script uses the Trainer API, 
#you only need to define the distribution parameter in the Hugging Face Estimator. Below is the configuration for running 
#training on smdistributed Data Parallel
distribution = {'smdistributed':{'dataparallel':{ 'enabled': True }}}

# creates the next Hugging Face estimator using a distributed setup
#When you are ready to scale the number of instances, you can do this with SageMaker Python SDK estimator function by setting your instance_count.
huggingface_estimator_bert = HuggingFace(
	entry_point='run_glue.py', # note we are pointing to the processing script in HF repo
	source_dir='./examples/pytorch/text-classification',
	instance_type='ml.p3.16xlarge',
	instance_count=2, #Instead of the eight GPUs on a single p3.16xlarge, you now have 16 GPUs across two identical instances.
	role=role,
	git_config=git_config,
	transformers_version='4.6.1',
	pytorch_version='1.7.1',
	py_version='py36',
	hyperparameters = hyperparameters,
    disable_profiler=True
)


In [None]:
training_path='s3://{}/{}/train'.format(bucket, prefix)
testing_path='s3://{}/{}/test'.format(bucket, prefix)
# starting the train job
huggingface_estimator_bert.fit({"train": training_path, "test": testing_path}, wait=False)

In [71]:
# check the status of the training job
client = boto3.client("sagemaker")
describe_response = client.describe_training_job(TrainingJobName=huggingface_estimator_bert.latest_training_job.name)

print ('Time - JobStatus - SecondaryStatus')
print('------------------------------')
print (time.strftime("%H:%M", time.localtime()), '-', describe_response['TrainingJobStatus'] + " - " + describe_response['SecondaryStatus'])

# uncomment this for monitoring the job status...
#job_run_status = describe_response['TrainingJobStatus']
#while job_run_status not in ('Failed', 'Completed', 'Stopped'):
#    describe_response = client.describe_training_job(TrainingJobName=huggingface_estimator_bert.latest_training_job.name)
#    job_run_status = describe_response['TrainingJobStatus']
#    print (time.strftime("%H:%M", time.localtime()), '-', describe_response['TrainingJobStatus'] + " - " + describe_response['SecondaryStatus'])
#    sleep(30)

Time - JobStatus - SecondaryStatus
------------------------------
02:47 - Completed - Completed
