# Auto-Generating Tags for Content using Amazon SageMaker BlazingText with fastText

Multi-label Text classification is one of the fundamental tasks in Natural Language Processing (NLP) applications. In this demo, We will build a fastText model to predict the tag of question about programming and then deploy the model on SageMaker Hosting Services.

## Setup kaggle cli and download dataset
In this demo, We will use the [10% of Stack Overflow Q&A dataset](https://medium.com/r/?url=https%3A%2F%2Fwww.kaggle.com%2Fstackoverflow%2Fstacksample). The dataset include:
- Questions contains the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.
- Answers contains the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
- Tags contains the tags on each of these questions

We only need Questions and Tags to train our model.

In [140]:
!mkdir /home/ec2-user/.kaggle

In [141]:
!mv .kaggle/kaggle.json /home/ec2-user/.kaggle/

In [142]:
!chmod 600 /home/ec2-user/.kaggle/kaggle.json

In [143]:
!kaggle datasets download stackoverflow/stacksample

In [144]:
!unzip -l stacksample.zip

In [145]:
!unzip -j stacksample.zip Questions.csv -d stacksample/

In [146]:
!unzip -j stacksample.zip Tags.csv -d stacksample/

In [147]:
!head stacksample/Questions.csv -n 2
!head stacksample/Tags.csv

Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
80,26,2008-08-01T13:57:07Z,NA,26,SQLStatement.execute() - multiple queries in one statement,"<p>I've written a database generation script in <a href=""http://en.wikipedia.org/wiki/SQL"">SQL</a> and want to execute it in my <a href=""http://en.wikipedia.org/wiki/Adobe_Integrated_Runtime"">Adobe AIR</a> application:</p>
Id,Tag
80,flex
80,actionscript-3
80,air
90,svn
90,tortoisesvn
90,branch
90,branching-and-merging
120,sql
120,asp.net


## Pre-process and Clean Data

In [564]:
import string
import re

def clean_text(text):
    if not isinstance(text, str): 
        return text
    def cleanhtml(raw_html):
        cleanr = re.compile('<[^>]+>')
        cleantext = re.sub(cleanr, '', raw_html)
        return cleantext
    def replace_link(match):
        return '' if re.match('[a-z]+://', match.group(1)) else match.group(1)
    def removeContractions(raw_text):
        CONTRACTIONS = {"mayn't":"may not", "may've":"may have","isn't":"is not","wasn't":"was not","'ll":" will","'have": "have"}
        raw_text = raw_text.replace("’","'")
        words = raw_text.split()
        reformed = [CONTRACTIONS[word] if word in CONTRACTIONS else word for word in words]
        raw_text = " ".join(reformed)
        return raw_text
    text = cleanhtml(text)
    #text = removeContractions(text)
    text = re.sub('<pre><code>.*?</code></pre>', '', text)
    text = re.sub('<a[^>]+>(.*)</a>', replace_link, text)
    #Remove hashtags
    text = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", " ", text).split())
    #Remove punctuations
    text = ' '.join(re.sub("[\.\,\(\)\{\}\[\]\`\'\!\?\:\;\-\=]", " ", text).split())
    #text = text.translate(str.maketrans('','',string.punctuation))
    #text = text.lower()
    return text    

In [556]:
import numpy as np 
import pandas as pd

QuestionsFile = "stacksample/Questions.csv"
chunksize = 20000

df = None
for ratings in pd.read_csv(QuestionsFile, names=['id', 'title', 'body'], encoding = 'ISO-8859-1',  header=None , usecols=[0,5,6],error_bad_lines = False, chunksize=chunksize):
    if df is None:
        df = ratings.copy()
    else:
        df.append(ratings)

df.head()

In [557]:
df = df[1:]
df.head()

In [560]:
df = df.sample(frac=.5, replace=False)
df.head()

In [123]:
TagsFile = "stacksample/Tags.csv"
chunksize = 20000
df_tags = None
for ratings in pd.read_csv(TagsFile, names=['id', 'tag'], header=None , chunksize=chunksize):
    if df_tags is None:
        df_tags = ratings.copy()
    else:
        df_tags.append(ratings)

df_tags.head()

In [124]:
df_tags = df_tags[1:]
df_tags.head()

In [568]:
questions = df.values
all_rows=[]

for index, row in enumerate(questions):
    title = clean_text(row[1])
    tag_ids = [ tag[1] for tag_idx, tag in enumerate(tags) if tag[0]  == row[0] ]
    if(len(tag_ids)>0): 
        all_rows.append({"title":title, "tags":tag_ids})

In [570]:
import csv
import multiprocessing
from multiprocessing import Pool

def preprocess(rows,output_file):
    pool = Pool(processes=multiprocessing.cpu_count())
    transformed_rows = pool.map(transform_instance, rows)
    pool.close() 
    pool.join()
    with open(output_file, "w") as txt_file:
        for line in transformed_rows:
            txt_file.write(" ".join(line) + "\n")

The input file is formatted in a way that each line contain a single sentence and the corresponding label(s) prefixed by \_\_label__,  i.e. \_\_label__database \_\_label__oracle How to edit sessions parameters on Oracle 10g XE

In [571]:
import nltk
nltk.download('punkt')

def transform_instance(row):
    cur_row = []
    label = ["__label__" + str(tag) for tag in row["tags"] if tag]
    label = " ".join(map(str, label))
    cur_row.append(str(label))
    cur_row.extend(nltk.word_tokenize(row["title"]))
    return cur_row

In [572]:
preprocess(all_rows[:1200], 'stackoverflow.train')    
preprocess(all_rows[1200:], 'stackoverflow.validation')

## Install FastText


In [None]:
!wget https://github.com/facebookresearch/fastText/archive/v0.2.0.zip
!unzip v0.2.0.zip
!cd fastText-0.2.0 && make

## Train the model

In [604]:
!cd fastText-0.2.0 && ./fasttext supervised -input "../stackoverflow.train" -output stack_model -lr 0.5 -epoch 25 -minCount 5 -wordNgrams 2 -loss ova

It is possible to directly test classifier interactively, by running the command:

In [605]:
!cd fastText-0.2.0 && ./fasttext test stack_model.bin "../stackoverflow.validation"

In [None]:
import sagemaker
from sagemaker import get_execution_role
import json
import boto3

sess = sagemaker.Session()

role = get_execution_role()
bucket = sess.default_bucket() 
prefix = 'blazingtext/stackoverflow' 

s3_output_location = 's3://{}/{}/output'.format(bucket, prefix)

In [606]:
!tar -czvf model.tar.gz fastText-0.2.0/stack_model.bin
model_location = sess.upload_data("model.tar.gz", bucket=bucket, key_prefix=prefix)
print(model_location)

## Hosting

Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model.

In [None]:
region_name = boto3.Session().region_name
container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

In [607]:
%%time
stackoverflow = sagemaker.Model(model_data=model_location, image=container, role=role, sagemaker_session=sess)
stackoverflow.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')
predictor = sagemaker.RealTimePredictor(endpoint=stackoverflow.endpoint_name, 
                                   sagemaker_session=sess,
                                   serializer=json.dumps,
                                   deserializer=sagemaker.predictor.json_deserializer)

## Inference
Now that the trained model is deployed at an endpoint that is up-and-running, we can use this endpoint for inference. 

In [689]:
sentence = "How can I refresh a page with jQuery"

payload = {"instances" : [sentence],"configuration": {"k":3}}
predictions = predictor.predict(payload)

In [691]:
import copy
predictions_copy = copy.deepcopy(predictions) 
for output in predictions_copy:
    for index,label in enumerate(output['label']):
        label_title = label[9:]
        prob = float(output["prob"][index])
        print(f"{label_title}, {prob} ")