# AWS Marketplace Product Usage Demonstration - 7Park Data Stopword Algorithm

## Introduction

The performance of many NLP algorithms is greatly improved if one removes “stopwords” first. These are words that are frequent in the dataset, but don’t help to discriminate between documents. Information Retrieval in particular, can be quite sensitive to stopwords. Note that not all frequent words are low-information.
 
Most NLP applications that remove stopwords do so using a list of classical words that have been collected over the years. While this suffices for some applications, it can fall short for many domains, which should have their own list of “low-information” words. For example, does the word “natural” help in distinguishing between sentences in Darwin’s The Origin of Species (included in the package)? Given that the book is about the natural world, probably not. However, classical stopword lists will not contain “natural”. This program picks out “natural” as a stopword, along with “life”, “nature”, “different”, etc.
 
It does not however, pick out “selection”; reading through the text, it is apparent that Darwin’s use of “selection” is not as low-information as his use of “natural”. For example, he speaks of “accumulative selection”, “the closest selection”, and so on. So “natural” is picked out as a stopword, while “selection” is not, even though “selection” is more frequent than “natural”.
 
The algorithm is based on a 2005 paper by Lo et al, but has several propriety improvements (e.g. it’s deterministic unlike theirs, which relies on sampling; and you only need to set one parameter). Darwin’s book courtesy of Project Gutenberg.

## Pre-requisites

This sample notebook requires subscription to the following pre-trained machine learning model packages from AWS Marketplace:

**[Stopword Algorithm](https://aws.amazon.com/marketplace/pp/prodview-64zsbbhzwijeo)**
    
If your AWS account has not been subscribed to these listings, here is the process you can follow for each of the above mentioned listings:

1. Open the listing from AWS Marketplace
1. Read the **Highlights** section and then **product overview** section of the listing.
1. View **usage information** and then **additional resources.**
1. Note the supported instance types.
1. Next, click on **Continue to subscribe.**
1. Review **End user license agreement, support terms**, as well as **pricing information.**
1. **"Accept Offer"** button needs to be clicked if your organization agrees with EULA, pricing information as well as support terms.

**Notes:**

If **Continue to configuration** button is active, it means your account already has a subscription to this listing.
Once you click on **Continue to configuration** button and then choose region, you will see that a Product Arn will appear. This is the model package ARN that you need to specify while creating a deployable model. However, for this notebook, the algorithm ARN has been specified in **src/algorithm_arns.py** file and you do not need to specify the same explicitly.

## Stopword Algorithm Usage with SageMaker Estimator
Firstly, you need to import SageMaker package, get execution role and create session.

In [2]:
import sagemaker


role = sage.get_execution_role()
sess = sagemaker.Session()

Secondly, you can specify parameters of Decision Forest.
#### Hyperparameters
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Type</strong></td>
        <td><strong>Default value</strong></td>
        <td><strong>Range</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>number_of_hits</td>
        <td>int</td>
        <td>100</td>
        <td>1-100</td>
        <td>The total number of output stopwords</td>
    </tr>
    <tr>
        <td>strategy</td>
        <td>Categorical</td>
        <td>"medium"</td>
        <td>'conservative' or 'medium' or 'aggressive'</td>
        <td>The strategy of calculate the stopwords</td>
    </tr>
    <tr>
        <td>blacklist</td>
        <td>FreeText</td>
        <td></td>
        <td></td>
        <td>A list of string that will not be the stopword. Separate by ';'. Example: 'inc;co;llc'</td>
    </tr>
</table>

Example of hyperparameters dictionary:

In [3]:
stopword_params = {
  "number_of_hits": 50,
  "strategy": "medium",
  "blacklist": "what;it;my"
}

Then, you need to create SageMaker Estimator instance with following parameters:
<table style="border: 1px solid black;">
    <tr>
        <td><strong>Parameter name</strong></td>
        <td><strong>Description</strong></td>
    </tr>
    <tr>
        <td>algorithm_arn</td>
        <td>Algorithm arn used for training</td>
    </tr>
    <tr>
        <td>role</td>
        <td>An AWS IAM role. The SageMaker training jobs and APIs that create SageMaker endpoints use this role to access training data and models</td>
    </tr>
    <tr>
        <td>base_job_name</td>
        <td>Prefix for training job name when the fit() method launches</td>
    </tr>
    <tr>
        <td>train_instance_count</td>
        <td>Number of Amazon EC2 instances to use for training. Should be 1, because it is not distributed version of algorithm</td>
    </tr>
    <tr>
        <td>train_instance_type</td>
        <td>Type of EC2 instance to use for training. See available types on Amazon Marketplace page of algorithm</td>
    </tr>
    <tr>
        <td>input_mode</td>
        <td>The input mode that the algorithm supports. May be "File" or "Pipe"</td>
    </tr>
    <tr>
        <td>output_path</td>
        <td>S3 location for saving the trainig result (model artifacts and output files)</td>
    </tr>
    <tr>
        <td>sagemaker_session</td>
        <td>Session object which manages interactions with Amazon SageMaker APIs and any other AWS services needed</td>
    </tr>
    <tr>
        <td>hyperparameters</td>
        <td>Dictionary containing the hyperparameters to initialize this estimator with</td>
    </tr>
</table>
Full SageMaker Estimator documentation: https://sagemaker.readthedocs.io/en/stable/estimators.html

Full SageMaker Algorithm Estimator documentation: https://sagemaker.readthedocs.io/en/stable/algorithm.html


In [4]:
from src.algorithm_arns import AlgorithmArnProvider


stopword_arn = AlgorithmArnProvider.get_algorithm_arn(sess.boto_region_name) # Get the algorithm_arn

stopword_algorithm = sagemaker.algorithm.AlgorithmEstimator(
    algorithm_arn=stopword_arn,
    role=role,
    base_job_name="stopword-algorithm",
    train_instance_count=1,
    train_instance_type='ml.m5.xlarge',
    input_mode="File",
    output_path="s3://<bucket-name>/<output-path>",
    sagemaker_session=sess,
    hyperparameters=stopword_params
)

### Training stage
On training stage, Stopword algorithm consume input data from S3 location.
This container supports only .tsv ("tab-separated values") files with one column data.

In [None]:
stopword_algorithm.fit({"train": "s3://<bucket-name>/<training-data-path>"}) # Example of training data in data/train_data_example

### Inspect the stopword result
After training, you will get a model file which contain the stopword list result.

In [None]:
# download the result model
! aws s3 cp s3://<bucket-name>/<output-path>/model.tar.gz .

In [9]:
import pandas as pd
import tarfile

with tarfile.open("model.tar.gz", "r:*") as tar:
    stopword_path = tar.getnames()[0]
    stopword_df = pd.read_csv(tar.extractfile(stopword_path), names=['stopword', 'score'], sep="\t")

stopword_df

Unnamed: 0,stopword,score
0,different,0.009377
1,life,0.01143
2,between,0.012078
3,natural,0.012219
4,plants,0.01276
5,its,0.017336
6,most,0.017826
7,may,0.025719
8,will,0.025883
9,each,0.02645


### Real-time cleaning
Firstly, you need to deploy SageMaker endpoint that consumes data.

In [10]:
import json

predictor = stopword_algorithm.deploy(
    initial_instance_count=1, 
    instance_type="ml.m5.xlarge",
    serializer = json.dumps,
    content_type = "application/jsonlines",
    accept = "application/jsonlines",
    deserializer = sagemaker.predictor.json_deserializer
)

...........
---------!

Secondly, you should pass a dictionary with 'data' as key to endpoint and get predictions.

In this example we are passing a sentence.

In [12]:
predict_data = {'data': "life is beautiful"}
predict_result = predictor.predict(
    predict_data,
    {'ContentType': 'application/jsonlines', 'Accept': 'application/jsonlines'}
)

print(predict_result)

{'source': {'data': 'life is beautiful'}, 'stopwords': ['life']}


Don't forget to delete endpoint if you don't need it anymore.

In [13]:
sess.delete_endpoint(predictor.endpoint)

### Batch transform job
If you don't need real-time prediction, you can use transform job. It uses saved model, compute predictions one time and saves it in specified or auto-generated output path.

More about transform jobs: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html

Transformer API: https://sagemaker.readthedocs.io/en/latest/transformer.html

In [None]:
transformer = stopword_algorithm.transformer(
    instance_count=1, 
    instance_type='ml.m5.xlarge', 
    accept='application/jsonlines',
    output_path='s3://<bucket-name>/<output-path>'
)
transformer.transform(
    data="s3://<bucket-name>/<input-data-path>/<input-file-name>", # Example of transform data in data/transform_data_example
    content_type='application/jsonlines',
    job_name="stopword-transformer"
)
transformer.wait()
print(transformer.output_path)

### Cleanup

In [15]:
predictor.delete_model()

Finally, if the AWS Marketplace subscription was created just for an experiment and you would like to unsubscribe, here are the steps that can be followed. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model.

**Steps to unsubscribe from the product on AWS Marketplace:**

Navigate to Machine Learning tab on Your [Software subscriptions page](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=lbr_tab_ml).
Locate the listing that you would need to cancel, and click Cancel Subscription.