# Amazon Machine Learning Demonstration


https://aws.amazon.com/pt/machine-learning/
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.
With Amazon Machine Learning you can train three different types of models, using the following algorithms:
 - Binary Logistic Regression
 - Multinomial Logistic Regression
 - Linear Regression
 
We will use Multinomial Logistic Regression to create a model for predicting the category of a product, given its short descriptiion.

Python Boto3 reference:
http://boto3.readthedocs.io/en/latest/reference/services/machinelearning.html

## Goal: to create a model to predict a given product category

Model:
 - Input: product short description
 - Output: category
 - *predict_categoria(product_name) -> category*
 

In [1]:
%matplotlib inline

import boto3
import numpy as np
import pandas as pd
import sagemaker
import IPython.display as disp
import json

from time import gmtime, strftime
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import preprocessing
from IPython.display import Markdown
from notebook import notebookapp

In [2]:
# Get the current Sagemaker session
sagemaker_session = sagemaker.Session()

role = sagemaker.get_execution_role()

In [3]:
s3_bucket = sagemaker_session.default_bucket()
client = boto3.client('machinelearning', region_name='us-east-1')
s3_client = boto3.client('s3')
s3 = boto3.client('s3')
base_dir='/tmp/aml'

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-715445047862


In [None]:
bucket_arn = "arn:aws:s3:::%s/*" % s3_bucket
policy_statement = {
    "Sid": "AddPerm",
    "Effect": "Allow",
    "Principal": "*",
    "Action": "s3:GetObject",
    "Resource": bucket_arn
}

In [None]:
current_policy = None
try:
    current_policy = json.loads(s3_client.get_bucket_policy(Bucket=s3_bucket)['Policy'])
    policy_found = False
    for st in current_policy['Statement']:
        if st["Action"] == "s3:GetObject" and st["Resource"] == bucket_arn:
            policy_found = True
            break

    if not policy_found:
        current_policy['Statement'].append( bucket_statement )
except Exception as e:
    print("There is no current policy. Adding one...")
    s3_client.put_bucket_policy(
        Bucket=s3_bucket,
        Policy=json.dumps(
            {
                "Version": "2012-10-17",
                "Statement": [policy_statement]
            }
        )
    )

# Data Scientist moment
## Preparing the dataset

In [None]:
!mkdir -p $base_dir
!curl -s https://workshopml.spock.cloud/datasets/products/aml_data.tar.gz | tar -xz -C $base_dir

In [None]:
data = pd.read_csv(base_dir + '/sample.csv', sep=',', encoding='utf-8')
print( len(data) )
data.iloc[[517, 163, 14, 826, 692]]

### So, we need to remove accents, transform everything to lower case and remove stopwords

In [None]:
# tranlating table for removing accents
accents = "".maketrans("áàãâéêíóôõúüçÁÀÃÂÉÊÍÓÔÕÚÜÇ", "aaaaeeiooouucAAAAEEIOOOUUC")

# loading stopwords without accents
file = open("stopwords.txt", "r")
stopwords = list(map(lambda x:x.strip().translate(accents),file.readlines()))
file.close()

In [None]:
# this tokenizer will tokenize the text, remove stop words and compute bigrams (ngram(2))
word_vectorizer = TfidfVectorizer(ngram_range=(1,2), analyzer='word', stop_words=stopwords, token_pattern='[a-zA-Z]+')
tokenizer = word_vectorizer.build_tokenizer()

def remove_stop_words(text):
    return " ".join( list(filter( lambda x: x not in stopwords, tokenizer(text) )) )

In [None]:
data['product_name_tokens'] = list(map(lambda x: remove_stop_words( x.lower().translate(accents) ), data['product_name']))
data['main_category_tokens'] = list(map(lambda x: remove_stop_words( x.lower().translate(accents) ), data['main_category']))
data['subcategory_tokens'] = list(map(lambda x: remove_stop_words( x.lower().translate(accents) ), data['sub_category']))

In [None]:
data.iloc[[26, 163, 14, 826, 692]]

## Let's remove the unecessary columns

In [None]:
data_final = data[ [ 'product_name_tokens', 'main_category_tokens', 'subcategory_tokens' ]]
data_final = data_final.rename(columns={
    "product_name_tokens": "product_name", 
    "main_category_tokens": "category",
    "subcategory_tokens": "sub_category", 
})
data_final.head()

# Ok. We finished our 'sample' dataset preparation.
## Now, lets continue with the dataset that was already cleaned.
## In real life, you should apply all these transformations to your final dataset.

In [None]:
disp.Image(base_dir + '/workflow_processo.png')

### Now, lets execute the steps above, using Amazon Machine Learning.

In [None]:
# First, lets upload our dataset to S3
s3.upload_file( base_dir + '/dataset.csv', s3_bucket, 'workshop/AML/dataset.csv' )

In [None]:
# just take a look on that, before continue
pd.read_csv(base_dir + '/dataset.csv', sep=',', encoding='utf-8').head()

## Now, lets create the DataSources
### Before that, we need to split it into 70% training and 30% test

In [None]:
strategy_train = open( 'split_strategy_training.json', 'r').read()
strategy_test = open( 'split_strategy_test.json', 'r').read()
print( "Training: {}\nTest: {}".format( strategy_train, strategy_test ) )

### How AML knows the file format (CSV)? By using the schema bellow...

In [None]:
categorias_schema = open('category_schema.json', 'r').read()
print( "Formato dos dados do dataset: {}\n".format( categorias_schema) )

### Creating the DataSources (train and test) for the Category Model

In [None]:
train_datasource_name = 'CategoriasTrain' + '_' + strftime("%Y%m%d_%H%M%S", gmtime())
test_datasource_name = 'CategoriasTest' + '_' + strftime("%Y%m%d_%H%M%S", gmtime())

print(train_datasource_name, test_datasource_name)

resp = client.create_data_source_from_s3(
    DataSourceId=train_datasource_name,
    DataSourceName=train_datasource_name,
    DataSpec={
        'DataLocationS3': 's3://%s/workshop/AML/dataset.csv' % s3_bucket,
        'DataSchema': categorias_schema,
        'DataRearrangement': strategy_train
    },
    ComputeStatistics=True
)

resp = client.create_data_source_from_s3(
    DataSourceId=test_datasource_name,
    DataSourceName=test_datasource_name,
    DataSpec={
        'DataLocationS3': 's3://%s/workshop/AML/dataset.csv' % s3_bucket,
        'DataSchema': categorias_schema,
        'DataRearrangement': strategy_test
    },
    ComputeStatistics=True
)

waiter = client.get_waiter('data_source_available')
waiter.wait(FilterVariable='Name', EQ=train_datasource_name)
waiter.wait(FilterVariable='Name', EQ=test_datasource_name)
print( "Datasources created successfully!" )

## Creating/training the Category model

This is the Model Recipe. It contains the last transformations applyed to your dataset before start training the model. Please note the function: ngram(product_name, 2). It will create bigrams for the input text. So, the model will consider as input a term frequency table, extracted from the bigrams of the product_name.

In [None]:
cat_recipe = open('category_recipe.json', 'r').read()
print(cat_recipe)

Reference: http://docs.aws.amazon.com/machine-learning/latest/dg/data-transformations-reference.html
## The training will start as soon as you execute the command bellow

In [None]:
model_name = 'ProdutoCategorias' + '_' + strftime("%Y%m%d_%H%M%S", gmtime())
print(model_name)
resp = client.create_ml_model(
    MLModelId=model_name,
    MLModelName=model_name,
    MLModelType='MULTICLASS',
    Parameters={
        'sgd.maxPasses': '30',
        'sgd.shuffleType': 'auto',
        'sgd.l2RegularizationAmount': '1e-6'
    },
    TrainingDataSourceId=train_datasource_name,
    Recipe=cat_recipe
)
waiter = client.get_waiter('ml_model_available')
waiter.wait(FilterVariable='Name', EQ=model_name)
print( "Model created successfully!" )

In [None]:
eval_name = 'ProdutoCategoriasEval' + '_' + strftime("%Y%m%d_%H%M%S", gmtime())
# it will take around 4mins.
resp = client.create_evaluation(
    EvaluationId=eval_name,
    EvaluationName=eval_name,
    MLModelId=model_name,
    EvaluationDataSourceId=test_datasource_name
)
waiter = client.get_waiter('evaluation_available')
waiter.wait(FilterVariable='Name', EQ=eval_name)
print( "Model evaluated successfully!" )

#### It will take a few more minutes, please check the service console if you wish

## Checking the model score...

In [None]:
score = client.get_evaluation( EvaluationId=eval_name )
print("Score categorias: {}".format( score['PerformanceMetrics']['Properties']['MulticlassAvgFScore'] ) )

# Predicting new Categories with the trained model

In [None]:
try:
    client.create_realtime_endpoint(
        MLModelId=model_name
    )
    print('Please, wait a few seconds while the endpoint is being created. Get some coffee...')
except Exception as e:
    print(e)

In [None]:
def predict_category( product_name ):
    response = client.predict(
        MLModelId=model_name,
        Record={
            'product_name': product_name
        },
        PredictEndpoint='https://realtime.machinelearning.us-east-1.amazonaws.com'
    )
    return response['Prediction']['predictedLabel']

In [None]:
testes = pd.read_csv(base_dir + '/testes.csv', sep=',', encoding='utf-8')
testes.head()

In [None]:
result = None
try:
    testes['predicted_category'] = testes['product_name'].apply(predict_category)
    result = testes
except Exception as e:
    print( "Your realtime endpoint is not ready yet... Please, wait for a few seconds more and try again.")
result

# Cleaning Up

In [None]:
client.delete_realtime_endpoint(MLModelId=model_name)
print("Endpoint deleted")
client.delete_ml_model(MLModelId=model_name)
print("Model deleted")
client.delete_evaluation(EvaluationId=eval_name)
print("Evaluation deleted")
client.delete_data_source(DataSourceId=test_datasource_name)
print("Datasource deleted")
client.delete_data_source(DataSourceId=train_datasource_name)
print("Endpoint deleted")

# Well Done!