# Amazon Machine Learning Demonstration


https://aws.amazon.com/pt/machine-learning/
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.
With Amazon Machine Learning you can train three different types of models, using the following algorithms:
 - Binary Logistic Regression
 - Multinomial Logistic Regression
 - Linear Regression
 
We will use Multinomial Logistic Regression to create a model for predicting the category of a product, given its short descriptiion.

Python Boto3 reference:
http://boto3.readthedocs.io/en/latest/reference/services/machinelearning.html

## Goal: to create a model to predict a given product category

Model:
 - Input: product short description
 - Output: category
 - *predict_categoria(product_name) -> category*
 

In [None]:
%matplotlib inline

import boto3
import numpy as np
import pandas as pd
import sagemaker
import IPython.display as disp

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn import preprocessing
from IPython.display import Markdown
from notebook import notebookapp

In [None]:
# Get the current Sagemaker session
sagemaker_session = sagemaker.Session()

role = sagemaker.get_execution_role()

## Before start running this tutorial, please add the following policy to your bucket
```javascript
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AddPerm",
            "Effect": "Allow",
            "Principal": "*",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::<YOUR_S3_BUCKET_NAME_HERE>/*"
        }
    ]
}
```

In [None]:
s3_bucket = sagemaker_session.default_bucket()
client = boto3.client('machinelearning')
s3 = boto3.client('s3')
base_dir='/tmp/aml'

# Data Scientist moment
## Preparing the dataset

In [None]:
!mkdir -p $base_dir
!curl -s https://workshopml.spock.cloud/datasets/products/aml_data.tar.gz | tar -xz -C $base_dir

In [None]:
data = pd.read_csv(base_dir + '/sample.csv', sep=',', encoding='utf-8')
print( len(data) )
data.iloc[[517, 163, 14, 826, 692]]

### So, we need to remove accents, transform everything to lower case and remove stopwords

In [None]:
# tranlating table for removing accents
accents = "".maketrans("áàãâéêíóôõúüçÁÀÃÂÉÊÍÓÔÕÚÜÇ", "aaaaeeiooouucAAAAEEIOOOUUC")

# loading stopwords without accents
file = open("stopwords.txt", "r")
stopwords = list(map(lambda x:x.strip().translate(accents),file.readlines()))
file.close()

In [None]:
# this tokenizer will tokenize the text, remove stop words and compute bigrams (ngram(2))
word_vectorizer = TfidfVectorizer(ngram_range=(1,2), analyzer='word', stop_words=stopwords, token_pattern='[a-zA-Z]+')
tokenizer = word_vectorizer.build_tokenizer()

def remove_stop_words(text):
    return " ".join( list(filter( lambda x: x not in stopwords, tokenizer(text) )) )

In [None]:
data['product_name_tokens'] = list(map(lambda x: remove_stop_words( x.lower().translate(accents) ), data['product_name']))
data['main_category_tokens'] = list(map(lambda x: remove_stop_words( x.lower().translate(accents) ), data['main_category']))
data['subcategory_tokens'] = list(map(lambda x: remove_stop_words( x.lower().translate(accents) ), data['sub_category']))

In [None]:
data.iloc[[26, 163, 14, 826, 692]]

## Let's remove the unecessary columns

In [None]:
data_final = data[ [ 'product_name_tokens', 'main_category_tokens', 'subcategory_tokens' ]]
data_final = data_final.rename(columns={
    "product_name_tokens": "product_name", 
    "main_category_tokens": "category",
    "subcategory_tokens": "sub_category", 
})
data_final.head()

# Ok. We finished our 'sample' dataset preparation.
## Now, lets continue with the dataset that was already cleaned.
## In real life, you should apply all these transformations to your final dataset.

In [None]:
disp.Image(base_dir + '/workflow_processo.png')

### Now, lets execute the steps above, using Amazon Machine Learning.

In [None]:
# First, lets upload our dataset to S3
s3.upload_file( base_dir + '/dataset.csv', s3_bucket, 'workshop/AML/dataset.csv' )

In [None]:
# just take a look on that, before continue
pd.read_csv(base_dir + '/dataset.csv', sep=',', encoding='utf-8').head()

## Now, lets create the DataSources
### Before that, we need to split it into 70% training and 30% test

In [None]:
strategy_train = open( 'split_strategy_training.json', 'r').read()
strategy_test = open( 'split_strategy_test.json', 'r').read()
print( "Training: {}\nTest: {}".format( strategy_train, strategy_test ) )

### How AML knows the file format (CSV)? By using the schema bellow...

In [None]:
categorias_schema = open('category_schema.json', 'r').read()
print( "Formato dos dados do dataset: {}\n".format( categorias_schema) )

### Creating the DataSources (train and test) for the Category Model

In [None]:
resp = client.create_data_source_from_s3(
    DataSourceId='ProdCategoriasTrain',
    DataSourceName='Dataset de produtos e suas categorias (train)',
    DataSpec={
        'DataLocationS3': 's3://%s/workshop/AML/dataset.csv' % s3_bucket,
        'DataSchema': categorias_schema,
        'DataRearrangement': strategy_train
    },
    ComputeStatistics=True
)
resp = client.create_data_source_from_s3(
    DataSourceId='ProdCategoriasTest',
    DataSourceName='Dataset de produtos e suas categorias (test)',
    DataSpec={
        'DataLocationS3': 's3://%s/workshop/AML/dataset.csv' % s3_bucket,
        'DataSchema': categorias_schema,
        'DataRearrangement': strategy_test
    },
    ComputeStatistics=True
)

## Creating/training the Category model

This is the Model Recipe. It contains the last transformations applyed to your dataset before start training the model. Please note the function: ngram(product_name, 2). It will create bigrams for the input text. So, the model will consider as input a term frequency table, extracted from the bigrams of the product_name.

In [None]:
cat_recipe = open('category_recipe.json', 'r').read()
print(cat_recipe)

Reference: http://docs.aws.amazon.com/machine-learning/latest/dg/data-transformations-reference.html
## The training will start as soon as you execute the command bellow

In [None]:
resp = client.create_ml_model(
    MLModelId='ProdutoCategorias',
    MLModelName='Modelo de produtos e suas Categorias',
    MLModelType='MULTICLASS',
    Parameters={
        'sgd.maxPasses': '30',
        'sgd.shuffleType': 'auto',
        'sgd.l2RegularizationAmount': '1e-6'
    },
    TrainingDataSourceId='ProdCategoriasTrain',
    Recipe=cat_recipe
)

### You must wait for the end of the training, before trying to evaluate it.
### You can use your time checking the rest of the code or doing something more interesting.
### Come back after 8 or 10 mins and continue executing this code

In [None]:
# it will take around 4mins.
client.create_evaluation(
    EvaluationId='ProdutoCategorias',
    EvaluationName='Teste do modelo ProdutoCategorias',
    MLModelId='ProdutoCategorias',
    EvaluationDataSourceId='ProdCategoriasTest'
)

#### It will take a few more minutes, please check the service console if you wish

## Checking the model score...

In [None]:
score = client.get_evaluation( EvaluationId='ProdutoCategorias' )
print("Score categorias: {}".format( score['PerformanceMetrics']['Properties']['MulticlassAvgFScore'] ) )

# Predicting new Categories with the trained model

In [None]:
try:
    client.create_realtime_endpoint(
        MLModelId='ProdutoCategorias'
    )
    print('Please, wait a few minutes while the endpoint is being created. Get some coffee...')
except Exception as e:
    print(e)

In [None]:
def predict_category( product_name ):
    response = client.predict(
        MLModelId='ProdutoCategorias',
        Record={
            'product_name': product_name
        },
        PredictEndpoint='https://realtime.machinelearning.us-east-1.amazonaws.com'
    )
    return response['Prediction']['predictedLabel']

In [None]:
testes = pd.read_csv(base_dir + '/testes.csv', sep=',', encoding='utf-8')
testes.head()

In [None]:
testes['predicted_category'] = testes['product_name'].apply(predict_category)
testes

# Well Done!