# TF-IDF Training and Prediction with Sagemaker Scikit-learn
This tutorial shows you how to use [Scikit-learn](https://scikit-learn.org/stable/) with Sagemaker by utilizing the pre-built container. Scikit-learn is a popular Python machine learning framework. It includes a number of different algorithms for classification, regression, clustering, dimensionality reduction, and data/feature pre-processing. 

The [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) module  makes it easy to use existing scikit-learn code, which we will show by training a model on the '20 Newsgroups' dataset and generating a set of predictions. For more information about the Scikit-learn container, see the [sagemaker-scikit-learn-containers](https://github.com/aws/sagemaker-scikit-learn-container) repository and the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) repository.

For more on Scikit-learn, please visit the Scikit-learn website: <http://scikit-learn.org/stable/>.

### Table of contents
* [Upload the data for training](#upload_data)
* [Pre-processing](#pre-processing)
* [Create a Scikit-learn Training Script](#create_sklearn_script)
* [Create the SageMaker Scikit Estimator](#create_sklearn_estimator)
* [Train the SKLearn Estimator on the 20 Newsgropus data](#train_sklearn)
* [Evaluate the Trained Model](#evaluate)
* [Using the trained model to make inference requests](#inference)
 * [Deploy the model](#deploy)
 * [Choose some data and use it for a prediction](#prediction_request)
 * [Endpoint cleanup](#endpoint_cleanup)
* [Batch Transform](#batch_transform)
 * [Prepare Input Data](#prepare_input_data)
 * [Run Transform Job](#run_transform_job)
 * [Check Output Data](#check_output_data)

**Note: this example requires SageMaker Python SDK v2.**

In [None]:
import sagemaker

print(sagemaker.__version__)
!pip install -qU 'sagemaker>=2.5'

First, lets create our Sagemaker session and role, and create a S3 prefix to use for the notebook example.

In [None]:
# S3 prefix
prefix = "scikit-tfidf"

import pandas as pd
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

## Upload the data for training <a class="anchor" id="upload_data"></a>

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using a sample of the 20 Newsgroups, which is included with Scikit-learn. We will load the dataset, write locally, then write the dataset to s3 to use.

In [None]:
# https://scikit-learn.org/stable/datasets/real_world.html#the-20-newsgroups-text-dataset
# https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/20_newsgroups.tar.gz


import numpy as np
import os
from sklearn import datasets

news = datasets.fetch_20newsgroups(subset='all')

print("Number of articles: " + str(len(news.data)))
print("Number of different categories: " + str(len(news.target_names)))


In [None]:
# create a dataframe
df = pd.DataFrame([news.target, news.data]).T


In [None]:
# Create directory and write csv
os.makedirs("./data", exist_ok=True)
df.to_csv('./data/articles.csv', index=False, header=False)


Once we have the data locally, we can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [None]:
WORK_DIRECTORY = "data"
input_data = sagemaker_session.upload_data(
    WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY)
)

input_data

## Data Pre-processing <a class="anchor" id="pre-processing"></a>
With Amazon SageMaker Processing, you can run processing jobs for data processing steps in your machine learning pipeline.  
In this example, we will use the SageMaker built-in Sckit-learn container to run our pre-processing script.  
Processing jobs accept data from Amazon S3 as input and store data into Amazon S3 as output.


In [None]:
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor = SKLearnProcessor(
    framework_version="0.23-1", 
    role=role, 
    instance_type="ml.m5.xlarge", 
    instance_count=1
)

sklearn_processor.run(
    code="code/preprocessing.py",
    inputs=[ProcessingInput(input_name="rawdata", source=input_data, destination="/opt/ml/processing/input")],
    outputs=[
        ProcessingOutput(output_name="train", source="/opt/ml/processing/train"),
        ProcessingOutput(output_name="test", source="/opt/ml/processing/test"),
    ],
    arguments=["--train-test-split-ratio", "0.2"],
)

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description["ProcessingOutputConfig"]
for output in output_config["Outputs"]:
    if output["OutputName"] == "train":
        preprocessed_training_data = output["S3Output"]["S3Uri"]
    if output["OutputName"] == "test":
        preprocessed_test_data = output["S3Output"]["S3Uri"]

## Create a Scikit-learn Training Script <a class="anchor" id="create_sklearn_script"></a>
SageMaker can now run a scikit-learn script using the `SKLearn` estimator. When executed on SageMaker a number of helpful environment variables are available to access properties of the training environment, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. Any artifacts saved in this folder are uploaded to S3 for model hosting after the training job completes.
* `SM_OUTPUT_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing two input channels, 'train' and 'test', were used in the call to the `SKLearn` estimator's `fit()` method, the following environment variables will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'train' channel
* `SM_CHANNEL_TEST`: Same as above, but for the 'test' channel.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. 


Because the Scikit-learn container imports your training script, you should always put your training code in a main guard `(if __name__=='__main__':)` so that the container does not inadvertently run your training code at the wrong point in execution.

For more information about training environment variables, please visit https://github.com/aws/sagemaker-containers.

## Create SageMaker Scikit Estimator <a class="anchor" id="create_sklearn_estimator"></a>

To run our Scikit-learn training script on SageMaker, we construct a `sagemaker.sklearn.estimator.sklearn` estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.
* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters.

To see the code for the SKLearn Estimator, see here: https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/sklearn

In [None]:
from sagemaker.sklearn.estimator import SKLearn

sklearn = SKLearn(
    entry_point="code/train.py",
    framework_version="0.23-1",
    instance_type="ml.m5.xlarge",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={"min-samples-leaf": 2, "n-estimators": 500},
)

#### Train SKLearn Estimator on 20-Newsgroups data <a class="anchor" id="train_sklearn"></a>
Training is very simple, just call `fit` on the Estimator! This will start a SageMaker Training job that will download the data for us, invoke our scikit-learn code (in the provided script file), and save any model artifacts that the script creates.

In [None]:
preprocessed_training_data

In [None]:
%%time
sklearn.fit({"train": preprocessed_training_data})

## Evaluate Model <a class="anchor" id="evaluate"></a>

Evaluate the trained model

In [None]:
training_job_description = sklearn.jobs[-1].describe()

model_data_s3_uri = "{}{}/{}".format(
    training_job_description["OutputDataConfig"]["S3OutputPath"],
    training_job_description["TrainingJobName"],
    "output/model.tar.gz",
)
print(training_job_description["TrainingJobName"])
print(model_data_s3_uri)

In [None]:
sklearn_processor = SKLearnProcessor(
    framework_version='0.23-1',
    role=role,
    instance_type='ml.m5.xlarge',
    instance_count=1
)

sklearn_processor.run(
    code="code/evaluation.py",
    inputs=[
        ProcessingInput(source=model_data_s3_uri, destination="/opt/ml/processing/model"),
#       ProcessingInput(source=preprocessed_training_data, destination="/opt/ml/processing/train"),
        ProcessingInput(source=preprocessed_test_data, destination="/opt/ml/processing/test"),
    ],
    outputs=[ProcessingOutput(output_name="evaluation", source="/opt/ml/processing/evaluation")],
)
evaluation_job_description = sklearn_processor.jobs[-1].describe()

## Using the trained model to make inference requests <a class="anchor" id="inference"></a>

### Deploy the model <a class="anchor" id="deploy"></a>

Deploying the model to SageMaker hosting just requires a `deploy` passing ta script which will instantiate the fitted model (from 'model.tar.gz').

In [None]:
from sagemaker.sklearn.model import SKLearnModel

sklearn_model = SKLearnModel(
    model_data=model_data_s3_uri,
    role=role,
    entry_point="code/inference.py",
    framework_version="0.23-1"
)

predictor = sklearn_model.deploy(
    instance_type="ml.c5.xlarge",
    initial_instance_count=1
)


### Choose some data and use it for a prediction <a class="anchor" id="prediction_request"></a>

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [None]:
from scipy import sparse

test_feature_vectors = os.path.join(preprocessed_test_data, "feature_vectors.npz")
!aws s3 cp $test_feature_vectors ./data/
X_test_vectors = sparse.load_npz("data/feature_vectors.npz")

test_labels = os.path.join(preprocessed_test_data, "labels.csv")
y_test = pd.read_csv(test_labels, header=None)


In [None]:
preprocessed_test_data

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The output from the endpoint return an numerical representation of the classification prediction; in the original dataset, these are the newsgroup names, but in this example the labels are numerical. We can compare against the original label that we parsed.

In [None]:
import random

# get one random row from our test data
randomlist = random.sample(range(0, X_test_vectors.shape[0]), 1)

#X_subset = X_test_vectors.tocsr()[0:20,].todense()
X_subset = X_test_vectors.tocsr()[randomlist,].todense()
y_test = np.array(y_test).flatten()
y_subset = y_test[randomlist]


In [None]:
# convert to numpy array for predictor
predictions = predictor.predict(X_subset)

print('prediction:', predictions)
print('     label:', y_subset)


### Endpoint cleanup <a class="anchor" id="endpoint_cleanup"></a>

When you're done with the endpoint, you'll want to clean it up.

In [None]:
predictor.delete_endpoint()

## Batch Transform <a class="anchor" id="batch_transform"></a>
We can also use the trained model for asynchronous batch inference on S3 data using SageMaker Batch Transform.

In [None]:
# Define a SKLearn Transformer from the trained SKLearn Estimator
transformer = sklearn_model.transformer(instance_count=1, instance_type="ml.c5.xlarge")

### Prepare Input Data <a class="anchor" id="prepare_input_data"></a>
We will extract 10 random samples of 100 rows from the training data, then split the features (X) from the labels (Y). Then upload the input data to a given location in S3.

In [None]:
# get 20 random rows of data from our test data
randomlist = random.sample(range(0, X_test_vectors.shape[0]), 20)

#X_subset = X_test_vectors.tocsr()[0:20,].todense()
X_subset = X_test_vectors.tocsr()[randomlist,].todense()
y_test = np.array(y_test).flatten()
y_subset = y_test[randomlist]


In [None]:
# Upload input data from local filesystem to S3
np.savetxt('X_sample_data', X_subset, fmt='%s', delimiter=',')
batch_input_s3 = sagemaker_session.upload_data("X_sample_data", key_prefix=prefix + "/batch_input")
batch_input_s3

### Run Transform Job <a class="anchor" id="run_transform_job"></a>
Using the Transformer, run a transform job on the S3 input data.

In [None]:
# Start a transform job and wait for it to finish
transformer.transform(batch_input_s3, content_type="text/csv")
print("Waiting for transform job: " + transformer.latest_transform_job.job_name)
transformer.wait()

### Check Output Data  <a class="anchor" id="check_output_data"></a>
After the transform job has completed, download the output data from S3. For each file "f" in the input data, we have a corresponding file "f.out" containing the predicted labels from each input row. We can compare the predicted labels to the true labels saved earlier.

In [None]:
# Download the output data from S3 to local filesystem
batch_output = transformer.output_path

In [None]:
predictions_file = 'X_sample_data.out'
!aws s3 cp --recursive $batch_output/ ./data/

In [None]:
with open('./data/'+predictions_file) as file:
    lines = file.readlines()
    
predictions = np.fromstring(lines[0][1:-1], sep=',')
predictions = predictions.astype(int)

print('predictions:', predictions)
print('     labels:', y_subset)

In [None]:
from sklearn.metrics import f1_score

f1 = f1_score(y_subset, predictions, average='macro')
print('f1 = %s' % (f1))