### Load the required packages

In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import label_binarize
from sklearn.pipeline import Pipeline 

from sklearn.model_selection import train_test_split

import sagemaker
from sagemaker import get_execution_role

# 1. Create and start a SageMaker Instance
An Amazon SageMaker notebook instance is an ML compute instance running the Jupyter Notebook App. An instance is nothing but a virtual machine where we can choose properties like processors, GPU, RAM, and others, based on the project requirements. To create a notebook instance, use either the SageMaker console.

## Inititate a SageMaker Session
SageMaker session is an object that represents the SageMaker session that we are currently operating within. It manages interactions with the Amazon SageMaker APIs and any other AWS services needed. This class provides convenient methods for manipulating entities and resources that Amazon SageMaker uses, such as training jobs, endpoints, and input datasets in S3. We will discuss them in detail later.

In [6]:
session = sagemaker.Session()

### Get execution role
Get the notebook instance's execution role, which is the IAM role that we created for our SageMaker notebook instance.

In [None]:
role = get_execution_role()

### Set s3 bucket and folders
Then extract the default bucket assigned to this session, using session method(sagemaker.Session().default_bucket()) or provide an existing bucket name. Let us also create a folder name in the s3 bucket to store all the data and models.

In [7]:
s3_bucket = sagemaker.Session().default_bucket()
s3_prefix = 'spam-data' #prefix used for data stored within the bucket
s3_path = 's3://{}/{}/'.format(s3_bucket, s3_prefix)

# 2. Prepare the data
We will begin by uploading data to SageMaker; this can be done in two ways, upload it to a local directory or s3.

### Upload data to s3
To upload the data to s3, create an s3 bucket. Follow this blog to create a bucket, then upload the data to the bucket, note down the bucket, and file name of the data.

In [10]:
data = pd.read_csv('SMSSpamCollection.txt', sep="\t", header=None, names = ['labels', 'messages'])

In [11]:
data.head()

Unnamed: 0,labels,messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


### Preprocess & split dataset into training and testing
Next, convert the textual data into numerical values. We will use sklearn to preprocess the data. 
- Import all the requisite packages from sklearn library. 
- Build a Tfidf pipeline to preprocess data. 
- Convert the label column to numerical values (0 and 1).

In [12]:
tf_idf = Pipeline([('cv',CountVectorizer()), ('tfidf_transformer',TfidfTransformer(smooth_idf=True,use_idf=True))])
tf_idf_vector  = pd.DataFrame(tf_idf.fit_transform(data['messages']).todense())
data['labels'] = label_binarize(data['labels'], classes=['ham', 'spam'])

In [34]:
X_train, X_test, y_train, y_test = train_test_split(tf_idf_vector, data['labels'], test_size=0.3, random_state=2020)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.3, random_state=2020)

### Upload train and test data to S3
To upload the data to s3, create a folder in the local directory and save the data in the folder. 
- Create folder: To create a folder or to ensure that a folder exists, run the following command.
    
- Save to local directory: Ensure that the header and index are false since that is the format required by the AWS training code.

In [16]:
import os
data_dir = 'data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    
import scipy.sparse
# scipy.sparse.save_npz(os.path.join(data_dir, 'test_data.csv'), X_test)
X_test.to_csv(os.path.join(data_dir, 'test_data.csv'), header=False, index=False)
pd.concat([y_train, X_train], axis=1).to_csv(os.path.join(data_dir, 'train_data.csv'), header=False, index=False)
pd.concat([y_val, X_val], axis=1).to_csv(os.path.join(data_dir, 'val_data.csv'), header=False, index=False)

In [17]:
test_path = session.upload_data(os.path.join(data_dir, 'test_data.csv'), key_prefix = s3_prefix)
train_path = session.upload_data(os.path.join(data_dir, 'train_data.csv'), key_prefix = s3_prefix)
val_path = session.upload_data(os.path.join(data_dir, 'val_data.csv'), key_prefix = s3_prefix)

In [None]:
X_train = X_test = y_train = X_val = y_val = None

## 3. Get the XGBoost algorithm image

Estimators are a high-level interface for SageMaker training to handle end-to-end Amazon SageMaker training and deployment tasks.
Estimator object requires three main objects:
1. sagemaker_session: We will use the session object that we created in the first section.
2. role: We will use the execution role object that we created in the first section.
3. model_uri: Next, configure the container image for the region that we are running in. In local mode, this should point to the path in which the model is located and not the file itself, as local Docker containers will try to mount the URI as a volume. model_uri requires two inputs the name of the estimator model, in our case XGBoost, and the region name that can be extracted using the session method (session.boto_region_name) or using boto3 (boto3.Session().region_name).

In [18]:
from sagemaker.amazon.amazon_estimator import get_image_uri
container = get_image_uri(session.boto_region_name, 'xgboost')
# container = get_image_uri(boto3.Session().region_name, 'xgboost')

'get_image_uri' method will be deprecated in favor of 'ImageURIProvider' class in SageMaker Python SDK v2.
There is a more up to date SageMaker XGBoost image. To use the newer image, please set 'repo_version'='1.0-1'. For example:
	get_image_uri(region, 'xgboost', '1.0-1').


### Construct the estimator object
To construct an estimator object, we will need to provide an s3 path for the estimator object to save the model. Let us append the name of the model output folder (model_output) to the s3 path (s3_path) that we created in the first section to build a path to save model outputs (artifacts).

In [19]:
output_path = s3_path + 'model_output'
output_path 

's3://sagemaker-eu-west-2-629866591278/spam-data/model_output'

### Define hyperparamaters
We will build a dictionary of the parameters that we would like to define, then feed this dictionary into the estimator object using the set_hyperparameter method in the estimator object. For a detailed understanding of parameters in XGBoost models, refer to this link.

In [20]:
hyperparameters ={
"max_depth": 5,
"eta": 0.2,
"gamma": 2,
"min_child_weight": 5,
"subsample": 0.8,
"objective": "binary:logistic",
"early_stopping_rounds": 25,
"num_round": 150,
}

We now have the objects and output path required to create an estimator object. For a detailed understanding of other parameters in an estimator object, please refer to this link. We will only be using a few high-level parameters in this project.

In [21]:
## get estimator
classifier = sagemaker.estimator.Estimator(
            container,
            role,
            train_instance_count=1,
            train_instance_type='ml.m4.xlarge',
            output_path=output_path,
            sagemaker_session=session)
## set hyperparameters
classifier.set_hyperparameters(**hyperparameters)

Parameter image_name will be renamed to image_uri in SageMaker Python SDK v2.


## 4. Fit the model
With the estimator object set up, SageMaker can now fit the model. We will need to specify the location of the data, where we will provide the URI that points to that data in S3 in the sagemaker.s3_input object.

In [22]:
s3_train = sagemaker.s3_input(s3_data=train_path, content_type='csv')
s3_val = sagemaker.s3_input(s3_data=val_path, content_type='csv')
classifier.fit({
               'train':s3_train,
               'validation':s3_val,
               })

's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.
's3_input' class will be renamed to 'TrainingInput' in SageMaker Python SDK v2.


2020-09-16 21:02:40 Starting - Starting the training job...
2020-09-16 21:02:46 Starting - Launching requested ML instances......
2020-09-16 21:03:52 Starting - Preparing the instances for training......
2020-09-16 21:04:43 Downloading - Downloading input data...
2020-09-16 21:05:35 Training - Training image download completed. Training in progress..[34mArguments: train[0m
[34m[2020-09-16:21:05:35:INFO] Running standalone xgboost training.[0m
[34m[2020-09-16:21:05:35:INFO] File size need to be processed in the node: 130.41mb. Available memory size in the node: 8489.63mb[0m
[34m[2020-09-16:21:05:35:INFO] Determined delimiter of CSV input is ','[0m
[34m[21:05:35] S3DistributionType set as FullyReplicated[0m
[34m[21:05:36] 2730x8713 matrix with 23786490 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2020-09-16:21:05:36:INFO] Determined delimiter of CSV input is ','[0m
[34m[21:05:36] S3DistributionType set as FullyReplicated[0m
[

## 5. Test the Model
### Batch Transform
To test the model that we created, we will use SageMaker's Batch Transform functionality, which will split the test data into batches, send it to the model, and merge the results. 

### Transformer object
To start with, build a transformer object to fit the model that we created.

In [24]:
classifier_transformer = classifier.transformer(instance_count=1, instance_type='ml.m4.xlarge')

Parameter image will be renamed to image_uri in SageMaker Python SDK v2.


### Batch transform job
SageMaker will begin a batch transform job using our trained model and apply it to the test data stored in s3. We will need to provide pieces of information like data location, data type (to serialize data), and split type (to split data into batches). SageMaker will run the batch transform job in the background. To get some output on the job performance, use the wait method in the transformer object.

In [25]:
classifier_transformer.transform(test_path, content_type='text/csv', split_type='Line')

In [26]:
classifier_transformer.wait()

.............................[32m2020-09-16T21:10:58.600:[sagemaker logs]: MaxConcurrentTransforms=4, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34mArguments: serve[0m
[34m[2020-09-16 21:10:58 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[35mArguments: serve[0m
[35m[2020-09-16 21:10:58 +0000] [1] [INFO] Starting gunicorn 19.7.1[0m
[34m[2020-09-16 21:10:58 +0000] [1] [INFO] Listening at: http://0.0.0.0:8080 (1)[0m
[34m[2020-09-16 21:10:58 +0000] [1] [INFO] Using worker: gevent[0m
[34m[2020-09-16 21:10:58 +0000] [37] [INFO] Booting worker with pid: 37[0m
[34m[2020-09-16 21:10:58 +0000] [38] [INFO] Booting worker with pid: 38[0m
[34m[2020-09-16 21:10:58 +0000] [39] [INFO] Booting worker with pid: 39[0m
[34m[2020-09-16 21:10:58 +0000] [40] [INFO] Booting worker with pid: 40[0m
[34m[2020-09-16:21:10:58:INFO] Model loaded successfully for worker : 39[0m
[34m[2020-09-16:21:10:58:INFO] Model loaded successfully for worker : 37[0m
[34m[2020-09-16:21:10:58:INFO]

In [41]:
!aws s3 cp --recursive $classifier_transformer.output_path $data_dir
predictions = pd.read_csv(os.path.join(data_dir, 'test_data.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

download: s3://sagemaker-eu-west-2-629866591278/xgboost-2020-09-16-21-06-22-378/test_data.csv.out to data/test_data.csv.out


In [42]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

0.9712918660287081