# Sentiment Analysis with XGBoost

In this notebook we are going to see how we can build a sentiment anaylsis model using the XGboost model as provdied by [Amazon's SageMaker](https://aws.amazon.com/sagemaker/).



## Imports

In [1]:
import os
import glob
import re
import pickle
import numpy as np
import pandas as pd
from sklearn.utils import shuffle
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
from nltk.stem.porter import *
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.externals import joblib

import sagemaker
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
from sklearn.metrics import accuracy_score





[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1. Download the Data

To get the data we first create a directory using `mkdir`.
The [IMDB](http://ai.stanford.edu/~amaas/data/sentiment/) dataset comes with a downlaod link [http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz).
We download the dataset and put it in a directory using `!wget -O`, for more information on that see [GNU download options](https://www.gnu.org/software/wget/manual/html_node/Download-Options.html#Download-Options).

We then use the `tar` command to `xzf`:
- `x` extract
- `z` zipped gipped archive 
- `f` file.


The code block below summarizes these 3 steps

In [2]:
# mkdir = make new directory
%mkdir ../data
!wget -O ../data/aclImdb_v1.tar.gz http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -zxf ../data/aclImdb_v1.tar.gz -C ../data

mkdir: cannot create directory ‘../data’: File exists
--2019-08-11 16:33:33--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘../data/aclImdb_v1.tar.gz’


2019-08-11 16:33:35 (42.3 MB/s) - ‘../data/aclImdb_v1.tar.gz’ saved [84125825/84125825]



# 2. Prepare the Data

Combine the data in a single file. We define a helper function `read_imdb_data` to process the raw data into data and labels. We then define the `prepare_imdb_data` function to randomly split the data between training and testing.

In [3]:
def read_imdb_data(data_dir='../data/aclImdb'):
    #store data and labels in empty dict
    data = {}
    labels = {}
    
    for data_type in ['train', 'test']:
        data[data_type] = {}
        labels[data_type] = {}
        
        for sentiment in ['pos', 'neg']:
            data[data_type][sentiment] = []
            labels[data_type][sentiment] = []
            
            path = os.path.join(data_dir, data_type, sentiment, '*.txt')
            files = glob.glob(path)
            
            for f in files:
                with open(f) as review:
                    data[data_type][sentiment].append(review.read())
                    # Here we represent a positive review by '1' and a negative review by '0'
                    labels[data_type][sentiment].append(1 if sentiment == 'pos' else 0)
                    
            assert len(data[data_type][sentiment]) == len(labels[data_type][sentiment]), \
                    "{}/{} data size does not match labels size".format(data_type, sentiment)
                
    return data, labels

In [4]:
#get the data and labels
data, labels = read_imdb_data()
print("IMDB reviews: train = {} pos / {} neg, test = {} pos / {} neg".format(
            len(data['train']['pos']), len(data['train']['neg']),
            len(data['test']['pos']), len(data['test']['neg'])))

IMDB reviews: train = 12500 pos / 12500 neg, test = 12500 pos / 12500 neg


In [5]:
def prepare_imdb_data(data, labels):
    """Prepare training and test sets from IMDb movie reviews."""
    
    #Combine positive and negative reviews and labels
    data_train = data['train']['pos'] + data['train']['neg']
    data_test = data['test']['pos'] + data['test']['neg']
    labels_train = labels['train']['pos'] + labels['train']['neg']
    labels_test = labels['test']['pos'] + labels['test']['neg']
    
    #Shuffle reviews and corresponding labels within training and test sets
    data_train, labels_train = shuffle(data_train, labels_train)
    data_test, labels_test = shuffle(data_test, labels_test)
    
    # Return a unified training data, test data, training labels, test labets
    return data_train, data_test, labels_train, labels_test

In [6]:
#split between train and test
train_X, test_X, train_y, test_y = prepare_imdb_data(data, labels)
print("IMDb reviews (combined): train = {}, test = {}".format(len(train_X), len(test_X)))

IMDb reviews (combined): train = 25000, test = 25000


In [7]:
#see sample review
train_X[100]

"I had no idea what this movie was until I read about it in the L.A. Weekly. I generally agree with the reviews in the LA Weekly and decided to get a ticket for this film. the film stars molly parker (from my favorite television show Deadwood) and Lukas haas -- who I suspect we will be seeing more of in the very near future. The film is funny, heartwarming, features great acting, and beautiful photography. i don't know if the film has distribution, but I hope it does - or will - soon. this is destined to be a real indie gem. it even has music by my favorite band the silver jews! the only disappointment was that molly parker wasn't there at the screening. even without her there... this was hands down the best film i saw at the festival."

# 3. Process the Data

We now want to process the raw data into a format that is readable to a ML algorithm. We remove all HTML formatting and perform some basic NLP data processing such as bag of words.

In [8]:
#instantiate stemmer
stemmer = PorterStemmer()

In [9]:
def review_to_words(review):
    """
    Uses the Porter Stemmer to stem words in a review
    """
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [10]:
# where to store cache files
cache_dir = os.path.join("../cache", "sentiment_analysis")  
# ensure cache directory exists
os.makedirs(cache_dir, exist_ok=True) 

In [11]:
def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    """Convert each review to words; read from cache if available."""

    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Preprocess training and test data to obtain words for each review
        #words_train = list(map(review_to_words, data_train))
        #words_test = list(map(review_to_words, data_test))
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        # Write to cache file for future runs
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [12]:
# Preprocess data
train_X, test_X, train_y, test_y = preprocess_data(train_X, test_X, train_y, test_y)

Wrote preprocessed data to cache file: preprocessed_data.pkl


For each review, transform it into a [bag of owrds](https://en.wikipedia.org/wiki/Bag-of-words_model) feature representation. We do not want any data leakage between training and testing datasets, we will only use the bag of words on the training data.

In [13]:
def extract_BoW_features(words_train, words_test, vocabulary_size=5000,
                         cache_dir=cache_dir, cache_file="bow_features.pkl"):
    """Extract Bag-of-Words for a given set of documents, already preprocessed into words."""
    
    # If cache_file is not None, try to read from it first
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = joblib.load(f)
            print("Read features from cache file:", cache_file)
        except:
            pass  # unable to read from cache, but that's okay
    
    # If cache is missing, then do the heavy lifting
    if cache_data is None:
        # Fit a vectorizer to training documents and use it to transform them
        # NOTE: Training documents have already been preprocessed and tokenized into words;
        #       pass in dummy functions to skip those steps, e.g. preprocessor=lambda x: x
        vectorizer = CountVectorizer(max_features=vocabulary_size,
                preprocessor=lambda x: x, tokenizer=lambda x: x)  # already preprocessed
        features_train = vectorizer.fit_transform(words_train).toarray()

        # Apply the same vectorizer to transform the test documents (ignore unknown words)
        features_test = vectorizer.transform(words_test).toarray()
        
        # NOTE: Remember to convert the features using .toarray() for a compact representation
        
        # Write to cache file for future runs (store vocabulary as well)
        if cache_file is not None:
            vocabulary = vectorizer.vocabulary_
            cache_data = dict(features_train=features_train, features_test=features_test,
                             vocabulary=vocabulary)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                joblib.dump(cache_data, f)
            print("Wrote features to cache file:", cache_file)
    else:
        # Unpack data loaded from cache file
        features_train, features_test, vocabulary = (cache_data['features_train'],
                cache_data['features_test'], cache_data['vocabulary'])
    
    # Return both the extracted features as well as the vocabulary
    return features_train, features_test, vocabulary

In [14]:
# Extract Bag of Words features for both training and test datasets
train_X, test_X, vocabulary = extract_BoW_features(train_X, test_X)

Wrote features to cache file: bow_features.pkl


# 4. Classify with XGBoost

## 4.1 Write the Dataset

XGBoost clasifier requires that the dataset bet written to a file and stored using Amazon S3. We split the trainig dataset in two parts: training and validation.

We write those datasets to a file and upload the files to S3. Furthermore we will write the test set input to a file and also upload it to S3. This is so that we can use SageMaker's [Batch Transform](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html) functionality to test the model once fitting is done.

In [15]:
#  Split the train_X and train_y arrays into the DataFrames val_X, train_X and val_y, train_y. Make sure that
#  val_X and val_y contain 10 000 entires while train_X and train_y contain the remaining 15 000 entries.


val_X = pd.DataFrame(train_X[:10000])
train_X = pd.DataFrame(train_X[10000:])

val_y = pd.DataFrame(train_y[:10000])
train_y = pd.DataFrame(train_y[10000:])

The [documentation for the XGBoost algorithm in SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html) requires that the saved datasets should contain no headers or index and that for the training and validation data, the label should occur first for each sample:

<i>For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record. For CSV inference, the algorithm assumes that CSV input does not have the label column.</i>

In [16]:
# Make sure that the local directory in which we'd like to store the training and validation csv files exists.
data_dir = '../data/xgboost'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [17]:
# Save the test data to test.csv in the data_dir directory. 
#Note: we do not save the associated ground truth
# labels, instead we will use them later to compare with our model output.

pd.DataFrame(test_X).to_csv(os.path.join(data_dir, 'test.csv'), header=False, index=False)

# Save the training and validation data to train.csv and validation.csv in the data_dir directory.
# Make sure that the files you create are in the correct format.

# Save the training and validation data to train.csv and validation.csv in the data_dir directory.
#  Make sure that the files you create are in the correct format.

pd.concat([val_y, val_X], axis=1).to_csv(os.path.join(data_dir, 'validation.csv'), header=False, index=False)
pd.concat([train_y, train_X], axis=1).to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

In [18]:
# Save memory: we can set text_X, train_X, val_X, train_y and val_y to None.

test_X = train_X = val_X = train_y = val_y = None

## 4.2 Uploading Training / Validation files to S3

Amazon's S3 service allows us to store files that can be access by both the built-in training models such as the XGBoost model we will be using as well as custom models.

There are two functionalities we can use with SageMaker:
- Low level 
- High level

Low level requires knowing each of the objects involved in the SageMaker environment. The high level approach lets AWS make certain choices on the user's behalf. The low leel approach beneits from allowing a lot of flexiblity whereas the high level can make development quicker. In this notebook we use the high level approach.

For this part we will draw heavily on the __[SageMaker API documentation](http://sagemaker.readthedocs.io/en/latest/)__ and the __[SageMaker Developer Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/)__.

the [upload_data](https://sagemaker.readthedocs.io/en/latest/session.html?highlight=upload_data#sagemaker.session.Session.upload_data) method uploads local file or directory to S3. It is a member of object representing our current SageMaker session. This method uploads the data to the default bucket, created for us by AWS if it doesn't exist already, into the path described by the key_prefix variable. If we navigate to the S3 console, we should find our files there.

In [19]:
# Store the current SageMaker session
session = sagemaker.Session() 

# S3 prefix (which folder will we use)
prefix = 'sentiment-xgboost'

# Upload the test.csv, train.csv and validation.csv files 
# which are contained in data_dir to S3 using sess.upload_data().

test_location = session.upload_data(os.path.join(data_dir, 'test.csv'), key_prefix=prefix)
val_location = session.upload_data(os.path.join(data_dir, 'validation.csv'), key_prefix=prefix)
train_location = session.upload_data(os.path.join(data_dir, 'train.csv'), key_prefix=prefix)

## 4.3 Creating the XGBoost model

We consider a model on SageMaker to be three components:
- Model Artifacts
- Training Code (Container)
- Inference Code (Container)

The __Model Artifacts__ are the actual model itself. For this case the artifacts are the trees created during training.

The __Training Code__ and the __Inference Code__ are used to manipulate the training artifacts. The training coe uses the training data that is provided and created model artifcats, and the inferencec code uses the model artifacts to make predictions on new data.

SageMaker runs the training and inference codes by making use of [docker containers](https://sagemaker-workshop.com/custom/containers.html#the-dockerfile), a way to package code and ensure that dependencies are not an issue.

In [20]:
# Our current execution role is required when creating the model as the training
# and inference code will need to access the model artifacts.
role = get_execution_role()

In [21]:
# We need to retrieve the location of the container which is provided by Amazon for using XGBoost.
# For convenience, the training and inference code both use the same container.

container = get_image_uri(session.boto_region_name, 'xgboost')

In [22]:
# Create a SageMaker estimator using the container location determined in the previous cell.
# It is recommended that we use a single training instance of type ml.m4.xlarge. It is also
# recommended that we use 's3://{}/{}/output'.format(session.default_bucket(), prefix) as the
# output path.


xgb = sagemaker.estimator.Estimator(container, # The location of the container we wish to use
                                    role,                                    # What is our current IAM Role
                                    train_instance_count=1,                  # How many compute instances
                                    train_instance_type='ml.m4.xlarge',      # What kind of compute instances
                                    output_path='s3://{}/{}/output'.format(session.default_bucket(), prefix),
                                    sagemaker_session=session)

# Set the XGBoost hyperparameters in the xgb object.
# We have a binary label so we should be using the 'binary:logistic' objective.

xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        early_stopping_rounds=10,
                        num_round=500)

## 4.4 Fit the XGBoost model

In [23]:
#  Attach the training and validation datasets set up  computation
s3_input_train = sagemaker.s3_input(s3_data=train_location, content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data=val_location, content_type='csv')

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation})

2019-08-11 17:07:44 Starting - Starting the training job...
2019-08-11 17:08:00 Starting - Launching requested ML instances......
2019-08-11 17:09:01 Starting - Preparing the instances for training......
2019-08-11 17:10:02 Downloading - Downloading input data...
2019-08-11 17:10:34 Training - Downloading the training image..
[31mArguments: train[0m
[31m[2019-08-11:17:10:53:INFO] Running standalone xgboost training.[0m
[31m[2019-08-11:17:10:53:INFO] File size need to be processed in the node: 238.47mb. Available memory size in the node: 8604.03mb[0m
[31m[2019-08-11:17:10:53:INFO] Determined delimiter of CSV input is ','[0m
[31m[17:10:53] S3DistributionType set as FullyReplicated[0m
[31m[17:10:55] 15000x5000 matrix with 75000000 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[31m[2019-08-11:17:10:55:INFO] Determined delimiter of CSV input is ','[0m
[31m[17:10:55] S3DistributionType set as FullyReplicated[0m
[31m[17:10:56] 10000x500

[31m[17:11:51] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 30 extra nodes, 8 pruned nodes, max_depth=5[0m
[31m[40]#011train-error:0.152333#011validation-error:0.1756[0m
[31m[17:11:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 28 extra nodes, 12 pruned nodes, max_depth=5[0m
[31m[41]#011train-error:0.151133#011validation-error:0.1756[0m
[31m[17:11:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 10 pruned nodes, max_depth=5[0m
[31m[42]#011train-error:0.15#011validation-error:0.1747[0m
[31m[17:11:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 14 pruned nodes, max_depth=5[0m
[31m[43]#011train-error:0.1484#011validation-error:0.1742[0m
[31m[17:11:56] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 20 extra nodes, 12 pruned nodes, max_depth=5[0m
[31m[44]#011train-error:0.146533#011validation-error:0.1715[0m
[31m[17:11:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots,

[31m[17:12:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 8 pruned nodes, max_depth=5[0m
[31m[88]#011train-error:0.117267#011validation-error:0.1482[0m
[31m[17:12:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 22 extra nodes, 10 pruned nodes, max_depth=5[0m
[31m[89]#011train-error:0.117333#011validation-error:0.1483[0m
[31m[17:12:54] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 4 pruned nodes, max_depth=5[0m
[31m[90]#011train-error:0.117267#011validation-error:0.148[0m
[31m[17:12:56] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 6 pruned nodes, max_depth=5[0m
[31m[91]#011train-error:0.1162#011validation-error:0.1482[0m
[31m[17:12:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 10 pruned nodes, max_depth=5[0m
[31m[92]#011train-error:0.115067#011validation-error:0.1486[0m
[31m[17:12:58] src/tree/updater_prune.cc:74: tree pruning end, 1 roots

[31m[17:13:52] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 8 pruned nodes, max_depth=5[0m
[31m[136]#011train-error:0.094933#011validation-error:0.139[0m
[31m[17:13:54] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 16 extra nodes, 16 pruned nodes, max_depth=5[0m
[31m[137]#011train-error:0.094133#011validation-error:0.1396[0m
[31m[17:13:55] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 6 pruned nodes, max_depth=5[0m
[31m[138]#011train-error:0.093733#011validation-error:0.1388[0m
[31m[17:13:56] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 24 extra nodes, 8 pruned nodes, max_depth=5[0m
[31m[139]#011train-error:0.093667#011validation-error:0.1389[0m
[31m[17:13:58] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 18 extra nodes, 10 pruned nodes, max_depth=5[0m
[31m[140]#011train-error:0.093467#011validation-error:0.1391[0m
[31m[17:13:59] src/tree/updater_prune.cc:74: tree pruning end, 

[31m[17:14:53] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 10 extra nodes, 10 pruned nodes, max_depth=5[0m
[31m[184]#011train-error:0.082867#011validation-error:0.1332[0m
[31m[17:14:54] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 8 pruned nodes, max_depth=5[0m
[31m[185]#011train-error:0.0828#011validation-error:0.1336[0m
[31m[17:14:56] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 0 pruned nodes, max_depth=5[0m
[31m[186]#011train-error:0.082333#011validation-error:0.1337[0m
[31m[17:14:57] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 10 pruned nodes, max_depth=5[0m
[31m[187]#011train-error:0.081933#011validation-error:0.1334[0m
[31m[17:14:58] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 12 extra nodes, 8 pruned nodes, max_depth=5[0m
[31m[188]#011train-error:0.081867#011validation-error:0.1338[0m
[31m[17:14:59] src/tree/updater_prune.cc:74: tree pruning end, 1

## 4.5 Test the Model

We use batch transform to perform inference on a large dataset in a way that is not realtime. This allows us to see how well our model performs.

The advantage of this is that we don't need to use the model's results immediately, instead we can perform inference on a large number of samples. The method is also useful in that we can perform inference on the entire testing set.

In [24]:
# Create a transformer object from the trained model.
xgb_transformer = xgb.transformer(instance_count = 1,
                                  instance_type = 'ml.m4.xlarge')

To perform the transform job we need to specify the type of data we are sending that it is serialized correclty in the background. Here we are providing the model with csv data so we specify _text/csv_. 

In addition, if the data is too large to process all at once then we need to specify how the data file should be split it. Again this is a csv file, therefore each line is a single entry, we tell SageMaker to split the input on each line.

In [25]:
# Start the transform job
xgb_transformer.transform(test_location,
                          content_type='text/csv', #pecify the content type 
                          split_type='Line') #split type of the test data

With the code above, the transform is running in the background. We call the `wait()` method to wait until the transform job is done and receive some feedback.

In [31]:
xgb_transformer.wait()

!


The transform job is executed and the estimated sentiment of each review has been saved on S3. We want to work on this file locally and copy it to the data_dir.

A convenient way to do is inside jupyter is found in the [AWS CLI command reference](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html#examples).

In [32]:
!aws s3 cp --recursive $xgb_transformer.output_path $data_dir

Completed 256.0 KiB/370.7 KiB (1.9 MiB/s) with 1 file(s) remainingCompleted 370.7 KiB/370.7 KiB (2.7 MiB/s) with 1 file(s) remainingdownload: s3://sagemaker-us-west-2-013747046745/xgboost-2019-08-11-17-16-30-179/test.csv.out to ../data/xgboost/test.csv.out


Finally, we can read the output from the model.

We need to convert the output into something more usable for our purposes. We convert the sentiment to be `1` for _positive_ and `0` for _negative_.

In [33]:
predictions = pd.read_csv(os.path.join(data_dir, 'test.csv.out'), header=None)
predictions = [round(num) for num in predictions.squeeze().values]

In [34]:
#evaluate model
accuracy_score(test_y, predictions)

0.86528

# 5.0 Cleaning Up

As we perform operations on larger and larger data, keeping track of how much memory we use becomes essential. We might run out of memory while performing operations and/or incur costly expensives.


The default notebook instance on SageMaker might not have a lot of excess disk space. As we repeat exercises similar to this one, we might eventually fill up the alloted disk space, leading to erros which can be difficult to diagnose.

Once we are done with a notebok, it is good practie to remove the files we created along the way. We can do this from the terminal or from the notebook hub. 

The code block below allows such commands from within the jupyter notebook.

In [35]:
# First we will remove all of the files contained in the data_dir directory
!rm $data_dir/*

# And then we delete the directory itself
!rmdir $data_dir

# Similarly we will remove the files in the cache_dir directory and the directory itself
!rm $cache_dir/*
!rmdir $cache_dir

rm: cannot remove ‘../cache/sentiment_analysis/*’: No such file or directory
rmdir: failed to remove ‘../cache/sentiment_analysis’: No such file or directory
