# Amazon Reviews - Scaling the Chosen Methodology / Technique

After the initial exploration and analysis stage of the data science pipeline, the next step will be to start to scale the data and methods being used to determine whether the results reflect that of the initial sample experiments.

If we're driven by a particular scientifit enquiry rather than an application driven data science workflow, We should also be triyng to prove our original hypothesis. If we take the original work performed during the experimentation, we would look at our initial sample experiment and determine the confidence of our chosen model for predicting a specific label (let's say  we're predicting the product_cateogy). We would then try and establish whether the application of this method prooves our H0. 

While this is not strictly hypothesis testing (proving/disproving Null Hypothesis), in many scenarios, this tends to be a suitable approach to validate our assumptions when moving from sample to scale.

### Imports

The following imports are required in order to run different statistical tests and modelling techniques.

In [60]:
# NOTE: Uncomment the folllowing lines on first run of the notebook. 
# !conda install -y -c conda-forge fastparquet scikit-learn arrow-cpp parquet-cpp pyarrow numpy
# !pip install --upgrade mxnet gluonnlp swifter dask cufflinks
# !pip install -q torch==1.4.0
# !pip install -q transformers
# !pip install s3-concat


In [62]:
import boto3
import sagemaker
from s3_concat import S3Concat
import sys
import os
import re
import numpy as np
import pandas as pd
import subprocess
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
import gzip
from io import BytesIO
import zipfile
import random
import json
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.metrics import classification_report
import nltk
from fastparquet import write
from fastparquet import ParquetFile
import s3fs
import pyarrow.parquet as pq
import pickle
import glob
import ast 
import csv
import itertools
import dask.dataframe as dd
from dask.multiprocessing import get
import multiprocessing
import datetime

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from collections import OrderedDict

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn import metrics

### Configs and Global Vars

Throughout the notebook we're going to store all our global variables (although all variables inside a notebook are global if they are not defined in a method), inside an array.

In [14]:
configs = {
    'aws_region' :  'us-east-1',
    'bucket_name': 'demos-amazon-reviews',
    'prefix' : 'preprocessed_reviews_csvs', #only use this if you want to have your files in a folder 
    'index_key' : 'review_date_str',
    'file_extension' :'.csv',
    'wordvecdata': 'wordvec-full-data',
    'models_dir': 'models',
    'label_column':'product_category',
    'text_column': 'review_body_processed',
    'class_labels_pickle_filename':'class_labels.pkl'
}

#initilize empty
global_vars = {}

### Environment Setup

Setting up the environment involves ensuring all the corret session and IAM roles are configured. We also need to ensure the correct region and bucket is made available.

In [15]:
def setup_env(configs, global_vars):
    sess = sagemaker.Session()
    role = get_execution_role()
    AWS_REGION = configs['aws_region']
    s3 = boto3.resource('s3')
    s3_bucket = s3.Bucket(configs['bucket_name'])

    if s3_bucket.creation_date == None:
    # create S3 bucket because it does not exist yet
        print('Creating S3 bucket {}.'.format(bucket))
        resp = s3.create_bucket(
            ACL='private',
            Bucket=bucket
        )
    else:
        print('Bucket already exists')
        
    global_vars['role'] = role
    global_vars['sess'] = sess
    global_vars['s3'] = s3
    global_vars['s3_bucket'] = s3_bucket
    
    return global_vars

global_vars = setup_env(configs, global_vars)

Bucket already exists


### Create Data Manifest

At this step, we need to create an index of all the files we're going to be using for this experiment and model building. Now, we don't want to download all of the data at once, or we're going to cause a lot of I/O activity for your Notebook Instance. 

What we're going to do is first create a path index to where the files live on S3. From there, we can do some sampling to get to see what the data looks like, do some basic sampling stats on the data, to get a better handle on how we should build a model, and then move to using all the data to build a robust model!

In [4]:
def create_dataset_manifest(configs, global_vars):
    interval_printer_idx = 100
    idx = 0
    1
    conn = global_vars['s3_bucket']
    file_format = configs['file_extension']
    index_key = configs['index_key']+'='
    s3_prefix = configs['prefix']+'/'
    manifest = []    
    for file in conn.objects.filter(Prefix=s3_prefix):
        path = file.key
#         print(file)
        if (file_format in path):
#             print(path)
            relative_path = path.replace(configs['prefix'],'')
            date = relative_path.split('/')[1].replace(index_key,'')

            man = {'idx':idx, 'path':relative_path, 'path_with_prefix':path, 'date':date}
            manifest.append(man)  
            idx += 1
            if (idx % interval_printer_idx) == 0:
                print('Processed {} files'.format(idx))
    print('Training Dataset Size {}'.format(len(manifest)))
    return manifest
            
manifest = create_dataset_manifest(configs, global_vars)   
    

Processed 100 files
Processed 200 files
Training Dataset Size 241


In [6]:
#sanity check that we have the right amount of data for a given file!
utils.count_s3_obj_lines(configs, global_vars, manifest[240])

5259983

## Transform and Upload Data to S3

In [144]:
def prep_data(df):
    '''
    Ensure that there are no labels/categories which only represent less than 1% of the total rows
    in the dataset. This will cause problems when trying to train the model'''
    df_len = df.shape[0]
    pct_min = 0.01
    min_product_category_row_count = df_len * pct_min #should be around 1% of the dataset, Imbalanced data will skew our modelling
    df = df.groupby('product_category').filter(lambda x : len(x)>min_product_category_row_count)
    return df



def prep_data_for_supervised_blazing_text_augmented(df, configs, labs, train_file_output_name, test_file_output_name, val_file_output_name):
    '''
        Prepare the input dataframe for use in AWS Supervised BlazingText service. 
        Load each of the df parts and transform the Review_Body, 
        transform it into a augmented manifest structure, and save the results to a tmp file (locally)
        return the updated label dictionary which will contain the mapping of label to idx.
    '''
    

    text_col = configs['text_column']
    label_col = configs['label_column']

    labels = df[label_col].tolist()
    #and tokenized words
    tmp = df[text_col]
    xs = []
    for entry in tmp:
        res = str(entry).strip('][').split(', ') 
        res = ' '.join(res)
        xs.append(res)
        
    #split the data into test and train for supervised mode
    X_train, X_test, y_train, y_test = train_test_split(
        xs, labels, test_size=0.2, random_state = 0)
    
    #then split our test into val and test
    X_test, X_val, y_test, y_val = train_test_split(
        X_test, y_test, test_size=0.2, random_state = 0)
    
    train_prepped = []
    #train
    for i in range(0, len(X_train)):
        src = str(X_train[i])
        if len(src)>10:
            
            label = str(y_train[i])
            if label in labs:
                lab_idx = labs[label]
            else:
                lab_idx = len(labs)
                labs[label] = lab_idx
                
            row = {'source':src,'label':lab_idx } 
            train_prepped.append(row)
    
    test_prepped = []
    #train
    for i in range(0, len(X_test)):
        src = str(X_test[i])
        if len(src)>10:
            
            label = str(y_test[i])
            if label in labs:
                lab_idx = labs[label]
            else:
                lab_idx = len(labs)
                labs[label] = lab_idx
            
            row = {'source':src,'label':lab_idx } 
            test_prepped.append(row)
            
    val_prepped = []
    #validate
    for i in range(0, len(X_val)):
        src = str(X_val[i])
        if len(src)>10:
            
            label = str(y_val[i])
            if label in labs:
                lab_idx = labs[label]
            else:
                lab_idx = len(labs)
                labs[label] = lab_idx
            
            row = {'source':src,'label':lab_idx } 
            val_prepped.append(row)
            
    
    with open(train_file_output_name, 'w') as outfile:
        for row in train_prepped:
            outfile.write(json.dumps(row)+'\n')

    with open(test_file_output_name, 'w') as outfile:
        for row in test_prepped:
            outfile.write(json.dumps(row)+'\n')
    
    with open(val_file_output_name, 'w') as outfile:
        for row in val_prepped:
            outfile.write(json.dumps(row)+'\n')
            
     
    return labs
        
        
def upload_corpus_to_s3(configs, global_vars, train_file , test_file, val_file):
    
    '''
    Upload Training, Test, and Validation datasets to S3 bucket
    '''
    
    train_prefix = 'train'
    test_prefix = 'test'
    val_prefix = 'validate'
    s3_bucket = global_vars['s3_bucket']
    
    sess = global_vars['sess']
    bucket = global_vars['s3_bucket']
   
    data_file_s3 = '{}/{}/{}'.format(configs['wordvecdata'], train_prefix, train_file)
    s3_bucket.upload_file(train_file, data_file_s3)   

    data_file_s3 = '{}/{}/{}'.format(configs['wordvecdata'], test_prefix, test_file)
    s3_bucket.upload_file(test_file, data_file_s3) 
    
    data_file_s3 = '{}/{}/{}'.format(configs['wordvecdata'], val_prefix, val_file)
    s3_bucket.upload_file(val_file, data_file_s3) 
    
    s3_train_data = 's3://{}/{}/{}'.format(configs['bucket_name'], configs['wordvecdata'], train_prefix)
    s3_test_data = 's3://{}/{}/{}'.format(configs['bucket_name'], configs['wordvecdata'], test_prefix)
    s3_val_data = 's3://{}/{}/{}'.format(configs['bucket_name'], configs['wordvecdata'], val_prefix)

    s3_output_location = 's3://{}/{}/output'.format(configs['bucket_name'], configs['wordvecdata'])
    
    configs['s3_w2v_train_data'] = s3_train_data
    configs['s3_w2v_test_data'] = s3_test_data
    configs['s3_w2v_validate_data'] = s3_val_data
    configs['s3_w2v_output_location'] = s3_output_location

    print('S3 Training Data Path {}'.format(s3_train_data))
    print('S3 Test Data Path {}'.format(s3_test_data))
    print('S3 Validate Data Path {}'.format(s3_val_data))

    print('S3 output Data Path {}'.format(s3_output_location))

    return configs

def remove_local_file(filename):
    
    os.remove(filename)
    
def download_transform_upload(configs, global_vars, manifest):
        
    #As we're dealing with a large dataset, we need to be strategic 
    labels = {}
    partNum = 0
    for entry in manifest:
        full_path = 's3://'+configs['bucket_name']+'/'+entry['path_with_prefix']
        df = pd.read_csv(full_path, header=0, error_bad_lines=False, escapechar="\\")
        print('Dataset Rows {}, Columns {}'.format(df.shape[0], df.shape[1]))
        df = prep_data(df)
        
        train_file = 'amazonreviews_part_{}.train'.format(partNum)
        test_file = 'amazonreviews_part_{}.test'.format(partNum)
        val_file = 'amazonreviews_part_{}.validate'.format(partNum)

        try:
            labels = prep_data_for_supervised_blazing_text_augmented(df, configs,labels, train_file, test_file, val_file)
            #upload new train file
            configs = upload_corpus_to_s3(configs, global_vars, train_file , test_file, val_file)         
            #delete local file
            remove_local_file(train_file)
            remove_local_file(test_file)
            remove_local_file(val_file)
            #increment part_number for filename
            partNum += 1
            print(labels)
        except Exception as e:
            print(e)
            print('Could not process File {}'.format(full_path))
            
    global_vars['labels'] = labels
    return global_vars

In [None]:
global_vars = download_transform_upload(configs, global_vars, manifest)

Dataset Rows 2, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0}
Dataset Rows 23, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0}
Dataset Rows 19, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0}
Dataset Rows 27, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Book

b'Skipping line 1434: expected 18 fields, saw 20\n'


Dataset Rows 5214, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2}
Dataset Rows 6766, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2}
Dataset Rows 5199, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2}
Dataset Rows 7537, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/t

  if self.run_code(code, result):


Dataset Rows 35460, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4}
Dataset Rows 44151, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 79807, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 83417, Columns 18

S3 Training Da

b'Skipping line 3060: expected 18 fields, saw 75\n'


Dataset Rows 90017, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}


b'Skipping line 79224: expected 18 fields, saw 23\n'


Dataset Rows 81109, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 65477, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}


  if self.run_code(code, result):


Dataset Rows 71941, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}


b'Skipping line 10724: expected 18 fields, saw 19\n'
b'Skipping line 46605: expected 18 fields, saw 37\n'


Dataset Rows 64475, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 66688, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 70538, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 62081, Columns 

b'Skipping line 32153: expected 18 fields, saw 21\n'


Dataset Rows 72207, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}


b'Skipping line 4994: expected 18 fields, saw 20\n'


Dataset Rows 77716, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 82284, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 81643, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5}
Dataset Rows 63106, Columns 

b'Skipping line 19642: expected 18 fields, saw 19\n'


Dataset Rows 70447, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9}
Dataset Rows 79025, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9}
Dataset Rows 75563, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/o

b'Skipping line 8011: expected 18 fields, saw 26\n'


Dataset Rows 71710, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9}
Dataset Rows 66602, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9}
Dataset Rows 78862, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/o

b'Skipping line 45262: expected 18 fields, saw 23\n'


Dataset Rows 75373, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}


b'Skipping line 7114: expected 18 fields, saw 26\n'


Dataset Rows 81213, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 78506, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 76810, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-re

b'Skipping line 79951: expected 18 fields, saw 22\n'


Dataset Rows 96123, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}


b'Skipping line 4799: expected 18 fields, saw 30\n'


Dataset Rows 101887, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 91068, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 93147, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-r

b'Skipping line 85199: expected 18 fields, saw 22\n'


Dataset Rows 94735, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 91228, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 76805, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-re

b'Skipping line 16937: expected 18 fields, saw 19\n'
b'Skipping line 67400: expected 18 fields, saw 21\n'


Dataset Rows 80508, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 88194, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 95781, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-re

b'Skipping line 47572: expected 18 fields, saw 28\n'


Dataset Rows 78793, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}


b'Skipping line 44777: expected 18 fields, saw 26\n'


Dataset Rows 90743, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}
Dataset Rows 83098, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}


b'Skipping line 29367: expected 18 fields, saw 26\n'


Dataset Rows 84577, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}


b'Skipping line 51287: expected 18 fields, saw 26\n'


Dataset Rows 85711, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12}


b'Skipping line 5216: expected 18 fields, saw 29\n'


Dataset Rows 103145, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13}
Dataset Rows 132469, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13}


b'Skipping line 22930: expected 18 fields, saw 20\n'


Dataset Rows 132218, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13}


b'Skipping line 292: expected 18 fields, saw 22\n'


Dataset Rows 118777, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13}
Dataset Rows 96786, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13}
Dataset Rows 106540, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 T

b'Skipping line 79321: expected 18 fields, saw 20\n'


Dataset Rows 152398, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14}
Dataset Rows 56503, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14}
Dataset Rows 95385, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordv

b'Skipping line 71876: expected 18 fields, saw 19\n'


Dataset Rows 123681, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15}
Dataset Rows 122991, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16}
Dataset R

b'Skipping line 29877: expected 18 fields, saw 25\n'


Dataset Rows 137420, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22}
Dataset Rows 175268, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 

b'Skipping line 150471: expected 18 fields, saw 31\n'


Dataset Rows 154999, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23}
Dataset Rows 154494, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen':

b'Skipping line 109600: expected 18 fields, saw 20\n'


Dataset Rows 183445, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24}


b'Skipping line 127580: expected 18 fields, saw 21\n'


Dataset Rows 173230, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24}


b'Skipping line 170299: expected 18 fields, saw 33\n'


Dataset Rows 181281, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24}


b'Skipping line 129424: expected 18 fields, saw 21\n'


Dataset Rows 178596, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24}
Dataset Rows 180954, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Ca

b'Skipping line 6303: expected 18 fields, saw 50\n'


Dataset Rows 187649, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26}
Dataset Rows 190914, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video

b'Skipping line 127305: expected 18 fields, saw 19\n'


Dataset Rows 186665, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26}
Dataset Rows 231229, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video

b'Skipping line 160003: expected 18 fields, saw 46\n'


Dataset Rows 226426, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27}


b'Skipping line 101492: expected 18 fields, saw 27\n'


Dataset Rows 232363, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27}
Dataset Rows 232692, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Vi

b'Skipping line 246444: expected 18 fields, saw 20\n'


Dataset Rows 265092, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28}


b'Skipping line 224730: expected 18 fields, saw 34\n'


Dataset Rows 247477, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28}
Dataset Rows 247074, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Vi

b'Skipping line 165406: expected 18 fields, saw 20\n'


Dataset Rows 330471, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29}
Dataset Rows 296907, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/outpu

b'Skipping line 259826: expected 18 fields, saw 19\n'


Dataset Rows 264590, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29}


b'Skipping line 444: expected 18 fields, saw 19\n'
b'Skipping line 265664: expected 18 fields, saw 19\n'


Dataset Rows 268183, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29}
Dataset Rows 263428, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/outpu

b'Skipping line 53053: expected 18 fields, saw 25\n'


Dataset Rows 313183, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29}


b'Skipping line 258037: expected 18 fields, saw 49\n'


Dataset Rows 338142, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29}


b'Skipping line 62844: expected 18 fields, saw 20\n'


Dataset Rows 339632, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29}
Dataset Rows 355057, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/outpu

b'Skipping line 365118: expected 18 fields, saw 19\n'


Dataset Rows 435037, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30}


b'Skipping line 116476: expected 18 fields, saw 20\n'
b'Skipping line 183741: expected 18 fields, saw 21\n'
b'Skipping line 243671: expected 18 fields, saw 19\n'


Dataset Rows 515924, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30}


b'Skipping line 78688: expected 18 fields, saw 21\n'
b'Skipping line 104343: expected 18 fields, saw 22\n'
b'Skipping line 194984: expected 18 fields, saw 24\n'
b'Skipping line 255976: expected 18 fields, saw 22\n'
b'Skipping line 283054: expected 18 fields, saw 22\n'


Dataset Rows 407372, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30}


b'Skipping line 6946: expected 18 fields, saw 23\n'
b'Skipping line 345065: expected 18 fields, saw 26\n'
b'Skipping line 412831: expected 18 fields, saw 22\n'
b'Skipping line 431015: expected 18 fields, saw 20\n'


Dataset Rows 448239, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30, 'Mobile_Apps': 31}


b'Skipping line 267725: expected 18 fields, saw 19\n'
b'Skipping line 429699: expected 18 fields, saw 21\n'


Dataset Rows 430575, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30, 'Mobile_Apps': 31}
Dataset Rows 429266, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amaz

b'Skipping line 146896: expected 18 fields, saw 19\n'


Dataset Rows 452754, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30, 'Mobile_Apps': 31}


b'Skipping line 43114: expected 18 fields, saw 26\n'
b'Skipping line 229635: expected 18 fields, saw 21\n'


Dataset Rows 478487, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30, 'Mobile_Apps': 31}


b'Skipping line 154403: expected 18 fields, saw 21\n'
b'Skipping line 452308: expected 18 fields, saw 20\n'


Dataset Rows 509161, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30, 'Mobile_Apps': 31}


b'Skipping line 53188: expected 18 fields, saw 19\n'
b'Skipping line 262789: expected 18 fields, saw 23\n'
b'Skipping line 388306: expected 18 fields, saw 19\n'
b'Skipping line 470556: expected 18 fields, saw 45\n'


Dataset Rows 533598, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30, 'Mobile_Apps': 31}


b'Skipping line 544948: expected 18 fields, saw 19\n'


Dataset Rows 545043, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30, 'Mobile_Apps': 31}


b'Skipping line 16264: expected 18 fields, saw 26\n'
b'Skipping line 1800362: expected 18 fields, saw 25\n'


S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
{'Books': 0, 'Video': 1, 'Music': 2, 'Video_DVD': 3, 'Toys': 4, 'Video_Games': 5, 'Office_Products': 6, 'PC': 7, 'Camera': 8, 'Kitchen': 9, 'Electronics': 10, 'Software': 11, 'Baby': 12, 'Wireless': 13, 'Home': 14, 'Health_&_Personal_Care': 15, 'Grocery': 16, 'Beauty': 17, 'Sports': 18, 'Home_Entertainment': 19, 'Apparel': 20, 'Shoes': 21, 'Tools': 22, 'Lawn_and_Garden': 23, 'Pet_Products': 24, 'Outdoors': 25, 'Digital_Music_Purchase': 26, 'Digital_Ebook_Purchase': 27, 'Home_Improvement': 28, 'Automotive': 29, 'Jewelry': 30, 'Mobile_Apps': 31, 'Digital_Video_Download': 32}
S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 782600: expected 18 fields, saw 20\n'


In [None]:
#as blazingtext pipe only supports one augmented file for train and test, let's concat them all
def concat_augmented_files(configs, global_vars):
    
    #output filename
    concatenated_file_train = 'amazon_augmented_train.json'
    concatenated_file_test = 'amazon_augmented_test.json'
    concatenated_file_val = 'amazon_augmented_validate.json'

    
    #where all our files sit
    train_prefix = 'train'
    test_prefix = 'test'
    val_prefix = 'validate'
    
    s3_train_path = '{}/{}/'.format(configs['wordvecdata'], train_prefix)
    s3_test_path = '{}/{}/'.format(configs['wordvecdata'], test_prefix)
    s3_val_path = '{}/{}/'.format(configs['wordvecdata'], val_prefix)

    
    s3_concat_file_path_train = '{}/{}/{}'.format(configs['wordvecdata'], train_prefix, concatenated_file_train)
    s3_concat_file_path_test = '{}/{}/{}'.format(configs['wordvecdata'], test_prefix, concatenated_file_test)  
    s3_concat_file_path_val = '{}/{}/{}'.format(configs['wordvecdata'], val_prefix, concatenated_file_val)

    print(s3_concat_file_path_train)
    print(s3_concat_file_path_test)
    print(s3_concat_file_path_val)


    min_file_size = None

    #train file
    job_train = S3Concat(configs['bucket_name'], 
                         s3_concat_file_path_train, 
                         min_file_size,
                         content_type='application/json',
                         session=boto3.session.Session()
                        )
    
    job_train.add_files(s3_train_path)
    job_train.concat(small_parts_threads=32)

    
    #test file
    job_test = S3Concat(configs['bucket_name'], 
                         s3_concat_file_path_test, 
                         min_file_size,
                         content_type='application/json',
                         session=boto3.session.Session()
                        )
    
    job_test.add_files(s3_test_path)
    job_test.concat(small_parts_threads=32)
    
    
    #val file
    job_val = S3Concat(configs['bucket_name'], 
                         s3_concat_file_path_val, 
                         min_file_size,
                         content_type='application/json',
                         session=boto3.session.Session()
                        )
    
    job_val.add_files(s3_val_path)
    job_val.concat(small_parts_threads=32)
    
    
    configs['s3_w2v_train_file'] = s3_concat_file_path_train
    configs['s3_w2v_test_file'] = s3_concat_file_path_test
    configs['s3_w2v_validate_file'] = s3_concat_file_path_validate

    return configs

configs = concat_augmented_files(configs, global_vars)


### Save the Label Mapping

As our model is going to be trained using numerical labels which represent our product_category label (e.g. Books), we need to store our mapping (Label:idx) in order to obtain the correct mapping during inferencing.

In [153]:
def save_labels_lookup(labels, filename = 'class_labels.pkl'):
    
    pickle.dump(labels,open(filename, "wb" ) )
    
save_labels_lookup(global_vars['labels'])

## Model /Analysis Experimentation (Local Mode)

The purpose of this section is to perform some experimentations with different modelling techniques.

We're first going to perform some local experiments on the 1% sample of data to see which methods provide valuable insights for both customers (e.g. Amazon Customer), and operations (e.g. Amazon). 

We want to look at different type of insights, from understanding how customer reviews have changed over times, and whether there is predictability in the type of review, and the category of product it is related to. 

Let's start of by first gettign our data into a shape which we can use for analysis and modelling purposes

### Prep Data for Modelling Purposes

We're going to develop some dataframes which represent our Xs and Ys (features and labels).

Let's create some feature/label datasets which are shaped around the following labels:

- year_product-category
- product-category_star_rating

The features for this model will be only using the text of the reviews





### Word Embeddings Using BlazingText (Supervised)

BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "_ _label_ _".

As we're now using the complete dataset, we'll need to use the Augmented dataset structure and use `Pipe` mode in  order to allow for streaming of data, rather than loading all the data into memory in one go.

Augmented Data Structure

```json
{'source':'string', 'label':'string'}
{'source':'string', 'label':'string'}
```

Note, the structure are single json entries, per line

**Picking Hyperparameters**

As we're now working with a much larger dataset, we need to be conscience of the hyperparameters which we choose, as these can have a serious impact on how well our model performs

As we're looking at using Word2Vec in a supervised mode (e.g. with labelled data), we have the option of using several additional parameters in addition to the default set listed in the [documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/blazingtext.html)

Some of the key hyperparameters which need to be considered are:

- Vector Dimension: The larger the vector, the more information is encoded, however, this requires a significant amount of resources. A vector size above 300 tends to yeild deminishing returns. Further reading can be found in the paper [Glove: Global Vectors for Word Representation](https://www.aclweb.org/anthology/D14-1162.pdf)
- 

In [187]:
def configure_estimator(configs, global_vars):
    
    region_name = configs['aws_region'] 
    sess = global_vars['sess']
    container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
    print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

    bt_model = sagemaker.estimator.Estimator(container,
                                         global_vars['role'], 
                                         train_instance_count=1, 
                                         train_instance_type='ml.c5.18xlarge',
                                         train_volume_size = 150,
                                         train_max_run = 360000,
                                         input_mode= 'Pipe',
                                         output_path=configs['s3_w2v_output_location'],
                                         sagemaker_session=sess)
    
    bt_model.set_hyperparameters(mode="supervised",
                                 epochs=20,
                                 min_count=2,
                                 learning_rate=0.05,
                                 vector_dim=300,
                                 early_stopping=True,
                                 patience=4,
                                 min_epochs=10,
                                 word_ngrams=4,
                                subwords=True,)
    

    
   
    global_vars['bt_model'] = bt_model
    
    return global_vars

global_vars = configure_estimator(configs, global_vars)

Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest (us-east-1)


In [188]:
def configure_data_channels(configs, global_vars):
    

    s3train_manifest = 's3://{}/{}'.format(configs['bucket_name'],configs['s3_w2v_train_file'])
    s3validation_manifest = 's3://{}/{}'.format(configs['bucket_name'],configs['s3_w2v_test_file'])
    
    attribute_names = ["source","label"]

    
    train_data = sagemaker.session.s3_input(s3train_manifest, 
                                            distribution='FullyReplicated', 
                                            content_type='application/jsonlines', 
                                            s3_data_type='AugmentedManifestFile',
                                            attribute_names=attribute_names,
                                            record_wrapping='RecordIO' 
                                           )
    
    validation_data = sagemaker.session.s3_input(s3validation_manifest, 
                                                 distribution='FullyReplicated', 
                                                 content_type='application/jsonlines', 
                                                 s3_data_type='AugmentedManifestFile',
                                                 attribute_names=attribute_names,
                                                 record_wrapping='RecordIO'
                                                )
    
    data_channels = {'train': train_data, 'validation': validation_data}
    
    global_vars['data_channels'] = data_channels

    return global_vars

global_vars = configure_data_channels(configs, global_vars)
                                        

In [None]:
def fit_model(configs, global_vars):
    
    bt_model = global_vars['bt_model']
    data_channels = global_vars['data_channels']
    bt_model.fit(inputs=data_channels, logs=True)
    
fit_model(configs, global_vars)

2020-05-04 02:07:45 Starting - Starting the training job...
2020-05-04 02:07:46 Starting - Launching requested ML instances......
2020-05-04 02:08:56 Starting - Preparing the instances for training...
2020-05-04 02:09:38 Downloading - Downloading input data..................................................................
2020-05-04 02:20:53 Training - Training image download completed. Training in progress.[34mArguments: train[0m
[34m[05/04/2020 02:20:54 INFO 140077378238272] nvidia-smi took: 0.0252711772919 secs to identify 0 gpus[0m
[34m[05/04/2020 02:20:54 INFO 140077378238272] Running single machine CPU BlazingText training using supervised mode.[0m
[34m[05/04/2020 02:20:54 INFO 140077378238272] Switching off subword embedding mode as it is only supported by cbow and skipgram.[0m
[34mRead 10M words[0m
[34mRead 20M words[0m
[34mRead 30M words[0m
[34mRead 40M words[0m
[34mRead 50M words[0m
[34mRead 60M words[0m
[34mRead 70M words[0m
[34mRead 80M words[0m
[34

[34mRead 2840M words[0m
[34mRead 2850M words[0m
[34mRead 2860M words[0m
[34mRead 2870M words[0m
[34mRead 2880M words[0m
[34mRead 2890M words[0m
[34mRead 2900M words[0m
[34mRead 2910M words[0m
[34mRead 2920M words[0m
[34mRead 2930M words[0m
[34mRead 2940M words[0m
[34mRead 2950M words[0m
[34mRead 2960M words[0m
[34mRead 2970M words[0m
[34mRead 2980M words[0m
[34mRead 2990M words[0m
[34mRead 3000M words[0m
[34mRead 3010M words[0m
[34mRead 3020M words[0m
[34mRead 3030M words[0m
[34mRead 3040M words[0m
[34mRead 3050M words[0m
[34mRead 3060M words[0m
[34mRead 3070M words[0m
[34mRead 3080M words[0m
[34mRead 3090M words[0m
[34mRead 3100M words[0m
[34mRead 3110M words[0m
[34mRead 3120M words[0m
[34mRead 3130M words[0m
[34mRead 3140M words[0m
[34mRead 3150M words[0m
[34mRead 3160M words[0m
[34mRead 3170M words[0m
[34mRead 3180M words[0m
[34mRead 3190M words[0m
[34mRead 3200M words[0m
[34mRead 3210M words[0m
[34mRead 32

In [154]:
def host_model(global_vars):
    
    bt_model = global_vars['bt_model']
    text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')
    global_vars['w2v_classifier'] = text_classifier
    
    return global_vars

global_vars = host_model(global_vars)

--------------!

### Load the Labels

If we haven't already loaded the labels (if this is a first time Loading the Notebook after a kernel restart), then you'll need to load the pickle file containing the label mapping.

In [None]:
def load_class_label_mapping(configs, global_vars):

    
    filename = configs['class_labels_pickle_filename']
    global_vars['labels'] = pickle.load( open(filename, "rb" ) )
    print('Labels Loaded \n{}'.format(global_vars['labels']))

    return global_vars
    
global_vars = load_class_label_mapping(configs, global_vars)

### Evaluate the Model

Let's evaluate our model to determine how well we're able to predict the different classes. For this we're going to use a sample of data which was not used in the training/test dataset.

In [185]:
def evaluate_test_data_against_model(global_vars, configs):
    
    text_col = 'source'
    label_col = 'label'

    
    #first we need to download the validation dataset... 
    full_path = 's3://'+configs['bucket_name']+'/'+ configs['s3_w2v_validate_file']
    df = pd.read_json(full_path, lines=True, error_bad_lines=False)
    
    print('Dataset Rows {}, Columns {}'.format(df.shape[0], df.shape[1]))
    
    y_val = df[label_col].tolist()
    x_val = df[text_col].tolist()
    
    print('Total Eval Data {}'.format(len(x_val)))

    # we need to do some batch inferencing due to the size of the data:
    #each batch is 1000 sentences
    batch_size = 10000
    batches = len(x_val) // batch_size
    
    print('Batches {}'.format(batches))
    
    predictions_batches = []
    labels_inv = {y:x for x,y in global_vars['labels'].items()}
    y_hat = []

    for i in range(0, batches+1):
        lower = batch_size * i
        upper = batch_size * (i+1)
        if i == batches:
            upper = len(x_val)
        if i % (batches/10) == 0:
            print('Batch {} : {}'.format(lower,upper))
                
        instances_batch = x_val[lower:upper]
        
        payload = {"instances":instances_batch,
                  "configuration": {"k": 1}}

        text_classifier =  global_vars['w2v_classifier']

        response = text_classifier.predict(json.dumps(payload))

        predictions = json.loads(response)
        predictions_batches.append(predictions)
    
        for pred in predictions:
            try:
                idx = int(str(pred['label'][0]).replace('__label__',''))
                y_hat.append(labels_inv[idx])
            except:
                y_hat.append('UNKNOWN')
    
    print('Total Predictions {}'.format(len(y_hat)))
#     print(json.dumps(predictions, indent=2))
#     print(list(zip(y_hat, y_test)))
    return y_hat, y_val
              
y_hat, y_val = evaluate_test_data_against_model(global_vars, configs)

b'Skipping line 392761: expected 18 fields, saw 19\n'
b'Skipping line 597522: expected 18 fields, saw 19\n'
b'Skipping line 790320: expected 18 fields, saw 20\n'
b'Skipping line 1102921: expected 18 fields, saw 19\n'
b'Skipping line 1368570: expected 18 fields, saw 19\n'
b'Skipping line 1461203: expected 18 fields, saw 22\n'
b'Skipping line 1729183: expected 18 fields, saw 21\n'
b'Skipping line 1981788: expected 18 fields, saw 20\n'
b'Skipping line 2113686: expected 18 fields, saw 19\n'
b'Skipping line 2900715: expected 18 fields, saw 33\n'
  if self.run_code(code, result):


Dataset Rows 2905588, Columns 18
Total Eval Data 700038
Batches 70
Batch 0 : 10000
Batch 10000 : 20000
Batch 20000 : 30000
Batch 30000 : 40000
Batch 40000 : 50000
Batch 50000 : 60000
Batch 60000 : 70000
Batch 70000 : 80000
Batch 80000 : 90000
Batch 90000 : 100000
Batch 100000 : 110000
Batch 110000 : 120000
Batch 120000 : 130000
Batch 130000 : 140000
Batch 140000 : 150000
Batch 150000 : 160000
Batch 160000 : 170000
Batch 170000 : 180000
Batch 180000 : 190000
Batch 190000 : 200000
Batch 200000 : 210000
Batch 210000 : 220000
Batch 220000 : 230000
Batch 230000 : 240000
Batch 240000 : 250000
Batch 250000 : 260000
Batch 260000 : 270000
Batch 270000 : 280000
Batch 280000 : 290000
Batch 290000 : 300000
Batch 300000 : 310000
Batch 310000 : 320000
Batch 320000 : 330000
Batch 330000 : 340000
Batch 340000 : 350000
Batch 350000 : 360000
Batch 360000 : 370000
Batch 370000 : 380000
Batch 380000 : 390000
Batch 390000 : 400000
Batch 400000 : 410000
Batch 410000 : 420000
Batch 420000 : 430000
Batch 4300

In [186]:
def evaluate_model_predictions(y_pred, y_true):

    print(classification_report(y_true, y_pred))

evaluate_model_predictions(y_hat, y_val)

  _warn_prf(average, modifier, msg_start, len(result))


                        precision    recall  f1-score   support

               Apparel       0.61      0.77      0.68     32633
            Automotive       0.50      0.62      0.55     14710
                  Baby       0.62      0.45      0.52      8231
                Beauty       0.66      0.72      0.69     22681
                 Books       0.35      0.94      0.51     62408
                Camera       0.75      0.56      0.65      8177
Digital_Ebook_Purchase       0.81      0.02      0.03     98295
Digital_Music_Purchase       0.52      0.01      0.02      9069
Digital_Video_Download       0.75      0.26      0.39     19688
           Electronics       0.61      0.50      0.55     15246
               Grocery       0.70      0.74      0.72     12058
Health_&_Personal_Care       0.59      0.49      0.53     23883
                  Home       0.53      0.54      0.54     34185
    Home_Entertainment       0.00      0.00      0.00         0
      Home_Improvement       0.56      

**Notes**: We're seeing miuxed results for different classes, which is expected dueo to the different number of instances available for the different classes (e.g. the number of data points related to the specific product category vary).


## Findings from Scaling Word2Vec on Amazon Reviews

In this notebook we have explored the use of Word2Vec on the full Amazon Reviews dataset.

Let's compare how the predictions have changed compared to our 1% sample experiment conducted in the previous stage of this data science experiment.

...

TO ADD!

