# Amazon Reviews - Scaling the Chosen Methodology / Technique

After the initial exploration and analysis stage of the data science pipeline, the next step will be to start to scale the data and methods being used to determine whether the results reflect that of the initial sample experiments.

If we're driven by a particular scientifit enquiry rather than an application driven data science workflow, We should also be triyng to prove our original hypothesis. If we take the original work performed during the experimentation, we would look at our initial sample experiment and determine the confidence of our chosen model for predicting a specific label (let's say  we're predicting the product_cateogy). We would then try and establish whether the application of this method prooves our H0. 

While this is not strictly hypothesis testing (proving/disproving Null Hypothesis), in many scenarios, this tends to be a suitable approach to validate our assumptions when moving from sample to scale.

### Imports

The following imports are required in order to run different statistical tests and modelling techniques.

In [60]:
# NOTE: Uncomment the folllowing lines on first run of the notebook. 
# !conda install -y -c conda-forge fastparquet scikit-learn arrow-cpp parquet-cpp pyarrow numpy
# !pip install --upgrade mxnet gluonnlp swifter dask cufflinks
# !pip install -q torch==1.4.0
# !pip install -q transformers
# !pip install s3-concat


In [62]:
import boto3
import sagemaker
from s3_concat import S3Concat
import sys
import os
import re
import numpy as np
import pandas as pd
import subprocess
from sagemaker import get_execution_role
from sagemaker.amazon.amazon_estimator import get_image_uri
import gzip
from io import BytesIO
import zipfile
import random
import json
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.metrics import classification_report
import nltk
from fastparquet import write
from fastparquet import ParquetFile
import s3fs
import pyarrow.parquet as pq
import pickle
import glob
import ast 
import csv
import itertools
import dask.dataframe as dd
from dask.multiprocessing import get
import multiprocessing
import datetime

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from collections import OrderedDict

from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn import metrics


### Configs and Global Vars

Throughout the notebook we're going to store all our global variables (although all variables inside a notebook are global if they are not defined in a method), inside an array.

In [14]:
configs = {
    'aws_region' :  'us-east-1',
    'bucket_name': 'demos-amazon-reviews',
    'prefix' : 'preprocessed_reviews_csvs', #only use this if you want to have your files in a folder 
    'index_key' : 'review_date_str',
    'file_extension' :'.csv',
    'wordvecdata': 'wordvec-full-data',
    'models_dir': 'models',
}

global_vars = {}

### Environment Setup

Setting up the environment involves ensuring all the corret session and IAM roles are configured. We also need to ensure the correct region and bucket is made available.

In [15]:
def setup_env(configs, global_vars):
    
    sess = sagemaker.Session()
    
    role = get_execution_role()

    AWS_REGION = configs['aws_region']
    s3 = boto3.resource('s3')

    s3_bucket = s3.Bucket(configs['bucket_name'])

    if s3_bucket.creation_date == None:
    # create S3 bucket because it does not exist yet
        print('Creating S3 bucket {}.'.format(bucket))
        resp = s3.create_bucket(
            ACL='private',
            Bucket=bucket
        )
    else:
        print('Bucket already exists')
        
    global_vars['role'] = role
    global_vars['sess'] = sess
    global_vars['s3'] = s3
    global_vars['s3_bucket'] = s3_bucket
    
    return global_vars

global_vars = setup_env(configs, global_vars)

Bucket already exists


### Create Data Manifest

At this step, we need to create an index of all the files we're going to be using for this experiment and model building. Now, we don't want to download all of the data at once, or we're going to cause a lot of I/O activity for your Notebook Instance. 

What we're going to do is first create a path index to where the files live on S3. From there, we can do some sampling to get to see what the data looks like, do some basic sampling stats on the data, to get a better handle on how we should build a model, and then move to using all the data to build a robust model!

In [4]:
def create_dataset_manifest(configs, global_vars):
    
    interval_printer_idx = 100
    idx = 0
    1
    conn = global_vars['s3_bucket']
    file_format = configs['file_extension']
    index_key = configs['index_key']+'='
    s3_prefix = configs['prefix']+'/'
    manifest = []    
    for file in conn.objects.filter(Prefix=s3_prefix):
        path = file.key
#         print(file)
        if (file_format in path):
#             print(path)
            relative_path = path.replace(configs['prefix'],'')
            date = relative_path.split('/')[1].replace(index_key,'')

            man = {'idx':idx, 'path':relative_path, 'path_with_prefix':path, 'date':date}
            manifest.append(man)  
            idx += 1
            if (idx % interval_printer_idx) == 0:
                print('Processed {} files'.format(idx))
    print('Training Dataset Size {}'.format(len(manifest)))
    return manifest
            
manifest = create_dataset_manifest(configs, global_vars)   
    

Processed 100 files
Processed 200 files
Training Dataset Size 241


In [6]:
def count_lines(configs, global_vars, entry):
        
    s3 = boto3.client('s3')

    resp = s3.select_object_content(
        Bucket=configs['bucket_name'],
        Key=entry['path_with_prefix'],
        ExpressionType='SQL',
        Expression="SELECT count(*) FROM s3object s",
        InputSerialization = {'CSV':
                              {"FileHeaderInfo": "Use", 
                               "AllowQuotedRecordDelimiter": True,
                               "QuoteEscapeCharacter":"\\",
                              }, 
                              'CompressionType': 'NONE'},
        OutputSerialization = {'CSV':{}},
    )
    
    for event in resp['Payload']:
        if 'Records' in event:
            records = event['Records']['Payload'].decode('utf-8')
#             print('Rows:',records)
            return(int(records))

#sanity check that we have the right amount of data for a given file!
count_lines(configs, global_vars, manifest[240])

5259983

## Transform and Upload Data to S3

In [37]:
def prep_data(df):
    
    df_len = df.shape[0]
    pct_min = 0.01
    min_product_category_row_count = df_len * pct_min #should be around 1% of the dataset, Imbalanced data will skew our modelling
    df = df.groupby('product_category').filter(lambda x : len(x)>min_product_category_row_count)
    return df


def prep_data_for_supervised_blazing_text_csv(df, train_file_output_name, test_file_output_name):
    
    
    label_prefix = "__label__"    
    labels = (label_prefix + df['product_category']).tolist()
    #and tokenized words
    tmp = df['review_body_processed']
    xs = []
    for entry in tmp:
        res = str(entry).strip('][').split(', ') 
        res = ' '.join(res)
        xs.append(res)
    
   
    #split the data into test and train for supervised mode
    X_train, X_test, y_train, y_test = train_test_split(
        xs, labels, random_state = 0)
    
    
    train_prepped = []
    #train
    for i in range(0, len(X_train)):
        
        row = str(y_train[i]) + " " + str(X_train[i])
        train_prepped.append([row])
#     print('Train Processed Data: {}'.format(train_prepped[0]))
    
    test_prepped = []
    #train
    for i in range(0, len(X_test)):
        row = str(y_test[i]) + " " + str(X_test[i])
        test_prepped.append([row])
    print('')
#     print('Test Processed Data: {}'.format(test_prepped[0]))
    
    with open(train_file_output_name, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', 
                                lineterminator='\n',  
                                escapechar=' ', 
                                quoting=csv.QUOTE_NONE)
        csv_writer.writerows(train_prepped)

    with open(test_file_output_name, 'w') as csvoutfile:
        csv_writer = csv.writer(csvoutfile, delimiter=' ', 
                                lineterminator='\n',  
                                escapechar=' ', 
                                quoting=csv.QUOTE_NONE)        
        csv_writer.writerows(test_prepped)
        
    return True

def prep_data_for_supervised_blazing_text_augmented(df, train_file_output_name, test_file_output_name):
    
    
    label_prefix = "__label__"    
    labels = df['product_category'].tolist()
    #and tokenized words
    tmp = df['review_body_processed']
    xs = []
    for entry in tmp:
        res = str(entry).strip('][').split(', ') 
        res = ' '.join(res)
        xs.append(res)
        
    #split the data into test and train for supervised mode
    X_train, X_test, y_train, y_test = train_test_split(
        xs, labels, random_state = 0)
    
    
    train_prepped = []
    #train
    for i in range(0, len(X_train)):
        src = str(X_train[i])
        if len(src)>10:
            row = {'source':src,'label':str(y_train[i])}
            train_prepped.append(row)
#     print('Train Processed Data: {}'.format(train_prepped[0]))
    
    test_prepped = []
    #train
    for i in range(0, len(X_test)):
        src = str(X_test[i])
        if len(src)>10:
            row = {'source':src,'label':str(y_test[i])}
    #         row = str(y_test[i]) + " " + str(X_test[i])
            test_prepped.append(row)
    print('')
#     print('Test Processed Data: {}'.format(test_prepped[0]))
    
    with open(train_file_output_name, 'w') as outfile:
        for row in train_prepped:
            outfile.write(json.dumps(row)+'\n')
#         json.dump(train_prepped, outfile, indent="")

    with open(test_file_output_name, 'w') as outfile:
        for row in test_prepped:
            outfile.write(json.dumps(row)+'\n')
    return True
        
        
def upload_corpus_to_s3(configs, global_vars, train_file , test_file):
    
    
    train_prefix = 'train'
    test_prefix = 'test'
    s3_bucket = global_vars['s3_bucket']
    
    sess = global_vars['sess']
    bucket = global_vars['s3_bucket']
   
    data_file_s3 = '{}/{}/{}'.format(configs['wordvecdata'], train_prefix, train_file)
    s3_bucket.upload_file(train_file, data_file_s3)   

    data_file_s3 = '{}/{}/{}'.format(configs['wordvecdata'], test_prefix, test_file)
    s3_bucket.upload_file(test_file, data_file_s3) 
    
    s3_train_data = 's3://{}/{}/{}'.format(configs['bucket_name'], configs['wordvecdata'], train_prefix)
    s3_test_data = 's3://{}/{}/{}'.format(configs['bucket_name'], configs['wordvecdata'], test_prefix)
    s3_output_location = 's3://{}/{}/output'.format(configs['bucket_name'], configs['wordvecdata'])
    
    configs['s3_w2v_train_data'] = s3_train_data
    configs['s3_w2v_test_data'] = s3_test_data
    configs['s3_w2v_output_location'] = s3_output_location

    print('S3 Training Data Path {}'.format(s3_train_data))
    print('S3 Test Data Path {}'.format(s3_test_data))
    print('S3 output Data Path {}'.format(s3_output_location))

    return configs

def remove_local_file(filename):
    
    os.remove(filename)
    
def download_transform_upload(configs, global_vars, manifest):
        
    #As we're dealing with a large dataset, we need to be strategic 
    
    partNum = 0
    for entry in manifest:
        full_path = 's3://'+configs['bucket_name']+'/'+entry['path_with_prefix']
        df = pd.read_csv(full_path, header=0, error_bad_lines=False, escapechar="\\")
        print('Dataset Rows {}, Columns {}'.format(df.shape[0], df.shape[1]))
        df = prep_data(df)
        
        train_file = 'amazonreviews_part_{}.train'.format(partNum)
        test_file = 'amazonreviews_part_{}.test'.format(partNum)
        
        if prep_data_for_supervised_blazing_text_augmented(df, train_file, test_file):
            #upload new train file
            configs = upload_corpus_to_s3(configs, global_vars, train_file , test_file)         
            #delete local file
            remove_local_file(train_file)
            remove_local_file(test_file)
            partNum += 1
        else:
            print('Could not process File {}'.format(full_path))
            
        
    

In [None]:
download_transform_upload(configs, global_vars, manifest)

Dataset Rows 2, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 23, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 19, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 27, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 45, Columns 18

S3 Training Data

b'Skipping line 1434: expected 18 fields, saw 20\n'


Dataset Rows 5214, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 6766, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 5199, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 7537, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 8277, Columns 18

S3 Tr

  if self.run_code(code, result):


Dataset Rows 35460, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 44151, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 79807, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 83417, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 72891, Columns 18



b'Skipping line 3060: expected 18 fields, saw 75\n'


Dataset Rows 90017, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 79224: expected 18 fields, saw 23\n'


Dataset Rows 81109, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 65477, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


  if self.run_code(code, result):


Dataset Rows 71941, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 10724: expected 18 fields, saw 19\n'
b'Skipping line 46605: expected 18 fields, saw 37\n'


Dataset Rows 64475, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 66688, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 70538, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 62081, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 67995, Columns 18



b'Skipping line 32153: expected 18 fields, saw 21\n'


Dataset Rows 72207, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 4994: expected 18 fields, saw 20\n'


Dataset Rows 77716, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 82284, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 81643, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 63106, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 71604, Columns 18



b'Skipping line 19642: expected 18 fields, saw 19\n'


Dataset Rows 70447, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 79025, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 75563, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 71320, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 70862, Columns 18



b'Skipping line 8011: expected 18 fields, saw 26\n'


Dataset Rows 71710, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 66602, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 78862, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 77524, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 88210, Columns 18



b'Skipping line 45262: expected 18 fields, saw 23\n'


Dataset Rows 75373, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 7114: expected 18 fields, saw 26\n'


Dataset Rows 81213, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 78506, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 76810, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 86750, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 85934, Columns 18



b'Skipping line 79951: expected 18 fields, saw 22\n'


Dataset Rows 96123, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 4799: expected 18 fields, saw 30\n'


Dataset Rows 101887, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 91068, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 93147, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 92508, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 88535, Columns 18


b'Skipping line 85199: expected 18 fields, saw 22\n'


Dataset Rows 94735, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 91228, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 76805, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 74265, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 78701, Columns 18



b'Skipping line 16937: expected 18 fields, saw 19\n'
b'Skipping line 67400: expected 18 fields, saw 21\n'


Dataset Rows 80508, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 88194, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 95781, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 47572: expected 18 fields, saw 28\n'


Dataset Rows 78793, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 44777: expected 18 fields, saw 26\n'


Dataset Rows 90743, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 83098, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 29367: expected 18 fields, saw 26\n'


Dataset Rows 84577, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 51287: expected 18 fields, saw 26\n'


Dataset Rows 85711, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 5216: expected 18 fields, saw 29\n'


Dataset Rows 103145, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 132469, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 22930: expected 18 fields, saw 20\n'


Dataset Rows 132218, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 292: expected 18 fields, saw 22\n'


Dataset Rows 118777, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 96786, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 106540, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 119913, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 124285, Columns 

b'Skipping line 79321: expected 18 fields, saw 20\n'


Dataset Rows 152398, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 56503, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 95385, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 103895, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 71876: expected 18 fields, saw 19\n'


Dataset Rows 123681, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 122991, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 75515, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 77120, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 133674, Columns 1

b'Skipping line 29877: expected 18 fields, saw 25\n'


Dataset Rows 137420, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 175268, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 115010, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 184708, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 137037, Columns

b'Skipping line 150471: expected 18 fields, saw 31\n'


Dataset Rows 154999, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 154494, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 193485, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 227001, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 109600: expected 18 fields, saw 20\n'


Dataset Rows 183445, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 127580: expected 18 fields, saw 21\n'


Dataset Rows 173230, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 170299: expected 18 fields, saw 33\n'


Dataset Rows 181281, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 129424: expected 18 fields, saw 21\n'


Dataset Rows 178596, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 180954, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 189676, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 6303: expected 18 fields, saw 50\n'


Dataset Rows 187649, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 190914, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 191232, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 127305: expected 18 fields, saw 19\n'


Dataset Rows 186665, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 231229, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 300814, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 229109, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 228743, Columns

b'Skipping line 160003: expected 18 fields, saw 46\n'


Dataset Rows 226426, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 101492: expected 18 fields, saw 27\n'


Dataset Rows 232363, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 232692, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 202024, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 220256, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 246444: expected 18 fields, saw 20\n'


Dataset Rows 265092, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 224730: expected 18 fields, saw 34\n'


Dataset Rows 247477, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 247074, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 295955, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 315542, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 165406: expected 18 fields, saw 20\n'


Dataset Rows 330471, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 296907, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 259826: expected 18 fields, saw 19\n'


Dataset Rows 264590, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 444: expected 18 fields, saw 19\n'
b'Skipping line 265664: expected 18 fields, saw 19\n'


Dataset Rows 268183, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 263428, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 308800, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 53053: expected 18 fields, saw 25\n'


Dataset Rows 313183, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 258037: expected 18 fields, saw 49\n'


Dataset Rows 338142, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 62844: expected 18 fields, saw 20\n'


Dataset Rows 339632, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 355057, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 365118: expected 18 fields, saw 19\n'


Dataset Rows 435037, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 116476: expected 18 fields, saw 20\n'
b'Skipping line 183741: expected 18 fields, saw 21\n'
b'Skipping line 243671: expected 18 fields, saw 19\n'


Dataset Rows 515924, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 78688: expected 18 fields, saw 21\n'
b'Skipping line 104343: expected 18 fields, saw 22\n'
b'Skipping line 194984: expected 18 fields, saw 24\n'
b'Skipping line 255976: expected 18 fields, saw 22\n'
b'Skipping line 283054: expected 18 fields, saw 22\n'


Dataset Rows 407372, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 6946: expected 18 fields, saw 23\n'
b'Skipping line 345065: expected 18 fields, saw 26\n'
b'Skipping line 412831: expected 18 fields, saw 22\n'
b'Skipping line 431015: expected 18 fields, saw 20\n'


Dataset Rows 448239, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 267725: expected 18 fields, saw 19\n'
b'Skipping line 429699: expected 18 fields, saw 21\n'


Dataset Rows 430575, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 429266, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 146896: expected 18 fields, saw 19\n'


Dataset Rows 452754, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 43114: expected 18 fields, saw 26\n'
b'Skipping line 229635: expected 18 fields, saw 21\n'


Dataset Rows 478487, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 154403: expected 18 fields, saw 21\n'
b'Skipping line 452308: expected 18 fields, saw 20\n'


Dataset Rows 509161, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 53188: expected 18 fields, saw 19\n'
b'Skipping line 262789: expected 18 fields, saw 23\n'
b'Skipping line 388306: expected 18 fields, saw 19\n'
b'Skipping line 470556: expected 18 fields, saw 45\n'


Dataset Rows 533598, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 544948: expected 18 fields, saw 19\n'


Dataset Rows 545043, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 16264: expected 18 fields, saw 26\n'
b'Skipping line 172230: expected 18 fields, saw 22\n'
b'Skipping line 332015: expected 18 fields, saw 21\n'
b'Skipping line 371153: expected 18 fields, saw 19\n'
b'Skipping line 423263: expected 18 fields, saw 22\n'
b'Skipping line 485818: expected 18 fields, saw 44\n'


Dataset Rows 568545, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 179135: expected 18 fields, saw 19\n'
b'Skipping line 212258: expected 18 fields, saw 19\n'
b'Skipping line 646708: expected 18 fields, saw 19\n'
b'Skipping line 655386: expected 18 fields, saw 21\n'


Dataset Rows 743941, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 105653: expected 18 fields, saw 23\n'
b'Skipping line 849719: expected 18 fields, saw 20\n'


Dataset Rows 859212, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 666460: expected 18 fields, saw 20\n'


Dataset Rows 671965, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 12362: expected 18 fields, saw 19\n'
b'Skipping line 486586: expected 18 fields, saw 35\n'


Dataset Rows 725497, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 250802: expected 18 fields, saw 28\n'
b'Skipping line 372473: expected 18 fields, saw 21\n'
b'Skipping line 427570: expected 18 fields, saw 26\n'


Dataset Rows 670426, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output
Dataset Rows 685392, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 166263: expected 18 fields, saw 20\n'
b'Skipping line 643436: expected 18 fields, saw 22\n'
b'Skipping line 684144: expected 18 fields, saw 19\n'


Dataset Rows 699200, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 207449: expected 18 fields, saw 20\n'
b'Skipping line 439715: expected 18 fields, saw 27\n'


Dataset Rows 757924, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 10196: expected 18 fields, saw 40\n'
b'Skipping line 179501: expected 18 fields, saw 20\n'


Dataset Rows 772142, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 169739: expected 18 fields, saw 19\n'
b'Skipping line 325163: expected 18 fields, saw 25\n'
b'Skipping line 426128: expected 18 fields, saw 24\n'
b'Skipping line 473867: expected 18 fields, saw 20\n'
b'Skipping line 830764: expected 18 fields, saw 20\n'
b'Skipping line 901447: expected 18 fields, saw 22\n'


Dataset Rows 915242, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 276669: expected 18 fields, saw 20\nSkipping line 288877: expected 18 fields, saw 19\n'
b'Skipping line 295408: expected 18 fields, saw 23\n'
b'Skipping line 846905: expected 18 fields, saw 22\n'


Dataset Rows 1025940, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 40451: expected 18 fields, saw 21\n'
b'Skipping line 519374: expected 18 fields, saw 21\n'
b'Skipping line 774344: expected 18 fields, saw 20\n'
b'Skipping line 799837: expected 18 fields, saw 44\n'
b'Skipping line 1195376: expected 18 fields, saw 30\n'


Dataset Rows 1304875, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 1323002: expected 18 fields, saw 19\n'
b'Skipping line 1504371: expected 18 fields, saw 26\n'
b'Skipping line 2172111: expected 18 fields, saw 24\n'


Dataset Rows 2388762, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 107662: expected 18 fields, saw 19\n'
b'Skipping line 182450: expected 18 fields, saw 19\n'
b'Skipping line 242822: expected 18 fields, saw 21\n'
b'Skipping line 553668: expected 18 fields, saw 21\n'
b'Skipping line 1107339: expected 18 fields, saw 19\n'
b'Skipping line 1394760: expected 18 fields, saw 19\n'
b'Skipping line 1564840: expected 18 fields, saw 19\n'
b'Skipping line 1800362: expected 18 fields, saw 25\n'
b'Skipping line 1942145: expected 18 fields, saw 20\n'


Dataset Rows 2826699, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 547502: expected 18 fields, saw 20\n'
b'Skipping line 641957: expected 18 fields, saw 19\n'
b'Skipping line 1181218: expected 18 fields, saw 32\n'
b'Skipping line 1368433: expected 18 fields, saw 20\n'
b'Skipping line 1715896: expected 18 fields, saw 39\n'
b'Skipping line 1950271: expected 18 fields, saw 20\n'
b'Skipping line 2112904: expected 18 fields, saw 21\n'


Dataset Rows 2227541, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 339715: expected 18 fields, saw 20\n'
b'Skipping line 374722: expected 18 fields, saw 21\n'
b'Skipping line 909974: expected 18 fields, saw 25\n'
b'Skipping line 1255391: expected 18 fields, saw 21\n'
b'Skipping line 1470915: expected 18 fields, saw 19\n'
b'Skipping line 1849468: expected 18 fields, saw 21\n'
b'Skipping line 2003151: expected 18 fields, saw 30\n'


Dataset Rows 2329497, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 595037: expected 18 fields, saw 21\n'
b'Skipping line 1106871: expected 18 fields, saw 22\n'
b'Skipping line 1467716: expected 18 fields, saw 22\nSkipping line 1470303: expected 18 fields, saw 21\n'
b'Skipping line 1648052: expected 18 fields, saw 27\n'
b'Skipping line 1688088: expected 18 fields, saw 20\n'
b'Skipping line 507022: expected 18 fields, saw 20\n'
b'Skipping line 603880: expected 18 fields, saw 19\n'
b'Skipping line 665048: expected 18 fields, saw 21\n'
b'Skipping line 997271: expected 18 fields, saw 19\n'
b'Skipping line 1191284: expected 18 fields, saw 19\n'
b'Skipping line 1544984: expected 18 fields, saw 24\n'
b'Skipping line 1578878: expected 18 fields, saw 20\n'


Dataset Rows 2303639, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 346786: expected 18 fields, saw 24\n'
b'Skipping line 376777: expected 18 fields, saw 20\n'
b'Skipping line 437764: expected 18 fields, saw 19\n'
b'Skipping line 601927: expected 18 fields, saw 23\n'
b'Skipping line 1017710: expected 18 fields, saw 23\n'
b'Skipping line 1350411: expected 18 fields, saw 20\n'
b'Skipping line 1723666: expected 18 fields, saw 20\n'
b'Skipping line 1943655: expected 18 fields, saw 19\n'


Dataset Rows 2147647, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 392761: expected 18 fields, saw 19\n'
b'Skipping line 597522: expected 18 fields, saw 19\n'
b'Skipping line 790320: expected 18 fields, saw 20\n'
b'Skipping line 1102921: expected 18 fields, saw 19\n'
b'Skipping line 1368570: expected 18 fields, saw 19\n'
b'Skipping line 1461203: expected 18 fields, saw 22\n'
b'Skipping line 1729183: expected 18 fields, saw 21\n'
b'Skipping line 1981788: expected 18 fields, saw 20\n'
b'Skipping line 2113686: expected 18 fields, saw 19\n'
b'Skipping line 2900715: expected 18 fields, saw 33\n'


Dataset Rows 2905588, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 11196: expected 18 fields, saw 22\n'
b'Skipping line 374306: expected 18 fields, saw 21\nSkipping line 385562: expected 18 fields, saw 24\n'
b'Skipping line 1237776: expected 18 fields, saw 19\n'
b'Skipping line 2111518: expected 18 fields, saw 21\n'
b'Skipping line 2385917: expected 18 fields, saw 19\n'
b'Skipping line 3008194: expected 18 fields, saw 20\n'
b'Skipping line 3095815: expected 18 fields, saw 19\n'
b'Skipping line 3195837: expected 18 fields, saw 21\n'


Dataset Rows 3587244, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 31529: expected 18 fields, saw 19\n'
b'Skipping line 37231: expected 18 fields, saw 20\n'
b'Skipping line 235882: expected 18 fields, saw 19\n'
b'Skipping line 314897: expected 18 fields, saw 20\n'
b'Skipping line 544557: expected 18 fields, saw 20\n'
b'Skipping line 1017556: expected 18 fields, saw 36\n'
b'Skipping line 1084260: expected 18 fields, saw 20\n'


Dataset Rows 2829855, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 243150: expected 18 fields, saw 32\n'
b'Skipping line 1468142: expected 18 fields, saw 22\n'
b'Skipping line 1867001: expected 18 fields, saw 19\n'
b'Skipping line 1953176: expected 18 fields, saw 23\n'


Dataset Rows 3022435, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 272799: expected 18 fields, saw 25\n'
b'Skipping line 344949: expected 18 fields, saw 20\n'
b'Skipping line 460948: expected 18 fields, saw 19\nSkipping line 474877: expected 18 fields, saw 28\n'
b'Skipping line 1078278: expected 18 fields, saw 19\n'
b'Skipping line 1493872: expected 18 fields, saw 31\n'
b'Skipping line 1599672: expected 18 fields, saw 20\n'
b'Skipping line 2284529: expected 18 fields, saw 22\n'
b'Skipping line 2592035: expected 18 fields, saw 22\n'


Dataset Rows 2682400, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 3530: expected 18 fields, saw 23\n'
b'Skipping line 406300: expected 18 fields, saw 19\n'
b'Skipping line 469217: expected 18 fields, saw 20\n'
b'Skipping line 1188257: expected 18 fields, saw 23\n'


Dataset Rows 2638420, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 839478: expected 18 fields, saw 19\n'
b'Skipping line 1072193: expected 18 fields, saw 21\n'
b'Skipping line 1470670: expected 18 fields, saw 19\n'
b'Skipping line 2020565: expected 18 fields, saw 21\n'
b'Skipping line 2113841: expected 18 fields, saw 19\nSkipping line 2115355: expected 18 fields, saw 23\n'


Dataset Rows 2708013, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 453892: expected 18 fields, saw 19\n'
b'Skipping line 973358: expected 18 fields, saw 27\n'
b'Skipping line 1318559: expected 18 fields, saw 22\n'
b'Skipping line 1393960: expected 18 fields, saw 21\n'
b'Skipping line 1856130: expected 18 fields, saw 19\n'
b'Skipping line 2109455: expected 18 fields, saw 28\n'
b'Skipping line 2248178: expected 18 fields, saw 22\n'
b'Skipping line 2390685: expected 18 fields, saw 20\nSkipping line 2390691: expected 18 fields, saw 24\n'
b'Skipping line 2799889: expected 18 fields, saw 19\n'
b'Skipping line 2980239: expected 18 fields, saw 21\n'
b'Skipping line 3301518: expected 18 fields, saw 20\n'
b'Skipping line 3845988: expected 18 fields, saw 21\n'


Dataset Rows 4064186, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 6397: expected 18 fields, saw 25\nSkipping line 10732: expected 18 fields, saw 19\n'
b'Skipping line 100070: expected 18 fields, saw 36\n'
b'Skipping line 490471: expected 18 fields, saw 26\n'
b'Skipping line 533342: expected 18 fields, saw 26\n'
b'Skipping line 716501: expected 18 fields, saw 21\n'
b'Skipping line 1090740: expected 18 fields, saw 19\n'
b'Skipping line 1281270: expected 18 fields, saw 19\n'
b'Skipping line 1789827: expected 18 fields, saw 21\n'
b'Skipping line 1822817: expected 18 fields, saw 20\n'
b'Skipping line 2099836: expected 18 fields, saw 19\n'
b'Skipping line 2421976: expected 18 fields, saw 23\n'
b'Skipping line 2861821: expected 18 fields, saw 26\n'
b'Skipping line 3687915: expected 18 fields, saw 27\n'
b'Skipping line 4023985: expected 18 fields, saw 23\n'


Dataset Rows 4146029, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 1309528: expected 18 fields, saw 21\n'
b'Skipping line 1501414: expected 18 fields, saw 20\n'
b'Skipping line 2159454: expected 18 fields, saw 24\n'
b'Skipping line 2625075: expected 18 fields, saw 19\nSkipping line 2628095: expected 18 fields, saw 24\n'
b'Skipping line 3017005: expected 18 fields, saw 23\n'


Dataset Rows 3956179, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 129716: expected 18 fields, saw 19\n'
b'Skipping line 335169: expected 18 fields, saw 21\n'
b'Skipping line 754967: expected 18 fields, saw 25\n'
b'Skipping line 1321485: expected 18 fields, saw 23\n'
b'Skipping line 1715179: expected 18 fields, saw 21\nSkipping line 1722715: expected 18 fields, saw 23\n'
b'Skipping line 3044482: expected 18 fields, saw 20\n'
b'Skipping line 3381768: expected 18 fields, saw 22\n'
b'Skipping line 3409741: expected 18 fields, saw 28\nSkipping line 3434558: expected 18 fields, saw 21\n'
b'Skipping line 3941093: expected 18 fields, saw 29\n'


Dataset Rows 4224033, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 213964: expected 18 fields, saw 20\n'
b'Skipping line 1057558: expected 18 fields, saw 20\n'
b'Skipping line 1435053: expected 18 fields, saw 30\n'
b'Skipping line 3199147: expected 18 fields, saw 27\n'


Dataset Rows 4139016, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 3019488: expected 18 fields, saw 21\n'
b'Skipping line 3798245: expected 18 fields, saw 21\n'
b'Skipping line 4444169: expected 18 fields, saw 22\n'
b'Skipping line 4581898: expected 18 fields, saw 24\n'
b'Skipping line 4728182: expected 18 fields, saw 19\n'
b'Skipping line 5203522: expected 18 fields, saw 19\n'
b'Skipping line 5270395: expected 18 fields, saw 20\n'
b'Skipping line 5326682: expected 18 fields, saw 23\n'


Dataset Rows 5371561, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 551412: expected 18 fields, saw 19\n'
b'Skipping line 790934: expected 18 fields, saw 20\n'
b'Skipping line 1195394: expected 18 fields, saw 22\n'
b'Skipping line 1443641: expected 18 fields, saw 29\n'
b'Skipping line 3011004: expected 18 fields, saw 19\n'
b'Skipping line 3825037: expected 18 fields, saw 19\n'
b'Skipping line 4594412: expected 18 fields, saw 24\n'
b'Skipping line 4940090: expected 18 fields, saw 22\n'
b'Skipping line 5249010: expected 18 fields, saw 25\n'


Dataset Rows 5498159, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 380807: expected 18 fields, saw 28\n'
b'Skipping line 560494: expected 18 fields, saw 19\n'
b'Skipping line 697659: expected 18 fields, saw 33\n'
b'Skipping line 1273621: expected 18 fields, saw 21\n'
b'Skipping line 1444690: expected 18 fields, saw 26\n'
b'Skipping line 1874542: expected 18 fields, saw 19\n'
b'Skipping line 2630979: expected 18 fields, saw 19\n'
b'Skipping line 3307929: expected 18 fields, saw 21\n'
b'Skipping line 3498867: expected 18 fields, saw 28\n'
b'Skipping line 3668380: expected 18 fields, saw 22\n'
b'Skipping line 3854370: expected 18 fields, saw 21\n'
b'Skipping line 3981179: expected 18 fields, saw 21\n'
b'Skipping line 4643530: expected 18 fields, saw 20\n'


Dataset Rows 5076247, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 163084: expected 18 fields, saw 23\n'
b'Skipping line 226951: expected 18 fields, saw 21\n'
b'Skipping line 1296093: expected 18 fields, saw 23\n'
b'Skipping line 2527761: expected 18 fields, saw 19\n'
b'Skipping line 3913350: expected 18 fields, saw 19\n'


Dataset Rows 5529430, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 4361: expected 18 fields, saw 19\n'
b'Skipping line 403825: expected 18 fields, saw 20\n'
b'Skipping line 1750832: expected 18 fields, saw 20\n'
b'Skipping line 2150291: expected 18 fields, saw 20\n'
b'Skipping line 2833078: expected 18 fields, saw 20\n'
b'Skipping line 3127176: expected 18 fields, saw 23\n'
b'Skipping line 4170957: expected 18 fields, saw 19\nSkipping line 4181051: expected 18 fields, saw 24\n'


Dataset Rows 4827214, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 10707: expected 18 fields, saw 19\n'
b'Skipping line 483511: expected 18 fields, saw 19\n'
b'Skipping line 568213: expected 18 fields, saw 19\n'
b'Skipping line 1199638: expected 18 fields, saw 19\n'
b'Skipping line 1606838: expected 18 fields, saw 19\n'
b'Skipping line 1909690: expected 18 fields, saw 19\n'
b'Skipping line 2009220: expected 18 fields, saw 24\n'
b'Skipping line 2045041: expected 18 fields, saw 20\n'
b'Skipping line 3004554: expected 18 fields, saw 19\n'
b'Skipping line 3759389: expected 18 fields, saw 23\n'
b'Skipping line 4318231: expected 18 fields, saw 32\n'
b'Skipping line 4479895: expected 18 fields, saw 19\n'


Dataset Rows 4739672, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 791124: expected 18 fields, saw 19\nSkipping line 815704: expected 18 fields, saw 21\n'
b'Skipping line 930360: expected 18 fields, saw 25\n'
b'Skipping line 1896268: expected 18 fields, saw 20\n'
b'Skipping line 2138116: expected 18 fields, saw 19\n'
b'Skipping line 2361543: expected 18 fields, saw 20\n'
b'Skipping line 3252623: expected 18 fields, saw 19\n'
b'Skipping line 3773000: expected 18 fields, saw 24\n'
b'Skipping line 4018364: expected 18 fields, saw 24\n'
b'Skipping line 4673395: expected 18 fields, saw 19\n'
b'Skipping line 4706059: expected 18 fields, saw 20\n'


Dataset Rows 4774088, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 782600: expected 18 fields, saw 20\n'
b'Skipping line 2904351: expected 18 fields, saw 19\n'
b'Skipping line 3172892: expected 18 fields, saw 42\n'
b'Skipping line 3224820: expected 18 fields, saw 51\n'
b'Skipping line 3501485: expected 18 fields, saw 21\n'
b'Skipping line 3528793: expected 18 fields, saw 19\n'
b'Skipping line 4126729: expected 18 fields, saw 19\n'
b'Skipping line 4201875: expected 18 fields, saw 20\n'
b'Skipping line 4788045: expected 18 fields, saw 20\n'


Dataset Rows 5144760, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


b'Skipping line 2354843: expected 18 fields, saw 20\n'
b'Skipping line 4076974: expected 18 fields, saw 20\n'
b'Skipping line 4685772: expected 18 fields, saw 23\n'
b'Skipping line 4883245: expected 18 fields, saw 27\n'


Dataset Rows 5259979, Columns 18

S3 Training Data Path s3://demos-amazon-reviews/wordvec-full-data/train
S3 Test Data Path s3://demos-amazon-reviews/wordvec-full-data/test
S3 output Data Path s3://demos-amazon-reviews/wordvec-full-data/output


In [67]:
#as blazingtext pipe only supports one augmented file for train and test, let's concat them all
def concat_augmented_files(configs, global_vars):
    
    #output filename
    concatenated_file_train = 'amazon_augmented_train.json'
    concatenated_file_test = 'amazon_augmented_test.json'
    
    #where all our files sit
    train_prefix = 'train'
    test_prefix = 'test'
    s3_train_path = '{}/{}/'.format(configs['wordvecdata'], train_prefix)
    s3_test_path = '{}/{}/'.format(configs['wordvecdata'], test_prefix)
    
    
    s3_concat_file_path_train = '{}/{}'.format(configs['s3_w2v_train_data'], concatenated_file_train)
    s3_concat_file_path_test = '{}/{}'.format(configs['s3_w2v_test_data'], concatenated_file_test)
    
    
    print(s3_train_path)
    print(s3_test_path)

    min_file_size = None

    #train file
    job_train = S3Concat(configs['bucket_name'], 
                         s3_concat_file_path_train, 
                         min_file_size,
                         content_type='application/json',
                         session=boto3.session.Session()
                        )
    
    job_train.add_files(s3_train_path)
    job_train.concat(small_parts_threads=32)

    
    #test file
    job_test = S3Concat(configs['bucket_name'], 
                         s3_concat_file_path_test, 
                         min_file_size,
                         content_type='application/json',
                         session=boto3.session.Session()
                        )
    
    job_test.add_files(s3_test_path)
    job_test.concat(small_parts_threads=32)
    
    
    configs['s3_w2v_train_file'] = s3_concat_file_path_train
    
    configs['s3_w2v_test_file'] = s3_concat_file_path_test
    
    return configs

configs = concat_augmented_files(configs, global_vars)


wordvec-full-data/train/
wordvec-full-data/test/


In [95]:
### Validate Data
def validate_jsonlines(filename):
    
    lines_cnt = 0
    with open(filename, 'r') as jfile:
        for line in jfile:
            try:
                line_loaded = json.loads(line)
#                 print(line_loaded)
                lines_cnt += 1
#                 break
                if (lines_cnt % 1000000) == 0:
                    print('read {} lines'.format(lines_cnt))
        
            except Exception as e:
                print(e)
                print('error in line {}'.format(line))
                
    print('Total parsed lines {}'.format(lines_cnt))
    
validate_jsonlines('amazon_augmented_train.json')  

read 1000000 lines
read 2000000 lines
read 3000000 lines
read 4000000 lines
read 5000000 lines
read 6000000 lines
read 7000000 lines
read 8000000 lines
read 9000000 lines
read 10000000 lines
read 11000000 lines
read 12000000 lines
read 13000000 lines
read 14000000 lines
read 15000000 lines
read 16000000 lines
read 17000000 lines
read 18000000 lines
read 19000000 lines
read 20000000 lines
read 21000000 lines
read 22000000 lines
read 23000000 lines
read 24000000 lines
read 25000000 lines
read 26000000 lines
read 27000000 lines
read 28000000 lines
read 29000000 lines
read 30000000 lines
read 31000000 lines
read 32000000 lines
read 33000000 lines
read 34000000 lines
read 35000000 lines
read 36000000 lines
read 37000000 lines
read 38000000 lines
read 39000000 lines
read 40000000 lines
read 41000000 lines
read 42000000 lines
read 43000000 lines
read 44000000 lines
read 45000000 lines
read 46000000 lines
read 47000000 lines
read 48000000 lines
read 49000000 lines
read 50000000 lines
read 5100

## Model /Analysis Experimentation (Local Mode)

The purpose of this section is to perform some experimentations with different modelling techniques.

We're first going to perform some local experiments on the 1% sample of data to see which methods provide valuable insights for both customers (e.g. Amazon Customer), and operations (e.g. Amazon). 

We want to look at different type of insights, from understanding how customer reviews have changed over times, and whether there is predictability in the type of review, and the category of product it is related to. 

Let's start of by first gettign our data into a shape which we can use for analysis and modelling purposes

### Prep Data for Modelling Purposes

We're going to develop some dataframes which represent our Xs and Ys (features and labels).

Let's create some feature/label datasets which are shaped around the following labels:

- year_product-category
- product-category_star_rating

The features for this model will be only using the text of the reviews





### Word Embeddings Using BlazingText (Supervised)
BlazingText expects a single preprocessed text file with space separated tokens and each line of the file should contain a single sentence and the corresponding label(s) prefixed by "_ _label_ _".

As we're now using the complete dataset, we'll need to use the Augmented dataset structure and use `Pipe` mode in  order to allow for streaming of data, rather than loading all the data into memory in one go.

Augmented Data Structure

```json
{'source':'string', 'label':'string'}
{'source':'string', 'label':'string'}
```

Note, the structure are single json entries, per line

In [78]:
def configure_estimator(configs, global_vars):
    
    region_name = configs['aws_region'] 
    sess = global_vars['sess']
    container = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "blazingtext", "latest")
    print('Using SageMaker BlazingText container: {} ({})'.format(container, region_name))

    bt_model = sagemaker.estimator.Estimator(container,
                                         global_vars['role'], 
                                         train_instance_count=1, 
                                         train_instance_type='ml.c5.18xlarge',
                                         train_volume_size = 150,
                                         train_max_run = 360000,
                                         input_mode= 'Pipe',
                                         output_path=configs['s3_w2v_output_location'],
                                         sagemaker_session=sess)
    
    bt_model.set_hyperparameters(mode="supervised",
                                 epochs=20,
                                 min_count=2,
                                 learning_rate=0.05,
                                 vector_dim=10,
                                 early_stopping=True,
                                 patience=4,
                                 min_epochs=10,
                                 word_ngrams=4)
    

    
   
    global_vars['bt_model'] = bt_model
    
    return global_vars

global_vars = configure_estimator(configs, global_vars)

Using SageMaker BlazingText container: 811284229777.dkr.ecr.us-east-1.amazonaws.com/blazingtext:latest (us-east-1)


In [82]:
def configure_data_channels(configs, global_vars):
    

    s3train_manifest = configs['s3_w2v_train_file'] 
    s3validation_manifest = configs['s3_w2v_test_file'] 
    
    attribute_names = ["source","label"]

    
    train_data = sagemaker.session.s3_input(s3train_manifest, 
                                            distribution='FullyReplicated', 
                                            content_type='application/jsonlines', 
                                            s3_data_type='AugmentedManifestFile',
                                            attribute_names=attribute_names,
                                            record_wrapping='RecordIO' 
                                           )
    
    validation_data = sagemaker.session.s3_input(s3validation_manifest, 
                                                 distribution='FullyReplicated', 
                                                 content_type='application/jsonlines', 
                                                 s3_data_type='AugmentedManifestFile',
                                                 attribute_names=attribute_names,
                                                 record_wrapping='RecordIO'
                                                )
    
    data_channels = {'train': train_data, 'validation': validation_data}
    
    global_vars['data_channels'] = data_channels

    return global_vars

global_vars = configure_data_channels(configs, global_vars)
                                        

In [83]:
def fit_model(configs, global_vars):
    
    bt_model = global_vars['bt_model']
    data_channels = global_vars['data_channels']
    bt_model.fit(inputs=data_channels, logs=True)
    
    
fit_model(configs, global_vars)

2020-04-29 06:58:01 Starting - Starting the training job...
2020-04-29 06:58:03 Starting - Launching requested ML instances......
2020-04-29 06:59:11 Starting - Preparing the instances for training......
2020-04-29 07:00:08 Downloading - Downloading input data......................................................................................................
2020-04-29 07:17:39 Training - Downloading the training image..[34mArguments: train[0m
[34m[04/29/2020 07:17:55 INFO 139910984623936] nvidia-smi took: 0.0252721309662 secs to identify 0 gpus[0m
[34m[04/29/2020 07:17:55 INFO 139910984623936] Running single machine CPU BlazingText training using supervised mode.[0m
[34mterminate called after throwing an instance of 'std::runtime_error'
  what():  Customer Error: Invalid JSON string found. Please ensure data is in the expected JSON formats[0m

2020-04-29 07:18:11 Uploading - Uploading generated training model
2020-04-29 07:18:11 Failed - Training job failed
[34m[04/29/2020 

UnexpectedStatusException: Error for Training job blazingtext-2020-04-29-06-58-01-582: Failed. Reason: ClientError: Training did not complete successfully! Please check the logs for errors.

In [None]:
def host_model(global_vars):
    
    bt_model = global_vars['bt_model']
    text_classifier = bt_model.deploy(initial_instance_count = 1,instance_type = 'ml.m4.xlarge')
    global_vars['w2v_classifier'] = text_classifier
    
    return global_vars

global_vars = host_model(global_vars)

In [None]:
def evaluate_test_data_against_model(global_vars, test_file):
    
    train_data = []
    instances = []
    with open(test_file, 'r', newline= '\n') as csvinputfile:
        data = csv.reader(csvinputfile)
        for row in data:
#             print(row)
            label = row[0].split(' ')[0]
            text = row[0].partition(' ')[2].replace('  ',' ').strip()
            tmp = {'label':label, "text":text}
            train_data.append(tmp)
#             print(tmp)
#             break
            #same order as the csv rows
            instances.append(text)
            
    print('Total Instances {}. Total Train Data {}'.format(len(instances), len(train_data)))

#     print(instances[0], train_data[0]['label'])
    
    # we need to do some batch inferencing due to the size of the data:
    
    #each batch is 1000 sentences
    batch_size = 10000
    batches = len(instances) // batch_size
    
    print('Batches {}'.format(batches))
    
    predictions_batches = []
    
    for i in range(0, batches+1):
        lower = batch_size * i
        upper = batch_size * (i+1)
        if i == batches:
            upper = len(instances)
        print('Batch {} : {}'.format(lower,upper))
            
        instances_batch = instances[lower:upper]
        
        payload = {"instances":instances_batch,
                  "configuration": {"k": 1}}

        text_classifier =  global_vars['w2v_classifier']


        response = text_classifier.predict(json.dumps(payload))

        predictions = json.loads(response)
        predictions_batches.append(predictions)
        
#     print(json.dumps(predictions, indent=2))
    print('Total Predictions {}'.format(len(predictions)))
        
    return predictions_batches, train_data
            
              
predictions_batches, train_data = evaluate_test_data_against_model(global_vars, test_file)

In [None]:
def evaluate_model_predictions(predictions_batches, train_data):
    
    preds = []
    for batch in predictions_batches:
        for pred in batch:
            lab = pred['label'][0]
            prob = pred['prob'][0]
            tmp = {'pred_label':lab, 'pred_prob':prob}
            preds.append(tmp)
            
    print('Total Preds {}'.format(len(preds)))
    
    for i in range(0,len(train_data)):
        data = train_data[i]
        true_label = data['label']
        preds[i]['true_label'] = true_label
        
    print('Example Data: \n\t {}'.format(preds[1]))
    
    y_true = []
    y_pred = []
    for pred in preds:
        y_true.append(pred['true_label'].replace('__label__',''))
        y_pred.append(pred['pred_label'].replace('__label__',''))
        
    print(classification_report(y_true, y_pred))

        
evaluate_model_predictions(predictions_batches, train_data)

**Notes**: Using the Word2Vec word Embedding approach, we're seeing similar results to the TF-IDF/SVC implementation for predicting product category. However, the computational time required to compute the SVC was nearly 100 times slower than the Word2Vec approach, and this is only for a sample dataset of 1% of the total data.

## Bidirectional Encoder Representations from Transformers (BERT)

https://huggingface.co/transformers/model_doc/bert.html#overview

In [11]:
!pip install --upgrade pip
!pip install tensorflow

Collecting pip
  Using cached https://files.pythonhosted.org/packages/54/0c/d01aa759fdc501a58f431eb594a17495f15b88da142ce14b5845662c13f3/pip-20.0.2-py2.py3-none-any.whl
Installing collected packages: pip
  Found existing installation: pip 10.0.1
    Uninstalling pip-10.0.1:
      Successfully uninstalled pip-10.0.1
Successfully installed pip-20.0.2
Collecting tensorflow
  Downloading tensorflow-2.1.0-cp36-cp36m-manylinux2010_x86_64.whl (421.8 MB)
[K     |██████████████████████          | 288.9 MB 111.7 MB/s eta 0:00:02     |██████████████▎                 | 187.6 MB 116.6 MB/s eta 0:00:03     |█████████████████████▌          | 282.9 MB 111.7 MB/s eta 0:00:02

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



[K     |███████████████████████████████▉| 420.1 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.1 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.1 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.1 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████████████████████▉| 420.2 MB 118.2 MB/s eta 0:00:01[K     |███████████████

[K     |████████████████████████████████| 421.8 MB 20 kB/s 
[?25hProcessing /home/ec2-user/.cache/pip/wheels/7c/06/54/bc84598ba1daf8f970247f550b175aaaee85f68b4b0c5ab2c6/termcolor-1.1.0-cp36-none-any.whl
Collecting six>=1.12.0
  Downloading six-1.14.0-py2.py3-none-any.whl (10 kB)
Collecting tensorboard<2.2.0,>=2.1.0
  Downloading tensorboard-2.1.1-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 90.8 MB/s eta 0:00:01
Collecting gast==0.2.2
  Downloading gast-0.2.2.tar.gz (10 kB)
Processing /home/ec2-user/.cache/pip/wheels/b1/c2/ed/d62208260edbd3fa7156545c00ef966f45f2063d0a84f8208a/wrapt-1.12.1-cp36-cp36m-linux_x86_64.whl
Collecting google-pasta>=0.1.6
  Using cached google_pasta-0.2.0-py3-none-any.whl (57 kB)
Collecting grpcio>=1.8.6
  Downloading grpcio-1.28.1-cp36-cp36m-manylinux2010_x86_64.whl (2.8 MB)
[K     |████████████████████████████████| 2.8 MB 111.1 MB/s eta 0:00:01
Collecting tensorflow-estimator<2.2.0,>=2.1.0rc0
  Downloading tensorflow_estimato

In [8]:
from transformers import DistilBertTokenizer, BertTokenizer, DistilBertModel, BertModel
from multiprocessing import  Pool
import multiprocessing as mp
import torch.multiprocessing as torchmp
from functools import partial

import tensorflow as tf
from transformers import (
    BertConfig, 
    BertTokenizer,
    TFBertForSequenceClassification
)

from collections import namedtuple
from typing import List, Tuple



In [9]:
def set_bert_configs(configs, batch_size, epochs):
    configs['bert_model_type'] = 'bert-base-uncased'
    configs['bert_epochs'] = 3
    configs['bert_batch_size'] = 1024
    configs['model_path'] = 'model/bert_tf_model_uncased_amazon_reviews_fine_tuned'
    return configs

configs = set_bert_configs(configs, batch_size = 1024, epochs = 3)

### View Training Of Model

In [79]:
def describe_model(configs):
    
    TO_FINETUNE = configs = configs['bert_model_type']
    num_examples = len(tuples)
    config = BertConfig.from_pretrained(TO_FINETUNE)
    model = TFBertForSequenceClassification.from_pretrained(TO_FINETUNE, config=config) 
    model.summary()

describe_model(configs)
    

Model: "tf_bert_for_sequence_classification_12"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_493 (Dropout)        multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  23070     
Total params: 109,505,310
Trainable params: 109,505,310
Non-trainable params: 0
_________________________________________________________________


In [10]:
def create_label_mapping(df):
    
    num_labels = df['product_category'].nunique()
    labels_unique = df['product_category'].unique()
    print('Number of Classes {}'.format(num_labels))
    
    labels = df['product_category'].tolist()
    #convert to index for reversals
    labels_unique_name_idx = { labels_unique[i] :  i for i in range(0, len(labels_unique) ) }
    labels_unique_idx_name = { i : labels_unique[i] for i in range(0, len(labels_unique) ) }
#     print(labels_unique_name_idx)
    
    labels_idx_col = []
    for lab in labels:
        if lab in labels_unique_name_idx:
            labels_idx_col.append(labels_unique_name_idx[lab])
    
    df['label'] = labels_idx_col
    
#     display(df)

    return df, labels_unique_idx_name, num_labels

sampled_data, labels_mapping, num_labels = create_label_mapping(sampled_data)

Number of Classes 30


In [11]:
def create_text_column(df,text_col):
    
    col_to_add = 'processed_text'
    tmp = df[text_col]
    xs = []
    for entry in tmp:
        res = str(entry).strip('][').split(', ') 
        res = ' '.join(res)
        xs.append(res)
        
    df['processed_text'] = xs
#     display(df)
    return df

sampled_data = create_text_column(sampled_data, 'review_body_processed')

In [12]:
def create_train_test_data(df):
    
    xs = []
    ys = []
    
    for row in df.itertuples(index=False):
        xs.append(row.processed_text)
        ys.append(row.label)

    X_train, X_test, y_train, y_test = train_test_split(
        xs, ys, test_size=0.20, random_state = 0)
    
    print("Train Dataframe Length {},  Test Dataframe Length {}".format(len(X_train), len(X_test)))
    
    train_df = pd.DataFrame(list(zip(X_train, y_train)), 
               columns =['processed_text', 'label']) 
 
    test_df = pd.DataFrame(list(zip(X_test, y_test)), 
               columns =['processed_text', 'label']) 
    
    return train_df, test_df 
    
    
    
train_df, test_df = create_train_test_data(sampled_data)

Train Dataframe Length 1150147,  Test Dataframe Length 287537


In [18]:
def build_tf_tuples(df):
    
    InputExample = namedtuple('InputExample', ['text', 'category_index'])
    
    data = []
    for row in df.itertuples(index=False):
        data.append(InputExample(text=row.processed_text, category_index=row.label))
        
    return data

# tf_tuples =  build_tf_tuples(train_df)

In [19]:
def convert_examples_to_tf_dataset(examples: List[Tuple[str, int]],tokenizer, max_length=512,):
    """
    Loads data into a tf.data.Dataset for finetuning a given model.
    Args:
        examples: List of tuples representing the examples to be fed
        tokenizer: Instance of a tokenizer that will tokenize the examples
        max_length: Maximum string length
    Returns:
        a ``tf.data.Dataset`` containing the condensed features of the provided sentences
    """
    features = [] # -> will hold InputFeatures to be converted later
    InputFeatures = namedtuple('InputFeatures', ['input_ids', 'attention_mask', 'token_type_ids', 'label'])

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default
        )

        # input ids = token indices in the tokenizer's internal dict
        # token_type_ids = binary mask identifying different sequences in the model
        # attention_mask = binary mask indicating the positions of padded tokens so the model does not attend to them

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.category_index
            )
        )
            
    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )



In [20]:
def generate_tensors(configs, tuples, num_labels):
    
    BATCH_SIZE = 16
    num_examples = len(tuples)
    TO_FINETUNE = configs['bert_model_type']
    config = BertConfig.from_pretrained(TO_FINETUNE, num_labels=num_labels)
    tokenizer = BertTokenizer.from_pretrained(TO_FINETUNE)
    
    # Make the CPU do all data pre-processing steps, not the GPU
    with tf.device('/cpu:0'):
        train_data = convert_examples_to_tf_dataset(tuples, tokenizer)
        
        train_data = train_data.shuffle(buffer_size=num_examples, reshuffle_each_iteration=True) \
                               .batch(BATCH_SIZE) \
                               .repeat(-1)
        
    return train_data

# tf_train_data = generate_tensors(configs, tf_tuples, num_labels)

In [None]:
def train_model_with_new_layer_classes(configs, global_vars, train_data, tuples, num_labels):
    
    EPOCHS = configs['bert_epochs']
    BATCH_SIZE = configs['bert_batch_size']
    TO_FINETUNE = configs['bert_model_type']
    num_examples = len(tuples)

    config = BertConfig.from_pretrained(TO_FINETUNE, num_labels=num_labels)
    model = TFBertForSequenceClassification.from_pretrained(TO_FINETUNE, config=config)
    optimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    metric = tf.keras.metrics.SparseCategoricalCrossentropy(name='accuracy')
    
    log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
    
    
    model.compile(optimizer=optimizer,
                  loss=loss,
                  metrics=[metric])

    train_steps = num_examples // BATCH_SIZE

    model.fit(train_data, 
              epochs=EPOCHS,
              steps_per_epoch=train_steps,
#               verbose=2,
              callbac ks=[tensorboard_callback]
             )
    
    global_vars['bert_tf_model'] = model
    global_vars['tensorboard_callback'] = tensorboard_callback
    configs['log_dir'] = log_dir
    
    print('Fitting Data to Pre-trained Model')
    
    return global_vars, configs

global_vars, configs = train_model_with_new_layer_classes(configs, global_vars, tf_train_data,tf_tuples,num_labels)

Train for 1403 steps
Epoch 1/3

### Save Model Weights

In [163]:
def save_model_weights(model, path):
    
    model.save_weights(path)
    
# os.mkdir('model/')
save_model_weights(global_vars['bert_tf_model'], 'model/bert_tf_model_uncased_amazon_reviews_fine_tuned.ckpt')

In [177]:
def save_model(model, path):
    
    model.save_pretrained(save_directory=path)
    
    
# model = global_vars['bert_tf_model']

os.mkdir(model_path)
save_model(global_vars['bert_tf_model'], configs['model_path'])


### Save Model File

In [130]:
import tarfile

def tar_model(model_path):
    print('Compressing model to Tar File')
    with tarfile.open('model/model.tar.gz', mode='w:gz') as archive:
        archive.add(model_path, recursive=True)

tar_model(configs['model_path'] )

Compressing model to Tar File


### Load Model

Only use this if you're loading the pre-trained fine tuned model

In [24]:
def load_model(global_vars, model_path):
    
    model = TFBertForSequenceClassification.from_pretrained(model_path)
    
    optimizer = tf.keras.optimizers.Adam(learning_rate=3e-05, epsilon=1e-08)
    loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    metric = tf.keras.metrics.SparseCategoricalCrossentropy(name='accuracy')
    model.compile(optimizer=optimizer,
                  loss=loss,
                  metrics=[metric])
    
    model.summary()

    global_vars['bert_tf_model'] = model
    return global_vars
    
global_vars =  load_model(global_vars, configs['model_path'] )

Model: "tf_bert_for_sequence_classification_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_151 (Dropout)        multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  23070     
Total params: 109,505,310
Trainable params: 109,505,310
Non-trainable params: 0
_________________________________________________________________


### Evaluate Model

In [36]:
def evaluate_model(configs, model, df, labels_mapping):
    
    TO_FINETUNE = configs['bert_model_type']
    config = BertConfig.from_pretrained(TO_FINETUNE)
    tokenizer = BertTokenizer.from_pretrained(TO_FINETUNE)
    TEST_STEPS = 100
    y_preds = []
    y = []
    inputs = []
    num_labels = len(labels_mapping)
    tf_test_tuples =  build_tf_tuples(df)
    tf_test_data = generate_tensors(configs, tf_test_tuples, num_labels)

    callbacks = []

    log_dir = "logs/eval/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
    tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
    callbacks.append(tensorboard_callback)

    test_history = model.evaluate(tf_test_data,
                             steps=TEST_STEPS,
                              callbacks=callbacks,
                                 verbose=3)
    

            
    return test_history# y, y_preds


test_history = evaluate_model(configs, global_vars['bert_tf_model'], test_df, labels_mapping )

In [64]:
# ##Attemp at parallizing TF, need to Use RAY to do this...

# def encode_predict(x):
#     data =  tokenizer.encode_plus(x, return_tensors="tf", max_length=512)
# #     print(data)
#     outputs = model(data)
#     classification_scores = outputs[0]
#     lab = labels_mapping[np.argmax(classification_scores)]
#     return lab

# def predict_process(df, q, model, tokenizer):
#     print('Running Process')    
# #     TO_FINETUNE = configs['bert_model_type']
# #     config = BertConfig.from_pretrained(TO_FINETUNE)
# #     tokenizer = BertTokenizer.from_pretrained(TO_FINETUNE)
#     y, y_preds = [], []
#     cnt = 0
#     print('Rows To compute {}'.format(df.shape[0]))
# #     df['lab_pred']= df['processed_text'].apply(encode_predict)
        
#     for idx,row in df.iterrows():
#         test_data = tokenizer.encode_plus(row['processed_text'], return_tensors="tf", max_length=512)
#         print(idx)
#         outputs = model(test_data)
#         classification_scores = outputs[0]
#         df['pred_label'] = labels_mapping[np.argmax(classification_scores)]

#         if cnt % 1 == 0:
#             print('Step  {} of {}'.format(cnt, sample_size))
#         cnt += 1
    
#     queue.put(df)


# def evaluate_data_for_confusion_matrix_parallel(configs, model, df, labels_mapping):

#     TO_FINETUNE = configs['bert_model_type']
#     tokenizer = BertTokenizer.from_pretrained(TO_FINETUNE)
    
    
#     df = df.sample(100)
#     print('Starting Multicore inferencing Dataset Size {}'.format(df.shape[0]))
#     cores = mp.cpu_count()
#     df_split = np.array_split(df, 1)
# #     print(len(df_split))
#     jobs = []
    
#     q = mp.Queue()
#     processes = []
#     rets = []
#     for i in range(0,1):
#         print(i)
#         p = mp.Process(target=predict_process, args=(df_split[i], q, model, tokenizer))
#         processes.append(p)
#         p.start()
    
#     for p in processes:
#         ret = q.get() # will block
#         rets.append(ret)
        
#     for p in processes:
#         p.join()
        
# #     pool = Pool(cores)
# #     pool_results = pool.map(predict_process, df_split)
# #     pool.close()
# #     pool.join()
    
#     parts = pd.concat(rets, axis=0)
#     print('Finished Multicore inferencing Dataset Size {}'.format(parts.shape[0]))

#     return parts


# #run this with the test data
# test_pred_df = evaluate_data_for_confusion_matrix(configs, 
#                                                   global_vars['bert_tf_model'], 
#                                                   test_df, 

#                                                   labels_mapping )

In [None]:
from transformers import TextClassificationPipeline

def evaluate_data_for_confusion_matrix(configs, model, df, labels_mapping):
    
    
    
    TO_FINETUNE = configs['bert_model_type']
    config = BertConfig.from_pretrained(TO_FINETUNE)
    tokenizer = BertTokenizer.from_pretrained(TO_FINETUNE)
    
    inference_pipeline = TextClassificationPipeline(model=model, 
                                                tokenizer=tokenizer,
                                                framework='tf',
                                                device=-1) # -1 is CPU, >= 0 is GPU
    
    y, y_preds = [], []
    cnt = 0
    for idx,row in df.iterrows():
#         print(idx)
#         test_data = tokenizer.encode_plus(row['processed_text'], return_tensors="tf", max_length=512)
#         outputs = model(test_data)
#         print(inference_pipeline(row['processed_text']))
        preds = inference_pipeline(row['processed_text'])
        pred_lab = int(preds[0]['label'].replace('LABEL_',''))
#         classification_scores = 
        y_preds.append(labels_mapping[pred_lab])
        y.append(labels_mapping[row['label']])
#         print(y_preds)
#         print(y)
        cnt += 1
        if cnt % 5 == 0:
            
            total_predictions = len(y)
            correct_predictions = 0
            for i in range(0,len(y_preds)):
                if y_preds[i] == y[i]:
                    correct_predictions += 1
            classification_accuracy = correct_predictions / total_predictions * 100.0
            print('Step  {} of {}'.format(cnt, df.shape[0]))
            print('Eval Classification Accuracy: {}'.format(classification_accuracy))
                
            
    return y, y_preds

#run this with the test data
test_pred_df = evaluate_data_for_confusion_matrix(configs, 
                                                  global_vars['bert_tf_model'], 
                                                  test_df, 
                                                  labels_mapping )

Step  5 of 287537
Eval Classification Accuracy: 60.0
Step  10 of 287537
Eval Classification Accuracy: 60.0
Step  15 of 287537
Eval Classification Accuracy: 46.666666666666664
Step  20 of 287537
Eval Classification Accuracy: 50.0
Step  25 of 287537
Eval Classification Accuracy: 52.0
Step  30 of 287537
Eval Classification Accuracy: 53.333333333333336
Step  35 of 287537
Eval Classification Accuracy: 54.285714285714285
Step  40 of 287537
Eval Classification Accuracy: 55.00000000000001
Step  45 of 287537
Eval Classification Accuracy: 55.55555555555556
Step  50 of 287537
Eval Classification Accuracy: 54.0
Step  55 of 287537
Eval Classification Accuracy: 58.18181818181818
Step  60 of 287537
Eval Classification Accuracy: 60.0
Step  65 of 287537
Eval Classification Accuracy: 61.53846153846154
Step  70 of 287537
Eval Classification Accuracy: 58.57142857142858
Step  75 of 287537
Eval Classification Accuracy: 56.00000000000001
Step  80 of 287537
Eval Classification Accuracy: 56.25
Step  85 of 2875

### Upload BERT Model to S3

In [142]:
def upload_bert_to_s3(configs, global_vars, model_tar):
    
    
    models_dir = configs['models_dir']
    prefix = 'bert'
    model_s3_filename = 'model.tar.gz'
    
    s3_bucket = global_vars['s3_bucket']
    sess = global_vars['sess']
    bucket = global_vars['s3_bucket']
   
    

    model_file_s3 = '{}/{}/{}'.format(models_dir, prefix, model_s3_filename)
    s3_bucket.upload_file(model_tar, model_file_s3)   

   
    
    
    s3_bert_model = 's3://{}/{}/{}/model.tar.gz'.format(configs['bucket_name'], models_dir, prefix)
    
    configs['s3_bert_model'] = s3_bert_model
   
    print('S3 BERT Data Path {}'.format(s3_bert_model))
   

    return configs

model_tar = 'model/model.tar.gz'
configs = upload_bert_to_s3(configs, global_vars, model_tar)     

S3 BERT Data Path s3://demos-amazon-reviews/models/bert/model.tar.gz


## Findings from TF-IDF, Word2Vec, BERT

In this notebook we have explored a sample of the dataset, and then applied a variety of text analysis methods to understand the application of natural language processsing techniques to perform classification and prediction of an Amazon Review product category.

What we've learnt is that depending on the method (TF-IDF, Word Embeddings va blazingtext, and BERT), the speed of training and inferencing as we increase the complexity of the method (e.g. BERT being the most complex), however, we do see an increase in classifcation score as a result.

With our 1% sample which has been taken from a fair distribution of the total 140 million reviews, we can see the performance for multi-class classification performs as follows:

- TF-IDF  - 65%
- Word Embeddings - 75%
- BERT - 82%

Let's consider the computational time taken to train and then perform inferencing (local model):

- TF-IDF - Training: 10 minutes. Inferencing: < ~1ms
- Word-Embeddings - This was using SageMaker distributed mode so cannot compare
- BERT - Training: 18 hours. Inferencing: ~15 seconds (per sample)

Conclusion:

- BERT represents the state-of-the-art technique for performing NLP tasks such as sequence classification, however due to the complexity of the model, the training and the inferening is costly w.r.t time and computational resources. Whilst this was performed on on local mode, we should be able to scale up the processing time for training, the inferencing time will be similar to the response of that of the local mode, due to inferencing is not a distributed task.
- TF-IDF provided adequate results, however, due to two part approach to TF-IDF - first undsupervised training to identify frequency and document frequency scores, then using these scores to train a classifier - the amount of heavy lifting is substantial. There was a significant amount of data preparation and tuning that was required in order to acheive adequate results. Also, TF-IDF cannot capture the semantics of the words, thus makes it difficult to use the outputs for other tasks. Furthermore, as we had to train a classifier on the back of the TF-IDF scores, it would be significantly slower (and require extensive computational resources), if we were to move to the full dataset.
- Word Embeddings: Word Embeddings provided the best of both worlds, it allows for the semantics of the tokens to persist within the training cycles (we can set the size of the n-grams we wish to keep), plus it provides acceptable results (> 75%) for classifying reviews based on their product_category

Going forward, we'll take the Word Embeddings Model and apply it to the larger corpus of reviews.


