# Data Preparation Notebook

In this notebook, you will execute code to 

1. Download raw data from a DynamoDB table.
2. Review the data that will be used to create a machine learning (ML) model.
4. Split the data into training and testing sets.
5. Perform negative sampling.
6. Calculate statistics needed to train the Neural Collaborative Filtering (NCF) model.
7. Upload the training and testing data back to an S3 bucket.

## 1. Retrieve the raw data from the DynamoDB ratings table

In [None]:
#################
## Code Cell 1 ##
#################
import boto3
import botocore
from boto3.dynamodb.conditions import Attr

def do_table_scan():
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('ratings')
    response = table.scan()
    result = response['Items']
    while 'LastEvaluatedKey' in response:
        response=table.scan(ExclusiveStartKey=response['LastEvaluatedKey'])
        result.extend(response['Items'])
    
    return result
    
response=do_table_scan()    

For this model, we will be using the raw data that contains 4 columns, and has been adapted from a publicly available source for the purposes of this lab. 
To learn more about the origin of this data set right-click the README.md and choose Open with Markdown preview.
- CustomerId
- ProductId
- rating
- ProductName
- timestamp

## 2. Read data into a Pandas dataframe object then split into train and test data sets.

In [None]:
#################
## Code Cell 2 ##
#################

# required libraries
import os
import boto3
import sagemaker
import numpy as np
import pandas as pd
import pickle

# Read the response object into a Pandas DataFrame
df = pd.DataFrame(response)

# let's see what the data looks like:
display(df)

# We don't need the ProductName data during training so store it
# locally and we will add back to the prediction results later
# Using the pickle library for serializing into a byte stream.
df_products = df[['ProductId','ProductName']]
df_unique_products = df_products.drop_duplicates(subset=['ProductId'])
df_unique_products.to_pickle('products.pkl')

# Understand what should be the maximum sampling size for each customer:
df.groupby('CustomerId').ProductId.nunique().min()

#### Note: Since the "least active" customer has 20 ratings, for our testing set, let's sample 50% for every customer.

In [None]:
#################
## Code Cell 3 ##
#################

# The below code defines a function to be used in the next code cell to split the 
# DataFrame into training and testing data sets.

def train_test_split(df, sampling_num):
    """ perform training/testing split
    
    @param df: dataframe
    @param sampling_num: number of ratings to sample for each customer
    
    @return df_train: training data
    @return df_test testing data
    
    """
    # first sort the data by time
    df = df.sort_values(['CustomerId', 'timestamp'], ascending=[True, False])
    
    # perform deep copy on the dataframe to avoid modification on the original dataframe
    df_train = df.copy(deep=True)
    df_test = df.copy(deep=True)
    
    # get test set
    df_test = df_test.groupby(['CustomerId']).head(sampling_num).reset_index()
    
    # get train set
    df_train = df_train.merge(
        df_test[['CustomerId', 'ProductId']].assign(remove=1),
        how='left'
    ).query('remove != 1').drop(columns='remove').reset_index(drop=True)
    
    # sanity check to make sure we're not duplicating/losing data
    assert len(df) == len(df_train) + len(df_test)
    
    return df_train, df_test

In [None]:
#################
## Code Cell 4 ##
#################

# This code snippet calls the above function and passes two parameters: the dataframe and the sampling size.

df_train, df_test = train_test_split(df, 10)

## 3. Perform negative sampling

Assuming if a user rating an item is a positive label, there is no negative sample in the dataset, which is not possible for model training. Therefore, we random sample `n` items from the unseen product list for every customer to provide the negative samples.

In [None]:
#################
## Code Cell 5 ##
#################

def negative_sampling(customer_ids, product_ids, items, n_neg):
    """This function creates n_neg negative labels for every positive label
    
    @param customer_ids: list of customer ids
    @param product_ids: list of product ids
    @param items: unique list of product ids
    @param n_neg: number of negative labels to sample
    
    @return df_neg: negative sample dataframe
    
    """
    
    neg = []
    ui_pairs = zip(customer_ids, product_ids)
    records = set(ui_pairs)
    
    # for every positive label case
    for (u, i) in records:
        # generate n_neg negative labels
        for _ in range(n_neg):
            # if the randomly sampled product exists for that customer
            j = np.random.choice(items)
            while(u, j) in records:
                # resample
                j = np.random.choice(items)
            neg.append([u, j, 0])
    # conver to pandas dataframe for concatenation later
    df_neg = pd.DataFrame(neg, columns=['CustomerId', 'ProductId', 'rating'])
    
    return df_neg

In [None]:
#################
## Code Cell 6 ##
#################

# create and display the negative samples for training set
neg_train = negative_sampling(
    customer_ids=df_train.CustomerId.values, 
    product_ids=df_train.ProductId.values,
    items=df.ProductId.unique(),
    n_neg=5
)

print(f'created {neg_train.shape[0]:,} negative samples')

df_train = df_train[['CustomerId', 'ProductId']].assign(rating=1)
df_test = df_test[['CustomerId', 'ProductId']].assign(rating=1)

df_train = pd.concat([df_train, neg_train], ignore_index=True)


## 4. Calculate statistics for our understanding and model training

In [None]:
#################
## Code Cell 7 ##
#################

def get_unique_count(df):
    """calculate unique customer and product counts"""
    return df.CustomerId.nunique(), df.ProductId.nunique()

# unique number of customers and products in the whole dataset
get_unique_count(df)

print('training set shape', get_unique_count(df_train))
print('testing set shape', get_unique_count(df_test))

Next, we calculate some statistics for training purposes. We also store the number of customers and product ids to be used during model training.

In [None]:
##################
## Code Cell 8  ##
##################

# number of unique user and number of unique customer/products
n_customer, n_product = get_unique_count(df_train)

print("number of unique customers ", n_customer)
print("number of unique products ", n_product)

# save the variable for the model training notebook
# -----
# read about `store` magic here: 
# https://ipython.readthedocs.io/en/stable/config/extensions/storemagic.html

%store n_customer
%store n_product


## 5. Preprocess the data and upload it to the S3 bucket.

In [None]:
##################
## Code Cell 9  ##
##################

# get current session region
session = boto3.session.Session()
region = session.region_name
print(f'currently in {region}')

# use the default sagemaker s3 bucket to store processed data
# here we figure out what that default bucket name is 
sagemaker_session = sagemaker.Session()
bucket_name = sagemaker_session.default_bucket()
print(bucket_name)  # bucket name format: "sagemaker-{region}-{aws_account_id}"

# Save data locallly 
dest = 'data/'
train_path = os.path.join(dest, 'train.npy')
test_path = os.path.join(dest, 'test.npy')

!mkdir {dest}
np.save(train_path, df_train.values)
np.save(test_path, df_test.values)

# upload to S3 bucket (see the bucket name above)
sagemaker_session.upload_data(train_path, key_prefix='data')
sagemaker_session.upload_data(test_path, key_prefix='data')


## 6. Data preparation completed. Proceed with the next step of the lab.