# Validating and Importing User-Item-Interaction Data <a class="anchor" id="top"></a>

In this notebook, you will choose a dataset and prepare it for use with Amazon Personalize.

1. [Introduction](#intro)
1. [Choose a dataset or data source](#source)
1. [Prepare your data](#prepare)
1. [Create dataset groups and the interactions dataset](#group_dataset)
1. [Configure an S3 bucket and an IAM role](#bucket_role)
1. [Import the interactions data](#import)

## Introduction <a class="anchor" id="intro"></a>

For the most part, the algorithms in Amazon Personalize (called recipes) look to solve different tasks, explained here:

1. **User Personalization** - New release that supports ALL HRNN workflows / user personalization needs, it will be what we use here.
1. **HRNN & HRNN-Metadata** - Recommends items based on previous user interactions with items.
1. **HRNN-Coldstart** - Recommends new items for which interaction data is not yet available.
1. **Personalized-Ranking** - Takes a collection of items and orders them in probable order of interest for a particular user using an HRNN-like approach.
1. **SIMS (Similar Items)** - Given one item, recommends other items also interacted with by users (think items in similar baskets rather than items necessarily similar to each other).
1. **Popularity-Count** - Recommends the most popular items, if HRNN or HRNN-Metadata do not have an answer (because there are not enough interactions) - this is returned by default.

No matter the use case, the algorithms all share a base of learning on user-item-interaction data which is defined by 3 core attributes:

1. **UserID** - The user who interacted
1. **ItemID** - The item the user interacted with
1. **Timestamp** - The time at which the interaction occurred

We also support event types and event values defined by:

1. **Event Type** - Categorical label to define a type of event (browse, purchased, rated, etc).
1. **Event Value** - A value corresponding to the event type that occurred. Generally speaking, we look for normalized values between 0 and 1 over the event types. For example, if there are three phases to complete a transaction (clicked, added-to-cart, and purchased), then there would be an event_value for each phase as 0.33, 0.66, and 1.0 respectfully.

The event type and event value fields are additional fields which can be used to filter the data used for training the personalization model. In this particular exercise we will not have an event type or event value (More information on how to use the eventValue with eventValueThreshold in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/recording-events.html)). 

## Choose a dataset or data source <a class="anchor" id="source"></a>
[Back to top](#top)

As we mentioned, the user-item-iteraction data is key for getting started with the service. This means we need to look for use cases that generate that kind of data, a few common examples are:

1. Video-on-demand applications
1. E-commerce platforms
1. Social media aggregators / platforms

There are a few guidelines for scoping a problem suitable for Personalize. We recommend the values below as a starting point, although the [official limits](https://docs.aws.amazon.com/personalize/latest/dg/limits.html) lie a little lower.

* Authenticated users
* At least 50 unique users
* At least 100 unique items
* At least 2 dozen interactions for each user 

Most of the time this is easily attainable, and if you are low in one category, you can often make up for it by having a larger number in another category.

Generally speaking your data will not arrive in a perfect form for Personalize, and will take some modification to be structured correctly. This notebook guides you through all of that. 

To begin, we are going to use the latest MovieLens dataset, this dataset has over 25 million interactions and a rich collection of metadata for items. There is also a smaller version of this dataset, which can be used to shorten training times, while still incorporating the same capabilities as the full dataset. Set USE_FULL_MOVIELENS to True to use the full dataset.

In [7]:
USE_FULL_MOVIELENS = False

First, you will download the dataset from the Movielens website and unzip it in a new folder using the code below.

In [8]:
data_dir = "poc_data"
!mkdir $data_dir

if not USE_FULL_MOVIELENS:
    !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
    !cd $data_dir && unzip ml-latest-small.zip
    dataset_dir = data_dir + "/ml-latest-small/"
else:
    !cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-25m.zip
    !cd $data_dir && unzip ml-25m.zip
    dataset_dir = data_dir + "/ml-25m/"

--2021-08-24 11:45:16--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2021-08-24 11:45:17 (1.39 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


Take a look at the data files you have downloaded.

In [9]:
!ls $dataset_dir

links.csv  movies.csv  ratings.csv  README.txt	tags.csv


At present not much is known except that we have a few CSVs and a readme. Next we will output the readme to learn more!

In [10]:
!pygmentize $dataset_dir/README.txt

Summary

This dataset (ml-latest-small) describes 5-star rating and free-text tagging activity from [MovieLens](http://movielens.org), a movie recommendation service. It contains 100836 ratings and 3683 tag applications across 9742 movies. These data were created by 610 users between March 29, 1996 and September 24, 2018. This dataset was generated on September 26, 2018.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in the files `links.csv`, `movies.csv`, `ratings.csv` and `tags.csv`. More details about the contents and use of all these files follows.

This is a *development* dataset. As such, it may change over time and is not an appropriate dataset for shared research results. See available *benchmark* datasets if that is your intent.

This and other GroupLens data sets are publicly available for down

From the README, we see there is a file `ratings.csv` that should work as a proxy for our interactions data, after all rating a film definitely is a form of interacting with it. The dataset also has some genre information as some movie genome data. In this POC we will focus on the interactions and the genre data.


## Prepare your data <a class="anchor" id="prepare"></a>
[Back to top](#top)

The next thing to be done is to load the data and confirm the data is in a good state, then save it to a CSV where it is ready to be used with Amazon Personalize.

To get started, import a collection of Python libraries commonly used in data science.

In [11]:
import time
from time import sleep
import json
from datetime import datetime
import boto3
import pandas as pd

Next, open the data file and take a look at the first rows.

In [12]:
original_data = pd.read_csv(dataset_dir + '/ratings.csv')
original_data.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [13]:
original_data.shape

(100836, 4)

In [14]:
original_data.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


This shows that we have a good range of values for `userId` and `movieId`. Next, it is always a good idea to confirm the data format.

In [15]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [16]:
original_data.isnull().any()

userId       False
movieId      False
rating       False
timestamp    False
dtype: bool

From this, you can see that there are a total of (25,000,095 for full 100836 for small) entries in the dataset, with 4 columns, and each cell stored as int64 format, with the exception of the rating whihch is a float64.

The int64 format is clearly suitable for `userId` and `movieId`. However, we need to dive deeper to understand the timestamps in the data. To use Amazon Personalize, you need to save timestamps in [Unix Epoch](https://en.wikipedia.org/wiki/Unix_time) format.

Currently, the timestamp values are not human-readable. So let's grab an arbitrary timestamp value and figure out how to interpret it.

Do a quick sanity check on the transformed dataset by picking an arbitrary timestamp and transforming it to a human-readable format.

In [17]:
arb_time_stamp = original_data.iloc[50]['timestamp']
print(arb_time_stamp)
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

964982681.0
2000-07-30 18:44:41


This date makes sense as a timestamp, so we can continue formatting the rest of the data. Remember, the data we need is user-item-interaction data, which is `userId`, `movieId`, and `timestamp` in this case. Our dataset has an additional column, `rating`, which can be dropped from the dataset after we have leveraged it to focus on positive interactions.

Since this is a dataset of an explicit feedback movie ratings, it includes movies rated from 1 to 5. We want to include only moves that were "liked" by the users, and simulate a dataset of data that would be gathered by a VOD platform. In order to do that, we will filter out all interactions under 2 out of 5, and create two EVENT_Types "click" and and "watch". We will then assign all movies rated 2 and above as "click" and movies rated 4 and above as both "click" and "watch".

Note that this is to correspond with the events we are modeling, for a real data set you would actually model based on implicit feedback such as clicks, watches and/or explicit feedback such as ratings, likes etc.

In [18]:
watched_df = original_data.copy()
watched_df = watched_df[watched_df['rating'] > 3]
watched_df = watched_df[['userId', 'movieId', 'timestamp']]
watched_df['EVENT_TYPE']='watch'
watched_df.head()

Unnamed: 0,userId,movieId,timestamp,EVENT_TYPE
0,1,1,964982703,watch
1,1,3,964981247,watch
2,1,6,964982224,watch
3,1,47,964983815,watch
4,1,50,964982931,watch


In [19]:
clicked_df = original_data.copy()
clicked_df = clicked_df[clicked_df['rating'] > 1]
clicked_df = clicked_df[['userId', 'movieId', 'timestamp']]
clicked_df['EVENT_TYPE']='click'
clicked_df.head()

Unnamed: 0,userId,movieId,timestamp,EVENT_TYPE
0,1,1,964982703,click
1,1,3,964981247,click
2,1,6,964982224,click
3,1,47,964983815,click
4,1,50,964982931,click


In [20]:
interactions_df = clicked_df.copy()
interactions_df = interactions_df.append(watched_df)
interactions_df.sort_values("timestamp", axis = 0, ascending = True, 
                 inplace = True, na_position ='last') 

In [21]:
interactions_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 158371 entries, 66679 to 81092
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   userId      158371 non-null  int64 
 1   movieId     158371 non-null  int64 
 2   timestamp   158371 non-null  int64 
 3   EVENT_TYPE  158371 non-null  object
dtypes: int64(3), object(1)
memory usage: 6.0+ MB


Lets look at what the new dataset looks like.

In [22]:
interactions_df.describe()

Unnamed: 0,userId,movieId,timestamp
count,158371.0,158371.0,158371.0
mean,323.940804,19869.485146,1211627000.0
std,182.195975,35809.058185,213737600.0
min,1.0,1.0,828124600.0
25%,175.0,1203.0,1031072000.0
50%,325.0,3033.0,1193289000.0
75%,475.0,8528.0,1435998000.0
max,610.0,193609.0,1537799000.0


After manipulating the data, always confirm the data format has not changed.

In [23]:
interactions_df.dtypes

userId         int64
movieId        int64
timestamp      int64
EVENT_TYPE    object
dtype: object

 Amazon Personalize has default column names for users, items, and timestamp. These default column names are `USER_ID`, `ITEM_ID`, AND `TIMESTAMP`. So the final modification to the dataset is to replace the existing column headers with the default headers.

In [24]:
interactions_df.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', 
                              'timestamp':'TIMESTAMP'}, inplace = True) 

That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [25]:
interactions_filename = "interactions.csv"
interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

## Create dataset groups and the interactions dataset <a class="anchor" id="group_dataset"></a>
[Back to top](#top)

The highest level of isolation and abstraction with Amazon Personalize is a *dataset group*. Information stored within one of these dataset groups has no impact on any other dataset group or models created from one - they are completely isolated. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset groups can house the following types of information:

* User-item-interactions
* Event streams (real-time interactions)
* User metadata
* Item metadata

Before we create the dataset group and the dataset for our interaction data, let's validate that your environment can communicate successfully with Amazon Personalize.

In [53]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')
print("We can communicate with Personalize!")

### Create the dataset group

The following cell will create a new dataset group with the name `personalize-poc-movielens`.

In [54]:
create_dataset_group_response = personalize.create_dataset_group(
    name = "personalize-poc-movielens"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:832194813872:dataset-group/personalize-poc-movielens",
  "ResponseMetadata": {
    "RequestId": "e84f8796-daaf-4a79-84ee-2a0477ee29fa",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Mon, 01 Feb 2021 19:19:13 GMT",
      "x-amzn-requestid": "e84f8796-daaf-4a79-84ee-2a0477ee29fa",
      "content-length": "104",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset group, it must be active. This can take a minute or two. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the dataset group every minute, up to a maximum of 3 hours.

In [55]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


Now that you have a dataset group, you can create a dataset for the interaction data.

### Create the dataset

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for interactions data, which requires the `USER_ID`, `ITEM_ID`, and `TIMESTAMP` fields. These must be defined in the same order in the schema as they appear in the dataset.

In [56]:
interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "personalize-poc-movielens-interactions",
    schema = json.dumps(interactions_schema)
)

interaction_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:832194813872:schema/personalize-poc-movielens-interactions",
  "ResponseMetadata": {
    "RequestId": "42390b3d-9138-48ed-9c02-df3f6159dfd1",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Mon, 01 Feb 2021 19:22:14 GMT",
      "x-amzn-requestid": "42390b3d-9138-48ed-9c02-df3f6159dfd1",
      "content-length": "104",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, it just defines the schema for the data. The data will be loaded a few steps later.

In [57]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "personalize-poc-movielens-ints",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = interaction_schema_arn
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:832194813872:dataset/personalize-poc-movielens/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "446a75b5-5233-4580-a042-4254033887cd",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Mon, 01 Feb 2021 19:22:16 GMT",
      "x-amzn-requestid": "446a75b5-5233-4580-a042-4254033887cd",
      "content-length": "106",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## Configure an S3 bucket and an IAM  role <a class="anchor" id="bucket_role"></a>
[Back to top](#top)

So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. However, Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing that bucket. Let's set all of that up.

Use the metadata stored on the instance underlying this Amazon SageMaker notebook, to determine the region it is operating in. If you are using a Jupyter notebook outside of Amazon SageMaker, simply define the region as a string below. The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far.

In [58]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

us-east-1


Amazon S3 bucket names are globally unique. To create a unique bucket name, the code below will append the string `personalizepocvod` to your AWS account number. Then it creates a bucket with this name in the region discovered in the previous cell.

In [59]:
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "-" + region + "-" + "personalizepocvod"
print(bucket_name)
if region == "us-east-1":
    s3.create_bucket(Bucket=bucket_name)
else:
    s3.create_bucket(
        Bucket=bucket_name,
        CreateBucketConfiguration={'LocationConstraint': region}
        )

832194813872-us-east-1-personalizepocvod


### Upload data to S3

Now that your Amazon S3 bucket has been created, upload the CSV file of our user-item-interaction data. 

In [60]:
interactions_file_path = data_dir + "/" + interactions_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_filename

### Set the S3 bucket policy
Amazon Personalize needs to be able to read the contents of your S3 bucket. So add a bucket policy which allows that.

In [61]:
policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

{'ResponseMetadata': {'RequestId': '9460125C440E624C',
  'HostId': 'FVHJyrZZHSoU1Y8NRkktdPVFUYT1tDRwlh7w4gO08bmhoLzAdhzVvjy0+3G5mNLhtsOlaL6i6Cg=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'FVHJyrZZHSoU1Y8NRkktdPVFUYT1tDRwlh7w4gO08bmhoLzAdhzVvjy0+3G5mNLhtsOlaL6i6Cg=',
   'x-amz-request-id': '9460125C440E624C',
   'date': 'Mon, 01 Feb 2021 19:22:28 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

In [62]:
iam = boto3.client("iam")

role_name = "PersonalizeRolePOC"
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::832194813872:role/PersonalizeRolePOC


### Create an IAM role

Amazon Personalize needs the ability to assume roles in AWS in order to have the permissions to execute certain tasks. Let's create an IAM role and attach the required policies to it. The code below attaches very permissive policies; please use more restrictive policies for any production application.

## Import the interactions data <a class="anchor" id="import"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, so now you will execute an import job that will load the data from the S3 bucket into the Amazon Personalize dataset. 

In [63]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "personalize-poc-import1",
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:832194813872:dataset-import-job/personalize-poc-import1",
  "ResponseMetadata": {
    "RequestId": "96e73fe9-05d4-4ac8-a285-a3ad895ce3a4",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Mon, 01 Feb 2021 19:23:29 GMT",
      "x-amzn-requestid": "96e73fe9-05d4-4ac8-a285-a3ad895ce3a4",
      "content-length": "111",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every minute, up to a maximum of 6 hours.

Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job should take around 15 minutes.

In [64]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE
CPU times: user 145 ms, sys: 0 ns, total: 145 ms
Wall time: 20min 1s


When the dataset import is active, you are ready to start building models with SIMS, Personalized-Ranking, and User Personalization. This process will continue in other notebooks.

# Validating and Importing Item Metadata <a class="anchor" id="top"></a>

This will allow you to work with filters as well as supporting the `User Personalization` algorithm.


## Prepare your Item metadata <a class="anchor" id="prepare"></a>
[Back to top](#top)

Next we load the data and confirm the data is in a good state, then save it to a CSV in S3 where it is ready to be used with Amazon Personalize.

To get started, import a collection of Python libraries commonly used in data science.

Next, open the data file and take a look at the first rows.

In [3]:
original_data = pd.read_csv(dataset_dir + '/movies.csv')
original_data.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
original_data.describe()

Unnamed: 0,movieId
count,9742.0
mean,42200.353623
std,52160.494854
min,1.0
25%,3248.25
50%,7300.0
75%,76232.0
max,193609.0


This does not really tell us much about the dataset, so we will explore a bit more and look at the raw information. We can see that genres often appear in groups. That is fine for us as Personalize supports this structure.

In [5]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


From this, you can see that there are a total of (62,000+ for full 9742 for small) entries in the dataset, with 3 columns.

Lets look for potential data issues. First we will check for null values.

In [6]:
original_data.isnull().sum()

movieId    0
title      0
genres     0
dtype: int64

Looks good, we currently have no null values.

This is a pretty small dataset of just the movieId, title and the list of genres that are applicable to each entry. However there is additional data available in the Movielens dataset. For instance the title includes the year of the movies release. Let's make that another column of metadata.

In [7]:
original_data['year'] =original_data['title'].str.extract('.*\((.*)\).*',expand = False)
original_data.head(5)

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


Lets check again for null values, now that we have added a new field.

In [8]:
original_data.isnull().sum()

movieId     0
title       0
genres      0
year       12
dtype: int64

It looks like we have introduced some null values, this is likely due to something in the orginal data. If we had time, we could investigate the titles that resulted in the null values. However, for this workshop we will drop the null value titles.

In [10]:
original_data = original_data.dropna(axis=0)

Lets validate that we resololved the data issue

In [11]:
original_data.isnull().sum()

movieId    0
title      0
genres     0
year       0
dtype: int64

From an item metadata perspective, we only want to include information that is relevant to training a model and/or filtering results, so we will drop the title column, and keep the genre information.

In [12]:
itemmetadata_df = original_data.copy()
itemmetadata_df = itemmetadata_df[['movieId', 'genres', 'year']]
itemmetadata_df.head()

Unnamed: 0,movieId,genres,year
0,1,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Adventure|Children|Fantasy,1995
2,3,Comedy|Romance,1995
3,4,Comedy|Drama|Romance,1995
4,5,Comedy,1995


We will add a new dataframe to help us generate a creation timestamp. If you don’t provide the CREATION_TIMESTAMP for an item, the model infers this information from the interaction dataset and uses the timestamp of the item’s earliest interaction as its corresponding release date. If an item doesn’t have an interaction, its release date is set as the timestamp of the latest interaction in the training set and it is considered a new item. For the current dataset we will set the CREATION_TIMESTAMP to 0.

In [13]:
itemmetadata_df['CREATION_TIMESTAMP'] = 0

After manipulating the data, always confirm that the data format has not changed.

In [14]:
itemmetadata_df.dtypes

movieId                int64
genres                object
year                  object
CREATION_TIMESTAMP     int64
dtype: object

Amazon Personalize has a default column for `ITEM_ID` that will map to our `movieId`. We will flesh out more information by specifying `GENRE` as well.

In [15]:
itemmetadata_df.rename(columns = {'genres':'GENRE', 'movieId':'ITEM_ID', 'year':'YEAR'}, inplace = True) 

In [16]:
itemmetadata_df

Unnamed: 0,ITEM_ID,GENRE,YEAR,CREATION_TIMESTAMP
0,1,Adventure|Animation|Children|Comedy|Fantasy,1995,0
1,2,Adventure|Children|Fantasy,1995,0
2,3,Comedy|Romance,1995,0
3,4,Comedy|Drama|Romance,1995,0
4,5,Comedy,1995,0
...,...,...,...,...
9737,193581,Action|Animation|Comedy|Fantasy,2017,0
9738,193583,Animation|Comedy|Fantasy,2017,0
9739,193585,Drama,2017,0
9740,193587,Action|Animation,2018,0


That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [17]:
itemmetadata_filename = "item-meta.csv"
itemmetadata_df.to_csv((data_dir+"/"+itemmetadata_filename), index=False, float_format='%.0f')

### Create the dataset

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for item metadata data, and we define the `ITEM_ID`, `GENRE`, `YEAR`, and `CREATION_TIMESTAMP` fields. These must be defined in the same order in the schema as they appear in the dataset.

In [19]:
itemmetadata_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "GENRE",
            "type": "string",
            "categorical": True
        },{
            "name": "YEAR",
            "type": "int",
        },
        {
            "name": "CREATION_TIMESTAMP",
            "type": "long",
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "personalize-poc-movielens-item",
    schema = json.dumps(itemmetadata_schema)
)

itemmetadataschema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:832194813872:schema/personalize-poc-movielens-item",
  "ResponseMetadata": {
    "RequestId": "9fc2d9d9-22be-46d5-8269-cd959ff4ad8a",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Mon, 01 Feb 2021 19:48:20 GMT",
      "x-amzn-requestid": "9fc2d9d9-22be-46d5-8269-cd959ff4ad8a",
      "content-length": "96",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. We will upload the data a few steps later.

In [20]:
dataset_type = "ITEMS"
create_dataset_response = personalize.create_dataset(
    name = "personalize-poc-movielens-items",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = itemmetadataschema_arn
)

items_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:832194813872:dataset/personalize-poc-movielens/ITEMS",
  "ResponseMetadata": {
    "RequestId": "c7a4b711-d16c-45d5-9fe9-a10de3f456a1",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Mon, 01 Feb 2021 19:48:23 GMT",
      "x-amzn-requestid": "c7a4b711-d16c-45d5-9fe9-a10de3f456a1",
      "content-length": "99",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### Upload data to S3

We upload the the CSV file of our user-item-interaction data to the S3 bucket we created previously. 

In [21]:
itemmetadata_file_path = data_dir + "/" + itemmetadata_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(itemmetadata_filename).upload_file(itemmetadata_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+itemmetadata_filename

## Import the item metadata <a class="anchor" id="import"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, now you will execute an import job that will load the data from the S3 bucket into the Amazon Personalize dataset. 

In [22]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "personalize-poc-item-import1",
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, itemmetadata_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:832194813872:dataset-import-job/personalize-poc-item-import1",
  "ResponseMetadata": {
    "RequestId": "272cc72f-55c5-4f4c-8be6-7105637b074a",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Mon, 01 Feb 2021 19:48:27 GMT",
      "x-amzn-requestid": "272cc72f-55c5-4f4c-8be6-7105637b074a",
      "content-length": "116",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every minute, up to a maximum of 6 hours.

Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job should take around 15 minutes.

In [23]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE
CPU times: user 116 ms, sys: 17.1 ms, total: 133 ms
Wall time: 18min 1s


With both imports now complete you can enable filtering for your recommendations as well as support `User Personalization`. Run the cell below before moving on to store a few values for usage in the next notebooks. After completing that cell open notebook `02_Training_Layer.ipynb` to continue.

In [66]:
%store USE_FULL_MOVIELENS
%store dataset_dir
%store interactions_dataset_arn
%store dataset_group_arn
%store bucket_name
%store role_arn
%store role_name
%store data_dir
%store region
%store interaction_schema_arn
%store items_dataset_arn
%store itemmetadataschema_arn

Stored 'USE_FULL_MOVIELENS' (bool)
Stored 'dataset_dir' (str)
Stored 'interactions_dataset_arn' (str)
Stored 'dataset_group_arn' (str)
Stored 'bucket_name' (str)
Stored 'role_arn' (str)
Stored 'role_name' (str)
Stored 'data_dir' (str)
Stored 'region' (str)
Stored 'interaction_schema_arn' (str)
