# Validating and Importing User-Item-Interaction Data <a class="anchor" id="top"></a>

In this notebook, you will choose a dataset and prepare it for use with Amazon Personalize.

1. [Introduction](#intro)
1. [Choose a dataset or data source](#source)
1. [Prepare your data](#prepare)
1. [Create dataset groups and the interactions dataset](#group_dataset)
1. [Configure an S3 bucket and an IAM role](#bucket_role)
1. [Import the interactions data](#import)

## Introduction <a class="anchor" id="intro"></a>

For the most part, the algorithms in Amazon Personalize (called recipes) look to solve different tasks, explained here:

1. **User Personalization** - New release that supports ALL HRNN workflows / user personalization needs, it will be what we use here.
1. **Personalized-Ranking** - Takes a collection of items and orders them in probable order of interest for a particular user using an HRNN-like approach.
1. **SIMS (Similar Items)** - Given one item, recommends other items also interacted with by users (think items in similar baskets rather than items necessarily similar to each other).
1. **Popularity-Count** - Recommends the most popular items, if HRNN or HRNN-Metadata do not have an answer (because there are not enough interactions) - this is returned by default.

No matter the use case, the algorithms all share a base of learning on user-item-interaction data which is defined by 3 core attributes:

1. **UserID** - The user who interacted
1. **ItemID** - The item the user interacted with
1. **Timestamp** - The time at which the interaction occurred

We also support event types and event values defined by:

1. **Event Type** - Categorical label to define a type of event (browse, purchased, rated, etc).
1. **Event Value** - A value corresponding to the event type that occurred. Generally speaking, we look for normalized values between 0 and 1 over the event types. For example, if there are three phases to complete a transaction (clicked, added-to-cart, and purchased), then there would be an event_value for each phase as 0.33, 0.66, and 1.0 respectfully.

The event type and event value fields are additional fields which can be used to filter the data used for training the personalization model. In this particular exercise we will not have an event type or event value (More information on how to use the eventValue with eventValueThreshold in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/recording-events.html)). 

## Choose a dataset or data source <a class="anchor" id="source"></a>
[Back to top](#top)

As we mentioned, the user-item-iteraction data is key for getting started with the service. This means we need to look for use cases that generate that kind of data, a few common examples are:

1. Video-on-demand applications
1. E-commerce platforms
1. Social media aggregators / platforms

There are a few guidelines for scoping a problem suitable for Personalize. We recommend the values below as a starting point, although the [official limits](https://docs.aws.amazon.com/personalize/latest/dg/limits.html) lie a little lower.

* Authenticated users
* At least 50 unique users
* At least 100 unique items
* At least 2 dozen interactions for each user 

Most of the time this is easily attainable, and if you are low in one category, you can often make up for it by having a larger number in another category.

Generally speaking your data will not arrive in a perfect form for Personalize, and will take some modification to be structured correctly. This notebook guides you through all of that. 


Download the raw data from the designated s3 bucket

In [1]:
import os

raw_data_dir = "data/raw"
!mkdir -p $raw_data_dir

s3bucket = 'rp-personalize'
s3raw_datafile = 'data/raw/recommend_csv.tgz'
s3basename = os.path.basename(s3raw_datafile)

!(cd $raw_data_dir && aws s3 cp s3://$s3bucket/$s3raw_datafile .)
!(cd $raw_data_dir && tar xzf $s3basename)
!(cd $raw_data_dir && rm -f $s3basename)


download: s3://rp-personalize/data/raw/recommend_csv.tgz to ./recommend_csv.tgz
tar: ._posts.csv: Cannot change ownership to uid 504, gid 50: Operation not permitted
tar: posts.csv: Cannot change ownership to uid 504, gid 50: Operation not permitted
tar: ._profiles.csv: Cannot change ownership to uid 504, gid 50: Operation not permitted
tar: profiles.csv: Cannot change ownership to uid 504, gid 50: Operation not permitted
tar: ._reads.csv: Cannot change ownership to uid 504, gid 50: Operation not permitted
tar: reads.csv: Cannot change ownership to uid 504, gid 50: Operation not permitted
tar: Exiting with failure status due to previous errors


Take a look at the data files you have downloaded.

In [2]:
!ls -l $raw_data_dir

total 439132
-rw-r--r-- 1 root root 308226061 Oct  5 12:48 posts.csv
-rw-r--r-- 1 root root  23004727 Oct  5 12:46 profiles.csv
-rw-r--r-- 1 root root  76607759 Oct 13 14:29 rallypoint.tgz
-rw-r--r-- 1 root root  41822301 Oct  5 12:46 reads.csv


At present not much is known except that we have a few CSV files. 

## Prepare your data <a class="anchor" id="prepare"></a>
[Back to top](#top)

The next thing to be done is to load the data and confirm the data is in a good state, then save it to a CSV where it is ready to be used with Amazon Personalize.

To get started, import a collection of Python libraries commonly used in data science.

In [3]:
import time
from time import sleep
import json
from datetime import datetime
import boto3
import pandas as pd
import numpy as np
import sagemaker

In [4]:
sagemaker.get_execution_role()
region = boto3.Session().region_name


### Be sure and put the Personalize Role that you created here...

In [5]:
role_arn = "arn:aws:iam::662559257807:role/amPersonalizeDataAccessRole"

Next, open the data file and take a look at the first rows.

In [6]:
df_interactions = pd.read_csv(f'{raw_data_dir}/reads.csv')
df_interactions.head(5)

Unnamed: 0,logged_at,profile_id,post_id
0,2021-03-15 00:00:00+00,909996,6823552
1,2021-03-15 00:00:04+00,1716231,6803240
2,2021-03-15 00:00:07+00,1423473,162892
3,2021-03-15 00:00:07+00,1620867,6823347
4,2021-03-15 00:00:23+00,625376,162892


In [7]:
df_interactions.shape

(940850, 3)

In [8]:
df_interactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940850 entries, 0 to 940849
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   logged_at   940850 non-null  object
 1   profile_id  940850 non-null  int64 
 2   post_id     940850 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 21.5+ MB


This shows that we have a good range of values for `userId` and `movieId`. Next, it is always a good idea to confirm the data format.

From this, you can see that there are a total of (25,000,095 for full 100836 for small) entries in the dataset, with 4 columns, and each cell stored as int64 format, with the exception of the rating whihch is a float64.

The int64 format is clearly suitable for `userId` and `movieId`. However, we need to dive deeper to understand the timestamps in the data. To use Amazon Personalize, you need to save timestamps in [Unix Epoch](https://en.wikipedia.org/wiki/Unix_time) format.

Currently, the timestamp values are not human-readable. So let's grab an arbitrary timestamp value and figure out how to interpret it.

Do a quick sanity check on the transformed dataset by picking an arbitrary timestamp and transforming it to a human-readable format.

In [9]:
df_interactions['TIMESTAMP'] =  pd.to_datetime(df_interactions['logged_at'])

In [10]:
df_interactions.drop('logged_at', axis=1, inplace=True)

 Amazon Personalize has default column names for users, items, and timestamp.  
 These default column names are `USER_ID`, `ITEM_ID`, AND `TIMESTAMP`. So modify the dataset to replace the existing column headers with the default headers.

In [11]:
df_interactions['TIMESTAMP'] = df_interactions['TIMESTAMP'].astype('int64')

In [12]:
df_interactions.rename(columns = {'profile_id': 'USER_ID', 'post_id': 'ITEM_ID'}, inplace = True)

In [13]:
df_interactions.head()

Unnamed: 0,USER_ID,ITEM_ID,TIMESTAMP
0,909996,6823552,1615766400000000000
1,1716231,6803240,1615766404000000000
2,1423473,162892,1615766407000000000
3,1620867,6823347,1615766407000000000
4,625376,162892,1615766423000000000


In [14]:
df_interactions.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 940850 entries, 0 to 940849
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype
---  ------     --------------   -----
 0   USER_ID    940850 non-null  int64
 1   ITEM_ID    940850 non-null  int64
 2   TIMESTAMP  940850 non-null  int64
dtypes: int64(3)
memory usage: 21.5 MB


That's it! At this point the data is ready to go, and we just need to save it as a CSV file.

In [15]:
data_dir = './data/'
interactions_filename = "interactions.csv"
df_interactions.to_csv((data_dir+interactions_filename), index=False)

## Create dataset groups and the interactions dataset <a class="anchor" id="group_dataset"></a>
[Back to top](#top)

The highest level of isolation and abstraction with Amazon Personalize is a *dataset group*. Information stored within one of these dataset groups has no impact on any other dataset group or models created from one - they are completely isolated. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset groups can house the following types of information:

* User-item-interactions
* Event streams (real-time interactions)
* User metadata
* Item metadata

Before we create the dataset group and the dataset for our interaction data, let's validate that your environment can communicate successfully with Amazon Personalize.

In [16]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')


### Create the dataset group

The following cell will create a new dataset group with the name `personalize-poc-movielens`.

In [17]:
create_dataset_group_response = personalize.create_dataset_group(
    name = "rp-personalize-dsg"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:662559257807:dataset-group/rp-personalize-dsg",
  "ResponseMetadata": {
    "RequestId": "6c4ad8c5-de61-46fc-b55e-acce767ec196",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 17:16:48 GMT",
      "x-amzn-requestid": "6c4ad8c5-de61-46fc-b55e-acce767ec196",
      "content-length": "97",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset group, it must be active. This can take a minute or two. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the dataset group every minute, up to a maximum of 3 hours.

In [18]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(5)

DatasetGroup: CREATE PENDING
DatasetGroup: CREATE PENDING
DatasetGroup: CREATE PENDING
DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


Now that you have a dataset group, you can create a dataset for the interaction data.

### Create the dataset

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for interactions data, which requires the `USER_ID`, `ITEM_ID`, and `TIMESTAMP` fields. These must be defined in the same order in the schema as they appear in the dataset.

In [19]:
interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}


create_schema_response = personalize.create_schema(
    name = "interactions-schema",
    schema = json.dumps(interactions_schema)
)

interaction_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:662559257807:schema/interactions-schema",
  "ResponseMetadata": {
    "RequestId": "3885c445-f424-486f-90b8-ed242f84515b",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 17:17:09 GMT",
      "x-amzn-requestid": "3885c445-f424-486f-90b8-ed242f84515b",
      "content-length": "85",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, it just defines the schema for the data. The data will be loaded a few steps later.

In [20]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "rp-interactions-dataset",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = interaction_schema_arn
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:662559257807:dataset/rp-personalize-dsg/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "74ddde67-5665-4d8e-84bb-906496c0a3e7",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 17:17:09 GMT",
      "x-amzn-requestid": "74ddde67-5665-4d8e-84bb-906496c0a3e7",
      "content-length": "99",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## Configure an S3 bucket and an IAM  role <a class="anchor" id="bucket_role"></a>
[Back to top](#top)

So far, we have downloaded, manipulated, and saved the data onto the Amazon EBS instance attached to instance running this Jupyter notebook. However, Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing that bucket. Let's set all of that up.

Use the metadata stored on the instance underlying this Amazon SageMaker notebook, to determine the region it is operating in. If you are using a Jupyter notebook outside of Amazon SageMaker, simply define the region as a string below. The Amazon S3 bucket needs to be in the same region as the Amazon Personalize resources we have been creating so far.

Amazon S3 bucket names are globally unique. To create a unique bucket name, the code below will append the string `personalizepocvod` to your AWS account number. Then it creates a bucket with this name in the region discovered in the previous cell.

In [21]:
s3 = boto3.client('s3')
s3bucket = 'am-tmp2'
s3prefix = 'rallypoint'

print(s3bucket)
print(s3prefix)

am-tmp2
rallypoint


### Upload data to S3

Now that your Amazon S3 bucket has been created, upload the CSV file of our user-item-interaction data. 

In [22]:
# interactions_file_path = data_dir + "/" + interactions_filename
# boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
# interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_filename

src_filename = f'{data_dir}{interactions_filename}'
interactions_s3filepath = f's3://{s3bucket}/{s3prefix}/{interactions_filename}'

!aws s3 cp $src_filename $interactions_s3filepath

upload: data/interactions.csv to s3://am-tmp2/rallypoint/interactions.csv


## Import the Interactions data <a class="anchor" id="import"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, so now you will execute an import job that will load the data from the S3 bucket into the Amazon Personalize dataset. 

In [24]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "interactions-import",
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": interactions_s3filepath
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:662559257807:dataset-import-job/interactions-import",
  "ResponseMetadata": {
    "RequestId": "0f4855c4-bb19-450e-b670-e9c7edd929da",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 18:06:56 GMT",
      "x-amzn-requestid": "0f4855c4-bb19-450e-b670-e9c7edd929da",
      "content-length": "107",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every minute, up to a maximum of 6 hours.

Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job should take around 15 minutes.

In [25]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(5)

DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob:

When the dataset import is active, you are ready to start building models with SIMS, Personalized-Ranking, and User Personalization. This process will continue in other notebooks.

# Validating and Importing Item Metadata <a class="anchor" id="top"></a>

This will allow you to work with filters as well as supporting the `User Personalization` algorithm.


## Prepare your Item metadata <a class="anchor" id="prepare"></a>
[Back to top](#top)

Next we load the data and confirm the data is in a good state, then save it to a CSV in S3 where it is ready to be used with Amazon Personalize.

To get started, import a collection of Python libraries commonly used in data science.

Next, open the data file and take a look at the first rows.

In [26]:
df_items = pd.read_csv(f'{raw_data_dir}/posts.csv')
df_items.head(5)

Unnamed: 0,post_id,type,ancestry,title,body,active,last_activity_at,profile_id,votes_count,created_at,...,comments_count,r_and_c_count,short_group_url,sponsored_post,root_type,best_of_rp,best_of_rp_setter_id,engagement_locked,command_post_type,qrc_groups
0,6621543,Comment,6084286/6099168,,**redacted contact** - and where do you get ...,1,,1580444,1,2021-01-01 00:04:27,...,0,0,,0,Question,0,,0,,"Retirement,Leadership,Values,Promotions"
1,6621544,Comment,6084286/6619121,,"col trinh, most people in the chain of command...",1,,1459261,0,2021-01-01 00:04:33,...,0,0,,0,Question,0,,0,,"Retirement,Leadership,Values,Promotions"
2,6621546,Response,6620500,,great accomplishments in difficult times. wat...,1,,1652327,2,2021-01-01 00:05:29,...,0,0,,0,SharedLink,0,,0,,"Awards,Physics,Theoretical Physics,Science"
3,6621547,Comment,6084286/6098802,,**redacted contact** - i certainly didn't vo...,1,,181760,0,2021-01-01 00:05:34,...,0,0,,0,Question,0,,0,,"Retirement,Leadership,Values,Promotions"
4,6621548,Comment,1157569/6592815,,tanana alaska january #### operation jack fros...,1,,1631106,2,2021-01-01 00:05:51,...,0,0,,0,Question,0,,0,,"Friends,Memories,Photography"


This does not really tell us much about the dataset, so we will explore a bit more and look at the raw information. We can see that genres often appear in groups. That is fine for us as Personalize supports this structure.

In [27]:
df_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 387721 entries, 0 to 387720
Data columns (total 42 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   post_id                 387721 non-null  int64  
 1   type                    387721 non-null  object 
 2   ancestry                368364 non-null  object 
 3   title                   3029 non-null    object 
 4   body                    380390 non-null  object 
 5   active                  387721 non-null  int64  
 6   last_activity_at        19357 non-null   object 
 7   profile_id              387721 non-null  int64  
 8   votes_count             387721 non-null  int64  
 9   created_at              387721 non-null  object 
 10  updated_at              387721 non-null  object 
 11  up_votes                387721 non-null  int64  
 12  down_votes              387721 non-null  int64  
 13  popularity              19357 non-null   float64
 14  net_votes           

From this, you can see that there are a total of (62,000+ for full 9742 for small) entries in the dataset, with 3 columns.

Lets look for potential data issues. First we will check for null values.

In [28]:
df_items = df_items[['post_id', 'net_votes', 'qrc_groups', 'created_at']]
df_items.head(5)

Unnamed: 0,post_id,net_votes,qrc_groups,created_at
0,6621543,-1,"Retirement,Leadership,Values,Promotions",2021-01-01 00:04:27
1,6621544,0,"Retirement,Leadership,Values,Promotions",2021-01-01 00:04:33
2,6621546,2,"Awards,Physics,Theoretical Physics,Science",2021-01-01 00:05:29
3,6621547,0,"Retirement,Leadership,Values,Promotions",2021-01-01 00:05:34
4,6621548,2,"Friends,Memories,Photography",2021-01-01 00:05:51


Looks good, we currently have no null values.

This is a pretty small dataset of just the movieId, title and the list of genres that are applicable to each entry. However there is additional data available in the Movielens dataset. For instance the title includes the year of the movies release. Let's make that another column of metadata.

In [29]:
df_items['qrc_groups'] = df_items['qrc_groups'].fillna('')
lists = df_items['qrc_groups'].tolist()

tags = {}
for li in lists:
    vals = li.split(',')
    for val in vals:
        if len(val):
            if val in tags:
                tags[val] += 1
            else:
                tags[val] = 0

topten = sorted([(value,key) for (key,value) in tags.items()], reverse=True)[0:10]
print(topten)

for tag in topten:
    tag_name = tag[1]
    df_items[tag_name] = df_items['qrc_groups'].str.contains(tag_name).astype('int')

df_items.drop('qrc_groups', axis=1, inplace=True)

[(51927, 'Humor'), (28409, 'Quotes'), (26750, 'Motivation'), (26038, 'Inspiration'), (15085, 'American History'), (13599, 'Military History'), (11887, 'Leadership'), (11521, 'Donald Trump'), (11328, 'World History'), (11274, 'Joe Biden')]


In [30]:
df_items.head(5)

Unnamed: 0,post_id,net_votes,created_at,Humor,Quotes,Motivation,Inspiration,American History,Military History,Leadership,Donald Trump,World History,Joe Biden
0,6621543,-1,2021-01-01 00:04:27,0,0,0,0,0,0,1,0,0,0
1,6621544,0,2021-01-01 00:04:33,0,0,0,0,0,0,1,0,0,0
2,6621546,2,2021-01-01 00:05:29,0,0,0,0,0,0,0,0,0,0
3,6621547,0,2021-01-01 00:05:34,0,0,0,0,0,0,1,0,0,0
4,6621548,2,2021-01-01 00:05:51,0,0,0,0,0,0,0,0,0,0


In [31]:
df_items.rename(columns = {'post_id': 'ITEM_ID', 'created_at': 'TIMESTAMP'}, inplace = True)
df_items.head(5)

Unnamed: 0,ITEM_ID,net_votes,TIMESTAMP,Humor,Quotes,Motivation,Inspiration,American History,Military History,Leadership,Donald Trump,World History,Joe Biden
0,6621543,-1,2021-01-01 00:04:27,0,0,0,0,0,0,1,0,0,0
1,6621544,0,2021-01-01 00:04:33,0,0,0,0,0,0,1,0,0,0
2,6621546,2,2021-01-01 00:05:29,0,0,0,0,0,0,0,0,0,0
3,6621547,0,2021-01-01 00:05:34,0,0,0,0,0,0,1,0,0,0
4,6621548,2,2021-01-01 00:05:51,0,0,0,0,0,0,0,0,0,0


In [32]:
# column names must not contain spaces
df_items.rename(columns = {'Donald Trump': 'Donald_Trump', 'Joe Biden': 'Joe_Biden', 'World History': 'World_History', 'American History': 'American_History', 'Military History': 'Military_History'}, inplace = True)
df_items.head(5)

Unnamed: 0,ITEM_ID,net_votes,TIMESTAMP,Humor,Quotes,Motivation,Inspiration,American_History,Military_History,Leadership,Donald_Trump,World_History,Joe_Biden
0,6621543,-1,2021-01-01 00:04:27,0,0,0,0,0,0,1,0,0,0
1,6621544,0,2021-01-01 00:04:33,0,0,0,0,0,0,1,0,0,0
2,6621546,2,2021-01-01 00:05:29,0,0,0,0,0,0,0,0,0,0
3,6621547,0,2021-01-01 00:05:34,0,0,0,0,0,0,1,0,0,0
4,6621548,2,2021-01-01 00:05:51,0,0,0,0,0,0,0,0,0,0


In [33]:
df_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 387721 entries, 0 to 387720
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   ITEM_ID           387721 non-null  int64 
 1   net_votes         387721 non-null  int64 
 2   TIMESTAMP         387721 non-null  object
 3   Humor             387721 non-null  int64 
 4   Quotes            387721 non-null  int64 
 5   Motivation        387721 non-null  int64 
 6   Inspiration       387721 non-null  int64 
 7   American_History  387721 non-null  int64 
 8   Military_History  387721 non-null  int64 
 9   Leadership        387721 non-null  int64 
 10  Donald_Trump      387721 non-null  int64 
 11  World_History     387721 non-null  int64 
 12  Joe_Biden         387721 non-null  int64 
dtypes: int64(12), object(1)
memory usage: 38.5+ MB


In [34]:
df_items['TIMESTAMP'] = pd.to_datetime(df_items['TIMESTAMP'])
df_items['TIMESTAMP'] = df_items['TIMESTAMP'].astype('int64')
df_items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 387721 entries, 0 to 387720
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype
---  ------            --------------   -----
 0   ITEM_ID           387721 non-null  int64
 1   net_votes         387721 non-null  int64
 2   TIMESTAMP         387721 non-null  int64
 3   Humor             387721 non-null  int64
 4   Quotes            387721 non-null  int64
 5   Motivation        387721 non-null  int64
 6   Inspiration       387721 non-null  int64
 7   American_History  387721 non-null  int64
 8   Military_History  387721 non-null  int64
 9   Leadership        387721 non-null  int64
 10  Donald_Trump      387721 non-null  int64
 11  World_History     387721 non-null  int64
 12  Joe_Biden         387721 non-null  int64
dtypes: int64(13)
memory usage: 38.5 MB


In [35]:
items_filename = "item-meta.csv"
df_items.to_csv((data_dir+"/"+items_filename), index=False)

### Create the dataset

First, define a schema to tell Amazon Personalize what type of dataset you are uploading. There are several reserved and mandatory keywords required in the schema, based on the type of dataset. More detailed information can be found in the [documentation](https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html).

Here, you will create a schema for item metadata data, and we define the `ITEM_ID`, `GENRE`, `YEAR`, and `CREATION_TIMESTAMP` fields. These must be defined in the same order in the schema as they appear in the dataset.

In [36]:
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "net_votes",
            "type": "long"
        },
        {
            "name": "Humor",
            "type": "int",
            "categorical": True
        },
        {
            "name": "Quotes",
            "type": "int",
            "categorical": True
        },
        {
            "name": "Motivation",
            "type": "int",
            "categorical": True
        },
        {
            "name": "Inspiration",
            "type": "int",
            "categorical": True
        },
        {
            "name": "American_History",
            "type": "int",
            "categorical": True
        },
        {
            "name": "Military_History",
            "type": "int",
            "categorical": True
        },
        {
            "name": "Leadership",
            "type": "int",
            "categorical": True
        },
        {
            "name": "Donald_Trump",
            "type": "int",
            "categorical": True
        },
        {
            "name": "World_History",
            "type": "int",
            "categorical": True
        },
        {
            "name": "Joe_Biden",
            "type": "int",
            "categorical": True
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "items-schema",
    schema = json.dumps(items_schema)
)

items_metadataschema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:662559257807:schema/items-schema",
  "ResponseMetadata": {
    "RequestId": "049188f5-5f65-4c18-bf02-29ef32636593",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 18:11:04 GMT",
      "x-amzn-requestid": "049188f5-5f65-4c18-bf02-29ef32636593",
      "content-length": "78",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


With a schema created, you can create a dataset within the dataset group. Note that this does not load the data yet, but creates a schema of what the data looks like. We will upload the data a few steps later.

In [37]:
dataset_type = "ITEMS"
create_dataset_response = personalize.create_dataset(
    name = "items-metadata-dataset",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = items_metadataschema_arn
)

items_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:662559257807:dataset/rp-personalize-dsg/ITEMS",
  "ResponseMetadata": {
    "RequestId": "522e5b1d-2191-4867-a9c4-293ae52e206e",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 18:11:10 GMT",
      "x-amzn-requestid": "522e5b1d-2191-4867-a9c4-293ae52e206e",
      "content-length": "92",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### Upload data to S3

We upload the the CSV file of our user-item-interaction data to the S3 bucket we created previously. 

In [38]:
src_filename = f'{data_dir}{items_filename}'
items_s3filepath = f's3://{s3bucket}/{s3prefix}/{items_filename}'

!aws s3 cp $src_filename $items_s3filepath

upload: data/item-meta.csv to s3://am-tmp2/rallypoint/item-meta.csv


## Import the Items metadata <a class="anchor" id="import"></a>
[Back to top](#top)

Earlier you created the dataset group and dataset to house your information, now you will execute an import job that will load the data from the S3 bucket into the Amazon Personalize dataset. 

In [40]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "rp-item-metadata-import",
    datasetArn = items_dataset_arn,
    dataSource = {
        "dataLocation": items_s3filepath
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:662559257807:dataset-import-job/rp-item-metadata-import",
  "ResponseMetadata": {
    "RequestId": "16b96226-afd9-41cf-8c46-33a052915e44",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 18:12:14 GMT",
      "x-amzn-requestid": "16b96226-afd9-41cf-8c46-33a052915e44",
      "content-length": "111",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset, the import job must be active. Execute the cell below and wait for it to show the ACTIVE status. It checks the status of the import job every minute, up to a maximum of 6 hours.

Importing the data can take some time, depending on the size of the dataset. In this workshop, the data import job should take around 15 minutes.

In [41]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(30)

DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE
CPU times: user 43.3 ms, sys: 4.18 ms, total: 47.5 ms
Wall time: 4min


# Validating and Importing User Metadata <a class="anchor" id="top"></a>

This will allow you to work with filters as well as supporting the `User Personalization` algorithm.


In [42]:
df_users = pd.read_csv(f'{raw_data_dir}/profiles.csv')
df_users.head(5)


  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,profile_id,marital_status,age,gender,office_level,rank,member_type,civilian_title,created_at,branch_component,...,suspended_by,member_admin_stars,academic_availability,goal_type,job_score,education_score,financial_score,member_unit_name,member_unit_link,profile_groups
0,604,Married,44.0,Male,18A: Special Forces Officer,LTC,Servicemember,Board Member,2012-08-10 13:42:36,Reserve,...,--,4,0000-00-00 00:00:00,"old_friends,employment_transition",0.0,0.014041,0.014739,"DIU, OSD",https://www.rallypoint.com/units/diu-defense-i...,"Travel,Military History,Hiking,Formula 1,Talen..."
1,605,Married,,Male,13A: Field Artillery Officer,CPT,Veteran,Co-Founder,2012-08-10 14:33:07,Active,...,--,4,0000-00-00 00:00:00,"new_people,veteran_topics",0.0,0.013838,0.023,"Fort Knox WTBN, NRMC (WTC), WTC, MEDCOM",https://www.rallypoint.com/units/fort-knox-wtb...,"Education,Military Career,Promotions,Mentorshi..."
2,607,Married,39.0,Male,63AX: Acquisition Manager,Capt,Veteran,Vice President of Account Management,2012-08-10 18:11:56,Active,...,--,0,0000-00-00 00:00:00,civilian_career,0.430835,0.037638,0.032786,"SDTD, SMC, AFSPC",https://www.rallypoint.com/units/sdtd-space-de...,"RallyPoint,Transition,Networking,Civilian Care..."
3,610,Married,40.0,Male,25A: Signal Officer,LTC,Servicemember,,2012-08-10 21:16:49,Active,...,--,0,0000-00-00 00:00:00,"military_career,military_topics",0.0,0.022479,0.006713,J6,https://www.rallypoint.com/units/j6-j6-command...,"Hiking,BBQ,Snorkeling"
4,619,Married,38.0,Male,64PX: Contracting,Lt Col,Servicemember,Small Business Liaison/Procurement Analyst,2012-08-12 18:15:54,Reserve,...,--,0,0000-00-00 00:00:00,"new_people,military_career",0.002546,0.012981,0.0,"DCMA, ASD ACQ, USD AT&L, OSD",https://www.rallypoint.com/units/dcma-defense-...,"Networking,Family,Mentorship,Firearms and Guns..."


In [43]:
df_users = df_users[['profile_id', 'marital_status', 'age', 'gender', 'rank', 'created_at']]
df_users.head(5)

Unnamed: 0,profile_id,marital_status,age,gender,rank,created_at
0,604,Married,44.0,Male,LTC,2012-08-10 13:42:36
1,605,Married,,Male,CPT,2012-08-10 14:33:07
2,607,Married,39.0,Male,Capt,2012-08-10 18:11:56
3,610,Married,40.0,Male,LTC,2012-08-10 21:16:49
4,619,Married,38.0,Male,Lt Col,2012-08-12 18:15:54


In [44]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60395 entries, 0 to 60394
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   profile_id      60395 non-null  int64  
 1   marital_status  11008 non-null  object 
 2   age             12806 non-null  float64
 3   gender          60395 non-null  object 
 4   rank            56527 non-null  object 
 5   created_at      60395 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 2.8+ MB


In [45]:
# hmmm, N/A values for age and marital_status...  fix this
df_users['age'].fillna(df_users['age'].mean(), inplace=True)
df_users['marital_status'].fillna('unknown', inplace=True)
df_users['rank'].fillna('unknown', inplace=True)
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60395 entries, 0 to 60394
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   profile_id      60395 non-null  int64  
 1   marital_status  60395 non-null  object 
 2   age             60395 non-null  float64
 3   gender          60395 non-null  object 
 4   rank            60395 non-null  object 
 5   created_at      60395 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 2.8+ MB


In [46]:
df_users.rename(columns = {'profile_id': 'USER_ID', 'created_at': 'TIMESTAMP'}, inplace=True)
df_users['TIMESTAMP'] = pd.to_datetime(df_users['TIMESTAMP'])
df_users['TIMESTAMP'] = df_users['TIMESTAMP'].astype('int64')
df_users['age'] = df_users['age'].astype('int')

df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60395 entries, 0 to 60394
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   USER_ID         60395 non-null  int64 
 1   marital_status  60395 non-null  object
 2   age             60395 non-null  int64 
 3   gender          60395 non-null  object
 4   rank            60395 non-null  object
 5   TIMESTAMP       60395 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 2.8+ MB


In [47]:
users_filename = "user-meta.csv"
df_users.to_csv((data_dir+"/"+users_filename), index=False)

In [48]:
users_schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "marital_status",
            "type": "string",
            "categorical": True
        },
        {
            "name": "age",
            "type": "int",
            "categorical": True
        },
        {
            "name": "gender",
            "type": "string",
            "categorical": True
        },
        {
            "name": "rank",
            "type": "string",
            "categorical": True
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "users-schema",
    schema = json.dumps(users_schema)
)

users_metadataschema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:662559257807:schema/users-schema",
  "ResponseMetadata": {
    "RequestId": "337f97f2-6dc5-4f0d-9a41-a03f34ac6a1c",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 18:22:43 GMT",
      "x-amzn-requestid": "337f97f2-6dc5-4f0d-9a41-a03f34ac6a1c",
      "content-length": "78",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [49]:
dataset_type = "USERS"
create_dataset_response = personalize.create_dataset(
    name = "users-metadata-dataset",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = users_metadataschema_arn
)

users_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:662559257807:dataset/rp-personalize-dsg/USERS",
  "ResponseMetadata": {
    "RequestId": "ce305747-9735-45a7-a80a-4c1939199848",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 18:22:44 GMT",
      "x-amzn-requestid": "ce305747-9735-45a7-a80a-4c1939199848",
      "content-length": "92",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [50]:
src_filename = f'{data_dir}{users_filename}'
users_s3filepath = f's3://{s3bucket}/{s3prefix}/{users_filename}'

!aws s3 cp $src_filename $users_s3filepath

upload: data/user-meta.csv to s3://am-tmp2/rallypoint/user-meta.csv


In [54]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "rp-users-metadata-import",
    datasetArn = users_dataset_arn,
    dataSource = {
        "dataLocation": users_s3filepath
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:662559257807:dataset-import-job/rp-users-metadata-import",
  "ResponseMetadata": {
    "RequestId": "e78ab2f5-9fe8-4a58-9dba-09a1636a655c",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Wed, 13 Oct 2021 18:23:13 GMT",
      "x-amzn-requestid": "e78ab2f5-9fe8-4a58-9dba-09a1636a655c",
      "content-length": "112",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [55]:
%%time

max_time = time.time() + 6*60*60 # 6 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(30)

DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE
CPU times: user 37.6 ms, sys: 11.5 ms, total: 49.1 ms
Wall time: 4min


With all three imports (Interactions, Users, Items) now complete you can enable filtering for your recommendations.    
Run the cell below before moving on to store a few values for usage in the next notebooks.  
After completing that cell open notebook `training.ipynb` to continue.

In [56]:
%store dataset_group_arn
%store s3bucket
%store s3prefix
%store role_arn
%store data_dir
%store region
%store interactions_dataset_arn
%store interaction_schema_arn
%store items_dataset_arn
%store items_metadataschema_arn
%store users_dataset_arn
%store users_metadataschema_arn

Stored 'dataset_group_arn' (str)
Stored 's3bucket' (str)
Stored 's3prefix' (str)
Stored 'role_arn' (str)
Stored 'data_dir' (str)
Stored 'region' (str)
Stored 'interactions_dataset_arn' (str)
Stored 'interaction_schema_arn' (str)
Stored 'items_dataset_arn' (str)
Stored 'items_metadataschema_arn' (str)
Stored 'users_dataset_arn' (str)
Stored 'users_metadataschema_arn' (str)
