# 1. Validating and Importing User-Item-Interaction Data


For the most part, the algorithms in Amazon Personalize look to solve different tasks explained here:

1. HRNN & HRNN-Metadata - Personalization
1. HRNN Coldstart - Personalization that promotes new content
1. Personalized-Ranking - Takes a collection of items and then orders them in probable order of interest using an HRNN-like approach.
1. SIMS(Similar Items) - Given one item, what other items are also interacted with by users.
1. Popularity-Count - What items are most popular, if HRNN or HRNN-Metadata do not have an answer for the user you query, this is what is returned by default.


No matter the use case, the algorithms all share a base of learning on user-item-interaction data which is defined by 3 core attributes:

1. UserID - User who interacted
1. ItemID - Item, the user, interacted with
1. Timestamp - When did this interaction occur

We also support event types and event values defined by:

1. Event Type - Categorical label of an event (browse, purchased, rated, etc.).
1. Event Value - Something corresponding to an event type that happened. Generally speaking, we look to normalized between 0 and 1 for the values over the types. So if there are three phases to complete a transaction (clicked, added-to-cart, and purchased), there would be an event_value for each phase as 0.33, 0.66, 1.0 respectfully.

In this exercise, we will leave event_type and event_value ignored. They can come in handy later but are skipped for the initial POC. 

----

## Choosing a Dataset or Data Source

As we mentioned, the user-item-interaction data is key for getting started with the service. This means we need to look for use cases that generate that kind of data, a few common examples are:

1. Video-on-Demand applications
1. E-Commerce platforms
1. Social-Media aggregators/platforms

If the problem correctly sized for Personalize, the minimum recommendations are below:

* Authenticated users
* At least 50 users
* At least 100 items
* At least 2 dozen interactions for each. 

Most of the time, it is easily attainable, and if you are low in one category, you can often make it up by having a more significant number in the other. 

Your data will not arrive in a perfect form for this application and will take some modifications to get structured correctly. This notebook looks to guide you through all of that. 

To begin with, we are going to use the Last.FM dataset found [here](https://grouplens.org/datasets/hetrec-2011/). This data fits our guidelines with a large number for users, items, and interactions. 

Next, you will use the cells below to create a folder for the example data as well as download the dataset for analysis.

In [None]:
data_dir = "poc_data"
!rm -rf $data_dir
!mkdir -p $data_dir
!cd $data_dir && wget http://files.grouplens.org/datasets/hetrec2011/hetrec2011-lastfm-2k.zip
!cd $data_dir && unzip hetrec2011-lastfm-2k.zip

At present, not much is known about the data. Opening the readme will tell us about the overall structure of this data. This is a step you probably can skip with custom data unless the data source is coming from an external team. 

Note the data does not seem to be encoded with UTF-8 for some reason, so it is recommended that you open a terminal and `cat` the file so that you can see the output as Jupyter Lab/Notebooks do not render text documents that are not UTF-8.


Performing that yielded some interesting stats about the data:

```
---------------
Data statistics
---------------

    1892 users
   17632 artists
      
   12717 bi-directional user friend relations, i.e. 25434 (user_i, user_j) pairs
         avg. 13.443 friend relations per user
         
   92834 user-listened artist relations, i.e. tuples [user, artist, listeningCount]
         avg. 49.067 artists most listened by each user
         avg. 5.265 users who listened each artist
            
   11946 tags  
   
  186479 tag assignments (tas), i.e. tuples [user, tag, artist]
         avg. 98.562 tag per user
         avg. 14.891 tag per artist
         avg. 18.930 distinct tags used by each user
         avg. 8.764 distinct tags used for each artist

-----
```

We are focusing on the users, the artists, and the listening relations, so we have 1892, 17632, and 92834 items to meet those, a considerable volume of data for getting started.

The focus on this notebook is, again, the interactions, so we should look to find some data that supports it.


```
-----
Files
-----
            
   * artists.dat
   
        This file contains information about music artists listened and tagged by users.
   
   * tags.dat
   
        This file contains the set of tags available in the dataset.

   * user_artists.dat
   
        This file contains the artists listened by each user.
        
        It also provides a listening count for each [user, artist] pair.

   * user_taggedartists.dat - user_taggedartists-timestamps.dat
   
        These files contain the tag assignments of artists provided by each particular user.
        
        They also contain the timestamps when the tag assignments were done.
   
   * user_friends.dat
   
        These files contain the friend relations between users in the database.
     
-----------
Data format
-----------

   The data is formatted one entry per line as follows (tab separated, "\t"):

   * artists.dat
   
        id \t name \t url \t pictureURL

        Example:
        707     Metallica       http://www.last.fm/music/Metallica      http://userserve-ak.last.fm/serve/252/7560709.jpg

   * tags.dat
 
        tagID \t tagValue
        1       metal
 
   * user_artists.dat
   
        userID \t artistID \t weight
        2       51      13883
   
   * user_taggedartists.dat
  
        userID \t artistID \t tagID \t day \t month \t year
        2       52      13      1       4       2009  
  
   * user_taggedartists-timestamps.dat

        userID \t artistID \t tagID \t timestamp
        2       52      13      1238536800000

   * user_friends.dat

        userID \t friendID
        2       275

```

So right off the bad, we now see a problem that although there is data supporting users interacting with just artists, for some reason, we only have it stored as weight, not an actual timestamp. 

It did look like when a user tagged an artist, there is a timestamp, so what if we make an assumption that tagging was a positive indicator, and we use that data to get started? It seems like a reliable approach for the POC, so that is what we are going to do now. 

The schema for the `user_taggedartists-timestamps.dat` is:

| userID | artistID | tagID | timestamp     |
|--------|----------|-------|---------------|
| 2      | 52       | 13    | 1238536800000 |


That looks pretty good for our base. Only the `tagID` needs to be removed. 

----

## Preparing Your Data

The next thing to be done is to read the data with Pandas and confirm the data is in a good state and save it to a CSV where it is ready to be used with Amazon Personalize.

Import the Pandas library as well as a few other data science tools in order to inspect the information.

In [None]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time
import pprint
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
from datetime import datetime
import uuid

Next open the file with Pandas and take a look at the contents

In [None]:
original_data = pd.read_csv(data_dir + '/user_taggedartists-timestamps.dat')
original_data.head(5)

Well that did not work so well, looks like the tab delimiter needs to be specified, attempt 2:

In [None]:
original_data = pd.read_csv(data_dir + '/user_taggedartists-timestamps.dat', delimiter='\t')
original_data.head(5)

The data looks really good here but lets get some extra insights on it.

In [None]:
original_data.info()

In [None]:
original_data.describe()

Now there is clearly a range of values for all of the columns, which is excellent, the last one to be mindful of is that the timestamp should be in Unix Epoch format. You can learn more about the format [here](https://en.wikipedia.org/wiki/Unix_time)

Let us grab an arbitrary column and convert it to a DateTime and confirm that it feels like a reasonable value for the historical data.

In [None]:
# Uncomment the next lines to show an error

arb_time_stamp = original_data.iloc[50]['timestamp']
#print(arb_time_stamp)
#print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))


For this particular value, it rendered a year of 41,132... a bit into the future for us, so somehow, we parsed it incorrectly. Attempt number 2...

JavaScript records time in milliseconds and this is a collection of data from a web application, so divide by 1000 first and see what is returned:

In [None]:
arb_time_stamp = arb_time_stamp/1000
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

Feb 2009 feels completely reasonable, so now move forward by transforming each row in the data frame in the same way.

In [None]:
original_data.head(5)
original_data.timestamp = original_data.timestamp / 1000
original_data.head(5)

Now to see if the timestamps hold correctly:

In [None]:
arb_time_stamp = original_data.iloc[50]['timestamp']
print(arb_time_stamp)
print(datetime.utcfromtimestamp(arb_time_stamp).strftime('%Y-%m-%d %H:%M:%S'))

This looks exactly like we are after, now we can drop the `tagID` column. First, make a copy of the `df`.

In [None]:
interactions_df = original_data.copy()
interactions_df = interactions_df[['userID', 'artistID', 'timestamp']]
interactions_df.head()

In [None]:
interactions_df.astype({'timestamp': 'int64'}).dtypes

In [None]:
interactions_df.head()

Personalize has default column names of users, items and timestamp so now we will replace our data set with the correct values.

In [None]:
interactions_df.rename(columns={
    'userID':'USER_ID', 
    'artistID':'ITEM_ID', 
    'timestamp':'TIMESTAMP'
}, inplace = True) 

At this point the data is ready to go, we just need to save it as a CSV.

In [None]:
interactions_filename = "interactions.csv"
interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

----

## Creating Dataset Groups and the Interactions Dataset

The highest level of isolation and abstraction with Amazon Personalize is a Dataset Group. Information stored within one of these has no impact on any other dataset group or models created from one. This allows you to run many experiments and is part of how we keep your models private and fully trained only on your data. 

Before importing the data prepared earlier, there needs to be a dataset group and a dataset added to it that handles the interactions.

Dataset Groups can house the following types of information:

* User-Item-Interactions
* Event Streams ( Real-time Interactions )
* User Metadata
* Item Metadata

The cells below will create the dataset group and the dataset for interactions.



Now validate that your environment can communicate successfully with Amazon Personalize, the lines below do just that.

In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

### Create the Dataset Group

In [None]:
create_dataset_group_response = personalize.create_dataset_group(
    name = "personalize-poc-lastfm-"+str(uuid.uuid4())
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

Wait for Dataset Group to Have ACTIVE Status

Before we can use the Dataset Group in any items below it must be active, execute the cell below and wait for it to show active.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

### Create the Dataset

First define a schema for the interactions.

Schema uses the [Avro](https://avro.apache.org/docs/current/) format and dependant data types.

In [None]:
interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "lastfm-interactions-"+str(uuid.uuid4()),
    schema = json.dumps(interactions_schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

Now create a dataset with that schema.

There are following datasets types:
- Interactions
- Items
- Users

> Read [more](https://docs.aws.amazon.com/personalize/latest/dg/API_CreateDataset.html)

In [None]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "personalize-poc-lastfm-ints",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn
)

dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

In [None]:
interactions_dataset_arn = dataset_arn

----

## Configuring S3 and IAM 

Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing it. The code below will set all that up.

Now using the metadata stored on this instance of a SageMaker Notebook determine the region we are operating in. If you are using a Jupyter Notebook outside of SageMaker, define the AWS region as the string that indicates the region you would like to apply for Personalize and S3.

In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

In [None]:
session = boto3.Session(region_name=region)

In [None]:
print(region)
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "personalizepoc" + str(uuid.uuid4())
print(bucket_name)
if region != "us-east-1":
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
else:
    s3.create_bucket(Bucket=bucket_name)

#### Attach Policy to S3 Bucket
Amazon Personalize needs to be able to read the content of your S3 bucket that you created earlier. The lines below will do that.

In [None]:
s3 = boto3.client("s3")

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

### Create Personalize Role
Also Amazon Personalize needs the ability to assume Roles in AWS in order to have the permissions to execute certain tasks, the lines below grant that.

In [None]:
iam = boto3.client("iam")

role_name = "PersonalizeRolePOC"+str(uuid.uuid4())
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

#### Upload to S3

Before Personalize can import the data, it needs to be in S3.

In [None]:
# Upload Interactions File
interactions_file_path = data_dir + "/" + interactions_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_filename

----

## Importing the Interactions Data

Earlier you created the DatasetGroup and Dataset to house your information, now you will execute an import job that will load the data from S3 into Amazon Personalize for usage building your model.

#### Create Dataset Import Job

In [None]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "personalize-poc-import1"+str(uuid.uuid4())[:5],
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

#### Wait for Dataset Import Job to Have ACTIVE Status
It can take a while before the import job completes, please wait until you see that it is active below.

In [None]:
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
    else:    
        sleep(60)

Now that the dataset import is active you are ready to start building models with SIMS, Personalized-Ranking, Popularity-Count, and HRNN. Work will continue in other notebooks. Run the cell below before moving on to store a few values for usage in the next notebooks.

In [None]:
%store interactions_dataset_arn
%store dataset_group_arn
%store bucket_name
%store role_arn
%store role_name
%store data_dir