# Amazon Personalize Batch Job
> Train and create batch recommendations using Amazon Personalize. Expected completion time is 1.5-2 hours

- toc: true
- badges: true
- comments: true
- categories: [amazonpersonalize, batch, movie, hrnn]
- image: 

**Amazon Personalize Workflow**

The general workflow for training, deploying, and getting recommendations from a campaign is as follows:

1. Prepare data

2. Create related datasets and a dataset group.

3. Get training data.

    - Import historical data to the dataset group.

    - Record user events to the dataset group.

4. Create a solution version (trained model) using a recipe.

5. Evaluate the solution version using metrics.

6. Create a campaign (deploy the solution version).

7. Provide recommendations for users by running Batch Recommendation.

In this lab, we will step through the workflow and with some additional steps to setup your IAM permissions and S3 buckets as a data source for your dataset and output for the batch recommendations. 

**Note:** This lab will not cover the deployment of a real-time personalize campaign.

## Prepare Data

### Get dataset
In thie lab, we will be using the the [Movielens dataset](http://grouplens.org/datasets/movielens/) to train and make movie recommendations.

Movielens provide several datasets. To achieve better model accuracy, it is recommendeded to train the Personalize model with a large dataset, however the tradeoff would mean a longer training time. To minimise the time required to complete this lab, we will be sacrificing accuracy for time and will be using the small dataset.

In [None]:
data_dir = "movie_data"
!mkdir $data_dir
!cd $data_dir && wget http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
!cd $data_dir && unzip ml-latest-small.zip

--2020-04-16 06:45:48--  http://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 978202 (955K) [application/zip]
Saving to: ‘ml-latest-small.zip’


2020-04-16 06:45:48 (5.70 MB/s) - ‘ml-latest-small.zip’ saved [978202/978202]

Archive:  ml-latest-small.zip
   creating: ml-latest-small/
  inflating: ml-latest-small/links.csv  
  inflating: ml-latest-small/tags.csv  
  inflating: ml-latest-small/ratings.csv  
  inflating: ml-latest-small/README.txt  
  inflating: ml-latest-small/movies.csv  


### Prepare data

In [None]:
import time
from time import sleep
import json
from datetime import datetime

import boto3
import pandas as pd
import uuid

Load the dataset and preview the data.

In [None]:
original_data = pd.read_csv(data_dir + '/ml-latest-small/ratings.csv')
original_data.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In the lab, we will be using the movie rating dataset and considering movies with ratings greater or equal to 4 to use for the recommendation as we only want to recommend movies that have been positively rated.

In [None]:
interactions_df = original_data.copy()

# Only want ratings greater or equal to 4, filter out ratings less than 4
interactions_df = interactions_df[interactions_df['rating'] >= 4.0]

interactions_df.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100
10,1,163,5.0,964983650


The next step is to map the dataset to the personalize schema by renaming the column name.

For more information about the schema, refer to the following URL: https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html

In [None]:
interactions_df = interactions_df[['userId', 'movieId', 'timestamp']]
interactions_df.head()
interactions_df.rename(columns = {'userId':'USER_ID', 'movieId':'ITEM_ID', 
                              'timestamp':'TIMESTAMP'}, inplace = True)

Finally, we save the dataset to CSV file which we will later upload to S3 for Personalize to use.

In [None]:
interactions_filename = "interactions.csv"
interactions_df.to_csv((data_dir+"/"+interactions_filename), index=False, float_format='%.0f')

## Create related datasets and a dataset group.

In this section, we will setup the Amazon Personalize dataset group and load the inteaction dataset into Amazon Personalize which will be used for training.

Amazon Personalize requires data, stored in Amazon Personalize datasets, in order to train a model.

There are two ways to provide the training data. You can import historical data from an Amazon S3 bucket, and you can record event data as it is created.

A dataset group contains related datasets. Three types of historical datasets are created by the customer (users, items, and interactions), and one type is created by Amazon Personalize for live-event interactions. A dataset group can contain only one of each kind of dataset.

You can create dataset groups to serve different purposes. For example, you might have an application that provides recommendations for purchasing shoes and another that provides recommendations for places to visit in Europe. In Amazon Personalize, each application would have its own dataset group.

Historical data must be provided in a CSV file. Each dataset type has a unique schema that specifies the contents of the CSV file.

There is a [minimum amount of data](https://docs.aws.amazon.com/personalize/latest/dg/limits.html) that is necessary to train a model. Using existing data allows you to immediately start training a model. If you rely on recorded data as it is created, and there is no historical data, it can take a while before training can begin.

In [None]:
# Configure the SDK to Personalize:
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

### Create the personalize dataset group
We start by creating the personalize dataset group named "**personalize-devlab-movies-dataset-group**" which will be used to to store our interactive (ratings) dataset we prepared earlier.

In [None]:
create_dataset_group_response = personalize.create_dataset_group(
    name = "personalize-devlab-movies-dataset-group"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-2:931343413930:dataset-group/personalize-devlab-movies-dataset-group",
  "ResponseMetadata": {
    "RequestId": "0374647e-baae-470f-87b1-0664f3d225d5",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 16 Apr 2020 06:46:00 GMT",
      "x-amzn-requestid": "0374647e-baae-470f-87b1-0664f3d225d5",
      "content-length": "118",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### CHECKPOINT #1 - Wait for dataset group creation to complete
The dataset group will take some time to be created. **Execute the following cell and wait for it to show "ACTIVE" before proceeding to the next step.**

In [None]:
current_time = datetime.now()
print("Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

current_time = datetime.now()
print("Completed on: ", current_time.strftime("%I:%M:%S %p"))

Started on:  06:46:03 AM
DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE
Completed on:  06:47:03 AM


### Create the dataset
Once the dataset group have been complete, the next step is to defined the interaction schema and we will name it "**personalize-devlab-movies-interactions-schema**".

In [None]:
interactions_schema = schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = "personalize-devlab-movies-interactions-schema",
    schema = json.dumps(interactions_schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-2:931343413930:schema/personalize-devlab-movies-interactions-schema",
  "ResponseMetadata": {
    "RequestId": "82e560cc-a51a-4061-9b1e-38c607985d04",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 16 Apr 2020 06:47:20 GMT",
      "x-amzn-requestid": "82e560cc-a51a-4061-9b1e-38c607985d04",
      "content-length": "111",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


Once the schema has been defined, we will define the interactiion dataset using the schema we created above and provide it with the following name "personalize-devlab-movies-interactions-dataset"

In [None]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "personalize-devlab-movies-interactions-dataset",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn
)

dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-2:931343413930:dataset/personalize-devlab-movies-dataset-group/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "8d113327-296b-45fd-8f40-1b97a195f699",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 16 Apr 2020 06:47:21 GMT",
      "x-amzn-requestid": "8d113327-296b-45fd-8f40-1b97a195f699",
      "content-length": "120",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [None]:
# Record the interaction dataset arn to be used later
interactions_dataset_arn = dataset_arn

## Configuring S3 and IAM 


Amazon Personalize will need an S3 bucket to act as the source of your data, as well as IAM roles for accessing it. The code below will set all that up.

Now using the metada stored on this instance of a SageMaker Notebook determine the region we are operating in. If you are using a Jupyter Notebook outside of SageMaker simply define region as the string that indicates the region you would like to use for Forecast and S3.


In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
    
session = boto3.Session(region_name=region)

print(region)
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "personalizedevlab"
print(bucket_name)
if region != "us-east-1":
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
else:
    s3.create_bucket(Bucket=bucket_name)

us-east-2
931343413930personalizedevlab


### Attach Policy to S3 Bucket
Amazon Personalize needs to be able to read the content of your S3 bucket that you created earlier. The lines below will do that.

In [None]:
s3 = boto3.client("s3")

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:*Object",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket_name),
                "arn:aws:s3:::{}/*".format(bucket_name)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket_name, Policy=json.dumps(policy))

{'ResponseMetadata': {'RequestId': '039C39BD0ADB3BFE',
  'HostId': 'C6i18rymISNGlhn+y6WXKM/YIw/lVDWD6082jKgk3ggJq43St4JOKljBFBSRd07X+1h6jNRczCA=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'C6i18rymISNGlhn+y6WXKM/YIw/lVDWD6082jKgk3ggJq43St4JOKljBFBSRd07X+1h6jNRczCA=',
   'x-amz-request-id': '039C39BD0ADB3BFE',
   'date': 'Thu, 16 Apr 2020 06:48:30 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

### Create Personalize Role
Also Amazon Personalize needs the ability to assume Roles in AWS in order to have the permissions to execute certain tasks, the lines below grant that.

In [None]:
iam = boto3.client("iam")

role_name = "PersonalizeRoleDevLab"
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::931343413930:role/PersonalizeRoleDevLab


## Upload dataset to S3

Before Personalize can import the data, it needs to be in S3.

In [None]:
# Upload Interactions File
interactions_file_path = data_dir + "/" + interactions_filename
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_file_path)
interactions_s3DataPath = "s3://"+bucket_name+"/"+interactions_filename

## Importing the Interactions Data

Earlier you created the DatasetGroup and Dataset to house your information, now you will execute an import job that will load the data from S3 into Amazon Personalize for usage building your model.

#### Create Dataset Import Job

In [None]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "personalize-devlab-import1",
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-2:931343413930:dataset-import-job/personalize-devlab-import1",
  "ResponseMetadata": {
    "RequestId": "185c9da7-fa80-4af1-b6ca-ed7dafc83c17",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 16 Apr 2020 06:59:06 GMT",
      "x-amzn-requestid": "185c9da7-fa80-4af1-b6ca-ed7dafc83c17",
      "content-length": "114",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### CHECKPOINT #2 - Wait for Dataset Import Job to Have ACTIVE Status

It can take a while before the import job completes. **Execute the following cell and wait for it to show "ACTIVE" before proceeding to the next step.**

In [None]:
current_time = datetime.now()
print("Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_import_job_response = personalize.describe_dataset_import_job(
        datasetImportJobArn = dataset_import_job_arn
    )
    status = describe_dataset_import_job_response["datasetImportJob"]['status']
    print("DatasetImportJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

current_time = datetime.now()
print("Completed on: ", current_time.strftime("%I:%M:%S %p"))

Started on:  06:59:09 AM
DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE
Completed on:  07:11:09 AM


## Create solution

In this section we will define a solution using the HRNN recipe to generate user personalization recommendation. There are several other recipes available such as HRNN-Metadata, HRNN-Coldstart etc. More information about the additional recipes can be found here:
https://docs.aws.amazon.com/personalize/latest/dg/working-with-predefined-recipes.html

In [None]:
HRNN_recipe_arn = "arn:aws:personalize:::recipe/aws-hrnn"

In [None]:
hrnn_create_solution_response = personalize.create_solution(
    name = "personalize-devlab-hrnn",
    datasetGroupArn = dataset_group_arn,
    recipeArn = HRNN_recipe_arn
)

hrnn_solution_arn = hrnn_create_solution_response['solutionArn']
print(json.dumps(hrnn_create_solution_response, indent=2))

{
  "solutionArn": "arn:aws:personalize:us-east-2:931343413930:solution/personalize-devlab-hrnn",
  "ResponseMetadata": {
    "RequestId": "ab61bc26-fe23-471e-8ebc-9929535840c8",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 16 Apr 2020 07:11:17 GMT",
      "x-amzn-requestid": "ab61bc26-fe23-471e-8ebc-9929535840c8",
      "content-length": "93",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## Create the solution version

In this section we will train a solution version using the dataset that we loaded and using the HRNN recipe.

In [None]:
hrnn_create_solution_version_response = personalize.create_solution_version(
    solutionArn = hrnn_solution_arn
)

In [None]:
hrnn_solution_version_arn = hrnn_create_solution_version_response['solutionVersionArn']
print(json.dumps(hrnn_create_solution_version_response, indent=2))

{
  "solutionVersionArn": "arn:aws:personalize:us-east-2:931343413930:solution/personalize-devlab-hrnn/960341f6",
  "ResponseMetadata": {
    "RequestId": "007a14cd-6ac0-4248-97a0-5e666c8a03e8",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 16 Apr 2020 07:11:19 GMT",
      "x-amzn-requestid": "007a14cd-6ac0-4248-97a0-5e666c8a03e8",
      "content-length": "109",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


### CHECKPOINT #3 - Wait for solution version creation to be completed
Training the solution version will take some time to complete training. **Execute the following cell and wait for it to show "ACTIVE" before proceeding to the next step.**

#### Viewing Solution Creation Status

You can also view the status updates in the console. To do so,

* In another browser tab you should already have the AWS Console up from opening this notebook instance. 
* Switch to that tab and search at the top for the service `Personalize`, then go to that service page. 
* Click `View dataset groups`.
* Click the name of your dataset group, most likely something with DevLab in the name.
* Click `Solutions and recipes`.
* You will now see a list of all of the solutions you created above. Click any one of them. 
* Note in `Solution versions` the job that is in progress. Once it is `Active` you solution is ready to be reviewed. It is also capable of being deployed.

In [None]:
current_time = datetime.now()
print("Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_solution_version_response = personalize.describe_solution_version(
        solutionVersionArn = hrnn_solution_version_arn
    )
    status = describe_solution_version_response["solutionVersion"]['status']
    print("SolutionVersion Status: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)
    
current_time = datetime.now()
print("Completed on: ", current_time.strftime("%I:%M:%S %p"))

Started on:  07:11:21 AM
SolutionVersion Status: CREATE PENDING
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN_PROGRESS
SolutionVersion Status: CREATE IN

## Evaluate solution metrics

In this section we will run the function to get the solution metrics. More information about the Personalize solution metrics can be found here: https://docs.aws.amazon.com/personalize/latest/dg/working-with-training-metrics.html

In [None]:
hrnn_solution_metrics_response = personalize.get_solution_metrics(
    solutionVersionArn = hrnn_solution_version_arn
)

print(json.dumps(hrnn_solution_metrics_response, indent=2))

{
  "solutionVersionArn": "arn:aws:personalize:us-east-2:931343413930:solution/personalize-devlab-hrnn/960341f6",
  "metrics": {
    "coverage": 0.0149,
    "mean_reciprocal_rank_at_25": 0.0241,
    "normalized_discounted_cumulative_gain_at_10": 0.0267,
    "normalized_discounted_cumulative_gain_at_25": 0.0544,
    "normalized_discounted_cumulative_gain_at_5": 0.0174,
    "precision_at_10": 0.0091,
    "precision_at_25": 0.0095,
    "precision_at_5": 0.0109
  },
  "ResponseMetadata": {
    "RequestId": "85884607-49b5-44a8-9459-83a61b954ade",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Thu, 16 Apr 2020 07:45:56 GMT",
      "x-amzn-requestid": "85884607-49b5-44a8-9459-83a61b954ade",
      "content-length": "407",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


## Batch Recommendation
In the section, we will generate a random sample of users to generate batch recommendations for.

First, we will load the movie database so that we can visualize the recommended movie.

In [None]:
# Create a dataframe for the items by reading in the correct source CSV.
items_df = pd.read_csv(data_dir + '/ml-latest-small/movies.csv', index_col=0)
# Render some sample data
items_df.head(5)

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
4,Waiting to Exhale (1995),Comedy|Drama|Romance
5,Father of the Bride Part II (1995),Comedy


In [None]:
# Create a function to get movie by id
def get_movie_by_id(movie_id, movie_df=items_df):
    try:
        return movie_df.loc[int(movie_id)]['title'] + " - " + movie_df.loc[int(movie_id)]['genres']
    except:
        return "Error obtaining movie" + movie_id

In [None]:
# Get the user list
users_df = pd.read_csv(data_dir + '/ml-latest-small/ratings.csv', index_col=0)

batch_users = users_df.sample(3).index.tolist()

# Write the file to disk
json_input_filename = "json_input.json"
with open(data_dir + "/" + json_input_filename, 'w') as json_input:
    for user_id in batch_users:
        json_input.write('{"userId": "' + str(user_id) + '"}\n')

In [None]:
# Showcase the input file:
!cat $data_dir"/"$json_input_filename

{"userId": "302"}
{"userId": "393"}
{"userId": "57"}


Upload the users generate batch recommendations for to S3

In [None]:
# Upload files to S3
boto3.Session().resource('s3').Bucket(bucket_name).Object(json_input_filename).upload_file(data_dir+"/"+json_input_filename)
s3_input_path = "s3://" + bucket_name + "/" + json_input_filename
print(s3_input_path)

s3://931343413930personalizedevlab/json_input.json


In the next cell, we define output bucket of where we will store the batch recommendation results.

In [None]:
# Define the output path
s3_output_path = "s3://" + bucket_name + "/"
print(s3_output_path)

s3://931343413930personalizedevlab/


Run the batch inference process

In [None]:
current_time = datetime.now()
batchInferenceJobArn = personalize.create_batch_inference_job (
    solutionVersionArn = hrnn_solution_version_arn,
    jobName = "Personalize-devlab-Batch-Inference-Job-HRNN"+current_time.strftime("%I%M%S"),
    roleArn = role_arn,
    jobInput = 
     {"s3DataSource": {"path": s3_input_path}},
    jobOutput = 
     {"s3DataDestination":{"path": s3_output_path}}
)
batchInferenceJobArn = batchInferenceJobArn['batchInferenceJobArn']

### CHECKPOINT #4 - Wait for batch recommendation job to complete

It can take a while before the batch recommendation job completes. **Execute the following cell and wait for it to show "ACTIVE" before proceeding to the next step.**

In [None]:
current_time = datetime.now()
print("Import Started on: ", current_time.strftime("%I:%M:%S %p"))

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_inference_job_response = personalize.describe_batch_inference_job(
        batchInferenceJobArn = batchInferenceJobArn
    )
    status = describe_dataset_inference_job_response["batchInferenceJob"]['status']
    print("DatasetInferenceJob: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)
    
current_time = datetime.now()
print("Import Completed on: ", current_time.strftime("%I:%M:%S %p"))

Import Started on:  07:49:42 AM
DatasetInferenceJob: CREATE PENDING
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInferenceJob: CREATE IN_PROGRESS
DatasetInfer

**Download and Visualize batch recommendation**

Once the batch recommendation job has been completed, we will now download and visualize the results from the batch job.

The results of the batch job will be stored in the S3 output folder that was specified earlier. It will be returned in a json format similar to the following:

```
{"input":{"userId":"448"},"output":{"recommendedItems":["5810","53322","2003","6957","92535","8917","3105","6873","1249","26133","2657","4865","2420","1345","4621","34437","2010","4128","2076","1203","52973","4246","2871","8641","162"],"scores":[0.0031413,0.0022093,0.0021377,0.0020497,0.001922,0.0018058,0.0017834,0.0017671,0.0017457,0.0016255,0.0015854,0.001539,0.0014838,0.0014573,0.001374,0.001372,0.0013563,0.0013385,0.0013196,0.0013065,0.0012714,0.0012507,0.001228,0.0012243,0.0012083]},"error":null}
{"input":{"userId":"409"},"output":{"recommendedItems":["2571","50","527","296","1196","111","110","1258","2858","1214","6874","1265","648","750","912","588","608","2329","858","2762","1291","541","1387","260","1200"],"scores":[0.0151337,0.0129099,0.0094722,0.0081178,0.0070135,0.0061934,0.0059672,0.0049986,0.0048773,0.0048134,0.0046837,0.0044422,0.0044251,0.0042917,0.0042376,0.0042309,0.0039795,0.0038287,0.0037793,0.0036532,0.0036527,0.0035218,0.0035178,0.0034306,0.0034158]},"error":null}
{"input":{"userId":"288"},"output":{"recommendedItems":["5989","49272","6377","4963","4995","68157","4886","48394","8368","80463","54001","8961","91658","8950","5418","5445","109374","8360","1136","81834","5618","1265","48780","8949","4720"],"scores":[0.0177912,0.0114374,0.0107343,0.0098669,0.009329,0.0067548,0.0057784,0.0057611,0.0056522,0.0053493,0.0050671,0.0050291,0.0047089,0.0046453,0.0045933,0.0045281,0.0041894,0.0040642,0.0040296,0.003935,0.0037718,0.0036896,0.0036616,0.0036362,0.0036019]},"error":null}

```

An typical use case for this is to use the batch recommendation output to generate a personalized recommendation email that can be fed into a popular email marketing service such as **[Amazon Pinpoint](https://aws.amazon.com/pinpoint/)** or your favourate email marketing service.

In [None]:
s3 = boto3.client('s3')
export_name = json_input_filename + ".out"
s3.download_file(bucket_name, export_name, data_dir+"/"+export_name)

# Update DF rendering
pd.set_option('display.max_rows', 30)
with open(data_dir+"/"+export_name) as json_file:
    # Get the first line and parse it
    line = json.loads(json_file.readline())
    # Do the same for the other lines
    while line:
        # extract the user ID 
        col_header = "User: " + line['input']['userId']
        # Create a list for all the artists
        recommendation_list = []
        # Add all the entries
        for item in line['output']['recommendedItems']:
            movie = get_movie_by_id(item)
            recommendation_list.append(movie)
        if 'bulk_recommendations_df' in locals():
            new_rec_DF = pd.DataFrame(recommendation_list, columns = [col_header])
            bulk_recommendations_df = bulk_recommendations_df.join(new_rec_DF)
        else:
            bulk_recommendations_df = pd.DataFrame(recommendation_list, columns=[col_header])
        try:
            line = json.loads(json_file.readline())
        except:
            line = None
bulk_recommendations_df

Unnamed: 0,User: 302,User: 393,User: 57
0,Braveheart (1995) - Action|Drama|War,Star Trek: First Contact (1996) - Action|Adven...,Pulp Fiction (1994) - Comedy|Crime|Drama|Thriller
1,Raiders of the Lost Ark (Indiana Jones and the...,Harry Potter and the Sorcerer's Stone (a.k.a. ...,Star Trek: First Contact (1996) - Action|Adven...
2,"Lord of the Rings: The Fellowship of the Ring,...",Pulp Fiction (1994) - Comedy|Crime|Drama|Thriller,Harry Potter and the Sorcerer's Stone (a.k.a. ...
3,Seven (a.k.a. Se7en) (1995) - Mystery|Thriller,Heat (1995) - Action|Crime|Thriller,Heat (1995) - Action|Crime|Thriller
4,Die Hard: With a Vengeance (1995) - Action|Cri...,2001: A Space Odyssey (1968) - Adventure|Drama...,X-Men (2000) - Action|Adventure|Sci-Fi
5,"Shawshank Redemption, The (1994) - Crime|Drama",Shrek (2001) - Adventure|Animation|Children|Co...,2001: A Space Odyssey (1968) - Adventure|Drama...
6,Terminator 2: Judgment Day (1991) - Action|Sci-Fi,Avatar (2009) - Action|Adventure|Sci-Fi|IMAX,Forrest Gump (1994) - Comedy|Drama|Romance|War
7,Jurassic Park (1993) - Action|Adventure|Sci-Fi...,"Fifth Element, The (1997) - Action|Adventure|C...",Avatar (2009) - Action|Adventure|Sci-Fi|IMAX
8,Star Wars: Episode IV - A New Hope (1977) - Ac...,X-Men (2000) - Action|Adventure|Sci-Fi,"Fifth Element, The (1997) - Action|Adventure|C..."
9,Schindler's List (1993) - Drama|War,Dead Poets Society (1989) - Drama,Dead Poets Society (1989) - Drama


Congratulations, you have now completed the lab. You can either continue to the challenge section to see if you can improve the model by using a larger dataset or proceed to the cleanup section to delete the resources from this lab.

## Challenge
Before wrapping up the lab, let's see if you can try and improve the recommendation accuracy by usingn the large dataset from [movielens](https://grouplens.org/datasets/movielens/).

You can use the same dataset group, however, please note that you don't need to redefine the data set schema.

## Cleanup

**IMPORTANT**
Once you're done with the lab, the final step is to clean up your environment by decommissioning the resources we created for this devlab. Please run the following cells in the following order to clean up your environment.

**1. Delete the solution**

Delete the HRNN solution we created for the Personalize dataset group

In [None]:
personalize.delete_solution(
    solutionArn = hrnn_solution_arn
)

**2. Delete the dataset**

Delete the datasets created for the personalize dataset group.

In [None]:
personalize.delete_dataset(
    datasetArn = dataset_arn
)

Run the following cell to verify that non-required dataset has been deleted and if required run the subsequent cell with the correct ARN.

In [None]:
paginator = personalize.get_paginator('list_datasets')
for paginate_result in paginator.paginate():
    for datasets in paginate_result["datasets"]:
        print(datasets["datasetArn"])

In [None]:
# Replace the ARN and run the following cell to delete any additional datasets
personalize.delete_dataset(
    datasetArn = "INSERT ARN HERE"
)

**3. Delete the schema**

Delete the personalize schema used for the datasets.

In [None]:
personalize.delete_schema(
    schemaArn = schema_arn
)

Run the following cell to verify that non-required schema has been deleted and if required run the subsequent cell with the correct ARN.

In [None]:
paginator = personalize.get_paginator('list_schemas')
for paginate_result in paginator.paginate():
    for schema in paginate_result["schemas"]:
        print(schema["schemaArn"])

In [None]:
# Replace the ARN and run the following cell to delete any additional schema
personalize.delete_schema(
    schemaArn = "INSERT ARN HERE"
)

**4. Delete the dataset group**

Deletes the personalize dataset group

In [None]:
personalize.delete_dataset_group(
    datasetGroupArn = dataset_group_arn
)

**5. Detach the policy from the personalize devlab role**

In [None]:
iam.detach_role_policy(
    RoleName = "PersonalizeRoleDevLab",
    PolicyArn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
)

In [None]:
iam.detach_role_policy(
    RoleName = "PersonalizeRoleDevLab",
    PolicyArn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
)

Check that all policies have been detached from the role, if not run the subsequent cell to detached the appropriate roles

In [None]:
# Lists all policities attached to the personalize devlab role
iam.list_attached_role_policies(
    RoleName = "PersonalizeRoleDevLab"
)

In [None]:
# Detach policy from rule
iam.detach_role_policy(
    RoleName = "PersonalizeRoleDevLab",
    PolicyArn = "INSERT ARN HERE"
)

**6. Delete the IAM role**

In [None]:
iam.delete_role(
    RoleName = "PersonalizeRoleDevLab"
)