# Title

Building a recommender system with AWS Personalize using the SDK for Python option (boto3). The data is the same longtail B2B-Retail set as in the "Association Rules Mining" ML-Project, but this time I don't reduce it to the approx 3'000 most popular items. I upload the full set.

[Documentation](https://docs.aws.amazon.com/personalize/latest/dg/what-is-personalize.html) for AWS Personalize.

I learned the hard way:
- For Europe AWS Personalize is only available in Region Ireland (eu-west-1), important when configuring the AWSCLI.


**Data Sources:**

- `xxx.csv`: blablabla

**Data Output:**

- `xxx.csv`: blablabla

**Changes**

- 2019-07-18: Start project



<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-libraries,-load-data" data-toc-modified-id="Import-libraries,-load-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import libraries, load data</a></span></li></ul></div>

---

## Import libraries, load data

In [1]:
# Import libraries, get personalize boto3 client
import numpy as np
import pandas as pd
import json
import time

import boto3
personalize = boto3.client('personalize')
personalize_runtime = boto3.client('personalize-runtime')

# Display settings
from IPython.display import display
pd.options.display.max_columns = 100

In [6]:
# Load data
interactions_raw = pd.read_csv('data/raw/sales_total.csv', parse_dates=['Fakturadatum'])
items_raw = pd.read_csv('data/raw/artikel_agg_2018.csv')
users_raw = pd.read_csv('data/raw/customers_agg_2018.csv')

## Prepare and upload training data to S3 bucket

Check documentation for more info. We prepare 3 datasets for
- users
- items
- interactions

The datasets are then prepared to match the columns in the schemas below.

In [7]:
"""Prepare interaction data"""

# Subset data for 2018 data only
interactions_18_full = interactions_raw.loc[interactions_raw['Fakturadatum'].dt.year == 2018]
interactions_18_part = interactions_18_full[['Kunde', 'Artikel', 'Fakturadatum', 'Nettowert']]

# Kick out all artikel that contain str values in their code
interactions_18_part['num'] = pd.to_numeric(interactions_18_part['Artikel'], errors='coerce')
interactions_18 = interactions_18_part.dropna(how='any')
interactions_18.drop(['num'], axis=1, inplace=True)

# Kick-out special customers
interactions = interactions_18.loc[interactions_18['Kunde'] > 700000]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [9]:
# Rename Columns
interactions = interactions.rename(columns={'Kunde': 'USER_ID', 
                                            'Artikel': 'ITEM_ID',
                                            'Fakturadatum': 'TIMESTAMP',
                                            'Nettowert': 'EVENT_VALUE',
                                           })

# Check results
assert interactions.isnull().sum().sum() == 0
print(interactions.shape)
display(interactions.head(2))

(1402641, 4)


Unnamed: 0,USER_ID,ITEM_ID,TIMESTAMP,EVENT_VALUE
1388625,8488019,5171607,2018-01-03,77.3
1388626,8488019,5171101,2018-01-03,32.0


In [10]:
# Save to CSV
interactions.to_csv("data/interim/interactions.csv", index=False)

In [5]:
"""Prepare User data"""

users = users_raw[['Unnamed: 0', 'Branche']]
users = users.rename(columns={'Unnamed: 0': 'USER_ID', 
                              'Branche': 'BRANCHE',
                             })

# Check results
print(users.shape)
display(users.head(2))

(18625, 2)


Unnamed: 0,USER_ID,BRANCHE
0,8107232,15.0
1,8155006,10.0


In [6]:
"""Prepare Item data"""

items = items_raw[['id', 'name']]
items = items.rename(columns={'id': 'ITEM_ID', 
                              'name': 'NAME',
                             })

# Check results
print(items.shape)
display(items.head(2))

(73682, 2)


Unnamed: 0,ITEM_ID,NAME
0,351,Ankörn-Schablone MultiBlue für Montagepl
1,2809,"Seitenrolle ø 30mm, max. 40kg Bauhöhe 31"


### Upload data to S3 bucket

In [11]:
# Retrieve the list of existing buckets (optional)
s3 = boto3.client('s3')
response= s3.list_buckets()
for bucket in response['Buckets']:
    print(bucket['Name'])

rbuerki-01-personalize


In [12]:
"""Specify a Bucket and Data Output Location"""

bucket = "rbuerki-01-personalize"       # name of my S3 bucket
filename = "interactions.csv"  # name to save the dataset under

In [13]:
"""Attach policy to s3 bucket"""

policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket),
                "arn:aws:s3:::{}/*".format(bucket)
            ]
        }
    ]
}

s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(policy))

{'ResponseMetadata': {'RequestId': 'F4F48A5D4E9737DA',
  'HostId': 'CgqnkJGg345jNwIg1WYTUYXiqYO5lWbW7IBkuplSHts9c7HwpGjMAYwjZd5ED4PQo3qJWtY+wTQ=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'CgqnkJGg345jNwIg1WYTUYXiqYO5lWbW7IBkuplSHts9c7HwpGjMAYwjZd5ED4PQo3qJWtY+wTQ=',
   'x-amz-request-id': 'F4F48A5D4E9737DA',
   'date': 'Fri, 19 Jul 2019 04:38:04 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 1}}

In [17]:
"""Upload data"""

# boto3.Session().resource('s3').Bucket(bucket).Object(filename).upload_file("data/interim/{}".format(filename))

## Prepare Data Structure

### Create Schemas

Schemas in Amazon Personalize are defined in the Avro format. For more information, see [Apache Avro](https://avro.apache.org/docs/current/). The schema fields can be in any order but must match the order of the corresponding column headers in the data files to be imported. 

In [42]:
interactions_schema = {"type": "record", 
                       "name": "Interactions",
                       "namespace": "com.amazonaws.personalize.schema",
                       "fields": [
                       {
                           "name": "USER_ID",
                           "type": "string"
                       },
                       {
                           "name": "ITEM_ID",
                           "type": "string"
                       },
                       {
                           "name": "TIMESTAMP",
                           "type": "long"
                       },
                       {
                           "name": "EVENT_VALUE",
                           "type": "int"
                       }
                                  ],
                                  "version": "1.0"
                      }

In [43]:
create_schema_response = personalize.create_schema(
    name = "interactions-schema",
    schema = json.dumps(interactions_schema)
)

interactions_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:eu-west-1:873674308518:schema/interactions-schema",
  "ResponseMetadata": {
    "RequestId": "b4e20448-c293-4855-890b-1ffdf61d3a0c",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Fri, 19 Jul 2019 05:34:45 GMT",
      "x-amzn-requestid": "b4e20448-c293-4855-890b-1ffdf61d3a0c",
      "content-length": "85",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [34]:
users_schema = {"type": "record", 
                "name": "Users",
                "namespace": "com.amazonaws.personalize.schema",
                "fields": [
                {
                    "name": "USER_ID",
                    "type": "string"
                },
                {
                    "name": "BRANCHE",
                    "type": "string",
                    "categorical": True
                }
                          ],
                          "version": "1.0"
               }

In [35]:
create_schema_response = personalize.create_schema(
    name = "users-schema",
    schema = json.dumps(users_schema)
)

users_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:eu-west-1:873674308518:schema/users-schema",
  "ResponseMetadata": {
    "RequestId": "b210c93a-7dec-4141-a753-9cbb91d412b3",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Fri, 19 Jul 2019 05:21:05 GMT",
      "x-amzn-requestid": "b210c93a-7dec-4141-a753-9cbb91d412b3",
      "content-length": "78",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [24]:
items_schema = {"type": "record", 
                "name": "Items",
                "namespace": "com.amazonaws.personalize.schema",
                "fields": [
                {
                    "name": "ITEM_ID",
                    "type": "string"
                },
                {
                    "name": "NAME",
                    "type": "string",
                    "categorical": True
                }
                          ],
                          "version": "1.0"
               }

In [25]:
create_schema_response = personalize.create_schema(
    name = "items-schema",
    schema = json.dumps(items_schema)
)

items_schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:eu-west-1:873674308518:schema/items-schema",
  "ResponseMetadata": {
    "RequestId": "4dd789fb-662a-411b-98f2-5f23b7308322",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Fri, 19 Jul 2019 05:08:42 GMT",
      "x-amzn-requestid": "4dd789fb-662a-411b-98f2-5f23b7308322",
      "content-length": "78",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [39]:
# personalize.delete_schema(schemaArn="arn:aws:personalize:eu-west-1:873674308518:schema/interactions-schema")

{'ResponseMetadata': {'RequestId': '366d9a2f-2785-4499-b444-7f6f53ea9de4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Fri, 19 Jul 2019 05:34:02 GMT',
   'x-amzn-requestid': '366d9a2f-2785-4499-b444-7f6f53ea9de4',
   'content-length': '0',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

### Create (and wait for) Dataset Group

In [26]:
create_dataset_group_response = personalize.create_dataset_group(
    name = "recommender-test-dataset-group"
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:eu-west-1:873674308518:dataset-group/recommender-test-dataset-group",
  "ResponseMetadata": {
    "RequestId": "abf1ea9b-d51d-458f-85b4-862c5fb76992",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Fri, 19 Jul 2019 05:10:44 GMT",
      "x-amzn-requestid": "abf1ea9b-d51d-458f-85b4-862c5fb76992",
      "content-length": "109",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [27]:
"""Wait for Dataset Group to have ACTIVE status"""

max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(60)

DatasetGroup: CREATE PENDING
DatasetGroup: ACTIVE


### Create Datasets

In [44]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    name = "recommender-test-interactions",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = interactions_schema_arn
)

dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:eu-west-1:873674308518:dataset/recommender-test-dataset-group/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "ad3789ca-c175-4478-81c0-6953f74900a8",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "content-type": "application/x-amz-json-1.1",
      "date": "Fri, 19 Jul 2019 05:34:50 GMT",
      "x-amzn-requestid": "ad3789ca-c175-4478-81c0-6953f74900a8",
      "content-length": "111",
      "connection": "keep-alive"
    },
    "RetryAttempts": 0
  }
}


In [None]:
dataset_type = "USERS"
create_dataset_response = personalize.create_dataset(
    name = "recommender-test-users",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn
)

dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

In [None]:
dataset_type = "ITEMS"
create_dataset_response = personalize.create_dataset(
    name = "recommender-test-items",
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn
)

dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

## Prepare, create, and wait for Dataset Import Job