# Amazon Personalize User Segmentation - Airline Ticket marketing campaigns

Amazon Personalize offers two recipes that segment your users based on their interest in different product categories, brands and more. 
1. Item affinity recipe `aws-item-affinity` identifies users based on their interest in the individual items in your catalog. 
1. Item attribute affinity recipe `aws-item-attribute` identifies users based on the attributes of items in your catalog such as airline, promotion, season, cities etc. This allows you to better engage users with your marketing campaigns and improve retention through targeted messaging.

This notebook demonstrates how to use the `aws-item-affinity` and `aws-item-attribute` recipe to create user segments based on their preferences for airline products in sample dataset. We use one dataset group which contains user-item interaction data and item metadata. We use these datasets to train solutions using the two recipes and create user segments in batch.


This notebook guides you through the deployment of the following architecture. 


As we can see above, we will deploy the following resources:
1. S3 bucket used to store the training files, plus our inference input and output files
1. A dataset group
1. Three datasets - Interactions, Items, users
1. Two solutions and solution versions configured with each of our new User Segmentation recipes
1. Two batch inference jobs

Once we have the batch inference job results, we will be analyzing the results

## Preprocess library 


In [None]:
!pip install numpy==1.25.1


In [2]:
import pandas as pd
import json
import numpy as np
from datetime import datetime
import boto3
import time
from time import sleep
from lxml import html
import seaborn as sns
import matplotlib.pyplot as plt
import sys
from tqdm import tqdm
import datetime as dt



### Get the Personalize API model Json and Personalize Boto3 Client

In [14]:
# let's validate that your environment can communicate successfully with Amazon Personalize.

personalize = boto3.client(service_name='personalize')
personalize_runtime = boto3.client(service_name='personalize-runtime')
personalize_events = boto3.client(service_name='personalize-events')

s3 =boto3.client('s3')

## Upload data to S3

In [7]:
import boto3

# Create a boto3 session
session = boto3.Session()

# Get the current AWS region
region = session.region_name

# Print the current region
print("Current AWS region:", current_region)


Current AWS region: us-east-1


### Create S3 bucket

In [8]:
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
suffix = str(np.random.uniform())[4:9]
bucket_name = "personalize-user-segment" + suffix
print(bucket_name)
if region != "us-east-1":
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
else:
    s3.create_bucket(Bucket=bucket_name)

personalize-user-segment90441


### Upload datsets into S3

In [42]:
# interaction datset
interactions_filename = 'df_interactions.csv'
boto3.Session().resource('s3').Bucket(bucket_name).Object(interactions_filename).upload_file(interactions_filename)

In [11]:
# item dataset
item_metadata_file = 'df_item_deduplicated.csv'
boto3.Session().resource('s3').Bucket(bucket_name).Object(item_metadata_file).upload_file(item_metadata_file)

In [12]:
# user dataset
user_metadata_file = 'df_users_deduplicated.csv'
boto3.Session().resource('s3').Bucket(bucket_name).Object(user_metadata_file).upload_file(user_metadata_file)


### Helper functions
The following helper functions will be used later in the notebook.

In [4]:
def print_s3_file_content(bucket, key, limit=None):
    obj = s3.get_object(Bucket=bucket, Key=key)

    i = 0
    for line in obj['Body'].read().decode("utf-8").split("\n"):
        print(line)
        i+=1
        if limit is not None and i > limit:
            break

max_time = time.time() + 3 * 60 * 60 # 3 hours

def wait_for_dataset_group_job(dataset_group_arn):
    max_time = time.time() + 3 * 60 * 60
    while time.time() < max_time:
        describe_dataset_group_response = personalize.describe_dataset_group(
            datasetGroupArn = dataset_group_arn
        )
        status = describe_dataset_group_response["datasetGroup"]["status"]
        print("DatasetGroup: {}".format(status))

        if status == "ACTIVE" or status == "CREATE FAILED":
            break

        time.sleep(60)
        
def wait_for_dataset_import_job(dataset_import_job_arn):
    max_time = time.time() + 3 * 60 * 60
    while time.time() < max_time:
        describe_dataset_import_job_response = personalize.describe_dataset_import_job(
            datasetImportJobArn = dataset_import_job_arn
        )
        status = describe_dataset_import_job_response["datasetImportJob"]['status']
        print("DatasetImportJob: {}".format(status))

        if status == "ACTIVE" or status == "CREATE FAILED":
            break

        time.sleep(120)
            
def wait_for_solution_version_job(solution_version_arn):
    max_time = time.time() + 3 * 60 * 60
    while time.time() < max_time:
        describe_solution_version_response = personalize.describe_solution_version(
            solutionVersionArn = solution_version_arn
        )
        status = describe_solution_version_response["solutionVersion"]["status"]
        print("SolutionVersion: {}".format(status))

        start = describe_solution_version_response["solutionVersion"]["creationDateTime"]
        end = describe_solution_version_response["solutionVersion"]["lastUpdatedDateTime"]
        if status == "ACTIVE":
            print("Time took: {}".format(end - start))
            break
        if status == "CREATE FAILED":
            print("Time took: {}".format(end - start))
            print("Job Failed: {}".format(describe_solution_version_response["solutionVersion"]["failureReason"]))
            break

        time.sleep(180)
        
def wait_for_batch_segment_job(batch_segment_job_arn):
    max_time = time.time() + 3 * 60 * 60
    while time.time() < max_time:
        describe_job_response = personalize.describe_batch_segment_job(
            batchSegmentJobArn = batch_segment_job_arn
        )
        status = describe_job_response["batchSegmentJob"]["status"]
        print("Batch Segment Job: {}".format(status))

        start = describe_job_response["batchSegmentJob"]["creationDateTime"]
        end = describe_job_response["batchSegmentJob"]["lastUpdatedDateTime"]
        if status == "ACTIVE":
            print("Time took: {}".format(end - start))
            break
        if status == "CREATE FAILED":
            print("Time took: {}".format(end - start))
            print("Job Failed: {}".format(describe_job_response["batchSegmentJob"]["failureReason"]))
            break

        time.sleep(180)
        
        

# Item affinity recipe

The Item affinity recipe will recommend users that are likely to engage with a given item.
### Notes:
* Please jump to step 2. Create Solution if you already have a dataset group created
* Please jump to step 3. Create Batch Segmeent if you already have a solution version trained and would like to query without filters
* Please jump to step 4. Filter, if you already have a solution version trained and would like to test with filters

## 1. Create dataset group, datasets and  upload datasets

### Create a Dataset Group
The following cell will create a new dataset group with the name airlines-dataset-group + a suffix

In [17]:
dataset_group_name = "airlines-dataset-group-" + suffix

create_dataset_group_response = personalize.create_dataset_group(
    name = dataset_group_name
)

dataset_group_arn = create_dataset_group_response['datasetGroupArn']
print(json.dumps(create_dataset_group_response, indent=2))

{
  "datasetGroupArn": "arn:aws:personalize:us-east-1:696784033931:dataset-group/airlines-dataset-group-90441",
  "ResponseMetadata": {
    "RequestId": "d5cbc46d-1265-4689-8a44-40c1ddb42520",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 07:07:24 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "107",
      "connection": "keep-alive",
      "x-amzn-requestid": "d5cbc46d-1265-4689-8a44-40c1ddb42520"
    },
    "RetryAttempts": 0
  }
}


Before we can use the dataset group, it must be active. This can take a minute or two. Execute the cell below and wait for it to show the ACTIVE status.

In [66]:
wait_for_dataset_group_job(dataset_group_arn)

DatasetGroup: ACTIVE


In [18]:
status = None
max_time = time.time() + 3*60*60 # 3 hours
while time.time() < max_time:
    describe_dataset_group_response = personalize.describe_dataset_group(
        datasetGroupArn = dataset_group_arn
    )
    status = describe_dataset_group_response["datasetGroup"]["status"]
    print("DatasetGroup: {}".format(status))
    
    if status == "ACTIVE" or status == "CREATE FAILED":
        break
        
    time.sleep(20)

DatasetGroup: ACTIVE


### Interaction schema

In [15]:
schema_name="airlines-interaction-schema-"+suffix

schema = {
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name":"CABIN_TYPE",
            "type": "string",
            "categorical": True
        },
        {
          "name": "EVENT_TYPE",
          "type": "string"
        },
        {
          "name": "EVENT_VALUE",
          "type": "float"
        }
    ],
    "version": "1.0"
}

create_schema_response = personalize.create_schema(
    name = schema_name,
    schema = json.dumps(schema)
)

schema_arn = create_schema_response['schemaArn']
print(json.dumps(create_schema_response, indent=2))



{
  "schemaArn": "arn:aws:personalize:us-east-1:696784033931:schema/airlines-interaction-schema-90441",
  "ResponseMetadata": {
    "RequestId": "9edb02fc-0e0d-4c05-ae4f-4c2db7a2e312",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 07:01:03 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "99",
      "connection": "keep-alive",
      "x-amzn-requestid": "9edb02fc-0e0d-4c05-ae4f-4c2db7a2e312"
    },
    "RetryAttempts": 0
  }
}


### Interaction dataset

In [19]:
dataset_type = "INTERACTIONS"
create_dataset_response = personalize.create_dataset(
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = schema_arn,
    name = "airlines-dataset-interactions-" + suffix
)

interactions_dataset_arn = create_dataset_response['datasetArn']
print(json.dumps(create_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:696784033931:dataset/airlines-dataset-group-90441/INTERACTIONS",
  "ResponseMetadata": {
    "RequestId": "bbd94f3c-784e-4dab-8b71-ddec242a0792",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 07:09:39 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "109",
      "connection": "keep-alive",
      "x-amzn-requestid": "bbd94f3c-784e-4dab-8b71-ddec242a0792"
    },
    "RetryAttempts": 0
  }
}


### Users schema

In [20]:
user_metadata_schema_name="airlines-users-schema-"+suffix

user_metadata_schema = {
    "type": "record",
    "name": "Users",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "memberClass",
            "type": "string",
            "categorical": True
        }
    ],
    "version": "1.0"
}

create_metadata_schema_response = personalize.create_schema(
    name = user_metadata_schema_name,
    schema = json.dumps(user_metadata_schema)
)

metadata_schema_arn = create_metadata_schema_response['schemaArn']
print(json.dumps(create_metadata_schema_response, indent=2))

{
  "schemaArn": "arn:aws:personalize:us-east-1:696784033931:schema/airlines-users-schema-90441",
  "ResponseMetadata": {
    "RequestId": "927ba64b-943a-4593-9fe2-87d23254a182",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 07:15:17 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "93",
      "connection": "keep-alive",
      "x-amzn-requestid": "927ba64b-943a-4593-9fe2-87d23254a182"
    },
    "RetryAttempts": 0
  }
}


### Users Dataset

In [21]:
dataset_type = "USERS"
create_metadata_dataset_response = personalize.create_dataset(
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = user_metadata_dataset_arn,
    name = "airlines-metadata-dataset-users-" + suffix
)

user_metadata_dataset_arn = create_metadata_dataset_response['datasetArn']
print(json.dumps(create_metadata_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:696784033931:dataset/airlines-dataset-group-90441/USERS",
  "ResponseMetadata": {
    "RequestId": "f5b15644-a679-4d1e-a1cd-316fa3235c26",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 07:18:10 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "102",
      "connection": "keep-alive",
      "x-amzn-requestid": "f5b15644-a679-4d1e-a1cd-316fa3235c26"
    },
    "RetryAttempts": 0
  }
}


### Items schema

In [22]:
item_metadata_schema_name="airlines-item-schema-"+suffix

# Define the updated schema for items based on your dataframe columns
items_schema = {
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "DSTCity",
            "type": ["null", "string"],
            "categorical": True
        },
        {
            "name": "SRCCity",
            "type": ["null", "string"],
            "categorical": True
        },
        {
            "name": "Airline",
            "type": ["null", "string"],
            "categorical": True
        },
        {
            "name": "DurationDays",
            "type": "int"
        },
        {
            "name": "Season",
            "type": ["null", "string"],
            "categorical": True
        },
        {
            "name": "numberOfSearchByUser",
            "type": "int"
        },
        {
            "name": "Promotion",
            "type": ["null", "string"],
            "categorical": True
        },
        {
            "name": "DynamicPrice",
            "type": "int"
        },
        {
            "name": "DiscountForMember",
            "type": "float"
        },
        {
            "name": "Expired",
            "type": ["null", "string"],
            "categorical": True
        }
    ],
    "version": "1.0"
}

create_metadata_schema_response = personalize.create_schema(
    name = item_metadata_schema_name,
    schema = json.dumps(items_schema)
)

metadata_schema_arn = create_metadata_schema_response['schemaArn']
print(json.dumps(create_metadata_schema_response, indent=2))


{
  "schemaArn": "arn:aws:personalize:us-east-1:696784033931:schema/airlines-item-schema-90441",
  "ResponseMetadata": {
    "RequestId": "dea88384-2a93-49be-81a1-078e8c1914d3",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 07:47:40 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "92",
      "connection": "keep-alive",
      "x-amzn-requestid": "dea88384-2a93-49be-81a1-078e8c1914d3"
    },
    "RetryAttempts": 0
  }
}


### Items dataset

In [23]:
dataset_type = "ITEMS"
create_metadata_dataset_response = personalize.create_dataset(
    datasetType = dataset_type,
    datasetGroupArn = dataset_group_arn,
    schemaArn = metadata_schema_arn,
    name = "airlines-metadata-dataset-items-" + suffix
)

metadata_dataset_arn = create_metadata_dataset_response['datasetArn']
print(json.dumps(create_metadata_dataset_response, indent=2))

{
  "datasetArn": "arn:aws:personalize:us-east-1:696784033931:dataset/airlines-dataset-group-90441/ITEMS",
  "ResponseMetadata": {
    "RequestId": "67b75a8f-6c68-4744-a027-58539145fa06",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 07:49:44 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "102",
      "connection": "keep-alive",
      "x-amzn-requestid": "67b75a8f-6c68-4744-a027-58539145fa06"
    },
    "RetryAttempts": 0
  }
}


## Configure an S3 bucket and an IAM role

### Set the S3 bucket policy
Amazon Personalize needs to be able to read the contents of your S3 bucket. So add a bucket policy which allows that.


In [24]:
s3 = boto3.client("s3")
bucket = bucket_name
policy = {
    "Version": "2012-10-17",
    "Id": "PersonalizeS3BucketAccessPolicy",
    "Statement": [
        {
            "Sid": "PersonalizeS3BucketAccessPolicy",
            "Effect": "Allow",
            "Principal": {
                "Service": "personalize.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::{}".format(bucket),
                "arn:aws:s3:::{}/*".format(bucket)
            ]
        }
    ]
}
# uncomment if this policy has not been attached to the bucket
s3.put_bucket_policy(Bucket=bucket, Policy=json.dumps(policy))

{'ResponseMetadata': {'RequestId': '18RERZAWXQ21J4TF',
  'HostId': 'CNAZY0j8dGRhx2n8V6FPigVQRRn5EG1PRm0CfMVmboDY3vdk1LU1U4god8JYYyDsdK0SeNq5GwiocIHtfLGrgPSjnojN0YBnFsDyiFeM4sE=',
  'HTTPStatusCode': 204,
  'HTTPHeaders': {'x-amz-id-2': 'CNAZY0j8dGRhx2n8V6FPigVQRRn5EG1PRm0CfMVmboDY3vdk1LU1U4god8JYYyDsdK0SeNq5GwiocIHtfLGrgPSjnojN0YBnFsDyiFeM4sE=',
   'x-amz-request-id': '18RERZAWXQ21J4TF',
   'date': 'Mon, 18 Sep 2023 08:44:29 GMT',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

#### Create personalize role with Personalize FullAccess and S3 FullAccess

In [26]:
iam = boto3.client("iam")
role_name = "PersonalizeRoleDemo"+account_id
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "personalize.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = "arn:aws:iam::aws:policy/service-role/AmazonPersonalizeFullAccess"
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

arn:aws:iam::696784033931:role/PersonalizeRoleDemo696784033931


### Import the interactions data

In [44]:
print (bucket_name)
print (interactions_filename)
print (interactions_dataset_arn)
print (role_arn)

personalize-user-segment90441
df_interactions.csv
arn:aws:personalize:us-east-1:696784033931:dataset/airlines-dataset-group-90441/INTERACTIONS
s3://personalize-user-segment90441/df_interactions.csv
arn:aws:iam::696784033931:role/PersonalizeRoleDemo696784033931


In [45]:
create_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "airlines-dataset-import-job-"+suffix,
    datasetArn = interactions_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, interactions_filename)
    },
    roleArn = role_arn
)

dataset_import_job_arn = create_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:696784033931:dataset-import-job/airlines-dataset-import-job-90441",
  "ResponseMetadata": {
    "RequestId": "4a060cda-6564-4a68-9bfc-82b5a9ae0c86",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 09:24:10 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "121",
      "connection": "keep-alive",
      "x-amzn-requestid": "4a060cda-6564-4a68-9bfc-82b5a9ae0c86"
    },
    "RetryAttempts": 0
  }
}


In [58]:
wait_for_dataset_import_job(dataset_import_job_arn)

DatasetImportJob: ACTIVE


### Import the items data

In [None]:
print (bucket_name)
print (item_metadata_file)
print (metadata_dataset_arn)
print (role_arn)

personalize-user-segment90441
df_item_deduplicated.csv
arn:aws:personalize:us-east-1:696784033931:dataset/airlines-dataset-group-90441/ITEMS
arn:aws:iam::696784033931:role/PersonalizeRoleDemo696784033931


In [52]:
create_metadata_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "airlines-items-metadata-dataset-import-job-"+suffix,
    datasetArn = metadata_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, item_metadata_file)
    },
    roleArn = role_arn
)

metadata_dataset_import_job_arn = create_metadata_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_metadata_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:696784033931:dataset-import-job/airlines-items-metadata-dataset-import-job-90441",
  "ResponseMetadata": {
    "RequestId": "3e1cc3bd-545e-449b-9804-6091e60053ef",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 11:58:42 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "136",
      "connection": "keep-alive",
      "x-amzn-requestid": "3e1cc3bd-545e-449b-9804-6091e60053ef"
    },
    "RetryAttempts": 0
  }
}


In [57]:
wait_for_dataset_import_job(metadata_dataset_import_job_arn)

DatasetImportJob: ACTIVE


### Import the user data

In [62]:
print (bucket_name)
print (user_metadata_file)
print (user_metadata_dataset_arn)
print (role_arn)

personalize-user-segment90441
df_users_deduplicated.csv
arn:aws:personalize:us-east-1:696784033931:dataset/airlines-dataset-group-90441/USERS
arn:aws:iam::696784033931:role/PersonalizeRoleDemo696784033931


In [64]:
create_user_metadata_dataset_import_job_response = personalize.create_dataset_import_job(
    jobName = "airlines-user-metadata-dataset-import-job-"+suffix,
    datasetArn = user_metadata_dataset_arn,
    dataSource = {
        "dataLocation": "s3://{}/{}".format(bucket_name, user_metadata_file)
    },
    roleArn = role_arn
)

user_metadata_dataset_import_job_arn = create_user_metadata_dataset_import_job_response['datasetImportJobArn']
print(json.dumps(create_user_metadata_dataset_import_job_response, indent=2))

{
  "datasetImportJobArn": "arn:aws:personalize:us-east-1:696784033931:dataset-import-job/airlines-user-metadata-dataset-import-job-90441",
  "ResponseMetadata": {
    "RequestId": "f99062a7-b640-4a71-aea3-be4673ec7184",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 18 Sep 2023 13:22:28 GMT",
      "content-type": "application/x-amz-json-1.1",
      "content-length": "135",
      "connection": "keep-alive",
      "x-amzn-requestid": "f99062a7-b640-4a71-aea3-be4673ec7184"
    },
    "RetryAttempts": 0
  }
}


In [65]:
wait_for_dataset_import_job(user_metadata_dataset_import_job_arn)

DatasetImportJob: CREATE PENDING
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: CREATE IN_PROGRESS
DatasetImportJob: ACTIVE


## Prepare marketing promotion item

In [76]:
import pandas as pd

# Specify the file path
file_path = "df_item_deduplicated.csv"

# Load the CSV file into a DataFrame
df_item_deduplicated = pd.read_csv(file_path)

df_item_deduplicated.head()

Unnamed: 0,ITEM_ID,DSTCity,SRCCity,Airline,DurationDays,Season,numberOfSearchByUser,Promotion,DynamicPrice,DiscountForMember,Expired
0,-2996180203100007228,Hong Kong,Manila,LeopardSpot Airlines,16,April,18444,No,3287,0.0,No
1,-198685942195013072,London,Bangkok,HawkGlide Express,17,October,3812,No,9493,0.0,No
2,-1023099686133718933,Beijing,Kuala Lumpur,KoalaHug Express,8,November,16503,Yes,2713,0.25,No
3,35400463909927393,London,Jakarta,ButterflyWing Express,10,June,12086,No,9118,0.0,No
4,8566475821201526637,Shanghai,Singapore,PeacockPlume Airways,3,October,14729,No,4775,0.0,No


In [77]:
# Assuming df is your DataFrame containing the data
result_df = df_item_deduplicated[(df_item_deduplicated['Promotion'] == 'Yes') & (df_item_deduplicated['Season'] == 'October') & (df_item_deduplicated['DSTCity'] == 'Hong Kong')]

# Display the resulting DataFrame
result_df.head()

Unnamed: 0,ITEM_ID,DSTCity,SRCCity,Airline,DurationDays,Season,numberOfSearchByUser,Promotion,DynamicPrice,DiscountForMember,Expired
83,-2750327774238244386,Hong Kong,Singapore,PandaPaw Express,18,October,14572,Yes,6158,0.5,No
124,7498680365518056067,Hong Kong,Jakarta,KangarooKick Express,9,October,3341,Yes,9946,0.4,No
194,-4293041568465629878,Hong Kong,Kuala Lumpur,TigerPounce Express,10,October,7634,Yes,9171,0.5,No


In [78]:
# Assuming df is your DataFrame
random_row = result_df.sample(n=1)

# Save the randomly selected row to a JSON file
random_row.to_json('promotion_item_metadata.json', orient='records')

# Display the randomly selected row
random_row.head()

Unnamed: 0,ITEM_ID,DSTCity,SRCCity,Airline,DurationDays,Season,numberOfSearchByUser,Promotion,DynamicPrice,DiscountForMember,Expired
194,-4293041568465629878,Hong Kong,Kuala Lumpur,TigerPounce Express,10,October,7634,Yes,9171,0.5,No


### Prepare for LangChain prompting metadata, test-metadata.json

In [79]:
import json

# Read the JSON file
with open('promotion_item_metadata.json', 'r') as file:
    data = json.load(file)

# Extract the first element (dictionary) from the list
json_data = data[0]

# Save the extracted data to a new JSON file
with open('test-metadata.json', 'w') as output_file:
    json.dump(json_data, output_file)

print(f'Data saved to "test-metadata.json"')

Data saved to "test-metadata.json"


### Prepare for user-segment batch job query data input file, item-affinity-query.json

In [82]:
df_item_affinity_query = pd.DataFrame({"itemId": [random_row['ITEM_ID'].values[0]]})

In [83]:
# Display the DataFrame
df_item_affinity_query.head()


Unnamed: 0,itemId
0,-4293041568465629878


In [84]:
df_item_affinity_query.to_json('item-affinity-query.json', orient='records')


In [85]:
import json

# Read the JSON file
with open('item-affinity-query.json', 'r') as file:
    data = json.load(file)

# Extract the first element (dictionary) from the list
json_data = data[0]

# Save the extracted data to a new JSON file
with open('item-affinity-query.json', 'w') as output_file:
    json.dump(json_data, output_file)

print(f'Data saved to "item-affinity-query.json"')

Data saved to "item-affinity-query.json"


## 2. Create Solution

#### 2.1 Select item-affinity recipe

In [68]:
item_user_recipe = 'arn:aws:personalize:::recipe/aws-item-affinity'

### 2.2 Create solution

In [69]:
print (dataset_group_arn)

arn:aws:personalize:us-east-1:696784033931:dataset-group/airlines-dataset-group-90441


In [70]:
create_solution_response = personalize.create_solution(
    name = "item-affinity-airline-meta-demo",
    datasetGroupArn = dataset_group_arn,
    recipeArn = item_user_recipe,
)
solution_arn = create_solution_response['solutionArn']

In [71]:
personalize.describe_solution(solutionArn = solution_arn)

{'solution': {'name': 'item-affinity-airline-meta-demo',
  'solutionArn': 'arn:aws:personalize:us-east-1:696784033931:solution/item-affinity-airline-meta-demo',
  'performHPO': False,
  'performAutoML': False,
  'recipeArn': 'arn:aws:personalize:::recipe/aws-item-affinity',
  'datasetGroupArn': 'arn:aws:personalize:us-east-1:696784033931:dataset-group/airlines-dataset-group-90441',
  'status': 'ACTIVE',
  'creationDateTime': datetime.datetime(2023, 9, 18, 15, 16, 49, 574000, tzinfo=tzlocal()),
  'lastUpdatedDateTime': datetime.datetime(2023, 9, 18, 15, 16, 49, 574000, tzinfo=tzlocal())},
 'ResponseMetadata': {'RequestId': '3f34fd10-b036-47ad-9e43-217f41094017',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 18 Sep 2023 15:17:01 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '456',
   'connection': 'keep-alive',
   'x-amzn-requestid': '3f34fd10-b036-47ad-9e43-217f41094017'},
  'RetryAttempts': 0}}

### 2.3 Create Solution Version

In [72]:
create_solution_version_response = personalize.create_solution_version(
    solutionArn = solution_arn
)
solution_version_arn = create_solution_version_response['solutionVersionArn']
print(solution_version_arn)

arn:aws:personalize:us-east-1:696784033931:solution/item-affinity-airline-meta-demo/5d132205


#### Wait for Solution Version to Have ACTIVE Status

# Noted: It need take 33min for model training, during the period, please run another Bedrock SDK notebook to creat promotion content first.

In [73]:
wait_for_solution_version_job(solution_version_arn)

SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: CREATE IN_PROGRESS
SolutionVersion: ACTIVE
Time took: 0:33:31.972000


### 2.4 Get Metrics
Note: these metrics note are Amazon Personalize’s offline metrics that are used to evaluate results across solution versions. These should not be confused with the metrics that we will derive from the test dataset that we built when preprocessing the data.

In [75]:
get_solution_metrics_response = personalize.get_solution_metrics(solutionVersionArn=solution_version_arn)
print(get_solution_metrics_response['metrics'])

{'coverage': 0.1435, 'hits_at_1_percent': 0.4326, 'recall_at_1_percent': 0.0687}


## 3. Create Batch Segment Job


### 3.1 Prepare input query data by item-affinity-query.json

In [86]:
# example json lines in the input file:
# {"itemId": "1"}
# {"itemId": "2"}
# {"itemId": "3"}
# ...
batch_file_name = 'item-affinity-query.json'

# upload the file into S3
boto3.Session().resource('s3').Bucket(bucket_name).Object(batch_file_name).upload_file(batch_file_name)

batch_input_path = "s3://"+bucket_name+"/"+batch_file_name
batch_output_path = "s3://"+bucket_name+"/output/"
print_s3_file_content(bucket=bucket_name, key=batch_file_name,limit=3)
# these are the file contents

{"itemId": "-4293041568465629878"}


### 3.2 Create Batch Segment Job

In [87]:
import datetime

# Define the prefix
prefix = "item-affinity-query-query-"

# Get the current timestamp in the desired format
current_time = datetime.datetime.now().strftime("%Y%m%d%H%M%S")

# Combine the prefix and current timestamp to create the job name
job_name = f"{prefix}{current_time}"




In [88]:

create_batch_segment_response = personalize.create_batch_segment_job(
    jobName = job_name,
    solutionVersionArn = solution_version_arn,
    numResults = 5,
    jobInput =  {
        "s3DataSource": {
            "path": batch_input_path
        }
    },
    jobOutput = {
        "s3DataDestination": {
            "path": batch_output_path
        }
    },
    roleArn = role_arn 
    )

batch_segment_job_arn = create_batch_segment_response['batchSegmentJobArn']
print(batch_segment_job_arn)

arn:aws:personalize:us-east-1:696784033931:batch-segment-job/item-affinity-query-query-20230918171159


## Note: it will take 7min to run the batch job

In [89]:
wait_for_batch_segment_job(batch_segment_job_arn)

Batch Segment Job: CREATE IN_PROGRESS
Batch Segment Job: CREATE IN_PROGRESS
Batch Segment Job: CREATE IN_PROGRESS
Batch Segment Job: ACTIVE
Time took: 0:07:26.078000


### Download the result from S3

In [91]:
print(bucket_name)
print(batch_output_path)

personalize-user-segment90441
s3://personalize-user-segment90441/output/


In [92]:

object_key = 'output/item-affinity-query.json.out'

local_file_name = 'item-affinity-query-results.json'

# Download the file from S3 to the current directory
s3.download_file(bucket_name, object_key, local_file_name)