# Shaped Demo

Welcome to your Shaped Demo !

In this notebook, you'll be guided on how to use Shaped to upload / connect some example data for a fashion e-commerce use case & create a custom model to personalise results for users. Using the workflow set out below, feel free to use your own data (whether it be a database connector or a static file set). For any questions about connecting data through other sources, or model configuration options, see our documenation at https://docs.shaped.ai/docs/overview/welcome/ 

## Setup

Install shaped via the command line simply using `pip` . 

In order to use shaped for this demo, you'll need to create a trial account which will allow you to upload data & create your first model. To get started, simply visit the Shaped website at https://www.shaped.ai/ and click Get Started. Once you've created an account, you will be provisioned an `API_KEY` to utilize. 

In [None]:
# ! pip install shaped

In [None]:
# API_KEY = "****************"

In [1]:
import pandas as pd
import yaml
import json

In [2]:
import shaped
import os
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["SHAPED_READ_AND_WRITE_KEY"]

! shaped init --api-key {API_KEY} --env prod

Initializing with config: {'api_key': 'FVUozSixRs3Sp9AJuXcwJaW6vFuKPp5HaKPuN6Gu', 'env': 'prod'}


## Data

Let's start off by uploading some static data to our shaped account to build our first model with. Shaped supports a range of different DB connectors, aswell as manual uploads/inserts of data files. We will be using some sample H&M customer click data ( see https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations ). For documentation relating to connector datasets, see our docs (for example our Postgres connector https://docs.shaped.ai/docs/integrations/postgresql ):

In [3]:
hm_events_demo = """
name: hm_events_demo
schema_type: CUSTOM
column_schema:
  article_id: String
  created_at: DateTime
  customer_id: String
"""

with open("hm_events_demo.yaml", "w") as file:
    yaml.dump(yaml.safe_load(hm_events_demo), file)

hm_items_demo = """
name: hm_items_demo
schema_type: CUSTOM
column_schema:
  article_id: String
  department_name: String
  detail_desc: String
  graphical_appearance_name: String
  image_url: String
  index_name: String
  prod_name: String
  product_type_name: String
  section_name: String
"""

with open("hm_items_demo.yaml", "w") as file:
    yaml.dump(yaml.safe_load(hm_items_demo), file)

hm_users_demo = """
name: hm_users_demo
schema_type: CUSTOM
column_schema:
  FN: String
  Active: String
  age: String
  club_member_status: String
  customer_id: String
  fashion_news_frequency: String
  latest_interaction_at: DateTime
  postal_code: String
"""

with open("hm_users_demo.yaml", "w") as file:
    yaml.dump(yaml.safe_load(hm_users_demo), file)

Create the shaped dataset structure using the above `.yaml` files:

In [4]:
! shaped create-dataset --file hm_events_demo.yaml
! shaped create-dataset --file hm_items_demo.yaml
! shaped create-dataset --file hm_users_demo.yaml

{
  "column_schema": {
    "article_id": "String",
    "created_at": "DateTime",
    "customer_id": "String"
  },
  "name": "hm_events_demo",
  "schema_type": "CUSTOM"
}
message: Dataset with name 'hm_events_demo' was successfully scheduled for creation

{
  "column_schema": {
    "article_id": "String",
    "department_name": "String",
    "detail_desc": "String",
    "graphical_appearance_name": "String",
    "image_url": "String",
    "index_name": "String",
    "prod_name": "String",
    "product_type_name": "String",
    "section_name": "String"
  },
  "name": "hm_items_demo",
  "schema_type": "CUSTOM"
}
message: Dataset with name 'hm_items_demo' was successfully scheduled for creation

{
  "column_schema": {
    "Active": "String",
    "FN": "String",
    "age": "String",
    "club_member_status": "String",
    "customer_id": "String",
    "fashion_news_frequency": "String",
    "latest_interaction_at": "DateTime",
    "postal_code": "String"
  },
  "name": "hm_users_demo",
  "sc

After these have been created, insert the relevant  `csv` data into the corresponding shaped datasets

In [6]:
! shaped dataset-insert --dataset-name hm_events_demo --type csv --file h_and_m_events.csv
! shaped dataset-insert --dataset-name hm_items_demo --type csv --file h_and_m_articles.csv
! shaped dataset-insert --dataset-name hm_users_demo --type csv --file h_and_m_customers.csv

1000000 Records [11:59, 1389.52 Records/s]
105542 Records [02:11, 805.08 Records/s]
1371980 Records [21:04, 1084.91 Records/s]


In [7]:
! shaped list-datasets

datasets:
- dataset_name: hm_events_demo
  dataset_uri: https://api.shaped.ai/v1/datasets/hm_events_demo
  created_at: 2024-12-13T19:26:29 UTC
  schema_type: CUSTOM
  status: ACTIVE
- dataset_name: hm_items_demo
  dataset_uri: https://api.shaped.ai/v1/datasets/hm_items_demo
  created_at: 2024-12-13T19:26:32 UTC
  schema_type: CUSTOM
  status: ACTIVE
- dataset_name: hm_users_demo
  dataset_uri: https://api.shaped.ai/v1/datasets/hm_users_demo
  created_at: 2024-12-13T19:26:34 UTC
  schema_type: CUSTOM
  status: ACTIVE



### Events

Let's now have a look at the tables we just uploaded to shaped. Starting with the `events` table first:

In [8]:
hm_events = pd.read_csv('h_and_m_events.csv')

hm_events.head()

Unnamed: 0,created_at,customer_id,article_id
0,2020-09-22,fffef3b6b73545df065b521e19f64bf6fe93bfd450ab20...,898573003
1,2020-09-22,54e8ebd39543b5a4d69c3e7d79977558d2a606e6540ba0...,573085043
2,2020-09-22,544094a3ab237bf18d7bda9c2265218de4320ce795775e...,839332001
3,2020-09-22,544094a3ab237bf18d7bda9c2265218de4320ce795775e...,865938002
4,2020-09-22,544094a3ab237bf18d7bda9c2265218de4320ce795775e...,770315017


We can now create our events schema query from this dataset. We set the target `label` to 1 to simply optimise the model for clicks:

In [6]:
events_query = """
SELECT 
  customer_id AS user_id,
  article_id AS item_id,
  created_at,
  1 AS label, 
  'click' AS event_value
FROM 
    hm_events
"""

###  Items

In [9]:
hm_items = pd.read_csv('h_and_m_articles.csv')

hm_items.head()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc,image_url
0,108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.,https://h-and-m-images.s3.us-east-2.amazonaws....
1,108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.,https://h-and-m-images.s3.us-east-2.amazonaws....
2,108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.,https://h-and-m-images.s3.us-east-2.amazonaws....
3,110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde...",https://h-and-m-images.s3.us-east-2.amazonaws....
4,110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde...",https://h-and-m-images.s3.us-east-2.amazonaws....


In order to build the shaped items schema, we need to include the unique identifier as` item_id` + any additional features we want to explicitly train on:

In [10]:
items_query = """
SELECT
    article_id as item_id,
    prod_name as product_name,
    product_type_name as product_type,
    graphical_appearance_name as graphical_appearance,
    colour_group_name as color,
    index_name as category,
    section_name as section,
    department_name as department,
    detail_desc as description,
    image_url as image
FROM 
    hm_items
"""

### Users

In [11]:
hm_users = pd.read_csv('h_and_m_customers.csv')

hm_users.head()

Unnamed: 0,customer_id,FN,Active,club_member_status,fashion_news_frequency,age,postal_code,latest_interaction_at
0,00000dbacae5abe5e23885899a1fa44253a17956c6d1c3...,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...,2020-09-05
1,0000423b00ade91418cceaf3b26c6af3dd342b51fd051e...,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...,2020-07-08
2,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...,2020-09-15
3,00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2...,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...,2019-06-09
4,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...,2020-08-12


When surfacing recommendations to users, we usually only want to train a shaped model using `user_id`'s above a certain interaction count. At low levels (i.e 1 or 2 clicks), it is harder to determine true behavioural intent due to the cold start problem. Shaped overcomes this at rank time by allowing flexibility over what baseline content features and retrieval strategies can be used. This will be demonstrated later. 

For now, let's only select users who have 5 or more interactions (a typical academic standard):

In [12]:
users_query = """
WITH user_interaction_counts AS (
    SELECT 
        customer_id,
        COUNT(*) as interaction_count
    FROM 
        hm_events
    WHERE created_at >= '2020-06-01'
    GROUP BY 1
)

SELECT  
    u.customer_id as user_id,
    u.postal_code as zip_code,
    u.age
FROM 
    hm_users u
LEFT JOIN user_interaction_counts uic 
ON u.customer_id = uic.customer_id
WHERE uic.interaction_count >= 5
"""

## Model Creation

We're now ready to create our model definition to upload to shaped. Let's start out with a basic model `.yaml` file that references our `fetch` queries and shaped datasets :

In [None]:
shaped_model = """
model:
    name: h_and_m_product_recs_demo
connectors:
  - id: hm_events
    name: hm_events_demo
    type: Dataset
  - id: hm_items
    name: hm_items_demo
    type: Dataset
  - id: hm_users
    name: hm_users_demo
    type: Dataset
fetch:
  events: |
    SELECT
      article_id as item_id,
      customer_id as user_id,
      created_at,
      1 as label,
      'clicked' as event_value
    FROM hm_events
  users: |
    WITH user_interaction_counts AS (
      SELECT 
        customer_id,
        COUNT(*) as interaction_count
      FROM hm_events
      WHERE created_at >= '2020-06-01'
      GROUP BY 1
    )
    SELECT
      u.customer_id as user_id,
      u.postal_code as zip_code,
      u.age
    FROM hm_users u
    LEFT JOIN user_interaction_counts uic ON u.customer_id = uic.customer_id
    WHERE uic.interaction_count >= 5
  items: |
    SELECT
      article_id as item_id,
      prod_name as product_name,
      product_type_name as product_type,
      graphical_appearance_name as graphical_appearance,
      colour_group_name as color,
      index_name as category,
      section_name as section,
      department_name as department,
      detail_desc as description,
      image_url as image
    FROM hm_items
"""

Now that the basic model structure has been written, we can also tweak some model config options to customise how we create our model. Shaped allows you use a variety of models for our `scoring_policy` and `embedding_polic`y interface. Refer to our policy configuration docs (see https://docs.shaped.ai/docs/model_creation_guides/policy-configuration)  for our suite of options. For this model we can set:

- `lightgbm` as the `scoring_policy`. This model handles categorical feature importance well, which will allow shaped to understand what types of products customers are clicking on. 

- `item-content-similarity` as the `embedding_policy`. Shaped will determine the embedding space for the items based on the similarity of a product's associated text content, categorical feature contexts & image embeddings. We can also set `language_feature_name` to a given pre-trained model of your choice. Let's use a `FashionCLIP` model to achieve this.

We've also added some additional flags to the model file to improve our results:

- `use_gpu: true` to enable training of image embeddings via GPU utilisation
- `interaction_dedupe_strategy: max` to only take the most recent single interaction that each user has for every given item and ignore all others.
- `remove_no_activity_users_and_items: true` tells the model to only train on users and items who have at least 1 interaction
- `text_encoding: true` allows shaped to understand the text embedding of any `Text` features


The schema_override section is the interface for manually assigning feature types for each user / item feature. By default, shaped will auto-select the most suitable feature type, but there may be scenarios where you want to override this. The possible feature types to choose from are:
 
- `Text`
- `Category`
- `TextCategory` (models the feature as both)
- `Numerical`
- `Binary`
- `Sequence[Text]`
- `Sequence[Category]`
- `Sequence[TextCategory]`
- `Vector`
- `Url`
- `Image`

In [14]:
shaped_model = """
model:
  use_gpu: true
  interaction_dedupe_strategy: max
  remove_no_activity_users_and_items: true
  name: h_and_m_product_recs_demo
  policy_configs:
    embedding_policy:
      pool_fn: max
      policy_type: item-content-similarity
      distance_fn: dot
    scoring_policy:
      policy_type: auto-tune
      policies:
        - lightgbm
  language_model_name: patrickjohncyh/fashion-clip
  inference_config:
    text_semantic_search_weight: 0.7
    retriever_k_override:
      popular: 300
      knn: 300
    retrieval_k: 600
  text_encoding: true
  schema_override:
    user:
      id: user_id
      features:
        - name: zip_code
          type: Category
        - name: age
          type: Numerical
    item:
      id: item_id
      features:
        - name: product_name
          type: TextCategory
        - name: product_type
          type: TextCategory
        - name: graphical_appearance
          type: TextCategory
        - name: category
          type: TextCategory
        - name: section
          type: TextCategory
        - name: department
          type: TextCategory
        - name: description
          type: TextCategory
        - name: image
          type: Image
    interaction:
      label:
        name: label
        type: BinaryLabel
      created_at: created_at
      features:
        - name: event_value
          type: Category
connectors:
  - id: hm_events
    name: hm_events
    type: Dataset
  - id: hm_items
    name: hm_items
    type: Dataset
  - id: hm_users
    name: hm_users
    type: Dataset
fetch:
  events: |
    SELECT
      article_id as item_id,
      customer_id as user_id,
      created_at,
      1 as label,
      'clicked' as event_value
    FROM hm_events
  users: |
    WITH user_interaction_counts AS (
      SELECT 
        customer_id,
        COUNT(*) as interaction_count
      FROM hm_events
      WHERE created_at >= '2020-06-01'
      GROUP BY 1
    )
    SELECT
      u.customer_id as user_id,
      u.postal_code as zip_code,
      u.age
    FROM hm_users u
    LEFT JOIN user_interaction_counts uic ON u.customer_id = uic.customer_id
    WHERE uic.interaction_count >= 5
  items: |
    SELECT
      article_id as item_id,
      prod_name as product_name,
      product_type_name as product_type,
      graphical_appearance_name as graphical_appearance,
      index_name as category,
      section_name as section,
      department_name as department,
      detail_desc as description,
      image_url as image
    FROM hm_items
"""

with open("h_and_m_product_recs_demo.yaml", "w") as file:
    yaml.dump(yaml.safe_load(shaped_model), file)

In [16]:
! shaped create-model --file h_and_m_product_recs_demo.yaml

{
  "connectors": [
    {
      "id": "hm_events",
      "name": "hm_events",
      "type": "Dataset"
    },
    {
      "id": "hm_items",
      "name": "hm_items",
      "type": "Dataset"
    },
    {
      "id": "hm_users",
      "name": "hm_users",
      "type": "Dataset"
    }
  ],
  "fetch": {
    "events": "SELECT\n  article_id as item_id,\n  customer_id as user_id,\n  created_at,\n  1 as label,\n  'clicked' as event_value\nFROM hm_events\n",
    "items": "SELECT\n  article_id as item_id,\n  prod_name as product_name,\n  product_type_name as product_type,\n  graphical_appearance_name as graphical_appearance,\n  index_name as category,\n  section_name as section,\n  department_name as department,\n  detail_desc as description,\n  image_url as image\nFROM hm_items\n",
    "users": "WITH user_interaction_counts AS (\n  SELECT \n    customer_id,\n    COUNT(*) as interaction_count\n  FROM hm_events\n  WHERE created_at >= '2020-06-01'\n  GROUP BY 1\n)\nSELECT\n  u.customer_id as user_i

Congratulations ! You've now sucessfully created your first shaped model. In order to track the progress of your model, you can visit your dashboard to see its status/state. When you create a model with shaped, it goes through the following states:

1. Scheduling
2. Fetching (loading the data)
3. Tuning (finding optimal hyperparameters to train on)
4. Training (training on the aforementioned hyperparameters)
5. Deploying (deploying the model endpoint to be utilized)
6. Active

Once your model has gone active, we can now test some results using our shaped python client. We also have an interactive area on the dashboard that serves the same purpose. For documentation regarding rank time configuration, see our docs at https://docs.shaped.ai/docs/model_creation_guides/rank-time-configuration & https://docs.shaped.ai/docs/model_inference_guides/similar-ranking/ :

## Ranking

We can simply use the `rank` endpoint for your model to return a list of relevant items for a given user. 

In [None]:
# Define client
client = shaped.Client(api_key=API_KEY)

# Make a request to the shaped API
response = client.rank(
    model_name="h_and_m_product_recs_demo",
    user_id="a8f8851c9ca25414947073f42a377d522c89aeb3ce5769b2c2dbaa5bc7e7c2ee",
    return_metadata=True,
    config=shaped.InferenceConfig(
        exploration_factor=0.1,
        diversity_factor=0.1,
        retrieval_k=600,
        retrieval_k_override=shaped.RetrieverTopKOverride(
            knn=300,
            chronological=0,
            toplist=0,
            trending=300,
            random=0,
            cold_start=0,
        ),
        limit=5,
    ),
)

print(response)

Similarly, we can use the `similar_items` endpoint to return a list of similar items for a candidate `item_id` :

In [None]:
api_response = client.similar_items(
    model_name="h_and_m_product_recs_demo", 
    item_id="B010TWRY8O",
    return_metadata=True,
    config=shaped.InferenceConfig(
        exploration_factor=0.1,
        diversity_factor=0.1,
        retrieval_k=600,
        retrieval_k_override=shaped.RetrieverTopKOverride(
            knn=300,
            chronological=0,
            toplist=0,
            trending=300,
            random=0,
            cold_start=0,
        ),
        limit=5,
    ),
)

print(api_response)