In [2]:
from gensim.models import Word2Vec
import pandas as pd
import numpy as np
import urllib.request
## image embeddings
from PIL import Image
import requests
from io import BytesIO
import gcsfs
import pickle

## Superlinked setup

In this notebook we will walk you through all the steps required to setup a superlinked system.
We will use a small set of the products data to run all defenitions E2E.
This step is the first step in the "Superlinked development cycle" which will allow you to experiment with index and query defenitions.
Next step will be following the local server deployment together with a VDB to interact with your full data and make adustments.

### Prerequisites

Before running this notebook make sure:

1. You have superlinked notebooks enviorment setup 
2. You followed the item2vec tutorial and got your model trained on your data (this step is optional as you can still use our data for this)

### Read Data

Reading the products data and the saved item2vec pickled model.
Refer to the README.md file under `item2vec/` for more information about training item2vec model.

In [3]:
products = pd.read_json('https://storage.googleapis.com/superlinked-recipes/ecommerce-recsys/data/products.json', orient='records', lines=True)
events = pd.read_json('https://storage.googleapis.com/superlinked-recipes/ecommerce-recsys/data/events.json', orient='records', lines=True, nrows=1000)
item2vec_model = Word2Vec.load("https://storage.googleapis.com/superlinked-recipes/ecommerce-recsys/models/w2v.pickle")

## Data overview

### Products

In [4]:
products.head()

Unnamed: 0,id,is_active,product_image,description,topic,brand,product_type,popularity,item_w2v,price
0,1383239,1,https://storage.googleapis.com/superlinked-rec...,"Made from soft, durable and highly insulating ...",male_clothing,regatta,sweatshirts_&_fleeces,0.048134,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",26
1,14807127,1,https://storage.googleapis.com/superlinked-rec...,Brand: Parks London Collection: Vintage Aromat...,unisex_home,parks_london,candles_&_home_fragrance,0.490409,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",38
2,14807133,1,https://storage.googleapis.com/superlinked-rec...,Brand: Parks London Collection: Vintage Aromat...,unisex_home,parks_london,candles_&_home_fragrance,0.490409,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",38
3,14808099,1,https://storage.googleapis.com/superlinked-rec...,Easy to clean Fingerprint proof Ten year guara...,unisex_home,brabantia,kitchen_storage,0.272473,"[0.217785418, 0.0114847049, -0.0036433504, -0....",64
4,14809572,1,https://storage.googleapis.com/superlinked-rec...,Brand: Parks London Collection: Parks Exclusiv...,unisex_home,parks_london,candles_&_home_fragrance,0.490409,"[0.2159118056, 0.0520496592, 0.0123813841, -0....",30


For our data we have:
id - the product id
* product image - the url holding the image for the product
* description - html like string representing the product description.
* is active - a field indicating if item is in stock or not (not that this is a dynamic field)
* topic - the high level category of the product
* product_type - the type of the product
* brand - the brand of the product
* color - a binned representation of the product (we already have this data encoded in the image)
* price - the item price
* item_w2v - pre extracted embeddings
* popularity - some popularity coeficient (could be avg purchases per day/week for example. Just make sure to normalize it for different categoris/types as some product categories get sold more quickly then others and you wouldn't want to represnt this bias.).

### Events

The events dataframe schema is aligned with the way we will define the evetns effect in SL.
This is also the same event file we prepared for the item2vec training process.

In [5]:
events.head()

Unnamed: 0,user,session_id,product,event_type,created_at
0,9999211,1724940658471,18938567,product_viewed,2024-08-29 14:27:10.442
1,2375656,1721633128877,18904223,product_viewed,2024-07-22 07:31:34.284
2,2391885,1718520365363,18888360,product_viewed,2024-06-16 06:46:54.527
3,2097335,1724517034717,18881606,product_viewed,2024-08-24 16:34:12.575
4,2363497,1724590923651,18890987,product_viewed,2024-08-25 13:03:09.432


## Data preperation overview

In this section we will show you how you could extract the item2vec embeddings for each item based on your  trained model.

For simplicity we will use a small smaple from our data for this section.

In [12]:
products_sample = products.sample(100).reset_index(drop=True)

For this sample, let's drop the current item2vec embeddings.

In [13]:
products_sample.drop(columns=['item_w2v'], inplace=True)

In [14]:
products_sample.head()

Unnamed: 0,id,is_active,product_image,description,topic,brand,product_type,popularity,price
0,19059370,1,https://storage.googleapis.com/superlinked-rec...,MidiShort SleevesV-neckComposition: 100% Polye...,female_clothing,joseph_ribkoff,dresses,0.034502,199
1,19108201,1,https://storage.googleapis.com/superlinked-rec...,A-LineZip FasteningMini LengthComposition: 51%...,female_clothing,hobbs_london,skirts,0.108556,110
2,19126209,1,https://storage.googleapis.com/superlinked-rec...,Kuba rugs draw their inspiration from Zairean ...,unisex_home,louis_de_poortere,rugs,0.076656,1190
3,19105206,1,https://storage.googleapis.com/superlinked-rec...,"For exceptional durability, this bespoke duvet...",unisex_home,surrey_down,duvets_&_pillows,0.080633,681
4,18916655,1,https://storage.googleapis.com/superlinked-rec...,Brand: Miller Harris Product Type: Lumiere Dor...,unisex_beauty,miller_harris,bath_&_shower,0.072742,26


### Get item2vec embeddings

In [15]:
def normalize_vec(vec):
    if np.linalg.norm(vec) == 0:
        return vec
    nrom_vec = vec / np.linalg.norm(vec)
    return nrom_vec

def get_item2vec_vector(row):
    try:
        item_emb = item2vec_model.wv[str(row['id'])]
        return normalize_vec(item_emb).tolist()
    except Exception as e:
        return np.zeros(100).tolist()

products_sample['item_w2v'] = products_sample.apply(get_item2vec_vector, axis=1)

In [16]:
products_sample.head()

Unnamed: 0,id,is_active,product_image,description,topic,brand,product_type,popularity,price,item_w2v
0,19059370,1,https://storage.googleapis.com/superlinked-rec...,MidiShort SleevesV-neckComposition: 100% Polye...,female_clothing,joseph_ribkoff,dresses,0.034502,199,"[0.12845230102539062, 0.027607589960098267, 0...."
1,19108201,1,https://storage.googleapis.com/superlinked-rec...,A-LineZip FasteningMini LengthComposition: 51%...,female_clothing,hobbs_london,skirts,0.108556,110,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,19126209,1,https://storage.googleapis.com/superlinked-rec...,Kuba rugs draw their inspiration from Zairean ...,unisex_home,louis_de_poortere,rugs,0.076656,1190,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,19105206,1,https://storage.googleapis.com/superlinked-rec...,"For exceptional durability, this bespoke duvet...",unisex_home,surrey_down,duvets_&_pillows,0.080633,681,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,18916655,1,https://storage.googleapis.com/superlinked-rec...,Brand: Miller Harris Product Type: Lumiere Dor...,unisex_beauty,miller_harris,bath_&_shower,0.072742,26,"[0.20885172486305237, 0.05882904678583145, 0.0..."


### Save data

Saving the fields relevant for our use case

In [57]:
# keep_fields = [
#     'id',
#     'is_active',
#     'product_image',
#     'description',
#     'topic',
#     'brand',
#     'product_type',
#     'popularity',
#     'item_w2v',
# ]
# ## save as jsonlines
# # products_sample[keep_fields].to_json('../data/source/preprocessed_products_sample.jsonl', orient='records', lines=True)

Saving the categories that we will need for the next step - setting up Superlinked schema and index.

When setting up SuperLinked index - which we will do next, in order to use categorical variables we will need to prefedefine all possbile values for each category space.
For this we will save our data categories as a json list which we could extract when building the index.

In [35]:
with open('../data/categories/brands.json', 'w') as f:
    json.dump(products['brand'].unique().tolist(), f)

with open('../data/categories/topics.json', 'w') as f:
    json.dump(products['topic'].unique().tolist(), f)

with open('../data/categories/product_types.json', 'w') as f:
    json.dump(products['product_type'].unique().tolist(), f)