# Ingest H&M data into Shaped

This example will show you how to prepare the H&M dataset ([link to Kaggle](https://www.kaggle.com/competitions/h-and-m-personalized-fashion-recommendations/overview)) and upload it to a Shaped table. 

# 1. Data preparation

# 1.1 Set up virtual environment and install dependencies

Create the venv with python 3.11 to ensure compatibility with the Shaped CLI:

```bash
python3.11 -m venv .venv
/.venv/bin/activate
```

In [11]:
%pip install -qU shaped pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# 1.2 Prepare datasets (items)

This example involves three datasets: 
- `articles.csv`: The catalog items, which will be candidates for our retrieval engine
- `customers.csv`: Information about each user; 
- `transaction_train.csv`: List of customer interactions and transactions; we'll use this to train our engine on behavioural data

Before uploading to Shaped, we have to ensure: 
1. Column names are only alphanumeric with underscores (no hyphens or special characters)
2. Dates are in epoch or ISO time

In [12]:
import pandas as pd

data_dir = "data/raw"
articles_file = f"{data_dir}/articles.csv"
customers_file = f"{data_dir}/customers.csv"
transactions_file = f"{data_dir}/transactions_train.csv"

try:
    articles = pd.read_csv(articles_file, dtype={'article_id': str})
    customers = pd.read_csv(customers_file)
    transactions = pd.read_csv(transactions_file)
    print('Dataframes loaded successfully')
except Exception:
    print('Error loading dataframes -' + Exception)


Dataframes loaded successfully


In [13]:
print('#'*20 + ' Summary of data ' + '#'*20 + '\n')
print('#'*20 + ' ARTICLES DF ' + '#'*20)
print(articles.dtypes)
print('\n'+'#'*20 + ' CUSTOMERS DF ' + '#'*20)
print(customers.dtypes)
print('\n'+'#'*20 + ' TRANSACTIONS DF ' + '#'*20)
print(transactions.dtypes)

#################### Summary of data ####################

#################### ARTICLES DF ####################
article_id                      object
product_code                     int64
prod_name                       object
product_type_no                  int64
product_type_name               object
product_group_name              object
graphical_appearance_no          int64
graphical_appearance_name       object
colour_group_code                int64
colour_group_name               object
perceived_colour_value_id        int64
perceived_colour_value_name     object
perceived_colour_master_id       int64
perceived_colour_master_name    object
department_no                    int64
department_name                 object
index_code                      object
index_name                      object
index_group_no                   int64
index_group_name                object
section_no                       int64
section_name                    object
garment_group_no             

In [15]:
from datetime import datetime
# articles - rename "article_id" to "item_id" and add an image_url column
articles = articles.rename(columns={'article_id' : 'item_id'})
# https://h-and-m-images.s3.us-east-2.amazonaws.com/010/0108775051.jpg
articles['image_url'] = "https://h-and-m-images.s3.us-east-2.amazonaws.com/" + articles['item_id'].astype(str).str[:3] + "/" + articles['item_id'].astype(str) + ".jpg"
print(articles['item_id'].head().values)
print(articles['image_url'].head().values)

# customers needs "FN" to be renamed "subscribed_to_fn"
# Active should be lowercase
# customer_id should be user_id
customers = customers.rename(columns={'FN': 'subscribed_to_fn', 'Active': 'active', 'customer_id': 'user_id'})

# transactions needs t_date to be an epoch date (in ms)
transactions['created_at'] = (pd.to_datetime(transactions['t_dat']).view('int64') // 10**9).astype('int64')
transactions = transactions.rename(columns={'customer_id': 'user_id', 'article_id': 'item_id'})

print('#'*20 + ' Data cleaning steps completed ' + '#'*20)

['0108775015' '0108775044' '0108775051' '0110065001' '0110065002']
['https://h-and-m-images.s3.us-east-2.amazonaws.com/010/0108775015.jpg'
 'https://h-and-m-images.s3.us-east-2.amazonaws.com/010/0108775044.jpg'
 'https://h-and-m-images.s3.us-east-2.amazonaws.com/010/0108775051.jpg'
 'https://h-and-m-images.s3.us-east-2.amazonaws.com/011/0110065001.jpg'
 'https://h-and-m-images.s3.us-east-2.amazonaws.com/011/0110065002.jpg']
#################### Data cleaning steps completed ####################


## 1.3 Export dataframes as jsonl files 

Our datasets are structured correctly, so now it's time to upload them to Shaped. We can do this using the CLI:

In [16]:
print("Exporting dataframes to JSONL files in 'data/processed/' directory...")
try:
    customers.to_json('data/processed/customers.jsonl', orient='records', lines=True)
    print("Customers df exported...")
    articles.to_json('data/processed/articles.jsonl', orient='records', lines=True)
    print("Articles df exported...")
except Exception:
    print(f'An error occurred: {Exception}')


Exporting dataframes to JSONL files in 'data/processed/' directory...
Customers df exported...
Articles df exported...


In [6]:
transactions.to_json('data/processed/transactions.jsonl', orient='records', lines=True)
print("Transactions df exported...")

Transactions df exported...


# 1.4 Upload data to Shaped

Use the CLI to upload each dataset to Shaped:

```bash
shaped create-dataset-from-uri --name hm_articles --type jsonl --path data/processed/articles.jsonl
shaped create-dataset-from-uri --name hm_customers --type jsonl --path data/processed/customers.jsonl
shaped create-dataset-from-uri --name hm_transactions --type jsonl --path data/processed/transactions.jsonl
```