This notebook shows how you could split Fashion outfits file in order to have train and test datasets to experiment with your model.

**Disclaimer**: The process used here is alway simple than the one we used to generate the test files for the leaderboard, but you could expect that some tricks used here (like dealing with products as a set, or generating candidates based on entire product_id list) are also used on the leaderboard files.

This code takes about 40 minutes to run in a common notebook

In [7]:
import datetime
print(datetime.datetime.now())

2022-05-05 22:46:41.368069


In [8]:
import pandas as pd

full_outfits = pd.read_parquet("../data/manual_outfits.parquet")
full_outfits.head()

Unnamed: 0,products,outfit_id
0,"[15360881, 15379678, 15781925, 16204075, 16260...",0
1,"[13893589, 13893721, 15426616, 16035469, 17173...",1
2,"[13508028, 14161732, 16160567, 17484491, 17503...",2
3,"[16127776, 16756133, 17040752, 18203427, 18205...",3
4,"[14480467, 15487690, 17257765]",4


To generate more examples given an particular outfit, one way is shuffing the products, remember that outfits are sets of products. This step is here just to ilustrate this possibility.

In [9]:
from numpy.random import permutation

full_outfits["products_shuffled"] = full_outfits.apply(lambda row: permutation(row["products"]).tolist(), axis=1)
full_outfits.head()

Unnamed: 0,products,outfit_id,products_shuffled
0,"[15360881, 15379678, 15781925, 16204075, 16260...",0,"[15360881, 16204075, 16260894, 15379678, 15781..."
1,"[13893589, 13893721, 15426616, 16035469, 17173...",1,"[16035469, 13893721, 18218977, 13893589, 17173..."
2,"[13508028, 14161732, 16160567, 17484491, 17503...",2,"[13508028, 17484491, 17503108, 16160567, 14161..."
3,"[16127776, 16756133, 17040752, 18203427, 18205...",3,"[18203427, 16127776, 17040752, 16756133, 18205..."
4,"[14480467, 15487690, 17257765]",4,"[17257765, 15487690, 14480467]"


To generate a train example to use in your model, you could split a outfit in incomplete_outfit and missing_product.

In [10]:
full_outfits["incomplete_outfit"] = full_outfits.apply(lambda row: row["products_shuffled"][:-1], axis=1)
full_outfits["missing_product"] = full_outfits.apply(lambda row: row["products_shuffled"][-1], axis=1)
full_outfits.head()

Unnamed: 0,products,outfit_id,products_shuffled,incomplete_outfit,missing_product
0,"[15360881, 15379678, 15781925, 16204075, 16260...",0,"[15360881, 16204075, 16260894, 15379678, 15781...","[15360881, 16204075, 16260894, 15379678]",15781925
1,"[13893589, 13893721, 15426616, 16035469, 17173...",1,"[16035469, 13893721, 18218977, 13893589, 17173...","[16035469, 13893721, 18218977, 13893589, 17173...",15426616
2,"[13508028, 14161732, 16160567, 17484491, 17503...",2,"[13508028, 17484491, 17503108, 16160567, 14161...","[13508028, 17484491, 17503108, 16160567]",14161732
3,"[16127776, 16756133, 17040752, 18203427, 18205...",3,"[18203427, 16127776, 17040752, 16756133, 18205...","[18203427, 16127776, 17040752, 16756133]",18205465
4,"[14480467, 15487690, 17257765]",4,"[17257765, 15487690, 14480467]","[17257765, 15487690]",14480467


In [11]:
items_metadata = pd.read_parquet("../data/products.parquet")
items_metadata = items_metadata["product_id"]
items_metadata.head()

0    17073270
1    17674562
2    17678603
3    17179699
4    15907453
Name: product_id, dtype: int32

One way to generate candidates is to sample then from all product_id in the dataset.

In [12]:
from random import randint

def candidates(row, minc=8, maxc=40):
    n = randint(minc, maxc)
    print(n)
    c = items_metadata.sample(n).unique().tolist()
    print(c)
    c.append(row["missing_product"])
    print("Last:", list(set(c)))
    return list(set(c))

full_outfits["candidates"] = full_outfits.apply(lambda row: candidates(row), axis=1)
full_outfits.head()

29
[16794417, 16750366, 17744789, 17236381, 18188018, 17073834, 17041770, 16364771, 17180340, 16869094, 17220172, 16941694, 15935241, 17719856, 17869859, 17511760, 17754487, 17858392, 16463066, 16961862, 17306057, 18164886, 17137396, 17058488, 18239428, 18043210, 16704690, 18265637, 17236737]
Last: [17236737, 15935241, 17744789, 18164886, 17236381, 16750366, 17869859, 18265637, 15781925, 17073834, 17719856, 16794417, 16704690, 17180340, 17058488, 18239428, 16961862, 17306057, 18043210, 17220172, 17511760, 17858392, 16463066, 16364771, 16869094, 17041770, 18188018, 17137396, 17754487, 16941694]
21
[15905989, 16843052, 17267149, 17743794, 17864845, 16948874, 17514998, 18031040, 17869174, 17781361, 16501048, 13522819, 18048663, 17369239, 16936183, 18025704, 17787608, 16459457, 16858870, 16418386, 16358659]
Last: [13522819, 16358659, 16948874, 17864845, 18048663, 17369239, 16843052, 17743794, 16501048, 15426616, 18031040, 16459457, 15905989, 17267149, 16418386, 17787608, 18025704, 17781361

KeyboardInterrupt: 

A example of split on 80% for training and 20% for test

In [None]:
train = full_outfits.sample(frac=0.8)
train.head()

In [None]:
test = full_outfits[~full_outfits.outfit_id.isin(set(train["outfit_id"].values.tolist()))]
test.head()

In [None]:
test_input = test[["outfit_id", "incomplete_outfit", "candidates"]]
test_input.head()

In [None]:
test_output = test[["outfit_id", "missing_product"]]
test_output.head()

In [None]:
import time

unique_name = int(time.time())
train.to_parquet(f"../data/manual_outfits_train_{unique_name}.parquet")
test_input.to_parquet(f"../data/manual_outfits_testinput_{unique_name}.parquet")
test_output.to_csv(f"../data/manual_outfits_testoutput_{unique_name}.csv", index=False)

In [None]:
print(datetime.datetime.now())