# Example Dataset Preparation

This notebook demonstrates how to create a small example dataset from the WikiArt metadata. It encodes **styles**, **artists**, and **timeframes** as one-hot vectors and organizes all entries into training, validation, and test splits. The output is saved as a serialized Python dictionary (`wikiart_full.pkl`).

- **Input**: `wikiart_full.csv` (metadata file from [ArtSAGENet](https://github.com/thefth/ArtSAGENet))  
- **Output**: `../wikiart_full.pkl` â€” contains one-hot encoded label vectors and image paths.  
- **Purpose**: Provide a ready-to-load dataset structure for model training examples.

In [None]:
import numpy as np
import pandas as pd
import pickle

from sklearn.preprocessing import LabelBinarizer
from tqdm.notebook import tqdm

In [None]:
# Load dataset
df = pd.read_csv('https://raw.githubusercontent.com/thefth/ArtSAGENet/main/Dataset/wikiart_full.csv')

# Display basic information about the dataset
df.info()

In [None]:
# One-hot encode categorical attributes
styles = LabelBinarizer().fit_transform(df['style_classification']).astype(float)
artists = LabelBinarizer().fit_transform(df['artist_attribution']).astype(float)
timeframes = LabelBinarizer().fit_transform(df['timeframe_estimation']).astype(float)

In [None]:
# Initialize dictionary for train/val/test splits
dataset = {'train': [], 'val': [], 'test': []}

# Iterate through all rows and populate the dataset dictionary
for id_, row in tqdm(df.iterrows(), total=df.shape[0]):
    dataset[row['mode']].append([row['image'], styles[id_], artists[id_], timeframes[id_]])

In [None]:
# Save the complete dataset as a pickle file for quick loading in model training scripts
with open('../wikiart_full.pkl', 'wb') as file:
    pickle.dump(dataset, file)