In [1]:
import os
import hopsworks
import polars as pl

In [2]:
with open('data/hopsworks-api-key.txt', 'r') as file:
    os.environ["HOPSWORKS_API_KEY"] = file.read().rstrip()
    
project = hopsworks.login()

2025-01-05 16:56:27,676 INFO: Initializing external client
2025-01-05 16:56:27,677 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-05 16:56:29,413 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1159324


## Get data

### Quick data exploration
Taken from https://graildient-descent.streamlit.app/eda, using 10,000 samples.
We could do the same using the larger dataset.

#### Numerical/quantitative features
Target variable (sold_price):
- Most sold items are between 35-135$ (consider plotting bins) - consider outliers, since we are far from normal distribution, maybe try a log transformation?

Number of photos:
- Another numerical feature, could be added
- Price increases until 13 photos, then inconsistent

#### Categorical/qualitative features
- high-cardinality: designer, color, subcategory, size
- low-cardinality: category, condition

Target encoding more fitting for high-cardinality features. Whereas low-cardinality features could be one-hot encoded.

- designer a strong predictor
- we skip department, to focus on men's clothing only (it has better representation)
- there is good variation of sold prices in different subcategories => probably a good indicator
- could be interesting to use embeddings for color instead
- condition - perfect for ordinal encoding. The better the condition, the higher the average sold price.

##### Text
- title/description - should we do any pre-processing?
- we could look at sentiment and similar text analysis approaches

#### Images
- could be interesting to add embedding representation of title image

## Feature processing

In [3]:
from utils.feature_engineering import pipeline
df = pipeline(no_of_hits=300)
df.shape

embedding designer names
embedding descriptions
embedding titles
embedding hashtags
shape: (291, 13)
┌──────────┬────────────┬────────────┬────────────┬───┬────────┬───────────┬───────────┬───────────┐
│ id       ┆ sold_at    ┆ designer_n ┆ descriptio ┆ … ┆ color  ┆ followern ┆ sold_pric ┆ embedded_ │
│ ---      ┆ ---        ┆ ames       ┆ n          ┆   ┆ ---    ┆ o         ┆ e         ┆ hashtags  │
│ i64      ┆ datetime[μ ┆ ---        ┆ ---        ┆   ┆ str    ┆ ---       ┆ ---       ┆ ---       │
│          ┆ s]         ┆ list[f32]  ┆ list[f32]  ┆   ┆        ┆ i64       ┆ i64       ┆ list[f32] │
╞══════════╪════════════╪════════════╪════════════╪═══╪════════╪═══════════╪═══════════╪═══════════╡
│ 69836167 ┆ 2025-01-05 ┆ [0.015452, ┆ [-0.002503 ┆ … ┆ black  ┆ 31        ┆ 200       ┆ [0.048453 │
│          ┆ 13:04:15.6 ┆ 0.032922,  ┆ ,          ┆   ┆        ┆           ┆           ┆ ,         │
│          ┆ 08         ┆ … -0.0129… ┆ 0.020896,  ┆   ┆        ┆           ┆           ┆ -0



(291, 13)

## Save data

In [4]:
fs = project.get_feature_store() 

In [5]:
grailed_items_fg = fs.get_or_create_feature_group(
    name='draft_grailed_items',
    description='Sold Grailed items',
    version=2,
    primary_key=['id'],
    event_time="sold_at",
    # expectation_suite=aq_expectation_suite
)

In [6]:
grailed_items_fg.insert(df)

%4|1736092700.741|FAIL|rdkafka#consumer-2| [thrd:ssl://51.161.80.189:9093/bootstrap]: ssl://51.161.80.189:9093/0: Connection setup timed out in state CONNECT (after 30281ms in state CONNECT)
Uploading Dataframe: 100.00% |████████████████████████████████████████████████| Rows 291/291 | Elapsed Time: 00:02 | Remaining Time: 00:00


Launching job: draft_grailed_items_2_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1159324/jobs/named/draft_grailed_items_2_offline_fg_materialization/executions


(Job('draft_grailed_items_2_offline_fg_materialization', 'SPARK'), None)

In [None]:
# TODO: Update feature description