# V4.1 golden dataset

- Version 
    - 4 stands for "evaluation data for V4 semrel teacher and student models"
    - 4.1 Because we may continue to add in more annotation data, so there can be 4.2, 4.3 ...
- Consists of 
    - V3 golden dataset
    - Purchased pairs which V3 teacher predicted as partial, internally annotated in July 2025
        - [excel sheet uploaded to BQ](https://github.com/etsy/etsy-llm/blob/4b2f6b38dd27910d3fcb693df18be116d517d918/projects/semrel_annotation/notebooks/prep_v3_evaluation_data.ipynb)
        - [how it was sampled](https://console.cloud.google.com/bigquery?ws=!1m7!1m6!12m5!1m3!1setsy-bigquery-adhoc-prod!2sus-central1!3s405d7a9a-be79-4b6c-8d44-4f6826428965!2e1)
- [Schema and info](https://docs.google.com/spreadsheets/d/1pniqtO8wJM9ZZSGw_bbN2sDS9kpo2WNcbytqFmFtRYw/edit?usp=sharing)

## Create partial purchase dataset in BQ UI

- [SQL script](https://github.com/yqngzh/yzhang-adhoc-analysis/blob/master/2025H2_exact_match/create_v4_1_eval_data.sql)
- output table `etsy-search-ml-dev.yzhang.exact_matches_hydrated_data_v4_1`

##  Combine v3 data with partial purchases

In [1]:
import pandas as pd
import numpy as np
import hashlib

from google.cloud import bigquery

In [2]:
v3_df = pd.read_excel(
    "gs://training-dev-search-data-jtzn/semantic_relevance/datasets/v3_eval_golden_standard_labels/gsl_eval_v0_output.xlsx"
)

In [3]:
client = bigquery.Client(project="etsy-search-ml-dev")
query_job = client.query("select * from `etsy-search-ml-dev.yzhang.exact_matches_hydrated_data_v4_1`")
rows = query_job.result()
v4_raw_df = rows.to_dataframe()

In [4]:
# clean up V3 to match V4
v3_df["gt_label"] = v3_df["etsy_round_label"]

v3_df["product_type_in_query"] = None
v3_df["product_type_in_listing_if_mismatch"] = None
v3_df["subsitute_complementary"] = None
v3_df["descriptor_mismatch"] = None

In [5]:
# make row_id in V3
def generate_hash(row):
    data = row["etsyUUID"] + row["query"] + str(row["listingId"])
    if isinstance(data, str):
        data = data.encode('utf-8')
    return hashlib.sha256(data).hexdigest()


v3_df["row_id"] = v3_df.apply(generate_hash, axis=1)

In [6]:
# clean up V4
v4_raw_df = v4_raw_df[np.logical_and(~pd.isnull(v4_raw_df.gt_label), v4_raw_df.gt_label != "Not Sure")]
v4_raw_df["gt_label"] = v4_raw_df.gt_label.map({
    "Fully Relevant": "relevant", 
    "Partially Relevant": "partial",
    "Irrelevant": "not_relevant"
})
v4_raw_df.gt_label.value_counts()

relevant        265
partial         128
not_relevant     10
Name: gt_label, dtype: int64

In [7]:
desired_columns = [
    # request
    "row_id", "etsyUUID", "platform", "userCountry", "userLanguage",
    # query
    "query", "queryEn", "seg_queryBin", "seg_qisClass", "seg_queryTaxoFullPath", "seg_queryTaxoTop", 
    "tangibleItem", "fandom", "motif", "style", "material", "color", "technique", "size", "occasion", 
    "customization", "age", "price", "quantity", "recipient", "queryEntities", "queryRewrites", "queryIsGift",
    # listing
    "listingId", "listingCountry", "shop_primaryLanguage", "listingTitle", "listingTitleEn",
    "listingTaxo", "listingTags", "listingAttributes", "listingShopName", 
    "listingDescription", "listingDescriptionEn", "listingDescNgrams",
    "listingImageUrls", "listingHeroImageCaption", "listingVariations", "listingReviews",
    # annotation
    "anno_data_source", "gt_label", 
    "product_type_in_query",
    "product_type_in_listing_if_mismatch",
    "subsitute_complementary",
    "descriptor_mismatch"
]

In [8]:
v3_df = v3_df[desired_columns]
v4_raw_df = v4_raw_df[desired_columns]

In [9]:
v4_1 = pd.concat([v3_df, v4_raw_df])

In [10]:
v4_1.to_json(
    "gs://training-dev-search-data-jtzn/semantic_relevance/datasets/v4_eval_golden_data/eval_data_v4-1.jsonl", 
    orient="records", lines=True
)

In [11]:
v4_1.shape

(1498, 50)

In [12]:
print(v4_1.shape)
print(len(v4_1.row_id.unique()))

(1498, 50)
1498


In [13]:
v4_1.gt_label.value_counts()

partial         748
relevant        626
not_relevant    124
Name: gt_label, dtype: int64

In [14]:
v4_1.anno_data_source.value_counts()

partial_purchases           403
us_v2-direct_specified      227
us_v2-direct_unspecified    208
intl-fr                     118
us_v2-broad                 109
intl-es                     102
intl-de                      97
intl-it                      67
intl-nl                      61
us_v2-gift                   59
intl-pt                      22
intl-ja                      10
intl-pl                       9
intl-ru                       6
Name: anno_data_source, dtype: int64