# UK Biobank Trait–Value Tokenisation Script

This script converts UK Biobank raw phenotype fields into **token IDs** according to the unified trait–value vocabulary used in **ukbFound**.
It is designed to reproduce the end-to-end preprocessing workflow for the demo dataset provided in `ukbFound/data/`.

## Purpose

* Load the curated `ukb_traits.csv` vocabulary.
* Map raw UKB fields (continuous, multi-choice, multi-select, categorical) into standardised token IDs.
* Expand multi-instance and multi-array fields following UKB conventions.
* Produce a tokenised participant-by-trait matrix suitable for downstream modelling.

## Input files

Located in `ukbFound/data/`:

* `ukb_traits.csv` — unified trait–value vocabulary with field type, value levels and token IDs.
* `demo_UKB_data.csv` — synthetic demo phenotype table (raw UKB-style columns).

## Output files

Generated in the same directory:

* `demo_UKB_tokens.csv` — tokenised participant matrix (one row per participant).
* `missing_field_id.csv` — list of (field_id, instance) not present in the input file.

## Key processing steps

1. **Load trait vocabulary** (`ukb_traits.csv`) as strings for type safety.
2. **Select non-private traits** and enumerate their instance/array expansions.
3. **Row-wise tokenisation** via `check_row()`

   * Continuous: Q1–Q4 discretisation when applicable.
   * Multi-choice: direct value lookup.
   * Multi-select: “yes/not” encoding for 20001/20002; enumerated outputs for other fields.
4. **Chunked streaming** over the raw CSV to support large datasets.
5. **Concatenate outputs** and write to disk incrementally.

## Notes

* This implementation is self-contained and depends only on `pandas` and `numpy`.
* All empty outputs are returned as `pd.Series(..., dtype="float64")` to ensure stable downstream behaviour.
* The demo dataset is small (n=100) and intended solely for illustrating the tokenisation pipeline.


In [3]:
# coding: utf-8
import pandas as pd
import numpy as np
from collections import OrderedDict
from pathlib import Path

# ========== 0. 路径设置 ==========
# 当前脚本所在目录（例如 ukbFound/examples/）；在 Notebook 中则使用当前工作目录
try:
    HERE = Path(__file__).resolve().parent
except NameError:
    HERE = Path.cwd()

# 仓库根目录，例如 ukbFound/
ROOT = HERE.parent
# data 目录，例如 ukbFound/data/
DATA_DIR = ROOT / "data"

print("HERE    :", HERE)
print("ROOT    :", ROOT)
print("DATA_DIR:", DATA_DIR)

# ========== 1. 加载 trait 词表 ==========
print("=== Step 1: Load trait vocabulary (ukb_traits.csv) ===")

# 关键改动：dtype=str，避免混合类型 DtypeWarning
traits_df = pd.read_csv(
    DATA_DIR / "ukb_traits.csv",
    encoding="latin1",
    quotechar='"',
    dtype=str,
)

print(f"Loaded traits_df with {traits_df.shape[0]} rows and {traits_df.shape[1]} columns.")

file_path = DATA_DIR / "demo_UKB_data.csv"
output_file = DATA_DIR / "demo_UKB_tokens.csv"
missing_field_file = DATA_DIR / "missing_field_id.csv"
print(f"Input UKB file   : {file_path}")
print(f"Output token file: {output_file}")
print(f"Missing-field log: {missing_field_file}")

# ========== 1.1 工具函数：统一字符串表示 ==========
def convert_and_strip(value):
    if pd.isna(value):
        return np.nan
    # dtype=str 已保证是字符串，这里保守兼容 float/int
    if isinstance(value, float) and value.is_integer():
        return str(int(value))
    return str(value).strip()

print("Converting selected columns in traits_df to string representation ...")

traits_df["field_id"] = traits_df["field_id"].map(convert_and_strip)
traits_df["value_type"] = traits_df["value_type"].map(convert_and_strip)
traits_df["value"] = traits_df["value"].map(convert_and_strip)
if "trait" in traits_df.columns:
    traits_df["trait"] = traits_df["trait"].map(convert_and_strip)
if "token" in traits_df.columns:
    traits_df["token"] = traits_df["token"].map(str)
if "meaning" in traits_df.columns:
    traits_df["meaning"] = traits_df["meaning"].map(str)

only_traits_df = traits_df[
    ["field_id", "value_type", "instance_min", "array_min", "array_max", "private"]
].drop_duplicates()
only_traits_df = only_traits_df[only_traits_df["private"] == "0"]  # 注意此处 dtype=str

print("\n=== Step 2: Summary of non-private traits used for tokenisation ===")
print(f"Unique non-private (field_id, instance_min, array_range) entries: {only_traits_df.shape[0]}")
print("Value_type distribution among non-private traits:")
print(only_traits_df["value_type"].value_counts().sort_index())
print()

# 为了后面快速筛选，统一把 instance_min/array_min/array_max 转成 int
only_traits_df = only_traits_df.copy()
only_traits_df["instance_min"] = only_traits_df["instance_min"].astype(int)
only_traits_df["array_min"] = only_traits_df["array_min"].astype(int)
only_traits_df["array_max"] = only_traits_df["array_max"].astype(int)

# ========== 2. per-row 处理逻辑 ==========
def check_row(apply_row, df_trait):
    """
    apply_row: 某个 field 在同一 instance 下的一整行（array 维度展开后的多个列）
    df_trait:  当前 field_id 在 traits 词表中的子集
    """
    apply_row = apply_row.map(convert_and_strip)
    df = df_trait.copy()

    value_types = df["value_type"].unique()
    field_ids = df["field_id"].unique()
    field_id = field_ids[0]

    if len(field_ids) != 1:
        raise KeyError(f"field_ids have more than one value: {field_ids}")

    value_type = value_types[0]
    array_values = [v for v in apply_row.tolist() if not pd.isna(v)]

    out_dict = OrderedDict()

    # ========== 2.1 多选题 ==========
    if value_type == "22":
        # 普通多选
        if field_id not in ["20002", "20001"]:
            if len(array_values) == 0:
                # 关键改动：显式指定 dtype，避免 FutureWarning
                return pd.Series(out_dict, dtype="float64")

            matched_value_df = df[df["value"].isin(array_values)]
            count = 0
            for _, row in matched_value_df.iterrows():
                col = (
                    row["trait"]
                    + "_d"
                    + apply_row.index[0]
                    + "-"
                    + str(count)
                )
                out_dict[col] = float(row["token_id"]) if "token_id" in row and not pd.isna(row["token_id"]) else np.nan
                count += 1
        # 20002/20001 特殊编码: YES / NOT
        else:
            if len(apply_row.unique()) <= 1 and pd.isna(apply_row.unique()[0]):
                # 关键改动：显式指定 dtype
                return pd.Series(out_dict, dtype="float64")

            for _, row in df.iterrows():
                choice_value = row["value"]
                col = (
                    row["trait"]
                    + "_d"
                    + apply_row.index[0]
                    + "-"
                    + choice_value
                )

                selected_apply_row = apply_row[apply_row == choice_value]
                yes_mask = (df["value"] == row["value"]) & (
                    df["token"].str.contains('_"yes"')
                )
                not_mask = (df["value"] == row["value"]) & (
                    df["token"].str.contains('_"not"')
                )

                if selected_apply_row.empty:
                    if not df[not_mask].empty:
                        out_dict[col] = float(df[not_mask]["token_id"].iat[0])
                    else:
                        out_dict[col] = np.nan
                else:
                    if not df[yes_mask].empty:
                        out_dict[col] = float(df[yes_mask]["token_id"].iat[0])
                    else:
                        out_dict[col] = np.nan

    # ========== 2.2 其他类型（11, 21, 31） ==========
    else:
        col = df["trait"].iat[0] + "_d" + apply_row.index[0]
        if len(array_values) == 0:
            out_dict[col] = np.nan
        else:
            matched_value_df = df[df["value"].isin(array_values)]

            if matched_value_df.empty:
                # Q1–Q4 分位数编码
                if "Q1" in df["value"].values:
                    try:
                        numeric_vals = pd.to_numeric(apply_row.dropna(), errors="coerce")
                        numeric_vals = numeric_vals.dropna()
                        if len(numeric_vals) == 0:
                            out_dict[col] = np.nan
                        else:
                            value = float(numeric_vals.mean())
                            for q in ["Q1", "Q2", "Q3", "Q4"]:
                                q_df = df[df["value"] == q]
                                if q_df.empty:
                                    continue
                                boundary = float(q_df["meaning"].iat[0])
                                if value <= boundary:
                                    out_dict[col] = float(q_df["token_id"].iat[0])
                                    break
                            else:
                                q4_df = df[df["value"] == "Q4"]
                                out_dict[col] = (
                                    float(q4_df["token_id"].iat[0]) if not q4_df.empty else np.nan
                                )
                    except Exception:
                        out_dict[col] = np.nan
                else:
                    out_dict[col] = np.nan
            else:
                out_dict[col] = float(matched_value_df["token_id"].iat[0])

    # 关键改动：统一显式 dtype，避免 FutureWarning
    return pd.Series(out_dict, dtype="float64")


# ========== 3. 主循环：按 chunk 读取并写出 ==========
print("=== Step 3: Tokenise UKB CSV into ukbFound input tokens ===")

missing_field = set()
total_rows = 0
n_chunks = 0

reader = pd.read_csv(
    file_path,
    chunksize=100000,
    encoding="latin1",
    low_memory=False,
    quotechar='"',
)

for chunk_index, chunk_df in enumerate(reader):
    n_chunks += 1
    n_rows_chunk = chunk_df.shape[0]
    total_rows += n_rows_chunk

    print(f"\n--- Processing chunk {chunk_index + 1} ---")
    print(f"  Rows in this chunk: {n_rows_chunk}")
    print(f"  Columns in raw chunk: {chunk_df.shape[1]}")

    out_chunk_df = pd.DataFrame({"eid": chunk_df["eid"]})

    for row_index, row in enumerate(only_traits_df.itertuples(), 1):
        instance_min = row.instance_min
        field_id = str(row.field_id)
        value_type = row.value_type

        col_names = [
            f"{field_id}-{instance_min}.{array_idx}"
            for array_idx in range(row.array_min, row.array_max + 1)
        ]

        try:
            selected_columns_chunk_df = chunk_df[col_names]
        except KeyError:
            missing_field.add(f"{field_id}-{instance_min}")
            continue

        if selected_columns_chunk_df.empty:
            continue

        token_trait_df = selected_columns_chunk_df.apply(
            check_row,
            axis=1,
            args=(traits_df[traits_df["field_id"] == field_id],),
        )
        out_chunk_df = pd.concat(
            [out_chunk_df, token_trait_df],
            axis=1,
            ignore_index=False,
        )

        if row_index % 100 == 0 or row_index == only_traits_df.shape[0]:
            print(
                f"  [chunk {chunk_index + 1}] processed traits "
                f"{row_index}/{only_traits_df.shape[0]}",
                end="\r",
            )

    print()
    print(f"  Finished traits for chunk {chunk_index + 1}.")
    print(f"  Output columns in this chunk (including eid): {out_chunk_df.shape[1]}")

    if chunk_index == 0:
        out_chunk_df.to_csv(output_file, mode="w", header=True, index=False)
    else:
        out_chunk_df.to_csv(output_file, mode="a", header=False, index=False)

    print(f"  Chunk {chunk_index + 1} written to disk.")

print("\n=== Step 4: Final summary ===")
print(f"Total chunks processed : {n_chunks}")
print(f"Total participants (rows) processed: {total_rows}")
print(f"Tokenised data written to: {output_file}")

if len(missing_field) > 0:
    missing_sorted = sorted(missing_field)
    pd.DataFrame({"missing_field_id": missing_sorted}).to_csv(
        missing_field_file, mode="w", header=True, index=False
    )
    print(f"Number of (field_id-instance) not found in input CSV: {len(missing_sorted)}")
    print("Example missing entries (up to 10):")
    for m in missing_sorted[:10]:
        print(f"  - {m}")
    print(f"Full list saved to: {missing_field_file}")
else:
    print("All requested (field_id, instance_min) were found in the input CSV.")


HERE    : /data/hongqy/dev/ukbFound/examples
ROOT    : /data/hongqy/dev/ukbFound
DATA_DIR: /data/hongqy/dev/ukbFound/data
=== Step 1: Load trait vocabulary (ukb_traits.csv) ===
Loaded traits_df with 44285 rows and 38 columns.
Input UKB file   : /data/hongqy/dev/ukbFound/data/demo_UKB_data.csv
Output token file: /data/hongqy/dev/ukbFound/data/demo_UKB_tokens.csv
Missing-field log: /data/hongqy/dev/ukbFound/data/missing_field_id.csv
Converting selected columns in traits_df to string representation ...

=== Step 2: Summary of non-private traits used for tokenisation ===
Unique non-private (field_id, instance_min, array_range) entries: 2257
Value_type distribution among non-private traits:
11     236
21     859
22      74
31    1088
Name: value_type, dtype: int64

=== Step 3: Tokenise UKB CSV into ukbFound input tokens ===

--- Processing chunk 1 ---
  Rows in this chunk: 99
  Columns in raw chunk: 18831
  [chunk 1] processed traits 2257/2257
  Finished traits for chunk 1.
  Output columns