<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/35887/logos/header.png?t=2022-05-09-22-33-02">

<h1><center>[1/3] AI4Code TensorFlow TPU with CodeBert - Data Preparation</center></h1>

This is the first part of my **AI4Code TensorFlow TPU with CodeBert** series:

* **[1/3] Data Preparation ← (you're here)**
* [2/3] [TPU Training][1] (~4 hours)
* [3/3] [GPU Inference][2] (~2 hours)

This is basically a translation of **[Khoi Nguyen's][3]** works [[1][4], [2][5]] from PyTorch to TensorFlow with minor changes and updates for TPU support. The **[original][4]** PyTorch work takes up to 40 hours per epoch on Kaggle GPU, whereas **[my version][1]** takes only 50 minutes per epoch on Kaggle TPU, so it's lightning fast ⚡.

Outputs of this notebook are already saved to the dataset **[AI4Code CodeBert Tokens][6]** so feel free to skip this part unless you need to customize it!

### About Solution

- Input data: markdown + code context (512 tokens) + features
    - Markdown (up to 64 tokens)
    - Code context (all code cells or up to 20 code cells each up to 23 tokens)
    - Features: markdown cells to total cells ratio (appended to backbone outputs)
- Model and hyperparameters
    - CodeBert Base model
    - L1 loss (MAE)
    - AdamW optimizer
    - Learning rate schedule with warmup and linear decay
    - Total 5 epochs

### Warning

This notebook uses all data only when submitted via **Save & Run All (Commit)** and only the first 1,000 notebook entries in interactive session. This behaviour is bound to Kaggle environment variables. To make it process all data on Google Colab or your local machine, please explicitly set the `LIMIT` hyperparameter to `None`.

[1]: https://www.kaggle.com/nickuzmenkov/ai4code-tf-tpu-codebert-training
[2]: https://www.kaggle.com/nickuzmenkov/ai4code-tf-tpu-codebert-inference
[3]: https://www.kaggle.com/suicaokhoailang
[4]: https://github.com/suicao/ai4code-baseline/tree/main/code
[5]: https://www.kaggle.com/code/suicaokhoailang/stronger-baseline-with-code-cells
[6]: https://www.kaggle.com/datasets/nickuzmenkov/ai4code-codebert-tokens

# Setup

In [1]:
!mkdir 'raw' 'tfrec'

In [2]:
import glob
import json
import os
from typing import List

import numpy as np
import pandas as pd
import tensorflow as tf
import transformers
from sklearn.model_selection import GroupKFold
from sklearn.utils import shuffle
from tqdm.notebook import tqdm

In [3]:
RANDOM_STATE = 42
MD_MAX_LEN = 64
TOTAL_MAX_LEN = 512
K_FOLDS = 5
FILES_PER_FOLD = 16
LIMIT = None
MODEL_NAME = "microsoft/codebert-base"
TOKENIZER = transformers.AutoTokenizer.from_pretrained(MODEL_NAME)
INPUT_PATH = "../input/AI4Code"

In [4]:
def read_notebook(path: str) -> pd.DataFrame:
    return (
        pd.read_json(path, dtype={"cell_type": "category", "source": "str"})
        .assign(id=os.path.basename(path).split(".")[0])
        .rename_axis("cell_id")
    )


def clean_code(cell: str) -> str:
    return str(cell).replace("\\n", "\n")


def sample_cells(cells: List[str], n: int) -> List[str]:
    cells = [clean_code(cell) for cell in cells]
    if n >= len(cells):
        return cells
    else:
        results = []
        step = len(cells) / n
        idx = 0
        while int(np.round(idx)) < len(cells):
            results.append(cells[int(np.round(idx))])
            idx += step
        if cells[-1] not in results:
            results[-1] = cells[-1]
        return results


def get_features(df: pd.DataFrame) -> dict:
    features = {}
    for i, sub_df in tqdm(df.groupby("id"), desc="Features"):
        features[i] = {}
        total_md = sub_df[sub_df.cell_type == "markdown"].shape[0]
        code_sub_df = sub_df[sub_df.cell_type == "code"]
        total_code = code_sub_df.shape[0]
        codes = sample_cells(code_sub_df.source.values, 20)
        features[i]["total_code"] = total_code
        features[i]["total_md"] = total_md
        features[i]["codes"] = codes
    return features


def tokenize(df: pd.DataFrame, fts: dict) -> dict:
    input_ids = np.zeros((len(df), TOTAL_MAX_LEN), dtype=np.int32)
    attention_mask = np.zeros((len(df), TOTAL_MAX_LEN), dtype=np.int32)
    features = np.zeros((len(df),), dtype=np.float32)
    labels = np.zeros((len(df),), dtype=np.float32)

    for i, row in tqdm(
        df.reset_index(drop=True).iterrows(), desc="Tokens", total=len(df)
    ):
        row_fts = fts[row.id]

        inputs = TOKENIZER.encode_plus(
            row.source,
            None,
            add_special_tokens=True,
            max_length=MD_MAX_LEN,
            padding="max_length",
            return_token_type_ids=True,
            truncation=True,
        )
        code_inputs = TOKENIZER.batch_encode_plus(
            [str(x) for x in row_fts["codes"]] or [""],
            add_special_tokens=True,
            max_length=23,
            padding="max_length",
            truncation=True,
        )

        ids = inputs["input_ids"]
        for x in code_inputs["input_ids"]:
            ids.extend(x[:-1])
        ids = ids[:TOTAL_MAX_LEN]
        if len(ids) != TOTAL_MAX_LEN:
            ids = ids + [
                TOKENIZER.pad_token_id,
            ] * (TOTAL_MAX_LEN - len(ids))

        mask = inputs["attention_mask"]
        for x in code_inputs["attention_mask"]:
            mask.extend(x[:-1])
        mask = mask[:TOTAL_MAX_LEN]
        if len(mask) != TOTAL_MAX_LEN:
            mask = mask + [
                TOKENIZER.pad_token_id,
            ] * (TOTAL_MAX_LEN - len(mask))

        input_ids[i] = ids
        attention_mask[i] = mask
        features[i] = (
            row_fts["total_md"] / (row_fts["total_md"] + row_fts["total_code"]) or 1
        )
        labels[i] = row.pct_rank

    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "features": features,
        "labels": labels,
    }


def get_ranks(base: pd.Series, derived: List[str]) -> List[str]:
    return [base.index(d) for d in derived]


def _serialize_sample(
    input_ids: np.array,
    attention_mask: np.array,
    feature: np.float64,
    label: np.float64,
) -> bytes:
    feature = {
        "input_ids": tf.train.Feature(int64_list=tf.train.Int64List(value=input_ids)),
        "attention_mask": tf.train.Feature(
            int64_list=tf.train.Int64List(value=attention_mask)
        ),
        "feature": tf.train.Feature(float_list=tf.train.FloatList(value=[feature])),
        "label": tf.train.Feature(float_list=tf.train.FloatList(value=[label])),
    }
    sample = tf.train.Example(features=tf.train.Features(feature=feature))
    return sample.SerializeToString()


def serialize(
    input_ids: np.array,
    attention_mask: np.array,
    features: np.array,
    labels: np.array,
    path: str,
) -> None:
    with tf.io.TFRecordWriter(path) as writer:
        for args in zip(input_ids, attention_mask, features, labels):
            writer.write(_serialize_sample(*args))

# Collect Data

In [None]:
paths = glob.glob(os.path.join(INPUT_PATH, "train", "*.json"))
if LIMIT is not None:
    paths = paths[:LIMIT]
df = (
    pd.concat([read_notebook(x) for x in tqdm(paths, desc="Concat")])
    .set_index("id", append=True)
    .swaplevel()
    .sort_index(level="id", sort_remaining=False)
)

df_orders = pd.read_csv(
    os.path.join(INPUT_PATH, "train_orders.csv"),
    index_col="id",
    squeeze=True,
).str.split()
df_orders_ = df_orders.to_frame().join(
    df.reset_index("cell_id").groupby("id")["cell_id"].apply(list),
    how="right",
)

ranks = {}
for id_, cell_order, cell_id in df_orders_.itertuples():
    ranks[id_] = {"cell_id": cell_id, "rank": get_ranks(cell_order, cell_id)}
df_ranks = (
    pd.DataFrame.from_dict(ranks, orient="index")
    .rename_axis("id")
    .apply(pd.Series.explode)
    .set_index("cell_id", append=True)
)

df_ancestors = pd.read_csv(
    os.path.join(INPUT_PATH, "train_ancestors.csv"), index_col="id"
)
df = (
    df.reset_index()
    .merge(df_ranks, on=["id", "cell_id"])
    .merge(df_ancestors, on=["id"])
)

df["pct_rank"] = df["rank"] / df.groupby("id")["cell_id"].transform("count")
df = df.sort_values("pct_rank").reset_index(drop=True)

features = get_features(df)

df = df[df["cell_type"] == "markdown"]
df = df.drop(["rank", "parent_id", "cell_type"], axis=1).dropna()

# Make Tokens & Save

In [None]:
df.to_csv("data.csv")
with open("features.json", "w") as file:
    json.dump(features, file)

In [None]:
df = shuffle(df, random_state=RANDOM_STATE)

for fold, (_, split) in enumerate(
    GroupKFold(K_FOLDS).split(df, groups=df["ancestor_id"])
):
    if data["features"] <= 0.5: ##이 부분
        print("=" * 36, f"Fold {fold}", "=" * 36)
        fold_dir = f"tfrec/{fold}"
        if not os.path.exists(fold_dir):
            os.mkdir(fold_dir)
    if data["features"] > 0.5: # 이부분
        print("=" * 36, f"Fold {fold}", "=" * 36)
        fold_dir2 = f"tfrec2/{fold}"
        if not os.path.exists(fold_dir2):
            os.mkdir(fold_dir2)

        data = tokenize(df.iloc[split], features)

        np.savez_compressed(
        f"raw/{fold}.npz",
        input_ids=data["input_ids"],
        attention_mask=data["attention_mask"],
        features=data["features"],
        labels=data["labels"],
        )

        for split, index in tqdm(
            enumerate(np.array_split(np.arange(data["labels"].shape[0]), FILES_PER_FOLD)),
            desc=f"Saving",
            total=FILES_PER_FOLD,
        ):
            serialize(
                input_ids=data["input_ids"][index],
                attention_mask=data["attention_mask"][index],
                features=data["features"][index],
                labels=data["labels"][index],
                path=os.path.join(fold_dir, f"{split:02d}-{len(index):06d}.tfrec"),
                path2=os.path.join(fold_dir, f"{split:02d}-{len(index):06d}.tfrec2"), # 이부분 
                )

# Next Steps

Go to the results dataset **[AI4Code CodeBert Tokens][3]** or continue exploring:

* <span style="color:lightgray">[1/3] Data Preparation ← (you're here)</span>
* [2/3] [TPU Training][1] (~4 hours)
* [3/3] [GPU Inference][2] (~2 hours)


[1]: https://www.kaggle.com/nickuzmenkov/ai4code-tf-tpu-codebert-training
[2]: https://www.kaggle.com/nickuzmenkov/ai4code-tf-tpu-codebert-inference
[3]: https://www.kaggle.com/datasets/nickuzmenkov/ai4code-codebert-tokens