## Setup
From a large dataset covering the complete update history of every app active on the Apple App Store between 2018 and 2021, we randomly sampled 4,000 update descriptions to form a pool for creating training and validation datasets. To ensure consistent analysis, we filtered the update descriptions to include only those written in English.  Each update description contains a corresponding release note and version number. On the App Store, these release notes are displayed in a section titled “What’s New” and limited to 4,000 characters. Since 2018, it is mandatory for developers to specify update details and avoid generic texts.
To establish ground truth for the sample pool, a member of the research team and a research assistant coded the release notes in the pool according to a coding scheme developed by the research team through inductive coding. This scheme comprises seven content dimensions (classes) and was validated   in interviews with software application developers (detailed in Section 4.1.2). A subset of n=1,000 update descriptions was independently coded by both raters to ensure high inter-rater reliability (observed agreement = 0.91, κ = 0.87). Any discrepancies were discussed to be consistent across the remaining sample pool.

#### Imports
 See `requirements.txt` for full dependency versions

In [None]:
import os
import json
import math
import pandas as pd
import glob

from collections import defaultdict
from sklearn.model_selection import train_test_split

#### Global Paths, Directories and Variables

In [None]:
# Define relevant paths
DEMO_PATH   = os.path.abspath(os.path.join(".."))

# Define relevant paths
DATA_DIR = os.path.join(DEMO_PATH, 'training_validation_data')
LLM_API_FOLDER = os.path.join(DEMO_PATH, 'LLM_API')

# Reproducibility
RANDOM_STATE = 94032

In [None]:
# Helper function for class distribution
def print_dist(df_obj, name, label_col='update_classification'):
    """
    Print the count and percentage distribution of 'update_classification' classes for the given DataFrame.
    """
    dist = df_obj[label_col].value_counts().sort_index()
    pct  = (dist / len(df_obj) * 100).round(2)
    dist_df = pd.DataFrame({'count': dist, 'pct': pct})
    print(f"{name} distribution (n={len(df_obj)}):")
    print(dist_df)

## Data Pool

We load the manually labeled dataset from the CSV file located in `DATA_DIR`. The dataset includes release notes (`whats_new`), their IDs, dates, and update classifications.


In [None]:
# Load manually labeled data pool for training and validation splits for model fine-tuning and performance evaluation 
df = pd.read_csv(
    os.path.join(DATA_DIR, "demo_app_updates_full_coded_4000.csv"),
    header=0,
    dtype={'id': str, 'whats_new': str, 'update_classification': int},
    parse_dates=['release_date'],
    low_memory=False
)

# Summary of original class distribution
print_dist(df, "Data Pool")

## Validation and Training Data Splits
We generate random subsamples from the labeled sample pool pool of n=4,000 release notes: a validation set of n = 1,000 release notes serving as a holdout and training sets with varying sample sizes of n = 2,000; 1,000; 500; 250; 100 release notes. Subsamples were constructed to reflect both a class distribution similar to that of the overall pool (representative) and, when possible, an equally balanced class distribution.


In [None]:
# Define constants (here set based on data pool class distribution)
VAL_SIZE        = 1000  # Number of validation samples to hold out
MIN_PER_CLASS   = 3     # Minimum labeled texts per class when sampling training data
TRAINING_SIZES  = [2000, 1000, 500, 250, 100]  # Various training set sizes to generate

### Validation Set Creation

We create a validation set of size `VAL_SIZE`. The remaining data forms the training pool.

In [None]:
# Create validation and training-remaining pools via stratified split, ensures each class is represented proportionally in validation
df_train_remaining, df_val_final = train_test_split(
    df,
    test_size=VAL_SIZE,
    stratify=df['update_classification'],
    random_state=RANDOM_STATE
)
print_dist(df_val_final, "Validation set")
print_dist(df_train_remaining, "Training-remaining pool")

### Training Set Creation

We generate training sets of sizes specified in `TRAINING_SIZES`, both real weighted and equal-distribution.


## Sampling Functions

We define helper functions to sample training subsets either reflecting representative (real-world) distributions or a balanced (equal) class distribution, with at least `MIN_PER_CLASS` samples per class.


In [None]:
def sample_real_world_clamp(df_pool, size, min_per_class=MIN_PER_CLASS, random_state=None):
    """
    Sample `size` rows reflecting df_pool's class frequencies,
    ensuring at least `min_per_class` per class and exactly `size` total.
    """
    counts = df_pool['update_classification'].value_counts()
    total = counts.sum()

    # Ideal allocations based on original frequencies
    ideal = {c: counts[c] * size / total for c in counts.index}
    alloc = {c: math.floor(ideal[c]) for c in counts.index}
    frac = {c: ideal[c] - alloc[c] for c in counts.index}

    # Distribute remaining slots by largest fractional parts
    remaining = size - sum(alloc.values())
    for c in sorted(frac, key=frac.get, reverse=True)[:remaining]:
        alloc[c] += 1

    # Enforce minimum and clamp to pool availability
    for c in alloc:
        avail = (df_pool['update_classification'] == c).sum()
        alloc[c] = min(max(alloc[c], min_per_class), avail)

    # If size exceeded due to min_per_class, remove surplus
    total_alloc = sum(alloc.values())
    if total_alloc > size:
        surplus = total_alloc - size
        # Reduce from classes above min_per_class, starting with those with highest original share
        reducible = [c for c in counts.index if alloc[c] > min_per_class]
        over = sorted(reducible, key=lambda c: counts[c], reverse=True)
        idx = 0
        while surplus > 0 and over:
            c = over[idx % len(over)]
            if alloc[c] > min_per_class:
                alloc[c] -= 1
                surplus -= 1
            idx += 1

    # Sample and shuffle
    dfs = [
        df_pool[df_pool['update_classification'] == c]
               .sample(n=alloc[c], random_state=random_state)
        for c in alloc
    ]
    return pd.concat(dfs).sample(frac=1, random_state=random_state).reset_index(drop=True)

def sample_equal_clamp(df_pool, size, min_per_class=MIN_PER_CLASS, random_state=RANDOM_STATE):
    """
    Sample `size` rows with equal class counts,
    ensuring at least `min_per_class` per class and exactly `size` total.
    """
    classes = df_pool['update_classification'].unique()
    alloc = {c: min_per_class for c in classes}
    remaining = size - sum(alloc.values())

    # Distribute remaining slots round-robin until size reached
    idx = 0
    cls_list = list(classes)
    while remaining > 0:
        c = cls_list[idx % len(cls_list)]
        avail = (df_pool['update_classification'] == c).sum()
        if alloc[c] < avail:
            alloc[c] += 1
            remaining -= 1
        idx += 1

    # Sample and shuffle
    dfs = [
        df_pool[df_pool['update_classification'] == c]
               .sample(n=alloc[c], random_state=random_state)
        for c in alloc
    ]
    return pd.concat(dfs).sample(frac=1, random_state=random_state).reset_index(drop=True)

#### Generate Training Sets

In [None]:
# Initialize containers for training sets
d_training_real = {}
d_training_equal = {}

# For each specified training set size, create two pools (clamped to minimum per class)
for size in TRAINING_SIZES:
    df_real = sample_real_world_clamp(df_train_remaining, size, min_per_class=MIN_PER_CLASS, random_state=RANDOM_STATE)
    df_equal = sample_equal_clamp(df_train_remaining, size,min_per_class=MIN_PER_CLASS, random_state=RANDOM_STATE)

    d_training_real[size] = df_real
    d_training_equal[size] = df_equal

    print_dist(df_real, f"Real-world training (n={size})")
    print_dist(df_equal, f"Equal-dist training (n={size})")

### Save Splits

We save the validation and training splits to `DATA_DIR`.

In [None]:
# Save validation set
val_path = os.path.join(DATA_DIR, "demo_app_updates_validation_real_1000.csv")
df_val_final.to_csv(val_path, sep=",", index=False)

# Save representative training sets
for size, df_tr in d_training_real.items():
    path = os.path.join(DATA_DIR, f"demo_app_updates_train_real_{size}.csv")
    df_tr.to_csv(path, sep=",", index=False)

# Save balanced training sets
for size, df_tr in d_training_equal.items():
    path = os.path.join(DATA_DIR, f"demo_app_updates_train_equal_{size}.csv")
    df_tr.to_csv(path, sep=",", index=False)

## Data Preparation Fine-Tuning
We prepare JSONL files for LLM APIs (OpenAI, Mistral) using the prompt from section 4.1.2 of the paper. The APIs of OpenAI and Mistral AI require inputs in .jsonl format. Files can be used for both APIs. Afterward, we validate each generated JSONL file for correct API use.

In [None]:
# Define the prompt
PROMPT_TEMPLATE = f"""ONLY provide the category number (1-7) in response. Determine the category for the following app update text.
If multiple categories seem applicable, always choose the lowest category number (1<2<3<4<5<6<7):

(1) Novel features - Introducing or enhancing significant functionalities that modify user experience
(2) Content extensions – Adding or expanding curated or user-facing content
(3) Platform and device support – Enabling compatibility with new OS versions, development kits, device types, or hardware features
(4) Specific fixes of bugs - Addressing distinct known issues
(5) Privacy & Security - Strengthening user data protection, permissions, or security protocols
(6) General Improvements – Tweaks addressing undisclosed bug fixes,  performance improvements, minor changes, or UI/UX polish
(7) Marketing & Branding – Changes purely about promotional or visual branding assets
"""

In [None]:
# Helper function to create jsonl files for datasets in DATA_DIR
def csv_to_jsonl(csv_path: str, jsonl_path: str, PROMPT_TEMPLATE: str) -> None:
    """
    Read a CSV, build prompts, and write out a JSONL file
    where each line is:
      {
        "messages": [
          {"role":"user",      "content": <prompt>},
          {"role":"assistant", "content": <label>}
        ]
      }
    """
    # Read CSV
    df_in = pd.read_csv(
        csv_path,
        dtype={'id': str, 'whats_new': str, 'update_classification': int},
        parse_dates=['release_date'],
        low_memory=False
    )

    # Build prompt column
    df_in['prompt'] = PROMPT_TEMPLATE + "\n\nApp update text: " + df_in['whats_new']

    # Assemble JSONL entries
    with open(jsonl_path, 'w', encoding='utf-8') as fout:
        for _, row in df_in.iterrows():
            record = {
                "messages": [
                    {"role": "user",      "content": row.prompt},
                    {"role": "assistant", "content": str(row.update_classification)}
                ]
            }
            fout.write(json.dumps(record) + '\n')


# Loop over every CSV in the folder
pattern = os.path.join(DATA_DIR, "demo_app_updates_train_*.csv")
for csv_path in glob.glob(pattern):
    base = os.path.splitext(os.path.basename(csv_path))[0]
    jsonl_name = f"{base}.jsonl"
    jsonl_path = os.path.join(LLM_API_FOLDER, jsonl_name)

    print(f"Converting {os.path.basename(csv_path)} → {jsonl_name}")
    csv_to_jsonl(csv_path, jsonl_path, PROMPT_TEMPLATE)

print("All files processed.")

In [None]:
# Helper function to validate jsonl files for API usage
def check_jsonl_file(jsonl_path: str) -> None:
    """
    Load a JSONL file, print one example, and report any format errors.
    """
    # Load all lines into Python objects
    with open(jsonl_path, 'r', encoding='utf-8') as f:
        dataset = [json.loads(line) for line in f]

    # Print header and one example
    print(f"Checking {os.path.basename(jsonl_path)}")
    print("Num examples:", len(dataset))
    if dataset:
        print("Example[0] messages:")
        for msg in dataset[0].get("messages", []):
            print(f"  {msg}")

    # Initialize error counters
    format_errors = defaultdict(int)

    # Validate each example
    for ex in dataset:
        if not isinstance(ex, dict):
            format_errors["data_type"] += 1
            continue

        messages = ex.get("messages")
        if not isinstance(messages, list) or not messages:
            format_errors["missing_or_empty_messages_list"] += 1
            continue

        # Validate each message in the list
        for message in messages:
            # Required keys: role, content
            if "role" not in message or "content" not in message:
                format_errors["message_missing_key"] += 1

            # No unexpected keys
            if any(k not in ("role", "content", "name") for k in message):
                format_errors["message_unrecognized_key"] += 1

            # Role must be one of the allowed set
            role = message.get("role")
            if role not in ("system", "user", "assistant"):
                format_errors["unrecognized_role"] += 1

            # Content must be a nonempty string
            content = message.get("content")
            if not isinstance(content, str) or content.strip() == "":
                format_errors["missing_or_invalid_content"] += 1

        # Ensure there's exactly one assistant response
        if not any(m.get("role") == "assistant" for m in messages):
            format_errors["missing_assistant_message"] += 1

    # Print summary of any errors found
    if format_errors:
        print("Found format errors:")
        for err, count in format_errors.items():
            print(f"  {err}: {count}")
    else:
        print("No errors found.")

# Loop over all .jsonl files in the directory and run checks
for filename in os.listdir(LLM_API_FOLDER):
    if filename.lower().endswith('.jsonl'):
        check_jsonl_file(os.path.join(LLM_API_FOLDER, filename))