## Setup
To replicate prior work, we incorporated the preprocessing steps, dictionaries and filters described in Kircher & Foerderer (2023) and Agarwal & Kapoor (2023), and we applied version logic similar to that used in Wen & Zhu (2019).

(1) Kircher & Foerderer (2023) classify an update as innovative (feature update) if the release note contains at least one of the keywords “new”, “added”, “upgrade”, or “major”.

(2) Agarwal & Kapoor (2023) classify an update as innovative (new functionality) if the release note exceeds 200 characters or contains at least one of the keywords “introduce”, “feature”, “support”, “performance”, “improve”, “enable”, “update”, “enhance”, “modify”, “optimize”, “fast”, “adjust”, or “multitask”. To account for differences in text representation and ensure consistency in the keyword matching process, we applied several preprocessing steps. We converted each release note to lowercase, tokenized it into individual words, and each word was reduced to its stem. The keywords in each dictionary were processed similarly.

(3) In Wen & Zhu (2019), the baseline treats every update release as an innovative update. A version-number rule is applied in robustness checks, classifying an update as innovative (major update) if the <major> field in the app’s version number (first digit) changed. When replicating this approach, we additionally controlled for changes in the <minor> field in the app’s version number (second digit).

#### Imports
 See `requirements.txt` for full dependency versions

In [None]:
import re
import os
import pandas as pd

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

#### Global Paths, Directories, Variables, and Classifier Instances

In [None]:
# Define Demo Study path
DEMO_PATH   = os.path.abspath(os.path.join(".."))

# Define relevant paths
VAL_PATH   = os.path.join(DEMO_PATH,'training_validation_data', 'demo_app_updates_validation_real_1000.csv')
OUTPUT_DIR = os.path.join(DEMO_PATH,'output_data')

# Relevant column names
COL_VERSION      = 'version_display'
COL_PREV_VERSION = 'previous_version'
COL_WHATS_NEW    = 'whats_new'

# Dictionaries from literature
DICT_SOURCES = {
    'KF23': {
        'innovation': ["new", "added", "upgrade", "major"],
        'maintenance': ["bug", "minor", "crash", "error"],
    },
    'AK23': {
        'innovation': [
            "new", "introduce", "feature", "add", "support", "performance",
            "improve", "upgrade", "enable", "update", "enhance", "modify",
            "optimize", "fast", "adjust", "multitask"
        ],
        'maintenance': [
            "bug", "fix", "issue", "crash", "problem", "error", "glitch"
        ],
    },
}

# Preprocess stemmer & keywords once
ps = PorterStemmer()
STEMMED_KEYWORDS = {
    name: {
        'innovation': set(ps.stem(w) for w in kws['innovation']),
        'maintenance': set(ps.stem(w) for w in kws['maintenance'])
    }
    for name, kws in DICT_SOURCES.items()
}

In [None]:
# Load dataframe
df = pd.read_csv(VAL_PATH)

## Functions
This section includes the main functions used for classifying software updates based dictionary-based classifications following Kircher & Foerderer (2023) and Agarwal and Kapoor (2023) and version-based classification following Wen & Zhu (2018).

### Dictionary-based classification
We use keyword matching within the release notes (`whats_new`) to categorize updates as "innovation" or "maintenance."

In [None]:
def preprocess_text(text: str):
    """Tokenize and stem a string of text."""
    if pd.isna(text):
        return []
    return [ps.stem(tok) for tok in word_tokenize(text.lower())]

def classify_whats_new(df: pd.DataFrame, source: str) -> pd.DataFrame:
    """
    Classifies each row's 'what's new' text as 'innovation' or 'maintenance'
    for a given source, based on stemmed keyword matching and length rules
    """
    stemmed = STEMMED_KEYWORDS[source]
    df['_toks'] = df[COL_WHATS_NEW].apply(preprocess_text)

    def classify(row):
        """
        Classifies a single row as innovation/maintenance based on:
        - If 'AK23', marks as innovation if text is >200 chars.
        - Otherwise, marks as innovation/maintenance if any stemmed keyword found.
        """
        tokens = set(row['_toks'])
        # Innovation if: (AK23 and long text) OR (any innovation keyword match)
        is_innov = (
            (source == 'AK23' and isinstance(row[COL_WHATS_NEW], str)
             and len(row[COL_WHATS_NEW]) > 200)
            or bool(tokens & stemmed['innovation'])
        )
        # Maintenance if not innovation AND at least one maintenance keyword match
        is_maint = not is_innov and bool(tokens & stemmed['maintenance'])
        return pd.Series({
            f'{source}_innovation':  int(is_innov),
            f'{source}_maintenance': int(is_maint),
        })
    # Classify each row and collect results in a new DataFrame
    out = df.apply(classify, axis=1)

    # Remove the temporary token column
    df.drop(columns=['_toks'], inplace=True)

    return pd.concat([df, out], axis=1)

### Version-based classification
We analyze version numbers to identify whether a release is a major or minor update, based on differences between the current (`version_display`) and previous version (`previous_version`).


In [None]:
def clean_version(v: str) -> str:
    """Clean a version string, keeping only digits and periods."""
    if pd.isna(v) or isinstance(v, float):
        return ''
    return re.sub(r'[^0-9.]', '', str(v))

def split_to_parts(s: str):
    """Split a version string like '3.2.0' into a list of integers [3,2,0]."""
    if not s:
        return []
    return [int(x) for x in s.split('.') if x.isdigit()]

def add_version_flags(df: pd.DataFrame,
                      version_col: str = 'version_display',
                      prev_col: str    = 'previous_version'
                     ) -> pd.DataFrame:
    """
    Add columns to indicate major and minor version bumps:
      - first_digit: 1 if there's a major bump (or full release with no prev)
      - second_digit: 1 if there's a minor bump (and no major bump)
    Logic depends on whether previous version info is present.
    """
    df = df.copy()

    # Clean version strings for both columns
    df['_v_clean']  = df[version_col].apply(clean_version)
    df['_ov_clean'] = df[prev_col].apply(clean_version)

    # Split cleaned version strings into integer parts
    df['_v_parts']  = df['_v_clean'].apply(split_to_parts)
    df['_ov_parts'] = df['_ov_clean'].apply(split_to_parts)

    # Extract major and minor version numbers, defaulting to 0 if missing
    df['_v_major'] = df['_v_parts'].str[0].fillna(0).astype(int)
    df['_v_minor'] = df['_v_parts'].str[1].fillna(0).astype(int)
    df['_ov_major'] = df['_ov_parts'].str[0].fillna(0).astype(int)
    df['_ov_minor'] = df['_ov_parts'].str[1].fillna(0).astype(int)

    # Determine which rows have a previous version
    has_prev = df['_ov_parts'].str.len() > 0
    no_prev  = ~has_prev

    # Compute version bumps when prev exists
    first_with  = has_prev & (df['_v_major'] != df['_ov_major'])
    second_with = has_prev & ~first_with & (df['_v_minor'] != df['_ov_minor'])

    # For rows with NO previous version:
    first_no = no_prev & df['_v_parts'].apply(lambda pts: len(pts)>=1 and all(p==0 for p in pts[1:])) # first = “full” release: only major or trailing zeros
    second_no = no_prev & ~first_no & df['_v_parts'].apply(lambda pts: len(pts)>=2) # second = has at least two segments (and no first bump)

    # Assign final flag columns (first wins over second if both would be true)
    df['first_digit']  = (first_with  | first_no).astype(int)
    df['second_digit'] = (second_with | second_no).astype(int)

    # Cleanup
    df.drop(columns=[
        '_v_clean','_ov_clean','_v_parts','_ov_parts',
        '_v_major','_v_minor','_ov_major','_ov_minor'
    ], inplace=True)

    return df

## Execution
We apply both classification strategies to the validation dataset.

In [None]:
# Compute flags once
df = add_version_flags(df, version_col='version_display', prev_col='previous_version')

# Loop over dictionaries
for src in DICT_SOURCES:
    df = classify_whats_new(df, src)

# Save the results to a CSV file
df.to_csv(os.path.join(OUTPUT_DIR, 'validation_literature_classification.csv'), index=False)