This <b>AsteXT</b> tutorial is meant for fine-tuning our own BERT language model to detect soft NER in Asian American short stories. This tutorial is used for the AsteXT team during Fall 2025 semester.

<b>Objective</b>: This tutorial will guide you in fine-tuning a BERT uncased model to classify a word into 6 categories (5 Soft NER categories and one for non-soft NER). You will get to learn topics like hyperparameter, data imbalance (and how to address it), splitting data into training and validation set, how to validate a model, and other topics. In the end, you should have a working fine-tuned BERT model that can identify soft NER from an unseen literary text.

Before starting this tutorial, you should have these files ready:

- A JSON file with labeled soft NER, hard NER, and general words of all the stories that you've labeled. This JSON file should be generated from "trainingDataBuilder.py"
    - Note that, for this tutorial, we are using standard NER labeled, not the BIO format.
- A JSON file with testing data from a short story that you have not labeled. This JSON file should be generated from "testingDataBuilder.py"

You also need these libraries installed:
`pytorch`, `numpy`, `scikit-learn`, `transformers`


We will be using GPU to train this model. If your local laptop does not have GPU, that is okay (however, it will just take you longer). The difference in time needed between using GPU and CPU is anywhere between 10 to 100 times (as in, depending on the quality of your laptop, your built-in GPU, and your CPU, using GPU might be between 10 to 100 times faster in training than using CPU).

<b>Note</b> that, traditionally, machine learning requires a test data that is also labeled to compare the fine-tuned model result against. For our case (in consideration of our data and time constraint), we do not need to label stories just to test the model since, after the trained model labels our testing data, we can manually check the model's work a posteriori.

<b>Another note</b>: it is expected that you will be running this tutorial code and model training code multiple times using different hyperparameters (defined later on) or different metrics. People seldom get a good model result on their first try. Do note that, although there are a lot of guardrails that we can implement into the training process to ensure a higher quality, the Number One most influential factor in the quality of your model is not the metrics or parameters we will be using but the data that you have labeled yourself. The better your label and the more you label, the better your model (most of the time).

When you go through this tutorial, make sure you read <b>all of my code comments throughout a code cell and written notes in Markdown cells in between code cells</b> (<--- again, important), as these will provide you with explanations about what is happening at each step as well as how you can interpret your model.

You do not need to modify each code cell. The code cells with my comment "Run without modifying" or something along that line on the top does not need to be modified. You can just run it. Some code cells have places that you need to make modifications to. I have labeled those with `# TODO`. Note that within one code cell, there might be more than one `TODO`s.

---
Heads-up:

As you go through the tutorial, there are a few variables that you need to modify yourself (such as model hyperparameters, data balance ratios, etc.) that do not have one correct answer about what value to use, as they are architectural designs that vary from person to person and data to data (especially when each of us have labeled our own stories and are thus using different stories). In other words, it is expected that you will be running this tutorial at least two times, testing out different parameters or values. To help you keep track of what values you've used, here's a list of all the values that you will be testing out in this tutorial (I might have missed one or two here and there, so please read the code carefully. I think I have included everything in this list):

In the function `balanceDatasetByCategory` (for increasing underrepresented data points (most likely soft NER) in our data to mitigate data imbalance):
- `targetRatio` (integer, target ratio between positive:negative data, e.g. = 10. I recommend keeping this ratio between 5-15. You are free to experiment outside of this range.)
- `boostRareCategories` (boolean, since not all soft NER data categories have the same amount of data, should we particularly increase those soft NER categories that have particularly less data? E.g. = True/False)
- `maxRepeatPerSentence` (integer, the maximum time that a lesser represented data point could be repeated, e.g. = 30. I recommend keeping this ratio between 10-40. You are free to experiment outside of this range.) 
- `probabilistic` (boolean, whether, during data upsampling, we randomly select data points to upsample or not. I highly recommend putting this to True) 

In the function `calculateClassWeights` (for calculating data weights in training to mitigage data imbalance):
- `minCount` (integer, the minimum number a data should appear. I recommend between 1-5. You are free to experiment outside of this range.) 
- `clipRatio` (integer, a ratio representing the largest ratio difference in weights, e.g.: when we give extra weight to less represented data, we cannot infinitely increase the weight, since that would cause errors. The ratio says what's the largest degree of weight we could increase for underrepresented data. I recommend between 5.0-15.0. You are free to experiment outside of this range)

In the function `makeValidationSplitByCategory` (for creating validation and training data split from your labeled data):
- `validationProportion` (float, the proportion of labeled data that you want to set aside for validation. I set it to 0.05, which is a good place to start, but you are free to change ot if you want to. Usually, this number should be below 25%, or 0.25. Since we are working with a small labeled dataset, I would say less than 15%, or 0.15)

Before the main model training iteration, we have these two hyperparameters:
- `trainingSteps` (integer, number of training steps the model should go through. The higher the number, the more training the model does, and more closely aligned with the training data the model will be. If the model gets too close, the model will overfit. I recommend between 300-1200, based on your performance)
- `learningRate` (float, learning rate during training. I recommend one of these: `3e-5, 1e-5, 5e-6, 3e-6, 1e-6, 5e-7`)

A hyperparameter is a value used in model training that is resulted from architectural choices. Meaning, there is no one set of hyperparameters that suits everyone. It depends on the model you are training, your data, and many other factors. There are more than 2 hyperparameters in machine learning, but we will focus on these two.

It is expected for you to try out different `trainingSteps` and `learningRate` parameters, since you are likely need to train multiple times to find a reliable model. Each time you train, try a different value. I have more detailed description of these two hyperparameters later in the code.

In [None]:
# First, we will download the libraries that we need:
!pip install torch numpy scikit-learn transformers

In [None]:
# Then, we will import these libraries 
# Run without modifying
import json, torch, random
import torch.nn as nn
import numpy as np

from sklearn.metrics import f1_score, classification_report
from torch.utils.data import DataLoader
from collections import Counter, defaultdict
from typing import List, Dict
from transformers import AutoModelForTokenClassification, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForTokenClassification
from datasets import Dataset, DatasetDict


In [None]:
# Run without modifying: this is to determine whether you have GPU or CPU

def getTrainingDevice():
    '''This function determines which type of GPU you have and store it in the variable "device" for future use in this tutorial. If you do not have a GPU, the function will print "Using CPU (training will be slower)"'''

    if torch.cuda.is_available():
        return torch.device("cuda")
    if getattr(torch.backends, "mps", None) is not None and torch.backends.mps.is_available():
        return torch.device("mps")
    return torch.device("cpu")

# Save our GPU or CPU preference here for later use
device = getTrainingDevice()
print(f"You are using: {device}")

If the above printed statement says either "You are using: mps" or "You are using: cuda", then you have GPU available to use in your local computer. If it says "You are using: cpu", then your laptop doesn't have GPU and you will be running training on CPU.

Apple's GPU is called <b>mps</b>, and Windows or other providers' GPU is called <b>cuda</b>.

In [None]:
# TODO
# We will then look at the training data distribution for our own knowledge. As you remember, we have a lot more general words than either hard or soft NER. Examining our data is always important before doing any calculation so that we have an idea of what type of data and what type of data distributon/properties we are working with.

# TODO Add JSON file path to your combined story training data created from "trainingDataBuilder.py"
trainingDataJSONFilePath = "/Users/Jerry/Desktop/AsteXT/AsteXTCode/AsteXTCode2025-6/training.json"

with open(trainingDataJSONFilePath, "r", encoding="utf-8") as trainingFile:
    trainingDataDict = json.load(trainingFile)

storeCountOfDataLabels = {} # Dictionary to count how many labeles we have

trainingDataSentences = trainingDataDict["sentences"]

for sentenceGroup in trainingDataSentences:
    for oneLabel in sentenceGroup["labels"]:
        if oneLabel not in storeCountOfDataLabels:
            storeCountOfDataLabels[oneLabel] = 0
        storeCountOfDataLabels[oneLabel] += 1

for label, count in sorted(storeCountOfDataLabels.items()):
    print(f"{label}: {count} data point(s)")

You should be seeing, printed out above, that we have a lot more "O" (general word) than any other labels. This means that we have a <b>data imbalance</b>. This is expected, since, in language, we have a lot more generic words than named entities, hard or soft. We need to mitigate this issue when fine-tuning our model so that we can give more weight to the soft NER that we've identified.

Traditionally, there are two ways to resolve data imbalance: resampling and reweighing. Resampling refers to increase or decrease the amount of training data so that we achieve a balanced proportion. Reweighting means to tell the model to give proportional weight to each training data point during training process. We will use a combination of both; specifically, we will use upsampling (increasing the number of underrepresented data points). There is also downsampling, which is underrepresenting abundant data. We will then use reweighing.

The next task is to create a few functions to turn our data into machine-readable labels, i.e.: 1 for positive (soft NER) and 0 for negative (hard NER or general words).

In [None]:
# Run without modifying: These are two functions that we will later call on to add numerical IDs to our raw data

# Create label mapping
LabelMap = {
    "O": 0,  # General words
    "Hard": 0,  # Treating all hard NER as negative
    "Soft-Communal/Public": 1,
    "Soft-Extraterrestrial/Figurative": 2,
    "Soft-Institutional": 3,
    "Soft-Natural": 4,
    "Soft-Private": 5,
}

def labelToID(labelStr):
    '''
    Turning word labels into numerical IDs'''
    labelStr = str(labelStr).strip()
    
    # Check for exact matches first
    if labelStr in LabelMap:
        return LabelMap[labelStr]
    
    # Check if it's a Soft entity (handles variations)
    if labelStr.startswith("Soft"):
        if "Communal" in labelStr or "Public" in labelStr:
            return LabelMap["Soft-Communal/Public"]
        elif "Extraterrestrial" in labelStr or "Figurative" in labelStr:
            return LabelMap["Soft-Extraterrestrial/Figurative"]
        elif "Institutional" in labelStr:
            return LabelMap["Soft-Institutional"]
        elif "Natural" in labelStr:
            return LabelMap["Soft-Natural"]
        elif "Private" in labelStr:
            return LabelMap["Soft-Private"]
    
    # Check if it's Hard NER
    if labelStr.startswith("Hard"):
        return 0 
    
    if labelStr in {"O"}: # General word
        return 0
    
    # Edge case safety: if there is an unknown label, we treat it as 0. However, if you followed "trainingDataBuilder.py", this scenario should probably not happen
    print(f"Warning: Unknown label '{labelStr}', treating as O")
    return 0

In [None]:
# Run without modifying: These functions help us label our datasets in a machine-readable manner

# See if a word is soft NER
def isSoft(label):
    return str(label).startswith("Soft")

# Or if it is hard NER
def isHard(label):
    return str(label).startswith("Hard")

def makeTokenizeAndAlignFn(tokenizer, labelAllSubtokens=False):
    '''This function uses the tokenizer to turn our labels into tokenized numerical values for each word.

    Default parameters:
    "labelAllSubtokens" refers to, if a work is broken down into multiple subtokens (e.g. "institutionalization" tokenized into two tokens: "institution" and "alization", whether we should give a label to both tokens or just the first one). This is set to False as default.
    '''
    def tokenizeAndAlignLabels(batch):
        tokenizedInputs = tokenizer(
            batch["tokens"],
            is_split_into_words=True,
            truncation=True
        )

        allLabels = []
        for i, labelsStr in enumerate(batch["labels"]):
            wordIds = tokenizedInputs.word_ids(batch_index=i)
            previousWordId = None
            labelIds = []
            
            for wordId in wordIds:
                if wordId is None:
                    # Special tokens get -100 (ignored in loss)
                    labelIds.append(-100)
                elif wordId != previousWordId:
                    # if a word is broken into subtokens, we use the first subtoken
                    label_id = labelToID(labelsStr[wordId])
                    labelIds.append(label_id)
                else:
                    # Continuation subword
                    if labelAllSubtokens:
                        label_id = labelToID(labelsStr[wordId])
                        labelIds.append(label_id)
                    else:
                        labelIds.append(-100)  # Ignore subwords
                
                previousWordId = wordId
            
            allLabels.append(labelIds)

        tokenizedInputs["labels"] = allLabels
        return tokenizedInputs

    return tokenizeAndAlignLabels

Now, after creating a few helper functions that turn our data into formats recognizable by BERT, we will need to start manipulating our data so that they are not as imbalanced as what we've seen earlier on. We will first look at the ratio of data before upsampling:

In [None]:
# Run without modifying: This step oversamples our positive data points (soft NERs)

def viewDataRatioBeforeBalancing(records):
    ''' 
    Show distribution of each Soft category as raw data before resampling
    '''
    categoryCount = Counter()
    totalTokens = 0
    
    for record in records:
        for label in record["labels"]:
            totalTokens += 1
            labelString = str(label)
            
            if labelString.startswith("Soft-"):

                categoryCount[labelString] += 1
            elif labelString.startswith("Hard") or labelString in {"O", "0"}:
                categoryCount["Non-Soft"] += 1
    
    print("Label distribution")
    print(f"\nNon-Soft tokens: {categoryCount['Non-Soft']}")
    print(f"\nSoft categories:")
    
    totalSoftNER = 0
    for category, count in sorted(categoryCount.items()):
        if category.startswith("Soft-"):
            print(f"  {category}: {count} tokens")
            totalSoftNER += count
    
    print(f"\nTotal Soft tokens: {totalSoftNER}")
    print(f"Ratio (Non-Soft:Soft): {categoryCount['Non-Soft']/totalSoftNER:.1f}:1")

viewDataRatioBeforeBalancing(trainingDataSentences)

In [None]:
# Run this to balance our data and see the ratio after balancing

# TODO (at the end of this code cell): you need to input a ratio in the "targetRatio" parameter when doing function call at the end of this cell block. There is no right or wrong answer, as this is an architectural choice. You can train the miodel 

def balanceDatasetByCategory(records, targetRatio=10.0, boostRareCategories=True, randomSeed=42, maxRepeatPerSentence=30, probabilistic=True):
    '''
    Balanced dataset builder with protections against extreme duplication.
    Parameters
      - records: list of sentence dicts
      - targetRatio: desired neg:pos token ratio
      - boostRareCategories: keep your boosting behavior
      - maxRepeatInSentence: hard cap for how many times a single sentence may be repeated. We want to avoid the case where we overly repeat one sentence
      - probabilistic: if True, use sampling-with-replacement probabilities instead of literal duplication. We want to set it to True

    This function returns a list of shuffled records
    '''
    random.seed(randomSeed)

    # Categorize sentences (same as before)
    sentenceByCategoryDict = {
        "Non-Soft": [],
        "Soft-Communal/Public": [],
        "Soft-Extraterrestrial/Figurative": [],
        "Soft-Institutional": [],
        "Soft-Natural": [],
        "Soft-Private": [],
        "Multiple-Soft": []
    }

    for record in records:
        softCategoriesInRecord = set()
        for label in record["labels"]:
            s = str(label)
            if s.startswith("Soft-"):
                softCategoriesInRecord.add(s)
        if not softCategoriesInRecord:
            sentenceByCategoryDict["Non-Soft"].append(record)
        elif len(softCategoriesInRecord) > 1:
            sentenceByCategoryDict["Multiple-Soft"].append(record)
        else:
            category = list(softCategoriesInRecord)[0]
            sentenceByCategoryDict[category].append(record)

    # Count tokens (same as before)
    allSoftNERSentences = []
    for cat, sents in sentenceByCategoryDict.items():
        if cat != "Non-Soft":
            allSoftNERSentences.extend(sents)

    softNERToken = sum(
        sum(1 for lbl in r["labels"] if str(lbl).startswith("Soft-"))
        for r in allSoftNERSentences
    )
    negativeDataToken = sum(
        sum(1 for lbl in r["labels"] if not str(lbl).startswith("Soft-"))
        for r in records
    )

    currentRatioOfData = negativeDataToken / softNERToken if softNERToken > 0 else float("inf")
    dataRepeatFactor = max(1, int(currentRatioOfData / targetRatio))

    # Build balanced_records
    balancedDataRecords = list(sentenceByCategoryDict["Non-Soft"])  # start with all negatives

    if not probabilistic:
        # literal duplication but with caps
        if boostRareCategories:
            # compute category sizes
            categorySize = {
                cat: len(sentences)
                for cat, sentences in sentenceByCategoryDict.items()
                if cat != "Non-Soft" and sentences
            }
            maxDataSize = max(categorySize.values()) if categorySize else 1

            for category, sentences in sentenceByCategoryDict.items():
                if category == "Non-Soft" or not sentences:
                    continue
                size = len(sentences)
                dataBoostFactor = maxDataSize / size if size > 0 else 1.0
                repeatDataFactor = int(dataRepeatFactor * dataBoostFactor)
                # enforce per-sentence cap
                repeatDataFactor = min(repeatDataFactor, maxRepeatPerSentence)

                # replicate sentences but ensure not more than cap per individual sentence
                for s in sentences:
                    balancedDataRecords.extend([s] * repeatDataFactor)
        else:
            # uniform oversampling, but cap repeats per sentence
            repeatDataFactor = min(dataRepeatFactor, maxRepeatPerSentence)
            for s in allSoftNERSentences:
                balancedDataRecords.extend([s] * repeatDataFactor)
    else:
        # probabilistic upsampling: sample with replacement from soft sentences
        # compute per-category sampling probabilities proportional to desired boosts
        print("Probabilistic upsampling mode: sampling with replacement rather than full duplication.")
        desiredTotalSoftData = int(negativeDataToken / targetRatio) if targetRatio > 0 else len(allSoftNERSentences)
        if desiredTotalSoftData <= 0:
            desiredTotalSoftData = len(allSoftNERSentences)

        # Build a flat list of (sentence, category) for sampling
        flat = []
        for cat, sents in sentenceByCategoryDict.items():
            if cat == "Non-Soft" or not sents:
                continue
            # per-sentence weight: boost small categories by inverse size (like before)
            cat_size = len(sents)
            for s in sents:
                # smaller categories get slightly higher sampling weight
                w = 1.0 * (max(1, max(len(sentenceByCategoryDict[c]) for c in sentenceByCategoryDict if c != "Non-Soft")) / max(1, cat_size))
                flat.append((s, w))

        # normalize weights and sample
        sentenceList, weights = zip(*flat)
        weights = [float(w) for w in weights]
        totalWeights = sum(weights)
        probs = [w / totalWeights for w in weights]

        # Draw desiredTotalSoftData samples with replacement, but cap per individual sentence
        from collections import defaultdict
        chosen_counts = defaultdict(int)
        for _ in range(desiredTotalSoftData):
            idx = random.choices(range(len(sentenceList)), probs)[0]
            sent = sentenceList[idx]
            if chosen_counts[id(sent)] >= maxRepeatPerSentence:
                continue  # skip adding more of this particular instance
            balancedDataRecords.append(sent)
            chosen_counts[id(sent)] += 1

    random.shuffle(balancedDataRecords)

    # final stat
    finalSoftNER = sum(
        sum(1 for lbl in r["labels"] if str(lbl).startswith("Soft-"))
        for r in balancedDataRecords
    )
    finalNegativeData = sum(
        sum(1 for lbl in r["labels"] if not str(lbl).startswith("Soft-"))
        for r in balancedDataRecords
    )

    print("\nOVERSAMPLING RESULTS:")
    print(f"  total sentences after balance: {len(balancedDataRecords)}")
    print(f"  Soft tokens: {finalSoftNER}, Non-Soft tokens: {finalNegativeData}")
    if finalSoftNER:
        print(f"  Final ratio (Non-Soft:Soft): {finalNegativeData/finalSoftNER:.1f}:1")
    else:
        print("Warning: no Soft tokens in balanced records")

    return balancedDataRecords

# TODO: determine a ratio of positive:negative data. Note that it won't necessarily result in exactly the ratio that you determine. For example, you might put in a 1:5 ratio but the function might still provide a 1:10, but it will still be a great improvement from the, say 1:50 ratio from the raw data
balancedRecords = balanceDatasetByCategory(trainingDataSentences, targetRatio=5.0, maxRepeatPerSentence=30, probabilistic=True)

In [None]:
# Run without modifying

# Calculate class weights and reweigh them as a way to further mitigate data imbalanace.
# Reweighing means giving more weight to data points that are less represented

def calculateClassWeights(records, minCount=1, clipRatio=10.0): # clipRatio prevents providing an overly large number when reweighing data 
    counts = Counter()
    for r in records:
        for lbl in r["labels"]:
            counts[labelToID(lbl)] += 1
    total = sum(counts.values())
    numOfCategories = 6
    weights = []
    for cid in range(numOfCategories):
        cnt = max(counts.get(cid, 0), minCount)
        w = total / (numOfCategories * cnt)
        w = min(w, clipRatio)
        weights.append(w)
    return torch.tensor(weights, dtype=torch.float32)  # keep on CPU and get transferred over to GPU when needed

classWeights = calculateClassWeights(balancedRecords)


Now, we will split our training data into two: 
- training data (sorry for naming it the same way)
- validation data

Validation data is different from testing data is that validation data is also labeled, but we won't be usingg validation data in model training (otherwise the model would be "cheating")

The function below, `def makeValidationSplitByCategoryseparates(...):`, will split our data into training and validation according to the ratio set in the parameter "valFrac". If "valFrac==0.05", it means that 5% of our labeled data will be set aside to be validation data. You can change this based on how many data you've labeled.

The good thing about the "makeValidationSplitByCategoryseparates" function is that it takes data imbalance into consideration when selecting data for validation.

In [None]:
# TODO at the end of this code cell: this function splits our labeled data into training and validation sets  
def makeValidationSplitByCategory(records, validationProportion=0.05, randomSeed=42, minHoldoutPerCategory=1):
    '''Create training and validation splits and ensure at least one sentence per Soft category
    when possible, and not removing the only instance from train.'''

    random.seed(randomSeed)
    # group sentences by the single soft category present, or 'Non-Soft' or 'Multiple-Soft'
    groups = defaultdict(list)
    for i, r in enumerate(records):
        softCats = {str(lbl) for lbl in r["labels"] if str(lbl).startswith("Soft-")}
        if not softCats:
            groups["Non-Soft"].append(i)
        elif len(softCats) > 1:
            groups["Multiple-Soft"].append(i)
        else:
            cat = list(softCats)[0]
            groups[cat].append(i)

    # Build initial validation indices by reserving 1 sentence for categories with 2 or more sentences
    valIndices = set()
    for cat, idxs in groups.items():
        if cat == "Non-Soft":
            continue
        if len(idxs) >= (minHoldoutPerCategory + 1):
            # choose up to minHoldoutPerCategory examples to hold out
            chosen = random.sample(idxs, min(minHoldoutPerCategory, len(idxs)))
            valIndices.update(chosen)

    # Now, ensure the validation part of the data is met by sampling randomly from remaining indices
    allIndices = set(range(len(records)))
    remaining = list(allIndices - valIndices)
    targetValSize = max(int(len(records) * validationProportion), len(valIndices))
    extraNeeded = max(0, targetValSize - len(valIndices))
    if extraNeeded > 0:
        extra = random.sample(remaining, min(extraNeeded, len(remaining)))
        valIndices.update(extra)

    trainIndices = sorted(list(allIndices - valIndices))
    valIndices = sorted(list(valIndices))

    print(f"Total records: {len(records)}: train: {len(trainIndices)}, validation: {len(valIndices)}")

    # show per-category counts
    def catCounts(idxList):
        c = Counter()
        for i in idxList:
            r = records[i]
            for lbl in r["labels"]:
                s = str(lbl)
                if s.startswith("Soft-"):
                    c[s] += 1
                elif s.startswith("Hard") or s in {"O", "0"}:
                    c["Non-Soft"] += 1
        return c

    print("Train token counts by category (sampled):", {k: v for k, v in catCounts(trainIndices).items()})
    print("Validation token counts by category (sampled):", {k: v for k, v in catCounts(valIndices).items()})

    trainRecords = [records[i] for i in trainIndices]
    validationRecords = [records[i] for i in valIndices]

    return trainRecords, validationRecords

# TODO: decide on how much of your labeled data you want to set aside for data validation. I put down 5% (or 0.05). You can start there and adjust as you see fit.
validationSetProportion = 0.05  # This means we are using 5% of data for validation. you can change this. But judging from the size fo each of our labeled data, I won't recommend using more tahn 10% for validation
trainingRecords, validationRecords = makeValidationSplitByCategory(balancedRecords, validationProportion=validationSetProportion, randomSeed=42)
# remember the variable "validationRecords". We will use it after a while, after we've trained the model, to do validation.

In [None]:
# Run without modifying: tokenize our training data or training

def prepareCategorySoftNerTrainOnly(records, modelName="bert-base-uncased", labelAllSubtokens=False, returnDatasetDict=False):
    '''
    model name is set to bert-base-uncased as default, since we are using this in our tutorial
    '''
    # Records: list of dicts with keys "tokens" and "labels" (as created from "trainingDataBuilder.py")
    for r in records:
        assert len(r["tokens"]) == len(r["labels"]), "number of tokens and number of labels must match" #this will throw an error if your training data JSON somehow has more words than labels. Should not happen if you followed the trainingDataBuilder.py

    ds = Dataset.from_list(records)
    tokenizer = AutoTokenizer.from_pretrained(modelName, use_fast=True)

    fn = makeTokenizeAndAlignFn(tokenizer, labelAllSubtokens=labelAllSubtokens) # call tokenizer wrapper function
    tokenizedTrain = ds.map(fn, batched=True)

    if returnDatasetDict:
        return DatasetDict(train=tokenizedTrain)
    return tokenizedTrain

tokenizedTrainingData = prepareCategorySoftNerTrainOnly(trainingRecords) # Note here that we are only using training data, not validation data, from the previous code cell

In [None]:
# Run without modifying: double-checking whether we are using GPU or CPU

# device is a torch.device, e.g. torch.device("mps") or torch.device("cuda")
print("device:", device, "device.type:", device.type)

if device.type == "cuda":
    batchSize = 32
elif device.type == "mps":
    batchSize = 16
else:
    batchSize = 8

print(f"Using batch size: {batchSize} because you are using device: {device}")

In [None]:
# Run without modifying

# Here, we are loading BERT
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=6, # because we are creating 6 categories (5 different soft NER and 1 negative data)
    label2id=LabelMap
)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

dataCollator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
# Run without modifying: to make sure that we have three columns:['labels', 'input_ids', 'attention_mask']
# After you run this code, you should see something like this:
'''Columns: ['labels', 'input_ids', 'attention_mask']
{'labels': [-100, 0, 0, 0, 0, 0, 0, 0, 0, 5, 5, -100, -100], 'input_ids': [101, 2021, 3904, 1997, 2068, 2064, 4339, 1999, 2256, 3320, 4155, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Remaining columns: ['labels', 'input_ids', 'attention_mask']
'''
#The numbers might not be exactly the same, but make sure that the last line being printed out is: Remaining columns: ['labels', 'input_ids', 'attention_mask']

print("Columns:", tokenizedTrainingData.column_names)
print(tokenizedTrainingData[0])

# Remove non-tensor columns, such as our charSpans label that shows where in a sentence a word is at.
cols_to_keep = ["input_ids", "attention_mask", "labels"]
cols_to_remove = [c for c in tokenizedTrainingData.column_names if c not in cols_to_keep]
tokenizedTrainingData = tokenizedTrainingData.remove_columns(cols_to_remove)
print("Remaining columns:", tokenizedTrainingData.column_names)


In [None]:
# Run without modifying
# test before training, get one batch from DataLoader

trainingDataloader = DataLoader(
    tokenizedTrainingData,
    batch_size=batchSize,
    collate_fn=dataCollator,
    shuffle=False
)

batch = next(iter(trainingDataloader))
batch = {k: v.to(device) for k, v in batch.items()}
print("Batched keys:", batch.keys())
# when printing batch keys, you should see something like "dict_keys(['input_ids', 'attention_mask', 'labels'])"

Now, we've finished preparing our data for training. Model training happens in multiple cycles, or some might call them "steps." The number of steps that we want to use in training is an architectural design on our side. Each step makes the model "understand" our data better, but we don't want to iterate too many steps, since that would cause overfitting (a scenaraio where the model learns the training data well but does not generalize well). This means that a good practice is to try training a model multiple times, using different number of steps and see which one provides the best result.

Before starting the full training cycle, we will experiment with just one step and see if our code works. The code cell below might take a while to execute (between about 20 seconds to a few minutes, depending on whether you are using GPU or CPU)

In [None]:
# Run without modifying

# This code cell replicates one step of training
dl = DataLoader(tokenizedTrainingData, batch_size=batchSize, collate_fn=dataCollator)
batch = next(iter(dl))
batch = {k: v.to(device) for k,v in batch.items()}
model.to(device)
model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-6)
optimizer.zero_grad()
outputs = model(**batch)
loss = outputs.loss if hasattr(outputs,"loss") else torch.nn.functional.cross_entropy(
    outputs.logits.view(-1,6), batch["labels"].view(-1), ignore_index=-100)
print("Initial loss: ", loss.item())
loss.backward()

print("Any empty or none gradients?", any(torch.isnan(p.grad).any().item() for p in model.parameters() if p.grad is not None))

Intuitively, "loss" refers to the size of error between the model's predicted points and the true points. At this point, we just want to see if the model, after one round of training, will generate for us a loss value or not, regardless of how large the number is. If you see that the initial loss number is something unreasonable, such as 0.00 (your model will make errors on the first training step) or "NaN", then your code previously might have some issues.

You should also see a "False" for "Any empty or none gradients?" This means that all gradients (the vector representing the direction that the model should move next in training) are operating correctly. Later on, during the main training steps, we will keep a close eye on whether all gradients are operating correctly at each step.

The below code is the main training loop. 

Since each of us is using different data, we will use different training hyperparameters. There are many hyperparameters to use. The two that we will focus the most on are "trainingSteps" and "learningRate".

I've set a default value for your training steps and learning rate here, as 400 and 1e-6, respectively. You are to change them according to the training result.

Some suggested range of training steps: `200-1200`
Some suggested choices of learning rate: `3e-5, 1e-5, 5e-6, 3e-6, 1e-6, 5e-7`

Training steps, as mentioned before, represents how manys steps will the model use in training
Learning rate refers to how precise of a "step" the model take after each iteration. In other words, how much the model learns at each step. A large learning rate means that the model learns quickly, saving training time but also risking losing out on minute details in the training data. A smaller learning rate meeans that the model is more meticulous but means training will take longer.

There are many other hyperparameters that one can use in machine learning. We will just focus on using these two for AsteXT. You are more than welcome to learn about others.

---
F1 score (ranging between 0 and 1) is a measurement of how well a model is performing on a classification task. (F1 is the combination of precision and recall scores). Since we do not have a robust validation data, the F1 score will be slightly arbitrary and will very likely overestimate the actual F1 score, but at least it's better than not having one. 

When you run the training code cell below, you will see lines like this printed out:
`Step 1/400  loss=1.8802  F1=0.048  any_nan_grad=False  any_nan_param=False`

You will get an update like this every 10 steps (or however many steps as you determine in the variable "printEvery"). "loss=" refers to the loss value, "F1=" refers to the rough F1 score at that step, and "any_nan_grad=" and "any_nan_param=" refers to whether, at that stage, any parameters or gradients are invalid. If False, then it means no parameter or gradient is invalid. We should try to keep it that way. If at a training step where one or both of them become "True," then you can take note of how many steps it took the model to reach that point and try decreasing the "trainingSteps" hyperparameter to that number.

In your training:
- <b>Aim for a rough F1 score of at least `0.65`. The more the better.</b>
- <b>Aim for a loss value of around `0.4`. The lower the better.</b>


In [None]:
# TODO: Hyperparameters: try out multiple options for training steps and learning rate. 
# Suggested range of training steps: 200-1200
# Suggested choices of learning rate: 3e-5, 1e-5, 5e-6, 3e-6, 1e-6, 5e-7

trainingSteps = 400
learningRate = 1e-6
# --------

printEvery = 10 # This refers to how often the function prints something below to indicate training progress. You can leave it at 10 as default.

trainingDataLoader = DataLoader(tokenizedTrainingData, batch_size=batchSize, shuffle=True, collate_fn=dataCollator)
it = iter(trainingDataLoader)

model = model.to(device)

# To double-check again whether you are using GPU or CPU
print("Model device:", next(model.parameters()).device) 

model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=learningRate)

for step in range(1, trainingSteps + 1):
    try:
        batch = next(it)
    except StopIteration:
        it = iter(trainingDataLoader)
        batch = next(it)
    batch = {k: v.to(device) for k, v in batch.items()}

    optimizer.zero_grad()
    outputs = model(**batch)
    logits = outputs.logits.float()
    labels = batch["labels"]

    weight = classWeights.to(logits.device).to(logits.dtype)
    crossEntropyLossFunc = torch.nn.CrossEntropyLoss(weight=weight, ignore_index=-100) # using cross entropy loss
    loss = crossEntropyLossFunc(logits.view(-1, model.config.num_labels), labels.view(-1))
    loss.backward()

    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    optimizer.step()

    # Compute quick F1 score
    preds = torch.argmax(logits, dim=-1)
    mask = labels != -100
    yTrue = labels[mask].cpu().numpy()
    yPred = preds[mask].cpu().numpy()
    # F1 across all classes (macro)
    batchF1 = f1_score(yTrue, yPred, average='macro', zero_division=0)

    anyNoneGradients = any((p.grad is not None) and torch.isnan(p.grad).any().item() for p in model.parameters())
    anyNoneParameters = any(torch.isnan(p).any().item() for p in model.parameters())

    if step % printEvery == 0 or step == 1:
        print(f"Step {step}/{trainingSteps}  loss={loss.item():.4f}  F1={batchF1:.3f}  any_nan_grad={anyNoneGradients}  any_nan_param={anyNoneParameters}")

    if anyNoneGradients or anyNoneParameters:
        print("Blank gradients or parameters detected, stopping early.")
        print(f"Stopped at step: {step}")
        break

print("Training complete.")

Now, we need to store your model. The trained model will be represented as a folder after being stored locally. Please make sure to do this following code cell <b>right after</b> training the model. Do not run other code cells in between. These two steps must be done together!!

In [None]:
# TODO: add a Save current safe model (so you can revert)

nameOfFineTunedModel = "NameYourFineTunedModelHere"
fineTunedModelStoragePath = "./" + nameOfFineTunedModel

model.save_pretrained(fineTunedModelStoragePath)
tokenizer.save_pretrained(fineTunedModelStoragePath)
print("Saved model to", fineTunedModelStoragePath)

After running the above code cell, look in your directory to see if you could find a folder called the name that you gave to your trained model.

Now we will try to calculate our F1 score and study our model's performance on data that the model hasn't see before, through the validation data that we've set aside back in the function `makeValidationSplitByCategory`. Run this following code cell and observe the result printed below.

In [None]:
# Run without modifying: now we will try to calculate our F1 score and study our model's performance on data that the model hasn't see. Watch out for printouts below this code cell

IDToLabelDict = {
    0: "O",
    1: "Soft-Communal/Public",
    2: "Soft-Extraterrestrial/Figurative",
    3: "Soft-Institutional",
    4: "Soft-Natural",
    5: "Soft-Private",
}

# Tokenize / align labels for both splits using your existing function
# Use the same tokenizer and labelAllSubtokens flag you used for training
fn = makeTokenizeAndAlignFn(tokenizer, labelAllSubtokens=False)  # or True, same as training
trainingDataForPostTrainingValidation = Dataset.from_list(trainingRecords)
validationDataForPostTrainingValidation = Dataset.from_list(validationRecords)

tokenizedTrain = trainingDataForPostTrainingValidation.map(fn, batched=True)
tokenizedVal = validationDataForPostTrainingValidation.map(fn, batched=True)

# Remove any extraneous string columns if present (collator expects input_ids, attention_mask, labels)
columnsNeeded = ["input_ids", "attention_mask", "labels"]
def keepColumns(ds):
    moveColumn = [column for column in ds.column_names if column not in columnsNeeded]
    if moveColumn:
        return ds.remove_columns(moveColumn)
    return ds

tokenizedTrain = keepColumns(tokenizedTrain)
tokenizedVal = keepColumns(tokenizedVal)

# Set format to torch (optional, but helpful if doing manual DataLoader)
tokenizedTrain.set_format(type="torch", columns=["input_ids","attention_mask","labels"])
tokenizedVal.set_format(type="torch", columns=["input_ids","attention_mask","labels"])

print("Tokenized train/val ready. Train size:", len(tokenizedTrain), "Val size:", len(tokenizedVal))
# keep tokenizedTrainingData variable name convention if your training pipeline expects it:
tokenizedTrainingData = tokenizedTrain
tokenizedValidationData = tokenizedVal


# validation that provides a yTrue and yPredict value
def runValidationPostTraining(model, tokenizedVal, batch_size=16):
    dl = DataLoader(tokenizedVal, batch_size=batch_size, collate_fn=dataCollator)
    model.eval()
    yTruePostTraining = []
    yPredictPostTraining = []
    for batch in dl:
        # move inputs to device
        batch = {k: v.to(device) for k, v in batch.items()}
        with torch.no_grad():
            out = model(**batch)
            logits = out.logits.detach().cpu().numpy()   # shape (B, S, C)
        labels = batch["labels"].detach().cpu().numpy() # shape (B, S)

        preds_ids = np.argmax(logits, axis=2)  # (B, S)
        # iterate tokens and collect those where label != -100
        B, S = labels.shape
        for i in range(B):
            for j in range(S):
                lab = labels[i, j]
                if lab != -100:
                    yTruePostTraining.append(IDToLabelDict[int(lab)])
                    yPredictPostTraining.append(IDToLabelDict[int(preds_ids[i, j])])
    return yTruePostTraining, yPredictPostTraining

yTruePostTraining, yPredictPostTraining = runValidationPostTraining(model, tokenizedValidationData, batch_size=batchSize)
print(classification_report(yTruePostTraining, yPredictPostTraining, labels=list(IDToLabelDict.values()), zero_division=0))

Are you satisfied with the results? If not, you can retrain the model through adjusting hyperparameters, such as training steps, learning rate, resampling ratio, and others.

If you are satisfied, we can now load our trained model on our testing data (the JSON file created from "testingDataBuilder.py") and watch our fine-tuned model in action.

In [None]:
# Two tasks in this code cell
# TODO One: put the file path of your fine-tuned model in the variable "fineTunedModelPath"
fineTunedModelPath = "/Users/Jerry/Desktop/AsteXT/AsteXTCode/AsteXTCode2025-6/softNERModelFinalCategoryAware" 

print("Loading fine-tuned model...")
fineTunedTokenizer = AutoTokenizer.from_pretrained(fineTunedModelPath, use_fast=True)
fineTunedModel = AutoModelForTokenClassification.from_pretrained(fineTunedModelPath)

fineTunedModel.to(device)
print(f"Using device: {device}")
fineTunedModel.eval() 
print("Model loaded")

# TODO Two: add the file path of your testing data JSON. Make sure this JSON was created by "testingDataBuilder.py"
testingDataJsonFilePath = "/Users/Jerry/Desktop/AsteXT/AsteXTCode/AsteXTCode2025-6/testData1.json"
with open(testingDataJsonFilePath, "r") as testingDataFile:
    testingDataDict = json.load(testingDataFile)

listOfTestingSentences = [sentence for sentenceID, sentence in testingDataDict.items()]
print(f"You have loaded {len(listOfTestingSentences)} sentences to test.")

In [None]:
# Run without modifying: this code loads your testing sentences and try to apply your trained model on these sentences to make predictions

def predictSoftNERCategories(text, model, tokenizer, device=None, debug=False):
    '''
    The "debug" hyperparameter is to determine whether we should print out more information when running. It doesn't affect the actual prediction of the model.
    '''
    if device is None:
        device = device
    elif isinstance(device, str):
        device = torch.device(device)

    model.to(device)
    model.eval()

    encoding = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        # max_length=max_length,
        return_offsets_mapping=True,
        return_attention_mask=True
    )

    # Prepare safe CPU copies for decoding
    offsetMap = encoding["offset_mapping"][0].cpu().tolist()
    inputIDForCPU = encoding["input_ids"][0].cpu().tolist()

    # Move the rest to device (but keep offsets on CPU)
    inputs = {k: v.to(device) for k, v in encoding.items() if k != "offset_mapping"}

    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits

        
        logitCPU = logits.detach().cpu().float()
        if torch.isnan(logitCPU).any():
            # replace none values with very negative values so softmax is stable
            logitCPU = torch.nan_to_num(logitCPU, nan=-1e9, posinf=1e9, neginf=-1e9)

        probs = torch.nn.functional.softmax(logitCPU, dim=-1)
        confs, preds = probs.max(dim=-1)

    preds = preds[0].tolist()
    confs = confs[0].tolist()

    # Convert tokens to CPU
    tokens = tokenizer.convert_ids_to_tokens(inputIDForCPU, skip_special_tokens=False)

    results = []
    for token, predictionID, conf, (start, end) in zip(tokens, preds, confs, offsetMap):
        # Skip special tokens and padding via offsets (special tokens often have start==end)
        if start == end:
            continue
        word = text[start:end] if end > start else token
        labelName = IDToLabelDict.get(int(predictionID), "Unknown")

        # ensure finite confidence
        conf = float(conf) if (conf == conf and conf not in (float("inf"), float("-inf"))) else 0.0
        results.append((word, labelName, conf))

    if debug:
        print("Decoded counts:", Counter([lab for _, lab, _ in results]))

    return results

# ---
# Test run your fine-tuned model:
predictionsList = []
for sentence in listOfTestingSentences:
    predictions = predictSoftNERCategories(sentence, fineTunedModel, fineTunedTokenizer, device=device, debug=True)
    predictionsList.append(predictions)

print("\nPredictions (non-O):")
for prediction in predictionsList:
    for word, label, confidence in predictions:
        # if label != "O":
            print(f"  '{word}': {label} (confidence: {confidence:.2%})")

You can look at the print out to see if you think the predictions are accurate. If not, you can train the model again using different hyperparameters or add more data to it. If you are happy with the results, then you have successfully fine-tuned a BERT model to identify soft Named Entities for Asian American short stories!

---

The below code is for debug purposes. You do not need to run it unless extremely necessary.

In [None]:
# Do not run unless needed or if you are curious. 
'''This code cell is intended for you to diagnose your fine-tuned model in cases where something unexpected happens'''
# Diagnostic: why predictions are uniform 1/6
import os, torch, numpy as np
from collections import Counter
from transformers import AutoTokenizer, AutoModelForTokenClassification

# === Set paths / objects you used ===
# Update the model path to the path of your fine-tuned model:
model_path = "/Users/Jerry/Desktop/AsteXT/AsteXTCode/AsteXTCode2025-6/softNERModelCategoryVary_safe_checkpoint_pre200_0.4"  
print("MODEL PATH:", model_path)
print("Files in model dir:", sorted(os.listdir(model_path)))

# === Load model/tokenizer (no device move yet) ===
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForTokenClassification.from_pretrained(model_path)
print("model.config.num_labels:", model.config.num_labels)

# === 1) PARAM checks: NaNs / zeros / sums for a few representative params ===
def stats_for_params(m, max_show=12):
    total = 0
    any_nan = False
    zero_count = 0
    sample = []
    for i,(n,p) in enumerate(m.named_parameters()):
        total += p.numel()
        cpu = p.detach().cpu()
        nan = bool(torch.isnan(cpu).any())
        any_nan = any_nan or nan
        zeros = int((cpu == 0).sum().item())
        zero_count += zeros
        if i < max_show:
            sample.append((n, tuple(cpu.shape), float(cpu.abs().sum().item()), nan, zeros))
    return {"total": total, "any_nan": any_nan, "zero_count": zero_count, "sample": sample}

stats = stats_for_params(model)
print("\n=== PARAM STATS SUMMARY ===")
print(" total params:", stats["total"])
print(" any NaN in params?:", stats["any_nan"])
print(" total zeros:", stats["zero_count"])
print(" sample params (name,shape,sum_abs,hasNaN,zero_count):")
for r in stats["sample"]:
    print(" ", r)

# === 2) Classifier layer explicit check ===
print("\n=== CLASSIFIER PARAMS ===")
for n,p in model.named_parameters():
    if any(key in n.lower() for key in ("classifier","score","out","proj")):
        cpu = p.detach().cpu()
        print(n, p.shape, "sum_abs=", float(cpu.abs().sum().item()), "any_nan=", bool(torch.isnan(cpu).any()), "zeros=", int((cpu==0).sum().item()))

# === 3) If you still have balancedRecords in memory, print class token counts; otherwise, recompute counts from tokenized dataset.
try:
    balancedRecords  # noqa
    have_balanced = True
except NameError:
    have_balanced = False

print("\n=== DATA / LABEL DISTRIBUTION ===")
if have_balanced:
    print("Using balancedRecords variable (found in current session).")
    counts = Counter()
    total_tokens = 0
    for r in balancedRecords:
        for lbl in r["labels"]:
            total_tokens += 1
            counts[labelToID(lbl)] += 1
    for cid,cnt in sorted(counts.items()):
        print(f"  label {cid} ({IDToLabelDict.get(cid,'?')}): {cnt} tokens")
    print(" total tokens:", total_tokens)
else:
    # try to infer from tokenizedTrainingData if present
    try:
        tokenizedTrainingData  # noqa
        print("Using tokenizedTrainingData variable (found in current session).")
        # Inspect first few examples to ensure labels present and are ints
        print("Columns:", tokenizedTrainingData.column_names)
        for i in range(min(3, len(tokenizedTrainingData))):
            ex = tokenizedTrainingData[i]
            print(f"\nExample {i} keys and types:")
            for k,v in ex.items():
                t = type(v)
                if isinstance(v, (list, tuple)):
                    sample = v[:10]
                    print(f"  {k}: type={t}, len={len(v)}, sample={sample}")
                else:
                    print(f"  {k}: type={t}, sample={str(v)[:80]}")
        # aggregate label ids
        all_label_ids = []
        for i in range(min(2000, len(tokenizedTrainingData))):
            lab = tokenizedTrainingData[i]["labels"]
            # some labels are lists of ints with -100 â€” flatten and count non -100
            if isinstance(lab, list):
                all_label_ids.extend([x for x in lab if x != -100])
        print("Found label id distribution (sampled up to 2000 examples):", Counter(all_label_ids))
    except NameError:
        print("No balancedRecords or tokenizedTrainingData found in this session. Please run diagnostics with those available.")

# === 4) Inspect classWeights if in memory ===
try:
    classWeights  # noqa
    print("\nclassWeights (tensor):", classWeights)
    print(" classWeights dtype/device:", classWeights.dtype, classWeights.device)
    print(" classWeights values:", classWeights.tolist())
except NameError:
    print("\nclassWeights variable not found in this session.")

# === 5) Run a forward on a sample sentence and print logits (no softmax) and their stats ===
device = torch.device("cuda" if torch.cuda.is_available() else
                      "mps" if getattr(torch.backends,'mps',None) is not None and torch.backends.mps.is_available()
                      else "cpu")
print("\nRunning single-sample forward on device:", device)
model.to(device)
model.eval()

text = "The Academy again assures me they are searching for fresh engineers."
enc = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128, return_offsets_mapping=True)
offsets = enc["offset_mapping"][0].cpu().tolist()
input_ids_cpu = enc["input_ids"][0].cpu().tolist()
tokens = tokenizer.convert_ids_to_tokens(input_ids_cpu, skip_special_tokens=False)

inputs = {k: v.to(device) for k,v in enc.items() if k != "offset_mapping"}
with torch.no_grad():
    out = model(**inputs)
    logits = out.logits.detach().cpu().float()  # move to cpu float

print("LOGITS shape:", logits.shape)
print("Tokens+offsets:", [(i, tokens[i], offsets[i]) for i in range(min(8, len(tokens)))])
print("\nLogits first 8 positions:\n", logits[0,:8,:].numpy())
print("Per-position min max (first 8):")
arr = logits[0].numpy()
print(" mins:", np.min(arr[:8,:], axis=1))
print(" maxs:", np.max(arr[:8,:], axis=1))
print("Any NaNs in logits?:", np.isnan(arr).any())
print("Unique rows (rounded):", np.unique(np.round(arr,6), axis=0).shape)

# === 6) Compare classifier params to base model (are they unchanged?) ===
print("\nComparing classifier weights to fresh base model (bert-base-uncased)...")
base = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=model.config.num_labels)
def collect_classifier(m):
    d={}
    for n,p in m.named_parameters():
        if any(k in n.lower() for k in ("classifier","score","out","proj")) or n.endswith("classifier.weight") or n.endswith("classifier.bias"):
            d[n]=p.detach().cpu().numpy()
    return d
base_cls = collect_classifier(base)
saved_cls = collect_classifier(model)
print("Classifier param names in base:", list(base_cls.keys()))
print("Classifier param names in saved:", list(saved_cls.keys()))
for k in base_cls:
    if k in saved_cls:
        diff = np.linalg.norm(saved_cls[k].ravel() - base_cls[k].ravel())
        print(" param:", k, "L2 diff from base:", diff, " sum_abs_saved:", np.abs(saved_cls[k]).sum())
    else:
        print(" param:", k, "not found in saved model keys")

print("\n=== DIAGNOSTIC COMPLETE ===")


In [None]:
# Unused training code: you do not need to use the following code and functions. I'm putting them there as back-up options.


# class WeightedTrainer(Trainer):
#     def __init__(self, *args, classWeights=None, **kwargs):
#         super().__init__(*args, **kwargs)
#         self.classWeights = classWeights  # keep CPU tensor

#     def compute_loss(self, model, inputs, return_outputs=False):
#         labels = inputs.pop("labels")
#         outputs = model(**inputs)
#         logits = outputs.logits.float()  # ensure float32
#         weight = self.classWeights.to(logits.device).to(logits.dtype)
#         lossFct = nn.CrossEntropyLoss(weight=weight, ignore_index=-100)
#         loss = lossFct(logits.view(-1, model.config.num_labels), labels.view(-1))
#         return (loss, outputs) if return_outputs else loss

# trainingArgs = TrainingArguments(
#     output_dir="./softNERModel",
    
#     # Core hyperparameters
#     num_train_epochs=3,
#     learning_rate=1e-6,
#     max_grad_norm = 1.0,
#     per_device_train_batch_size=batchSize,
    
#     # Regularization
#     weight_decay=0.01,
#     warmup_ratio=0.1,
    
#     save_strategy="steps",
    
#     # Logging
#     save_steps=500,
#     logging_steps=100,
    
#     # Performance
#     fp16=False,  # Set False if you get errors
# )

# trainer = WeightedTrainer(
#     model=model,
#     args=trainingArgs,
#     train_dataset=tokenizedTrainingData,
#     data_collator=dataCollator,
#     classWeights=classWeights
# )