# Part 2 & 3: Hindi Chunking and Performance Analysis

This notebook implements shallow parsing (chunking) on a Hindi CoNLL-U dataset using DistilBERT. While the logic builds upon the token classification workflow established in the Hugging Face tutorial (Part 1), significant modifications were necessary to handle the custom dataset format and the specific requirements of the Hindi Chunking task.

**Dataset Credit**: The dataset was provided by the university server (Hindi HDTB-UD).
**Model Credit**: The fine-tuned Hindi model used in Experiment 3 was provided by `mirfan899` on Hugging Face.

### Overview of Changes from Part 1
- **Data Loading**: Unlike Part 1, which downloaded WNUT data from a URL using `requests`, this notebook reads local `.conllu` files. Custom parsing logic was written to extract `ChunkId` and `ChunkType` from column 9.
- **Label Schema**: The raw data does not provide IOB tags. Custom logic was implemented to convert raw chunk IDs (e.g., `NP`) into standard IOB format (`B-NP`, `I-NP`) by detecting boundary changes.
- **Task**: The task is shifted from Named Entity Recognition (NER) to Chunking.

### Experiments
Three models are compared to analyze the impact of language specificity:
1. `distilbert-base-multilingual-cased`: A general multilingual baseline.
2. `distilbert-base-uncased`: An English-only baseline (expected to perform poorly, serving as a negative control).
3. `mirfan899/hindi-distilbert-ner`: A model fine-tuned on Hindi, expected to offer the best performance.

Additionally, a **Joint Classification** task is implemented to predict both Chunk ID (NP, VP) and Chunk Type (Head, Child) simultaneously.

## Setup
Standard imports were used. `evaluate` and `transformers` libraries are reused from Part 1.

In [None]:
import os
# Optimizing memory for MPS/Mac environment
os.environ["PYTORCH_MPS_HIGH_WATERMARK_RATIO"] = "0.0"

import re
import numpy as np
import evaluate
from datasets import Dataset, DatasetDict, ClassLabel, Features, Sequence, Value
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer, DataCollatorForTokenClassification
import torch

## Data Loading and Parsing

### Custom Parsing Logic
In Part 1, the dataset was simple enough to be split effectively by empty lines. Here, the CoNLL-U format is more complex. A custom parser was written to specifically target column 9 (index 9), where the chunk information is hidden.

**Logic for Chunk Isolation**:
Raw Chunk IDs often contain numbers (e.g., `NP2`) to distinguish them from adjacent chunks of the same type (e.g., `NP1 NP2`). 
If we stripped these numbers immediately, `NP1` and `NP2` would merge into a single continuous `NP` chunk. Therefore, the **raw IDs are preserved** during this parsing stage to allow for accurate boundary detection (`B-` vs `I-`) in the Label Generation step later.

In [None]:
def parse_conllu(filepath):
    sentences = []
    with open(filepath, 'r', encoding='utf-8') as f:
        lines = f.readlines()

    tokens = []
    chunk_ids = []
    chunk_types = []

    for line in lines:
        line = line.strip()
        # Skip comments and empty lines
        if not line or line.startswith('#'):
            if tokens:
                sentences.append({'tokens': tokens, 'chunk_ids': chunk_ids, 'chunk_types': chunk_types})
                tokens = []
                chunk_ids = []
                chunk_types = []
            continue
        
        parts = line.split('\t')
        if len(parts) < 10:
            continue
            
        token = parts[1]
        misc = parts[9]
        
        # Extract ChunkId and ChunkType from column 9
        # Example format: ChunkId=NP|ChunkType=child|...
        c_id = 'O'
        c_type = 'O'
        
        misc_parts = misc.split('|')
        for mp in misc_parts:
            if mp.startswith('ChunkId='):
                c_id = mp.split('=')[1] 
                # Critical: Numbers are NOT stripped yet. 'NP2' is kept as 'NP2'.
            elif mp.startswith('ChunkType='):
                c_type = mp.split('=')[1]
        
        tokens.append(token)
        chunk_ids.append(c_id)
        chunk_types.append(c_type)

    # Append final sentence
    if tokens:
        sentences.append({'tokens': tokens, 'chunk_ids': chunk_ids, 'chunk_types': chunk_types})
        
    return sentences

data_dir = "data/a2"
train_data = parse_conllu(os.path.join(data_dir, "hi_hdtb-ud-train.conllu"))
dev_data = parse_conllu(os.path.join(data_dir, "hi_hdtb-ud-dev.conllu"))
test_data = parse_conllu(os.path.join(data_dir, "hi_hdtb-ud-test.conllu"))

print(f"Loaded {len(train_data)} training sentences.")
print(f"Loaded {len(dev_data)} validation sentences.")
print(f"Loaded {len(test_data)} test sentences.")


### Data Inspection
A quick check was performed to identify the distribution of 'O' (Outside) tags, giving insight into how sparse the chunks are.

In [None]:
def check_o_tags(sentences):
    total_tokens = 0
    o_tags = 0
    for sent in sentences:
        total_tokens += len(sent['chunk_ids'])
        o_tags += sent['chunk_ids'].count('O')
    
    print(f"Total Tokens: {total_tokens}")
    print(f"Tokens with 'O' ChunkID: {o_tags}")
    print(f"Percentage 'O': {o_tags/total_tokens*100:.2f}%")

print("--- Train Data ---")
check_o_tags(train_data)
print("Sample:", train_data[0])

## Label Generation (IOB + Joint Tags)

### Why Custom Logic?
Unlike the WNUT dataset in Part 1, which came with pre-defined NER tags (`B-corporation`, etc.), this dataset only provides raw IDs (`NP`, `VP`). To use standard token classification models, IOB (Inside-Outside-Beginning) tags must be generated programmatically.

**Algorithm Description**:
The function iterates through the tokens and compares the current `ChunkId` with the previous one.
1. **B-Tag (Beginning)**: Assigned if the current ID differs from the previous one (e.g., switching from `NP1` to `NP2`, or `O` to `NP`).
2. **I-Tag (Inside)**: Assigned if the current ID matches the previous one (e.g., `NP1` follows `NP1`).
3. **Number Stripping**: Once the prefix (B/I) is determined, the trailing numbers are removed (`NP2` -> `NP`) to create generalizable classes.
4. **Joint Classification (Bonus)**: The label is constructed by combining the IOB tag, the stripped Chunk ID, and the Chunk Type: `B-NP-head`.

In [None]:
def generate_labels(sentences):
    processed = []
    all_labels = set()
    
    for sent in sentences:
        tokens = sent['tokens']
        c_ids = sent['chunk_ids']
        c_types = sent['chunk_types']
        
        ner_tags = []
        
        for i, (cid, ctype) in enumerate(zip(c_ids, c_types)):
            if cid in ['O', 'BLK', '']: # Handling punctuation/blanks as Outside
                label = 'O'
                ner_tags.append(label)
                all_labels.add(label)
                continue
            
            # Determine Prefix (B vs I)
            # Start of a chunk if:
            # 1. It is the first token
            # 2. OR the current Raw ChunkID is different from the previous one
            if i == 0 or cid != c_ids[i-1]:
                prefix = 'B'
            else:
                prefix = 'I'
            
            # Strip number from ChunkId for the final label
            # e.g. NP2 -> NP
            clean_cid = re.sub(r'\d+$', '', cid)
            
            # Constructing the Joint Label
            label = f"{prefix}-{clean_cid}-{ctype}"
            
            ner_tags.append(label)
            all_labels.add(label)
            
        processed.append({'tokens': tokens, 'ner_tags': ner_tags})
        
    return processed, sorted(list(all_labels))

train_processed, label_list = generate_labels(train_data)
dev_processed, _ = generate_labels(dev_data)
test_processed, _ = generate_labels(test_data)

label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}

print("Labels:", label_list)
print("Sample processed:", train_processed[0])

## Dataset Conversion
The processed data is converted into the Hugging Face `Dataset` library format. This standardizes the input for the Trainer API used in subsequent steps.

In [None]:
def convert_to_hf_dataset(processed_data, label2id):
    hf_data = {
        'tokens': [],
        'ner_tags': []
    }
    for item in processed_data:
        hf_data['tokens'].append(item['tokens'])
        tags_ids = [label2id[tag] for tag in item['ner_tags']]
        hf_data['ner_tags'].append(tags_ids)
    return Dataset.from_dict(hf_data)

hf_train = convert_to_hf_dataset(train_processed, label2id)
hf_dev = convert_to_hf_dataset(dev_processed, label2id)
hf_test = convert_to_hf_dataset(test_processed, label2id)

dataset = DatasetDict({
    'train': hf_train,
    'validation': hf_dev,
    'test': hf_test
})

## Evaluation Metrics
The `seqeval` library is used for standard token classification metrics (Precision, Recall, F1). The `compute_metrics` function was reused directly from Part 1.

In [None]:
seqeval = evaluate.load("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    
    # Convert IDs back to labels, filtering out -100 (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

## Tokenization Helper

### Reused Logic from Part 1
The `tokenize_and_align_labels` function (encapsulated here within `tokenize_dataset` for cleaner experimentation) is reused from the Hugging Face tutorial code in Part 1. 

**Purpose**: Transformers use subword tokenization (WordPiece), meaning one word might be split into multiple tokens. We must align the single label provided for the word (e.g., `B-NP`) to these multiple tokens. Following the tutorial strategy, the label is assigned to the **first subword**, and subsequent subwords are ignored (set to `-100`).

In [None]:
def tokenize_dataset(dataset, tokenizer_checkpoint):
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)
    
    def align_labels(examples):
        tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
        labels = []
        for i, label in enumerate(examples["ner_tags"]):
            word_ids = tokenized_inputs.word_ids(batch_index=i)
            previous_word_idx = None
            label_ids = []
            for word_idx in word_ids:
                if word_idx is None:
                    label_ids.append(-100)
                elif word_idx != previous_word_idx:
                    label_ids.append(label[word_idx])
                else:
                    label_ids.append(-100)
                previous_word_idx = word_idx
            labels.append(label_ids)
        tokenized_inputs["labels"] = labels
        return tokenized_inputs

    return dataset.map(align_labels, batched=True), tokenizer

## Experiment 1: Multilingual Baseline

DistilBERT Multilingual (`distilbert-base-multilingual-cased`) was selected as the baseline. It supports Hindi and provides a reasonable starting point for performance comparison.

In [None]:
CHECKPOINT_MULTI = "distilbert-base-multilingual-cased"

# 1. Prepare Data
tokenized_multi, tokenizer_multi = tokenize_dataset(dataset, CHECKPOINT_MULTI)
data_collator_multi = DataCollatorForTokenClassification(tokenizer=tokenizer_multi)

In [None]:
# 2. Initialize Model
model_multi = AutoModelForTokenClassification.from_pretrained(
    CHECKPOINT_MULTI, 
    num_labels=len(label_list), 
    id2label=id2label, 
    label2id=label2id
)

# Clear MPS cache (Mac usage)
if torch.backends.mps.is_available():
    torch.mps.empty_cache()

In [None]:
# 3. Train
args_multi = TrainingArguments(
    output_dir="results_multi_cased",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    logging_steps=50,
    group_by_length=True 
)

trainer_multi = Trainer(
    model=model_multi,
    args=args_multi,
    train_dataset=tokenized_multi["train"],
    eval_dataset=tokenized_multi["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator_multi,
    processing_class=tokenizer_multi,
)

trainer_multi.train()

In [None]:
# 4. Evaluate
results_multi_eval = trainer_multi.evaluate(tokenized_multi["test"])
print("Multilingual Model Results:", results_multi_eval)

## Experiment 2: English Baseline

DistilBERT (`distilbert-base-uncased`) was trained solely on English data. This experiment was included to quantify the difficulty of the task; if an English-only model performs well, the task may rely more on structure/punctuation than language semantics. We expect this model to perform poorly on Hindi text.

In [None]:
CHECKPOINT_EN = "distilbert-base-uncased"

# 1. Prepare Data
tokenized_en, tokenizer_en = tokenize_dataset(dataset, CHECKPOINT_EN)
data_collator_en = DataCollatorForTokenClassification(tokenizer=tokenizer_en)

In [None]:
# 2. Initialize Model
model_en = AutoModelForTokenClassification.from_pretrained(
    CHECKPOINT_EN, 
    num_labels=len(label_list), 
    id2label=id2label, 
    label2id=label2id
)

if torch.backends.mps.is_available():
    torch.mps.empty_cache()

In [None]:
# 3. Train
args_en = TrainingArguments(
    output_dir="results_en_uncased",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    logging_steps=50,
    group_by_length=True
)

trainer_en = Trainer(
    model=model_en,
    args=args_en,
    train_dataset=tokenized_en["train"],
    eval_dataset=tokenized_en["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator_en,
    processing_class=tokenizer_en,
)

trainer_en.train()

In [None]:
# 4. Evaluate
results_en_eval = trainer_en.evaluate(tokenized_en["test"])
print("English Model Results:", results_en_eval)

## Experiment 3: Fine-tuned Hindi Model

The model `mirfan899/hindi-distilbert-ner` was chosen as it has been fine-tuned on Hindi data. It is expected to understand the specific linguistic features of Hindi better than the generic multilingual model. `ignore_mismatched_sizes=True` is used because we are replacing the original NER classification head with our new 31-class Chunking head.

In [None]:
CHECKPOINT_HINDI = "mirfan899/hindi-distilbert-ner"

# 1. Prepare Data
tokenized_hindi, tokenizer_hindi = tokenize_dataset(dataset, CHECKPOINT_HINDI)
data_collator_hindi = DataCollatorForTokenClassification(tokenizer=tokenizer_hindi)

In [None]:
# 2. Initialize Model
# Added ignore_mismatched_sizes=True to handle the new label head
model_hindi = AutoModelForTokenClassification.from_pretrained(
    CHECKPOINT_HINDI, 
    num_labels=len(label_list), 
    id2label=id2label, 
    label2id=label2id,
    ignore_mismatched_sizes=True
)

if torch.backends.mps.is_available():
    torch.mps.empty_cache()

In [None]:
# 3. Train
args_hindi = TrainingArguments(
    output_dir="results_hindi_ft",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,
    logging_steps=50,
    group_by_length=True
)

trainer_hindi = Trainer(
    model=model_hindi,
    args=args_hindi,
    train_dataset=tokenized_hindi["train"],
    eval_dataset=tokenized_hindi["validation"],
    compute_metrics=compute_metrics,
    data_collator=data_collator_hindi,
    processing_class=tokenizer_hindi,
)

trainer_hindi.train()

In [None]:
# 4. Evaluate
results_hindi_eval = trainer_hindi.evaluate(tokenized_hindi["test"])
print("Hindi Model Results:", results_hindi_eval)

## Visualization

To better understand the model's behavior, random predictions from the test set are visualized. This qualitative check helps verify if the model produces sane outputs (e.g., contiguous `B-NP` followed by `I-NP`).

In [None]:
import random

def show_random_predictions(model, tokenizer, dataset, num_examples=10):
    print(f"\n--- Random Predictions for {model.name_or_path} ---")
    
    # Pick random indices
    indices = random.sample(range(len(dataset)), num_examples)
    
    for idx in indices:
        example = dataset[idx]
        tokens = example['tokens']
        labels = example['ner_tags'] # These are IDs
        
        # Tokenize (Keep the BatchEncoding object to get word_ids)
        tokenized_input = tokenizer(tokens, truncation=True, is_split_into_words=True, return_tensors="pt")
        
        # Move inputs to device
        inputs = {k: v.to(model.device) for k,v in tokenized_input.items()}
        
        # Predict
        with torch.no_grad():
            outputs = model(**inputs)
        
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2)[0].cpu().numpy()
        
        # Align back to words
        # word_ids() works on the BatchEncoding object (tokenized_input), NOT the dict
        word_ids = tokenized_input.word_ids()
        aligned_preds = []
        aligned_labels = []
        seen_words = set()
        
        for i, word_id in enumerate(word_ids):
            if word_id is None: continue
            if word_id not in seen_words:
                # First token of the word -> keep prediction
                pred_label = id2label[predictions[i]]
                true_label = id2label[labels[word_id]] if labels[word_id] != -100 else "O"
                
                aligned_preds.append(pred_label)
                aligned_labels.append(true_label)
                seen_words.add(word_id)
        
        # Display
        print(f"\nSentence: {' '.join(tokens)}")
        print(f"Predicted: {aligned_preds}")
        print(f"True:      {aligned_labels}")

In [None]:
show_random_predictions(model_multi, tokenizer_multi, hf_test)

In [None]:
show_random_predictions(model_en, tokenizer_en, hf_test)

In [None]:
show_random_predictions(model_hindi, tokenizer_hindi, hf_test)

## Summary of Results
The performance of all three models is aggregated below.

In [None]:
import pandas as pd

summary = {
    "Model": ["Multilingual", "English (Uncased)", "Hindi (Fine-tuned)"],
    "Precision": [results_multi_eval['eval_precision'], results_en_eval['eval_precision'], results_hindi_eval['eval_precision']],
    "Recall": [results_multi_eval['eval_recall'], results_en_eval['eval_recall'], results_hindi_eval['eval_recall']],
    "F1": [results_multi_eval['eval_f1'], results_en_eval['eval_f1'], results_hindi_eval['eval_f1']],
    "Accuracy": [results_multi_eval['eval_accuracy'], results_en_eval['eval_accuracy'], results_hindi_eval['eval_accuracy']]
}

df = pd.DataFrame(summary)
print(df)