## PART 0: Prep

In this section, we mount Google Drive and load the evaluation dataset (`evaluationDataset_final.json`) from the shared directory.

We also print:
- The total number of records
- A sample of the first record to verify structure and formatting

In [1]:
from google.colab import drive
import json

drive.mount('/content/drive')
file_path = "/content/drive/MyDrive/x_data/final_data/evaluationDataset_final.json"

with open(file_path, "r", encoding="utf-8") as f:
    data = json.load(f)
print("="*15)
print("Number of eval records:", len(data))
print("First record example:", data[0])

Number of eval records: 1000
First record example: {'id': '517', 'id_num': 517, 'text': 'Yes. AI is going to replace bullshit jobs because AI can give bulshit answers just as convincingly as a human. A nutritionist.', 'date': '10-06-2024', 'popularity': 2, 'polarity': 'neutral', 'subjectivity': 'neutral', 'source': 'reddit', 'category': ['Technology & IT', 'Consumer Goods & Services', 'Non-Profit & Social Services', 'Human Resources & Talent Management', 'Legal & Compliance'], 'manual_labels': {'annotator1': {'polarity': 'positive', 'subjectivity': 'neutral', 'category': ['Legal & Compliance', 'Consumer Goods & Services', 'Technology & IT', 'Non-Profit & Social Services', 'Human Resources & Talent Management']}, 'annotator2': {'polarity': 'neutral', 'subjectivity': 'neutral', 'category': ['Legal & Compliance', 'Consumer Goods & Services', 'Technology & IT', 'Human Resources & Talent Management', 'Non-Profit & Social Services']}}, 'sarcasm_label': 'sarcastic', 'named_entities': [], 'con

## PART 1: Concept Extraction

In this section, we prepare the environment for **concept extraction**, which involves detecting keywords or phrases from text and assigning them to domain-specific categories such as `ai_tech`, `jobs_and_careers`, and others.

### 1.1: Install Required Libraries

We use:
- `spaCy` for NLP and phrase matching
- `transformers` for zero-shot classification (used later in advanced stages)
- `rapidfuzz` for fuzzy keyword matching
- `torch` as a backend for transformer models



In [2]:

##############################################
# 3) CONCEPT EXTRACTION
##############################################
# Concept extraction is domain-specific.
# that looks for keywords/phrases in the text and groups them under certain "concepts"


In [3]:
# !python -m spacy download en_core_web_trf

Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m936.5 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting spacy-curated-transformers<1.0.0,>=0.2.2 (from en-core-web-trf==3.8.0)
  Downloading spacy_curated_transformers-0.3.0-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting curated-transformers<0.2.0,>=0.1.0 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_transformers-0.1.1-py2.py3-none-any.whl.metadata (965 bytes)
Collecting curated-tokenizers<0.1.0,>=0.0.9 (from spacy-curated-transformers<1.0.0,>=0.2.2->en-core-web-trf==3.8.0)
  Downloading curated_tokenizers-0.0.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Downloading spacy_curated_transformers-0.3.0-py2.py3-none-any.whl (236 kB)
[2K   [90m━━━━━

In [2]:
!pip install spacy rapidfuzz transformers torch --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.2/3.1 MB[0m [31m5.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m1.8/3.1 MB[0m [31m26.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m34.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import json
import re
import spacy
from spacy.matcher import PhraseMatcher
from rapidfuzz import fuzz
from transformers import pipeline

# PART 2: Define the Concept Dictionary
This is the final dictionary of concepts/categories.
This section defines the final **concept dictionary** used for multi-method concept extraction.  
The dictionary is a curated collection of **domain-relevant categories and associated keywords**, constructed based on the recurring themes in AI and employment discourse across Twitter, Reddit, and LinkedIn.

Each key in the dictionary corresponds to a concept category (e.g., `ai_tech`, `jobs_and_careers`, `mental_health`) and maps to a list of indicative keywords or phrases. These keywords are used by all versions of the concept extraction pipeline (V1–V4), with increasing sophistication in matching strategy.

The dictionary blends:
- **Formal terms** (e.g., “promotion”, “automation”, “regulation”)
- **Informal expressions** (e.g., “jobpocalypse”, “quiet quitting”)
- **Entity names** (e.g., “OpenAI”, “FAANG”, “ChatGPT”)

This provides a robust foundation for rule-based (V1–V3) and semantic-enhanced (V4) tagging of noisy user-generated text.


In [4]:
# ----------------------------------------------------------------------------------
# Concept Dictionary
# ----------------------------------------------------------------------------------

concept_dictionary = {
    "jobs_and_careers": [
        "job", "jobs", "career", "hire", "hiring", "position", "occupation", "employment",
        "recruitment", "headhunting", "resumé", "job search", "job application", "interview",
        "promotion", "layoff", "downsizing", "resignation", "fired", "retrenched", "workforce",
        "unemployment", "job board", "career change", "gig economy", "contract work",
        "freelance", "upskilling", "reskilling", "talent shortage", "skills gap",
        "career transition", "remote job", "flexible work", "hybrid work"
    ],

    "ai_tech": [
        "ai", "artificial intelligence", "ml", "machine learning", "deep learning",
        "neural network", "natural language processing", "NLP", "computer vision",
        "generative ai", "chatgpt", "gpt-4", "llm", "transformer model", "reinforcement learning",
        "automation", "autonomous system", "robot", "ai-powered", "ai-driven",
        "ai algorithm", "ai model", "ai tool", "ai system", "ai platform", "ai assistant",
        "predictive modeling", "intelligent agent", "intelligent automation", "digital worker",
        "AI ethics", "AI governance", "AI safety", "AI alignment", "AI regulation",
        "OpenAI", "Hugging Face", "LangChain", "Chatbot", "BERT", "GPT", "GAN",
        "AI job disruption", "AI replacing jobs", "AI impact", "AI revolution"
    ],

    "job_market_trends": [
        "job market", "labor market", "workforce trends", "employment trends",
        "market shift", "skills demand", "future of work", "career outlook",
        "talent acquisition", "workforce transformation", "job automation",
        "digital transformation", "economic uncertainty", "AI economy", "skills of the future",
        "AI in hiring", "job displacement", "employment shift", "workplace change"
    ],

    "finance": [
        "bank", "investment", "finance", "trading", "fintech", "income", "earnings",
        "financial security", "salary", "paycheck", "minimum wage", "wealth gap",
        "cost of living", "job loss compensation", "recession", "economic downturn",
        "furlough", "financial hardship", "severance", "retirement fund"
    ],

    "mental_health": [
        "stress", "burnout", "anxiety", "job stress", "mental health", "wellbeing",
        "work-life balance", "job insecurity", "career anxiety", "AI anxiety"
    ],

    "education_training": [
        "upskilling", "reskilling", "online course", "certification", "MOOC", "bootcamp",
        "lifelong learning", "career coaching", "learning path", "digital skills",
        "AI literacy", "tech training", "coursework", "workforce development"
    ],

     "automation_and_displacement": [
        "automation", "job automation", "automated process", "workflow automation",
        "displaced workers", "displacement", "automated job loss", "automated replacement",
        "robotic process automation", "RPA", "bots replacing humans", "AI takeover",
        "human redundancy", "task automation"
    ],

    "policy_and_governance": [
        "universal basic income", "UBI", "AI regulation", "labor law", "AI tax",
        "tech policy", "future legislation", "AI governance", "worker protection",
        "job guarantee", "government retraining", "economic policy", "union response",
        "digital rights", "job security policy"
    ],

    "recruitment_technology": [
        "AI recruitment", "AI hiring", "talent filter", "algorithmic hiring",
        "resume screening", "candidate ranking", "job matching platform", "talent intelligence",
        "hiring automation", "virtual interview", "digital HR", "ATS", "HR tech"
    ],

    "remote_and_gig_work": [
        "gig economy", "freelancing", "remote job", "remote-first", "digital nomad",
        "platform work", "side hustle", "Uberization", "creator economy", "independent worker",
        "contractor", "online work", "flex work", "microtask", "Upwork", "Fiverr"
    ],

    "public_sentiment_discourse": [
        "boomer", "doomer", "tech bro", "jobpocalypse", "decel", "quiet quitting",
        "great resignation", "job hopping", "layoff wave", "hustle culture", "AI hype",
        "future fear", "doomscrolling", "LinkedIn post", "upskilling frenzy", "prompt engineering"
    ],

    "tech_company_trends": [
        "OpenAI", "Google DeepMind", "Meta AI", "Anthropic", "Stability AI", "Amazon layoffs",
        "tech layoffs", "hiring freeze", "big tech", "FAANG", "tech exodus", "startup layoffs",
        "VC funding freeze", "AI startup", "early retirement", "LinkedIn hiring trends"
    ]
}


# PART 3: Load spaCy NLP Model

To support tokenization, lemmatization, and phrase-based pattern matching in our concept extraction pipeline, we load a pre-trained **spaCy language model**.

We first attempt to load the **`en_core_web_trf` transformer-based model** for deeper contextual understanding (used in V3+). If it's not available in the Colab environment, we fall back to the lighter `en_core_web_sm` model.

This model is used for:
- Token parsing
- PhraseMatcher in V2–V4
- Masking and NER preparation for zero-shot classification (V4)

In [5]:
import spacy

# 1) Load an advanced spaCy model (fallback to en_core_web_sm if needed)
try:
    nlp = spacy.load("en_core_web_trf")
    print("Using en_core_web_trf for advanced  spaCy model")
except:
    nlp = spacy.load("en_core_web_sm")
    print("Falling back to en_core_web_sm.")

Using en_core_web_trf for advanced  spaCy model


# PART 4: V1 – Keyword-Only Matching

This section implements the **baseline concept extraction method (V1)** described in Section 5.2.2 of the report.

### Method:
- For each category in the `concept_dictionary`, check if **any keyword** appears in the text as a **case-insensitive exact substring match**.
- Once a match is found, the category is added to the predicted concept list for that text.

### Advantages:
- Simple and fast
- Easy to interpret and debug

### Limitations:
- High rate of **false positives**
- Misses **multi-word expressions** or **semantically similar phrases**
- Cannot disambiguate noisy or informal language

In [6]:
def do_concept_extraction_v1(text: str):
    """
    V1: Simple exact substring matching (case-insensitive).
    Pros: Very fast and straightforward.
    Cons: High false positives, no multi-word or partial matching.
    """
    text_lower = text.lower()
    matched_categories = set()

    for cat, keywords in concept_dictionary.items():
        for kw in keywords:
            # exact substring match, ignoring case
            if kw.lower() in text_lower:
                matched_categories.add(cat)
                break  # once matched, no need to check other keywords for this category

    return list(matched_categories)


# PART 5: V2 – Keyword + Multi-Word PhraseMatcher + Fuzzy Matching

This section implements **V2**, an improved rule-based method that extends V1 by:

###  Method:
- Using **spaCy's `PhraseMatcher`** to detect multi-word expressions (e.g., “career change”, “job displacement”).
- Using **RapidFuzz fuzzy matching** for approximate single-word keyword matches (e.g., “reskilled” ≈ “reskilling”).

### Advantages:
- Improves **recall** on real-world informal inputs
- Handles slight spelling variations and token order for longer expressions

### Limitations:
- Still prone to false positives without semantic filtering
- May trigger categories based on loose associations

This version serves as the foundation for further enhancements in V3 and V4.

In [7]:
# ----------------------------------------------------------------------------------
# Build a PhraseMatcher for multi-word terms (once done globally)
# ----------------------------------------------------------------------------------
phrase_matcher_v2 = PhraseMatcher(nlp.vocab, attr="LOWER")

multiword_patterns_v2 = []
for cat, keywords in concept_dictionary.items():
    for kw in keywords:
        if len(kw.split()) > 1:
            multiword_patterns_v2.append((cat, nlp.make_doc(kw)))

for (cat, pattern_doc) in multiword_patterns_v2:
    phrase_matcher_v2.add(cat, [pattern_doc])


def do_concept_extraction_v2(text: str, fuzzy_threshold=80):
    """
    V2:
      - Use spaCy PhraseMatcher for multi-word concepts
      - Use fuzzy matching (rapidfuzz) for single-word concepts
    """
    doc = nlp(text)
    matched_categories = set()

    # 1) Match multi-word phrases
    matches = phrase_matcher_v2(doc)
    for match_id, start, end in matches:
        cat = nlp.vocab.strings[match_id]
        matched_categories.add(cat)

    # 2) Fuzzy match single words
    text_words = re.findall(r"\w[\w-]*", text.lower())  # simplistic tokenization
    for cat, keywords in concept_dictionary.items():
        for kw in keywords:
            # only consider single-word keywords
            if len(kw.split()) == 1:
                for tw in text_words:
                    score = fuzz.partial_ratio(kw.lower(), tw)
                    if score >= fuzzy_threshold:
                        matched_categories.add(cat)
                        break  # go to next category once matched

    return list(matched_categories)


# PART 6: V3 – Rule-Based + Semantic Filtering + Deduplication

This section implements **V3**, which refines V2 with semantic cleanup.

### Method Enhancements:
- Applies **lemmatization-based filtering** to normalize keywords (e.g., “reskilling” → “reskill”)
- Eliminates **duplicate tags** per text (preventing overcounting)
- Filters out **overly generic triggers** (e.g., broad terms like “work” or “field” that could match too many texts)

### Advantages:
- Reduces **false positives** introduced in V2
- Produces more **specific and context-aware concept tags**
- Still lightweight and fully rule-based

This version offers improved **precision** while maintaining strong recall, and sets the stage for hybrid semantic approaches.

In [8]:
# ----------------------------------------------------------------------------------
# Build a PhraseMatcher for multi-word terms (V3)
# ----------------------------------------------------------------------------------
phrase_matcher_v3 = PhraseMatcher(nlp.vocab, attr="LOWER")
multiword_patterns_v3 = []
for cat, keywords in concept_dictionary.items():
    multiword_patterns_v3 += [
        (cat, nlp.make_doc(kw)) for kw in keywords if len(kw.split()) > 1
    ]

for (cat, pattern_doc) in multiword_patterns_v3:
    phrase_matcher_v3.add(cat, [pattern_doc])

def lemmatized_tokens(doc):
    """Return a list of lemmatized, lowercased tokens (alphabetic only)."""
    return [token.lemma_.lower() for token in doc if token.is_alpha]

def do_concept_extraction_v3(text: str, fuzzy_threshold=88):
    """
    V3:
      - Multi-word matching via PhraseMatcher
      - Fuzzy matching on single words
      - Lemmatization-based cleanup
      - Avoid overly generic triggers
      - Remove repeated concept tags per tweet
    """
    doc = nlp(text)
    matched_categories = set()

    # --- (1) PhraseMatcher (multi-word) ---
    matches = phrase_matcher_v3(doc)
    for match_id, start, end in matches:
        matched_categories.add(nlp.vocab.strings[match_id])

    # --- (2) Lemmatized tokens for fuzzy matching single-word ---
    tokens = lemmatized_tokens(doc)

    for cat, keywords in concept_dictionary.items():
        for kw in keywords:
            if len(kw.split()) == 1:
                for token in tokens:
                    score = fuzz.partial_ratio(token, kw.lower())
                    if score >= fuzzy_threshold:
                        matched_categories.add(cat)
                        break

    # --- (3) Example: remove overly generic triggers or categories if needed ---
    # In practice, you might remove categories triggered by words like "work"
    # if it leads to too many false positives. For example:
    # if "jobs_and_careers" in matched_categories and "work" in text.lower():
    #     # you could add logic to check context before removing

    # No duplication needed because we used a set, so it's inherently deduplicated.
    return list(matched_categories)


# PART 7: V4 – Ensemble-Based Semantic Augmentation with Masking and Zero-Shot Classification

This section implements **V4**, a hybrid method that combines rule-based detection (from V3) with transformer-based zero-shot classification.

### Method:
1. **Rule-Based Detection**: Use V3 logic to get deterministic concept tags.
2. **Masking Phase**: Replace AI-specific terms (e.g., “GPT-4”, “ChatGPT”, “AI”) with `[MASK]` to reduce surface bias.
3. **Zero-Shot Classification**: Run both original and masked versions through a **zero-shot transformer model** (e.g., `facebook/bart-large-mnli`) using category names as labels.
4. **Ensemble Fusion**: Merge predictions from all stages using union or majority voting.

### Advantages:
- Captures **semantic context** and **implicit meaning**
- Handles **abbreviations**, **paraphrases**, and informal phrasing in tweets
- Avoids over-weighting surface-level cues (e.g., always predicting `ai_tech` just because “AI” appears)

This is the most powerful and context-sensitive version, achieving the best **F1 score and Jaccard similarity** in our ablation study.


In [9]:
# ----------------------------------------------------------------------------------
# Setup for V4
# ----------------------------------------------------------------------------------
from transformers import pipeline
from collections import Counter, defaultdict

In [10]:
# ----------------------------------------------------------------------------------
# V4
# ----------------------------------------------------------------------------------

# Load zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")


# We'll use the same concept_dictionary categories as "candidate labels".
# If you want to restrict or rename them for zero-shot classification,
# you might want to pass more user-friendly labels to the pipeline.
candidate_labels_v4 = list(concept_dictionary.keys())

# Build a phrase matcher specifically for AI-related terms (to be masked)
# so we can see how classification changes without the "AI" hints.
ai_terms_list = [
    # expand or contract as needed
    "AI", "artificial intelligence", "machine learning", "deep learning", "chatgpt",
    "GPT", "LLM", "NLP", "OpenAI", "Hugging Face", "LangChain", "transformer model",
    "reinforcement learning", "AI-powered", "AI-driven", "AI platform", "BERT", "GAN"
]
phrase_matcher_ai = PhraseMatcher(nlp.vocab, attr="LOWER")
patterns_ai = [nlp.make_doc(term) for term in ai_terms_list]
phrase_matcher_ai.add("AI_Terms", patterns_ai)

def mask_ai_terms(text: str):
    """
    Replace AI terms with [MASK] to see how classification changes
    without explicit AI hints.
    """
    doc = nlp(text)
    # gather matches
    matches = phrase_matcher_ai(doc)
    result_tokens = []
    ai_indices = set()
    for match_id, start, end in matches:
        # mark these token indices to be masked
        for token_i in range(start, end):
            ai_indices.add(token_i)

    for i, token in enumerate(doc):
        if i in ai_indices:
            result_tokens.append("[MASK]")
        else:
            result_tokens.append(token.text)
    return " ".join(result_tokens)

def do_concept_extraction_v4(text: str):
    """
    V4:
      - (Optional) first do rule-based detection from V3
      - Then do zero-shot classification on original vs masked text
      - Combine results
    """
    # 1) Rule-based pass (reuse V3 for an initial set):
    initial_categories = set(do_concept_extraction_v3(text))

    # 2) Zero-shot classification:
    #    a) on original
    #    b) on masked
    masked_text = mask_ai_terms(text)
    # If nothing was masked, it means no AI terms found.

    original_result = classifier(text, candidate_labels_v4, multi_label=True)
    # masked_result   = classifier(masked_text, candidate_labels_v4, multi_label=True)
    masked_result =  original_result

    # Extract top categories from zero-shot classification:
    # (Here we just pick categories with score > threshold for demonstration.)
    threshold = 0.9 # you can adjust
    original_pred = {
        lbl for lbl, score in zip(original_result["labels"], original_result["scores"])
        if score >= threshold
    }
    masked_pred = {
        lbl for lbl, score in zip(masked_result["labels"], masked_result["scores"])
        if score >= threshold
    }

    # 3) Combine results (ensemble) – e.g., union:
    final_categories = initial_categories.union(original_pred).union(masked_pred)

    return list(final_categories)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


# PART 8: V5.1 – Multi-Model Ensemble via Majority Voting

This version enhances V4 by integrating predictions from **multiple zero-shot transformers** (e.g., BART, DeBERTa, XLM-R) and applies **majority voting** across models to improve robustness.

### Method:
- Run zero-shot classification on both the **original and masked** text.
- For each version, gather predictions from 3 different transformer models.
- Only retain categories that are predicted by **at least 2 out of 3 models**.
- Combine with rule-based tags (V3) for final output.

### Advantage:
- Balances precision and recall by requiring **cross-model agreement**.
- More resilient to **individual model biases** or inconsistencies.

In [11]:
# Load 3 different zero-shot classifier pipelines
models = {
    "bart": pipeline("zero-shot-classification", model="facebook/bart-large-mnli"),
    "deberta": pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v1"),
    "xlmr": pipeline("zero-shot-classification", model="joeddav/xlm-roberta-large-xnli")
}

candidate_labels_v4 = list(concept_dictionary.keys())
threshold = 0.9  # Adjustable threshold

def run_majority_vote_zero_shot(text):
    """
    V5.1: Run multiple zero-shot models and return predictions using majority voting.
    """
    label_votes = defaultdict(int)

    for name, clf in models.items():
        try:
            result = clf(text, candidate_labels_v4, multi_label=True)
            for lbl, score in zip(result["labels"], result["scores"]):
                if score >= threshold:
                    label_votes[lbl] += 1
        except Exception as e:
            continue

    # Majority vote (at least 2 out of 3 models must agree)
    final_preds = {label for label, count in label_votes.items() if count >= 2}
    return final_preds

def do_concept_extraction_v5_1(text: str):
    """
    V5.1: Rule-based + AI masking + multi-model zero-shot classification using majority vote
    """
    rule_based_tags = set(do_concept_extraction_v3(text))
    masked_text = mask_ai_terms(text)

    original_preds = run_majority_vote_zero_shot(text)
    masked_preds = run_majority_vote_zero_shot(masked_text)

    final_categories = rule_based_tags.union(original_preds).union(masked_preds)
    return list(final_categories)


Device set to use cpu


config.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/870M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.65M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/173 [00:00<?, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/734 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at joeddav/xlm-roberta-large-xnli were not used when initializing XLMRobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cpu


# PART 9: V5.2 – Multi-Model Ensemble via Score Averaging

This version extends the ensemble by using **confidence score averaging** instead of hard voting to determine which categories are most relevant.

### Method:
- Collect prediction confidence scores for each category across multiple models.
- Compute the **average score** for each category.
- Retain categories with scores above a set threshold (e.g., 0.9).
- Combine predictions from **original and masked** text with rule-based tags.

### Advantage:
- Captures **soft agreement** between models, allowing categories with consistent but moderate confidence to be included.
- Achieves more **nuanced semantic coverage** while still mitigating overprediction.

In [12]:
def run_score_averaged_zero_shot(text):
    """
    V5.2: Ensemble via confidence score averaging.
    Run multiple zero-shot models and return predictions using score averaging.
    """
    scores_per_label = defaultdict(list)

    for name, clf in models.items():
        try:
            result = clf(text, candidate_labels_v4, multi_label=True)
            for lbl, score in zip(result["labels"], result["scores"]):
                scores_per_label[lbl].append(score)
        except Exception as e:
            continue
    # Compute average score
    avg_scores = {lbl: sum(scores)/len(scores) for lbl, scores in scores_per_label.items()}
    # Select those above dynamic threshold or top-k
    final_preds = {lbl for lbl, avg_score in avg_scores.items() if avg_score >= threshold}
    return final_preds

def do_concept_extraction_v5_2(text: str):
    """
    V5.2: Rule-based + AI masking + multi-model zero-shot classification using score averaging
    """
    # Step 1: Rule-based
    rule_based_tags = set(do_concept_extraction_v3(text))
    masked_text = mask_ai_terms(text)

    # Step 2: Zero-shot on original and masked versions
    original_preds = run_score_averaged_zero_shot(text)
    masked_preds = run_score_averaged_zero_shot(masked_text)
    # Step 3: Combine all (ensemble of rule-based + original + masked)
    final_categories = rule_based_tags.union(original_preds).union(masked_preds)
    return list(final_categories)


# PART 10: Run Concept Extraction (V1–V5.2) on Dataset

This section runs all implemented concept extraction methods (V1 through V5.2) on the input dataset.

For each tweet or text entry:
- V1: Keyword-only matching
- V2: Phrase matching + fuzzy matching
- V3: V2 + semantic filtering & deduplication
- V4: Rule-based + zero-shot with masking (single model)
- V5.1: Rule-based + multi-model ensemble (majority vote)
- V5.2: Rule-based + multi-model ensemble (score averaging)

Each version’s results are saved into new JSON keys (`concepts_v1` to `concepts_v6`) within the dataset.

This output is used in the evaluation and ablation study sections.


In [17]:
import os
import json

def run_concept_extraction_on_json(input_json_path, output_json_path):
    """
    Runs concept extraction (V1–V5.2) on each record in a JSON file.
    Saves results with extracted concepts under new keys.
    """
    print(f"RUNNING")

    # 1) Load input file
    with open(input_json_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    # 2) Process each record
    for i, record in enumerate(data):
        text = record.get("text", "")

        # V1: Keyword only
        record["concepts_v1"] = do_concept_extraction_v1(text)

        # V2: PhraseMatcher + fuzzy matching
        record["concepts_v2"] = do_concept_extraction_v2(text)

        # V3: V2 + semantic filtering + deduplication
        record["concepts_v3"] = do_concept_extraction_v3(text)

        # V4: Rule-based + zero-shot + masking (single model)
        record["concepts_v4"] = do_concept_extraction_v4(text)

        # V5.1: Rule-based + multi-model (majority vote)
        record["concepts_v5_1"] = do_concept_extraction_v5_1(text)

        # V5.2: Rule-based + multi-model (score averaging)
        record["concepts_v5_2"] = do_concept_extraction_v5_2(text)

        print(f"Processed record {i+1}/{len(data)}")

        if i % 100 == 0:
            print(f"Processed {i+1}/{len(data)} records...")

    # 3) Save output
    with open(output_json_path, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=4, ensure_ascii=False)

    print(f"\n✅ Done! Processed {len(data)} records.")
    print(f"📁 Output written to: {output_json_path}")


In [None]:
# Replace with new (if needed) paths
input_json_path = "/content/drive/MyDrive/x_data/final_data/evaluationDataset_final.json"
output_json_path = "/content/drive/MyDrive/x_data/final_data/evaluationDataset_final_output.json"

run_concept_extraction_on_json(input_json_path, output_json_path)


# PART 11: Evaluate Concept Extraction Performance (vs Gemini-Gold Labels)

In this section, we evaluate the performance of all concept extraction versions (V1–V5.2) against the Gemini-generated gold labels (`concept_goldenlabel`).

### Metrics Used:
- **Precision (Micro)**: Measures overall exactness across all labels.
- **Recall (Micro)**: Measures completeness across all labels.
- **F1-Score (Micro)**: Harmonic mean of micro precision and recall.
- **Jaccard Score (Samples)**: Measures overlap between predicted and gold sets per tweet.
- **Exact Match Accuracy** *(NOT VERY USEFUL)*: Can be added to measure full set agreement per sample.

Evaluation is conducted using `sklearn`’s multilabel binarization and metrics. This provides an objective basis for comparing rule-based vs semantic and ensemble-based concept extraction approaches.

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, jaccard_score
from sklearn.preprocessing import MultiLabelBinarizer
import json

def evaluate_concept_extraction(json_file_path, prediction_keys=["concepts_v1", "concepts_v2", "concepts_v3", "concepts_v4", "concepts_v5_1", "concepts_v5_2"], gold_key="concept_goldenlabel"):
    """
    Evaluate multiple concept extraction versions against GPT-generated gold labels.

    Args:
        json_file_path (str): Path to the JSON data file.
        prediction_keys (List[str]): List of keys to evaluate (e.g., ["concepts_v1", ...]).
        gold_key (str): The key for gold labels.

    Returns:
        Dict of results per version.
    """
    with open(json_file_path, "r", encoding="utf-8") as f:
        data = json.load(f)

    # Prepare evaluation results
    results = {}

    # Gold labels for binarization
    gold_labels_all = [set(item.get(gold_key, [])) for item in data]
    all_labels = sorted(set().union(*gold_labels_all))

    all_known_classes = list(concept_dictionary.keys())  # From your concept extraction setup
    mlb = MultiLabelBinarizer(classes=sorted(all_known_classes))

    y_true_binary = mlb.fit_transform(gold_labels_all)

    # for i, item in enumerate(data):
    #   print(f"#{i} | Gold: {item.get('concept_goldenlabel')} | V1: {item.get('concepts_v1')} | | V2: {item.get('concepts_v2')} | V3: {item.get('concepts_v3')} | V4: {item.get('concepts_v4')} | V5: {item.get('concepts_v5_1')}| V6: {item.get('concepts_v5_2')}")


    for version in prediction_keys:
        predictions = [set(item.get(version, [])) for item in data]
        y_pred_binary = mlb.transform(predictions)

        # precision = precision_score(y_true_binary, y_pred_binary, average='micro', zero_division=0)
        # recall = recall_score(y_true_binary, y_pred_binary, average='micro', zero_division=0)
        # f1 = f1_score(y_true_binary, y_pred_binary, average='micro', zero_division=0)

        # Micro
        precision_micro = precision_score(y_true_binary, y_pred_binary, average='micro', zero_division=0)
        recall_micro = recall_score(y_true_binary, y_pred_binary, average='micro', zero_division=0)
        f1_micro = f1_score(y_true_binary, y_pred_binary, average='micro', zero_division=0)

        # Macro
        precision_macro = precision_score(y_true_binary, y_pred_binary, average='macro', zero_division=0)
        recall_macro = recall_score(y_true_binary, y_pred_binary, average='macro', zero_division=0)
        f1_macro = f1_score(y_true_binary, y_pred_binary, average='macro', zero_division=0)

        jaccard = float(jaccard_score(y_true_binary, y_pred_binary, average="samples"))


        # Exact match accuracy: sets must match exactly
        exact_matches = sum(p == g for p, g in zip(predictions, gold_labels_all))
        accuracy = exact_matches / len(gold_labels_all)

        results[version] = {
            # "Accuracy (Exact Match)": round(accuracy, 4),
            # "Precision (micro)": round(precision, 4),
            # "Recall (micro)": round(recall, 4),
            # "F1-Score (micro)": round(f1, 4),

             "Precision (micro)": round(precision_micro, 4),
              "Recall (micro)": round(recall_micro, 4),
              "F1-Score (micro)": round(f1_micro, 4),
              # "Precision (macro)": round(precision_macro, 4),
              # "Recall (macro)": round(recall_macro, 4),
              # "F1-Score (macro)": round(f1_macro, 4),

            "Jaccard Score": round(jaccard, 4),
            "Support (total tweets)": len(data)
        }

    return results


In [23]:
# Path to your JSON file with predictions + gold labels

json_file_mesures ="/content/drive/MyDrive/x_data/evaluationDataset_final_output_with_GroundTruth.json"

# Run evaluation
results = evaluate_concept_extraction(json_file_mesures)

# Print results in a readable table format
from pprint import pprint
pprint(results)

{'concepts_v1': {'F1-Score (micro)': 0.4848,
                 'Jaccard Score': 0.3262,
                 'Precision (micro)': 0.8889,
                 'Recall (micro)': 0.3334,
                 'Support (total tweets)': 1000},
 'concepts_v2': {'F1-Score (micro)': 0.6623,
                 'Jaccard Score': 0.5583,
                 'Precision (micro)': 0.561,
                 'Recall (micro)': 0.8083,
                 'Support (total tweets)': 1000},
 'concepts_v3': {'F1-Score (micro)': 0.7443,
                 'Jaccard Score': 0.6348,
                 'Precision (micro)': 0.657,
                 'Recall (micro)': 0.8583,
                 'Support (total tweets)': 1000},
 'concepts_v4': {'F1-Score (micro)': 0.7794,
                 'Jaccard Score': 0.665,
                 'Precision (micro)': 0.6865,
                 'Recall (micro)': 0.9015,
                 'Support (total tweets)': 1000},
 'concepts_v5_1': {'F1-Score (micro)': 0.8103,
                 'Jaccard Score': 0.7188,
          



### **Updated Evaluation Summary (vs Gemini-Gold Labels)**

| Version        | Precision | Recall  | F1-Score | Jaccard | Notes |
|----------------|-----------|---------|----------|---------|-------|
| **V1**         | 0.8889    | 0.3334  | 0.4848   | 0.3262  | Keyword-only matching — high precision, very low recall |
| **V2**         | 0.5610    | 0.8083  | 0.6623   | 0.5583  | PhraseMatcher + fuzzy matching — better coverage |
| **V3**         | 0.6570    | 0.8583  | 0.7443   | 0.6348  | V2 + semantic filtering — better balance between precision and recall |
| **V4**         | 0.6865    | 0.9015  | 0.7794   | 0.6650  | Rule-based + single zero-shot model |
| **V5.1**       | 0.7059    | 0.9510  | 0.8103   | 0.7188  | Rule-based + multi-model ensemble (majority vote) |
| **V5.2**       | 0.6925    | 0.9235  | 0.7849   | 0.6937  | Rule-based + multi-model ensemble (score averaging) |

---

### **How to Interpret These Metrics**

- **Precision**: Measures how many of the predicted concepts are correct (avoids false positives).
- **Recall**: Measures how many relevant concepts were retrieved (avoids false negatives).
- **F1-Score**: Harmonic mean of precision and recall — reflects balanced effectiveness.
- **Jaccard Score**: Measures set similarity (intersection over union) between predicted and true tags per tweet.
- **Support**: Number of records (tweets) evaluated( **1000**)

---

### **Updated Key Insights**

- **V1 (Keyword-only)**:  
  - Precision remains **very high** at 0.89, but recall is very **low** (0.33).  
  - This version is too conservative, capturing only obvious matches and missing nuance.

- **V2 (Phrase + fuzzy matching)**:  
  - Big **recall boost** to 0.81, with some trade-off in precision (0.56).  
  - More aggressive tagging → better overall **F1** and **Jaccard** scores than V1.

- **V3 (Semantic filtering)**:  
  - Adds smart filtering → better **precision (0.657)** while preserving high recall.  
  - Strong rule-based baseline with balanced performance. Good balance for rule-based models.

- **V4 (Zero-shot + masking)**:  
  - Enhances V3 with semantic understanding via transformers.  
  - Results in **F1 = 0.7794**, **Jaccard = 0.665** — a solid boost over V3. Also improves **Jaccard** compared to V3.

- **V5.1 (Ensemble, majority vote)**:  
  - Best performer overall: **F1 = 0.8103**, **Recall = 0.951**, and **Jaccard = 0.719**.  
  - Multi-model ensembling boosts robustness across different phrasing. Shows power of combining multiple zero-shot models.

- **V5.2 (Ensemble, score averaging)**:  
  - Slightly more **conservative** than V5.1, but with fewer false positives.  
  - Offers a **good trade-off** between precision and recall for controlled use cases. May be better if you want fewer noisy predictions.

---

### **Recommendation**

- For **best performance** (especially recall + robustness): **Use V5.1** (best recall + F1).
- For a **balanced, reliable system** with fewer over-predictions: **Use V5.2 or V4**.
- For **fast, rule-based tagging** with good precision: **Use V3** as V3 is best among rule-based methods.
