## Topic Classification and Item Similarity Labeling Using LLMs

** Objectives **

- **Automated Labeling Tasks:** Use LLMs for topic classification and item similarity labeling.

- **Multi-LLM Evaluation:** Assess agreement and alignment with human expectations using multiple LLM-as-a-judge models.

- **Scalable Labeling:** Implement batch inference to efficiently process large-scale labeling tasks.

#### Prerequisites
- Python 3.8+ installed.

- Libraries: transformers, openai, scikit-learn, numpy, pandas.

- OpenAI API key for using GPT models.

#### Step 1: Load and Prepare Data

Load a dataset of news articles (e.g., df_combined with a summary_cleaned column containing article summaries).Ensure the dataset is clean and ready for processing.

In [None]:
import pandas as pd

# Load dataset
file_path = "./../Data/news-recommendation/news_summary.tsv"
item_df = pd.read_csv(file_path, sep="\t")
item_df = item_df.sample(frac=1, random_state=42).reset_index(drop=True)


In [None]:
print(item_df.shape)
print(set(item_df['category'].tolist()))
item_df.head()

#### Step 3: Topic Classification Labeling Using Prompts
- Create a list of predefined topics (e.g., health, technology, sports).
- Design Prompt for Topic Classification
- Classify all articles in the dataset and store the results.

In [56]:
import json
import re

import openai
import os
from openai import OpenAI
import yaml
import time

import logging
# Configure logging
logging.basicConfig(level=logging.DEBUG, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix
from openai import OpenAI
from sklearn.neighbors import NearestNeighbors
import time

# Read the YAML file
with open('./../../../Curify/curify_api.yaml', 'r') as yaml_file:
    data = yaml.safe_load(yaml_file)

# Access the API keys and other configuration data
openai_api_key = data.get('openai').get('api_key')
deepseek_api_key = data.get('deepseek').get('api_key')

def call_LLM_api(prompt, model="gpt-4o", max_tokens=5000, retries=3, backoff_factor=2, api_keys=None):
    """
    Send a prompt to either OpenAI (GPT-4o) or DeepSeek API and handle potential errors robustly.

    Parameters:
        prompt (str): The user input or task prompt to send to the model.
        model (str): The model to use ("gpt-4o" for OpenAI or "deepseek-chat" for DeepSeek).
        max_tokens (int): The maximum number of tokens in the response.
        retries (int): Number of retry attempts in case of transient errors.
        backoff_factor (int): Backoff time multiplier for retries.
        api_keys (dict): Dictionary containing API keys with keys "openai" and "deepseek".

    Returns:
        str: The model's response content if successful.
    """
    if model.startswith("gpt-"):
        api_key = openai_api_key
        base_url = "https://api.openai.com/v1"
    elif model.startswith("deepseek"):
        api_key = deepseek_api_key
        base_url = "https://api.deepseek.com"
    else:
        raise ValueError("Unsupported model specified.")
    
    client = OpenAI(api_key=api_key, base_url=base_url)
    
    for attempt in range(1, retries + 1):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=max_tokens,
            )        
            return response.choices[0].message.content.strip()
        
        except (openai.RateLimitError, openai.APIConnectionError) as e:
            logging.warning(f"Transient error: {e}. Attempt {attempt} of {retries}. Retrying...")
        except (openai.BadRequestError, openai.AuthenticationError) as e:
            logging.error(f"Unrecoverable error: {e}. Check your inputs or API key.")
            break
        except Exception as e:
            logging.error(f"Unexpected error: {e}. Attempt {attempt} of {retries}. Retrying...")
        
        # Exponential backoff before retrying
        if attempt < retries:
            time.sleep(backoff_factor * attempt)
    
    raise RuntimeError(f"Failed to fetch response from {model} API after {retries} attempts.")

# List of Tier 1 topics
TIER1_TOPICS = [
    "arts, culture, entertainment and media",
    "crime, law and justice",
    "disaster, accident and emergency incident",
    "economy, business and finance",
    "education",
    "environment",
    "health",
    "human interest",
    "labour",
    "lifestyle and leisure",
    "politics",
    "religion",
    "science and technology",
    "society",
    "sport",
    "conflict, war and peace",
    "weather"
]

TIER1_TOPICS = ['business', 'tech', 'entertainment', 'sport', 'politics']

def classify_tier1_topics(text, model="gpt-4o-mini", api_keys=None):
    """
    Calls an LLM (GPT-4o or DeepSeek) to classify the input text into relevant Tier 1 topics.
    
    Returns:
    {
        "top_topics": ["topic1", "topic2", "topic3"],  # Up to 3 topics
        "primary_topic": "top_topic"  # Most relevant topic
    }
    """
    prompt = f"""
    Given the following article summary, classify it into relevant Tier 1 topics from the list below.
    
    Topics: {TIER1_TOPICS}

    Return a JSON object with the format:
    {{
        "top_topics": ["topic1", "topic2", "topic3"],  # At most 3 topics
        "primary_topic": "top_topic"  # Most relevant topic
    }}

    Ensure the topics are chosen from the provided list.

    Article Summary: "{text}"
    """

    response = call_LLM_api(prompt, model=model, api_keys=api_keys)  # Call LLM API
     # Extract the JSON content using regex to remove extra characters
    match = re.search(r"\{.*\}", response, re.DOTALL)  
    if match:
        json_str = match.group(0)  # Extract the valid JSON
        try:
            data = json.loads(json_str)  # Parse JSON
            if isinstance(data, dict) and "top_topics" in data and "primary_topic" in data:
                return data  # Valid JSON
        except json.JSONDecodeError as e:
            print(f"JSON parsing error: {e}")
    
    # If parsing fails, return a fallback response
    return {"top_topics": [], "primary_topic": None}  

item_df[["top_topics_gpt", "primary_topic_gpt"]] = item_df["summary"].apply(
    lambda x: pd.Series(classify_tier1_topics(x, model="gpt-4o-mini"))
)

item_df[["top_topics_deepseek", "primary_topic_deepseek"]] = item_df["summary"].apply(
    lambda x: pd.Series(classify_tier1_topics(x, model="deepseek-chat"))
)

# Save the labeled DataFrame to CSV
item_df.to_csv("news_summary_labeled.csv", index=False)
print("Labeled data saved to 'news_summary_labeled.csv'")

2025-03-20 03:26:42,250 - DEBUG - Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'user', 'content': '\n    Given the following article summary, classify it into relevant Tier 1 topics from the list below.\n    \n    Topics: [\'business\', \'tech\', \'entertainment\', \'sport\', \'politics\']\n\n    Return a JSON object with the format:\n    {\n        "top_topics": ["topic1", "topic2", "topic3"],  # At most 3 topics\n        "primary_topic": "top_topic"  # Most relevant topic\n    }\n\n    Ensure the topics are chosen from the provided list.\n\n    Article Summary: "All areas saw a rise in annual house price inflation in November except for Northern Ireland and the West Midlands, where the rate was unchanged, the ODPM said.It said annual inflation rose between October and November because prices had fallen by 1.1% in the same period in 2003.In London, the area with the highest average house price at £262,825, annual in

KeyboardInterrupt: 

#### Step 4: Evaluate Multi-LLM Labeling Quality for Topic Classification
- Generate cross-tabulation of primary topics labeled with GPT-4o and deepseek-V3.
- Display cases where the models disagree with human assessment.


In [None]:
import pandas as pd

# Create a file to log results
output_file = "classification_evaluation.txt"
with open(output_file, "w") as f:

    # 1. Agreement between GPT and DeepSeek
    item_df["gpt_deepseek_agreement"] = item_df["primary_topic_gpt"] == item_df["primary_topic_deepseek"]
    agreement_rate = item_df["gpt_deepseek_agreement"].mean()

    f.write("=== GPT vs DeepSeek Agreement ===\n")
    f.write(f"Agreement Rate: {agreement_rate:.2%}\n\n")

    # 2. Accuracy of GPT and DeepSeek compared to true category
    gpt_accuracy = (item_df["primary_topic_gpt"] == item_df["category"]).mean()
    deepseek_accuracy = (item_df["primary_topic_deepseek"] == item_df["category"]).mean()

    f.write("=== Classification Accuracy ===\n")
    f.write(f"GPT Accuracy: {gpt_accuracy:.2%}\n")
    f.write(f"DeepSeek Accuracy: {deepseek_accuracy:.2%}\n\n")
    
    # 3. Confusion Matrix (Optional Detailed View)
    f.write("=== GPT Confusion Matrix ===\n")
    gpt_confmat = pd.crosstab(item_df["category"], item_df["primary_topic_gpt"])
    f.write(gpt_confmat.to_string())  # cleaner formatting
    f.write("\n\n")

    f.write("=== DeepSeek Confusion Matrix ===\n")
    deepseek_confmat = pd.crosstab(item_df["category"], item_df["primary_topic_deepseek"])
    f.write(deepseek_confmat.to_string())
    f.write("\n\n")

    # 4. Examples where GPT and DeepSeek disagree
    disagreement_df = item_df[item_df["primary_topic_gpt"] != item_df["primary_topic_deepseek"]]
    f.write("=== Example Disagreements Between GPT and DeepSeek ===\n")
    f.write(disagreement_df[["summary", "category", "primary_topic_gpt", "primary_topic_deepseek"]].head(20).to_string(index=False))
    f.write("\n\n")

    # 5. Save full disagreement cases to CSV (optional)
    disagreement_df.to_csv("disagreement_cases.csv", index=False)

print(f"✅ Evaluation completed. Results saved to: {output_file}")


#### Step 4: Item Similarity Labeling Using Similarity Matrix and LLM Prompts
- Generate embeddings for all articles to compute similarity.
- Use cosine similarity to compute pairwise similarities and filter pairs above a threshold.
- Use a prompt to ask the LLM to verify if two articles are similar.

In [65]:
from sklearn.neighbors import NearestNeighbors
import numpy as np
import json
import os

from openai import OpenAI

def clean_json_string(text):
    if text.startswith("```json"):
        text = text.replace("```json", "").replace("```", "").strip()
    return text

import torch
import numpy as np

def generate_embeddings(texts, model, tokenizer, batch_size=32, embedding_dim=384, device='cpu'):
    """
    Generate dense embeddings for a list of texts using a pre-trained transformer model.

    Args:
        texts (list of str): Input text corpus.
        model: HuggingFace model for embeddings (e.g., sentence-transformers or AutoModel).
        tokenizer: Tokenizer matching the model.
        batch_size (int): Batch size for batched inference.
        embedding_dim (int): Dimension of the embedding vectors.
        device (str): 'cpu' or 'cuda'.

    Returns:
        np.ndarray: Embedding matrix of shape (len(texts), embedding_dim)
    """
    model.to(device)
    model.eval()
    all_embeddings = []

    with torch.no_grad():
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            encoded_input = tokenizer(batch_texts, padding=True, truncation=True, return_tensors='pt').to(device)

            model_output = model(**encoded_input)
            if hasattr(model_output, 'last_hidden_state'):
                # Mean pooling over sequence
                embeddings = model_output.last_hidden_state.mean(dim=1)
            elif hasattr(model_output, 'pooler_output'):
                embeddings = model_output.pooler_output
            else:
                raise ValueError("Unsupported model output structure.")

            all_embeddings.append(embeddings.cpu())

    all_embeddings = torch.cat(all_embeddings, dim=0)
    return all_embeddings.numpy()

def find_approximate_neighbors(embeddings, k=20, threshold=0.7):
    """
    Use approximate nearest neighbor (ANN) search to find top-k similar pairs with similarity scores.
    
    Returns:
        List of tuples: [(i, j, similarity), ...]
    """
    nbrs = NearestNeighbors(n_neighbors=k, metric='cosine').fit(embeddings)
    distances, indices = nbrs.kneighbors(embeddings)
    similar_pairs = []

    for i, (dists, idxs) in enumerate(zip(distances, indices)):
        for j, dist in zip(idxs, dists):
            if i < j:  # avoid duplicates
                similarity = 1 - dist
                if similarity >= threshold:
                    similar_pairs.append((i, j, similarity))

    return similar_pairs

def call_openai_api_batch(similar_pairs, texts, batch_size=10, model="gpt-4o-mini"):
    """
    Calls OpenAI API in batches with similarity-aware pairs.
    
    Args:
        similar_pairs (list): List of (i, j, similarity).
        texts (list): List of article texts.
        batch_size (int): API batch size.
        model (str): Model name.
        openai_api_key (str): Your OpenAI API key.
    
    Returns:
        List of dicts: [{"pair": (text1, text2), "similarity": float, "label": 0/1, "reasoning": str}]
    """
    all_results = []
    client = OpenAI(api_key=openai_api_key)

    for i in range(0, len(similar_pairs), batch_size):
        batch = similar_pairs[i:i + batch_size]

        for idx1, idx2, sim in batch:
            text1 = texts[idx1]
            text2 = texts[idx2]

            prompt = f"""
You are a helpful assistant for text similarity analysis.
Are these two articles discussing the same topic?

Article 1: "{text1}"
Article 2: "{text2}"

Provide your reasoning and output in strict JSON format:
{{
  "reasoning": "Explain your decision briefly",
  "answer": "Yes" or "No"
}}"""

            try:
                response = client.chat.completions.create(
                    model=model,
                    messages=[{"role": "user", "content": prompt}],
                    max_tokens=300,
                    temperature=0.2
                )

                content = response.choices[0].message.content.strip()
                content = clean_json_string(content)

                try:
                    parsed = json.loads(content)
                    answer = parsed.get("answer", "").strip().lower()
                    label = 1 if answer == "yes" else 0
                    reasoning = parsed.get("reasoning", "")

                except Exception as parse_err:
                    print(f"⚠️ JSON parsing error for pair ({idx1}, {idx2}): {parse_err}")
                    label = 0
                    reasoning = "Failed to parse JSON response"

            except Exception as api_err:
                print(f"❌ API error for pair ({idx1}, {idx2}): {api_err}")
                label = 0
                reasoning = "API call failed"

            all_results.append({
                "pair": (text1, text2),
                "similarity": sim,
                "label": label,
                "reasoning": reasoning
            })

            print(f"\n✅ Pair (sim={sim:.3f}): \n🔸 Article 1: {text1[:100]}...\n🔸 Article 2: {text2[:100]}...\n➡ Label: {label}, Reason: {reasoning[:150]}")

    return all_results

def process_articles(texts, item_ids, k=20, threshold=0.7, batch_size=10):
    """
    Process articles and classify all ANN pairs using OpenAI API validation.
    
    Returns:
        similarity_matrix: Dict {(text_1, text_2): label}
        reasoning_dict: Dict {(text_1, text_2): reasoning_text}
    """
    if len(texts) != len(item_ids):
        raise ValueError("texts and item_ids must have the same length.")

    total_start_time = time.time()

    # Step 1: Embedding
    start_time = time.time()
    embeddings = generate_embeddings(texts, model, tokenizer)

    embedding_time = time.time() - start_time
    print(f"✅ Embedding generation time: {embedding_time:.2f}s")

    # Step 2: ANN search
    start_time = time.time()
    similar_pairs = find_approximate_neighbors(embeddings, k=k, threshold=threshold)
    ann_time = time.time() - start_time
    print(f"✅ ANN search time: {ann_time:.2f}s")

    # Step 3: OpenAI validation
    start_time = time.time()
    pair_results = call_openai_api_batch(similar_pairs, texts, batch_size=batch_size)
    api_time = time.time() - start_time
    print(f"✅ OpenAI API time: {api_time:.2f}s")

    total_time = time.time() - total_start_time

    # Summary
    print("\n=== ⏳ Timing Summary ===")
    print(f"🔹 Embedding: {embedding_time:.2f}s")
    print(f"🔹 ANN: {ann_time:.2f}s")
    print(f"🔹 OpenAI Validation: {api_time:.2f}s")
    print(f"🔹 Total: {total_time:.2f}s")
    print("========================")

    return len(similar_pairs), pair_results, total_time, api_time


#### Step 6: Evaluate Batch Inference Latency

In [69]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Assume final_1 is your list of results
# Example format:
# final_1 = [{"pair": (...), "similarity": 0.85, "label": 1, "reasoning": "..."}, ...]
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

def plot_similarity_vs_label(results, output_file="similarity_vs_label.png"):
    """
    Plot cosine similarity vs. binary label and save to local file.
    
    Args:
        results (list of dict): List of {"pair": (text1, text2), "similarity": float, "label": int, ...}
        output_file (str): File path to save the plot.
    """
    # Convert to DataFrame
    df = pd.DataFrame(results)
    
    # Create the plot
    plt.figure(figsize=(7, 5))
    sns.stripplot(x='label', y='similarity', data=df, jitter=True, palette='Set1', alpha=0.7)

    # Formatting
    plt.title("Similarity vs Label")
    plt.xlabel("Label (0 = Not Similar, 1 = Similar)")
    plt.ylabel("Cosine Similarity Score")
    plt.grid(True, linestyle="--", alpha=0.3)
    plt.tight_layout()
    
    # Save to file
    plt.savefig(output_file, dpi=300)
    plt.close()
    print(f"Plot saved to: {output_file}")

from sklearn.metrics import roc_auc_score

def compute_auc_score(results):
    similarities = [r["similarity"] for r in results]
    labels = [r["label"] for r in results]
    auc = roc_auc_score(labels, similarities)
    return auc


In [71]:
# Example usage
num_sample = 500
num_k = 10
item_head = item_df.sample(frac=1, random_state=42).head(num_sample).reset_index(drop=True)
texts = item_head['summary'].tolist()
item_ids = item_head['item_id'].tolist()  # Assuming item_df has 'item_id' column

# Run article processing with timing by setting different batch size.
label_vol_1, final_1, total_time_1, api_time_1 = process_articles(texts, item_ids, k=num_k, batch_size=1)


auc_score = compute_auc_score(final_1)
print(f"AUC-ROC Score: {auc_score:.4f}")

# Serialize both results
results_combined = {
"Results 1": final_1,
    "Batch size 1": api_time_1,
    "Label vol 1": label_vol_1,
    "AUC": auc_score
}

plot_similarity_vs_label(final_1)

# Save to JSON file
with open("batch_latency.json", "w") as f:
    json.dump(results_combined, f, indent=2, default=str)

✅ Embedding generation time: 29.29s
✅ ANN search time: 0.03s

✅ Pair (sim=0.836): 
🔸 Article 1: O'Gara missed a penalty which would have put Ireland nine points clear, and the home crowd breathed ...
🔸 Article 2: Ireland got themselves on the scoreboard with an O'Gara penalty and by the 24th minute the visitors ...
➡ Label: 1, Reason: Both articles discuss matches involving the Irish rugby team during the Six Nations tournament. Article 1 focuses on Ireland's victory over England, h

✅ Pair (sim=0.811): 
🔸 Article 1: The biggest single project to be halted was the Xiluodi Dam project, designed to produce 12,600 MW o...
🔸 Article 2: The China Three Gorges Project Corp is refusing to obey a government order to stop construction of o...
➡ Label: 1, Reason: Both articles discuss the construction of dams in China, particularly focusing on the Three Gorges Dam and related projects. Article 1 highlights the 

✅ Pair (sim=0.870): 
🔸 Article 1: Deutsche Boerse bosses have held "constructive, pr

2025-03-20 03:54:51,963 - DEBUG - receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Thu, 20 Mar 2025 03:54:51 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'access-control-expose-headers', b'X-Request-ID'), (b'openai-organization', b'user-gbgqrr8mnjzeosnimrnxpaf7'), (b'openai-processing-ms', b'5276'), (b'openai-version', b'2020-10-01'), (b'x-ratelimit-limit-requests', b'5000'), (b'x-ratelimit-limit-tokens', b'2000000'), (b'x-ratelimit-remaining-requests', b'4999'), (b'x-ratelimit-remaining-tokens', b'1999452'), (b'x-ratelimit-reset-requests', b'12ms'), (b'x-ratelimit-reset-tokens', b'16ms'), (b'x-request-id', b'req_e38d713634a88883b01b241f9d4846b6'), (b'strict-transport-security', b'max-age=31536000; includeSubDomains; preload'), (b'cf-cache-status', b'DYNAMIC'), (b'X-Content-Type-Options', b'nosniff'), (b'Server', b'cloudflare'), (b'CF-RAY', b'923245291f34ef61-IAD'), (b'Content-E