# **Problem Statement 1**  
### **Filtering the Noise: ML for Trustworthy Location Reviews**  
**Team 3Pandas** *(Tran Ha My, Diane Teo Min Xuan, Ng Yuen Ning)*  

---

## **Problem Statement**  
Design and implement an **ML-based system** to evaluate the **quality** and **relevancy** of Google location reviews. The system should:  

- **Gauge review quality:** Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.  
- **Assess relevancy:** Determine whether the content of a review is genuinely related to the location being reviewed.  
- **Enforce policies:** Automatically flag or filter out reviews that violate the following example policies:  
  - No advertisements or promotional content.  
  - No irrelevant content (e.g., reviews about unrelated topics).  
  - No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals).  

---

## **Motivation & Impact**  
- **For Users:** Increases trust in location-based reviews, leading to better decision-making.  
- **For Businesses:** Ensures fair representation and reduces the impact of malicious or irrelevant reviews.  
- **For Platforms:** Automates moderation, reduces manual workload, and enhances platform credibility.  

---

## **Data Sources**  

| **Data Sources**       | **Details** |
|-------------------------|-------------|
| **Public Datasets**    | - **Google Review Data:** Open datasets containing Google location reviews (e.g., [Google Local Reviews on Kaggle](https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews))<br>- **Google Local review data:** [UCSD Public Dataset](https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/)<br>- **Alternative Sources:** Yelp, TripAdvisor, or other open review datasets for supplementary training. |
| **Student-Crawled Data** | - Students are encouraged to crawl additional reviews from Google Maps (in compliance with Google's terms of service).<br>- **Example:** [Scraping Google Reviews (YouTube)](https://www.youtube.com/watch?v=LYMdZ7W9bWQ) |


### Dependencies

In [3]:
!pip install -q transformers accelerate datasets peft torch tensorboard iterative-stratification scikit-learn plotly optuna
!pip install --upgrade --quiet nltk textblob

In [7]:
# ===== Standard Library =====
import os
import re
import gc
import shutil
import psutil
import yaml
import json
import zipfile
from pathlib import Path

# ===== Data Processing & Utilities =====
import numpy as np
import pandas as pd
import nltk
from textblob import TextBlob
from datasets import Dataset

# ===== PyTorch & CUDA =====
import torch
import torch.nn as nn
from torch.utils.data import IterableDataset, DataLoader
from torch.cuda import amp
from torch.cuda.amp import autocast, GradScaler
from torch.optim import AdamW

# ===== Transformers & NLP Models =====
from transformers import (
    AutoTokenizer,
    AutoModel,
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    pipeline,
    get_scheduler
)

# ===== Hugging Face PEFT (LoRA) =====
from peft import LoraConfig, get_peft_model

# ===== Evaluation Metrics =====
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    precision_recall_fscore_support,
    average_precision_score
)

# ===== ML Utilities =====
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

### For Google Colab

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
uploaded = files.upload()

Saving all_reviews_with_labels_normalised.csv to all_reviews_with_labels_normalised (1).csv
Saving synthetic_combined.csv to synthetic_combined (1).csv


### For Local Machines

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model_dir = Path("yuenning_model")
print(f"Model directory: {model_dir.absolute()}")

if model_dir.exists():
    print("Files in yuenning_model folder:")
    for file in model_dir.iterdir():
        if file.is_file():
            print(f"  📄 {file.name} ({file.stat().st_size / 1024:.1f} KB)")
        else:
            print(f"  📁 {file.name}/")
else:
    print("❌ yuenning_model folder not found!")
    print("Current directory contents:")
    for item in Path('.').iterdir():
        print(f"  {item.name}{'/' if item.is_dir() else ''}")

Using device: cpu
Model directory: c:\Users\ningy\Desktop\NUS\Personal Projects\TikTok_TechJam_2025\3pandas\yuenning_model
Files in yuenning_model folder:
  📄 multilabel_thresholds.json (0.1 KB)
  📄 qwen_gemma_multilabel_final_3.zip (6699.7 KB)


### 1. Load Data

In [7]:
all_reviews = list(uploaded.keys())[0]
synthetic_combined = list(uploaded.keys())[1]

In [8]:
full_df = pd.read_csv(all_reviews)
full_df = full_df.dropna(subset=['rating']).reset_index(drop=True)

print(f"Loaded {all_reviews} with {len(full_df)} rows")
print(full_df.isnull().sum())

Loaded all_reviews_with_labels_normalised (1).csv with 11667 rows
review_text             0
rating                  0
has_photo               0
author_name             0
user_review_count       0
business_name           0
category                0
source                  0
review_id               0
comprehensive_review    0
is_ad                   0
is_relevant             0
is_rant                 0
is_legit                0
dtype: int64


In [9]:
synthetic_df = pd.read_csv(synthetic_combined)

def s(col):
    return synthetic_df[col].fillna("NA").astype(str).str.strip()

has_photo_str = np.where(synthetic_df["has_photo"].fillna(False), "yes", "no")
MAX_REVIEW_CHARS = 2000
review_text_clean = s("review_text").str.replace(r"\s+", " ", regex=True).str[:MAX_REVIEW_CHARS]

synthetic_df["comprehensive_review"] = (
    "[Business] " + s("business_name") +
    " | [Category] " + s("category") +
    " | [Rating] " + s("rating") +
    " | [Author] " + s("author_name") +
    " | [User Review Count] " + s("user_review_count") +
    " | [Has Photo] " + pd.Series(has_photo_str, index=synthetic_df.index) +
    " | [Source] " + s("source") +
    " | [Review] " + review_text_clean
).str.replace(r"\s+\|\s+\[Review\]\s+NA$", "", regex=True)

print(f"Loaded {synthetic_combined} with {len(synthetic_df)} rows")
print(synthetic_df.isnull().sum())

Loaded synthetic_combined (1).csv with 714 rows
review_text             0
rating                  0
has_photo               0
author_name             0
user_review_count       0
business_name           0
category                0
source                  0
review_id               0
is_ad                   0
is_rant                 0
is_legit                0
is_relevant             0
comprehensive_review    0
dtype: int64


In [10]:
to_clean_df = full_df.dropna(subset=['review_text', 'is_ad', 'is_relevant', 'is_rant', 'is_legit'])

to_clean_df.head()

Unnamed: 0,review_text,rating,has_photo,author_name,user_review_count,business_name,category,source,review_id,comprehensive_review,is_ad,is_relevant,is_rant,is_legit
0,Love the convenience of this neighborhood carw...,4.0,False,Doug Schmidt,1.0,"Auto Spa Speedy Wash - Harvester, MO",['Car wash'],google,1001,"[Business] Auto Spa Speedy Wash - Harvester, M...",False,True,False,True
1,"2 bathrooms (for a large 2 story building), 1 ...",2.0,False,Duf Duftopia,1.0,Kmart,"['Discount store', 'Appliance store', 'Baby st...",google,1002,[Business] Kmart | [Category] ['Discount store...,True,True,True,False
2,My favorite pizza shop hands down!,5.0,False,Andrew Phillips,1.0,Papa’s Pizza,"['Pizza restaurant', 'Chicken wings restaurant...",google,1003,[Business] Papa’s Pizza | [Category] ['Pizza r...,False,True,False,True
3,BOTCHED INSTRUMENT REPAIR IS COSTING US HUNDRE...,1.0,False,Julie Heiland,1.0,The Music Place,['Musical instrument store'],google,1004,[Business] The Music Place | [Category] ['Musi...,False,True,True,False
4,Very unprofessional!!!!!,1.0,False,Alan Khasanov,1.0,Park Motor Cars Inc,['Used car dealer'],google,1005,[Business] Park Motor Cars Inc | [Category] ['...,False,True,True,False


### 2. Pre-Process Datafames

##### 2.1 Cleaning Functions

In [11]:
def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text)
    text = normalize_whitespace(text)
    return text

##### 2.2 Compute Basic Signals

In [12]:
def compute_basic_signals(text):
    url_count = len(re.findall(r'https?://\S+', text))
    phone_count = len(re.findall(r'\+?\d[\d\s-]{7,}\d', text))
    caps_ratio = sum(1 for c in text if c.isupper()) / max(len(text), 1)
    return url_count, phone_count, caps_ratio

##### 2.3 Sentiment Analysis

In [13]:
def add_textblob_sentiment(df, text_col="review_text", positive_threshold=0.9, negative_threshold=-0.9):
    def get_sentiment(text):
        if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
            return 0.0, 0.0
        try:
            analysis = TextBlob(text)
            return analysis.sentiment.polarity, analysis.sentiment.subjectivity
        except Exception:
            return 0.0, 0.0

    sentiment_results = df[text_col].apply(get_sentiment)
    df["sentiment_polarity"], df["sentiment_subjectivity"] = zip(*sentiment_results)

    df["is_extreme_sentiment"] = df["sentiment_polarity"].apply(
        lambda x: 1 if x >= positive_threshold or x <= negative_threshold else 0
    )

    return df

##### Apply to Dataframe

In [14]:
def preprocess_reviews(df):
    df["clean_text"] = df["review_text"].apply(clean_text)
    signals = df["clean_text"].apply(compute_basic_signals)
    df["url_count"], df["phone_count"], df["caps_ratio"] = zip(*signals)
    return df

cleaned_df = preprocess_reviews(to_clean_df)
print(cleaned_df.head())

cleaned_synthetic_df = preprocess_reviews(synthetic_df)
print(cleaned_synthetic_df.head())

                                         review_text  rating  has_photo  \
0  Love the convenience of this neighborhood carw...     4.0      False   
1  2 bathrooms (for a large 2 story building), 1 ...     2.0      False   
2                 My favorite pizza shop hands down!     5.0      False   
3  BOTCHED INSTRUMENT REPAIR IS COSTING US HUNDRE...     1.0      False   
4                           Very unprofessional!!!!!     1.0      False   

       author_name  user_review_count                         business_name  \
0     Doug Schmidt                1.0  Auto Spa Speedy Wash - Harvester, MO   
1     Duf Duftopia                1.0                                 Kmart   
2  Andrew Phillips                1.0                          Papa’s Pizza   
3    Julie Heiland                1.0                       The Music Place   
4    Alan Khasanov                1.0                   Park Motor Cars Inc   

                                            category  source  review_id  \

In [15]:
# # Save as JSON
# output_json_path = os.path.join(labeled_input_folder, "cleaned_df.json")
# cleaned_df.to_json(output_json_path, orient="records", lines=True, force_ascii=False)
# print(f"JSON file saved to: {output_json_path}")

# # Save as Parquet
# output_parquet_path = os.path.join(labeled_input_folder, "cleaned_df.parquet")
# cleaned_df.to_parquet(output_parquet_path, index=False)
# print(f"Parquet file saved to: {output_parquet_path}")

### 3. Train-Test Split with Multi-Label Stratification

In [78]:
meta_cols = ["clean_text", "url_count","phone_count","caps_ratio","rating","has_photo","user_review_count"]
label_cols = ["is_ad", "is_relevant", "is_rant", "is_legit"]

X = cleaned_df.drop(columns=label_cols)
y = cleaned_df[label_cols].values

mskf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_val_idx, test_idx = next(mskf.split(X, y))

train_val_df = cleaned_df.iloc[train_val_idx].reset_index(drop=True)
test_df = cleaned_df.iloc[test_idx].reset_index(drop=True)

y_train_val = y[train_val_idx]
train_idx, val_idx = next(mskf.split(train_val_df.drop(columns=label_cols), y_train_val))

train_df_original = train_val_df.iloc[train_idx].reset_index(drop=True)
val_df = train_val_df.iloc[val_idx].reset_index(drop=True)

train_df = pd.concat([train_df_original, cleaned_synthetic_df], ignore_index=True).reset_index(drop=True)

print(f"Training set size: {train_df.shape} (includes synthetic data)")
print(f"Validation set size: {val_df.shape}")
print(f"Test set size: {test_df.shape}")

print(f"Synthetic data in validation set: {val_df['review_id'].isin(cleaned_synthetic_df['review_id']).any()}")
print(f"Synthetic data in test set: {test_df['review_id'].isin(cleaned_synthetic_df['review_id']).any()}")

Training set size: (8181, 18) (includes synthetic data)
Validation set size: (1867, 18)
Test set size: (2333, 18)
Synthetic data in validation set: False
Synthetic data in test set: False


### 4. Tokenisation

In [21]:
def simple_tokenize(text):
    text = str(text).lower()
    tokens = re.findall(r'\b[a-z]+\b', text)
    return tokens

train_df['tokens'] = train_df['clean_text'].apply(simple_tokenize)
test_df['tokens'] = test_df['clean_text'].apply(simple_tokenize)

In [22]:
print(train_df.shape)
print(val_df.shape)
print(test_df.shape)

(8181, 19)
(1867, 18)
(2333, 19)


### Yuen Ning's model

### previous model

In [81]:
model_name = "Qwen/Qwen1.5-0.5B"  # or "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [80]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=4, torch_dtype="auto")
config = LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1)
model = get_peft_model(model, config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen1.5-0.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [79]:
training_args = TrainingArguments(
    output_dir="./qwen_gemma_multilabel",
    # evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    report_to="none",
    remove_unused_columns=False
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.sigmoid(torch.tensor(logits)).numpy()
    preds = (probs > 0.5).astype(int)

    precision = precision_score(labels, preds, average='macro', zero_division=0)
    recall = recall_score(labels, preds, average='macro', zero_division=0)
    f1 = f1_score(labels, preds, average='macro', zero_division=0)

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

In [83]:
label_cols = ["is_ad", "is_relevant", "is_rant", "is_legit"]

# Convert train_dataset to numpy array of labels
labels_array = np.array(train_df[label_cols])
pos_counts = labels_array.sum(axis=0)
neg_counts = len(labels_array) - pos_counts
pos_weight = torch.tensor(neg_counts / (pos_counts + 1e-8), dtype=torch.float)  # avoid divide by zero
print("Class weights (pos_weight):", pos_weight)

Class weights (pos_weight): tensor([13.4032,  0.0972,  9.0135,  0.2771])


In [86]:
def preprocess(batch):
    return tokenizer(batch["clean_text"], truncation=True, padding="max_length", max_length=512)

def prepare_dataset(df):
    # Apply filtering to the DataFrame directly
    df = df.reset_index(drop=True)
    return Dataset.from_pandas(df[["clean_text"] + label_cols])

# Filter train_df before creating the dataset
train_df_filtered = train_df.drop(index=7026).reset_index(drop=True)

train_dataset = prepare_dataset(train_df_filtered)
val_dataset = prepare_dataset(val_df)

train_dataset = train_dataset.map(preprocess, batched=True)
val_dataset = val_dataset.map(preprocess, batched=True)
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"] + label_cols)
val_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "is_ad", "is_relevant", "is_rant", "is_legit"])


Map:   0%|          | 0/8180 [00:00<?, ? examples/s]

Map:   0%|          | 0/1867 [00:00<?, ? examples/s]

In [90]:
def data_collator(batch):
    input_ids = torch.stack([item["input_ids"] for item in batch])
    attention_mask = torch.stack([item["attention_mask"] for item in batch])
    labels = torch.stack([
        torch.tensor([item["is_ad"], item["is_relevant"], item["is_rant"], item["is_legit"]], dtype=torch.float)
        for item in batch
    ])
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

class MultiLabelTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, **kwargs):
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fct = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight.to(logits.device))
        loss = loss_fct(logits, labels)
        return (loss, outputs) if return_outputs else loss

trainer = MultiLabelTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = MultiLabelTrainer(


In [91]:
trainer.train()

Step,Training Loss
500,1.0403
1000,0.4951
1500,0.2845


TrainOutput(global_step=1536, training_loss=0.5968945575878024, metrics={'train_runtime': 1977.0227, 'train_samples_per_second': 12.413, 'train_steps_per_second': 0.777, 'total_flos': 2.330930486181888e+16, 'train_loss': 0.5968945575878024, 'epoch': 3.0})

In [92]:
model.save_pretrained("./qwen_gemma_multilabel_final_3")
tokenizer.save_pretrained("./qwen_gemma_multilabel_final_3")

('./qwen_gemma_multilabel_final_3/tokenizer_config.json',
 './qwen_gemma_multilabel_final_3/special_tokens_map.json',
 './qwen_gemma_multilabel_final_3/chat_template.jinja',
 './qwen_gemma_multilabel_final_3/vocab.json',
 './qwen_gemma_multilabel_final_3/merges.txt',
 './qwen_gemma_multilabel_final_3/added_tokens.json',
 './qwen_gemma_multilabel_final_3/tokenizer.json')

In [93]:
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 0.4419700801372528, 'eval_precision': 0.7407957440067051, 'eval_recall': 0.7859766918655908, 'eval_f1': 0.7593047443425293, 'eval_runtime': 63.8983, 'eval_samples_per_second': 29.218, 'eval_steps_per_second': 3.662, 'epoch': 3.0}


In [96]:
shutil.make_archive("qwen_gemma_multilabel_final_3", 'zip', "./qwen_gemma_multilabel_final_3")
files.download("qwen_gemma_multilabel_final_3.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### 5. Threshold Tuning

In [98]:
def safe_get_predictions_proba(model, tokenizer, texts, device='cuda', batch_size=16):
    """Safe prediction function with proper device and type handling"""
    model.eval()
    model.to(device)
    all_probs = []

    with torch.no_grad():
        for i in tqdm(range(0, len(texts), batch_size), desc="Getting predictions"):
            batch_texts = texts[i:i+batch_size]

            # Tokenize
            inputs = tokenizer(
                batch_texts,
                truncation=True,
                padding=True,
                max_length=512,
                return_tensors='pt'
            )

            # Explicitly move to device
            inputs = {key: value.to(device) for key, value in inputs.items()}

            outputs = model(**inputs)
            probs = torch.sigmoid(outputs.logits)

            # Convert to float32 before moving to CPU for numpy compatibility
            probs_float32 = probs.float().cpu().numpy()
            all_probs.extend(probs_float32)

    return np.array(all_probs)
class MultilabelThresholdTuner:
    def __init__(self, model, tokenizer, label_cols, device='cuda'):
        self.model = model
        self.tokenizer = tokenizer
        self.label_cols = label_cols
        self.device = device
        self.best_thresholds = None
        self.best_metrics = None

        def safe_get_predictions_proba(model, tokenizer, texts, device='cuda', batch_size=16):
            """Safe prediction function with proper device and type handling"""
            model.eval()
            model.to(device)
            all_probs = []

            with torch.no_grad():
                for i in tqdm(range(0, len(texts), batch_size), desc="Getting predictions"):
                    batch_texts = texts[i:i+batch_size]

                    # Tokenize
                    inputs = tokenizer(
                        batch_texts,
                        truncation=True,
                        padding=True,
                        max_length=512,
                        return_tensors='pt'
                    )

                    # Explicitly move to device
                    inputs = {key: value.to(device) for key, value in inputs.items()}

                    outputs = model(**inputs)
                    probs = torch.sigmoid(outputs.logits)

                    # Convert to float32 before moving to CPU for numpy compatibility
                    probs_float32 = probs.float().cpu().numpy()
                    all_probs.extend(probs_float32)

            return np.array(all_probs)
    def get_true_labels_from_df(self, df):
        """Extract true labels from DataFrame"""
        true_labels = []
        for label_col in self.label_cols:
            true_labels.append(df[label_col].values)
        return np.array(true_labels).T

    def optimize_thresholds_per_class(self, y_true, y_probs):
        """Optimize thresholds for each class using F1 maximization"""
        n_classes = len(self.label_cols)
        optimal_thresholds = np.zeros(n_classes)

        for class_idx in range(n_classes):
            precision, recall, thresholds = precision_recall_curve(
                y_true[:, class_idx], y_probs[:, class_idx]
            )

            # Calculate F1 scores
            f1_scores = 2 * (precision * recall) / (precision + recall + 1e-8)

            # Find threshold with maximum F1 score
            if len(thresholds) > 0:
                best_idx = np.nanargmax(f1_scores[:len(thresholds)])
                optimal_thresholds[class_idx] = thresholds[best_idx]

                print(f"{self.label_cols[class_idx]}: Optimal threshold = {thresholds[best_idx]:.3f}, "
                      f"Max F1 = {f1_scores[best_idx]:.3f}")
            else:
                optimal_thresholds[class_idx] = 0.5
                print(f"{self.label_cols[class_idx]}: No thresholds found, using default 0.5")

        return optimal_thresholds

    def evaluate_thresholds(self, y_true, y_probs, thresholds):
        """Evaluate performance with given thresholds"""
        y_pred = (y_probs >= thresholds).astype(int)

        from sklearn.metrics import precision_score, recall_score, f1_score

        metrics = {
            'f1_weighted': f1_score(y_true, y_pred, average='weighted'),
            'f1_macro': f1_score(y_true, y_pred, average='macro'),
            'f1_micro': f1_score(y_true, y_pred, average='micro'),
            'precision_weighted': precision_score(y_true, y_pred, average='weighted'),
            'recall_weighted': recall_score(y_true, y_pred, average='weighted'),
        }

        # Class-wise metrics
        class_metrics = {}
        for i, label in enumerate(self.label_cols):
            class_metrics[f'{label}_f1'] = f1_score(y_true[:, i], y_pred[:, i], zero_division=0)
            class_metrics[f'{label}_precision'] = precision_score(y_true[:, i], y_pred[:, i], zero_division=0)
            class_metrics[f'{label}_recall'] = recall_score(y_true[:, i], y_pred[:, i], zero_division=0)

        metrics.update(class_metrics)
        return metrics, y_pred

    def tune_thresholds_from_df(self, df, text_column='clean_text'):
        """Tune thresholds using DataFrame"""
        print("Extracting texts and true labels...")
        texts = df[text_column].tolist()
        y_true = self.get_true_labels_from_df(df)

        print("Getting prediction probabilities...")
        y_probs = self.get_predictions_proba(texts)

        print(f"True labels shape: {y_true.shape}")
        print(f"Predicted probabilities shape: {y_probs.shape}")

        print("\nOptimizing thresholds...")
        self.best_thresholds = self.optimize_thresholds_per_class(y_true, y_probs)

        # Evaluate with optimized thresholds
        self.best_metrics, y_pred = self.evaluate_thresholds(y_true, y_probs, self.best_thresholds)

        # Compare with default threshold
        default_metrics, _ = self.evaluate_thresholds(y_true, y_probs, 0.5)

        print("\n" + "="*60)
        print("THRESHOLD TUNING RESULTS")
        print("="*60)

        print("\nOptimal thresholds:")
        for label, threshold in zip(self.label_cols, self.best_thresholds):
            print(f"  {label}: {threshold:.3f}")

        print("\nPerformance comparison:")
        print(f"{'Metric':<20} {'Default (0.5)':<12} {'Optimized':<12} {'Improvement':<12}")
        print("-" * 60)
        for metric in ['f1_weighted', 'f1_macro', 'f1_micro']:
            improvement = self.best_metrics[metric] - default_metrics[metric]
            print(f"{metric:<20} {default_metrics[metric]:<12.4f} {self.best_metrics[metric]:<12.4f} {improvement:+.4f}")

        print("\nClass-wise F1 scores:")
        for label in self.label_cols:
            default_f1 = default_metrics[f'{label}_f1']
            optimized_f1 = self.best_metrics[f'{label}_f1']
            improvement = optimized_f1 - default_f1
            print(f"  {label:<15} Default: {default_f1:.3f}, Optimized: {optimized_f1:.3f}, Δ: {improvement:+.3f}")

        return self.best_thresholds, self.best_metrics, y_pred

In [100]:
texts_val = val_df['clean_text'].tolist()
y_true_val = val_df[label_cols].values  # shape: (num_samples, num_labels)
y_probs_val = safe_get_predictions_proba(model, tokenizer, texts_val, batch_size=16, device='cuda')
print("Validation predictions shape:", y_probs_val.shape)

Getting predictions: 100%|██████████| 117/117 [00:21<00:00,  5.51it/s]

Validation predictions shape: (1867, 4)





In [102]:
tuner = MultilabelThresholdTuner(model, tokenizer, label_cols, device='cuda')
tuner.best_thresholds = tuner.optimize_thresholds_per_class(y_true_val, y_probs_val)

is_ad: Optimal threshold = 0.627, Max F1 = 0.582
is_relevant: Optimal threshold = 0.001, Max F1 = 0.981
is_rant: Optimal threshold = 0.232, Max F1 = 0.618
is_legit: Optimal threshold = 0.080, Max F1 = 0.940


In [103]:
best_metrics, y_pred_optimized = tuner.evaluate_thresholds(y_true_val, y_probs_val, tuner.best_thresholds)
default_metrics, _ = tuner.evaluate_thresholds(y_true_val, y_probs_val, 0.5)

print("\nF1 Weighted Improvement over default 0.5 threshold:",
      best_metrics['f1_weighted'] - default_metrics['f1_weighted'])


F1 Weighted Improvement over default 0.5 threshold: 0.02246704466831384


In [104]:
for label in label_cols:
    default_f1 = default_metrics[f'{label}_f1']
    optimized_f1 = best_metrics[f'{label}_f1']
    print(f"{label}: Default F1 = {default_f1:.3f}, Optimized F1 = {optimized_f1:.3f}")

is_ad: Default F1 = 0.556, Optimized F1 = 0.582
is_relevant: Default F1 = 0.951, Optimized F1 = 0.981
is_rant: Default F1 = 0.605, Optimized F1 = 0.618
is_legit: Default F1 = 0.926, Optimized F1 = 0.940


In [105]:
y_pred_final = (y_probs_val >= tuner.best_thresholds).astype(int)

In [109]:
thresholds_to_save = dict(zip(label_cols, tuner.best_thresholds))
with open("multilabel_thresholds.json", "w") as f:
    json.dump(thresholds_to_save, f)
files.download("multilabel_thresholds.json")

print("Optimized thresholds saved to multilabel_thresholds.json")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Optimized thresholds saved to multilabel_thresholds.json


### 6. Import Model

In [None]:
# Load model & tokenizer
model = AutoModelForSequenceClassification.from_pretrained("./qwen_gemma_multilabel_final")
tokenizer = AutoTokenizer.from_pretrained("./qwen_gemma_multilabel_final")

# Load thresholds
with open("multilabel_thresholds.json", "r") as f:
    thresholds = json.load(f)

# Safe prediction function
def predict_labels(texts, model, tokenizer, thresholds, batch_size=16, device='cuda'):
    model.eval()
    model.to(device)
    all_probs = []

    with torch.no_grad():
        for i in range(0, len(texts), batch_size):
            batch_texts = texts[i:i+batch_size]
            inputs = tokenizer(batch_texts, truncation=True, padding=True, max_length=512, return_tensors='pt').to(device)
            logits = model(**inputs).logits
            probs = torch.sigmoid(logits).cpu().numpy()
            all_probs.extend(probs)

    all_probs = np.array(all_probs)
    y_pred = (all_probs >= np.array(list(thresholds.values()))).astype(int)
    return y_pred