# **Problem Statement 1**  
### **Filtering the Noise: ML for Trustworthy Location Reviews**  
**Team 3Pandas** *(Tran Ha My, Diane Teo Min Xuan, Ng Yuen Ning)*  

---

## **Problem Statement**  
Design and implement an **ML-based system** to evaluate the **quality** and **relevancy** of Google location reviews. The system should:  

- **Gauge review quality:** Detect spam, advertisements, irrelevant content, and rants from users who have likely never visited the location.  
- **Assess relevancy:** Determine whether the content of a review is genuinely related to the location being reviewed.  
- **Enforce policies:** Automatically flag or filter out reviews that violate the following example policies:  
  - No advertisements or promotional content.  
  - No irrelevant content (e.g., reviews about unrelated topics).  
  - No rants or complaints from users who have not visited the place (can be inferred from content, metadata, or other signals).  

---

## **Motivation & Impact**  
- **For Users:** Increases trust in location-based reviews, leading to better decision-making.  
- **For Businesses:** Ensures fair representation and reduces the impact of malicious or irrelevant reviews.  
- **For Platforms:** Automates moderation, reduces manual workload, and enhances platform credibility.  

---

## **Data Sources**  

| **Data Sources**       | **Details** |
|-------------------------|-------------|
| **Public Datasets**    | - **Google Review Data:** Open datasets containing Google location reviews (e.g., [Google Local Reviews on Kaggle](https://www.kaggle.com/datasets/denizbilginn/google-maps-restaurant-reviews))<br>- **Google Local review data:** [UCSD Public Dataset](https://mcauleylab.ucsd.edu/public_datasets/gdrive/googlelocal/)<br>- **Alternative Sources:** Yelp, TripAdvisor, or other open review datasets for supplementary training. |
| **Student-Crawled Data** | - Students are encouraged to crawl additional reviews from Google Maps (in compliance with Google's terms of service).<br>- **Example:** [Scraping Google Reviews (YouTube)](https://www.youtube.com/watch?v=LYMdZ7W9bWQ) |


### Dependencies

In [1]:
!pip install iterative-stratification
! pip install tldextract
!pip install -q transformers accelerate datasets bitsandbytes peft trl torch tensorboard
!pip install -U "huggingface_hub[cli]"
!pip install -q huggingface_hub
!pip install -q einops
!pip install -U bitsandbytes

Collecting iterative-stratification
  Downloading iterative_stratification-0.1.9-py3-none-any.whl.metadata (1.3 kB)
Downloading iterative_stratification-0.1.9-py3-none-any.whl (8.5 kB)
Installing collected packages: iterative-stratification
Successfully installed iterative-stratification-0.1.9
Collecting tldextract
  Downloading tldextract-5.3.0-py3-none-any.whl.metadata (11 kB)
Collecting requests-file>=1.4 (from tldextract)
  Downloading requests_file-2.1.0-py2.py3-none-any.whl.metadata (1.7 kB)
Downloading tldextract-5.3.0-py3-none-any.whl (107 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.4/107.4 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading requests_file-2.1.0-py2.py3-none-any.whl (4.2 kB)
Installing collected packages: requests-file, tldextract
Successfully installed requests-file-2.1.0 tldextract-5.3.0
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━

In [110]:
import yaml
import os
import json

# ! pip install tldextract
import re
import tldextract

from transformers import pipeline
from tqdm import tqdm

# ! pip install tldextract
from textblob import TextBlob
import pandas as pd

import torch
from torch.cuda import amp
from torch.cuda.amp import autocast, GradScaler
from transformers import pipeline
from transformers import GPT2Tokenizer, GPT2ForSequenceClassification
from transformers import AutoModelForCausalLM, AutoTokenizer

from sklearn.model_selection import train_test_split
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np

import torch
from torch.utils.data import IterableDataset, DataLoader
from torch.cuda.amp import autocast, GradScaler
import torch.nn as nn
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModel, AutoModelForSequenceClassification
from torch.optim import AdamW
from sklearn.metrics import precision_recall_fscore_support, average_precision_score

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import gc
import psutil
from huggingface_hub import login

from getpass import getpass
import bitsandbytes as bnb
from peft import LoraConfig, get_peft_model, PeftModel

from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.multioutput import MultiOutputClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import precision_score, recall_score, f1_score

import shutil
from google.colab import files

### 1. Load Data

In [3]:
from google.colab import files

uploaded = files.upload()

Saving all_reviews_with_labels_normalised.csv to all_reviews_with_labels_normalised.csv
Saving synthetic_combined.csv to synthetic_combined.csv


In [4]:
all_reviews = list(uploaded.keys())[0]
synthetic_combined = list(uploaded.keys())[1]

In [53]:
# with open("config.yaml", "r") as f:
#     config = yaml.safe_load(f)

# labeled_input_folder = config['labeled_input']
# synthetic_folder = config['synthetic_folder']

# full_df = pd.read_csv(f'{synthetic_folder}/all_reviews_with_labels_normalised.csv')
# full_df.isnull().sum()

full_df = pd.read_csv(all_reviews)
full_df = full_df.dropna(subset=['rating']).reset_index(drop=True)

print(f"Loaded {all_reviews} with {len(full_df)} rows")
print(full_df.isnull().sum())

Loaded all_reviews_with_labels_normalised.csv with 11667 rows
review_text             0
rating                  0
has_photo               0
author_name             0
user_review_count       0
business_name           0
category                0
source                  0
review_id               0
comprehensive_review    0
is_ad                   0
is_relevant             0
is_rant                 0
is_legit                0
dtype: int64


In [52]:
synthetic_df = pd.read_csv(synthetic_combined)

def s(col):
    return synthetic_df[col].fillna("NA").astype(str).str.strip()

has_photo_str = np.where(synthetic_df["has_photo"].fillna(False), "yes", "no")
MAX_REVIEW_CHARS = 2000
review_text_clean = s("review_text").str.replace(r"\s+", " ", regex=True).str[:MAX_REVIEW_CHARS]

synthetic_df["comprehensive_review"] = (
    "[Business] " + s("business_name") +
    " | [Category] " + s("category") +
    " | [Rating] " + s("rating") +
    " | [Author] " + s("author_name") +
    " | [User Review Count] " + s("user_review_count") +
    " | [Has Photo] " + pd.Series(has_photo_str, index=synthetic_df.index) +
    " | [Source] " + s("source") +
    " | [Review] " + review_text_clean
).str.replace(r"\s+\|\s+\[Review\]\s+NA$", "", regex=True)

print(f"Loaded {synthetic_combined} with {len(synthetic_df)} rows")
print(synthetic_df.isnull().sum())

Loaded synthetic_combined.csv with 714 rows
review_text             0
rating                  0
has_photo               0
author_name             0
user_review_count       0
business_name           0
category                0
source                  0
review_id               0
is_ad                   0
is_rant                 0
is_legit                0
is_relevant             0
comprehensive_review    0
dtype: int64


In [51]:
to_clean_df = full_df.dropna(subset=['review_text', 'is_ad', 'is_relevant', 'is_rant', 'is_legit'])

to_clean_df.head()

Unnamed: 0,review_text,rating,has_photo,author_name,user_review_count,business_name,category,source,review_id,comprehensive_review,is_ad,is_relevant,is_rant,is_legit
0,Love the convenience of this neighborhood carw...,4.0,False,Doug Schmidt,1.0,"Auto Spa Speedy Wash - Harvester, MO",['Car wash'],google,1001,"[Business] Auto Spa Speedy Wash - Harvester, M...",False,True,False,True
1,"2 bathrooms (for a large 2 story building), 1 ...",2.0,False,Duf Duftopia,1.0,Kmart,"['Discount store', 'Appliance store', 'Baby st...",google,1002,[Business] Kmart | [Category] ['Discount store...,True,True,True,False
2,My favorite pizza shop hands down!,5.0,False,Andrew Phillips,1.0,Papa’s Pizza,"['Pizza restaurant', 'Chicken wings restaurant...",google,1003,[Business] Papa’s Pizza | [Category] ['Pizza r...,False,True,False,True
3,BOTCHED INSTRUMENT REPAIR IS COSTING US HUNDRE...,1.0,False,Julie Heiland,1.0,The Music Place,['Musical instrument store'],google,1004,[Business] The Music Place | [Category] ['Musi...,False,True,True,False
4,Very unprofessional!!!!!,1.0,False,Alan Khasanov,1.0,Park Motor Cars Inc,['Used car dealer'],google,1005,[Business] Park Motor Cars Inc | [Category] ['...,False,True,True,False


### 2. Pre-Process Datafames

##### 2.1 Cleaning Functions

In [46]:
def normalize_whitespace(text):
    return re.sub(r'\s+', ' ', text).strip()

def clean_text(text):
    if pd.isna(text):
        return ""
    text = str(text)
    text = normalize_whitespace(text)
    return text

##### 2.2 Compute Basic Signals

In [47]:
def compute_basic_signals(text):
    url_count = len(re.findall(r'https?://\S+', text))
    phone_count = len(re.findall(r'\+?\d[\d\s-]{7,}\d', text))
    caps_ratio = sum(1 for c in text if c.isupper()) / max(len(text), 1)
    return url_count, phone_count, caps_ratio

##### 2.3 Sentiment Analysis

In [48]:
def add_textblob_sentiment(df, text_col="review_text", positive_threshold=0.9, negative_threshold=-0.9):
    def get_sentiment(text):
        if pd.isna(text) or not isinstance(text, str) or text.strip() == "":
            return 0.0, 0.0
        try:
            analysis = TextBlob(text)
            return analysis.sentiment.polarity, analysis.sentiment.subjectivity
        except Exception:
            return 0.0, 0.0

    sentiment_results = df[text_col].apply(get_sentiment)
    df["sentiment_polarity"], df["sentiment_subjectivity"] = zip(*sentiment_results)

    df["is_extreme_sentiment"] = df["sentiment_polarity"].apply(
        lambda x: 1 if x >= positive_threshold or x <= negative_threshold else 0
    )

    return df

##### Apply to Dataframe

In [49]:
def preprocess_reviews(df):
    df["clean_text"] = df["review_text"].apply(clean_text)
    signals = df["clean_text"].apply(compute_basic_signals)
    df["url_count"], df["phone_count"], df["caps_ratio"] = zip(*signals)
    return df

cleaned_df = preprocess_reviews(to_clean_df)
print(cleaned_df.head())

cleaned_synthetic_df = preprocess_reviews(synthetic_df)
print(cleaned_synthetic_df.head())

                                         review_text  rating  has_photo  \
0  Love the convenience of this neighborhood carw...     4.0      False   
1  2 bathrooms (for a large 2 story building), 1 ...     2.0      False   
2                 My favorite pizza shop hands down!     5.0      False   
3  BOTCHED INSTRUMENT REPAIR IS COSTING US HUNDRE...     1.0      False   
4                           Very unprofessional!!!!!     1.0      False   

       author_name  user_review_count                         business_name  \
0     Doug Schmidt                1.0  Auto Spa Speedy Wash - Harvester, MO   
1     Duf Duftopia                1.0                                 Kmart   
2  Andrew Phillips                1.0                          Papa’s Pizza   
3    Julie Heiland                1.0                       The Music Place   
4    Alan Khasanov                1.0                   Park Motor Cars Inc   

                                            category  source  review_id  \

In [50]:
# # Save as JSON
# output_json_path = os.path.join(labeled_input_folder, "cleaned_df.json")
# cleaned_df.to_json(output_json_path, orient="records", lines=True, force_ascii=False)
# print(f"JSON file saved to: {output_json_path}")

# # Save as Parquet
# output_parquet_path = os.path.join(labeled_input_folder, "cleaned_df.parquet")
# cleaned_df.to_parquet(output_parquet_path, index=False)
# print(f"Parquet file saved to: {output_parquet_path}")

### 3. Train-Test Split with Multi-Label Stratification

In [69]:
meta_cols = ["clean_text", "url_count","phone_count","caps_ratio","rating","has_photo","user_review_count"]
label_cols = ["is_ad","is_relevant","is_rant"]

X = cleaned_df.drop(columns=label_cols)
y = cleaned_df[label_cols].values

mskf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_val_idx, test_idx = next(mskf.split(X, y))

train_val_df = cleaned_df.iloc[train_val_idx].reset_index(drop=True)
test_df = cleaned_df.iloc[test_idx].reset_index(drop=True)

y_train_val = y[train_val_idx]
train_idx, val_idx = next(mskf.split(train_val_df.drop(columns=label_cols), y_train_val))

train_df_original = train_val_df.iloc[train_idx].reset_index(drop=True)
val_df = train_val_df.iloc[val_idx].reset_index(drop=True)

train_df = pd.concat([train_df_original, cleaned_synthetic_df], ignore_index=True).reset_index(drop=True)

print(f"Training set size: {train_df.shape} (includes synthetic data)")
print(f"Validation set size: {val_df.shape}")
print(f"Test set size: {test_df.shape}")

print(f"Synthetic data in validation set: {val_df['review_id'].isin(cleaned_synthetic_df['review_id']).any()}")
print(f"Synthetic data in test set: {test_df['review_id'].isin(cleaned_synthetic_df['review_id']).any()}")

Training set size: (8181, 18) (includes synthetic data)
Validation set size: (1866, 18)
Test set size: (2334, 18)
Synthetic data in validation set: False
Synthetic data in test set: False


### 4. Tokenisation

In [67]:
def simple_tokenize(text):
    text = str(text).lower()
    tokens = re.findall(r'\b[a-z]+\b', text)
    return tokens

train_df['tokens'] = train_df['clean_text'].apply(simple_tokenize)
test_df['tokens'] = test_df['clean_text'].apply(simple_tokenize)

In [68]:
print(train_df.shape)
print(val_df.shape)
print(test_df.shape)

(8343, 19)
(1907, 18)
(2384, 19)


### Yuen Ning's model

In [86]:
model_name = "Qwen/Qwen1.5-0.5B"  # or "google/gemma-2b"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [93]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3, torch_dtype="auto")
config = LoraConfig(r=8, lora_alpha=32, lora_dropout=0.1)
model = get_peft_model(model, config)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen1.5-0.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [94]:
training_args = TrainingArguments(
    output_dir="./qwen_gemma_multilabel",
    # evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,
    report_to="none",
    remove_unused_columns=False # Add this line to ignore unused columns
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    probs = torch.sigmoid(torch.tensor(logits)).numpy()
    preds = (probs > 0.5).astype(int)

    precision = precision_score(labels, preds, average='macro', zero_division=0)
    recall = recall_score(labels, preds, average='macro', zero_division=0)
    f1 = f1_score(labels, preds, average='macro', zero_division=0)

    return {
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

In [95]:
def preprocess(batch):
    return tokenizer(batch["clean_text"], truncation=True, padding="max_length", max_length=512)

def prepare_dataset(df):
    # Apply filtering to the DataFrame directly
    df = df.reset_index(drop=True)
    return Dataset.from_pandas(df[["clean_text"] + label_cols])

# Filter train_df before creating the dataset
train_df_filtered = train_df.drop(index=7026).reset_index(drop=True)

train_dataset = prepare_dataset(train_df_filtered)
val_dataset = prepare_dataset(val_df)

train_dataset = train_dataset.map(preprocess, batched=True)
val_dataset = val_dataset.map(preprocess, batched=True)
train_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "is_ad", "is_relevant", "is_rant"])
val_dataset.set_format(type="torch", columns=["input_ids", "attention_mask", "is_ad", "is_relevant", "is_rant"])


Map:   0%|          | 0/8180 [00:00<?, ? examples/s]

Map:   0%|          | 0/1866 [00:00<?, ? examples/s]

In [98]:
def data_collator(batch):
    input_ids = torch.stack([item["input_ids"] for item in batch])
    attention_mask = torch.stack([item["attention_mask"] for item in batch])
    labels = torch.stack([
        torch.tensor([item["is_ad"], item["is_relevant"], item["is_rant"]], dtype=torch.float)
        for item in batch
    ])
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset, # Use the Hugging Face Dataset
    eval_dataset=val_dataset,   # Use the Hugging Face Dataset
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [99]:
trainer.train()

Step,Training Loss
500,0.4471
1000,0.187
1500,0.1104


TrainOutput(global_step=1536, training_loss=0.24450827219213048, metrics={'train_runtime': 2028.9279, 'train_samples_per_second': 12.095, 'train_steps_per_second': 0.757, 'total_flos': 2.330922766565376e+16, 'train_loss': 0.24450827219213048, 'epoch': 3.0})

In [100]:
model.save_pretrained("./qwen_gemma_multilabel_final")
tokenizer.save_pretrained("./qwen_gemma_multilabel_final")

('./qwen_gemma_multilabel_final/tokenizer_config.json',
 './qwen_gemma_multilabel_final/special_tokens_map.json',
 './qwen_gemma_multilabel_final/chat_template.jinja',
 './qwen_gemma_multilabel_final/vocab.json',
 './qwen_gemma_multilabel_final/merges.txt',
 './qwen_gemma_multilabel_final/added_tokens.json',
 './qwen_gemma_multilabel_final/tokenizer.json')

In [103]:
eval_results = trainer.evaluate()
print(eval_results)

{'eval_loss': 0.13003966212272644, 'eval_precision': 0.7716463472962802, 'eval_recall': 0.6625803733996433, 'eval_f1': 0.7074768587565138, 'eval_runtime': 66.6749, 'eval_samples_per_second': 27.987, 'eval_steps_per_second': 3.51, 'epoch': 3.0}


In [111]:
shutil.make_archive("qwen_gemma_multilabel_final", 'zip', "./qwen_gemma_multilabel_final")
files.download("qwen_gemma_multilabel_final.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### more graveyard

In [None]:
class ReviewDataset(Dataset):
    def __init__(self, df, tokenizer, max_len=128):
        self.texts = df["clean_text"].tolist()
        self.meta = df[meta_cols].fillna(0).to_numpy(dtype=np.float32)
        self.labels = df[label_cols].to_numpy(dtype=np.float32)
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        # Ensure idx is a single int
        if isinstance(idx, list):
            idx = idx[0]

        text = str(self.texts[idx])
        meta = self.meta[idx]
        labels = self.labels[idx]

        encoding = self.tokenizer(
            text,
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )

        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "meta": torch.tensor(meta, dtype=torch.float),
            "labels": torch.tensor(labels, dtype=torch.float)
        }

def collate_fn(batch):
    return {
        "input_ids": torch.stack([b["input_ids"] for b in batch]),
        "attention_mask": torch.stack([b["attention_mask"] for b in batch]),
        "meta": torch.stack([b["meta"] for b in batch]),
        "labels": torch.stack([b["labels"] for b in batch])
    }


In [None]:
model_name = "Qwen/Qwen2-0.5B"  # lightweight Qwen
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B", use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_cols))
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Some weights of Qwen2ForSequenceClassification were not initialized from the model checkpoint at Qwen/Qwen2-0.5B and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Qwen2ForSequenceClassification(
  (model): Qwen2Model(
    (embed_tokens): Embedding(151936, 896)
    (layers): ModuleList(
      (0-23): 24 x Qwen2DecoderLayer(
        (self_attn): Qwen2Attention(
          (q_proj): Linear(in_features=896, out_features=896, bias=True)
          (k_proj): Linear(in_features=896, out_features=128, bias=True)
          (v_proj): Linear(in_features=896, out_features=128, bias=True)
          (o_proj): Linear(in_features=896, out_features=896, bias=False)
        )
        (mlp): Qwen2MLP(
          (gate_proj): Linear(in_features=896, out_features=4864, bias=False)
          (up_proj): Linear(in_features=896, out_features=4864, bias=False)
          (down_proj): Linear(in_features=4864, out_features=896, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
        (post_attention_layernorm): Qwen2RMSNorm((896,), eps=1e-06)
      )
    )
    (norm): Qwen2RMSNorm((896,), eps=1e-06)
    (rotary_emb): Q

In [None]:
batch_size = 8
train_dataset = ReviewDataset(train_df, tokenizer)
test_dataset = ReviewDataset(test_df, tokenizer)

train_dataset = ReviewDataset(train_df, tokenizer, max_len=128)
val_dataset = ReviewDataset(val_df, tokenizer, max_len=128)

train_loader = DataLoader(
    train_dataset,
    batch_size=16,
    shuffle=True,
    pin_memory=True,
    num_workers=4
)
val_loader   = torch.utils.data.DataLoader(val_dataset, batch_size=32)
test_loader  = torch.utils.data.DataLoader(test_dataset, batch_size=32)

num_workers = 0



In [None]:
optimizer = AdamW(model.parameters(), lr=5e-5)
criterion = nn.BCEWithLogitsLoss()

In [None]:
if len(train_dataset) < batch_size:
    batch_size = len(train_dataset)

epochs = 3
scaler = GradScaler()  # For mixed precision

for epoch in range(epochs):
    model.train()
    total_loss = 0

    for batch in train_loader:
        input_ids = batch["input_ids"].to(device, non_blocking=True)
        attention_mask = batch["attention_mask"].to(device, non_blocking=True)
        labels = batch["labels"].to(device, non_blocking=True)

        optimizer.zero_grad()

        with autocast():  # Mixed precision forward + loss
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            loss = criterion(logits, labels)

        scaler.scale(loss).backward()       # Scaled backward
        scaler.step(optimizer)              # Step optimizer
        scaler.update()                     # Update scale

        total_loss += loss.item()

    print(f"Epoch {epoch+1} Train Loss: {total_loss/len(train_loader):.4f}")


  scaler = GradScaler()  # For mixed precision


IndexError: Caught IndexError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/fetch.py", line 50, in fetch
    data = self.dataset.__getitems__(possibly_batched_index)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/datasets/arrow_dataset.py", line 2865, in __getitems__
    return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
                  ~~~~~^^^
IndexError: index 3 is out of bounds for dimension 0 with size 3


code graveyard

In [None]:
# =======================
# CONFIG
# =======================
label_cols = ["is_ad", "is_relevant", "is_rant", "is_legit"]
meta_cols = ["url_count", "phone_count", "caps_ratio", "rating", "user_review_count"]  # adjust based on available
max_len = 128
batch_size = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
df = cleaned_df
df[label_cols] = df[label_cols].astype(float)

X = df.drop(columns=label_cols)
y = df[label_cols].values

mskf = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
train_idx, test_idx = next(mskf.split(X, y))

train_df = df.iloc[train_idx]
test_df = df.iloc[test_idx]

# Then split train into train/val
train_idx2, val_idx = next(MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
                           .split(train_df.drop(columns=label_cols), train_df[label_cols].values))
val_df = train_df.iloc[val_idx]
train_df = train_df.iloc[train_idx2]


In [None]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_batch(df):
    encodings = tokenizer(
        df["clean_text"].tolist(),
        truncation=True,
        padding="max_length",
        max_length=max_len,
        return_tensors="pt"
    )
    meta = torch.tensor(df[meta_cols].fillna(0).values, dtype=torch.float32)
    labels = torch.tensor(df[label_cols].values, dtype=torch.float32)
    return encodings, meta, labels

In [None]:
class ReviewDataset(Dataset):
    def __init__(self, encodings, meta, labels):
        self.encodings = encodings
        self.meta = meta
        self.labels = labels
    def __len__(self):
        return len(self.labels)
    def __getitem__(self, idx):
        return {
            "input_ids": self.encodings["input_ids"][idx],
            "attention_mask": self.encodings["attention_mask"][idx],
            "meta": self.meta[idx],
            "labels": self.labels[idx]
        }

train_enc, train_meta, train_labels = tokenize_batch(train_df)
val_enc, val_meta, val_labels = tokenize_batch(val_df)
test_enc, test_meta, test_labels = tokenize_batch(test_df)

train_dataset = ReviewDataset(train_enc, train_meta, train_labels)
val_dataset = ReviewDataset(val_enc, val_meta, val_labels)
test_dataset = ReviewDataset(test_enc, test_meta, test_labels)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)


In [None]:
class ReviewGuardModel(nn.Module):
    def __init__(self, backbone="prajjwal1/bert-tiny", meta_dim=len(meta_cols)):
        super().__init__()
        self.enc = AutoModel.from_pretrained(backbone)
        hid = self.enc.config.hidden_size
        self.meta_net = nn.Sequential(nn.Linear(meta_dim, 64), nn.ReLU(), nn.Dropout(0.1), nn.Linear(64, 64), nn.ReLU())
        self.fuse = nn.Linear(hid + 64, hid)
        self.cls = nn.Linear(hid, len(label_cols))
    def forward(self, input_ids, attention_mask, meta, labels=None):
        x = self.enc(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state[:,0]
        m = self.meta_net(meta)
        z = torch.relu(self.fuse(torch.cat([x, m], dim=1)))
        logits = self.cls(z)
        loss = None
        if labels is not None:
            loss_f = nn.BCEWithLogitsLoss()
            loss = loss_f(logits, labels)
        return {"loss": loss, "logits": logits}

model = ReviewGuardModel().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)


In [None]:
epochs = 3
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        meta = batch["meta"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, meta=meta, labels=labels)
        loss = outputs["loss"]
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} Train Loss: {total_loss/len(train_loader):.4f}")


In [None]:
pos_weights = torch.tensor([29.5641, 0.0374, 12.0096, 0.1678], dtype=torch.float32).to(device)
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weights)

In [None]:
model_name = "prajjwal1/bert-tiny"  # tiny BERT for CPU
max_len = 64
batch_size = 2
accum_steps = 4  # gradient accumulation
epochs = 3
meta_cols = ["url_count","phone_count","caps_ratio","user_review_count"]
label_cols = ["is_ad","is_relevant","is_rant","is_legit"]

device = torch.device("cpu")

tokenizer = AutoTokenizer.from_pretrained(model_name)


In [None]:
class ReviewDataset(torch.utils.data.Dataset):
    def __init__(self, df, tokenizer, max_len=128):
        self.texts = df["clean_text"].tolist()
        self.meta = df[meta_cols].fillna(0).to_numpy(dtype=np.float32)
        self.labels = df[label_cols].to_numpy(dtype=np.float32)
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt"
        )
        return {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "meta": torch.tensor(self.meta[idx], dtype=torch.float),
            "labels": torch.tensor(self.labels[idx], dtype=torch.float)
        }

class ReviewGuardModel(nn.Module):
    def __init__(self, backbone=model_name, meta_dim=len(meta_cols)):
        super().__init__()
        self.enc = AutoModel.from_pretrained(backbone)
        hid = self.enc.config.hidden_size
        self.meta_net = nn.Sequential(
            nn.Linear(meta_dim, 64),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(64, 64),
            nn.ReLU()
        )
        self.fuse = nn.Linear(hid + 64, hid)
        self.cls = nn.Linear(hid, len(label_cols))

    def forward(self, input_ids, attention_mask, meta, labels=None):
        x = self.enc(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state[:, 0]
        m = self.meta_net(meta)
        z = torch.relu(self.fuse(torch.cat([x, m], dim=1)))
        logits = self.cls(z)
        loss = None
        if labels is not None:
            loss_f = nn.BCEWithLogitsLoss()
            loss = loss_f(logits, labels)
        return {"loss": loss, "logits": logits}


In [None]:
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-tiny", use_fast=True)
train_dataset = ReviewDataset(train_df, tokenizer, max_len=128)
test_dataset = ReviewDataset(test_df, tokenizer, max_len=128)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16)


In [None]:
model = ReviewGuardModel().to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        meta = batch["meta"].to(device)
        labels = batch["labels"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, meta=meta)
        logits = outputs["logits"]
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    print(f"Epoch {epoch+1} Train Loss: {total_loss/len(train_loader):.4f}")


In [None]:
def tune_thresholds(model, val_loader):
    model.eval()
    all_logits, all_labels = [], []
    with torch.no_grad():
        for batch in val_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            meta = batch["meta"].to(device)
            labels = batch["labels"].cpu().numpy()

            logits = model(input_ids=input_ids, attention_mask=attention_mask, meta=meta)["logits"].cpu().numpy()
            all_logits.append(logits)
            all_labels.append(labels)

    all_logits = np.vstack(all_logits)
    all_labels = np.vstack(all_labels)

    thresholds = []
    for i in range(all_labels.shape[1]):
        best_thresh = 0.5
        best_f1 = 0.0
        for t in np.arange(0.1, 0.9, 0.05):
            preds = (1 / (1 + np.exp(-all_logits[:, i])) >= t).astype(int)
            f1 = f1_score(all_labels[:, i], preds)
            if f1 > best_f1:
                best_f1 = f1
                best_thresh = t
        thresholds.append(best_thresh)
    return thresholds


In [None]:
val_dataset = ReviewDataset(val_df)  # val_df is your validation DataFrame
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

thresholds = tune_thresholds(model, val_loader)

model.eval()
all_preds, all_labels = [], []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        meta = batch["meta"].to(device)
        labels = batch["labels"].cpu().numpy()

        logits = model(input_ids=input_ids, attention_mask=attention_mask, meta=meta)["logits"].cpu().numpy()
        preds = np.array([(1 / (1 + np.exp(-logits[:, i])) >= thresholds[i]).astype(int)
                          for i in range(logits.shape[1])]).T
        all_preds.append(preds)
        all_labels.append(labels)

all_preds = np.vstack(all_preds)
all_labels = np.vstack(all_labels)


In [None]:
for epoch in range(epochs):
    model.train()
    total_loss = 0
    optimizer.zero_grad()

    for i, batch in enumerate(train_loader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        meta = batch["meta"].to(device)
        labels = batch["labels"].to(device)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, meta=meta, labels=labels)
        loss = outputs["loss"] / accum_steps
        loss.backward()

        if (i + 1) % accum_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

        total_loss += loss.item() * accum_steps

    print(f"Epoch {epoch+1} Train Loss: {total_loss/len(train_loader):.4f}")

In [None]:
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        meta = batch["meta"].to(device)
        labels = batch["labels"].to(device)

        logits = model(input_ids=input_ids, attention_mask=attention_mask, meta=meta)["logits"]
        probs = torch.sigmoid(logits).cpu().numpy()
        all_preds.append(probs)
        all_labels.append(labels.cpu().numpy())

all_preds = np.vstack(all_preds)
all_labels = np.vstack(all_labels)

# Metrics per label
metrics = {}
for i, name in enumerate(label_cols):
    ap = average_precision_score(all_labels[:, i], all_preds[:, i])
    prec, rec, f1, _ = precision_recall_fscore_support(all_labels[:, i], (all_preds[:, i]>=0.5).astype(int), zero_division=0)
    metrics[f"{name}_ap"] = ap
    metrics[f"{name}_prec"] = prec[0]
    metrics[f"{name}_rec"] = rec[0]
    metrics[f"{name}_f1"] = f1[0]

# Micro and macro F1
metrics["micro_f1"] = precision_recall_fscore_support(all_labels, (all_preds>=0.5).astype(int), average="micro", zero_division=0)[2]
metrics["macro_f1"] = precision_recall_fscore_support(all_labels, (all_preds>=0.5).astype(int), average="macro", zero_division=0)[2]

print("Evaluation metrics:", metrics)

In [None]:
tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-tiny")
max_len = 128

class ReviewDataset(IterableDataset):
    def __init__(self, df, tokenizer, max_len, meta_cols, label_cols):
        self.df = df
        self.tokenizer = tokenizer
        self.max_len = max_len
        self.meta_cols = meta_cols
        self.label_cols = label_cols

    def __iter__(self):
        for _, row in self.df.iterrows():
            enc = self.tokenizer(
                row["clean_text"],
                truncation=True,
                padding="max_length",
                max_length=self.max_len,
                return_tensors="pt"
            )
            meta = torch.tensor([row[c] for c in self.meta_cols], dtype=torch.float32)
            labels = torch.tensor([row[c] for c in self.label_cols], dtype=torch.float32)

            yield {
                "input_ids": enc["input_ids"].squeeze(0),
                "attention_mask": enc["attention_mask"].squeeze(0),
                "meta": meta,
                "labels": labels
            }

train_dataset = ReviewDataset(train_df, tokenizer, max_len, meta_cols, label_cols)
test_dataset = ReviewDataset(test_df, tokenizer, max_len, meta_cols, label_cols)

train_loader = DataLoader(train_dataset, batch_size=8)
test_loader = DataLoader(test_dataset, batch_size=8)


In [None]:
class ReviewGuardModel(nn.Module):
    def __init__(self, backbone="prajjwal1/bert-tiny", meta_dim=6):
        super().__init__()
        self.enc = AutoModel.from_pretrained(backbone)
        hid = self.enc.config.hidden_size
        self.meta_net = nn.Sequential(nn.Linear(meta_dim, 32), nn.ReLU(), nn.Dropout(0.1), nn.Linear(32, 32), nn.ReLU())
        self.fuse = nn.Linear(hid + 32, hid)
        self.cls = nn.Linear(hid, len(label_cols))

    def forward(self, input_ids, attention_mask, meta, labels=None):
        x = self.enc(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state[:,0]
        m = self.meta_net(meta)
        z = torch.relu(self.fuse(torch.cat([x, m], dim=1)))
        logits = self.cls(z)
        loss = None
        if labels is not None:
            loss = nn.BCEWithLogitsLoss()(logits, labels)
        return {"loss": loss, "logits": logits}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = ReviewGuardModel(meta_dim=len(meta_cols)).to(device)
optimizer = AdamW(model.parameters(), lr=2e-5)


In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Optional: freeze transformer backbone for first few epochs
freeze_backbone = True
if freeze_backbone:
    for param in model.enc.parameters():
        param.requires_grad = False

optimizer = AdamW(filter(lambda p: p.requires_grad, model.parameters()), lr=2e-5)

epochs = 3
accum_steps = 4  # simulate larger batch by accumulating gradients
scaler = GradScaler()

for epoch in range(epochs):
    model.train()
    total_loss = 0
    batch_count = 0

    for i, batch in enumerate(train_loader):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        meta = batch["meta"].to(device)
        labels = batch["labels"].to(device)

        with autocast():  # mixed precision
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, meta=meta, labels=labels)
            loss = outputs["loss"] / accum_steps

        scaler.scale(loss).backward()

        if (i + 1) % accum_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

        total_loss += loss.item() * accum_steps
        batch_count += 1

    print(f"Epoch {epoch+1} Train Loss: {total_loss / batch_count:.4f}")

# Optional: unfreeze backbone for fine-tuning
if freeze_backbone:
    for param in model.enc.parameters():
        param.requires_grad = True


In [None]:
model.eval()
all_preds = []
all_labels = []

with torch.no_grad():
    for batch in test_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        meta = batch["meta"].to(device)
        labels = batch["labels"].to(device)

        logits = model(input_ids=input_ids, attention_mask=attention_mask, meta=meta)["logits"]
        probs = torch.sigmoid(logits).cpu().numpy()
        all_preds.append(probs)
        all_labels.append(labels.cpu().numpy())

import numpy as np
all_preds = np.vstack(all_preds)
all_labels = np.vstack(all_labels)
preds_bin = (all_preds >= 0.5).astype(int)

for i, label in enumerate(label_cols):
    f1 = f1_score(all_labels[:,i], preds_bin[:,i])
    prec = precision_score(all_labels[:,i], preds_bin[:,i])
    rec = recall_score(all_labels[:,i], preds_bin[:,i])
    print(f"{label}: F1={f1:.3f}, Precision={prec:.3f}, Recall={rec:.3f}")


In [None]:
torch.cuda.empty_cache()
gc.collect()

print(f"GPU available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Total memory: {torch.cuda.get_device_properties(0).total_memory/1024**3:.2f} GB")
    print(f"Memory allocated: {torch.cuda.memory_allocated()/1024**3:.2f} GB")
    print(f"Memory reserved: {torch.cuda.memory_reserved()/1024**3:.2f} GB")

print(f"RAM available: {psutil.virtual_memory().total/1024**3:.2f} GB")
print(f"RAM used: {psutil.virtual_memory().used/1024**3:.2f} GB")

GPU available: True
GPU: Tesla T4
Total memory: 14.74 GB
Memory allocated: 12.08 GB
Memory reserved: 12.10 GB
RAM available: 12.67 GB
RAM used: 9.86 GB


In [None]:
# Verify your data
print("Training data info:")
print(f"Shape: {train_df.shape}")
print("Columns:", train_df.columns.tolist())
print("\nLabel distribution:")
print(train_df['is_relevant'].value_counts())

# Check for missing values
print("\nMissing values in training data:")
print(train_df.isnull().sum())

# Check text length distribution
train_df['text_length'] = train_df['review_text'].apply(len)
print(f"\nText length stats:\n{train_df['text_length'].describe()}")

# Sample a few examples
print("\nSample reviews:")
for i, row in train_df.head(3).iterrows():
    print(f"Review {i+1}: {row['review_text'][:100]}...")
    print(f"Relevant: {row['is_relevant']}")
    print("-" * 50)

Training data info:
Shape: (8343, 19)
Columns: ['review_text', 'rating', 'has_photo', 'author_name', 'user_review_count', 'business_name', 'category', 'source', 'review_id', 'comprehensive_review', 'is_ad', 'is_relevant', 'is_rant', 'is_legit', 'clean_text', 'url_count', 'phone_count', 'caps_ratio', 'text_length']

Label distribution:
is_relevant
True     7615
False     728
Name: count, dtype: int64

Missing values in training data:
review_text               0
rating                  162
has_photo                 0
author_name               0
user_review_count       162
business_name             0
category                  0
source                    0
review_id                 0
comprehensive_review      0
is_ad                     0
is_relevant               0
is_rant                   0
is_legit                  0
clean_text                0
url_count                 0
phone_count               0
caps_ratio                0
text_length               0
dtype: int64

Text length stats

In [None]:
try:
    from google.colab import userdata
    token = userdata.get('HF_TOKEN')
    print("Token loaded from Colab secrets")
except:
    token = getpass('Enter your Hugging Face token: ')

# Login to Hugging Face
try:
    login(token=token)
    print("✓ Successfully logged in to Hugging Face Hub")
except Exception as e:
    print(f"✗ Login failed: {e}")

# Set environment variable
os.environ['HF_TOKEN'] = token

Enter your Hugging Face token: ··········


Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


✓ Successfully logged in to Hugging Face Hub


In [None]:
# Cell 2: Verify Model Loading
MODEL_NAME = "Qwen/Qwen3-8B"

if 'model' not in locals():
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_NAME,
        device_map="auto",
        torch_dtype=torch.float16,
        trust_remote_code=True
    )

if 'tokenizer' not in locals():
    tokenizer = AutoTokenizer.from_pretrained(
        MODEL_NAME,
        trust_remote_code=True
    )
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

print("✓ Model and tokenizer ready!")
print(f"Model device: {model.device}")
print(f"Model dtype: {model.dtype}")

✓ Model and tokenizer ready!
Model device: cuda:0
Model dtype: torch.float16


In [None]:
MAX_LENGTH = 512  # Reduced sequence length
BATCH_SIZE = 1    # Small batch size for 8B model
GRADIENT_ACCUMULATION_STEPS = 8  # Increase to get effective batch size
LEARNING_RATE = 2e-5
NUM_EPOCHS = 2
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

print("Memory-optimized configuration:")
print(f"Batch size: {BATCH_SIZE}")
print(f"Gradient accumulation: {GRADIENT_ACCUMULATION_STEPS}")
print(f"Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")

Memory-optimized configuration:
Batch size: 1
Gradient accumulation: 8
Effective batch size: 8


In [None]:
def prepare_lm_dataset(df, tokenizer, max_length=MAX_LENGTH):
    """Prepare dataset for language model training"""
    texts = []

    for _, row in df.iterrows():
        text = f"### Review Analysis Task:\n"
        text += f"Review Text: {row['review_text']}\n"

        if 'rating' in row:
            text += f"Rating: {row['rating']}/5\n"

        meta_cols = ["url_count", "phone_count", "caps_ratio", "has_photo", "user_review_count"]
        for col in meta_cols:
            if col in row and pd.notna(row[col]):
                text += f"{col.replace('_', ' ').title()}: {row[col]}\n"

        text += "### End of Review ###\n"
        texts.append(text)

    tokenized = tokenizer(
        texts,
        truncation=True,
        padding=False,
        max_length=max_length,
        return_offsets_mapping=False
    )

    tokenized["labels"] = tokenized["input_ids"].copy()

    return tokenized

print("Preparing training data...")
train_tokenized = prepare_lm_dataset(train_df, tokenizer)
val_tokenized = prepare_lm_dataset(val_df, tokenizer)

train_hf_dataset = Dataset.from_dict(train_tokenized)
val_hf_dataset = Dataset.from_dict(val_tokenized)

print(f"Training samples: {len(train_hf_dataset)}")
print(f"Validation samples: {len(val_hf_dataset)}")

Preparing training data...
Training samples: 8343
Validation samples: 1907


In [None]:
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to model
if isinstance(model, PeftModel) or hasattr(model, "peft_config"):
    raise RuntimeError("Model already has PEFT adapters attached in this session. Restart runtime before proceeding.")
model = get_peft_model(model, lora_config)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable_params:,}")
print(f"Total parameters: {total_params:,}")
print(f"Percentage trainable: {100 * trainable_params / total_params:.2f}%")

RuntimeError: Model already has PEFT adapters attached in this session. Restart runtime before proceeding.

In [None]:
# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False  # Causal language modeling
)

# Training arguments optimized for T4
training_args = TrainingArguments(
    output_dir="./qwen3-finetuned",
    overwrite_output_dir=True,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    eval_strategy="steps",
    eval_steps=50,
    save_steps=100,
    logging_steps=10,
    num_train_epochs=NUM_EPOCHS,
    learning_rate=LEARNING_RATE,
    weight_decay=0.01,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    fp16=True,  # Mixed precision training
    report_to="none",
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    gradient_checkpointing=True,  # Memory optimization
    optim="adamw_torch",
    max_grad_norm=1.0,
)
training_args.parallelism_config = None

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_hf_dataset,
    eval_dataset=val_hf_dataset,
    data_collator=data_collator,
)

print("Trainer configured successfully!")

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

In [None]:
MAX_LENGTH = 256

def prepare_lm_dataset(df, tokenizer, max_length=MAX_LENGTH):
    texts = []
    for _, row in df.iterrows():
        text = f"### Review Analysis Task:\nReview Text: {row['review_text']}\n"
        if 'rating' in row: text += f"Rating: {row['rating']}/5\n"
        if 'is_relevant' in row:
            relevance = "Relevant" if row['is_relevant'] else "Irrelevant"
            text += f"Relevance: {relevance}\n"
        text += "### End of Review ###\n"
        texts.append(text)

    tokenized = tokenizer(
        texts,
        truncation=True,
        padding=False,
        max_length=max_length
    )
    tokenized["labels"] = tokenized["input_ids"].copy()
    return tokenized

train_tokenized = prepare_lm_dataset(train_df, tokenizer)
val_tokenized = prepare_lm_dataset(val_df, tokenizer)

train_hf_dataset = Dataset.from_dict(train_tokenized)
val_hf_dataset = Dataset.from_dict(val_tokenized)

print(f"Training samples: {len(train_hf_dataset)}, Validation samples: {len(val_hf_dataset)}")

In [None]:
pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("clf", MultiOutputClassifier(LogisticRegression(max_iter=200)))
])

pipeline.fit(X_train, y_train)
print("Validation score:", pipeline.score(X_val, y_val))

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Validation score: 0.8384897745149449


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
model_name = "Qwen/Qwen2-0.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

def get_embeddings(texts, tokenizer, model, max_length=128, batch_size=16):
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch_texts = texts[i:i+batch_size]
        inputs = tokenizer(batch_texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        with torch.no_grad():
            outputs = model(**inputs)
            batch_emb = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
        all_embeddings.append(batch_emb)
    return np.vstack(all_embeddings)

# Get Qwen embeddings for clean_text
train_text_emb = get_embeddings(train_df["clean_text"].tolist(), tokenizer, model)
val_text_emb = get_embeddings(val_df["clean_text"].tolist(), tokenizer, model)
test_text_emb = get_embeddings(test_df["clean_text"].tolist(), tokenizer, model)

# Meta features
meta_cols = ["url_count","phone_count","caps_ratio","rating","has_photo","user_review_count"]
train_meta = train_df[meta_cols].values
val_meta = val_df[meta_cols].values
test_meta = test_df[meta_cols].values

# Combine embeddings + meta
X_train = np.hstack([train_text_emb, train_meta])
X_val = np.hstack([val_text_emb, val_meta])
X_test = np.hstack([test_text_emb, test_meta])

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

# Multi-label classifier
y_train = train_df[["is_ad","is_relevant","is_rant"]].values
y_val = val_df[["is_ad","is_relevant","is_rant"]].values

clf = MultiOutputClassifier(LogisticRegression(max_iter=200))
clf.fit(X_train, y_train)

print("Validation score:", clf.score(X_val, y_val))

ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
for obj in ["model", "tokenizer"]:
    if obj in globals():
        del globals()[obj]
gc.collect()
if torch.cuda.is_available():
    torch.cuda.empty_cache()

print(f"CUDA available: {torch.cuda.is_available()}")

CUDA available: True


In [None]:
MODEL_NAME = "Qwen/Qwen3-8B"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float16,
    trust_remote_code=True
).to("cuda")

print("Base model loaded:", model.device, model.dtype)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

In [None]:
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05

if isinstance(model, PeftModel) or hasattr(model, "peft_config"):
    raise RuntimeError("Model already has PEFT adapters. Restart runtime if needed.")

lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"Trainable params: {trainable_params:,} / {total_params:,} ({100 * trainable_params / total_params:.2f}%)")