In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Q1: Given this balanced dataset of product reviews, how would you build a sentiment analysis model to predict star ratings? What potential biases should you consider given that the dataset was artificially balanced?



### Step 1: Load and Inspect the Data
#### ** Explanation:**
> In this step, we'll load the dataset and examine a sample to understand its shape and fields. Since the dataset is split into `train.csv` and `valid.csv`, we'll load both.  
>  
> **Goal:** Check for missing values and inspect sample records across different star ratings.


In [2]:
import pandas as pd

# Load data
DATA_DIR = "/content/drive/MyDrive/amazon review data/"
TRAIN_CSV = DATA_DIR + "train.csv"
VAL_CSV   = DATA_DIR + "validation.csv"
train = pd.read_csv(TRAIN_CSV)
valid = pd.read_csv(VAL_CSV)

# Quick look at sample records
display(train.head())
display(train['stars'].value_counts())

# Check for nulls
print(train.isnull().sum())

Unnamed: 0.1,Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category
0,0,de_0203609,product_de_0865382,reviewer_de_0267719,1,Armband ist leider nach 1 Jahr kaputt gegangen,Leider nach 1 Jahr kaputt,de,sports
1,1,de_0559494,product_de_0678997,reviewer_de_0783625,1,In der Lieferung war nur Ein Akku!,EINS statt ZWEI Akkus!!!,de,home_improvement
2,2,de_0238777,product_de_0372235,reviewer_de_0911426,1,"Ein Stern, weil gar keine geht nicht. Es hande...",Achtung Abzocke,de,drugstore
3,3,de_0477884,product_de_0719501,reviewer_de_0836478,1,"Dachte, das wären einfach etwas festere Binden...",Zu viel des Guten,de,drugstore
4,4,de_0270868,product_de_0022613,reviewer_de_0736276,1,Meine Kinder haben kaum damit gespielt und nac...,Qualität sehr schlecht,de,toy


Unnamed: 0_level_0,count
stars,Unnamed: 1_level_1
1,240000
2,240000
3,240000
4,240000
5,240000


Unnamed: 0           0
review_id            0
product_id           0
reviewer_id          0
stars                0
review_body          0
review_title        43
language             0
product_category     0
dtype: int64


### Step 2: Data Preprocessing & Text Cleaning

#### **Explanation:**
> For sentiment analysis, preprocessing is crucial. We'll focus on:
> - Lowercasing the texts
> - Removing punctuation/special characters (if relevant for the language)
> - Potentially stemming/lemmatizing (language aware!)
>
> Since "review_body" is our main text field, we’ll use it as input (review_title could be concatenated for more signal).

In [3]:
import re
from sklearn.model_selection import train_test_split

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    return text

train['text'] = train['review_title'].fillna('') + ' ' + train['review_body'].fillna('')
valid['text'] = valid['review_title'].fillna('') + ' ' + valid['review_body'].fillna('')

train['text'] = train['text'].apply(clean_text)
valid['text'] = valid['text'].apply(clean_text)

### Step 3: Target Preprocessing

#### **Explanation:**
> We'll treat the prediction as a multiclass problem—predicting star ratings (1–5). This is a classic ordinal classification, but for simplicity, we'll use categorical classification as a baseline.


In [4]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
train['star_label'] = le.fit_transform(train['stars'])
valid['star_label'] = le.transform(valid['stars'])

### Step 4: Feature Engineering (Text Vectorization)

#### **Explanation:**
> To convert text to features, we’ll use TF-IDF, which is effective for bag-of-words [BoW] text classification tasks. For deep learning, consider tokenization and embeddings (like BERT), but for a first-pass model, TF-IDF + logistic regression is robust.



In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1,2))
X_train = vectorizer.fit_transform(train['text'])
X_valid = vectorizer.transform(valid['text'])
y_train = train['star_label']
y_valid = valid['star_label']

### Step 5: Model Building

#### **Explanation:**
> We’ll fit a multinomial Logistic Regression classifier for interpretable, fast benchmarking. Other contenders: Random Forests, Support Vector Machines, or fine-tuned BERT for best results.



In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

clf = LogisticRegression(max_iter=1000, multi_class='multinomial')
clf.fit(X_train, y_train)
pred_valid = clf.predict(X_valid)

print(classification_report(y_valid, pred_valid, target_names=le.classes_.astype(str)))
print(confusion_matrix(y_valid, pred_valid))



              precision    recall  f1-score   support

           1       0.67      0.49      0.57      6000
           2       0.48      0.33      0.39      6000
           3       0.28      0.61      0.39      6000
           4       0.53      0.33      0.40      6000
           5       0.67      0.52      0.59      6000

    accuracy                           0.46     30000
   macro avg       0.52      0.46      0.47     30000
weighted avg       0.52      0.46      0.47     30000

[[2939  892 2075   43   51]
 [ 965 1966 2812  165   92]
 [ 339 1019 3682  752  208]
 [  72  187 2557 1961 1223]
 [  52   69 1948  795 3136]]


In [7]:
print(train['stars'].value_counts())

stars
1    240000
2    240000
3    240000
4    240000
5    240000
Name: count, dtype: int64


Note: To speed up processing, I limit the dataset to 2000 samples per class.


In [8]:
# 2000 random samples per class, shuffle within each class
train = train.groupby('stars', group_keys=False).apply(lambda x: x.sample(n=2000, random_state=42)).reset_index(drop=True)

# Check the result
print(train['stars'].value_counts())

stars
1    2000
2    2000
3    2000
4    2000
5    2000
Name: count, dtype: int64


  train = train.groupby('stars', group_keys=False).apply(lambda x: x.sample(n=2000, random_state=42)).reset_index(drop=True)


### **Conclusion**

- **Model-building:** TF-IDF + Logistic Regression is a solid baseline; can be improved with better feature engineering and advanced NLP.
- **Bias:** Artificial balancing means unlikely-to-be-realistic results in production; must analyze post-deployment or retrain with organic distributions for true prediction tasks.


## Q2: How would you leverage both review_body and review_title fields to improve classification accuracy? What architectures would you consider for combining these text features?



---



### 1. **Dual-Input Sentiment Classifier: Architecture Overview**

```
[review_title] --> [Embedding] --> [LSTM] --
                                         |-> [Concatenate] -> [Dense] -> [Output]
[review_body]  --> [Embedding] --> [LSTM] --
```


### 2. **Implementation**

### Step 1: (Minimal) Preprocessing and Tokenization

Let’s use torchtext and a simple vocab for demo; you’ll likely want something production-grade in a real project (e.g. transformers, BPE tokenization, etc).

**Assumed Variables:**
- `train`/`valid`: Pandas DataFrames with clean text.
- `train['review_title']`, `train['review_body']`: Cleaned strings.
- `train['star_label']`: Integer 0-4 labels.

#### **A. Build Vocabulary and Sets**

In [9]:
import torch
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from collections import Counter
import numpy as np

# Build vocabulary (joint for title/body for efficiency)
def build_vocab(texts, min_freq=2, max_size=10000):
    words = Counter()
    for text in texts:
        words.update(str(text).lower().split())
    # Reserve 0 for PAD, 1 for UNK
    vocab = {"<PAD>":0, "<UNK>":1}
    idx = 2
    for word, freq in words.most_common(max_size):
        if freq < min_freq:
            break
        vocab[word] = idx
        idx += 1
    return vocab

vocab = build_vocab(pd.concat([train['review_title'], train['review_body']]))

def encode(text, vocab):
    return [vocab.get(w, 1) for w in str(text).lower().split()]

# Encode all texts
title_maxlen = 15
body_maxlen = 150

def pad_seq(seq, maxlen):
    return seq[:maxlen] + [0]*(maxlen - len(seq)) if len(seq) < maxlen else seq[:maxlen]

train_title = [pad_seq(encode(t, vocab), title_maxlen) for t in train['review_title']]
train_body = [pad_seq(encode(b, vocab), body_maxlen) for b in train['review_body']]
valid_title = [pad_seq(encode(t, vocab), title_maxlen) for t in valid['review_title']]
valid_body = [pad_seq(encode(b, vocab), body_maxlen) for b in valid['review_body']]

train_labels = train['star_label'].values
valid_labels = valid['star_label'].values

### Step 2: PyTorch Dataset and Dataloader


In [10]:
class ReviewDataset(Dataset):
    def __init__(self, titles, bodies, labels):
        self.titles = torch.tensor(np.array(titles), dtype=torch.long)
        self.bodies = torch.tensor(np.array(bodies), dtype=torch.long)
        self.labels = torch.tensor(np.array(labels), dtype=torch.long)

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.titles[idx], self.bodies[idx], self.labels[idx]

train_dataset = ReviewDataset(train_title, train_body, train_labels)
valid_dataset = ReviewDataset(valid_title, valid_body, valid_labels)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=64)

### Step 3: **PyTorch Dual-Branch Model**


In [11]:
import torch.nn as nn
import torch.nn.functional as F

class DualTextSentimentClassifier(nn.Module):
    def __init__(self, vocab_size, emb_dim=100, lstm_dim=64, n_classes=5):
        super().__init__()
        # Title branch
        self.title_emb = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        self.title_lstm = nn.LSTM(emb_dim, lstm_dim, batch_first=True)
        # Body branch
        self.body_emb = nn.Embedding(vocab_size, emb_dim, padding_idx=0)
        self.body_lstm = nn.LSTM(emb_dim, lstm_dim, batch_first=True)
        # Classifier
        self.fc1 = nn.Linear(2*lstm_dim, 64)
        self.fc2 = nn.Linear(64, n_classes)

    def forward(self, title_seq, body_seq):
        # Title
        t_emb = self.title_emb(title_seq)          # [bs, tlen, emb_dim]
        _, (t_h, _) = self.title_lstm(t_emb)       # t_h: [1, bs, lstm_dim]
        t_vec = t_h.squeeze(0)                     # [bs, lstm_dim]
        # Body
        b_emb = self.body_emb(body_seq)            # [bs, blen, emb_dim]
        _, (b_h, _) = self.body_lstm(b_emb)
        b_vec = b_h.squeeze(0)
        # Concatenate title and body
        x = torch.cat([t_vec, b_vec], dim=1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return x

### Step 4: **Training Loop**

In [12]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = DualTextSentimentClassifier(vocab_size=len(vocab), n_classes=5).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

for epoch in range(5):  # train for a few epochs
    model.train()
    for title_seq, body_seq, labels in train_loader:
        title_seq, body_seq, labels = title_seq.to(device), body_seq.to(device), labels.to(device)
        optimizer.zero_grad()
        logits = model(title_seq, body_seq)
        loss = criterion(logits, labels)
        loss.backward()
        optimizer.step()
    # Validation
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for title_seq, body_seq, labels in valid_loader:
            title_seq, body_seq, labels = title_seq.to(device), body_seq.to(device), labels.to(device)
            logits = model(title_seq, body_seq)
            preds = torch.argmax(logits, dim=1)
            correct += (preds == labels).sum().item()
            total += labels.size(0)
    print(f'Epoch {epoch+1} valid accuracy: {correct/total:.4f}')

Epoch 1 valid accuracy: 0.2001
Epoch 2 valid accuracy: 0.2584
Epoch 3 valid accuracy: 0.2933
Epoch 4 valid accuracy: 0.2972
Epoch 5 valid accuracy: 0.3122


### **Summary**

> **Dual-input PyTorch Model for Multi-field Review Classification**
>
> - Each field (`review_title`, `review_body`) is tokenized and fed through its own Embedding + LSTM branch.
> - Their representations are **concatenated** and passed through dense layers to predict the star rating.
> - This architecture lets the model weigh summary (title) and context (body) differently for final classification.
>
> *For even better results:*
> - Swap LSTM for transformer encoders.
> - Try pre-trained word embeddings or transformer embeddings (using Huggingface for BERT-based feature extraction).
> - Tune architecture size, regularization.
>
> **This pattern is widely used for multi-source text tasks in recommendation and sentiment systems.**


## Q3: If you needed to build a multilingual review classification system using this dataset, what approach would you take considering the language field? What challenges might you face?

### Multilingual Review Classification: Architecture & Challenges

#### 1. **Approach Overview**

**Strategy:**  
- Use the `language` field to **identify the review's language**.
- Use a **multilingual NLP model** (e.g., Multilingual BERT, XLM-Roberta) that can process multiple languages in one model.
- Alternatively, train **language-specific models** and route reviews accordingly.
- Optionally, combine language as an explicit feature.

---

#### 2. **What Models?**

- **Multilingual Pretrained Transformers**: Models like [XLM-Roberta](https://huggingface.co/xlm-roberta-base), [mBERT](https://huggingface.co/bert-base-multilingual-cased), or multilingual DistilBERT can handle dozens of languages natively.
- **Language Routing** (Advanced): If you have enough data per language, you might train separate language-specific models and apply them based on the `language` column.

---

#### 3. **Practical Example with Huggingface Transformers**

Let’s see a simple, robust workflow for a single classifier (multilingual transformer):

#### **(A) Preprocessing**

In [13]:
from transformers import AutoTokenizer
import pandas as pd

# Let's assume 'train' and 'valid' pandas DataFrames as before
# Each row has 'review_title', 'review_body', 'language', 'star_label'
def make_multilingual_input(row):
    text = (str(row['review_title']) if pd.notna(row['review_title']) else '') + ' ' + \
           (str(row['review_body']) if pd.notna(row['review_body']) else '')
    return text.strip()

train['full_text'] = train.apply(make_multilingual_input, axis=1)
valid['full_text'] = valid.apply(make_multilingual_input, axis=1)

#### **(B) Tokenization With Multilingual Model**

In [14]:
from transformers import AutoTokenizer

model_name = "xlm-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize function
def tokenize_batch(texts, max_length=128):
    return tokenizer(
        texts.tolist(),
        padding='max_length',
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )

# Example:
X_train = tokenize_batch(train['full_text'])
X_valid = tokenize_batch(valid['full_text'])
y_train = torch.tensor(train['star_label'].values)
y_valid = torch.tensor(valid['star_label'].values)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

#### **(C) PyTorch Dataset**

In [15]:
import torch
from torch.utils.data import Dataset

class ReviewsDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k,v in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = ReviewsDataset(X_train, y_train)
valid_dataset = ReviewsDataset(X_valid, y_valid)

#### **(D) Model and Training Loop (Huggingface Trainer for Simplicity)**

In [16]:
from sklearn.metrics import accuracy_score
import numpy as np

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc}

In [17]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)

training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="eval_accuracy",
)

# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=valid_dataset,

# )

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()


model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of XLMRobertaForSequenceClassification were not initialized from the model checkpoint at xlm-roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  | |_| | '_ \/ _` / _` |  _/ -_)


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mtaruntiwari[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


  item = {k: torch.tensor(v[idx]) for k,v in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


Epoch,Training Loss,Validation Loss,Accuracy
1,1.2419,1.071006,0.536733
2,0.9729,1.042053,0.557867


  item = {k: torch.tensor(v[idx]) for k,v in self.encodings.items()}
  item['labels'] = torch.tensor(self.labels[idx])


TrainOutput(global_step=1250, training_loss=1.10736171875, metrics={'train_runtime': 1125.3169, 'train_samples_per_second': 17.773, 'train_steps_per_second': 1.111, 'total_flos': 1315590712320000.0, 'train_loss': 1.10736171875, 'epoch': 2.0})

In [18]:
train['prepared_text'] = train['language'] + ": " + train['full_text']
valid['prepared_text'] = valid['language'] + ": " + valid['full_text']
# Then tokenize as above.


### 6. **Summary**

**Approach:**
- Use a _multilingual Transformer model_ (like XLM-RoBERTa) to handle all reviews in their original language.
- Use the `language` column for slicing/monitoring and, if desired, as an explicit signal.
- Carefully validate performance per language to guard against lower recall/precision in under-represented languages.


## Q4: How would you design a recommendation system using the product_id, reviewer_id, and stars fields? What additional features could you engineer from the text fields to improve recommendations?


### Designing a Recommendation System with product_id, reviewer_id, and stars

### **1. Problem Overview**

> For a **recommendation system**, we want to predict what rating (or preference) a user (reviewer_id) would give to a product (product_id).  
> We'll treat this as a **collaborative filtering** problem, with possible enhancement by **content features** from text fields.

---

### **2. Collaborative Filtering Approach: Matrix Factorization (MF)**

We start with classic collaborative filtering—learning latent user/item factors.

#### **A. Data Preparation**


In [19]:
import pandas as pd

# Assume train is your DataFrame
user2id = {u:i for i, u in enumerate(train['reviewer_id'].unique())}
item2id = {p:i for i, p in enumerate(train['product_id'].unique())}

train['user_idx'] = train['reviewer_id'].map(user2id)
train['item_idx'] = train['product_id'].map(item2id)

num_users = len(user2id)
num_items = len(item2id)

#### **B. PyTorch Matrix Factorization Model**

In [20]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class ReviewDataset(Dataset):
    def __init__(self, df):
        self.users = torch.tensor(df['user_idx'].values, dtype=torch.long)
        self.items = torch.tensor(df['item_idx'].values, dtype=torch.long)
        self.ratings = torch.tensor(df['stars'].values, dtype=torch.float)

    def __len__(self):
        return len(self.ratings)

    def __getitem__(self, idx):
        return self.users[idx], self.items[idx], self.ratings[idx]

train_set = ReviewDataset(train)
train_loader = DataLoader(train_set, batch_size=512, shuffle=True)

class MatrixFactorization(nn.Module):
    def __init__(self, n_users, n_items, emb_dim=32):
        super().__init__()
        self.user_emb = nn.Embedding(n_users, emb_dim)
        self.item_emb = nn.Embedding(n_items, emb_dim)
        self.fc = nn.Linear(emb_dim, 1)

    def forward(self, user_idx, item_idx):
        u = self.user_emb(user_idx)
        v = self.item_emb(item_idx)
        x = u * v
        out = self.fc(x)
        return out.squeeze()

model = MatrixFactorization(num_users, num_items)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
criterion = nn.MSELoss()

#### **C. Training Loop**

In [21]:
for epoch in range(5):
    model.train()
    for user_idx, item_idx, ratings in train_loader:
        optimizer.zero_grad()
        preds = model(user_idx, item_idx)
        loss = criterion(preds, ratings)
        loss.backward()
        optimizer.step()
    print(f"Epoch {epoch+1}, Loss: {loss.item():.4f}")

Epoch 1, Loss: 9.7271
Epoch 2, Loss: 7.8600
Epoch 3, Loss: 5.4425
Epoch 4, Loss: 3.7082
Epoch 5, Loss: 1.5256


### **3. Enhancing with Features from Text Fields (Review Title/Body)**

> **Text fields help with "cold-start" and inject semantic/product information**
>  
> We can use text as **additional item features** (content-based), or compute "reviewer profile" text embeddings.

#### **A. Item Content Embedding (Example with TF-IDF or Transformer)**

**TF-IDF Embedding for Products:**

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Combine review text for each product
product_text = train.groupby('product_id')['review_body'].apply(lambda x: ' '.join(x)).reset_index()

vectorizer = TfidfVectorizer(max_features=128)
tfidf_matrix = vectorizer.fit_transform(product_text['review_body'])  # (num_products, 128)

**With Transformers (for better results):**

In [23]:
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-multilingual-cased')
model = AutoModel.from_pretrained('distilbert-base-multilingual-cased')

def get_transformer_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=128)
    with torch.no_grad():
        out = model(**inputs).last_hidden_state.mean(dim=1)  # Mean-pooled embedding
    return out.cpu().numpy().flatten()

# Generate embeddings for each product
product_text['transformer_emb'] = product_text['review_body'].apply(get_transformer_embedding)
# Stack to a numpy array: product_emb_matrix = np.stack(product_text['transformer_emb'])

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/466 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/996k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.96M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/542M [00:00<?, ?B/s]

#### **B. Using Additional Features in the Model**

- **Concat TF-IDF/transformer features with learnable item embedding**.
- Or, use a neural net branch for item text features, and fuse with collaborative branch.

**Recommended architecture:**

```
[user_id] ---[Embedding]---+                          |
                           +--[concat]--[FC]-->rating |
[item_id]---[Embedding]----+                          |
                           +--[Item text features]----+
```

---

#### **C. Example: MF + Product Text Content Fusion**


In [24]:
class HybridRecommender(nn.Module):
    def __init__(self, n_users, n_items, emb_dim, item_text_features):
        super().__init__()
        self.user_emb = nn.Embedding(n_users, emb_dim)
        self.item_emb = nn.Embedding(n_items, emb_dim)
        self.item_text_feat = nn.Embedding.from_pretrained(torch.tensor(item_text_features, dtype=torch.float), freeze=True)
        self.fc1 = nn.Linear(emb_dim * 2 + item_text_features.shape[1], 64)
        self.fc2 = nn.Linear(64, 1)

    def forward(self, user_idx, item_idx):
        u = self.user_emb(user_idx)
        v = self.item_emb(item_idx)
        t = self.item_text_feat(item_idx)
        x = torch.cat([u, v, t], dim=1)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x.squeeze()


#### **4. Other Feature Engineering Ideas (Markdown summary for notebook)**

> Additional features you can engineer from review text fields:
> - **Sentiment score** (using a separate sentiment classifier)
> - **Topic model** representation (e.g. LDA topics of product reviews)
> - **Review length, exclamation count, positivity/negativity metrics**
> - **Aggregated reviewer behavior:** average review sentiment for each user
> - **Timestamp features** if available (recency)

---

#### **5. Challenges in Real-world**

> - **Cold start:** For new products/users with few or no ratings, content-based features help
> - **Scalability:** Large user and product sets make pure MF or deep models resource-intensive
> - **Sparsity:** Most user-item pairs are missing (not rated)
> - **Interpretability:** MF is less interpretable than content-based; text features help explain recommendations



#### **Conclusion (Markdown)**

> By designing a **hybrid recommendation system** that combines user and product embeddings (collaborative filtering) with **text-derived features** (from review_body and review_title)—such as TF-IDF/transformer embeddings or summary sentiment scores—you can achieve better recommendation quality, especially for new products or users.  
> This approach gives you both the power of collaborative signals and the explanatory/contextual power of product review texts!

## Q5: If you needed to detect fake or spam reviews in this dataset, what features and approaches would you use? How would you handle the challenge of limited labeled data for fake reviews?


### **Detecting Fake or Spam Reviews: Features and Approaches**

---

#### 1. **Feature Engineering for Fake/Spam Detection**

> For detecting fake or spam reviews, a classic approach is to combine text-based, semantic, behavioral, and metadata features. Here are examples for each:

**A. Reviewer Behavior Features:**
- **Review frequency:** How many reviews does this user write per day/week/month?
- **Reviewer-product overlap:** Do many reviewers leave reviews for the same set of products in a short period?
- **User uniqueness:** Does the reviewer only review one brand/a narrow set of products?
- **Account age, verified purchase status** (if available).

**B. Review Text Features:**
- **Review length:** Very short or excessively long reviews can be suspicious.
- **Use of generic phrases:** Repetitive expressions ("great product," "highly recommended").
- **Similarity to other reviews:** High cosine similarity among many reviews, especially for the same user or product, can be a sign of copy-pasted spam.
- **Excessive “!” or all-caps.**
- **Sentiment polarity/extremes, or mismatch:** Overly positive or negative reviews out of sync with the majority.
- **Unusual timing** or posting patterns.

**C. Review Metadata:**
- **Time of review:** Spikes in reviews in a short period may indicate organized spam.
- **Language:** Some languages or language misuse might signal machine translation or inauthenticity.

---

#### 2. **Machine Learning Approaches**

**A. Supervised Learning (if labeled data is available):**
- **Labels:** 0 = genuine, 1 = fake/spam.
- **Models:** Logistic Regression, Random Forest, XGBoost, Deep learning, or even transformer-based spam detectors with text+metadata features.

**B. Unsupervised or Semi-supervised Learning (for limited labeled data):**
- **Anomaly Detection:** Isolation Forest, One-Class SVM, clustering, or autoencoders to detect outlier users/reviews.
- **Self-training or PU learning:** Start with a few positive (fake) examples, and iteratively expand the set using model predictions.
- **Heuristics:** Use hard rules (length, similarity) for initial labeling, then refine with model predictions.

#### ** code: Extracting suspicious text similarity features**


In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Concatenate 'review_title' and 'review_body'
train['text_all'] = train['review_title'].fillna('') + ' ' + train['review_body'].fillna('')

# TF-IDF fit
vectorizer = TfidfVectorizer(max_features=1000)
tfidf = vectorizer.fit_transform(train['text_all'])

# For each review, find its highest cosine similarity with other reviews by same user
max_sim = []
for idx, row in train.iterrows():
    user_reviews = train[train['reviewer_id'] == row['reviewer_id']].index
    similarities = cosine_similarity(tfidf[idx], tfidf[user_reviews]).flatten()
    if len(similarities) > 1:
        # Exclude self
        max_sim.append(sorted(similarities)[-2])
    else:
        max_sim.append(0.0)
train['max_user_review_sim'] = max_sim

#### Simple heuristic - short generic reviews

In [26]:
train['text_len'] = train['text_all'].str.len()
train['excess_exclaim'] = train['text_all'].str.count('!')

# flag suspiciously short and generic reviews
possible_fakes = train[(train['text_len'] < 25) & (train['excess_exclaim'] > 2)]

In [27]:
possible_fakes

Unnamed: 0.1,Unnamed: 0,review_id,product_id,reviewer_id,stars,review_body,review_title,language,product_category,text,star_label,full_text,prepared_text,user_idx,item_idx,text_all,max_user_review_sim,text_len,excess_exclaim


### 3. **Handling Limited Labeled Data for Fake Reviews**

> **Challenge:** High-quality fake review labels are rare and expensive to obtain.

**Solutions:**
1. **Unsupervised methods/anomaly detection:** Use models that can learn what “normal” reviews look like and flag outliers.
2. **Heuristic bootstrapping:** Write rules to label the most obviously fake/genuine reviews, then iteratively refine with a classifier (semi-supervised or PU learning).
3. **Data augmentation:** Use synthetic fake reviews (template-based, or by swapping product names in real reviews) for model training—carefully, to avoid bias.
4. **Active learning:** Periodically review most-uncertain or highest-scoring suspected fake reviews and manually annotate, to incrementally improve your model.

---

### 4. **Explanation**

> **Fake/Spam Review Detection:**
>
> 1. **Features:** We engineer indicators from user behavior (activity patterns, product overlap), review metadata (time, language, length), and review text (similarity, repetition, sentiment extremes).
> 2. **Approach:** If we have enough labeled data, we train a traditional classifier; lacking that, we turn to anomaly detection and semi-supervised strategies.
> 3. **Handling Limited Labels:** We combine
>     - Rule-based filtering for initial seed labeling,
>     - Anomaly/outlier detection to flag hard-to-catch fakes,
>     - Manual review in an active learning framework to continually improve detection.
>




## **References**

1. **Amazon Fake Review Detection**  
   - Mukherjee, A., et al. ["What Yelp Fake Review Filter Might Be Doing?"](https://mukherjee.soc.northwestern.edu/papers/icwsm13_yelp_fakes.pdf) (ICWSM 2013).
   - Kumar, S., et al. ["Detecting Review Spam and Fake Reviewers"](https://www.cs.uic.edu/~liub/publications/ICDM-2010.pdf), ICDM 2010.
   - [KDnuggets: Detecting Fake Reviews in Opinion Data](https://www.kdnuggets.com/2018/06/detecting-fake-reviews-opinion-data.html)
   
2. **Machine Learning for Text and Tabular Data**  
   - [Scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html)
   - [TF-IDF Vectorizer Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
   - [PyTorch Matrix Factorization Example](https://pytorch.org/tutorials/beginner/nn_tutorial.html)
   - [PyTorch Tabular Data API](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html)
   - [Isolation Forest Anomaly Detection](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html)

3. **Text and NLP**  
   - [Huggingface Transformers Documentation](https://huggingface.co/docs/transformers/main/en/index)
   - [XLM-Roberta: A Multilingual Language Model](https://arxiv.org/abs/1911.02116) (Conneau et al., 2020).
   - [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

4. **Recommendation Systems**
   - [Deep Learning for Recommender Systems (YouTube)](https://www.youtube.com/watch?v=ZspR5PZemcs) (Yann LeCun, RecSys).
   - [Introduction to Matrix Factorization for Collaborative Filtering (Blog)](https://datasciencedojo.com/blog/matrix-factorization/)
   - [A Gentle Introduction to Deep Learning for Recommender Systems](https://machinelearningmastery.com/deep-learning-for-recommender-systems/)

5. **Semi-supervised Learning and Anomaly Detection**
   - [Active Learning Tutorial (Scikit-learn-contrib)](https://modAL.readthedocs.io/en/stable/)
   - [Anomaly Detection: Isolation Forest Example](https://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html)
   - [Positive-Unlabeled Learning: A Survey](https://arxiv.org/abs/1811.04820) (Bekker & Davis, 2020).

