<a href="https://colab.research.google.com/github/vijaygwu/systemdesign/blob/main/MultiStage.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Below is a simple Python example (using pandas) that demonstrates how to label positive vs. negative samples from the synthetic **interactions** dataset based on a compound engagement metric.

Feel free to change the weights and threshold to best suit your use case.

```python
import pandas as pd

# 1. Load the 'interactions.csv' dataset
interactions = pd.read_csv("interactions.csv")

# 2. Define the tunable weights for each signal.
#    You can modify these values (w1..w5) based on analysis or A/B testing.
w1 = 1.0  # Weight for click
w2 = 0.5  # Weight for (dwell_time / expected_dwell)
w3 = 2.0  # Weight for share
w4 = 1.0  # Weight for comment
w5 = 1.0  # Weight for hide (subtracted)

# 3. Decide on an 'expected_dwell' for normalizing dwell_time.
#    This is just an example; you might compute it from actual data.
expected_dwell = 10.0

# 4. Compute a compound engagement score for each interaction row.
#    - If dwell_time is missing or zero, it won't add much to the score.
#    - Negative feedback (hide=1) will reduce the score.
interactions["engagement_score"] = (
    w1 * interactions["click"] +
    w2 * (interactions["dwell_time"] / expected_dwell) +
    w3 * interactions["share"] +
    w4 * interactions["comment"] -
    w5 * interactions["hide"]
)

# 5. Choose a threshold above which we consider an interaction 'positive'.
#    For demonstration, we'll use threshold = 2.0, but you can tune it.
threshold = 2.0

# 6. Create a binary label: 1 for positive, 0 for negative.
#    - Alternatively, you could store string labels like "positive" / "negative".
interactions["label"] = (interactions["engagement_score"] > threshold).astype(int)

# 7. Now 'interactions' has an added 'label' column indicating positive/negative.
#    You can inspect, group, or merge it with user/content data as needed.
print(interactions.head(10))

# 8. (Optional) Save the updated DataFrame back to CSV
interactions.to_csv("interactions_labeled.csv", index=False)
```

### Explanation

1. **Load the data**: The script reads the synthetic `interactions.csv` file into a pandas DataFrame.
2. **Weights**: We define the contribution (importance) of each user action (`click`, `share`, etc.). You might calibrate these via offline experiments or hyperparameter searches.
3. **Expected dwell**: Normalizes `dwell_time` since a 10-second view might be “long” for some platforms but trivial on others. You can refine this by computing an average or median dwell time from real data.
4. **Compound engagement score**:
   \[
     \text{engagement\_score}
       = w_1 \times \text{click}
         + w_2 \times \Bigl(\frac{\text{dwell\_time}}{\text{expected\_dwell}}\Bigr)
         + w_3 \times \text{share}
         + w_4 \times \text{comment}
         - w_5 \times \text{hide}
   \]
   - A `click` (1) adds `w1` points.  
   - `dwell_time` is scaled down by `expected_dwell` then multiplied by `w2`.  
   - A `share` significantly boosts the score (here `2.0` points each).  
   - Each `comment` also contributes extra points.  
   - A `hide` event subtracts points (penalizing negative feedback).
5. **Threshold**: We pick `2.0` as a simple cutoff for whether an interaction is “highly engaged” (label=1) or not (label=0). In a production system, you might run `A/B` tests or user research to refine this threshold.
6. **Label**: We store the result in a new column, `label`.
7. **Further usage**: You can integrate these labeled interactions into a supervised learning pipeline, e.g., training a classification model or computing ranking metrics.

This sample code should help you **label positive/negative samples** from your synthetic dataset in a way that aligns with the multi-stage ranking system described in the original reference. Adjust weights, threshold, and normalization to fit your specific application.

Below is a high-level walkthrough (with example code) of how you might train a simple Artificial Neural Network (ANN) on this synthetic dataset. The goal is to predict a user’s engagement (positive vs. negative) with a piece of content. We will:

1. **Combine** the three CSVs (`users.csv`, `content.csv`, and `interactions.csv`).  
2. **Create** features for both users and content.  
3. **Join** them with the interaction records.  
4. **Derive** the engagement label (e.g., using the compound engagement metric or the binary label column).  
5. **Train** a simple feed-forward neural network (ANN) in Keras/PyTorch/etc.

Below is a minimal example in **Python** using **pandas** for data wrangling, **scikit-learn** for transformations, and **TensorFlow/Keras** for the neural network. This is only a starting template—feel free to expand or refine.

---

## 1. Set Up the Project

Make sure you have the following Python libraries installed:

```bash
pip install pandas scikit-learn tensorflow
```

(Or install PyTorch if you prefer. The core idea is the same.)

---

## 2. Load and Merge the Data

```python
import pandas as pd

# 1. Load each CSV into a DataFrame
users = pd.read_csv("users.csv")
content = pd.read_csv("content.csv")
interactions = pd.read_csv("interactions.csv")

# 2. Suppose we already have a 'label' column in interactions.csv
#    (or we can compute one using the code from before).
#    We'll assume "label" is binary: 1 => positive engagement, 0 => negative.

# 3. Merge user features onto interactions
#    - 'user_id' is the join key
df = interactions.merge(users, how="inner", on="user_id")

# 4. Merge content features onto the result
#    - 'content_id' is the join key
df = df.merge(content, how="inner", on="content_id")

# Now 'df' has columns from users, content, and interactions plus the 'label'.
print(df.head())
```

At this point, `df` contains a row for each user–content interaction, including user attributes (e.g., `device_type`, `region`, `topic_affinity_politics`, etc.), content attributes (e.g., `publisher`, `topic`, `quality_score`), as well as interaction signals (`click`, `dwell_time`, `share`, `hide`, etc.) and your final `label` (if you previously computed it or included it in `interactions.csv`).

---

## 3. Basic Feature Engineering

We need to convert categorical fields (e.g., `device_type`, `region`, `publisher`, `topic`) into numeric encodings or embeddings, as a neural network can only work with numeric inputs. We also might scale numerical fields (like dwell time, topic affinities, quality scores).

Below, we do a minimal approach using scikit-learn:  
- **Label Encoding** or **One-Hot Encoding** for categorical columns.  
- **Scaling** for numeric columns.

```python
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import numpy as np

# Example: let's pick some columns to treat as numeric and others as categorical.
num_cols = [
    "topic_affinity_politics", "topic_affinity_tech", "topic_affinity_sports",
    "avg_session_length", "quality_score", "dwell_time"
]

cat_cols = [
    "device_type", "region", "publisher", "topic"
]
# Note: You could also treat 'user_id' and 'content_id' as categorical
# and learn embeddings for them. For simplicity, we'll skip that here.

# 1. Separate our final label from the features
labels = df["label"].values  # 0 or 1
df_features = df[num_cols + cat_cols].copy()

# 2. One-Hot encode categorical columns
ohe = OneHotEncoder(sparse=False, handle_unknown="ignore")
cat_encoded = ohe.fit_transform(df_features[cat_cols])

# 3. Scale numeric columns
scaler = StandardScaler()
num_scaled = scaler.fit_transform(df_features[num_cols])

# 4. Concatenate numeric + one-hot-coded categorical arrays
X = np.hstack([num_scaled, cat_encoded])

# X is now a numeric matrix suitable for ANN input
print(X.shape, labels.shape)  # e.g., (N, num_features), (N,)
```

> **In practice**: You might do more advanced transformations (e.g., embeddings for user IDs, content IDs, or hashing for `blocklist`) or incorporate engineered cross-features. This snippet only shows the basics.

---

## 4. Build a Simple Feed-Forward Neural Network

Below is an example using **Keras** (TensorFlow) to define and train a minimal multi-layer perceptron (MLP) on the feature matrix `X` and label vector `labels`.

```python
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# 1. Define a small MLP
model = keras.Sequential([
    layers.Input(shape=(X.shape[1],)),   # input layer: dimension = # of features
    layers.Dense(64, activation='relu'), # hidden layer
    layers.Dense(32, activation='relu'), # another hidden layer
    layers.Dense(1, activation='sigmoid') # output layer for binary classification
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# 2. Train/validation split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, labels, test_size=0.2, random_state=42
)

# 3. Fit the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=32,
    epochs=10
)
```

This simple network tries to learn a mapping: \(\text{features} \to \text{probability of label=1}\). After training, you can examine metrics:

```python
import matplotlib.pyplot as plt

plt.plot(history.history["loss"], label="Train Loss")
plt.plot(history.history["val_loss"], label="Val Loss")
plt.legend()
plt.show()
```

---

## 5. Evaluate the Model

Besides looking at training/validation loss and accuracy, you might also compute:

- **Precision, Recall, F1**  
- **AUC (Area Under ROC Curve)**  
- **Calibration** (if your use case cares about well-calibrated probabilities)  

For example:

```python
from sklearn.metrics import classification_report, roc_auc_score

y_val_pred = (model.predict(X_val) > 0.5).astype(int).ravel()
print(classification_report(y_val, y_val_pred, digits=4))

y_val_probs = model.predict(X_val).ravel()  # get raw probabilities
print("AUC:", roc_auc_score(y_val, y_val_probs))
```

---

## 6. Going Further

1. **Learnable Embeddings** for user/content IDs:  
   - Instead of one-hot-encoding user or content IDs, you can feed them into a learned embedding layer in Keras or PyTorch. This is common in recommender systems (similar to collaborative filtering neural approaches).  
2. **Sequence/Context Modeling**:  
   - If you want to capture the user’s session history, you could build a model that processes a sequence of user interactions (via an RNN or Transformer).  
3. **Multi-Task Learning**:  
   - If you have multiple engagement signals (click, share, dwell, etc.) you can define separate output heads and train them together.  

---

## Summary

- **Join** all data: user + content + interactions + (optionally) computed labels.  
- **Encode** categorical fields, scale numeric fields.  
- **Train** an ANN in Keras (or another framework) using the resulting numeric matrix.  
- **Adjust** your architecture, hyperparameters, and feature transformations as you iterate.  

This pipeline is just the starting point. In production-scale systems, you might rely on a multi-stage process (ANN for final ranking, GBDT for initial ranking, etc.), plus advanced feature engineering and real-time pipelines. However, the example above demonstrates the core idea for how to feed the synthetic dataset into an ANN-based classifier or ranking model.

Below is a simple Python example (using pandas) that demonstrates how to label positive vs. negative samples from the synthetic **interactions** dataset based on a compound engagement metric.

Feel free to change the weights and threshold to best suit your use case.

```python
import pandas as pd

# 1. Load the 'interactions.csv' dataset
interactions = pd.read_csv("interactions.csv")

# 2. Define the tunable weights for each signal.
#    You can modify these values (w1..w5) based on analysis or A/B testing.
w1 = 1.0  # Weight for click
w2 = 0.5  # Weight for (dwell_time / expected_dwell)
w3 = 2.0  # Weight for share
w4 = 1.0  # Weight for comment
w5 = 1.0  # Weight for hide (subtracted)

# 3. Decide on an 'expected_dwell' for normalizing dwell_time.
#    This is just an example; you might compute it from actual data.
expected_dwell = 10.0

# 4. Compute a compound engagement score for each interaction row.
#    - If dwell_time is missing or zero, it won't add much to the score.
#    - Negative feedback (hide=1) will reduce the score.
interactions["engagement_score"] = (
    w1 * interactions["click"] +
    w2 * (interactions["dwell_time"] / expected_dwell) +
    w3 * interactions["share"] +
    w4 * interactions["comment"] -
    w5 * interactions["hide"]
)

# 5. Choose a threshold above which we consider an interaction 'positive'.
#    For demonstration, we'll use threshold = 2.0, but you can tune it.
threshold = 2.0

# 6. Create a binary label: 1 for positive, 0 for negative.
#    - Alternatively, you could store string labels like "positive" / "negative".
interactions["label"] = (interactions["engagement_score"] > threshold).astype(int)

# 7. Now 'interactions' has an added 'label' column indicating positive/negative.
#    You can inspect, group, or merge it with user/content data as needed.
print(interactions.head(10))

# 8. (Optional) Save the updated DataFrame back to CSV
interactions.to_csv("interactions_labeled.csv", index=False)
```

### Explanation

1. **Load the data**: The script reads the synthetic `interactions.csv` file into a pandas DataFrame.
2. **Weights**: We define the contribution (importance) of each user action (`click`, `share`, etc.). You might calibrate these via offline experiments or hyperparameter searches.
3. **Expected dwell**: Normalizes `dwell_time` since a 10-second view might be “long” for some platforms but trivial on others. You can refine this by computing an average or median dwell time from real data.
4. **Compound engagement score**:
   \[
     \text{engagement\_score}
       = w_1 \times \text{click}
         + w_2 \times \Bigl(\frac{\text{dwell\_time}}{\text{expected\_dwell}}\Bigr)
         + w_3 \times \text{share}
         + w_4 \times \text{comment}
         - w_5 \times \text{hide}
   \]
   - A `click` (1) adds `w1` points.  
   - `dwell_time` is scaled down by `expected_dwell` then multiplied by `w2`.  
   - A `share` significantly boosts the score (here `2.0` points each).  
   - Each `comment` also contributes extra points.  
   - A `hide` event subtracts points (penalizing negative feedback).
5. **Threshold**: We pick `2.0` as a simple cutoff for whether an interaction is “highly engaged” (label=1) or not (label=0). In a production system, you might run `A/B` tests or user research to refine this threshold.
6. **Label**: We store the result in a new column, `label`.
7. **Further usage**: You can integrate these labeled interactions into a supervised learning pipeline, e.g., training a classification model or computing ranking metrics.

This sample code should help you **label positive/negative samples** from your synthetic dataset in a way that aligns with the multi-stage ranking system described in the original reference. Adjust weights, threshold, and normalization to fit your specific application.

In [1]:
import pandas as pd

# 1. Load the 'interactions.csv' dataset
interactions = pd.read_csv("/content/sample_data/Interaction.csv")

# 2. Define the tunable weights for each signal.
#    You can modify these values (w1..w5) based on analysis or A/B testing.
w1 = 1.0  # Weight for click
w2 = 0.5  # Weight for (dwell_time / expected_dwell)
w3 = 2.0  # Weight for share
w4 = 1.0  # Weight for comment
w5 = 1.0  # Weight for hide (subtracted)

# 3. Decide on an 'expected_dwell' for normalizing dwell_time.
#    This is just an example; you might compute it from actual data.
expected_dwell = 10.0

# 4. Compute a compound engagement score for each interaction row.
#    - If dwell_time is missing or zero, it won't add much to the score.
#    - Negative feedback (hide=1) will reduce the score.
interactions["engagement_score"] = (
    w1 * interactions["click"] +
    w2 * (interactions["dwell_time"] / expected_dwell) +
    w3 * interactions["share"] +
    w4 * interactions["comment"] -
    w5 * interactions["hide"]
)

# 5. Choose a threshold above which we consider an interaction 'positive'.
#    For demonstration, we'll use threshold = 2.0, but you can tune it.
threshold = 2.0

# 6. Create a binary label: 1 for positive, 0 for negative.
#    - Alternatively, you could store string labels like "positive" / "negative".
interactions["label"] = (interactions["engagement_score"] > threshold).astype(int)

# 7. Now 'interactions' has an added 'label' column indicating positive/negative.
#    You can inspect, group, or merge it with user/content data as needed.
print(interactions.head(10))

# 8. (Optional) Save the updated DataFrame back to CSV
interactions.to_csv("/content/sample_data/interactions_labeled.csv", index=False)


   user_id  content_id  click  dwell_time  share  comment  hide  \
0        1         101      1        12.4      0        1     0   
1        1         108      1        15.0      1        0     0   
2        2         101      1         5.1      0        0     0   
3        2         104      0         0.0      0        0     1   
4        3         136      0         3.2      0        0     0   
5        3         107      1         7.0      1        1     0   
6        4         110      1         9.5      0        2     0   
7        4         104      1        10.2      0        2     0   
8        5         103      1         9.4      1        0     0   
9        5         107      1         5.0      0        0     0   

        event_timestamp  engagement_score  label  
0  2025-04-01T09:00:00Z             2.620      1  
1  2025-04-01T10:45:00Z             3.750      1  
2  2025-04-01T09:10:00Z             1.255      0  
3  2025-04-01T13:05:00Z            -1.000      0  
4  2025

In [2]:
import pandas as pd

# 1. Load each CSV into a DataFrame
users = pd.read_csv("/content/sample_data/Users.csv")
content = pd.read_csv("/content/sample_data/Content.csv")
interactions = pd.read_csv("/content/sample_data/interactions_labeled.csv")

# 2. Suppose we already have a 'label' column in interactions.csv
#    (or we can compute one using the code from before).
#    We'll assume "label" is binary: 1 => positive engagement, 0 => negative.

# 3. Merge user features onto interactions
#    - 'user_id' is the join key
df = interactions.merge(users, how="inner", on="user_id")

# 4. Merge content features onto the result
#    - 'content_id' is the join key
df = df.merge(content, how="inner", on="content_id")

# Now 'df' has columns from users, content, and interactions plus the 'label'.
print(df.head())


   user_id  content_id  click  dwell_time  share  comment  hide  \
0        1         101      1        12.4      0        1     0   
1        1         108      1        15.0      1        0     0   
2        2         101      1         5.1      0        0     0   
3        2         104      0         0.0      0        0     1   
4        3         136      0         3.2      0        0     0   

        event_timestamp  engagement_score  label  ... topic_affinity_politics  \
0  2025-04-01T09:00:00Z             2.620      1  ...                    0.77   
1  2025-04-01T10:45:00Z             3.750      1  ...                    0.77   
2  2025-04-01T09:10:00Z             1.255      0  ...                    0.34   
3  2025-04-01T13:05:00Z            -1.000      0  ...                    0.34   
4  2025-04-02T14:50:00Z             0.160      0  ...                    0.02   

  topic_affinity_tech  topic_affinity_sports  avg_session_length    blocklist  \
0                0.12        

In [4]:
pip install pandas scikit-learn tensorflow




In [6]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import numpy as np

# Example: let's pick some columns to treat as numeric and others as categorical.
num_cols = [
    "topic_affinity_politics", "topic_affinity_tech", "topic_affinity_sports",
    "avg_session_length", "quality_score", "dwell_time"
]

cat_cols = [
    "device_type", "region", "publisher", "topic"
]
# Note: You could also treat 'user_id' and 'content_id' as categorical
# and learn embeddings for them. For simplicity, we'll skip that here.

# 1. Separate our final label from the features
labels = df["label"].values  # 0 or 1
df_features = df[num_cols + cat_cols].copy()

# 2. One-Hot encode categorical columns
ohe = OneHotEncoder(handle_unknown="ignore")
cat_encoded = ohe.fit_transform(df_features[cat_cols]).toarray()

# 3. Scale numeric columns
scaler = StandardScaler()
num_scaled = scaler.fit_transform(df_features[num_cols])

# 4. Concatenate numeric + one-hot-coded categorical arrays
X = np.hstack([num_scaled, cat_encoded])

# X is now a numeric matrix suitable for ANN input
print(X.shape, labels.shape)  # e.g., (N, num_features), (N,)


(106, 30) (106,)


In [7]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# 1. Define a small MLP
model = keras.Sequential([
    layers.Input(shape=(X.shape[1],)),   # input layer: dimension = # of features
    layers.Dense(64, activation='relu'), # hidden layer
    layers.Dense(32, activation='relu'), # another hidden layer
    layers.Dense(1, activation='sigmoid') # output layer for binary classification
])

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)

# 2. Train/validation split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, labels, test_size=0.2, random_state=42
)

# 3. Fit the model
history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    batch_size=32,
    epochs=10
)


Epoch 1/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 148ms/step - accuracy: 0.4725 - loss: 0.6973 - val_accuracy: 0.4091 - val_loss: 0.7115
Epoch 2/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - accuracy: 0.5768 - loss: 0.6655 - val_accuracy: 0.5455 - val_loss: 0.6965
Epoch 3/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - accuracy: 0.7241 - loss: 0.6294 - val_accuracy: 0.5455 - val_loss: 0.6839
Epoch 4/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - accuracy: 0.7186 - loss: 0.6163 - val_accuracy: 0.5909 - val_loss: 0.6725
Epoch 5/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step - accuracy: 0.7264 - loss: 0.6033 - val_accuracy: 0.5909 - val_loss: 0.6630
Epoch 6/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step - accuracy: 0.7675 - loss: 0.5626 - val_accuracy: 0.5909 - val_loss: 0.6558
Epoch 7/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━

In [8]:
from sklearn.metrics import classification_report, roc_auc_score

y_val_pred = (model.predict(X_val) > 0.5).astype(int).ravel()
print(classification_report(y_val, y_val_pred, digits=4))

y_val_probs = model.predict(X_val).ravel()  # get raw probabilities
print("AUC:", roc_auc_score(y_val, y_val_probs))


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 229ms/step
              precision    recall  f1-score   support

           0     0.6000    0.8571    0.7059        14
           1     0.0000    0.0000    0.0000         8

    accuracy                         0.5455        22
   macro avg     0.3000    0.4286    0.3529        22
weighted avg     0.3818    0.5455    0.4492        22

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 227ms/step
AUC: 0.6607142857142857


Below is a **conceptual walkthrough** (with **example code**) demonstrating how to do a *multistage ranking* pipeline using the synthetic dataset (users, content, interactions) you already have.

---

# Overview

A typical **multistage ranking** system looks like this:

1. **Candidate Generation (Stage 1)**  
   - Narrow down the huge pool of content (potentially thousands/millions) to a smaller set (e.g., a few hundred).  
   - Often done with simpler or specialized methods (ANN embeddings, collaborative filtering, etc.).

2. **Initial Ranking (Stage 2)**  
   - A moderately complex model (e.g., a Gradient Boosted Decision Tree, or GBDT) that scores and sorts these candidates.  
   - Fast and interpretable with moderate accuracy.

3. **Final / Deep Ranking (Stage 3)**  
   - A more sophisticated (often neural) model that re-ranks the top \(N\) from Stage 2.  
   - Captures complex interactions (e.g., deep user–content embeddings, sequential signals) but is more expensive.

Here, we’ll do a **toy example** with Python + scikit-learn + Keras to illustrate the concept using your synthetic dataset. Feel free to adapt or swap out parts (e.g., use PyTorch or LightGBM) to suit your environment.

---

# 1. Setup and Data Loading

```python
import pandas as pd

# Load the CSVs
users = pd.read_csv("users.csv")          # 100 rows
content = pd.read_csv("content.csv")      # 100 rows
interactions = pd.read_csv("interactions.csv")  # 100 rows, each row is a user-content interaction

# We'll assume we already computed or added a 'label' in interactions
# via a compound engagement metric or a threshold approach.
# If not, see earlier instructions for how to generate 'label'.
```

---

# 2. Candidate Generation (Stage 1)

**Objective**: Quickly reduce the total number of content items to a smaller subset per user. In a real system, you might do:
- **Approximate Nearest Neighbor** on embeddings, or  
- **Collaborative Filtering** to find likely relevant items.

For this toy example, let’s do a **naïve “topic-match”** approach:  
1. Extract each user’s “topic affinity” (politics, tech, sports) from `users.csv`.  
2. Compare it to the content’s declared topic in `content.csv`, awarding a simple score.  
3. Select the top-K items for each user as “candidates.”

### 2.1 Build a Simple “Topic Score”

```python
import numpy as np

# Let's create a quick map: if content['topic'] == 'politics', we align with user topic_affinity_politics, etc.
# We'll do a function that given a user row and content row, returns a naive "match" score.
def naive_topic_score(user_row, content_row):
    topic = content_row["topic"]
    if topic == "politics":
        return user_row["topic_affinity_politics"]
    elif topic == "technology":
        return user_row["topic_affinity_tech"]
    elif topic == "sports":
        return user_row["topic_affinity_sports"]
    # ... in your real system you'd handle other topics (business, entertainment, etc.)
    # We'll just return 0 if it's not one of those three.
    else:
        return 0.0

# We also might want to filter out content if user is in a different region,
# or if content has region_restriction that doesn't match the user, etc.
# For simplicity, let's skip it or do a quick check.
```

### 2.2 Generate Candidates

Let’s say we want to find the **top 20** candidate items for each user.

```python
user_ids = users["user_id"].unique()
content_ids = content["content_id"].unique()

candidate_dict = {}  # key: user_id, value: list of top content_id

for uid in user_ids:
    # Grab user row
    user_row = users[users["user_id"] == uid].iloc[0]
    
    # For each piece of content, compute naive score
    scores = []
    for cid in content_ids:
        c_row = content[content["content_id"] == cid].iloc[0]
        score = naive_topic_score(user_row, c_row)
        scores.append((cid, score))
    
    # Sort by descending score, pick top 20
    scores.sort(key=lambda x: x[1], reverse=True)
    top_candidates = [cid for (cid, sc) in scores[:20]]
    candidate_dict[uid] = top_candidates

# candidate_dict[u] now holds the list of content IDs for that user
# In a real pipeline, you'd store these somewhere or proceed to Stage 2 next.
```

Note: For a real system, you’d do this with vector embeddings or CF to handle scale. This naive approach is fine for a small synthetic dataset demonstration.

---

# 3. Initial Ranking (Stage 2)

**Objective**: Score the *shortlisted candidates* with a moderately complex model (e.g., GBDT) that’s faster to run than a large neural net but more accurate than naive topic matching.

We’ll do the following:

1. Create training data from **interactions** for those user–content pairs that actually happened.  
2. Train a **scikit-learn** GBDT (e.g., XGBoost or LightGBM) to predict the user’s engagement label.  
3. At inference time, for each user’s candidate list, we generate the same features used in training, run them through the GBDT, and rank the content by predicted engagement probability.

### 3.1 Build Training Data for GBDT

We want user + content features => label. Let’s do a small set of numeric features:

- **User**: `topic_affinity_politics`, `topic_affinity_tech`, `topic_affinity_sports`  
- **Content**: `quality_score`  
- **Interaction**: we can also include `click`, `dwell_time`, etc., if it’s historically known—but typically Stage 2 is trained offline on *past interactions*.

```python
import xgboost as xgb   # or "import lightgbm as lgb"

# 1. Merge interactions with user and content data to get columns
df_merged = interactions.merge(users, on="user_id").merge(content, on="content_id")

# 2. We'll assume 'label' is in interactions. If you haven't computed it, do so first.
#    If 'label' doesn't exist, see earlier code for computing a compound engagement metric threshold.

# 3. We'll pick a few columns to be our features
feature_cols = [
    "topic_affinity_politics",
    "topic_affinity_tech",
    "topic_affinity_sports",
    "quality_score"
]
X_data = df_merged[feature_cols]
y_data = df_merged["label"]  # 0 or 1

# 4. Train/validation split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

# 5. Train a basic XGBoost model
gbdt = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    use_label_encoder=False,
    eval_metric="logloss",
    random_state=42
)
gbdt.fit(X_train, y_train, eval_set=[(X_val, y_val)], early_stopping_rounds=5)

# Let's check accuracy quickly
from sklearn.metrics import accuracy_score
y_val_pred = gbdt.predict(X_val)
acc = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", acc)
```

### 3.2 Use the GBDT to Score Candidates

Now that the GBDT is trained, we can *score* any user–content pair. In a multi-stage pipeline:

- We look at the user’s top 20 (from candidate generation).
- Build the same feature columns.
- Run `gbdt.predict_proba(...)` to get probability of engagement.

```python
def gbdt_score_candidates(uid, candidate_list):
    """
    Takes a user_id and a list of candidate content IDs.
    Returns a list of (content_id, predicted_score).
    """
    # find user row
    user_row = users[users["user_id"] == uid].iloc[0]
    
    results = []
    for cid in candidate_list:
        c_row = content[content["content_id"] == cid].iloc[0]
        
        # build same features
        row_feats = {
            "topic_affinity_politics": user_row["topic_affinity_politics"],
            "topic_affinity_tech": user_row["topic_affinity_tech"],
            "topic_affinity_sports": user_row["topic_affinity_sports"],
            "quality_score": c_row["quality_score"]
        }
        
        # convert to DataFrame
        single_df = pd.DataFrame([row_feats])
        
        # predict probability
        proba = gbdt.predict_proba(single_df)[0,1]  # class=1 probability
        results.append((cid, proba))
    
    # Sort by descending proba
    results.sort(key=lambda x: x[1], reverse=True)
    return results

# Example usage for user 1:
top_20 = candidate_dict[1]  # from Stage 1
stage2_ranked = gbdt_score_candidates(uid=1, candidate_list=top_20)
print("GBDT Stage2 Ranking for User 1:", stage2_ranked)
```

That’s the **Stage 2** ranking output: a list of `(content_id, predicted_score)`.

---

# 4. Final / Deep Ranking (Stage 3)

**Objective**: Re-rank the top \(M\) items (e.g., top 5 or top 10 from Stage 2) with a more computationally expensive model (like a neural network) that can capture deeper interactions.

### 4.1 Train a Neural Network on Historical Data

Similar to Stage 2, we can train a feed-forward (or any advanced) neural net. Let’s do a simple example in Keras:

```python
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# We'll reuse df_merged from above (the joined user+content+interaction dataset).
X_data_nn = df_merged[feature_cols].values
y_data_nn = df_merged["label"].values

# Train/val split
X_train_nn, X_val_nn, y_train_nn, y_val_nn = train_test_split(X_data_nn, y_data_nn, test_size=0.2, random_state=42)

# Build a small ANN
model = keras.Sequential([
    layers.Input(shape=(len(feature_cols),)),
    layers.Dense(16, activation='relu'),
    layers.Dense(8, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train_nn, y_train_nn, validation_data=(X_val_nn, y_val_nn), epochs=10, batch_size=32)
```

### 4.2 Re-rank the Top Items from Stage 2

At inference time for user \(u\):

1. Take the **top \(M\) items** from Stage 2’s results (e.g., top 5).  
2. Build the same features, run them through the trained neural net’s `model.predict(...)`.  
3. Sort by the neural net’s predicted probability.

```python
def nn_score_candidates(uid, candidate_list):
    results = []
    user_row = users[users["user_id"] == uid].iloc[0]
    
    for cid in candidate_list:
        c_row = content[content["content_id"] == cid].iloc[0]
        row_feats = [
            user_row["topic_affinity_politics"],
            user_row["topic_affinity_tech"],
            user_row["topic_affinity_sports"],
            c_row["quality_score"],
        ]
        X_input = np.array([row_feats], dtype=float)
        prob = model.predict(X_input)[0,0]  # single numeric
        results.append((cid, prob))
    
    # sort descending
    results.sort(key=lambda x: x[1], reverse=True)
    return results

# Let's say from Stage 2 we took the top 5:
stage2_top5 = stage2_ranked[:5]  # [(cid, score), (cid, score), ...]
final_stage3 = nn_score_candidates(uid=1, candidate_list=[cid for cid,_ in stage2_top5])

print("Stage 3 Final Ranking for User 1:", final_stage3)
```

**Now** you have a 3-stage pipeline:
1. Candidate generation with naive topic matching (Stage 1).  
2. GBDT ranking on the top 20 (Stage 2).  
3. DNN re-ranking on the top 5 from Stage 2 (Stage 3).  

---

# 5. Observations & Next Steps

- This example is **toy-scale**: Our dataset only has 100 users, 100 content items, and 100 interactions. In real systems, you might have millions.  
- **Candidate Generation** in production is often specialized: approximate nearest neighbor embeddings or high-performance collaborative filtering solutions.  
- **Stage 2** might use a library like LightGBM or XGBoost with many more features (e.g., dwell time, user–publisher affinity).  
- **Stage 3** can be a more advanced **deep architecture** (multi-task, embedding-based, sequence-based, or multi-head attention).  
- You might do **A/B Testing** at each stage to ensure each refinement truly improves final user engagement or retention.  
- **Real-time** integration includes caching, streaming feature updates, parallel inference, etc.

---

## Summary

1. **Candidate Generation** (Stage 1):  
   - Quickly narrow down potential items for each user.  
2. **Initial Ranking** (Stage 2):  
   - A moderate model (GBDT) that’s faster to infer but still fairly accurate.  
3. **Final/Deep Ranking** (Stage 3):  
   - A more computationally expensive model (e.g., neural net) that re-ranks the top items from Stage 2.

With the sample code above, you can see how to **build & train** multiple models on the synthetic dataset, **score** user–content pairs with each model, and **chain** the results for a multistage ranking pipeline. In a real system, you would refine the data pipeline, add more advanced features, handle very large data, and measure performance using robust offline/online metrics—but this demonstration illustrates the core ideas of how “multistage ranking” is done in practice.

In [10]:
import pandas as pd

# Load the CSVs
users = pd.read_csv("/content/sample_data/Users.csv")          # 100 rows
content = pd.read_csv("/content/sample_data/Content.csv")      # 100 rows
interactions = pd.read_csv("/content/sample_data/interactions_labeled.csv")  # 100 rows, each row is a user-content interaction

# We'll assume we already computed or added a 'label' in interactions
# via a compound engagement metric or a threshold approach.
# If not, see earlier instructions for how to generate 'label'.


In [11]:
import numpy as np

# Let's create a quick map: if content['topic'] == 'politics', we align with user topic_affinity_politics, etc.
# We'll do a function that given a user row and content row, returns a naive "match" score.
def naive_topic_score(user_row, content_row):
    topic = content_row["topic"]
    if topic == "politics":
        return user_row["topic_affinity_politics"]
    elif topic == "technology":
        return user_row["topic_affinity_tech"]
    elif topic == "sports":
        return user_row["topic_affinity_sports"]
    # ... in your real system you'd handle other topics (business, entertainment, etc.)
    # We'll just return 0 if it's not one of those three.
    else:
        return 0.0

# We also might want to filter out content if user is in a different region,
# or if content has region_restriction that doesn't match the user, etc.
# For simplicity, let's skip it or do a quick check.


In [12]:
user_ids = users["user_id"].unique()
content_ids = content["content_id"].unique()

candidate_dict = {}  # key: user_id, value: list of top content_id

for uid in user_ids:
    # Grab user row
    user_row = users[users["user_id"] == uid].iloc[0]

    # For each piece of content, compute naive score
    scores = []
    for cid in content_ids:
        c_row = content[content["content_id"] == cid].iloc[0]
        score = naive_topic_score(user_row, c_row)
        scores.append((cid, score))

    # Sort by descending score, pick top 20
    scores.sort(key=lambda x: x[1], reverse=True)
    top_candidates = [cid for (cid, sc) in scores[:20]]
    candidate_dict[uid] = top_candidates

# candidate_dict[u] now holds the list of content IDs for that user
# In a real pipeline, you'd store these somewhere or proceed to Stage 2 next.


In [14]:
import xgboost as xgb   # or "import lightgbm as lgb"

# 1. Merge interactions with user and content data to get columns
df_merged = interactions.merge(users, on="user_id").merge(content, on="content_id")

# 2. We'll assume 'label' is in interactions. If you haven't computed it, do so first.
#    If 'label' doesn't exist, see earlier code for computing a compound engagement metric threshold.

# 3. We'll pick a few columns to be our features
feature_cols = [
    "topic_affinity_politics",
    "topic_affinity_tech",
    "topic_affinity_sports",
    "quality_score"
]
X_data = df_merged[feature_cols]
y_data = df_merged["label"]  # 0 or 1

# 4. Train/validation split
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

# 5. Train a basic XGBoost model
gbdt = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=3,
    learning_rate=0.1,
    use_label_encoder=False,
    eval_metric="logloss",
    random_state=42,
    early_stopping_rounds=5 # Move early_stopping_rounds to the XGBClassifier constructor
)
gbdt.fit(X_train, y_train, eval_set=[(X_val, y_val)]) # Remove early_stopping_rounds from fit()

# Let's check accuracy quickly
from sklearn.metrics import accuracy_score
y_val_pred = gbdt.predict(X_val)
acc = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", acc)


[0]	validation_0-logloss:0.64370
[1]	validation_0-logloss:0.63996
[2]	validation_0-logloss:0.63499
[3]	validation_0-logloss:0.62238
[4]	validation_0-logloss:0.61711
[5]	validation_0-logloss:0.61348
[6]	validation_0-logloss:0.60472
[7]	validation_0-logloss:0.60463
[8]	validation_0-logloss:0.61131
[9]	validation_0-logloss:0.61919
[10]	validation_0-logloss:0.62538
[11]	validation_0-logloss:0.63238
Validation Accuracy: 0.6363636363636364


Parameters: { "use_label_encoder" } are not used.



In [15]:
def gbdt_score_candidates(uid, candidate_list):
    """
    Takes a user_id and a list of candidate content IDs.
    Returns a list of (content_id, predicted_score).
    """
    # find user row
    user_row = users[users["user_id"] == uid].iloc[0]

    results = []
    for cid in candidate_list:
        c_row = content[content["content_id"] == cid].iloc[0]

        # build same features
        row_feats = {
            "topic_affinity_politics": user_row["topic_affinity_politics"],
            "topic_affinity_tech": user_row["topic_affinity_tech"],
            "topic_affinity_sports": user_row["topic_affinity_sports"],
            "quality_score": c_row["quality_score"]
        }

        # convert to DataFrame
        single_df = pd.DataFrame([row_feats])

        # predict probability
        proba = gbdt.predict_proba(single_df)[0,1]  # class=1 probability
        results.append((cid, proba))

    # Sort by descending proba
    results.sort(key=lambda x: x[1], reverse=True)
    return results

# Example usage for user 1:
top_20 = candidate_dict[1]  # from Stage 1
stage2_ranked = gbdt_score_candidates(uid=1, candidate_list=top_20)
print("GBDT Stage2 Ranking for User 1:", stage2_ranked)


GBDT Stage2 Ranking for User 1: [(np.int64(101), np.float32(0.42559618)), (np.int64(104), np.float32(0.42559618)), (np.int64(106), np.float32(0.42559618)), (np.int64(113), np.float32(0.42559618)), (np.int64(120), np.float32(0.42559618)), (np.int64(123), np.float32(0.42559618)), (np.int64(127), np.float32(0.42559618)), (np.int64(130), np.float32(0.42559618)), (np.int64(132), np.float32(0.42559618)), (np.int64(139), np.float32(0.42559618)), (np.int64(146), np.float32(0.42559618)), (np.int64(148), np.float32(0.42559618)), (np.int64(158), np.float32(0.42559618)), (np.int64(160), np.float32(0.42559618)), (np.int64(163), np.float32(0.42559618)), (np.int64(170), np.float32(0.42559618)), (np.int64(178), np.float32(0.42559618)), (np.int64(183), np.float32(0.42559618)), (np.int64(188), np.float32(0.42559618)), (np.int64(193), np.float32(0.42559618))]


In [16]:
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# We'll reuse df_merged from above (the joined user+content+interaction dataset).
X_data_nn = df_merged[feature_cols].values
y_data_nn = df_merged["label"].values

# Train/val split
X_train_nn, X_val_nn, y_train_nn, y_val_nn = train_test_split(X_data_nn, y_data_nn, test_size=0.2, random_state=42)

# Build a small ANN
model = keras.Sequential([
    layers.Input(shape=(len(feature_cols),)),
    layers.Dense(16, activation='relu'),
    layers.Dense(8, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train_nn, y_train_nn, validation_data=(X_val_nn, y_val_nn), epochs=10, batch_size=32)


Epoch 1/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 226ms/step - accuracy: 0.4567 - loss: 0.6939 - val_accuracy: 0.4091 - val_loss: 0.6948
Epoch 2/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 92ms/step - accuracy: 0.4509 - loss: 0.6930 - val_accuracy: 0.3636 - val_loss: 0.6935
Epoch 3/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 54ms/step - accuracy: 0.4743 - loss: 0.6916 - val_accuracy: 0.3636 - val_loss: 0.6924
Epoch 4/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 69ms/step - accuracy: 0.4940 - loss: 0.6911 - val_accuracy: 0.4091 - val_loss: 0.6915
Epoch 5/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 67ms/step - accuracy: 0.5000 - loss: 0.6902 - val_accuracy: 0.3636 - val_loss: 0.6906
Epoch 6/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 36ms/step - accuracy: 0.5314 - loss: 0.6876 - val_accuracy: 0.3636 - val_loss: 0.6896
Epoch 7/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7a968083db10>

In [17]:
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

# We'll reuse df_merged from above (the joined user+content+interaction dataset).
X_data_nn = df_merged[feature_cols].values
y_data_nn = df_merged["label"].values

# Train/val split
X_train_nn, X_val_nn, y_train_nn, y_val_nn = train_test_split(X_data_nn, y_data_nn, test_size=0.2, random_state=42)

# Build a small ANN
model = keras.Sequential([
    layers.Input(shape=(len(feature_cols),)),
    layers.Dense(16, activation='relu'),
    layers.Dense(8, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train_nn, y_train_nn, validation_data=(X_val_nn, y_val_nn), epochs=10, batch_size=32)


Epoch 1/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 223ms/step - accuracy: 0.6594 - loss: 0.6557 - val_accuracy: 0.6364 - val_loss: 0.6983
Epoch 2/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 66ms/step - accuracy: 0.6672 - loss: 0.6525 - val_accuracy: 0.6364 - val_loss: 0.6968
Epoch 3/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.6672 - loss: 0.6493 - val_accuracy: 0.6364 - val_loss: 0.6953
Epoch 4/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step - accuracy: 0.6516 - loss: 0.6542 - val_accuracy: 0.6364 - val_loss: 0.6935
Epoch 5/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step - accuracy: 0.6672 - loss: 0.6468 - val_accuracy: 0.6364 - val_loss: 0.6921
Epoch 6/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step - accuracy: 0.6321 - loss: 0.6570 - val_accuracy: 0.6364 - val_loss: 0.6918
Epoch 7/10
[1m3/3[0m [32m━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x7a967bff7c90>

In [18]:
def nn_score_candidates(uid, candidate_list):
    results = []
    user_row = users[users["user_id"] == uid].iloc[0]

    for cid in candidate_list:
        c_row = content[content["content_id"] == cid].iloc[0]
        row_feats = [
            user_row["topic_affinity_politics"],
            user_row["topic_affinity_tech"],
            user_row["topic_affinity_sports"],
            c_row["quality_score"],
        ]
        X_input = np.array([row_feats], dtype=float)
        prob = model.predict(X_input)[0,0]  # single numeric
        results.append((cid, prob))

    # sort descending
    results.sort(key=lambda x: x[1], reverse=True)
    return results

# Let's say from Stage 2 we took the top 5:
stage2_top5 = stage2_ranked[:5]  # [(cid, score), (cid, score), ...]
final_stage3 = nn_score_candidates(uid=1, candidate_list=[cid for cid,_ in stage2_top5])

print("Stage 3 Final Ranking for User 1:", final_stage3)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 34ms/step
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
Stage 3 Final Ranking for User 1: [(np.int64(106), np.float32(0.36139703)), (np.int64(113), np.float32(0.3536266)), (np.int64(101), np.float32(0.328912)), (np.int64(104), np.float32(0.32609668)), (np.int64(120), np.float32(0.32516098))]
