# Lab 3: Contextual Bandit-Based News Article Recommendation

**`Course`:** Reinforcement Learning Fundamentals  

**`Student Name`:**  Rishit Anand

**`Roll Number`:**  U20230024

**`GitHub Branch`:** rishit_U20230024  

# Imports and Setup

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

from rlcmab_sampler import sampler


# Load Datasets

In [36]:
# Load datasets
news_df = pd.read_csv("data/news_articles.csv")
train_users = pd.read_csv("data/train_users.csv")
test_users = pd.read_csv("data/test_users.csv")

print(news_df.head())
print(train_users.head())

                                                link  \
0  https://www.huffpost.com/entry/covid-boosters-...   
1  https://www.huffpost.com/entry/american-airlin...   
2  https://www.huffpost.com/entry/funniest-tweets...   
3  https://www.huffpost.com/entry/funniest-parent...   
4  https://www.huffpost.com/entry/amy-cooper-lose...   

                                            headline   category  \
0  Over 4 Million Americans Roll Up Sleeves For O...  U.S. NEWS   
1  American Airlines Flyer Charged, Banned For Li...  U.S. NEWS   
2  23 Of The Funniest Tweets About Cats And Dogs ...     COMEDY   
3  The Funniest Tweets From Parents This Week (Se...  PARENTING   
4  Woman Who Called Cops On Black Bird-Watcher Lo...  U.S. NEWS   

                                   short_description               authors  \
0  Health experts said it is too early to predict...  Carla K. Johnson, AP   
1  He was subdued by passengers and crew when he ...        Mary Papenfuss   
2  "Until you have a dog y

## Data Preprocessing

In this section:
- Handle missing values
- Encode categorical features
- Prepare data for user classification

In [37]:
# drop user id and broswer version
drop_cols = ['user_id', 'browser_version']
train_users = train_users.drop(columns=drop_cols)
test_users = test_users.drop(columns=drop_cols)

# remove spaces from col names 
train_users.columns = train_users.columns.str.strip()
test_users.columns = test_users.columns.str.strip()

# fill missing age values with mean
train_users['age'] = train_users['age'].fillna(train_users['age'].mean())
test_users['age'] = test_users['age'].fillna(test_users['age'].mean())

# fill other potential missing values with 0
train_users = train_users.fillna(0)
test_users = test_users.fillna(0)

# encode categorical features
le_region = LabelEncoder()
all_regions = pd.concat([train_users['region_code'], test_users['region_code']], axis=0).astype(str)
le_region.fit(all_regions)

train_users['region_code'] = le_region.transform(train_users['region_code'].astype(str))
test_users['region_code'] = le_region.transform(test_users['region_code'].astype(str))

train_users['subscriber'] = train_users['subscriber'].astype(int)
test_users['subscriber'] = test_users['subscriber'].astype(int)

# encode target 
le_user = LabelEncoder()
train_users['label_encoded'] = le_user.fit_transform(train_users['label'])

print("Feature Columns:", list(train_users.drop(columns=['label', 'label_encoded']).columns))
print(f"Target Labels Mapping: {dict(zip(le_user.classes_, le_user.transform(le_user.classes_)))}")

# split to features and target
X = train_users.drop(columns=['label', 'label_encoded'])
y = train_users['label_encoded']

# split data into train and val
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


Feature Columns: ['age', 'income', 'clicks', 'purchase_amount', 'session_duration', 'content_variety', 'engagement_score', 'num_transactions', 'avg_monthly_spend', 'avg_cart_value', 'browsing_depth', 'revisit_rate', 'scroll_activity', 'time_on_site', 'interaction_count', 'preferred_price_range', 'discount_usage_rate', 'wishlist_size', 'product_views', 'repeat_purchase_gap (days)', 'churn_risk_score', 'loyalty_index', 'screen_brightness', 'battery_percentage', 'cart_abandonment_count', 'background_app_count', 'session_inactivity_duration', 'network_jitter', 'region_code', 'subscriber']
Target Labels Mapping: {'user_1': np.int64(0), 'user_2': np.int64(1), 'user_3': np.int64(2)}


## User Classification

Train a classifier to predict the user category (`User1`, `User2`, `User3`),
which serves as the **context** for the contextual bandit.


In [39]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# training
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# prediction and accuracy
y_pred = clf.predict(X_val)
print("Classification Report:")
print(classification_report(y_val, y_pred, target_names=le_user.classes_))
accuracy = accuracy_score(y_val, y_pred)
print(f"Validation Accuracy: {accuracy:.4f}")

Classification Report:
              precision    recall  f1-score   support

      user_1       0.80      0.74      0.77       147
      user_2       0.87      0.89      0.88       141
      user_3       0.79      0.83      0.81       112

    accuracy                           0.82       400
   macro avg       0.82      0.82      0.82       400
weighted avg       0.82      0.82      0.82       400

Validation Accuracy: 0.8200


# `Contextual Bandit`

## Reward Sampler Initialization

The sampler is initialized using the student's roll number `i`.
Rewards are obtained using `sampler.sample(j)`.


In [40]:
reward_sampler = sampler(24)

## Arm Mapping

| Arm Index (j) | News Category | User Context |
|--------------|---------------|--------------|
| 0–3          | Entertainment, Education, Tech, Crime | User1 |
| 4–7          | Entertainment, Education, Tech, Crime | User2 |
| 8–11         | Entertainment, Education, Tech, Crime | User3 |

## Epsilon-Greedy Strategy

This section implements the epsilon-greedy contextual bandit algorithm.


## Upper Confidence Bound (UCB)

This section implements the UCB strategy for contextual bandits.

## SoftMax Strategy

This section implements the SoftMax strategy with temperature $ \tau = 1$.


## Reinforcement Learning Simulation

We simulate the bandit algorithms for $T = 10,000$ steps and record rewards.

P.S.: Change $T$ value as and if required.


## Results and Analysis

This section presents:
- Average Reward vs Time
- Hyperparameter comparisons
- Observations and discussion


## Final Observations

- Comparison of Epsilon-Greedy, UCB, and SoftMax
- Effect of hyperparameters
- Strengths and limitations of each approach
