# Contextual Bandit: Offline Thompson Sampling

In the previous notebook, we established random and greedy baselines using replay-based offline evaluation.
These baselines provide reference points for decision quality under a fully offline setting.

In this notebook, we introduce an **offline contextual bandit approach using Thompson Sampling**.
Rather than simulating online exploration, we learn from historical data in a batch manner and evaluate decision quality under uncertainty using a conservative replay-based framework.

The objective is to assess whether an uncertainty-aware, learning-based policy can outperform fixed baselines when restricted to observable historical outcomes.



In [20]:
import os
import pandas as pd
import numpy as np
import random
from typing import Callable, Dict
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline


In [21]:
# Path configuration
DATA_DIR = "../data"
PROCESSED_PATH = os.path.join(DATA_DIR, "bank_processed_for_bandit.csv")

# Load processed dataset
df = pd.read_csv(PROCESSED_PATH)

# Basic checks
print("Dataset shape:", df.shape)
df.head()

Dataset shape: (45211, 18)


Unnamed: 0,row_id,age_group,job,marital,education,default,housing,loan,contact,month,balance_group,campaign,day_group,campaign_group,pdays_group,previous_group,poutcome,reward
0,0,pre_retirement,management,married,tertiary,no,yes,no,unknown,may,high_balance,1,day_1_7,1_10_contacts,never_contacted,0_10_previous,unknown,0
1,1,mid_career,technician,single,secondary,no,yes,no,unknown,may,low_balance,1,day_1_7,1_10_contacts,never_contacted,0_10_previous,unknown,0
2,2,young_adult,entrepreneur,married,secondary,no,yes,yes,unknown,may,low_balance,1,day_1_7,1_10_contacts,never_contacted,0_10_previous,unknown,0
3,3,mid_career,blue-collar,married,unknown,no,yes,no,unknown,may,high_balance,1,day_1_7,1_10_contacts,never_contacted,0_10_previous,unknown,0
4,4,young_adult,unknown,single,unknown,no,no,no,unknown,may,low_balance,1,day_1_7,1_10_contacts,never_contacted,0_10_previous,unknown,0


In [22]:
# Action space definition
ACTION_LOW_INTENSITY = 1
ACTION_HIGH_INTENSITY = 2

ACTION_SPACE = [
    ACTION_LOW_INTENSITY,
    ACTION_HIGH_INTENSITY
]

ACTION_NAMES = {
    ACTION_LOW_INTENSITY: "low_intensity_contact",
    ACTION_HIGH_INTENSITY: "high_intensity_contact"
}

print("Defined action space:")
for a in ACTION_SPACE:
    print(f"{a}: {ACTION_NAMES[a]}")

Defined action space:
1: low_intensity_contact
2: high_intensity_contact


In [23]:
def infer_historical_action_from_campaign(campaign: int) -> int:
    """
    Map historical campaign counts to contact intensity actions.
    
    - 1 contact   -> low intensity
    - 2+ contacts -> high intensity
    
    No-contact is rarely observed in the historical data.
    """
    if campaign == 1:
        return ACTION_LOW_INTENSITY
    else:
        return ACTION_HIGH_INTENSITY

df['historical_action'] = df['campaign'].apply(infer_historical_action_from_campaign)
df['historical_action'].value_counts(normalize=True)

historical_action
2    0.611953
1    0.388047
Name: proportion, dtype: float64

In [24]:
def replay_evaluate(
    df,
    policy_fn: Callable,
    reward_col: str = "reward",
    historical_action_col: str = "historical_action"
) -> Dict[str, float]:
    """
    Replay-based offline evaluation.

    For each row:
    - policy selects an action based on the state
    - reward is counted only if policy_action == historical_action
    """

    total_reward = 0
    matched_steps = 0
    n_rows = len(df)

    for _, row in df.iterrows():
        policy_action = policy_fn(row)
        historical_action = row[historical_action_col]

        if policy_action == historical_action:
            total_reward += row[reward_col]
            matched_steps += 1

    match_rate = matched_steps / n_rows if n_rows > 0 else 0.0
    avg_reward_on_matched = (
        total_reward / matched_steps if matched_steps > 0 else 0.0
    )

    return {
        "n_rows": n_rows,
        "matched_steps": matched_steps,
        "match_rate": match_rate,
        "total_reward_on_matched": total_reward,
        "avg_reward_on_matched": avg_reward_on_matched
    }


In [25]:
rng = random.Random(42)

def random_policy(_row):
    return rng.choice(ACTION_SPACE)


metrics_random = replay_evaluate(df, random_policy)
metrics_random

{'n_rows': 45211,
 'matched_steps': 22489,
 'match_rate': 0.49742319347061553,
 'total_reward_on_matched': 2624,
 'avg_reward_on_matched': 0.11667926541864912}

## RL Model

### Policy: Offline Linear Thompson Sampling

We use **Offline Linear Thompson Sampling** to model the decision policy.

- For each action, the expected reward is modeled as a linear function of contextual features.
- A Bayesian posterior over the linear coefficients is estimated using historical data.
- At decision time, parameters are sampled from the posterior to account for uncertainty, and actions are selected accordingly.

Policy evaluation is conducted using replay-based action matching, ensuring that rewards are only observed when the selected action aligns with historical behavior.


In [26]:
STATE_COLS = [
    "age_group",
    "job",
    "education",
    "balance_group",
    "pdays_group",
    "previous_group",
    "poutcome",
    "month",
]

In [27]:
STATE_COLS = [c for c in STATE_COLS if c in df.columns]

preprocess = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=False), STATE_COLS)
    ],
    remainder="drop"
)

X = preprocess.fit_transform(df[STATE_COLS])
r = df["reward"].to_numpy().astype(float)
a_hist = df["historical_action"].to_numpy().astype(int)

X.shape


(45211, 52)

In [28]:
def fit_offline_linear_ts(X, r, a_hist, actions, lam=1.0):
    d = X.shape[1]
    A = {a: lam * np.eye(d) for a in actions}
    b = {a: np.zeros(d) for a in actions}

    for i in range(len(X)):
        a = a_hist[i]
        x = X[i]
        reward = r[i]

        A[a] += np.outer(x, x)
        b[a] += x * reward

    return A, b


In [29]:
class OfflineLinearThompsonSampling:
    def __init__(self, A, b, v=1.0, seed=42):
        self.A = A
        self.b = b
        self.v = float(v)
        self.rng = np.random.default_rng(seed)

        self.L = {}
        self.mu = {}

        for a in A:
            # A = L L^T
            L = np.linalg.cholesky(A[a])
            self.L[a] = L

            # Solve A mu = b via Cholesky: L y = b, then L^T mu = y
            y = np.linalg.solve(L, b[a])
            mu = np.linalg.solve(L.T, y)
            self.mu[a] = mu

    def choose_action(self, x):
        scores = {}
        d = x.shape[0]

        for a in self.A:
            z = self.rng.standard_normal(d)
            y = np.linalg.solve(self.L[a], z)     # ~ N(0, A^{-1})
            theta = self.mu[a] + self.v * y
            scores[a] = float(x @ theta)

        return max(scores, key=scores.get)



In [30]:
def offline_ts_replay_evaluate(X, r, a_hist, policy):
    matched = 0
    total_reward = 0.0
    n = len(X)

    for i in range(n):
        a_hat = policy.choose_action(X[i])
        if a_hat == a_hist[i]:
            matched += 1
            total_reward += r[i]

    return {
        "n_rows": n,
        "matched_steps": matched,
        "match_rate": matched / n,
        "total_reward_on_matched": total_reward,
        "avg_reward_on_matched": total_reward / matched if matched > 0 else 0.0
    }


In [35]:
ACTION_SPACE = [1, 2]

A, b = fit_offline_linear_ts(X, r, a_hist, ACTION_SPACE, lam=1.0)
offline_ts = OfflineLinearThompsonSampling(A, b, v=1.0, seed=42)

metrics_offline_ts = offline_ts_replay_evaluate(X, r, a_hist, offline_ts)
metrics_offline_ts


{'n_rows': 45211,
 'matched_steps': 22352,
 'match_rate': 0.4943929574661034,
 'total_reward_on_matched': 2645.0,
 'avg_reward_on_matched': 0.11833392984967789}

## Summary

In this notebook, we evaluated random, greedy, and offline Thompson Sampling policies using a replay-based offline framework.
While heuristic and learning-based policies show modest improvements over random selection on matched observations, overall gains are constrained by the structure of the historical data.

Because actions in the dataset were assigned non-randomly, outcome differences reflect both policy behavior and underlying population differences. This historical selection bias limits the effectiveness of offline reinforcement learning and explains why increased model complexity does not necessarily yield better performance.

These results underscore the importance of aligning reinforcement learning methods with the data generation process and motivate future work involving online experimentation, simulation-based evaluation, or causal inference techniques.
