# Starbucks Promotional Response Prediction Using Machine Learning

This notebook implements the full end-to-end pipeline described in the Capstone project proposal:

1. Dynamically read AWS SageMaker configuration (execution role, default S3 bucket)  
2. Upload raw Starbucks JSON data from the project folder to S3  
3. Load raw data from S3 into a SageMaker notebook  
4. Clean and engineer features for:
   - Customer demographics (`profile.json`)
   - Offer metadata (`portfolio.json`)
   - Event logs (`transcript.json`)
5. Construct offer instances and label *true* offer responses  
6. Perform exploratory data analysis (EDA) to understand response patterns  
7. Split the labeled dataset into train, validation, and test sets  
8. Save processed datasets to S3  
9. Train baseline and advanced ML models to predict offer responsiveness  
10. Evaluate models using ROC-AUC, precision, recall, and F1  
11. Save the best model artifact to S3  
12. Run example inferences and interpret the outcome for marketing decisions  

The dataset is simulated, designed to reflect realistic behavioral patterns (not real individuals). 


## 1. Setup: SageMaker Session, Execution Role, and S3 Bucket

We begin by initializing the SageMaker session and retrieving:

- The **execution role** used by this notebook  
- The **default SageMaker S3 bucket**, which we will use to store:
  - Raw data
  - Processed labeled data
  - Model artifacts

Using these programmatic values avoids hard-coding AWS configuration.

In [1]:
import boto3
import sagemaker
import pandas as pd
import numpy as np
import io
import json
import os

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score, confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", palette="deep")

# SageMaker session and role
session = sagemaker.Session()
role = sagemaker.get_execution_role()
print("Execution role:", role)

# Default S3 bucket
bucket = session.default_bucket()
print("Default bucket:", bucket)

# S3 prefixes for this project
raw_prefix = "starbucks/data"
model_prefix = "starbucks/model"   # model artifacts

s3 = boto3.client("s3")


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/sagemaker-user/.config/sagemaker/config.yaml
Execution role: arn:aws:iam::135432667076:role/service-role/AmazonSageMaker-ExecutionRole-20251117T145438
Default bucket: sagemaker-us-east-1-135432667076


## 2. Upload Raw Data to S3

The GitHub repository contains the raw files under `data/`:

- `portfolio.json`
- `profile.json`
- `transcript.json`

To make these files available to SageMaker jobs and future pipelines, we upload them to the default S3 bucket under the prefix:

`s3://<default-bucket>/starbucks/data/`


In [2]:
local_data_dir = "data"

for root, dirs, files in os.walk(local_data_dir):
    for file in files:
        if not file.endswith(".json"):
            continue
        local_file_path = os.path.join(root, file)
        s3_key = f"{raw_prefix}/{file}"
        s3.upload_file(local_file_path, bucket, s3_key)
        print(f"Uploaded {local_file_path} to s3://{bucket}/{s3_key}")


Uploaded data/portfolio.json to s3://sagemaker-us-east-1-135432667076/starbucks/data/portfolio.json
Uploaded data/profile.json to s3://sagemaker-us-east-1-135432667076/starbucks/data/profile.json
Uploaded data/transcript.json to s3://sagemaker-us-east-1-135432667076/starbucks/data/transcript.json


## 3. Load Raw Data from S3

Defind a helper function to load JSON files directly from S3 into pandas DataFrames.
We then load:

- `portfolio.json` — promotional offer metadata  
- `profile.json`   — customer demographics  
- `transcript.json` — event logs  


In [4]:
def read_json_from_s3(key: str) -> pd.DataFrame:
    """Read a JSON file from S3 (under raw_prefix) into a DataFrame."""
    obj = s3.get_object(Bucket=bucket, Key=f"{raw_prefix}/{key}")
    return pd.read_json(io.BytesIO(obj["Body"].read()), orient="records", lines=True)

portfolio = read_json_from_s3("portfolio.json")
profile   = read_json_from_s3("profile.json")
transcript = read_json_from_s3("transcript.json")


In [5]:
portfolio.head()


Unnamed: 0,reward,channels,difficulty,duration,offer_type,id
0,10,"[email, mobile, social]",10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd
1,10,"[web, email, mobile, social]",10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0
2,0,"[web, email, mobile]",0,4,informational,3f207df678b143eea3cee63160fa8bed
3,5,"[web, email, mobile]",5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9
4,5,"[web, email]",20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7


In [6]:
profile.head()

Unnamed: 0,gender,age,id,became_member_on,income
0,,118,68be06ca386d4c31939f3a4f0e3dd783,20170212,
1,F,55,0610b486422d4921ae7d2bf64640c50b,20170715,112000.0
2,,118,38fe809add3b4fcf9315a9694bb96ff5,20180712,
3,F,75,78afa995795e4d85b5d9ceeca43f5fef,20170509,100000.0
4,,118,a03223e636434f42ac4c3df47e8bac43,20170804,


In [7]:
transcript.head()

Unnamed: 0,person,event,value,time
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,{'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'},0
1,a03223e636434f42ac4c3df47e8bac43,offer received,{'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'},0
2,e2127556f4f64592b11af22de27a7932,offer received,{'offer id': '2906b810c7d4411798c6938adc9daaa5'},0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,{'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'},0
4,68617ca6246f4fbc85e91a2a49552598,offer received,{'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'},0


In [8]:
portfolio.info()
profile.info()
transcript.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 612.0+ bytes
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB
<class 'pandas.core.fra

## 4. Clean and Enrich Customer Profile Data

The customer `profile` table contains age, gender, income, and account creation date.  
We clean and enrich it by:

- Replacing age = 118 with missing (NaN)  
- Converting `became_member_on` to a proper datetime  
- Creating a `member_days` feature representing customer tenure  
- Renaming `id` to `person` to match the transcript data  


In [10]:
profile_df = profile.copy()

# Replace invalid age
profile_df["age"] = profile_df["age"].replace(118, np.nan)

# Convert membership date
profile_df["became_member_on"] = pd.to_datetime(
    profile_df["became_member_on"], format="%Y%m%d"
)

max_date = profile_df["became_member_on"].max()
profile_df["member_days"] = (max_date - profile_df["became_member_on"]).dt.days

# Standardize ID name
profile_df = profile_df.rename(columns={"id": "person"})

profile_df.head()


Unnamed: 0,gender,age,person,became_member_on,income,member_days
0,,,68be06ca386d4c31939f3a4f0e3dd783,2017-02-12,,529
1,F,55.0,0610b486422d4921ae7d2bf64640c50b,2017-07-15,112000.0,376
2,,,38fe809add3b4fcf9315a9694bb96ff5,2018-07-12,,14
3,F,75.0,78afa995795e4d85b5d9ceeca43f5fef,2017-05-09,100000.0,443
4,,,a03223e636434f42ac4c3df47e8bac43,2017-08-04,,356


## 5. Clean and Enrich Offer Metadata

The `portfolio` table describes each offer: its type, difficulty (spend threshold), reward, duration, and channels.

We:

- Rename `id` to `offer_id`  
- One-hot encode the channels list into separate binary features  


In [11]:
portfolio_df = portfolio.copy()
portfolio_df = portfolio_df.rename(columns={"id": "offer_id"})

for ch in ["web", "email", "mobile", "social"]:
    portfolio_df[f"channel_{ch}"] = portfolio_df["channels"].apply(
        lambda x: 1 if ch in x else 0
    )

portfolio_df = portfolio_df.drop(columns=["channels"])
portfolio_df.head()


Unnamed: 0,reward,difficulty,duration,offer_type,offer_id,channel_web,channel_email,channel_mobile,channel_social
0,10,10,7,bogo,ae264e3637204a6fb9bb56bc8210ddfd,0,1,1,1
1,10,10,5,bogo,4d5c57ea9a6940dd891ad53e9dbe8da0,1,1,1,1
2,0,0,4,informational,3f207df678b143eea3cee63160fa8bed,1,1,1,0
3,5,5,7,bogo,9b98b8c7a33c4b65b9aebfe6a799e6d9,1,1,1,0
4,5,20,10,discount,0b1e1539f2cc45b7b9fa7c272da2e1d7,1,1,0,0


## 6. Flatten Transcript Event Log

The `transcript` table is an event log containing:
- `offer received`
- `offer viewed`
- `offer completed`
- `transaction` events

The `value` column is a nested dictionary that may contain:
- `offer id` / `offer_id`
- `amount` (transaction amount)
- `reward` (offer reward credited)

We flatten this nested structure into explicit columns for `offer_id`, `amount`, and `reward`.


In [12]:
transcript_df = transcript.copy()

def extract_value(row):
    val = row["value"]
    if "offer id" in val:
        row["offer_id"] = val["offer id"]
    elif "offer_id" in val:
        row["offer_id"] = val["offer_id"]
    else:
        row["offer_id"] = None

    row["amount"] = val.get("amount", None)
    row["reward"] = val.get("reward", None)
    return row

transcript_df = transcript_df.apply(extract_value, axis=1)
transcript_df = transcript_df.drop(columns=["value"])

transcript_df.head()


Unnamed: 0,person,event,time,offer_id,amount,reward
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,0b1e1539f2cc45b7b9fa7c272da2e1d7,,
2,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5,,
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,fafdcd668e3743c1bb461111dcafc2a4,,
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,4d5c57ea9a6940dd891ad53e9dbe8da0,,


## 7. Build Offer Instances and Define Response Labels

Construct offer instances and label whether each customer *responded* to an offer.

A customer is considered to have **responded** if:
1. They **viewed** the offer within its validity window, and  
2. They **completed** the offer (met the spend threshold) within the same window.

This labeling logic distinguishes true offer influence from:
- Purchases that happen coincidentally within the time window  
- Completions without viewing (not influenced by the promotion)  
- Non-responsive behavior  


In [None]:
# Offers received
offers_received = transcript_df[transcript_df["event"] == "offer received"].copy()

# Drop the transcript-level reward column to avoid collision
if "reward" in offers_received.columns:
    offers_received = offers_received.drop(columns=["reward"])

# Join offer metadata (which has 'reward')
offers_received = offers_received.merge(
    portfolio_df,
    on="offer_id",
    how="left"
)

# Offer window in hours
offers_received["offer_start"] = offers_received["time"]
offers_received["offer_end"] = offers_received["time"] + offers_received["duration"] * 24


# Views and completions
views = transcript_df[transcript_df["event"] == "offer viewed"].copy()
completions = transcript_df[transcript_df["event"] == "offer completed"].copy()

def label_offer(row):
    person = row["person"]
    offer_id = row["offer_id"]
    start = row["offer_start"]
    end = row["offer_end"]

    v_mask = (
        (views["person"] == person)
        & (views["offer_id"] == offer_id)
        & (views["time"] >= start)
        & (views["time"] <= end)
    )
    c_mask = (
        (completions["person"] == person)
        & (completions["offer_id"] == offer_id)
        & (completions["time"] >= start)
        & (completions["time"] <= end)
    )

    has_view = views[v_mask].shape[0] > 0
    has_complete = completions[c_mask].shape[0] > 0

    return int(has_view and has_complete)

offers_received["responded"] = offers_received.apply(label_offer, axis=1)

In [22]:
offers_received.head()

Unnamed: 0,person,event,time,offer_id,amount,reward,difficulty,duration,offer_type,channel_web,channel_email,channel_mobile,channel_social,offer_start,offer_end,responded
0,78afa995795e4d85b5d9ceeca43f5fef,offer received,0,9b98b8c7a33c4b65b9aebfe6a799e6d9,,5,5,7,bogo,1,1,1,0,0,168,1
1,a03223e636434f42ac4c3df47e8bac43,offer received,0,0b1e1539f2cc45b7b9fa7c272da2e1d7,,5,20,10,discount,1,1,0,0,0,240,0
2,e2127556f4f64592b11af22de27a7932,offer received,0,2906b810c7d4411798c6938adc9daaa5,,2,10,7,discount,1,1,1,0,0,168,0
3,8ec6ce2a7e7949b1bf142def7d0e0586,offer received,0,fafdcd668e3743c1bb461111dcafc2a4,,2,10,10,discount,1,1,1,1,0,240,0
4,68617ca6246f4fbc85e91a2a49552598,offer received,0,4d5c57ea9a6940dd891ad53e9dbe8da0,,10,10,5,bogo,1,1,1,1,0,120,0


In [23]:
offers_received["responded"].value_counts(normalize=True)

responded
0    0.634267
1    0.365733
Name: proportion, dtype: float64

## 8. Merge Customer Features and Build Modeling Dataset

Merge the offer instances with customer demographics to create a unified modeling dataset.

Features include:
- Offer attributes: type, difficulty, reward, duration, channels  
- Customer attributes: age, gender, income, membership tenure  
- Target: `responded` (1 = viewed + completed, 0 = otherwise)  


In [25]:
offer_person_df = offers_received.merge(
    profile_df,
    on="person",
    how="left"
)

model_df = offer_person_df[[
    "person",
    "offer_id",
    "offer_type",
    "difficulty",
    "reward",
    "duration",
    "channel_web",
    "channel_email",
    "channel_mobile",
    "channel_social",
    "gender",
    "age",
    "income",
    "member_days",
    "responded"
]].copy()

model_df = model_df.dropna(subset=["responded"])

model_df["responded"].value_counts(normalize=True)


responded
0    0.634267
1    0.365733
Name: proportion, dtype: float64

In [26]:
 model_df.head()

Unnamed: 0,person,offer_id,offer_type,difficulty,reward,duration,channel_web,channel_email,channel_mobile,channel_social,gender,age,income,member_days,responded
0,78afa995795e4d85b5d9ceeca43f5fef,9b98b8c7a33c4b65b9aebfe6a799e6d9,bogo,5,5,7,1,1,1,0,F,75.0,100000.0,443,1
1,a03223e636434f42ac4c3df47e8bac43,0b1e1539f2cc45b7b9fa7c272da2e1d7,discount,20,5,10,1,1,0,0,,,,356,0
2,e2127556f4f64592b11af22de27a7932,2906b810c7d4411798c6938adc9daaa5,discount,10,2,7,1,1,1,0,M,68.0,70000.0,91,0
3,8ec6ce2a7e7949b1bf142def7d0e0586,fafdcd668e3743c1bb461111dcafc2a4,discount,10,2,10,1,1,1,1,,,,304,0
4,68617ca6246f4fbc85e91a2a49552598,4d5c57ea9a6940dd891ad53e9dbe8da0,bogo,10,10,5,1,1,1,1,,,,297,0


In [None]:
# 8A. Exploratory Data Analysis (EDA)

Before training machine learning models, it is critical to explore the labeled dataset to understand:

- Overall response rates  
- How responses differ by demographic features  
- How offer types influence behavior  
- Whether any customer groups respond negatively  
- Which segments consistently make purchases regardless of offers  

This section includes visual analysis and interpretation to help uncover patterns that the ML model will later formalize.

We will examine:
- Response rate distribution  
- Response by offer type  
- Response by demographic segments (gender, age, income)  
- Interaction patterns between offer type and income  
- Transaction and behavioral patterns that indicate “no-offer-needed” customers  
