<a href="https://colab.research.google.com/github/vijaygwu/advertising/blob/main/Attribution_LightGBM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Section-by-section** explanation of the code. The code performs the following main tasks:

1. **Generates synthetic multi-touch attribution data**,
2. **Transforms it into user-level features**,
3. **Trains a LightGBM model** to predict conversion, and
4. **Analyzes model results** (feature importance and channel-level attribution).

---

## Imports

```python
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
```

1. **NumPy (`numpy`)**: Provides support for large, multi-dimensional arrays and random number generation.  
2. **Pandas (`pandas`)**: Offers data manipulation and analysis tools, particularly with `DataFrame`s.  
3. **Datetime, Timedelta**: Python’s built-in library for handling dates, times, and time spans.  
4. **LightGBM (`lightgbm`)**: A gradient boosting framework that is particularly efficient for machine learning tasks, especially on large datasets.  
5. **Scikit-learn** components:  
   - `train_test_split`: Splits arrays or DataFrames into random train and test subsets.  
   - `roc_auc_score`: Calculates the Area Under the Receiver Operating Characteristic Curve—a common metric for binary classification.

---

```python
# Optional: Fix random seeds for reproducibility
np.random.seed(42)
```
- **Sets the random seed** to ensure that all random processes (e.g., NumPy random operations) produce the same results each time the script is run. This is helpful for reproducible experiments.

---

## 1. Generate a Single Synthetic Journey

```python
def generate_synthetic_journey():
    """
    Generate a single synthetic customer journey
    """
    channels = ['search', 'social', 'email', 'display', 'organic']
    
    # Random number of touchpoints (1-8)
    num_touchpoints = np.random.randint(1, 9)
    
    journey = []
    timestamps = []
    base_time = datetime.now()
    
    for i in range(num_touchpoints):
        if i == 0:
            # First touch more likely to be search or social
            channel = np.random.choice(['search', 'social', 'organic'],
                                       p=[0.4, 0.4, 0.2])
        elif i == num_touchpoints - 1:
            # Last touch more likely to be email or search
            channel = np.random.choice(['email', 'search', 'social'],
                                       p=[0.4, 0.4, 0.2])
        else:
            # Mid-journey touches
            channel = np.random.choice(channels,
                                       p=[0.3, 0.2, 0.2, 0.2, 0.1])
        
        # Add randomness to timestamps (between 1-72 hours)
        time_delta = timedelta(hours=np.random.randint(1, 72))
        timestamp = base_time + time_delta
        base_time = timestamp
        
        journey.append(channel)
        timestamps.append(timestamp)
    
    # Generate conversion probability
    has_email = 'email' in journey
    has_search = 'search' in journey
    base_conv_prob = 0.3
    
    if has_email and has_search:
        conv_prob = base_conv_prob * 1.5
    elif has_email or has_search:
        conv_prob = base_conv_prob * 1.2
    else:
        conv_prob = base_conv_prob
    
    converted = np.random.random() < conv_prob
    
    return journey, timestamps, converted
```

**Function Purpose**: Creates a **single user’s journey** through various marketing channels.

1. **Define possible channels**: `channels = ['search', 'social', 'email', 'display', 'organic']`.
2. **Pick a random number of touchpoints** (1 to 8).
3. **Initialize**:
   - `journey` as an empty list (will store channels).
   - `timestamps` as an empty list (will store times for each channel interaction).
   - `base_time` as `datetime.now()` (a starting timestamp).
4. **Loop over each touchpoint** and pick the channel:
   - For the **first touch** (`i == 0`), favor “search” or “social” (some probability distribution).
   - For the **last touch** (`i == num_touchpoints - 1`), favor “email” or “search.”
   - For **intermediate touches**, randomly pick any channel using the probabilities `[0.3, 0.2, 0.2, 0.2, 0.1]`.
5. **Assign a random offset** (`time_delta`) of 1–72 hours from the current `base_time`, update `base_time` to that new timestamp, and store it in `timestamps`.
6. **Calculate conversion**:  
   - Base probability is 0.3.  
   - If the user’s journey included **email and search**, multiply by 1.5.  
   - If the user’s journey included **email or search**, multiply by 1.2.  
   - Draw a random number `converted = np.random.random() < conv_prob`.
7. **Return** the resulting journey (list of channels), timestamps, and whether they converted (boolean).

---

## 2. Generate a Complete Dataset

```python
def generate_dataset(num_users=10000):
    """
    Generate a complete dataset of user journeys
    """
    data = []
    
    for user_id in range(num_users):
        journey, timestamps, converted = generate_synthetic_journey()
        
        for i in range(len(journey)):
            data.append({
                'user_id': user_id,
                'channel': journey[i],
                'timestamp': timestamps[i].strftime('%Y-%m-%d %H:%M:%S'),  # Convert to string
                'touch_point': i + 1,
                'journey_length': len(journey),
                'converted': converted
            })
    
    df = pd.DataFrame(data)
    # Convert timestamp back to datetime
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    return df
```

**Function Purpose**: Uses the single-journey generator (`generate_synthetic_journey`) to create a **dataset** for multiple users.

1. `num_users=10000`: By default, generates journeys for 10,000 users.
2. Initialize `data` as a list to store row-by-row records.
3. For each user:
   - Call `generate_synthetic_journey()` to get their `journey`, `timestamps`, and `converted` status.
   - Loop through each touchpoint in that journey, appending a dictionary with:
     - `user_id`
     - `channel` (the channel at this touch)
     - `timestamp` (converted to string for convenience)
     - `touch_point` (the order in the journey)
     - `journey_length` (total number of touchpoints for this user)
     - `converted` (did they convert or not)
4. Create a Pandas `DataFrame` (`df`) from `data`.
5. Convert the `timestamp` back to a `datetime` object using `pd.to_datetime`.
6. Return the resulting DataFrame, which has one row per touchpoint.

---

## 3. Create Features for the Model

```python
def create_features(df):
    """
    Transform raw journey data into features for LightGBM
    """
    grouped = df.groupby('user_id')

    # Initialize user-level feature DataFrame
    user_features = pd.DataFrame(index=grouped.groups.keys())
    
    # Basic journey features
    user_features['journey_length'] = grouped['journey_length'].first()
    
    # Calculate time duration in hours (timestamp max - timestamp min)
    user_features['total_time'] = grouped['timestamp'].agg(
        lambda x: (x.max() - x.min()).total_seconds() / 3600
    )
    
    # Channel-specific features
    channels = ['search', 'social', 'email', 'display', 'organic']
    
    # Collect channel sequences (lists) for each user once
    channel_data = grouped['channel'].agg(list)
    
    for channel in channels:
        # Count occurrences
        user_features[f'{channel}_count'] = channel_data.apply(lambda x: x.count(channel))
        
        # Calculate frequency
        user_features[f'{channel}_freq'] = (
            user_features[f'{channel}_count'] / user_features['journey_length']
        )
        
        # First touch
        user_features[f'{channel}_first'] = channel_data.apply(lambda x: 1 if x[0] == channel else 0)
        
        # Last touch
        user_features[f'{channel}_last'] = channel_data.apply(lambda x: 1 if x[-1] == channel else 0)
    
    # Add conversion target
    user_features['converted'] = grouped['converted'].first()
    
    return user_features
```

**Function Purpose**: Aggregates **touchpoint-level** data to **user-level** data and creates modeling features.

1. **Group by `user_id`**: `grouped = df.groupby('user_id')`.
2. **Initialize** an empty `DataFrame` called `user_features`, using the unique user IDs as its index.
3. **Journey Length**:
   - For each user, retrieve the first `journey_length` value from the group and store it in `user_features['journey_length']`.
4. **Total Time**:
   - Compute how long the journey took in hours: `(x.max() - x.min()).total_seconds() / 3600`.
5. **Channel-Specific Features**:
   - Create a list of all channels each user encountered: `channel_data = grouped['channel'].agg(list)`.
   - For each possible channel (search, social, email, display, organic):
     - `channel_count`: How many times that channel appears in the user’s journey.  
     - `channel_freq`: The above count divided by the total journey length for that user.  
     - `channel_first`: 1 if the first touch channel is the given channel, else 0.  
     - `channel_last`: 1 if the last touch channel is the given channel, else 0.
6. **Conversion Target**:
   - `converted` is just the first conversion flag in each user’s group (they are all the same for a given user).
7. **Return** the resulting `DataFrame` with one row per user and columns representing features and the conversion label.

---

## 4. Train a LightGBM Model with Callbacks

```python
def train_model(features, target):
    """
    Train LightGBM model with the synthetic data using callbacks
    for early stopping instead of the early_stopping_rounds parameter.
    """
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'boosting_type': 'gbdt',
        'num_leaves': 31,
        'max_depth': 5,
        'learning_rate': 0.1,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': -1
    }
    
    X_train, X_test, y_train, y_test = train_test_split(
        features, target, test_size=0.2, random_state=42
    )
    
    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_test, label=y_test)
    
    # Use callbacks for early stopping
    model = lgb.train(
        params=params,
        train_set=train_data,
        num_boost_round=100,
        valid_sets=[train_data, valid_data],
        valid_names=['train', 'valid'],
        callbacks=[
            lgb.early_stopping(stopping_rounds=10),
            lgb.log_evaluation(period=10)  # Set period=0 or any other value as needed
        ]
    )
    
    return model, X_test, y_test
```

**Function Purpose**: Trains a **LightGBM** model on the features created above.

1. **Parameters (`params`)**:  
   - `objective='binary'`: Binary classification.  
   - `metric='auc'`: Use AUC (Area Under the ROC Curve) as the metric.  
   - `boosting_type='gbdt'`: Use traditional Gradient Boosting Decision Trees.  
   - `num_leaves=31`, `max_depth=5`: Control complexity of trees.  
   - `learning_rate=0.1`: Step size shrinkage.  
   - `feature_fraction=0.9`, `bagging_fraction=0.8`, `bagging_freq=5`: Subsampling features/rows for regularization.  
   - `verbose=-1`: Suppress detailed logging.
2. **Split the Data**:  
   - Use `train_test_split` to create an 80/20 train/test split.
3. **Create LightGBM Datasets**:  
   - `train_data` = `lgb.Dataset(X_train, label=y_train)`.  
   - `valid_data` = `lgb.Dataset(X_test, label=y_test)`.
4. **Train with Callbacks**:  
   - `lgb.early_stopping(stopping_rounds=10)` stops training if the validation metric does not improve for 10 consecutive rounds.  
   - `lgb.log_evaluation(period=10)` logs evaluation results every 10 rounds.
5. **Return** the trained `model` and the test split (`X_test`, `y_test`) for later evaluation.

---

## 5. Analyze Results

```python
def analyze_results(model, feature_names, X_test, y_test):
    """
    Analyze and print model results
    """
    # Feature importance
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importance(importance_type='gain')
    }).sort_values('importance', ascending=False)
    
    # Channel attribution
    channels = ['search', 'social', 'email', 'display', 'organic']
    channel_importance = {}
    
    for channel in channels:
        # Sum importance of features that contain the channel name
        channel_feat_df = importance_df[importance_df['feature'].str.contains(channel)]
        channel_importance[channel] = channel_feat_df['importance'].sum()
    
    # Normalize to percentages
    total_importance = sum(channel_importance.values())
    if total_importance > 0:
        channel_importance = {
            k: (v / total_importance) * 100 for k, v in channel_importance.items()
        }
    else:
        channel_importance = {k: 0 for k in channels}
    
    # Model performance (AUC)
    y_pred = model.predict(X_test)
    auc_score = roc_auc_score(y_test, y_pred)
    
    return importance_df, channel_importance, auc_score
```

**Function Purpose**: Evaluates the **LightGBM model** and helps with **channel attribution**.

1. **Feature Importance**:  
   - `model.feature_importance(importance_type='gain')` returns how much each feature contributed to reducing loss (the total gain).  
   - Create a `DataFrame` mapping `feature` name to its `importance`.  
   - Sort by descending importance.
2. **Channel Attribution**:
   - We have multiple features for each channel (e.g., `search_count`, `search_freq`, `search_first`, `search_last`).  
   - The code filters the `importance_df` for rows whose `feature` contains the channel’s name (like “search”).  
   - Sums the importance for those rows.  
   - Normalizes the sums so that the total across all channels is 100%.
3. **Calculate AUC**:
   - Predict on the test set: `y_pred = model.predict(X_test)`.  
   - Compare predictions vs. true labels using `roc_auc_score(y_test, y_pred)`.
4. **Return**:
   - `importance_df` (all features and their importance scores),  
   - `channel_importance` (aggregated feature importances per channel),  
   - `auc_score` (model’s performance metric).

---

## 6. Main Script

```python
if __name__ == "__main__":
    # Generate synthetic data
    print("Generating synthetic data...")
    raw_data = generate_dataset(num_users=10000)
    print(f"Generated {len(raw_data)} touchpoints")
    
    # Create features
    print("Creating features...")
    features_df = create_features(raw_data)
    print(f"Created features for {len(features_df)} users")
    
    # Split features and target
    X = features_df.drop('converted', axis=1)
    y = features_df['converted']
    
    # Train model
    print("Training model...")
    model, X_test, y_test = train_model(X, y)
    
    # Analyze results
    print("\nAnalyzing results...")
    importance, channel_importance, auc_score = analyze_results(
        model, X.columns, X_test, y_test
    )
    
    print("\nChannel Attribution Scores:")
    for channel, score in sorted(channel_importance.items(), key=lambda x: x[1], reverse=True):
        print(f"{channel}: {score:.2f}%")
    
    print(f"\nModel AUC Score: {auc_score:.3f}")
    
    print("\nTop 10 Most Important Features:")
    print(importance.head(10))
```

1. **`if __name__ == "__main__":`**: Pythonic entry point to the script.
2. **Generate Synthetic Data**:
   - `generate_dataset(num_users=10000)` creates a large DataFrame with each touchpoint.  
   - Print how many rows (touchpoints) were created.
3. **Create Features**:
   - `create_features(raw_data)` aggregates the touchpoint-level DataFrame to user-level features.  
   - Print how many users it created features for.
4. **Prepare Data for Model**:
   - `X` = all columns except `'converted'`.  
   - `y` = the `'converted'` column.
5. **Train Model**:
   - `train_model(X, y)` returns the trained `model`, plus the split test data.  
   - Print status updates.
6. **Analyze Results**:
   - Pass `model`, the columns of `X`, and the test splits to `analyze_results`.  
   - **Outputs**:
     - **Feature importance** table (`importance`).
     - **Channel-level scores** (`channel_importance`).
     - **AUC score** (`auc_score`).
7. **Print Results**:
   - **Channel Attribution Scores**: Sort channels by their total importance share.  
   - **AUC Score**: Evaluate the classification performance.  
   - **Top 10 Most Important Features**: Show which features were most important.

---

### Summary

- **Overall Flow**:  
  1. **Synthetic Data Generation** (user journeys) ->
  2. **Feature Engineering** (aggregate user-level data) ->
  3. **Train Model** (LightGBM) ->
  4. **Evaluate & Analyze** (feature importance, AUC, channel attribution).

- **Key Concepts**:
  - **Synthetic Data**: We randomly create journeys rather than using real marketing data.  
  - **Attribution**: The model identifies which channels/features are most responsible for driving conversions.  
  - **LightGBM**: A fast, efficient boosting library to handle large datasets.  
  - **Callback-based Early Stopping**: We use `lgb.early_stopping(stopping_rounds=10)` instead of the older `early_stopping_rounds` parameter for compatibility.  

You can run this entire script to generate data, train a model, and see which channels and features are most important for conversion in this synthetic scenario.

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Optional: Fix random seeds for reproducibility
np.random.seed(42)

def generate_synthetic_journey():
    """
    Generate a single synthetic customer journey
    """
    channels = ['search', 'social', 'email', 'display', 'organic']

    # Random number of touchpoints (1-8)
    num_touchpoints = np.random.randint(1, 9)

    journey = []
    timestamps = []
    base_time = datetime.now()

    for i in range(num_touchpoints):
        if i == 0:
            # First touch more likely to be search or social
            channel = np.random.choice(['search', 'social', 'organic'],
                                       p=[0.4, 0.4, 0.2])
        elif i == num_touchpoints - 1:
            # Last touch more likely to be email or search
            channel = np.random.choice(['email', 'search', 'social'],
                                       p=[0.4, 0.4, 0.2])
        else:
            # Mid-journey touches
            channel = np.random.choice(channels,
                                       p=[0.3, 0.2, 0.2, 0.2, 0.1])

        # Add randomness to timestamps (between 1-72 hours)
        time_delta = timedelta(hours=np.random.randint(1, 72))
        timestamp = base_time + time_delta
        base_time = timestamp

        journey.append(channel)
        timestamps.append(timestamp)

    # Generate conversion probability
    has_email = 'email' in journey
    has_search = 'search' in journey
    base_conv_prob = 0.3

    if has_email and has_search:
        conv_prob = base_conv_prob * 1.5
    elif has_email or has_search:
        conv_prob = base_conv_prob * 1.2
    else:
        conv_prob = base_conv_prob

    converted = np.random.random() < conv_prob

    return journey, timestamps, converted

def generate_dataset(num_users=10000):
    """
    Generate a complete dataset of user journeys
    """
    data = []

    for user_id in range(num_users):
        journey, timestamps, converted = generate_synthetic_journey()

        for i in range(len(journey)):
            data.append({
                'user_id': user_id,
                'channel': journey[i],
                'timestamp': timestamps[i].strftime('%Y-%m-%d %H:%M:%S'),  # Convert to string
                'touch_point': i + 1,
                'journey_length': len(journey),
                'converted': converted
            })

    df = pd.DataFrame(data)
    # Convert timestamp back to datetime
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    return df

def create_features(df):
    """
    Transform raw journey data into features for LightGBM
    """
    grouped = df.groupby('user_id')

    # Initialize user-level feature DataFrame
    user_features = pd.DataFrame(index=grouped.groups.keys())

    # Basic journey features
    user_features['journey_length'] = grouped['journey_length'].first()

    # Calculate time duration in hours (timestamp max - timestamp min)
    user_features['total_time'] = grouped['timestamp'].agg(
        lambda x: (x.max() - x.min()).total_seconds() / 3600
    )

    # Channel-specific features
    channels = ['search', 'social', 'email', 'display', 'organic']

    # Collect channel sequences (lists) for each user once
    channel_data = grouped['channel'].agg(list)

    for channel in channels:
        # Count occurrences
        user_features[f'{channel}_count'] = channel_data.apply(lambda x: x.count(channel))

        # Calculate frequency
        user_features[f'{channel}_freq'] = (
            user_features[f'{channel}_count'] / user_features['journey_length']
        )

        # First touch
        user_features[f'{channel}_first'] = channel_data.apply(lambda x: 1 if x[0] == channel else 0)

        # Last touch
        user_features[f'{channel}_last'] = channel_data.apply(lambda x: 1 if x[-1] == channel else 0)

    # Add conversion target
    user_features['converted'] = grouped['converted'].first()

    return user_features

def train_model(features, target):
    """
    Train LightGBM model with the synthetic data using callbacks
    for early stopping instead of the early_stopping_rounds parameter.
    """
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'boosting_type': 'gbdt',
        'num_leaves': 31,
        'max_depth': 5,
        'learning_rate': 0.1,
        'feature_fraction': 0.9,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'verbose': -1
    }

    X_train, X_test, y_train, y_test = train_test_split(
        features, target, test_size=0.2, random_state=42
    )

    train_data = lgb.Dataset(X_train, label=y_train)
    valid_data = lgb.Dataset(X_test, label=y_test)

    # Use callbacks for early stopping
    model = lgb.train(
        params=params,
        train_set=train_data,
        num_boost_round=100,
        valid_sets=[train_data, valid_data],
        valid_names=['train', 'valid'],
        callbacks=[
            lgb.early_stopping(stopping_rounds=10),
            lgb.log_evaluation(period=10)  # Set period=0 or any other value as needed
        ]
    )

    return model, X_test, y_test

def analyze_results(model, feature_names, X_test, y_test):
    """
    Analyze and print model results
    """
    # Feature importance
    importance_df = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importance(importance_type='gain')
    }).sort_values('importance', ascending=False)

    # Channel attribution
    channels = ['search', 'social', 'email', 'display', 'organic']
    channel_importance = {}

    for channel in channels:
        # Sum importance of features that contain the channel name
        channel_feat_df = importance_df[importance_df['feature'].str.contains(channel)]
        channel_importance[channel] = channel_feat_df['importance'].sum()

    # Normalize to percentages
    total_importance = sum(channel_importance.values())
    if total_importance > 0:
        channel_importance = {
            k: (v / total_importance) * 100 for k, v in channel_importance.items()
        }
    else:
        channel_importance = {k: 0 for k in channels}

    # Model performance (AUC)
    y_pred = model.predict(X_test)
    auc_score = roc_auc_score(y_test, y_pred)

    return importance_df, channel_importance, auc_score

if __name__ == "__main__":
    # Generate synthetic data
    print("Generating synthetic data...")
    raw_data = generate_dataset(num_users=10000)
    print(f"Generated {len(raw_data)} touchpoints")

    # Create features
    print("Creating features...")
    features_df = create_features(raw_data)
    print(f"Created features for {len(features_df)} users")

    # Split features and target
    X = features_df.drop('converted', axis=1)
    y = features_df['converted']

    # Train model
    print("Training model...")
    model, X_test, y_test = train_model(X, y)

    # Analyze results
    print("\nAnalyzing results...")
    importance, channel_importance, auc_score = analyze_results(
        model, X.columns, X_test, y_test
    )

    print("\nChannel Attribution Scores:")
    for channel, score in sorted(channel_importance.items(), key=lambda x: x[1], reverse=True):
        print(f"{channel}: {score:.2f}%")

    print(f"\nModel AUC Score: {auc_score:.3f}")

    print("\nTop 10 Most Important Features:")
    print(importance.head(10))


Generating synthetic data...
Generated 45127 touchpoints
Creating features...
Created features for 10000 users
Training model...
Training until validation scores don't improve for 10 rounds
[10]	train's auc: 0.612341	valid's auc: 0.536423
[20]	train's auc: 0.62324	valid's auc: 0.544061
Early stopping, best iteration is:
[15]	train's auc: 0.617702	valid's auc: 0.551053

Analyzing results...

Channel Attribution Scores:
email: 34.88%
search: 32.39%
social: 13.76%
display: 12.00%
organic: 6.98%

Model AUC Score: 0.551

Top 10 Most Important Features:
           feature  importance
1       total_time  462.245380
3      search_freq  236.422332
10     email_count  200.526920
11      email_freq  112.401075
7      social_freq   72.516461
0   journey_length   65.500460
15    display_freq   57.750719
14   display_count   52.423113
19    organic_freq   51.304351
2     search_count   51.230919
