# Scenario 2 : Website Analytics Debug

## My Prompt:
You are a web analytics expert. I need you to help debug and improve Python code that analyzes user engagement metrics from a session log DataFrame.

The DataFrame contains the following columns:
- `session_id` (identifier for a user session)
- `page_views` (number of pages viewed during a session)
- `duration` (duration of the session in seconds)
- `device_type` (e.g., mobile, desktop)

We are currently seeing errors such as:
- Bounce rates over 100%
- Negative session durations

Your task:
1. **Fix the broken code below** and make corrections:
   - Correct the bounce rate logic 
   - Ensure that average session time is calculated **per session** and correctly averaged
   - Correct `pages_per_session` logic

2. **Add data validation**:
   - Ensure duration is non-negative
   - Handle missing/null values

3. **Add visualizations**:
   - Distribution of session durations
   - Bounce rate by device type
   - Page views per session

4. Return two outputs:
   - A cleaned, corrected metrics dictionary
   - A DataFrame showing summary stats per device type

Use pandas, numpy, matplotlib, and seaborn.  
Write clean, commented code with docstrings.  
Do not assume the data is clean — validate it first.


In [1]:
#This is the broken code: 
def analyze_user_engagement(logs_df):
    metrics = {
        'bounce_rate': logs_df.groupby('session_id')['page_views'].apply(
            lambda x: x == 1).mean(),
        'avg_session_time': logs_df.groupby('session_id')['duration'].sum(),
        'pages_per_session': logs_df.groupby('session_id')['page_views'].mean()
    }
    
    device_metrics = logs_df.groupby('device_type').agg({
        'session_id': 'count',
        'duration': 'mean',
        'page_views': 'sum'
    })
    
    return metrics, device_metrics

In [2]:
#Generated Code 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_user_engagement(logs_df):
    """
    Analyze user engagement from session logs.
    
    Parameters:
        logs_df (pd.DataFrame): A DataFrame with columns:
            - session_id
            - page_views
            - duration (in seconds)
            - device_type

    Returns:
        metrics (dict): Overall engagement metrics.
        device_summary_df (pd.DataFrame): Aggregated stats per device_type.
    """
    # ------------------------------
    # Step 1: Data Cleaning
    # ------------------------------
    logs_df = logs_df.copy()

    # Drop rows with missing values in key columns
    logs_df.dropna(subset=['session_id', 'page_views', 'duration', 'device_type'], inplace=True)

    # Ensure page_views and duration are numeric
    logs_df['page_views'] = pd.to_numeric(logs_df['page_views'], errors='coerce')
    logs_df['duration'] = pd.to_numeric(logs_df['duration'], errors='coerce')

    # Remove rows with negative or null durations
    logs_df = logs_df[logs_df['duration'] >= 0]

    # ------------------------------
    # Step 2: Session-level Aggregation
    # ------------------------------
    session_df = logs_df.groupby('session_id').agg({
        'page_views': 'sum',
        'duration': 'sum',
        'device_type': 'first'  # assumes device_type doesn't change mid-session
    }).reset_index()

    # Bounce rate: sessions with only 1 page view
    total_sessions = session_df.shape[0]
    bounce_sessions = session_df[session_df['page_views'] == 1].shape[0]
    bounce_rate = (bounce_sessions / total_sessions) * 100

    # Average session time and pages per session
    avg_session_time = session_df['duration'].mean()
    avg_pages_per_session = session_df['page_views'].mean()

    # ------------------------------
    # Step 3: Device-level Summary
    # ------------------------------
    device_summary_df = session_df.groupby('device_type').agg({
        'session_id': 'count',
        'duration': 'mean',
        'page_views': ['mean', 'sum'],
        'session_id': 'count'
    })
    device_summary_df.columns = ['session_count', 'avg_duration', 'avg_page_views', 'total_page_views']
    device_summary_df = device_summary_df.reset_index()

    # Bounce rate by device
    device_bounce = session_df.copy()
    device_bounce['is_bounce'] = device_bounce['page_views'] == 1
    bounce_by_device = device_bounce.groupby('device_type')['is_bounce'].mean().mul(100).reset_index()
    bounce_by_device.rename(columns={'is_bounce': 'bounce_rate'}, inplace=True)

    # Merge bounce rate into device summary
    device_summary_df = device_summary_df.merge(bounce_by_device, on='device_type')

    # ------------------------------
    # Step 4: Visualizations
    # ------------------------------
    sns.set(style="whitegrid")

    # Distribution of session durations
    plt.figure(figsize=(10, 5))
    sns.histplot(session_df['duration'], bins=30, kde=True)
    plt.title('Distribution of Session Durations')
    plt.xlabel('Duration (seconds)')
    plt.ylabel('Frequency')
    plt.show()

    # Bounce rate by device
    plt.figure(figsize=(8, 5))
    sns.barplot(data=bounce_by_device, x='device_type', y='bounce_rate')
    plt.title('Bounce Rate by Device Type')
    plt.ylabel('Bounce Rate (%)')
    plt.xlabel('Device Type')
    plt.ylim(0, 100)
    plt.show()

    # Pages per session distribution
    plt.figure(figsize=(10, 5))
    sns.histplot(session_df['page_views'], bins=30, kde=True)
    plt.title('Pages per Session Distribution')
    plt.xlabel('Page Views')
    plt.ylabel('Frequency')
    plt.show()

    # ------------------------------
    # Step 5: Output Metrics
    # ------------------------------
    metrics = {
        'bounce_rate (%)': round(bounce_rate, 2),
        'avg_session_time (sec)': round(avg_session_time, 2),
        'pages_per_session': round(avg_pages_per_session, 2)
    }

    return metrics, device_summary_df


# Follow-up Prompts:
- "Fix the bounce rate logic — it should be percentage of sessions with only 1 page view."
- "Ensure that session durations cannot be negative."
- "Can you group data at the session level before doing calculations?"
- "Add docstrings and comments to the functions for readability."
- "Create clear visualizations for session duration and page views."


In [3]:
# Final Solution:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_user_engagement(logs_df):
    """
    Analyze user engagement from session logs.
    
    Parameters:
        logs_df (pd.DataFrame): A DataFrame with columns:
            - session_id
            - page_views
            - duration (in seconds)
            - device_type

    Returns:
        metrics (dict): Overall engagement metrics.
        device_summary_df (pd.DataFrame): Aggregated stats per device_type.
    """
    # ------------------------------
    # Step 1: Data Cleaning
    # ------------------------------
    logs_df = logs_df.copy()

    # Drop rows with missing values in key columns
    logs_df.dropna(subset=['session_id', 'page_views', 'duration', 'device_type'], inplace=True)

    # Ensure page_views and duration are numeric
    logs_df['page_views'] = pd.to_numeric(logs_df['page_views'], errors='coerce')
    logs_df['duration'] = pd.to_numeric(logs_df['duration'], errors='coerce')

    # Remove rows with negative or null durations
    logs_df = logs_df[logs_df['duration'] >= 0]

    # ------------------------------
    # Step 2: Session-level Aggregation
    # ------------------------------
    session_df = logs_df.groupby('session_id').agg({
        'page_views': 'sum',
        'duration': 'sum',
        'device_type': 'first'  # assumes device_type doesn't change mid-session
    }).reset_index()

    # Bounce rate: sessions with only 1 page view
    total_sessions = session_df.shape[0]
    bounce_sessions = session_df[session_df['page_views'] == 1].shape[0]
    bounce_rate = (bounce_sessions / total_sessions) * 100

    # Average session time and pages per session
    avg_session_time = session_df['duration'].mean()
    avg_pages_per_session = session_df['page_views'].mean()

    # ------------------------------
    # Step 3: Device-level Summary
    # ------------------------------
    device_summary_df = session_df.groupby('device_type').agg({
        'session_id': 'count',
        'duration': 'mean',
        'page_views': ['mean', 'sum'],
        'session_id': 'count'
    })
    device_summary_df.columns = ['session_count', 'avg_duration', 'avg_page_views', 'total_page_views']
    device_summary_df = device_summary_df.reset_index()

    # Bounce rate by device
    device_bounce = session_df.copy()
    device_bounce['is_bounce'] = device_bounce['page_views'] == 1
    bounce_by_device = device_bounce.groupby('device_type')['is_bounce'].mean().mul(100).reset_index()
    bounce_by_device.rename(columns={'is_bounce': 'bounce_rate'}, inplace=True)

    # Merge bounce rate into device summary
    device_summary_df = device_summary_df.merge(bounce_by_device, on='device_type')

    # ------------------------------
    # Step 4: Visualizations
    # ------------------------------
    sns.set(style="whitegrid")

    # Distribution of session durations
    plt.figure(figsize=(10, 5))
    sns.histplot(session_df['duration'], bins=30, kde=True)
    plt.title('Distribution of Session Durations')
    plt.xlabel('Duration (seconds)')
    plt.ylabel('Frequency')
    plt.show()

    # Bounce rate by device
    plt.figure(figsize=(8, 5))
    sns.barplot(data=bounce_by_device, x='device_type', y='bounce_rate')
    plt.title('Bounce Rate by Device Type')
    plt.ylabel('Bounce Rate (%)')
    plt.xlabel('Device Type')
    plt.ylim(0, 100)
    plt.show()

    # Pages per session distribution
    plt.figure(figsize=(10, 5))
    sns.histplot(session_df['page_views'], bins=30, kde=True)
    plt.title('Pages per Session Distribution')
    plt.xlabel('Page Views')
    plt.ylabel('Frequency')
    plt.show()

    # ------------------------------
    # Step 5: Output Metrics
    # ------------------------------
    metrics = {
        'bounce_rate (%)': round(bounce_rate, 2),
        'avg_session_time (sec)': round(avg_session_time, 2),
        'pages_per_session': round(avg_pages_per_session, 2)
    }

    return metrics, device_summary_df


## Scenario 2: Discussion Questions
1. **How did different prompts approach error identification?**

   Different prompts varied in how they approached error identification. Initial prompts produced incorrect logic, such as comparing a Series directly to a value (`x == 1`) instead of checking the count of page views per session to identify bounces. When I clarified the issues in my follow-up prompts, the AI began restructuring the logic more effectively. The bounce rate fix only appeared once I explicitly described what defined a "bounce." Similarly, the AI ignored the possibility of negative durations until I highlighted that session time should never be negative. This shows that prompting for edge cases is key to surfacing and correcting hidden logic errors.

2. **What validation methods were suggested?**

   After requesting validation, the AI introduced several effective methods. It used `pd.to_numeric` to ensure correct data types for `page_views` and `duration`, and removed rows with null or non-numeric values. It also filtered out rows with negative durations, which directly addressed the original issue of "impossible" results. These steps weren’t included in early versions of the code, which assumed all data was clean. This shows how important it is to explicitly prompt for robust data validation.

3. **How was time handling improved?**

   Time handling was significantly improved after restructuring the prompt. Instead of working with raw log rows, the AI aggregated data at the `session_id` level, summing durations per session before calculating the average. This allowed for accurate average session time metrics. It also ensured that only valid, non-negative durations were used in calculations. This improvement created cleaner, more reliable engagement metrics and directly solved the issue of negative session times from the original code.


