# ðŸ”¬ Exploratory Data Analysis (EDA) and Causal Experiment Validation

In this section, we analyze the clean, aggregated data (`data/processed/data_processed.csv`). This analysis is crucial for ensuring the robustness and validity of our final conclusions.

The EDA serves two strategic purposes for this project:
1.  **Validate Causal Integrity:** We must confirm that the randomized assignment was successful and that the Control and Variant groups are balanced. This validation is the basis for our **Causal Inference**, ensuring any observed performance difference is truly due to the variant.
2.  **Justify Advanced Modeling:** We will inspect the distribution of our core metric, **Revenue**, to determine the appropriate statistical model. This step is critical to formally **justify the use of advanced Bayesian methods** over simple Frequentist tests.

We will proceed by first checking the sample balance and then analyzing the raw performance metrics and their underlying distributions.

In [14]:
# Importing libraries
import pandas as pd
import plotly.express as px

In [None]:
# 1. Loading the processed data
df = pd.read_csv('../data/processed/data_processed.csv')

# 2. Calculating the metrics
metrics = df.groupby('variant_name').agg(
    n_users=('user_id', 'count'),
    n_conversions=('converted', 'sum'),
    total_revenue=('total_revenue', 'sum'),
).reset_index()

metrics['cvr'] = metrics['n_conversions'] / metrics['n_users']
metrics['arpu'] = metrics['total_revenue'] / metrics['n_users']

print("--- ðŸŽ¯ Aggregated metrics ---")
print(metrics.round(4))

--- ðŸŽ¯ Aggregated metrics ---
  variant_name  n_users  n_conversions  total_revenue     cvr    arpu
0      control     2390             54         470.56  0.0226  0.1969
1      variant     2393             42         179.32  0.0176  0.0749


Looking at the results of our metrics, we can conclude that:
- The difference of users on both groups is small - a difference of only three users - and therefore we can assure the causal balance of the test.
- The **control** group is notably superior to the **variant** group in all the metrics:
    1. **CVR (Conversion Rate)**: the *control* group has a CVR of 0.0226, while the *variant* group has a CVR of 0.0176
    2. **ARPU (Average Revenue Per User)**: the *control* group has an ARPU of 0.1969, while the *variant* has an ARPU of 0.0749

But these numbers per se can be not as meaningful as they can be, so let us quantify the difference  between the metrics so we can clearly see the magnitude of the difference between both groups.

In [23]:
# Calculating the lift percentual between the groups

control_metrics = metrics[metrics['variant_name'] == 'control']
variant_metrics = metrics[metrics['variant_name'] == 'variant']

lift_df = pd.DataFrame(columns=['metric', 'control', 'variant', 'Lift'])
lift_df['metric'] = ['CVR', 'ARPU']
lift_df['control'] = [control_metrics['cvr'].values[0], control_metrics['arpu'].values[0]]
lift_df['variant'] = [variant_metrics['cvr'].values[0], variant_metrics['arpu'].values[0]]
lift_df['Lift'] = (lift_df['variant'] - lift_df['control']) / lift_df['control'] * 100

lift_df

Unnamed: 0,metric,control,variant,Lift
0,CVR,0.022594,0.017551,-22.319729
1,ARPU,0.196887,0.074935,-61.939988


Now we can have a clearer vision of the situation:
- The *control* group had a 22% higher conversion rate than the *variant* group.
- The average revenue for the *control* group is 61% higher, which is a huge differece between the groups.

While the raw lift shows a dramatic difference, we must validate the appropriate statistical approach. Simple Frequentist tests (like the t-test) rely on assumptions of normality and high sample size.

In [29]:
# Filtering the DataFrame to include only the converting users (revenue > 0)
df_converted = df[df['total_revenue'] > 0] 
fig = px.histogram(
    df_converted,
    x="total_revenue",
    color="variant_name",
    marginal="box",
    title="Distribution of Revenue for Converting Users (Control vs. Variant)",
    opacity=0.6,
    nbins=50, 
    log_y=True,
    labels={
        'total_revenue': 'Total Revenue'
    } 
)
fig.show()

**Key Findings from the Revenue Distribution:**
1.  **Low Event Count:** We only observed **54 conversions in Control** and **42 in Variant**. This low sample size in the conversion metric increases the volatility of the estimates.
2.  **Extreme Skewness:** The histogram (especially if using a log scale on the Y-axis) reveals that the revenue is **highly non-normally distributed** with an extreme long tail of high-value purchases.

**Conclusion and Next Steps:**

The combination of low conversion counts and the highly skewed revenue distribution formally **invalidates the use of traditional Frequentist methods**.

Therefore, we conclude the EDA and proceed to **Bayesian Modeling**. Bayesian methods are inherently robust to non-normal distributions and small samples, allowing us to:
* Model the Conversion Rate (CVR) using a **Beta-Binomial** distribution.
* Model the Average Revenue Per User (ARPU) with a more complex, robust distribution (like a **Zero-Inflated/Mixture** model or **Gamma** distribution).
* Output the most valuable metric for stakeholders: the **Probability of B being Better (PBB)**, which directly quantifies the risk of choosing the wrong variant.