# Simpson's Paradox

Simpson's Paradox is a phenomenon in statistics where a trend that appears in several different groups of data reverses or disappears when the groups are combined. In other words, a pattern that holds within multiple subgroups can be misleading or even opposite when looking at the overall, aggregated data.


In [1]:
# Simpson's Paradox is a phenomenon in statistics where a trend that appears in several different groups of data disappears or reverses when these groups are combined. This paradox shows how aggregating data can sometimes lead to misleading conclusions.

# Let's illustrate Simpson's Paradox with a simple example using pandas:

import pandas as pd

# Example data: Two treatments (A and B) for a condition, split by gender
data = pd.DataFrame({
    'Gender': ['Male', 'Male', 'Female', 'Female'],
    'Treatment': ['A', 'B', 'A', 'B'],
    'Successes': [80, 90, 20, 30],
    'Trials': [200, 100, 200, 100]
})

# Calculate success rates within each gender
data['Success Rate'] = data['Successes'] / data['Trials']
print("Success rates by gender and treatment:")
print(data[['Gender', 'Treatment', 'Success Rate']])

# Aggregate data by treatment (ignoring gender)
agg = data.groupby('Treatment').sum(numeric_only=True)
agg['Success Rate'] = agg['Successes'] / agg['Trials']
print("\nAggregated success rates by treatment (ignoring gender):")
print(agg[['Success Rate']])

# Notice that within each gender, Treatment B has a higher success rate,
# but when data is aggregated, Treatment A appears better.
# This is Simpson's Paradox in action!


Success rates by gender and treatment:
   Gender Treatment  Success Rate
0    Male         A           0.4
1    Male         B           0.9
2  Female         A           0.1
3  Female         B           0.3

Aggregated success rates by treatment (ignoring gender):
           Success Rate
Treatment              
A                  0.25
B                  0.60
