These statistical analyses optimize insurance pricing by **quantifying risk differences** (ANOVA/t-tests), **measuring cost impacts** (Cohen's d), and **validating pricing factors** (p-values). They enable data-driven premium adjustments—like surcharges for high-risk groups (smokers, specific regions)—while maintaining actuarial fairness. The result: **competitive yet profitable pricing** with transparent justification for customers and regulators.

## Key Steps

### Statistical Tests:

- ANOVA + Tukey HSD: Identify regional cost differences.

- T-test + Cohen’s d: Quantify smoker cost impact (practically significant!).

### Visualization:

- Use histograms for distributions, heatmaps for correlations.



# ANOVA for Regional Cost Analysis

## What is ANOVA?
**ANOVA (Analysis of Variance)** is a statistical method that compares means across three or more groups to determine if at least one group differs significantly from others.

### Key Hypotheses:
- **Null Hypothesis (H₀):** All group means are equal  
  *(Example: All regions have identical medical costs)*
- **Alternative Hypothesis (H₁):** At least one group mean differs  

---

## How to Use ANOVA for Regional Cost Analysis

### Step 1: Prepare the Data
Group medical charges by region:
```python
northeast = df[df['region_northeast'] == 1]['charges']
southeast = df[df['region_southeast'] == 1]['charges']
# Repeat for other regions
```
### Step 2: Run One-Way ANOVA
```python
from scipy.stats import f_oneway
f_stat, p_value = f_oneway(northeast, northwest, southeast, southwest)
print(f"F-statistic: {f_stat:.2f}, p-value: {p_value:.4f}")
```
### Step 3: Interpret Results

| Metric       | Threshold | Conclusion                          |
|--------------|-----------|-------------------------------------|
| **p-value**  | < 0.05    | Significant regional differences     |
|              | ≥ 0.05    | No significant differences          |

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from scipy.stats import ttest_ind
# Load data
df = pd.read_csv("C:/Users/yoooE/Desktop/insurance.csv")

# Check missing values and data types
print("Missing values:\n", df.isnull().sum())
print("\nData types:\n", df.dtypes)

# 1. Drop rows with any missing values``
df = df.dropna()

# 2. Fill missing numerical values with mean or median
# df['column_name'] = df['column_name'].fillna(df['column_name'].mean())

# 3. Fill missing categorical values with mode
# df['column_name'] = df['column_name'].fillna(df['column_name'].mode()[0])

# 4. Fill all missing values with a constant
# df = df.fillna(0)

#Handle outliers in 'charges' (top 1%) -remove values above 99th percentile
percentile_99 = df['charges'].quantile(0.99)

df_filted = df[df['charges']<= percentile_99]
#print(df_filted)
#3.1 ANOVA: Region Impact on Costs
southeast = df_filted[df_filted['region'] == 'southeast']['charges']
southwest = df_filted[df_filted['region'] == 'southwest']['charges']
northeast = df_filted[df_filted['region'] == 'northeast']['charges']
northwest = df_filted[df_filted['region'] == 'northwest']['charges']
f_stat, p_value = f_oneway(southeast, southwest, northeast, northwest)
print(f"ANOVA Results: F-statistic = {f_stat:.2f}, p-value = {p_value:.4f}")
if p_value < 0.05:
    print('"Significant regional differences exist (reject H0)')
else:
    print("No significant regional differences (fail to reject H0).")



Missing values:
 age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Data types:
 age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object
ANOVA Results: F-statistic = 2.28, p-value = 0.0774
No significant regional differences (fail to reject H0).


# Post-Hoc Analysis in ANOVA

## What is a Post-Hoc Test?
A **post-hoc test** is performed after finding significant results in ANOVA (p < 0.05) to identify exactly which groups differ.

### Key Properties:
- 🔍 **Purpose**: Pinpoint specific significant differences between groups
- ⚖️ **Controls**: Family-wise error rate (reduces false positives)
- 📊 **Types**: Tukey HSD (most common), Bonferroni, Scheffé

---

# Tukey HSD: The Post-Hoc Test for ANOVA

## 1. Terminology Clarification
- **Tukey HSD** = **Tukey's Honestly Significant Difference** test
- **Turkey** = A country (no relation to statistics)
- **Tukey** = John Tukey, the statistician who developed this method

## 2. What is Tukey HSD?
A **post-hoc test** used after finding significant results in ANOVA to:
- Identify exactly which group pairs are different
- Control for Type I errors (false positives) when making multiple comparisons

## 3. How It Relates to Post-Hoc Analysis
| Concept        | Relationship to Tukey HSD |
|----------------|--------------------------|
| **Post-hoc**   | General term for follow-up tests after ANOVA |
| **Tukey HSD**  | One specific (and most popular) post-hoc method |

## 4. Key Features
```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Requires:
# - Significant ANOVA result first (p < 0.05)
# - Normally distributed data
# - Equal group variances
```
# Why Use Tukey HSD Instead of t-tests?

## The Multiple Comparisons Problem

When you have **3+ groups** (e.g., Region A, B, C, D), running individual t-tests between all pairs causes an **inflated Type I error rate** (false positives).

### Example with 4 Groups:
- **Number of pairs**: 6 (A-B, A-C, A-D, B-C, B-D, C-D)
- **Individual t-test error rate**: 5% per test
- **Overall error rate**: 26% chance of ≥1 false positive  
  *(Calculated as `1 - (0.95)^6 = 0.264`)*

## How Tukey HSD Solves This

| Feature               | t-tests               | Tukey HSD             |
|-----------------------|-----------------------|-----------------------|
| **Error Control**     | Per-test (5%)         | Family-wise (5%)      |
| **Adjustment**        | None                  | Corrects for multiple comparisons |
| **Power**             | Higher per-test       | Slightly lower        |
| **Best For**          | Comparing 2 groups    | Comparing all ANOVA group pairs |

### Key Advantage:
Tukey HSD maintains the **overall** Type I error rate at 5% across **all comparisons**, while t-tests would let it grow to 26%.

## Step-by-Step: Tukey HSD Implementation

### 1. Data Preparation
```python
import pandas as pd

# One-hot encode regions
df = pd.get_dummies(df, columns=['region'])  

# Convert to long format for Tukey
df_melt = df.melt(
    id_vars=['charges'],
    value_vars=['region_northeast','region_northwest',
               'region_southeast','region_southwest'],
    var_name='region', 
    value_name='is_region'
)
df_melt = df_melt[df_melt['is_region'] == 1]
```
### 2. Run Tukey Test
```python
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey = pairwise_tukeyhsd(
    endog=df_melt['charges'],  # Target variable (medical costs)
    groups=df_melt['region'],  # Grouping variable (regions)
    alpha=0.05                # Significance level
)
```
### Step 3: Interpret Results

#### Summary Output
```python
print(tukey.summary())
```

### Sample Output
```text
group1           group2         meandiff  p-adj   reject
---------------------------------------------------------
region_northeast region_southeast  1321.29  0.012    True
region_northeast region_northwest  -987.31  0.123   False
region_southeast region_southwest   588.98  0.045    True
```
## How to Read the Results

### meandiff
**Definition**: Difference in average costs between regions  

- **Positive value** (e.g., 1321.29): `group1` > `group2`  
- **Negative value** (e.g., -987.31): `group1` < `group2`  

### p-adj
**Definition**: Adjusted p-value accounting for multiple comparisons  

- **< 0.05**: Statistically significant difference (marked `True` in reject column); Unlikely due to random chance (probability < 5%), groups are truly different. 
- **≥ 0.05**: Not statistically significant (marked `False`);  Could reasonably occur by chance (probability ≥ 5%); No conclusion about differences.

### Examples

#### region_northeast vs region_southeast
- Southeast costs are **$1,321 higher** than Northeast  
- **Significant** (p=0.012 < 0.05)  

#### region_northeast vs region_northwest
- Northwest costs are **$987 lower** than Northeast  
- **Not significant** (p=0.123 > 0.05)  

In [None]:
# 3.2 Tukey HSD Post-Hoc Test
#one-hot encode regions
df = pd.get_dummies(df_filted,columns = ['region'])

df_melt = df.melt(
    id_vars = ['charges'],
    value_vars = ['region_southeast', 'region_southwest', 'region_northeast', 'region_northwest'],
    var_name = 'region',
    value_name = 'is_region'

)

df_melt = df_melt[df_melt['is_region']==1]

#run turkey_test

tukey = pairwise_tukeyhsd(
    endog = df_melt['charges'],
    groups = df_melt['region'],
    alpha = 0.05,
)

print(tukey.summary())

# T-Test: Smoker vs Non-Smoker Costs

## Step 1: Data Filtering Process

### Group Creation
The operation creates two comparison groups:

1. **Smoker Group**  
   - Contains medical charges for individuals who smoke  
   - Identified by: `smoker_yes` column value = `1`

2. **Non-Smoker Group**  
   - Contains medical charges for individuals who don't smoke  
   - Identified by: `smoker_yes` column value = `0`

### Technical Implementation
- Uses boolean filtering to select specific rows
- Extracts only the `charges` column values
- Produces two separate data series containing numerical cost values

### Resulting Data Structure
| Group       | Selection Criteria | Data Extracted |
|-------------|--------------------|----------------|
| Smokers     | `smoker_yes == 1`  | Medical charges |
| Non-Smokers | `smoker_yes == 0`  | Medical charges |

### Purpose
- Enables direct cost comparison between smokers and non-smokers
- Isolates the smoking status variable for analysis
- Prepares clean data for statistical testing

### Sample code
```python
smoker = df[df['smoker_yes'] == 1]['charges']
non_smoker = df[df['smoker_yes'] == 0]['charges']
```

## Step 3: Perform T-Test

### What It Does
Conducts an independent samples t-test comparing medical costs between:
- Smokers
- Non-smokers

### Key Parameters
- **Unequal variance assumption** (Welch's t-test):  
  Accounts for cases where the two groups have different variance in their medical costs

### Output Values
1. **T-statistic**  
   - Measures the size of the difference between groups relative to the variation in the data  
   - Higher absolute values indicate stronger evidence against the null hypothesis

2. **P-value**  
   - Estimates the probability of observing such a difference by random chance alone  
   - Used to determine statistical significance (typically p < 0.05)

### Interpretation Guide
| Value | Typical Meaning |
|-------|-----------------|
| Large t-statistic | Strong evidence of difference |
| Small p-value (< 0.05) | Statistically significant result |

### Sample code
```python
t_stat, p_val = ttest_ind(smoker, non_smoker, equal_var=False)
```
## Step 5: Effect Size Calculation (Cohen's d)

### What is Cohen's d?
Cohen's d is a standardized measure of effect size that quantifies the difference between two group means in terms of their combined variability. Unlike p-values which measure statistical significance, Cohen's d measures practical significance by showing how substantial the observed difference actually is.

### Calculation Components
1. **Mean Difference**  
   The raw difference between average medical costs of smokers and non-smokers

2. **Pooled Standard Deviation**  
   A weighted average of both groups' variability that serves as the "yardstick" for standardization

3. **Final Calculation**  
   The mean difference divided by the pooled standard deviation, resulting in a unitless effect size metric

### Interpretation Guidelines
| Cohen's d Value | Effect Size | Practical Meaning |
|-----------------|------------|-------------------|
| 0.2 | Small | Visible but minor difference |
| 0.5 | Medium | Substantial noticeable difference |
| ≥ 0.8 | Large | Clinically important difference |

### Key Advantages
- Allows comparison across different studies
- Not affected by sample size (unlike p-values)
- Provides intuitive understanding of real-world impact
### Sample code
```python
mean_diff = smoker.mean() - non_smoker.mean()
pooled_std = np.sqrt((smoker.std()**2 + non_smoker.std()**2) / 2)
cohens_d = mean_diff / pooled_std
```

In [None]:
# 3.3 T-Test: Smoker vs Non-Smoker Costs
from scipy.stats import ttest_ind

smokers = df_filted[df_filted['smoker'] == 'yes']['charges']
non_smokers = df_filted[df_filted['smoker'] == 'no']['charges']
#conduct t-test
t_stat, p_val = ttest_ind(smokers, non_smokers, equal_var=False)

print(f"T-test Results: t-statistic = {t_stat:.2f}, p-value = {p_val:.4f}")

#cohen's d
mean_diff = smokers.mean() - non_smokers.mean()
pooled_standard = np.sqrt((smokers.std()**2 + non_smokers.std()**2)/2)
cohens_d = mean_diff/pooled_standard
print(f"Effect Size (Cohen's d): {cohens_d:.2f}")

# Optional Statistical Analyses for Deeper Insights

## 1. Correlation Analysis
- **Purpose:** Identify which variables (age, BMI, children, smoker status, etc.) are most strongly associated with medical charges.
- **How:** Compute the correlation matrix and focus on the `charges` column.
- **Interpretation:** Higher absolute correlation values indicate stronger relationships.

## 2. Logistic Regression
- **Purpose:** Predict the likelihood of a high-cost claim (e.g., charges above the median).
- **How:**  
    - Create a binary target: 1 if charges > median, else 0.
    - Fit a logistic regression model using predictors like age, BMI, children, and smoker status.
- **Interpretation:** Coefficients show how each factor affects the odds of a high-cost claim.

## 3. Chi-Square Test
- **Purpose:** Assess the association between categorical variables (e.g., region and high-cost claims).
- **How:**  
    - Build a contingency table (e.g., region vs. high-cost status).
    - Run a chi-square test to check for independence.
- **Interpretation:** A significant p-value (< 0.05) suggests a relationship between the variables.

---

**Tip:** These analyses provide additional evidence for pricing decisions, risk segmentation, and targeted interventions. Use them to validate findings and uncover new patterns in your insurance data.

In [None]:
from sklearn.linear_model import LogisticRegression
from scipy.stats import chi2_contingency

# 3.4 Additional Statistical Analyses for Insurance Strategy

# 1. Correlation Analysis: Identify key drivers of medical costs
corr_matrix = df.corr(numeric_only=True)
print("Correlation with Charges:\n", corr_matrix['charges'].sort_values(ascending=False))

# 2. Logistic Regression: Predict likelihood of high-cost claims (e.g., charges above threshold)

# Create binary target: 1 if charges > median, else 0
df['high_cost'] = (df['charges'] > df['charges'].median()).astype(int)
features = ['age', 'bmi', 'children', 'smoker_yes']  # Example predictors
X = df[features]
y = df['high_cost']

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X, y)
print("Logistic Regression Coefficients:", dict(zip(features, logreg.coef_[0])))

# 3. Chi-Square Test: Association between categorical variables (e.g., region and high-cost claims)

contingency = pd.crosstab(df['high_cost'], df[['region_northeast','region_northwest','region_southeast','region_southwest']].idxmax(axis=1))
chi2, p, dof, expected = chi2_contingency(contingency)
print(f"Chi-Square Test: chi2 = {chi2:.2f}, p-value = {p:.4f}")