## **1. Introduction to the Dataset & Business Context** 🏥📈

### About the Dataset and Business Case

**Dataset:** <font color="violet">**Medical Cost Dataset**</font> 💊💰  
We are analyzing healthcare insurance data, focusing on individual medical costs and related factors. By examining this data, we can understand:

- **Distribution of Demographics and Lifestyle Factors:**
  - **<font color="green">Age</font>**
  - **<font color="green">Sex</font>**
  - **<font color="green">BMI (Body Mass Index)</font>**
  - **<font color="green">Number of Children</font>**
  
- **Patterns in Health and Regional Data:**
  - **<font color="purple">Smoking Status</font>** (`yes` or `no`)
  - **<font color="purple">Region</font>** (`northeast`, `southeast`, `southwest`, `northwest`)
  
- **Impact on Medical Charges:**
  - **<font color="orange">Medical Charges</font>**: Individual medical costs billed by health insurance

In [None]:
!wget https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/102/634/original/medical_cost.zip

--2025-01-09 02:50:56--  https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/102/634/original/medical_cost.zip
Resolving d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)... 18.64.229.71, 18.64.229.172, 18.64.229.135, ...
Connecting to d2beiqkhq929f0.cloudfront.net (d2beiqkhq929f0.cloudfront.net)|18.64.229.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 16425 (16K) [application/zip]
Saving to: ‘medical_cost.zip’


2025-01-09 02:50:57 (12.5 MB/s) - ‘medical_cost.zip’ saved [16425/16425]



In [None]:
!unzip medical_cost.zip

Archive:  medical_cost.zip
  inflating: insurance.csv           


In [None]:
import pandas as pd

df = pd.read_csv('insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [None]:
# Summary statistics
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801




---

## **2. Hypothesis Testing Fundamentals** 📊🔍

### 🎯 **What is Hypothesis Testing?**

Hypothesis Testing is a fundamental statistical method used to make decisions or inferences about a population based on sample data. It allows us to determine whether there is enough evidence to support a specific belief or claim about a parameter.

### 🏛️ **Real-World Analogy: The Courtroom**

Imagine a courtroom scenario to understand the basics of hypothesis testing:

- **<font color="magenta">Null Hypothesis (H₀)</font>**: The defendant is innocent.
- **<font color="magenta">Alternative Hypothesis (H₁)</font>**: The defendant is guilty.

The jury examines the evidence (sample data) to decide whether to reject the null hypothesis (innocence) in favor of the alternative hypothesis (guilt).

### 📌 **Key Terminologies**

1. **<font color="magenta">Null Hypothesis (H₀)</font>**:
   - A statement that there is no effect or no difference.
   - It serves as the default or starting assumption.
   - *Example*: In our Medical Cost dataset, H₀ could state that the average medical charges are equal to the industry average of \$13,000.

2. **<font color="magenta">Alternative Hypothesis (H₁)</font>**:
   - A statement that indicates the presence of an effect or a difference.
   - It represents what you aim to support.
   - *Example*: H₁ could state that the average medical charges are different from \$13,000.

3. **<font color="magenta">p-value</font>**:
   - The probability of obtaining test results at least as extreme as the observed results, assuming that H₀ is true.
   - A smaller p-value indicates stronger evidence against H₀.
   - *Example*: A p-value of 0.03 suggests there is a 3% chance that the observed difference in medical charges occurred under H₀.

4. **<font color="magenta">Significance Level (α)</font>**:
   - A threshold set by the researcher (commonly 0.05) to decide whether to reject H₀.
   - If p-value ≤ α, reject H₀.
   - *Example*: With α = 0.05, a p-value of 0.03 leads to rejecting H₀.

5. **<font color="magenta">Test Statistic</font>**:
   - A standardized value calculated from sample data during a hypothesis test.
   - It helps determine the p-value.
   - *Example*: In a Z-Test, the test statistic measures how many standard deviations the sample mean is from the population mean.

### 🔍 **Types of Errors**

1. **<font color="blue">Type I Error (False Positive)</font>**:
   - Occurs when H₀ is true, but we incorrectly reject it.
   - *Example*: Convicting an innocent defendant.

2. **<font color="blue">Type II Error (False Negative)</font>**:
   - Occurs when H₀ is false, but we fail to reject it.
   - *Example*: Acquitting a guilty defendant.

### 💡 **Why Hypothesis Testing Matters in Business**

In the context of our **Medical Cost dataset**, hypothesis testing enables businesses to:

- **Make Informed Decisions**: Determine if changes in policies or demographics significantly impact medical costs.
- **Validate Assumptions**: Confirm whether observed differences in data reflect true population differences or are due to random variation.
- **Optimize Strategies**: Tailor insurance plans and pricing based on statistically significant factors affecting costs.

### 🧩 **Connecting to Our Dataset**

Let's relate these concepts to our **Medical Cost dataset**:

- **Scenario**: An insurance company believes that their average medical charges are aligned with the industry average of \$13,000. They want to test this belief.
  
  - **H₀**: The average medical charge is \$13,000.
  - **H₁**: The average medical charge is not \$13,000.

- By conducting a hypothesis test, the company can determine whether to maintain, adjust, or overhaul their pricing strategy based on statistical evidence.

### 🧮 **Simple Example with Our Dataset**

Suppose we want to test if the average BMI of individuals in our dataset is significantly different from the national average BMI of 30.

- **H₀**: The average BMI is 30.
- **H₁**: The average BMI is not 30.

By performing a hypothesis test, we can assess whether our sample provides enough evidence to conclude that the average BMI in our dataset deviates from the national average.

---



## **3. One-Sample Hypothesis Test** 📏🔬

### 🎯 **Objective**
In this section, we'll learn how to perform a **One-Sample Hypothesis Test** using our **Medical Cost dataset**. This test helps us determine whether the average value of a specific variable in our dataset significantly differs from a known or hypothesized population value.

### 🏥 **Business Scenario**

**Scenario:**  
An insurance company claims that their average medical charges align with the industry average of **\$13,000**. The company wants to verify this claim to ensure their pricing strategy is competitive and sustainable.

**Question:**  
*Is the average medical charge in our dataset different from the industry average of \$13,000?*

### 📌 **Step-by-Step Hypothesis Setup**

1. **Define the Hypotheses:**
   - **<font color="magenta">Null Hypothesis (H₀)</font>:**  
     The average medical charge is equal to \$13,000.
     $
     H₀: \mu = 13,000
     $
   
   - **<font color="magenta">Alternative Hypothesis (H₁)</font>:**  
     The average medical charge is not equal to \$13,000.
     $
     H₁: \mu \neq 13,000
     $
   
2. **Choose the Significance Level (α):**  
   Commonly set at **0.05** (5%). This means there's a 5% risk of rejecting H₀ when it's actually true.

3. **Collect and Summarize the Data:**
   - **Sample Mean ($\bar{x}$)**: The average medical charge from our dataset.
   - **Sample Standard Deviation (s)**: Measures the variability of medical charges.
   - **Sample Size (n)**: Number of observations in the dataset.

4. **Calculate the Test Statistic:**  
   Since the population standard deviation is unknown and the sample size is large, we'll use the **Z-Test**.

   $
   Z = \frac{\bar{x} - \mu_0}{\frac{s}{\sqrt{n}}}
   $
   
   Where:
   - $\bar{x}$ = Sample mean
   - $\mu_0$ = Hypothesized population mean (\$13,000)
   - $s$ = Sample standard deviation
   - $n$ = Sample size

5. **Determine the p-value:**  
   The p-value indicates the probability of observing the sample results, or something more extreme, assuming H₀ is true.

6. **Make a Decision:**  
   - **If p-value ≤ α:** Reject H₀ (evidence suggests the average charge is different from \$13,000).
   - **If p-value > α:** Fail to reject H₀ (insufficient evidence to conclude a difference).

In [None]:
import numpy as np
from scipy import stats

# Calculate sample mean, standard deviation, and sample size
sample_mean = df['charges'].mean()
sample_std = df['charges'].std()
n = df.shape[0]

print(f"Sample Mean (𝑥̄): ${sample_mean:.2f}")
print(f"Sample Standard Deviation (s): ${sample_std:.2f}")
print(f"Sample Size (n): {n}")

Sample Mean (𝑥̄): $13270.42
Sample Standard Deviation (s): $12110.01
Sample Size (n): 1338


In [None]:
# Hypothesized population mean
mu_0 = 13000

# Calculate standard error
standard_error = sample_std / np.sqrt(n)

# Calculate Z statistic
z = (sample_mean - mu_0) / standard_error

print(f"Z-Statistic: {z:.4f}")

Z-Statistic: 0.8168


In [None]:
# Calculate p-value for two-tailed test
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"P-Value: {p_value:.4f}")

P-Value: 0.4140


- **If p-value ≤ 0.05:**  
  *Reject H₀.*  
  There is significant evidence to suggest that the average medical charge differs from \$13,000.

- **If p-value > 0.05:**  
  *Fail to reject H₀.*  
  There is not enough evidence to conclude that the average medical charge is different from \$13,000.

**Scenario:**  
A pharmaceutical company claims that their new drug reduces the average recovery time for patients to **7 days**. Researchers collect a sample of patients treated with the drug and perform a one-sample hypothesis test to verify this claim.

**Hypotheses:**
- **H₀:** The average recovery time is 7 days.
- **H₁:** The average recovery time is not 7 days.

**Outcome:**  
If the test results in a p-value of 0.03, which is less than α = 0.05, researchers would reject H₀, suggesting that the drug significantly changes the average recovery time.

---

## **4. Confidence Intervals** 📏🔍

### 🎯 **Objective**
In this section, we'll explore **Confidence Intervals (CI)** and understand how they complement hypothesis testing. Confidence intervals provide a range of plausible values for a population parameter, offering more insight than a single point estimate.

### 📚 **What is a Confidence Interval?**
A **Confidence Interval** is a range of values, derived from sample data, that is likely to contain the true population parameter (e.g., mean) with a specified level of confidence (commonly 95%).

- **95% Confidence Level:**  
  We are 95% confident that the interval contains the true population mean.

### 🏥 **Business Relevance in Medical Costs**
Using our **Medical Cost dataset**, confidence intervals help insurance companies:

- **Estimate Average Costs:** Understand the range within which the true average medical charge lies.
- **Make Pricing Decisions:** Set premiums based on reliable estimates of average costs.
- **Assess Risk:** Determine the variability and uncertainty in medical charges to manage financial risk.

### 📌 **Connecting Confidence Intervals to Hypothesis Testing**
- **Hypothesis Testing:**  
  Tests a specific claim (e.g., average charge is \$13,000) and provides a decision to reject or fail to reject the null hypothesis.

- **Confidence Intervals:**  
  Provide a range of plausible values for the population mean and indicate whether a specific value (e.g., \$13,000) lies within this range.

**Key Insight:**  
If the hypothesized value (\$13,000) lies **outside** the 95% CI, it suggests that the average medical charge is significantly different from \$13,000, aligning with rejecting the null hypothesis in hypothesis testing.

### 🧮 **Calculating a 95% Confidence Interval for Mean Charges**

Let's calculate the 95% Confidence Interval for the average medical charges in our dataset.

In [None]:
# Sample statistics
sample_mean = df['charges'].mean()
sample_std = df['charges'].std()
n = df.shape[0]

# Display the statistics
print(f"Sample Mean (𝑥̄): ${sample_mean:.2f}")
print(f"Sample Standard Deviation (s): ${sample_std:.2f}")
print(f"Sample Size (n): {n}")

Sample Mean (𝑥̄): $13270.42
Sample Standard Deviation (s): $12110.01
Sample Size (n): 1338


In [None]:
# Define confidence level
confidence_level = 0.95

# Calculate the standard error
standard_error = sample_std / np.sqrt(n)

# Determine the critical value for 95% confidence
z_critical = stats.norm.ppf((1 + confidence_level) / 2)

# Calculate the margin of error
margin_of_error = z_critical * standard_error

# Determine the confidence interval
ci_lower = sample_mean - margin_of_error
ci_upper = sample_mean + margin_of_error

print(f"95% Confidence Interval for Mean Charges: (${ci_lower:.2f}, ${ci_upper:.2f})")

95% Confidence Interval for Mean Charges: ($12621.54, $13919.30)


- **Interpretation:**  
  We are 95% confident that the true average medical charge lies between \$12,620 and \$13,919.

- **Decision Based on Hypothesis Testing:**
  - **If \$13,000 is within the CI:**  
    *Fail to reject H₀.*  
    There is insufficient evidence to say the average charge differs from \$13,000.

  - **If \$13,000 is outside the CI:**  
    *Reject H₀.*  
    There is sufficient evidence to conclude that the average charge is different from \$13,000.

**Scenario:**  
An insurance company wants to estimate the average medical charges for their policyholders to set appropriate premium rates.

**Action:**  
They calculate a 95% confidence interval for the mean charges.

**Outcome:**  
If the interval is (\$12,800, \$13,200), and their current pricing is based on \$13,000, they can be confident that their pricing strategy is aligned with the actual average charges.

---

## **5. Applied Scenarios & Practice Problems** 🧩💡

### 🎯 **Objective**
To reinforce the concepts learned, we'll work through real-life scenarios using different columns from our **Medical Cost dataset**. These examples will integrate **Hypothesis Testing Fundamentals**, **One-Sample Tests**, and **Confidence Intervals**. By the end of this section, you'll be able to apply these concepts confidently to various data-driven questions.

### 📚 **Scenario 1: Evaluating BMI Influence on Medical Charges**

**Business Context:**  
A health organization wants to assess whether individuals with a BMI greater than 30 have higher medical charges compared to the national average. This insight can help in designing targeted health programs.

#### 📝 **Question 2:**
*Do individuals with a BMI greater than 30 have average medical charges significantly different from \$13,000?*

#### 🔍 **Step-by-Step Solution**

1. **Define the Hypotheses:**
   - **<font color="magenta">Null Hypothesis (H₀)</font>:**  
     The average medical charge for individuals with BMI > 30 is \$13,000.
     $
     H₀: \mu = 13,000
     $
     
   - **<font color="magenta">Alternative Hypothesis (H₁)</font>:**  
     The average medical charge for individuals with BMI > 30 is not \$13,000.
     $
     H₁: \mu \neq 13,000
     $

2. **Choose the Significance Level (α):**  
   Set at **0.05** (5%).

3. **Filter the Dataset for Individuals with BMI > 30:**

In [None]:
# Filter the dataset for individuals with BMI > 30
df_bmi_over_30 = df[df['bmi'] > 30]

In [None]:
# 4. Calculate sample mean, standard deviation, and sample size
sample_mean_bmi = df_bmi_over_30['charges'].mean()
sample_std_bmi = df_bmi_over_30['charges'].std()
n_bmi = df_bmi_over_30.shape[0]

print(f"Sample Mean (𝑥̄): ${sample_mean_bmi:.2f}")
print(f"Sample Standard Deviation (s): ${sample_std_bmi:.2f}")
print(f"Sample Size (n): {n_bmi}")

Sample Mean (𝑥̄): $15560.93
Sample Standard Deviation (s): $14563.06
Sample Size (n): 705


In [None]:
# 5. Hypothesized population mean
mu_0 = 13000

# Calculate standard error
standard_error_bmi = sample_std_bmi / np.sqrt(n_bmi)

# Calculate Z statistic
z_bmi = (sample_mean_bmi - mu_0) / standard_error_bmi

print(f"Z-Statistic: {z_bmi:.4f}")

Z-Statistic: 4.6692


In [None]:
# 6. Calculate p-value for two-tailed test
p_value_bmi = 2 * (1 - stats.norm.cdf(abs(z_bmi)))
print(f"P-Value: {p_value_bmi:.4f}")

P-Value: 0.0000


7. **Make a Decision:**
   - **If p-value ≤ 0.05:** Reject H₀.
   - **If p-value > 0.05:** Fail to reject H₀.

### 📚 **Scenario 2: Analyzing the Effect of Smoking Status on Medical Charges**

**Business Context:**  
An insurance company suspects that smokers incur higher medical charges than non-smokers. They aim to quantify this difference to adjust their premium rates accordingly.

#### 📝 **Question 3:**
*Is the average medical charge for smokers significantly higher than for non-smokers?*

#### 🔍 **Step-by-Step Solution**

1. **Define the Hypotheses:**
   - **<font color="magenta">Null Hypothesis (H₀)</font>:**  
     The average medical charge for smokers is equal to that of non-smokers.
     $
     H₀: \mu_{\text{smokers}} = \mu_{\text{non-smokers}}
     $
     
   - **<font color="magenta">Alternative Hypothesis (H₁}</font>:**  
     The average medical charge for smokers is higher than that of non-smokers.
     $
     H₁: \mu_{\text{smokers}} > \mu_{\text{non-smokers}}
     $

2. **Choose the Significance Level (α):**  
   Set at **0.05** (5%).

3. **Separate the Dataset into Smokers and Non-Smokers:**

In [None]:
# Separate smokers and non-smokers
smokers = df[df['smoker'] == 'yes']['charges']
non_smokers = df[df['smoker'] == 'no']['charges']

In [None]:
# 4. Calculate means and standard deviations
mean_smokers = smokers.mean()
std_smokers = smokers.std()
n_smokers = smokers.shape[0]

mean_non_smokers = non_smokers.mean()
std_non_smokers = non_smokers.std()
n_non_smokers = non_smokers.shape[0]

print(f"Smokers - Mean: ${mean_smokers:.2f}, Std Dev: ${std_smokers:.2f}, Sample Size: {n_smokers}")
print(f"Non-Smokers - Mean: ${mean_non_smokers:.2f}, Std Dev: ${std_non_smokers:.2f}, Sample Size: {n_non_smokers}")

Smokers - Mean: $32050.23, Std Dev: $11541.55, Sample Size: 274
Non-Smokers - Mean: $8434.27, Std Dev: $5993.78, Sample Size: 1064


In [None]:
# 5. Calculate standard error
standard_error = np.sqrt((std_smokers**2) / n_smokers + (std_non_smokers**2) / n_non_smokers)

# Calculate Z statistic
z = (mean_smokers - mean_non_smokers) / standard_error

print(f"Z-Statistic: {z:.4f}")

Z-Statistic: 32.7519


In [None]:
# Calculate p-value for one-tailed test
p_value = 1 - stats.norm.cdf(z)
print(f"P-Value: {p_value:.4f}")

P-Value: 0.0000


7. **Make a Decision:**
   - **If p-value ≤ 0.05:** Reject H₀.
   - **If p-value > 0.05:** Fail to reject H₀.

---