**About Dataset**

This dataset contains information on the relationship between personal attributes (age, gender, BMI, family size, smoking habits), geographic factors, and their impact on medical insurance charges. It can be used to study how these features influence insurance costs.

Age: The insured person's age.

Sex: Gender (male or female) of the insured.

BMI (Body Mass Index): A measure of body fat based on height and weight.

Children: The number of dependents covered.

Smoker: Whether the insured is a smoker (yes or no).

Region: The geographic area of coverage.

Charges: The medical insurance costs incurred by the insured person.

**Loading Dataset and importing Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats # For t-tests
from scipy.stats import chi2_contingency # For chi-square test
# Load the dataset
df = pd.read_csv(r"C:\Users\OWNER\Desktop\project for GDS 2025\insurance.csv")
df.drop(columns = 'region', inplace= True)


**Descriptive Statistics**

In [None]:
# Display the first few rows of the DataFrame. By default, this will show the first 5 rows, but here we specify 10 rows.
df.head(10)

In [None]:
# We check for the presence of missing values in the columns and count them. This is a strategy for understanding data quality.
df.isna().sum()

In [None]:
#We find the summary statistics of the DataFrame, which gives us a quick overview of the numerical columns.
print("Summary Statistics:")
print(df.describe())



In [None]:
# For categorical columns, we can look at value counts
print("\nValue counts for 'sex':")
print(df['sex'].value_counts())
print("\nValue counts for 'smoker':")
print(df['smoker'].value_counts())
print("\nValue counts for 'region':")

# Defining our categorical columns
categorical_cols = ['sex', 'smoker']

# Determining the number of subplots needed.
num_cols = len(categorical_cols)
# Create 1 row, num_cols columns of subplots with a specified figure size
fig, axes = plt.subplots(1, num_cols, figsize=(15, 5))

# Ensure axes is an array even if num_cols is 1, to simplify iteration
if num_cols == 1:
    axes = [axes]

# Loop through each categorical column and create a bar chart on its own subplot
for i, col in enumerate(categorical_cols):
    # Get the value counts for the current column
    value_counts = df[col].value_counts()

    # Plot a bar chart on the i-th subplot
    # ax=axes[i] links the plot to the specific subplot
    value_counts.plot(kind='bar', ax=axes[i], color='purple')

    # Set the title for the current subplot
    axes[i].set_title(f'Value Counts for "{col}"')

    # Set the y-axis label
    axes[i].set_ylabel('Count')

    # Rotate x-axis labels for better readability, especially if categories are long
    axes[i].tick_params(axis='x', rotation=0)

# Adjust layout to prevent titles/labels from overlapping
plt.tight_layout()

# Display the plots
plt.show()

# Optionally, print the numerical value counts as well
print("\n--- Numerical Value Counts ---")
for col in categorical_cols:
    print(f"\nValue counts for '{col}':")
    print(df[col].value_counts())


Grouping by various variables in the Dataframe (Basically setting them as the first columns by which other metrics are computed)

In [None]:
df.groupby('sex').charges.agg(['sum', 'mean', 'median', 'std', 'var'])
#df.groupby('smoker').charges.agg(['sum', 'mean', 'median', 'std', 'var'])
#df.groupby('age').charges.agg(['sum', 'mean', 'median', 'std', 'var'])
#Comments on the above code:

**Probability Distributions**

In [None]:

print("\n Visualizing Distributions ")

plt.figure(figsize=(15, 5))

# Distribution of Age
plt.subplot(1, 3, 1)
sns.histplot(df['age'], kde=True, color='skyblue', bins=15)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency / Density')
plt.grid(axis='y', alpha=0.75)

# Distribution of BMI
plt.subplot(1, 3, 2)
sns.histplot(df['bmi'], kde=True, color='lightcoral', bins=15)
plt.title('Distribution of BMI')
plt.xlabel('BMI')
plt.ylabel('Frequency / Density')
plt.grid(axis='y', alpha=0.75)

# Distribution of Charges (often right-skewed)
plt.subplot(1, 3, 3)
sns.histplot(df['charges'], kde=True, color='lightgreen', bins=20)
plt.title('Distribution of Charges')
plt.xlabel('Charges')
plt.ylabel('Frequency / Density')
plt.grid(axis='y', alpha=0.75)

plt.tight_layout()
plt.show()


Underweight: Less than 18.5 kg/m 
2
 

Normal (Healthy) weight: 18.5 to 24.9 kg/m 
2
 

Overweight: 25.0 to 29.9 kg/m 
2
 

Obese (Class I): 30.0 to 34.9 kg/m 
2
 

Obese (Class II): 35.0 to 39.9 kg/m 
2
 

Obese (Class III): 40.0 kg/m 
2
  or greater (also known as Morbidly Obese)

**Confidence Intervals Demo**

In [None]:



# Confidence Interval for the Mean of 'Charges'
# We'll use a 95% confidence level.

sample_mean_charges = df['charges'].mean()
sample_std_charges = df['charges'].std()
sample_size_charges = len(df['charges'])

# Degrees of freedom for t-distribution (sample size - 1)
degrees_freedom = sample_size_charges - 1

# Calculate the standard error of the mean (SEM)
sem_charges = sample_std_charges / np.sqrt(sample_size_charges)

# Calculate the confidence interval using scipy.stats.t.interval
# alpha = 0.95 for 95% CI
confidence_level = 0.95
lower_bound, upper_bound = stats.t.interval(
    confidence_level,
    degrees_freedom,
    loc=sample_mean_charges,
    scale=sem_charges
)

print(f"\nSample Mean of Charges: {sample_mean_charges:,.2f}")
print(f"Sample Standard Deviation of Charges: {sample_std_charges:,.2f}")
print(f"Sample Size: {sample_size_charges}")
print(f"Standard Error of the Mean (SEM):{sem_charges:,.2f}")
print(f"95% Confidence Interval for Mean Charges: ({lower_bound:,.2f}, {upper_bound:,.2f})")

**Hypothesis Testing Demo**

In [None]:

#Independent Samples t-test
print("\nIndependent Samples t-test: Smoker vs. Non-Smoker Charges")

# Hypothesis: Do smokers incur significantly different medical charges than non-smokers?
# Null Hypothesis (H0): The mean charges for smokers are equal to the mean charges for non-smokers.
# Alternative Hypothesis (H1): The mean charges for smokers are different from the mean charges for non-smokers.

# Separate the data into two groups based on 'smoker' status
smoker_charges = df[df['smoker'] == 'yes']['charges']
non_smoker_charges = df[df['smoker'] == 'no']['charges']

print(f"Mean charges for smokers: ${smoker_charges.mean():,.2f} (n={len(smoker_charges)})")
print(f"Mean charges for non-smokers: ${non_smoker_charges.mean():,.2f} (n={len(non_smoker_charges)})")

# Perform the independent samples t-test
# `equal_var=False` is often recommended when sample sizes or standard deviations are very different (Welch's t-test)
t_stat, p_value = stats.ttest_ind(smoker_charges, non_smoker_charges, equal_var=False)

print(f"\nT-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.3e}") # Using scientific notation for small p-values

# Decision rule (e.g., alpha = 0.05)
alpha = 0.05
if p_value < alpha:
    print(f"Since p-value ({p_value:.3e}) < alpha ({alpha}), we reject the null hypothesis.")
    print("Conclusion: There is a statistically significant difference in mean medical charges between smokers and non-smokers.")
else:
    print(f"Since p-value ({p_value:.3e}) >= alpha ({alpha}), we fail to reject the null hypothesis.")
    print("Conclusion: There is no statistically significant difference in mean medical charges between smokers and non-smokers.")



In [None]:
# --- Chi-square Test for Independence ---
print("--- Chi-square Test for Independence: Sex and Smoker Status ---")

# Hypothesis: Is there a relationship (dependence) between a person's sex and whether they are a smoker?
# Null Hypothesis (H0): Sex and Smoker status are independent (no relationship).
# Alternative Hypothesis (H1): Sex and Smoker status are dependent (there is a relationship).

# Create a contingency table (cross-tabulation)
contingency_table = pd.crosstab(df['sex'], df['smoker'])
print("\nContingency Table (Sex vs. Smoker):")
print(contingency_table)

# Perform the Chi-square test
chi2, p_value_chi2, dof, expected = chi2_contingency(contingency_table)

print(f"\nChi-square statistic: {chi2:.3f}")
print(f"P-value: {p_value_chi2:.3f}")
print(f"Degrees of Freedom: {dof}")
# print("Expected Frequencies Table:")
# print(pd.DataFrame(expected, index=contingency_table.index, columns=contingency_table.columns))

# Decision rule (e.g., alpha = 0.05)
alpha = 0.05
if p_value_chi2 < alpha:
    print(f"Since p-value ({p_value_chi2:.3f}) < alpha ({alpha}), we reject the null hypothesis.")
    print("Conclusion: There is a statistically significant association between sex and smoker status.")
else:
    print(f"Since p-value ({p_value_chi2:.3f}) >= alpha ({alpha}), we fail to reject the null hypothesis.")
    print("Conclusion: There is no statistically significant association between sex and smoker status.")
