# Medical Insurance Charges Analysis

This project analyzes medical insurance cost data to identify the key drivers of insurance charges.
The dataset includes demographic and lifestyle factors such as age, BMI, smoking status, region, and number of children.

The main goal is to explore patterns in the data and extract insights that could support pricing strategies in the insurance industry.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

In [None]:
df = pd.read_csv("/Users/zohrehsamieekadkani/Desktop/books/python-files/Abgabe/medicall.csv")
df.head()

In [None]:
#Data Overview
print("Dataset Shape:", df.shape)
df.info()

In [None]:
#Missing Values & Duplicates
df.isnull().sum()


In [None]:
df.duplicated().sum()


In [None]:
#Summary Statistics
df.describe()

In [None]:
#Distribution of Charges
plt.figure(figsize=(10,6))
sns.histplot(df["charges"], bins=30, kde=True)
plt.title("Distribution of Insurance Charges")
plt.xlabel("Charges ($)")
plt.ylabel("Count")
plt.show()

In [None]:
#Smoker vs Non-Smoker Charges
plt.figure(figsize=(8,6))
sns.boxplot(data=df, x="smoker", y="charges")
plt.title("Insurance Charges by Smoking Status")
plt.xlabel("Smoker")
plt.ylabel("Charges ($)")
plt.show()

In [None]:
#Average Charges Comparison (Smoker vs Non-Smoker)
smoker_mean = df.groupby("smoker")["charges"].mean()
smoker_mean

In [None]:
difference_percent = ((smoker_mean["yes"] - smoker_mean["no"]) / smoker_mean["no"]) * 100
print(f"Smokers pay {difference_percent:.1f}% more on average.")

In [None]:
#Correlation Analysis
numeric_cols = ["age", "bmi", "children", "charges"]
corr = df[numeric_cols].corr()["charges"].drop("charges").sort_values(ascending=False)
corr

In [None]:
#Average Charges by Region
plt.figure(figsize=(10,6))
sns.barplot(data=df, x="region", y="charges", estimator="mean")
plt.title("Average Charges by Region")
plt.xlabel("Region")
plt.ylabel("Average Charges ($)")
plt.xticks(rotation=45)
plt.show()

In [None]:
#BMI Categories Analysis
df_analysis = df.copy()

df_analysis["bmi_category"] = pd.cut(
    df_analysis["bmi"],
    bins=[0, 18.5, 24.9, 29.9, 100],
    labels=["Underweight", "Normal", "Overweight", "Obese"]
)

plt.figure(figsize=(10,6))
sns.boxplot(data=df_analysis, x="bmi_category", y="charges")
plt.title("Charges by BMI Category")
plt.xlabel("BMI Category")
plt.ylabel("Charges ($)")
plt.xticks(rotation=45)
plt.show()

In [None]:
#Charges vs Children
plt.figure(figsize=(10,6))
sns.violinplot(data=df, x="children", y="charges")
plt.title("Charges Distribution by Number of Children")
plt.xlabel("Number of Children")
plt.ylabel("Charges ($)")
plt.show()


# Key Insights

- Smoking is the strongest factor affecting insurance charges.
- Smokers pay significantly higher charges compared to non-smokers.
- Age has a moderate positive correlation with charges.
- BMI also has a positive correlation with charges.
- Southeast region shows the highest average insurance charges.
- Obese category tends to have higher insurance costs.

# Conclusion

This analysis highlights smoking status as the most significant driver of insurance charges.
The insights from this project could support insurance companies in designing risk-based pricing models.