# U.S. Medical Insurance Costs

Ideas for analysis
- age distribution (all, childless, people with n amount of children)
- distribution of costs for people with 0, 1, 2, ... kids 
- distribution of the regions people live in, do the costs vary between areas, are age distributions between the areas similar?
- distribution of BMI vs age 

In [None]:
import pandas as pd

# Load data into a DataFrame
medical_data = pd.read_csv('insurance.csv')

print(medical_data.head())  # Display first few rows

1. Age distributions

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Save ages into a list
ages = medical_data["age"]

# Calculate mean and a median
average = np.mean(ages)
median = np.median(ages)

# Compute data for a histogram and plot it 
plt.figure(figsize=(6, 5))
plt.hist(ages, bins=20, density=False, alpha=0.7, color='b', edgecolor='black')
plt.axvline(average, color='r', linestyle='dashed', linewidth=2, label=f'Mean: {average:.1f}')
plt.title("Age distribution (everyone)", fontsize=15)
plt.ylabel("Number of people", fontsize=15)
plt.xlabel("Age", fontsize=15)
plt.tight_layout()
plt.show()
print(f"Mean: {average:.1f}, Median: {median:.1f}")

The age distribution is mostly even but young people are overrepresented.

Now checking how much the number of children affects the age distribution. The largest amount of kids in the dataset is 5 which is then also the largest number for which the distributions will be plotted.

In [None]:
# Dividing data based on the number of children
childless = medical_data.loc[medical_data["children"] == 0]
one_child = medical_data.loc[medical_data["children"] == 1]
two_kids = medical_data.loc[medical_data["children"] == 2]
three_kids = medical_data.loc[medical_data["children"] == 3]
four_kids = medical_data.loc[medical_data["children"] == 4]
five_kids = medical_data.loc[medical_data["children"] == 5]

# Plotting them into histograms
datasets = [childless, one_child, two_kids, three_kids, four_kids, five_kids]
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
axes = axes.flatten()
    
for i, data in enumerate(datasets):
    ages = data["age"]
    mean_age = np.mean(ages)
        
    axes[i].hist(ages, bins=20, density=False, alpha=0.7, color='b', edgecolor='black', label=f"Number of kids: {i}")
    axes[i].axvline(mean_age, color='r', linestyle='dashed', linewidth=2, label=f'Mean: {mean_age:.1f}')
    
    if i % 3 == 0:  # Set y-label only for the first column
        axes[i].set_ylabel("Number of people", fontsize=15)        

    if i > 2:  # Set x-label only for the second row
        axes[i].set_xlabel("Age", fontsize=15)

    axes[i].legend()
    axes[i].grid()
plt.tight_layout()
plt.show()


For childless people, young people are overrepresented and middle-aged people underrepresented. For people who do have kids, the distributions are closer to a bell curve, but for 1 or 2 kids, young people are still overrepresented. For 4 to 5 kids, there seems to be a much smaller number of people, which leads to non-continuous distributions.

Checking if the amount of kids one has affects the medical charges

In [None]:
# Plotting a scatterplot to check if there is any type of correlation
plt.scatter(medical_data["children"], medical_data["charges"], alpha=0.5)
plt.xlabel("Kids")
plt.ylabel("Charges")
plt.title("Kids vs Charges")
plt.show()

From the figure, one can see that the distribution for the charges is very wide but it gets more condensed for 4-5 kids. Meaning, people with many kids have smaller medical charges.

2. Distributions of regions vs costs, regions vs ages

In [None]:
# First a list of all of the regions included
regions = medical_data["region"].unique().tolist()

# Calculating and plotting the cost distributions per region
fig, axes = plt.subplots(2, 2, figsize=(8, 8))
axes = axes.flatten()
    
for i in range(len(regions)):
    data = medical_data.loc[medical_data["region"] == regions[0]]
    costs = data["charges"]
    mean_charge = np.mean(costs)
        
    axes[i].hist(costs, bins=20, density=False, alpha=0.7, color='b', edgecolor='black', label=f"Region: {regions[i]}")
    axes[i].axvline(mean_charge, color='r', linestyle='dashed', linewidth=2, label=f'Mean: {mean_charge:.1f}')
    
    if i % 2 == 0:  # Set y-label only for the first column
        axes[i].set_ylabel("Number of people", fontsize=15)        

    if i > 1:  # Set x-label only for the second row
        axes[i].set_xlabel("Charge", fontsize=15)

    axes[i].legend()
    axes[i].grid()
plt.tight_layout()
plt.show()


For all regions, the charge distribution looks similar. Mostly lower-end charges but some very high charges skew the distribution so that the mean is quite high.

Next, regions vs ages

In [None]:
# Calculating and plotting the age distributions per region
fig, axes = plt.subplots(2, 2, figsize=(8, 8))
axes = axes.flatten()
    
for i in range(len(regions)):
    data = medical_data.loc[medical_data["region"] == regions[0]]
    ages = data["age"]
    mean_age = np.mean(ages)
        
    axes[i].hist(ages, bins=20, density=False, alpha=0.7, color='b', edgecolor='black', label=f"Region: {regions[i]}")
    axes[i].axvline(mean_age, color='r', linestyle='dashed', linewidth=2, label=f'Mean: {mean_age:.1f}')
    
    if i % 2 == 0:  # Set y-label only for the first column
        axes[i].set_ylabel("Number of people", fontsize=15)        

    if i > 1:  # Set x-label only for the second row
        axes[i].set_xlabel("Age", fontsize=15)

    axes[i].legend()
    axes[i].grid()
plt.tight_layout()
plt.show()


All of the age distributions look the same, the population included in the data from these regions are similar everywhere.

3. BMI vs age

In [None]:
# Plotting a scatterplot to check if there is any type of correlation (plus trend line for linear fit)
x = medical_data["age"].to_numpy()
y = medical_data["bmi"].to_numpy()
# Creating a linear fit
slope, intercept = np.polyfit(x, y, 1)
# Values for the plot
trend_line = slope * x + intercept

plt.scatter(x, y, alpha=0.5)
plt.plot(x, trend_line, color="red", label="Trend Line")
plt.xlabel("Age")
plt.ylabel("BMI")
plt.title("Age vs BMI")
plt.legend()
plt.show()

The BMI distribution is very wide for all ages, however, one can notice that both the highest and lowest BMI values are from young people, which puts older people closer to chart average values. Overall, there is a slight correlation between age and higher BMI, which can be seen from the linear fit included in the figure.