# U.S. Medical Insurance Costs

We have a CSV dataset in a file. Headers are [age, sex, bmi, children, smoker, region, charges] and in all lower case. We will need to use the csv library to import and wrangle this data. Have fun!

This CSV has raw data. What to do... Well, we will have to understand the data and find meaning in it. Generate the standard fare (mean, median, mode, etc) where possible. Perhaps do a regression to identify risk of smoking by sex, an insurance cost predictor, correlation between smoking and BMI, correlation between BMI and having kids, average cost per region, and likelihood of being a smoker by region.

In [27]:
#import insurance.csv and inspect contents. Use csv.DictReader to pull data into a dictionary
#pull data from key/value dictionary and put in lists to make it easier to work with the data

import csv

age_list = []
sex_list = []
bmi_list = []
children_list = []
smoker_status_list = []
region_list = []
insurance_cost_list = []
with open('insurance.csv', newline = '') as insurance_csv:
    insurance_dict = csv.DictReader(insurance_csv)
    for item in insurance_dict:
        age_list.append(int(item['age']))
        sex_list.append(item['sex'])
        bmi_list.append(item['bmi'])
        children_list.append(int(item['children']))
        smoker_status_list.append(item['smoker'])
        region_list.append(item['region'])
        insurance_cost_list.append(float(item['charges']))
#print(age_list)
#print(children_list)
#print(insurance_cost_list)

#Now we have our data in lists, we can use the data more easily in analytical functions





Next, we will build a function that shows us the impact of smoking on insurance costs by taking in age, and returns an insurance cost estimate for the customer as a smoker, and another as a non-smoker. This will be done by doing a linear regression on the available data to find a slope that works best for the datapoints, and using that slope and intercept value to determine the y(insurance cost) for any corresponding X(age) for smokers and non smokers

In [48]:

def get_cost(m, b, age):
  cost = (m*age)+b
  return cost

def calculate_error_nonsmoker(m, b, point):
    age, ins_cost, smoker_status = point
    if smoker_status == 'no':
        cost_err = abs(get_cost(m, b, age) - ins_cost)
        return cost_err
    else:
        return 0

def calculate_error_smoker(m, b, point):
    age, ins_cost, smoker_status = point
    if smoker_status == 'yes':
        cost_err = abs(get_cost(m, b, age) - ins_cost)
        return cost_err
    else:
        return 0

def calculate_all_error_smoker(m, b, point_list):
    smoker_error_sum = 0
    for item in point_list:
        smoker_error_sum += calculate_error_smoker(m, b, item)
    return smoker_error_sum

def calculate_all_error_nonsmoker(m, b, point_list):
    nonsmoker_error_sum = 0
    for item in point_list:
        nonsmoker_error_sum += calculate_error_nonsmoker(m, b, item)
    return nonsmoker_error_sum
#After running the regression with very wide parameters (resource intensive),
# we can narrow the values down after getting initial results that show similar slopes
# for smokers and non smokers, but very different intercepts.
#possible_ms = [m * 1 for m in range(-10000, 10000)]
#possible_bs = [b * 1 for b in range(-200000, 200000, 1000)]
possible_ms_smoker = [m * 0.1 for m in range(2800, 3400)]
possible_ms_nonsmoker = [m * 0.1 for m in range(2000, 3000)]

possible_bs_smoker = [b * 1 for b in range(25000, 27000, 10)]
possible_bs_nonsmoker = [b * 1 for b in range(-4000, -2000, 10)]

datapoints = list(zip(age_list, insurance_cost_list, smoker_status_list))

#debugging
#print(calculate_all_error_nonsmoker(1000, 20000, datapoints))
#quick sanity check
#print(datapoints)
#print(possible_ms)
#print(possible_bs)




Now, we figure out the best slope and intercept for smokers, (and separate ones for non-smokers) that minimizes the error from the real-life data in our dataset, and the modeled data from our linear regression slope and intercept.

In [49]:
#for smokers

smallest_error_smoker = float('inf')
best_m_smoker = 0
best_b_smoker = 0

for m in possible_ms_smoker:
    for b in possible_bs_smoker:
        temp_error = calculate_all_error_smoker(m, b, datapoints)
        if temp_error < smallest_error_smoker:
            smallest_error_smoker = temp_error
            best_m_smoker = m
            best_b_smoker = b
        else:
            continue
print(best_m_smoker, best_b_smoker, smallest_error_smoker)

315.40000000000003 25600 2656210.14078


In [50]:
#for non-smokers
smallest_error_nonsmoker = float('inf')
best_m_nonsmoker = 0
best_b_nonsmoker = 0

for m in possible_ms_nonsmoker:
    for b in possible_bs_nonsmoker:
        temp_error = calculate_all_error_nonsmoker(m, b, datapoints)
        if temp_error < smallest_error_nonsmoker:
            smallest_error_nonsmoker = temp_error
            best_m_nonsmoker = m
            best_b_nonsmoker = b
        else:
            continue
print(best_m_nonsmoker, best_b_nonsmoker, smallest_error_nonsmoker)

267.7 -3470 2108429.6146190027


Now, we can take the information above and use it to create a function that generates an estimated cost for health insurance for an individual using data on age and smoker status.

In [54]:
# Now we build a function that takes in age and smoker status and returns an estimated insurance cost

def insurance_estimator(age, smoker_status):
    insurance_cost_estimate = 0
    if smoker_status == 'yes':
        insurance_cost_estimate = get_cost(best_m_smoker, best_b_smoker, age)
        print("The estimated insurance cost for a", age, "year old smoker is", insurance_cost_estimate, "dollars")
    elif smoker_status == 'no':
        insurance_cost_estimate = get_cost(best_m_nonsmoker, best_b_nonsmoker, age)
        print("The estimated insurance cost for a", age, "year old non-smoker is", insurance_cost_estimate, "dollars")
    else:
        print("Please enter valid smoker_status of 'yes' or 'no' and an integer age value")
    return 0

insurance_estimator(33, 'maybe')


Please enter valid smoker_status of 'yes' or 'no' and an integer age value


0