# U.S. Medical Insurance Costs

For this project, I will be investigating a medical insurance costs dataset in a .csv file using the Python skills.

## Extract data from csv!;0

In [1]:
import csv
with open('insurance.csv') as insurance_file:
    insurance_data = csv.DictReader(insurance_file)
    age = []
    sex = []
    bmi = []
    children = []
    smoker = []
    region = []
    charges = []
    for column in insurance_data:
        age.append(column['age'])
        sex.append(column['sex'])
        bmi.append(column['bmi'])
        children.append(column['children'])
        smoker.append(column['smoker'])
        region.append(column['region'])
        charges.append(column['charges'])

The file given, "insurance.csv", contains the following data for each individual considered:

- age: integer number
- sex: male/female
- bmi: float number
- number of children: integer number
- smoker: yes/no
- region: northwest/southwest/northeast/southeast
- charges: float number

In [2]:
total_population = len(age)
print(total_population)

1338


We just checked how big the number of records is.

## Let's evaluate the data in each column:)

#### Categorical variables:

In [3]:
def percantage(number):
    return round(number / total_population, 3)

In [4]:
male_count = sex.count('male')
female_count = sex.count('female')

male_count_prc = percantage(male_count)
female_count_prc = percantage(female_count)

# In order not to create a separate variable, we can use this method:
# print('Number of people with 5 clidren: {}. This is {}% of the total number of insured people.'.format(fife_cildren_count, percantage(fife_cildren_count)))
print('Number of men: {male_count}. This is {male_count_prc}% of all records.'.format(male_count=male_count, male_count_prc=male_count_prc))
print('Number of women: {female_count}. This is {female_count_prc}% of all records.'.format(female_count=female_count, female_count_prc=female_count_prc))

Number of men: 676. This is 0.505% of all records.
Number of women: 662. This is 0.495% of all records.


Half of the population are males and another half of population are females. 

In [5]:
smoker_count = smoker.count('yes')
non_smoker_count = smoker.count('no')

smoker_count_prc = percantage(smoker_count)
non_smoker_count_prc = percantage(non_smoker_count)

print('Number of smokers: {smoker_count}. This is {smoker_count_prc}% of all records.'.format(smoker_count=smoker_count, smoker_count_prc=smoker_count_prc))
print('Number of non-smokers: {non_smoker_count}. This is {non_smoker_count_prc}% of all records.'.format(non_smoker_count=non_smoker_count,non_smoker_count_prc=non_smoker_count_prc))

Number of smokers: 274. This is 0.205% of all records.
Number of non-smokers: 1064. This is 0.795% of all records.


The number of non-smokers is much higher than the number of smokers. Smokers make up only 1/8 of the total.

In [6]:
southwest_count = region.count('southwest')
southeast_count = region.count('southeast')
northwest_count = region.count('northwest')
northeast_count = region.count('northeast')

southwest_count_prc = percantage(southwest_count)
southeast_count_prc = percantage(southeast_count)
northwest_count_prc = percantage(northwest_count)
northeast_count_prc = percantage(northeast_count)

print('Number of people from southwest region: {southwest_count}. This is {southwest_count_prc}% of all records.'.format(southwest_count=southwest_count,southwest_count_prc=southwest_count_prc))
print('Number of people from southeast region: {southeast_count}. This is {southeast_count_prc}% of all records.'.format(southeast_count=southeast_count, southeast_count_prc=southeast_count_prc))
print('Number of people from northwest region: {northwest_count}. This is {northwest_count_prc}% of all records.'.format(northwest_count=northwest_count, northwest_count_prc=northwest_count_prc))
print('Number of people from northeast region: {northeast_count}. This is {northeast_count_prc}% of all records.'.format(northeast_count=northeast_count, northeast_count_prc=northeast_count_prc))

Number of people from southwest region: 325. This is 0.243% of all records.
Number of people from southeast region: 364. This is 0.272% of all records.
Number of people from northwest region: 325. This is 0.243% of all records.
Number of people from northeast region: 324. This is 0.242% of all records.


The regional distribution is even.

#### Numerical variables:

In [7]:
# Underweight = <18.5
# Normal weight = 18.5–24.9
# Overweight = 25–29.9
# Obesity = BMI of 30 or greater

def bmi_rating(bmi):
    
    by_bmi = {'Underweight': [],
             'Normal weight': [],
             'Overweight': [],
             'Obesity': []
             }
    
    for num in bmi:
        if float(num) <= 18.5:
            by_bmi['Underweight'].append(num)
        elif float(num) > 18.5 and float(num) <= 24.9:
            by_bmi['Normal weight'].append(num)
        elif float(num) >= 25 and float(num) <= 29.9:
            by_bmi['Overweight'].append(num)
        elif float(num) >= 30:
            by_bmi['Obesity'].append(num)
            
    return by_bmi
            

bmi_rating_dict = bmi_rating(bmi)
# print(bmi_rating_dict)

underweight_count = len(bmi_rating_dict['Underweight'])
normalweight_count = len(bmi_rating_dict['Normal weight'])
overweight_count = len(bmi_rating_dict['Overweight'])
obesity_count = len(bmi_rating_dict['Obesity'])

underweight_count_prc = percantage(underweight_count)
normalweight_count_prc = percantage(normalweight_count)
overweight_count_prc = percantage(overweight_count)
obesity_count_prc = percantage(obesity_count)

print('Number of people with underweight bmi: {underweight_count}. This is {underweight_count_prc}% of all records.'.format(underweight_count=underweight_count,underweight_count_prc=underweight_count_prc))
print('Number of people with normal weight bmi: {normalweight_count}. This is {normalweight_count_prc}% of all records.'.format(normalweight_count=normalweight_count,normalweight_count_prc=normalweight_count_prc))
print('Number of people with overweight bmi: {overweight_count}. This is {overweight_count_prc}% of all records.'.format(overweight_count=overweight_count,overweight_count_prc=overweight_count_prc))
print('Number of people with obesity bmi: {obesity_count}. This is {obesity_count_prc}% of all records.'.format(obesity_count=obesity_count,obesity_count_prc=obesity_count_prc))


Number of people with underweight bmi: 21. This is 0.016% of all records.
Number of people with normal weight bmi: 221. This is 0.165% of all records.
Number of people with overweight bmi: 377. This is 0.282% of all records.
Number of people with obesity bmi: 707. This is 0.528% of all records.


Body Mass Index (BMI) is a person's weight in kilograms (or pounds) divided by the square of height in meters (or feet). A high BMI can indicate high body fatness. BMI screens for weight categories that may lead to health problems, but it does not diagnose the body fatness or health of an individual.

We see that people with obesity occupy as much as 1/2 of the total number. This indicates an unhealthy state of the body. The saddest thing is that the normal weight is less than 1/5 of all records.

In [8]:
def children_rating(children):
    
    by_children = {0: [],
                  1: [],
                  2: [],
                  3: [],
                  4: [],
                  5: []
                 }
    
    for num in children:
        if int(num) == 0:
            by_children[0].append(num)
        elif int(num) == 1:
            by_children[1].append(num)
        elif int(num) == 2:
            by_children[2].append(num)
        elif int(num) == 3:
            by_children[3].append(num)
        elif int(num) == 4:
            by_children[4].append(num)
        elif int(num) == 5:
            by_children[5].append(num)
            
    return by_children


by_children_dict = children_rating(children)
# print(by_cildren_dict)

zero_children_count = len(by_children_dict[0])
one_children_count = len(by_children_dict[1])
two_children_count = len(by_children_dict[2])
three_children_count = len(by_children_dict[3])
four_children_count = len(by_children_dict[4])
fife_children_count = len(by_children_dict[5])

zero_children_count_prc = percantage(zero_children_count)
one_children_count_prc = percantage(one_children_count)
two_children_count_prc = percantage(two_children_count)
three_children_count_prc = percantage(three_children_count)
four_children_count_prc = percantage(four_children_count)
fife_children_count_prc = percantage(fife_children_count)

print('Number of people without children: {zero_children_count}. This is {zero_children_count_prc}% of all records.'.format(zero_children_count=zero_children_count,zero_children_count_prc=zero_children_count_prc))
print('Number of people with 1 child: {one_children_count}. This is {one_children_count_prc}% of all records.'.format(one_children_count=one_children_count,one_children_count_prc=one_children_count_prc))
print('Number of people with 2 children: {two_children_count}. This is {two_children_count_prc}% of all records.'.format(two_children_count=two_children_count,two_children_count_prc=two_children_count_prc))
print('Number of people with 3 children: {three_children_count}. This is {three_children_count_prc}% of all records.'.format(three_children_count=three_children_count,three_children_count_prc=three_children_count_prc))
print('Number of people with 4 children: {four_children_count}. This is {four_children_count_prc}% of all records.'.format(four_children_count=four_children_count,four_children_count_prc=four_children_count_prc))
print('Number of people with 5 children: {fife_children_count}. This is {fife_children_count_prc}% of all records.'.format(fife_children_count=fife_children_count,fife_children_count_prc=fife_children_count_prc))


Number of people without children: 574. This is 0.429% of all records.
Number of people with 1 child: 324. This is 0.242% of all records.
Number of people with 2 children: 240. This is 0.179% of all records.
Number of people with 3 children: 157. This is 0.117% of all records.
Number of people with 4 children: 25. This is 0.019% of all records.
Number of people with 5 children: 18. This is 0.013% of all records.


Almost half of insured people are without children, 1/4 have one child and only 0.15% have 3, 4 or 5 children.

In [9]:
# 18 to 24 include
# 25 to 34 include
# 35 to 44 include
# 45 to 55 include
# 55 to 64 include

def min_max_age(age):
    age_sorted = age.sort()
    return age[0], age[-1]
# print(min_max_age(age))

def age_rating(age):
    
    by_age = {'before 18': [],
              '18 to 24': [],
              '25 to 34': [],
              '35 to 44': [],
              '45 to 54': [],
              '55 to 64': [],
              'after 64': []
              }
    
    for num in age:
        if int(num) < 18:
            by_age['before 18'].append(num)
        elif int(num) >= 18 and int(num) <= 24:
            by_age['18 to 24'].append(num)
        elif int(num) >= 25 and int(num) <= 34:
            by_age['25 to 34'].append(num)
        elif int(num) >= 35 and int(num) <= 44:
            by_age['35 to 44'].append(num)
        elif int(num) >= 45 and int(num) <= 54:
            by_age['45 to 54'].append(num)
        elif int(num) >= 55 and int(num) <= 64:
            by_age['55 to 64'].append(num)
        elif int(num) > 64:
            by_age['after 64'].append(num)

    return by_age


by_age_dict = age_rating(age)
# print(by_age_dict)

age_18_to_24_count = len(by_age_dict['18 to 24'])
age_25_to_34_count = len(by_age_dict['25 to 34'])
age_35_to_44_count = len(by_age_dict['35 to 44'])
age_45_to_54_count = len(by_age_dict['45 to 54'])
age_55_to_64_count = len(by_age_dict['55 to 64'])

age_18_to_24_count_prc = percantage(age_18_to_24_count)
age_25_to_34_count_prc = percantage(age_25_to_34_count)
age_35_to_44_count_prc = percantage(age_35_to_44_count)
age_45_to_54_count_prc = percantage(age_45_to_54_count)
age_55_to_64_count_prc = percantage(age_55_to_64_count)

print('Number of people aged between 18 and 24: {age_18_to_24_count}. This is {age_18_to_24_count_prc}% of all records.'.format(age_18_to_24_count=age_18_to_24_count,age_18_to_24_count_prc=age_18_to_24_count_prc))
print('Number of people aged between 25 and 34: {age_25_to_34_count}. This is {age_25_to_34_count_prc}% of all records.'.format(age_25_to_34_count=age_25_to_34_count,age_25_to_34_count_prc=age_25_to_34_count_prc))
print('Number of people aged between 35 and 44: {age_35_to_44_count}. This is {age_35_to_44_count_prc}% of all records.'.format(age_35_to_44_count=age_35_to_44_count,age_35_to_44_count_prc=age_35_to_44_count_prc))
print('Number of people aged between 45 and 54: {age_45_to_54_count}. This is {age_45_to_54_count_prc}% of all records.'.format(age_45_to_54_count=age_45_to_54_count,age_45_to_54_count_prc=age_45_to_54_count_prc))
print('Number of people aged between 55 and 64: {age_55_to_64_count}. This is {age_55_to_64_count_prc}% of all records.'.format(age_55_to_64_count=age_55_to_64_count,age_55_to_64_count_prc=age_55_to_64_count_prc))


Number of people aged between 18 and 24: 278. This is 0.208% of all records.
Number of people aged between 25 and 34: 271. This is 0.203% of all records.
Number of people aged between 35 and 44: 260. This is 0.194% of all records.
Number of people aged between 45 and 54: 287. This is 0.214% of all records.
Number of people aged between 55 and 64: 242. This is 0.181% of all records.


The number of people in each age category is approximately the same.

## And now let's analyze the impact of characteristics on the cost of insurance:)0)

**Main questions:**

- Who is more incline to spend an higher amount of money in health insurance?
- What affect most the costs of health insurance?
- How can those people save money from that cost?

Let's create functions to help us display the desired statistics.

In [10]:
num_charges = []

for i in charges:
    num_charges.append(float(i))
    
sorted_charges = sorted(num_charges)
print(sorted_charges[:3])
print(sorted_charges[-3:])

[1121.8739, 1131.5066, 1135.9407]
[60021.39897, 62592.87309, 63770.42801]


Three lowest and highest charges.

In [11]:
def create_dict(charges, age, sex, bmi, children, smoker, region):
    charges_key = {}
    for i in range(len(charges)):
        charges_key[charges[i]] = {
            'age': age[i],
            'sex': sex[i],
            'bmi': bmi[i],
            'children': children[i],
            'smoker': smoker[i],
            'region': region[i],
#             'charges': charges[i]
        }
        
    return charges_key

charges_key_dict = create_dict(charges, age, sex, bmi, children, smoker, region)
print(charges_key_dict['1121.8739'])
print(charges_key_dict['1131.5066'])
print(charges_key_dict['1135.9407'])

print(charges_key_dict['63770.42801'])
print(charges_key_dict['62592.87309'])
print(charges_key_dict['60021.39897'])



{'age': '18', 'sex': 'male', 'bmi': '23.21', 'children': '0', 'smoker': 'no', 'region': 'southeast'}
{'age': '18', 'sex': 'male', 'bmi': '30.14', 'children': '0', 'smoker': 'no', 'region': 'southeast'}
{'age': '18', 'sex': 'male', 'bmi': '33.33', 'children': '0', 'smoker': 'no', 'region': 'southeast'}
{'age': '54', 'sex': 'female', 'bmi': '47.41', 'children': '0', 'smoker': 'yes', 'region': 'southeast'}
{'age': '45', 'sex': 'male', 'bmi': '30.36', 'children': '0', 'smoker': 'yes', 'region': 'southeast'}
{'age': '52', 'sex': 'male', 'bmi': '34.485', 'children': '3', 'smoker': 'yes', 'region': 'northwest'}


The next step is to retrieve the data records corresponding to their key from the dictionary.

The values are different, so we cannot determine a clear trend. 
**But we can make a few observations**:
- The lowest insurance charges among eighteen-year-old and non-smoking guys living in the southeastern region, and as a fact, bmi is the only feature that distinguishes them, which will affect the increase in the insurance charges.
- The highest insurance charges belongs to smokers with obesity over the age of 45.

In [12]:
def min_value(list):
    min = float('inf')
    for item in list:
        if float(item) < min:
            min = float(item)
    return min

def max_value(list):
    max = float('-inf')
    for item in list:
        if float(item) > max:
            max = float(item)
    return max

print(min_value(charges))
print(max_value(charges))

1121.8739
63770.42801


As we already know **the lowest** cost is **1121.8739** dollares and **the highest** cost is **63770.42801** dollares.

In [13]:
def average(list):
    sum = 0
    for item in list:
        sum += float(item)
    average = round((sum / len(list)), 1)
    return average

average_charges = average(charges)
print(average_charges)

13270.4


The **average** insurance cost is **13270.4** dollares.

In [14]:
def mediana(sorted_charges):
    return sorted_charges[int((((total_population / 2) - 1) + (total_population / 2)) / 2)]
mediana_charges = mediana(sorted_charges)
print(mediana_charges)

9377.9047


The **median** insurance cost is **9377.9047** dollares.

#### Let's find out the influence of various factors on the final cost!

In [15]:
def difference(x, y):
    return x - y

In [16]:
def average_cat_var(list1, list2, condition):
    list1_list2 = list(zip(list1, list2))
    values = []
    for item in list1_list2:
        if item[0] == condition:
            values.append(item[1])
    average_for_values = average(values)
    return average_for_values

In [17]:
women_average_charges = average_cat_var(sex, charges, 'female')
men_average_charges = average_cat_var(sex, charges, 'male')

print("The average insurance charges for women in the records is " + str(women_average_charges) + " dollars.")
print("The average insurance charges for men in the records is " + str(men_average_charges) + " dollars.")

The average insurance charges for women in the records is 12569.6 dollars.
The average insurance charges for men in the records is 13956.8 dollars.


In [18]:
print(difference(men_average_charges, women_average_charges))

1387.199999999999


Conclusion: **gender affects the insurance charges**. The average cost for women is 1387.20 dollares less than for men.

In [19]:
smokers_average_charges = average_cat_var(smoker, charges, 'yes')
non_smokers_average_charges = average_cat_var(smoker, charges, 'no')

print("The average insurance charges for smokers is " + str(smokers_average_charges) + " dollars.")
print("The average insurance charges for non-smokers is " + str(non_smokers_average_charges) + " dollars.")


The average insurance charges for smokers is 32050.2 dollars.
The average insurance charges for non-smokers is 8434.3 dollars.


In [20]:
print(difference(smokers_average_charges, non_smokers_average_charges))

23615.9


The average insurance cost for non-smokers is almost 4 times less than for those who smoke!

Conclusion: **smoking has the biggest impact on the insurance charges**. The average cost for non-smokers is 23615.9 dollares less than for smokers.

In [21]:
southwest_average_charges = average_cat_var(region, charges, 'southwest')
southeast_average_charges = average_cat_var(region, charges, 'southeast')
northwest_average_charges = average_cat_var(region, charges, 'northwest')
northeast_average_charges = average_cat_var(region, charges, 'northeast')

print("The average insurance charges for southwest region is " + str(southwest_average_charges) + " dollars.")
print("The average insurance charges for southeast region is " + str(southeast_average_charges) + " dollars.")
print("The average insurance charges for northwest region is " + str(northwest_average_charges) + " dollars.")
print("The average insurance charges for northeast region is " + str(northeast_average_charges) + " dollars.")

The average insurance charges for southwest region is 12346.9 dollars.
The average insurance charges for southeast region is 14735.4 dollars.
The average insurance charges for northwest region is 12417.6 dollars.
The average insurance charges for northeast region is 13406.4 dollars.


Higher insurance costs are paid mainly in the Southeast region. Also in the western side the insurance charges are less than in the eastern. It's unlikely that the region affects the insurance charges, but the lifestyle and habits of the population in a particular region can have an impact.

Conclusion: **the region doesn't have a specific impact on the insurance charges**.

In [22]:
def average_num_var(list1, list2, list3):
    list1_list2 = list(zip(list1, list2))
    values = []
    for item in list1_list2:
        if item[0] in list3:
            values.append(item[1])
    average_for_values = average(values)
    return average_for_values


In [23]:
print("The average insurance charges for people with underweight bmi is " + str(average_num_var(bmi, charges, bmi_rating_dict['Underweight'])) + " dollars.")
print("The average insurance charges for people with normal weight bmi is " + str(average_num_var(bmi, charges, bmi_rating_dict['Normal weight'])) + " dollars.")
print("The average insurance charges for people with overweight bmi is " + str(average_num_var(bmi, charges, bmi_rating_dict['Overweight'])) + " dollars.")
print("The average insurance charges for people with obesity bmi is " + str(average_num_var(bmi, charges, bmi_rating_dict['Obesity'])) + " dollars.")

The average insurance charges for people with underweight bmi is 8657.6 dollars.
The average insurance charges for people with normal weight bmi is 10404.9 dollars.
The average insurance charges for people with overweight bmi is 10994.0 dollars.
The average insurance charges for people with obesity bmi is 15552.3 dollars.


We see that the average value for each bmi group consistently increases, starting from the smallest.

Conclusion: **the cost of insurance increases depending on the bmi** (the higher the bmi, the higher the cost).

In [24]:
print("The average insurance charges for people with age between 18 and 24 is " + str(average_num_var(age, charges, by_age_dict['18 to 24'])) + " dollars.")
print("The average insurance charges for people with age between 25 and 34 is " + str(average_num_var(age, charges, by_age_dict['25 to 34'])) + " dollars.")
print("The average insurance charges for people with age between 35 and 44 is " + str(average_num_var(age, charges, by_age_dict['35 to 44'])) + " dollars.")
print("The average insurance charges for people with age between 45 and 54 is " + str(average_num_var(age, charges, by_age_dict['45 to 54'])) + " dollars.")
print("The average insurance charges for people with age between 55 and 64 is " + str(average_num_var(age, charges, by_age_dict['55 to 64'])) + " dollars.")

The average insurance charges for people with age between 18 and 24 is 9011.3 dollars.
The average insurance charges for people with age between 25 and 34 is 10352.4 dollars.
The average insurance charges for people with age between 35 and 44 is 13134.2 dollars.
The average insurance charges for people with age between 45 and 54 is 15853.9 dollars.
The average insurance charges for people with age between 55 and 64 is 18513.3 dollars.


We see that the average value for each age group consistently increases, starting from the smallest.

Conclusion: **the cost of insurance increases depending on the age** (the higher the age, the higher the cost).

In [25]:
print("The average insurance charges for people without children is " + str(average_num_var(children, charges, by_children_dict[0])) + " dollars.")
print("The average insurance charges for people with 1 child is " + str(average_num_var(children, charges, by_children_dict[1])) + " dollars.")
print("The average insurance charges for people with 2 children is " + str(average_num_var(children, charges, by_children_dict[2])) + " dollars.")
print("The average insurance charges for people with 3 children is " + str(average_num_var(children, charges, by_children_dict[3])) + " dollars.")
print("The average insurance charges for people with 4 children is " + str(average_num_var(children, charges, by_children_dict[4])) + " dollars.")
print("The average insurance charges for people with 5 children is " + str(average_num_var(children, charges, by_children_dict[5])) + " dollars.")

The average insurance charges for people without children is 12366.0 dollars.
The average insurance charges for people with 1 child is 12731.2 dollars.
The average insurance charges for people with 2 children is 15073.6 dollars.
The average insurance charges for people with 3 children is 15355.3 dollars.
The average insurance charges for people with 4 children is 13850.7 dollars.
The average insurance charges for people with 5 children is 8786.0 dollars.


At first, we can see that the average price rises depending on the number of children, but after crossing the threshold of 3 children, it starts to fall. This can be explained by the fact that people with 4 or 5 children, for example, don't smoke or with normal body weight, so in this case, we should also push away from other indicators, which means they can block even such a large number as 5 children.

Conclusion: **the number of children is only slightly reflected in the insurance cost**.

###  We can conclude that the best way to save money on insurance is to stop smoking and lead a healthy lifestyle to reduce bmi as much as is possible.