# U.S. Medical Insurance Costs Project

File `insurance.csv` contains data about patients from USA and their insurance costs.
 Purpose of this project is to analyze dataset using only Python built-in functions such as lists, dictionaries etc.
 ## Importing and looking over dataset

In [44]:
import csv
with open('insurance.csv') as insurance_data:
    data = insurance_data.read()

There are 7 columns that contain information about patients.

Some columns are numerical (*age,bmi,children,charges*) while some are categorical(*sex,smoker,region*).
Column 'children' contains number of patient childrens and column 'smoker' have information about if person is a smoker. 'Yes' means that the patient smokes, 'no' that he/she does not smoke. The most important for the analysis is column '*charges*'. It tells us how much money (in USD) is person paying for his/her insurance.
### Data Preparation
Using **csv module** I'm converting csv file into dictionary to prepare dataset to in-depth analysis.

In [45]:
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    ds = list(ds)

## Analyzing data


I start my analysis by calculating an average cost of insurance in `insurance.csv`.

In [31]:
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    insurance = {} #let's create dictioniary 
    for n in ds:
        for key,value in n.items():
            charge = float(n['charges'])
            insurance[charge] = charge

# calculating the mean of insurance costs
sum = 0
for charge in insurance.values():
    sum += charge
mean = sum/len(insurance.values())
mean         

13279.121486655948

*13279.12* USD is an average cost of insurance in this data set.
### Age


Now I will deal with *'age'* column. I want to know what is an average value of charges for each age and see who is paying the most and who the least for their insurance.

In [32]:
#I start by creating a dictionary that will contains age as a key and insurance cost as a value.

with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    data_by_age = {}
    for n in ds:
        for value,key in n.items():
            age = n['age']
            charge = n['charges']
            
            if age in data_by_age:
                data_by_age[age].append(charge)
            else:
                data_by_age[age] = []
                data_by_age[age].append(charge)      

In [33]:
#Let's check the mean of patient age in our dataset
length = len(data_by_age.keys())
sum = 0
for key in data_by_age.keys():
    sum += int(key)
mean = sum/length
print(mean)
    

41.0


In [34]:
#Here I calculate the mean value of insurance costs for each age group

def calc_mean(dict): #I may use this function later
    rank_mean = {}
    for key,value in dict.items():
        n = 0
        for x in value:
            n += float(x)
        mean = n/(len(value))
        rank_mean[key] = mean
    sorted_means = sorted(rank_mean.items(),key=lambda x: x[1], reverse=True)
    return sorted_means
        

calc_mean(data_by_age)

[('64', 23275.530837272727),
 ('61', 22024.45760869564),
 ('60', 21979.418507391292),
 ('63', 19884.998460869552),
 ('43', 19267.278653333342),
 ('62', 19163.856573478264),
 ('59', 18895.869531600016),
 ('54', 18758.546475357136),
 ('52', 18256.26971931035),
 ('37', 18019.91187720001),
 ('47', 17653.99959310345),
 ('57', 16447.185249999988),
 ('55', 16164.545488461548),
 ('53', 16020.930754999994),
 ('44', 15859.396587037038),
 ('51', 15682.255867241354),
 ('50', 15663.003300689621),
 ('56', 15025.515836538458),
 ('45', 14830.1998562069),
 ('48', 14632.500445172418),
 ('46', 14342.590638620672),
 ('58', 13878.928111600011),
 ('42', 13061.03866888889),
 ('30', 12719.110358148158),
 ('49', 12696.006264285717),
 ('23', 12419.820039642858),
 ('33', 12351.532987307695),
 ('36', 12204.47613799998),
 ('27', 12184.701721428562),
 ('39', 11778.242945200009),
 ('40', 11772.251309999981),
 ('34', 11613.528120769244),
 ('35', 11307.182031199996),
 ('24', 10648.01596214286),
 ('29', 10430.158727037

As we can see the oldest patients have the highest insurance costs. In general we can say that the higher the age, the greater the cost.
### Sex
In this case I want to compare insurance costs of men and women. I'd like to check who is paying more.



In [35]:
#Just like in analyzing age I create a dictionary but this time sex is a key and charges is a value.
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    data_by_sex = {}
    for n in ds:
        for key,value in n.items():
            sex = n['sex']
            charge = n['charges']
            
            if sex in data_by_sex:
                data_by_sex[sex].append(charge)
            else:
                data_by_sex[sex] = []
                data_by_sex[sex].append(charge)

In [36]:
calc_mean(data_by_sex) #I'm using function that i created before

[('male', 13956.751177721893), ('female', 12569.578843835325)]

As we can see men have bigger average value of insurance cost. It means that women's are paying less for their insurances than men.
### Region
In this step i will check how many patients are from different regions.

In [37]:
#I'm creating dictionary that will contains region as a key and number of patient from that region as value
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    data_by_region = {}
    for n in ds:
        for key in n.keys():
            region = n['region']
            
            if region in data_by_region:
                data_by_region[region] += 1
            else:
                data_by_region[region] = 1
data_by_region

{'southwest': 2275, 'southeast': 2548, 'northwest': 2275, 'northeast': 2268}

### Smoker
Now I want to check how smoking is affecting cost of insurance.

In [38]:
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    data_by_smoker = {}
    for n in ds:
        for key,value in n.items():
            smoker = n['smoker']
            charge = n['charges']
            
        if smoker in data_by_smoker:
            data_by_smoker[smoker].append(charge)
        else:
            data_by_smoker[smoker] = []
            data_by_smoker[smoker].append(charge)

#I want to know an average value of insurance
calc_mean(data_by_smoker)


[('yes', 32050.23183153285), ('no', 8434.268297856199)]

Smoking has a very strong impact on insurance cost. People who smoke are paying 3.8 time more for their insurance than people who do not smoke.
### BMI
I'd like to know an average BMI value of patients from this data set.

In [41]:
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    bmi_values = {}
    for n in ds:
        for key,value in n.items():
            bmi = float(n['bmi'])
            bmi_values[bmi] = bmi

# calculating the mean of insurance costs
sum = 0
for bmi in bmi_values:
    sum += charge
mean = sum/len(bmi_values.values())
mean         
    

30.969999999999764