# U.S. Medical Insurance Costs Project

File `insurance.csv` contains data about patients from USA and their insurance costs.
 Purpose of this project is to analyze dataset using only python bult-in funcions like lists, dictionaries etc.
 ## Importing and looking over dataset

In [4]:
import csv
with open('insurance.csv') as insurance_data:
    data = insurance_data.read()
    print(data)

age,sex,bmi,children,smoker,region,charges
19,female,27.9,0,yes,southwest,16884.924
18,male,33.77,1,no,southeast,1725.5523
28,male,33,3,no,southeast,4449.462
33,male,22.705,0,no,northwest,21984.47061
32,male,28.88,0,no,northwest,3866.8552
31,female,25.74,0,no,southeast,3756.6216
46,female,33.44,1,no,southeast,8240.5896
37,female,27.74,3,no,northwest,7281.5056
37,male,29.83,2,no,northeast,6406.4107
60,female,25.84,0,no,northwest,28923.13692
25,male,26.22,0,no,northeast,2721.3208
62,female,26.29,0,yes,southeast,27808.7251
23,male,34.4,0,no,southwest,1826.843
56,female,39.82,0,no,southeast,11090.7178
27,male,42.13,0,yes,southeast,39611.7577
19,male,24.6,1,no,southwest,1837.237
52,female,30.78,1,no,northeast,10797.3362
23,male,23.845,0,no,northeast,2395.17155
56,male,40.3,0,no,southwest,10602.385
30,male,35.3,0,yes,southwest,36837.467
60,female,36.005,0,no,northeast,13228.84695
30,female,32.4,1,no,southwest,4149.736
18,male,34.1,0,no,southeast,1137.011
34,female,31.92,1,yes,northeast,37701

There are 7 columns that contain information about patient.

Some columns are numerical (*age,bmi,children,charges*) while some are categorical(*sex,smoker,region*).
Column 'children' contains number of childrens and column 'smoker' have information about if person is a smoker. 'Yes' means that the patient smokes, 'no' that he/she does not smoke. The most important for the analysis is column '*charges*'. It tells us how much money (in USD) is person paying for his/her insurance.
### Data Preparation
Using **csv module** I'm converting csv file into dictionary to prepare dataset to in-depth analysis.

In [20]:
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    print(list(ds))
    ds = list(ds)

[{'age': '19', 'sex': 'female', 'bmi': '27.9', 'children': '0', 'smoker': 'yes', 'region': 'southwest', 'charges': '16884.924'}, {'age': '18', 'sex': 'male', 'bmi': '33.77', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '1725.5523'}, {'age': '28', 'sex': 'male', 'bmi': '33', 'children': '3', 'smoker': 'no', 'region': 'southeast', 'charges': '4449.462'}, {'age': '33', 'sex': 'male', 'bmi': '22.705', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '21984.47061'}, {'age': '32', 'sex': 'male', 'bmi': '28.88', 'children': '0', 'smoker': 'no', 'region': 'northwest', 'charges': '3866.8552'}, {'age': '31', 'sex': 'female', 'bmi': '25.74', 'children': '0', 'smoker': 'no', 'region': 'southeast', 'charges': '3756.6216'}, {'age': '46', 'sex': 'female', 'bmi': '33.44', 'children': '1', 'smoker': 'no', 'region': 'southeast', 'charges': '8240.5896'}, {'age': '37', 'sex': 'female', 'bmi': '27.74', 'children': '3', 'smoker': 'no', 'region': 'northwest', 'charges'

## Analyzing data
### Age
I start my analysis by focusing on *'age'* column. I want to know what is an average value of charges for each age and see who is paying the most and who the least for their insurance.

In [6]:
#I start by creating a dictionary that will contains age as a key and insurance costs as a values.

with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    data_by_age = {}
    for n in ds:
        for value,key in n.items():
            age = n['age']
            charge = n['charges']
            
            if age in data_by_age:
                data_by_age[age].append(charge)
            else:
                data_by_age[age] = []
                data_by_age[age].append(charge)      

In [18]:
#Let's chceck the mean of patient age in our dataset
length = len(data_by_age.keys())
sum = 0
for key in data_by_age.keys():
    sum += int(key)
mean = sum/length
print(mean)
    

41.0


In [19]:
#Here I calculate mean value of insurance costs for each age group

def calc_mean(dict): #I may use this function later
    rank_mean = {}
    for key,value in dict.items():
        n = 0
        for x in value:
            n += float(x)
        mean = n/(len(value))
        rank_mean[key] = mean
    sorted_means = sorted(rank_mean.items(),key=lambda x: x[1], reverse=True)
    return sorted_means
        

calc_mean(data_by_age)


[('64', 23275.530837272727),
 ('61', 22024.45760869564),
 ('60', 21979.418507391292),
 ('63', 19884.998460869552),
 ('43', 19267.278653333342),
 ('62', 19163.856573478264),
 ('59', 18895.869531600016),
 ('54', 18758.546475357136),
 ('52', 18256.26971931035),
 ('37', 18019.91187720001),
 ('47', 17653.99959310345),
 ('57', 16447.185249999988),
 ('55', 16164.545488461548),
 ('53', 16020.930754999994),
 ('44', 15859.396587037038),
 ('51', 15682.255867241354),
 ('50', 15663.003300689621),
 ('56', 15025.515836538458),
 ('45', 14830.1998562069),
 ('48', 14632.500445172418),
 ('46', 14342.590638620672),
 ('58', 13878.928111600011),
 ('42', 13061.03866888889),
 ('30', 12719.110358148158),
 ('49', 12696.006264285717),
 ('23', 12419.820039642858),
 ('33', 12351.532987307695),
 ('36', 12204.47613799998),
 ('27', 12184.701721428562),
 ('39', 11778.242945200009),
 ('40', 11772.251309999981),
 ('34', 11613.528120769244),
 ('35', 11307.182031199996),
 ('24', 10648.01596214286),
 ('29', 10430.158727037

As we can see the oldest patients have the highest insurance costs. In general we can say that the higher the age, the greater the cost.
### Sex
In this case I want to compare insurance costs of men and women. I'd like to check who is paying more.



In [8]:
#Just like in analyzing age I create a dictionary but this time sex is a key and charges is a value.
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    data_by_sex = {}
    for n in ds:
        for key,value in n.items():
            sex = n['sex']
            charge = n['charges']
            
            if sex in data_by_sex:
                data_by_sex[sex].append(charge)
            else:
                data_by_sex[sex] = []
                data_by_sex[sex].append(charge)

In [9]:
calc_mean(data_by_sex) #I'm using function that i created before

[('male', 13956.751177721893), ('female', 12569.578843835325)]

As we can see men have bigger average value of insurance cost. It means that womens are paying less for their insurances than men.
### Region
In this step i will chceck how many patients are from current regions

In [23]:
with open('insurance.csv') as insurance_data:
    ds = csv.DictReader(insurance_data)
    data_by_region = {}
    for n in ds:
        for key in n.keys():
            region = n['region']
            
            if region in data_by_region:
                data_by_region[region] += 1
            else:
                data_by_region[region] = 1
data_by_region

{'southwest': 2275, 'southeast': 2548, 'northwest': 2275, 'northeast': 2268}