# U.S. Medical Insurance Costs

In this project, a CSV file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within insurance.csv to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

In [1]:
#Import library CSV
import csv

To start, all necessary libraries must be imported. For this project the only library needed is the `csv` library in order to work with the **insurance.csv** data. There are other potential libraries that could help with this project; however, for this analysis, using just the `csv` library will suffice.

The next step is to look through **insurance.csv** in order to get aquanted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [2]:
#Create empty list
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

There are no signs of missing data. To store this information, seven empty lists will be created hold each individual column of data from **insurance.csv**.


In [3]:
def load_data(lst, csv_info, col_name):
    with open(csv_info) as csv_data:
        data = csv.DictReader(csv_data)
        for row in data:
            lst.append(row[col_name])
        return lst

We load data and store it into each empty list that we create above with function that take 3 parameters (*lst*, *csv_info*, *col_name*)

In [4]:
load_data(age, 'insurance.csv', 'age')
load_data(sex, 'insurance.csv', 'sex')
load_data(bmi, 'insurance.csv', 'bmi')
load_data(children, 'insurance.csv', 'children')
load_data(smoker, 'insurance.csv', 'smoker')
load_data(region, 'insurance.csv', 'region')
load_data(charges, 'insurance.csv', 'charges')

['16884.924',
 '1725.5523',
 '4449.462',
 '21984.47061',
 '3866.8552',
 '3756.6216',
 '8240.5896',
 '7281.5056',
 '6406.4107',
 '28923.13692',
 '2721.3208',
 '27808.7251',
 '1826.843',
 '11090.7178',
 '39611.7577',
 '1837.237',
 '10797.3362',
 '2395.17155',
 '10602.385',
 '36837.467',
 '13228.84695',
 '4149.736',
 '1137.011',
 '37701.8768',
 '6203.90175',
 '14001.1338',
 '14451.83515',
 '12268.63225',
 '2775.19215',
 '38711',
 '35585.576',
 '2198.18985',
 '4687.797',
 '13770.0979',
 '51194.55914',
 '1625.43375',
 '15612.19335',
 '2302.3',
 '39774.2763',
 '48173.361',
 '3046.062',
 '4949.7587',
 '6272.4772',
 '6313.759',
 '6079.6715',
 '20630.28351',
 '3393.35635',
 '3556.9223',
 '12629.8967',
 '38709.176',
 '2211.13075',
 '3579.8287',
 '23568.272',
 '37742.5757',
 '8059.6791',
 '47496.49445',
 '13607.36875',
 '34303.1672',
 '23244.7902',
 '5989.52365',
 '8606.2174',
 '4504.6624',
 '30166.61817',
 '4133.64165',
 '14711.7438',
 '1743.214',
 '14235.072',
 '6389.37785',
 '5920.1041',
 '176

Now that all the data from **insurance.csv** neatly organized into labeled lists, the analysis can be started. This is where one must plan out what to investigate and how to perform the analysis. There are many aspects of the data that could be looked into. The following operations will be implemented:
* creating a dictionary that construct all patient information
* find average age of the patients
* return the number of males vs. females counted in the dataset
* find geographical location of the patients
* return the average yearly medical charges of the patients

In [5]:
#Construct all above list into dictionaries
def construct_data(age, sex, bmi, children, smoker, region, charges):
    new_data = dict()
    num_data = len(age)
    for i in range(num_data):
        new_data[i] = {'age': age[i], 
                    'sexes': sex[i], 
                    'bmi': float(bmi[i]), 
                    'children': int(children[i]), 
                    'stat_smoker': smoker[i], 
                    'region': region[i], 
                    'cost': float(charges[i])}
    return new_data

new_data = construct_data(age, sex, bmi, children, smoker, region, charges)
print(new_data)

{0: {'age': '19', 'sexes': 'female', 'bmi': 27.9, 'children': 0, 'stat_smoker': 'yes', 'region': 'southwest', 'cost': 16884.924}, 1: {'age': '18', 'sexes': 'male', 'bmi': 33.77, 'children': 1, 'stat_smoker': 'no', 'region': 'southeast', 'cost': 1725.5523}, 2: {'age': '28', 'sexes': 'male', 'bmi': 33.0, 'children': 3, 'stat_smoker': 'no', 'region': 'southeast', 'cost': 4449.462}, 3: {'age': '33', 'sexes': 'male', 'bmi': 22.705, 'children': 0, 'stat_smoker': 'no', 'region': 'northwest', 'cost': 21984.47061}, 4: {'age': '32', 'sexes': 'male', 'bmi': 28.88, 'children': 0, 'stat_smoker': 'no', 'region': 'northwest', 'cost': 3866.8552}, 5: {'age': '31', 'sexes': 'female', 'bmi': 25.74, 'children': 0, 'stat_smoker': 'no', 'region': 'southeast', 'cost': 3756.6216}, 6: {'age': '46', 'sexes': 'female', 'bmi': 33.44, 'children': 1, 'stat_smoker': 'no', 'region': 'southeast', 'cost': 8240.5896}, 7: {'age': '37', 'sexes': 'female', 'bmi': 27.74, 'children': 3, 'stat_smoker': 'no', 'region': 'northw

Then we test our first dictionary to take a look and we see our dictionary contain patient information like `Age`, `Sex`, `BMI`, `Number of Children`, `Smoker Status`, `Region`, and `Yearly Medical Insurance Cost`

In [6]:
print(new_data[1])

{'age': '18', 'sexes': 'male', 'bmi': 33.77, 'children': 1, 'stat_smoker': 'no', 'region': 'southeast', 'cost': 1725.5523}


Supposed we want to know the Average Age for all patient in this dataset. We build function called `avg_patient` that take one paramater `new_data`

In [7]:
#Find out the average age of the patients in the dataset
def avg_patient(new_data):
    total_age = 0
    len_data = len(new_data)
    for i in new_data:
        curr_age = new_data[i]['age']
        total_age += int(curr_age)
    return round(total_age / len_data, 2)

In [8]:
print(f"The average age of patient in dataset is: {avg_patient(new_data)}  years old")

The average age of patient in dataset is: 39.21  years old


Then supposed we want to know how many region in this dataset.

In [9]:
#Grouping Region
region_data = []
for item in region:
    if item not in region_data:
        region_data.append(item)
    else:
        pass
print(f'There is {len(region_data)} region in dataset i.e. : {region_data}')

There is 4 region in dataset i.e. : ['southwest', 'southeast', 'northwest', 'northeast']


In [10]:
#Analyze where a majority of the individuals are from
def major_region(new_data):
    major_reg = {'southwest': 0, 'southeast': 0, 'northwest': 0, 'northeast': 0}
    for i in new_data:
        curr_region = new_data[i]['region']
        if curr_region == 'southwest':
            major_reg['southwest'] += 1
        elif curr_region == 'southeast':
            major_reg['southeast'] += 1
        elif curr_region == 'northwest':
            major_reg['northwest'] += 1
        elif curr_region == 'northeast':
            major_reg['northeast'] += 1
    return major_reg

Supposed we want to analyze where a majority of the individuals are from. We create function called `major_region` that take one parameter `new_data` ( Function above this text )

In [11]:
area = max(major_region(new_data)).upper()
num_area = max(major_region(new_data).values())

print(f"We find a majority of the individuals are from : {area} i.e amount to {num_area} person")

We find a majority of the individuals are from : SOUTHWEST i.e amount to 364 person


In [12]:
#Look at the different costs between smokers vs. non-smokers
def diff_cost(new_data):
    num_smoker = 0
    num_non_smoker = 0
    for i in new_data:
        curr_stat = new_data[i]['stat_smoker']
        curr_cost = new_data[i]['cost']
        if curr_stat == 'yes':
            num_smoker += curr_cost
        else:
            num_non_smoker += curr_cost
    return round(num_smoker, 3), round(num_non_smoker, 3)

Then we want to know the difference `Cost` between smokers and non-smokers. We create function called `diff_cost` that take one parameter

In [13]:
cost_smoker, cost_non_smoker = diff_cost(new_data)
diff_cost = abs(round(cost_smoker - cost_non_smoker, 3))
print(f'Cost for smoker: ${cost_smoker}\nCost for non-smoker: ${cost_non_smoker}')
print(f'The different costs between smokers and non-smokers around: ${diff_cost}')

Cost for smoker: $8781763.522
Cost for non-smoker: $8974061.469
The different costs between smokers and non-smokers around: $192297.947


In [14]:
def avg_age_patient_one_children(new_data):
    total_patient = 0
    num_patient = []
    for i in new_data:
        curr_patient = new_data[i]['children']
        curr_age = new_data[i]['age']
        if curr_patient == 1:
            total_patient += int(curr_age)
            num_patient.append(curr_age)
    return round(total_patient / len(num_patient))

Figure out what the average age is for someone who has at least one child in this dataset with create function called `avg_patient_one_children` that take one parameter 'new_data'

In [15]:
avg_age_patient_has_1_child = avg_age_patient_one_children(new_data)
print(f'We know average age for patient who has at least 1 children were {avg_age_patient_has_1_child} years old')

We know average age for patient who has at least 1 children were 39 years old


In [16]:
#find average cost for all patient
def avg_cost_patient(new_data):
    total_cost = 0
    len_data = len(new_data)
    for i in new_data:
        curr_cost = new_data[i]['cost']
        total_cost += curr_cost
    return round(total_cost / len_data, 2)

We want to know the Average Cost for all patient. We create function called `avg_cost_patient` that take one parameter `new_data`

In [17]:
avg_cost = avg_cost_patient(new_data)
print(f'We find the average cost for all patient amount to ${avg_cost}')

We find the average cost for all patient amount to $13270.42


In [18]:
#Find total of male and female
def total_patient_by_gender(new_data):
    male = 0
    female = 0
    for i in new_data:
        curr_patient = new_data[i]['sexes']
        if curr_patient == 'male':
            male += 1
        elif curr_patient == 'female':
            female += 1
    return male, female

return the number of Males vs Females counted in the dataset.

In [19]:
male, female = total_patient_by_gender(new_data)
print(f'Male : {male} person')
print(f'Female : {female} person')

Male : 676 person
Female : 662 person


In [20]:
#Find Average BMI for smoker, non smoker and all patient
def all_avg_bmi(new_data):
    total_bmi = 0
    num_data = len(new_data)
    for i in new_data:
        curr_patient= new_data[i]['bmi']
        total_bmi += curr_patient
    return round(total_bmi / num_data, 2)

def avg_bmi_smoker(new_data):
    total_bmi = 0
    patient_smoker = 0
    for i in new_data:
        curr_patient = new_data[i]['bmi']
        curr_stat_patient = new_data[i]['stat_smoker']
        if curr_stat_patient == 'yes':
            total_bmi += curr_patient
            patient_smoker += 1
    return round(total_bmi / patient_smoker, 2)

def avg_bmi_non_smoker(new_data):
    total_bmi = 0
    patient_non_smoker = 0
    for i in new_data:
        curr_patient = new_data[i]['bmi']
        curr_stat_patient = new_data[i]['stat_smoker']
        if curr_stat_patient == 'no':
            total_bmi += curr_patient
            patient_non_smoker += 1
    return round(total_bmi / patient_non_smoker, 2)

Last but not least, we want to know the Average for `Smoker`, `Non-Smoker` and all of patient in the dataset

In [21]:
avg_bmi_all_patient = all_avg_bmi(new_data)
avg_bmi_for_smoker = avg_bmi_smoker(new_data)
avg_bmi_for_non_smoker = avg_bmi_non_smoker(new_data)

print(f'Average BMI for all patient is {avg_bmi_all_patient}')
print(f'Average BMI for smoker patient is {avg_bmi_for_smoker}')
print(f'Average BMI for non smoker patient is {avg_bmi_for_non_smoker}')

Average BMI for all patient is 30.66
Average BMI for smoker patient is 30.71
Average BMI for non smoker patient is 30.65


## Conclusion

From the dataset that we do some analyze, We recommended the patient to stop smoking to saving insurance cost, maintain a healthy lifestyle starting from food and exercise