# U.S. Medical Insurance Costs

This is a portfolio project analysing the US Medical insurance costs. This project uses fundamentals of python programming such as python functions, loops, dictionaries, lists, and files to summarize the dataset. 

The following steps would be followed in arriving at the result:
    1. Importing Dataset
    2. Inspecting Dataset
    3. Preparing Dataset
    4. Analysis
    5. Summary


## 1. Importing Dataset
Let's import the python library to read our dataset

In [1]:
import csv

We would be storing our datasets as lists. Therefore we have to create empty lists to store our data before opening the csv file, and reading them into our empty lists.


In [2]:
ages = []
sex = []
bmis = []
num_of_children = []
smoker = []
region = []
insurance_cost = []
with open('insurance.csv') as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv, delimiter = ",")
    for row in insurance_reader:
        ages.append(row['age'])
        sex.append(row['sex'])
        bmis.append(row['bmi'])
        num_of_children.append(row['children'])
        smoker.append(row['smoker'])
        region.append(row['region'])
        insurance_cost.append(row['charges'])
       

## 2. Inspecting Dataset

We need to confirm if we have the data stored in our respective lists and check the respective datatypes. We'd do that by confirming the lenght of each list, and checking the first five and last five elements of each list.


In [3]:
def dataset_inspection(list,x):
    #This function takes two arguments,a list and x(integer) and returns a dictionary containing, the length of the list, the first x elements and last x elements of the list
    # x must be greater than 0
    my_dict = {}
    my_dict['lenght_of_list'] = len(list)
    my_dict['first_five_elements'] = list[:x]
    my_dict['last_five_elements'] = list[(-1 *x):]
    return my_dict

In [4]:
#lets test our function on our lists
print(dataset_inspection(ages, 5))
print(dataset_inspection(sex, 5))
print(dataset_inspection(bmis, 5))
print(dataset_inspection(num_of_children, 5))
print(dataset_inspection(smoker, 5))
print(dataset_inspection(region, 5))
print(dataset_inspection(insurance_cost, 5))


{'lenght_of_list': 1338, 'first_five_elements': ['19', '18', '28', '33', '32'], 'last_five_elements': ['50', '18', '18', '21', '61']}
{'lenght_of_list': 1338, 'first_five_elements': ['female', 'male', 'male', 'male', 'male'], 'last_five_elements': ['male', 'female', 'female', 'female', 'female']}
{'lenght_of_list': 1338, 'first_five_elements': ['27.9', '33.77', '33', '22.705', '28.88'], 'last_five_elements': ['30.97', '31.92', '36.85', '25.8', '29.07']}
{'lenght_of_list': 1338, 'first_five_elements': ['0', '1', '3', '0', '0'], 'last_five_elements': ['3', '0', '0', '0', '0']}
{'lenght_of_list': 1338, 'first_five_elements': ['yes', 'no', 'no', 'no', 'no'], 'last_five_elements': ['no', 'no', 'no', 'no', 'yes']}
{'lenght_of_list': 1338, 'first_five_elements': ['southwest', 'southeast', 'southeast', 'northwest', 'northwest'], 'last_five_elements': ['northwest', 'northeast', 'southeast', 'southwest', 'northwest']}
{'lenght_of_list': 1338, 'first_five_elements': ['16884.924', '1725.5523', '44

## 3. Preparing Dataset

Now that we have a feel of what our lists look like (columns of the dataset), we can start our analysis.
However, it is noticeable that our integers and floats are saved as strings.

#### Let's create functions that convert our strings to floats and integer
The functions would take a list of strings and returns a list of floats and integers

In [5]:
def string_to_integer(list):
    new_list = []
    for item in list:
        new_list.append(int(item))
    return new_list

def string_to_float(list):
    new_list = []
    for item in list:
        new_list.append(float(item))
    return new_list
        

Let's convert the lists not needed in strings into floats or integers as the case maybe

In [6]:
ages = string_to_integer(ages)
bmis = string_to_float(bmis)
num_of_children = string_to_integer(num_of_children)
insurance_cost = string_to_float(insurance_cost)

In [7]:
#confirm if the strings have been converted
print(ages[:10])
print(bmis[:10])
print(num_of_children[:10])
print(insurance_cost[:10])

[19, 18, 28, 33, 32, 31, 46, 37, 37, 60]
[27.9, 33.77, 33.0, 22.705, 28.88, 25.74, 33.44, 27.74, 29.83, 25.84]
[0, 1, 3, 0, 0, 0, 1, 3, 2, 0]
[16884.924, 1725.5523, 4449.462, 21984.47061, 3866.8552, 3756.6216, 8240.5896, 7281.5056, 6406.4107, 28923.13692]


## 4. Analysis

### Average

Let's create a function that returns the average of the elements in a list. Applicable to numerical values only

In [8]:
def average(list):
    """A function that returns the average of a list containing numerical elements."""
    return sum(list)/len(list)



In [9]:
#Let's determine the average age of the dataset using the average function.
print("The average age of the dataset is {}".format(average(ages)))

The average age of the dataset is 39.20702541106129


### Age Group

Let's categorize our ages into brackets to better understand the demographics.
18-24, 25-34, 35-44, 45-54, 55-64, 65 and above

In [10]:
def age_group(list):
    #This function takes a list of ages and returns a dictionary according to the age brackets specified above.
    my_dict = {"18-24": 0, "25-34": 0, "35-44": 0, "45-54": 0, "55-64": 0, "65 & above": 0}
    for item in list:
        if item < 18:
            pass
        elif 18 <= item <= 24:
            my_dict["18-24"] += 1
        elif 25 <= item <= 34:
            my_dict["25-34"] += 1
        elif 35 <= item <= 44:
            my_dict["35-44"] += 1
        elif 45 <= item <= 54:
            my_dict["45-54"] += 1
        elif 55 <= item <= 64:
            my_dict["55-64"] += 1
        else:
            my_dict["65 & above"] += 1
    my_dict['total'] = sum(my_dict.values())
    return my_dict
    

In [11]:
age_group(ages)

{'18-24': 278,
 '25-34': 271,
 '35-44': 260,
 '45-54': 287,
 '55-64': 242,
 '65 & above': 0,
 'total': 1338}

### Spread of Data

Let's examine the spread of the categorical variables in our dataset. The function below takes a list and returns a dictionary with a count of every possible entry, as well as the total count. 

In [12]:
def data_spread(list):
    my_dict = {}
    for item in list:
        if item not in my_dict:
            my_dict[item] = 1
        else:
            my_dict[item] += 1
    my_dict['total'] = sum(my_dict.values())
    return my_dict

In [13]:
# Let's examine our regional spread
data_spread(region)

{'southwest': 325,
 'southeast': 364,
 'northwest': 325,
 'northeast': 324,
 'total': 1338}

In [14]:
# Let's examine the number of males and females in our dataset
data_spread(sex)

{'female': 662, 'male': 676, 'total': 1338}

In [15]:
# Let's examine the spread of smokers and non-smokers
data_spread(smoker)

{'yes': 274, 'no': 1064, 'total': 1338}

#### Number of Individuals with at least one child

In [16]:
def num_of_child(list):
    my_dict = {"no child": 0, "at least one child": 0}
    for item in list:
        if item == 0:
            my_dict["no child"] += 1
        else:
            my_dict["at least one child"] += 1
    return my_dict
            

In [17]:
num_of_child(num_of_children)

{'no child': 574, 'at least one child': 764}

### Group by Region

Getting the average medical cost, age, and bmi per region.
We would write a function that takes two arguments, the list containing the regions and a list of any other numerical data. This function returns a dictionary of all the regions and the average value of the numerical data in the argument passed.

In [18]:
def group_by_region(region, list):
    my_dict = {}
    southeast = []
    southwest = []
    northwest = []
    northeast = []
    for i in range(len(region)):
        if region[i] == 'southeast':
            southeast.append(list[i])
        elif region[i] == 'southwest':
            southwest.append(list[i])
        elif region[i] == 'northeast':
            northeast.append(list[i])
        else:
            northwest.append(list[i])
    my_dict['southeast'] = average(southeast)
    my_dict['southwest'] = average(southwest)
    my_dict['northeast'] = average(northeast)
    my_dict['northwest'] = average(northwest)
    return my_dict
        

In [19]:
group_by_region(region, insurance_cost)

{'southeast': 14735.411437609895,
 'southwest': 12346.93737729231,
 'northeast': 13406.3845163858,
 'northwest': 12417.575373969228}

In [20]:
group_by_region(region, ages)

{'southeast': 38.93956043956044,
 'southwest': 39.45538461538462,
 'northeast': 39.26851851851852,
 'northwest': 39.19692307692308}

In [21]:
group_by_region(region, num_of_children)

{'southeast': 1.0494505494505495,
 'southwest': 1.1415384615384616,
 'northeast': 1.0462962962962963,
 'northwest': 1.1476923076923078}

In [22]:
group_by_region(region, bmis)

{'southeast': 33.35598901098903,
 'southwest': 30.59661538461538,
 'northeast': 29.17350308641976,
 'northwest': 29.199784615384626}

### Group by Smoker

Similar to the group by region, we want to determine the average of numerical data with reference to the smoker status of the individual

In [23]:
def group_by_smoker(smoker, list, str):
    """This function takes three arguments: a list of smoker data, a list of any other numerical variable in the dataset,
    and a string of your choice to describe the numerical variable.
    It returns the count of smokers and non-smokers and the average of the numerical data passed as an argument"""
    str1 = "average_" + str
    smoker_list = []
    non_smoker_list = []
    my_dict = {'smoker': {'count': 0, str1: smoker_list} , 'non-smoker': {'count': 0, str1: non_smoker_list}}
    for i in range(len(smoker)):
        if smoker[i] == 'yes':
            my_dict['smoker']['count'] += 1
            smoker_list.append(list[i])
            my_dict['smoker'][str1] = smoker_list
        else:
            my_dict['non-smoker']['count'] += 1
            non_smoker_list.append(list[i])
            my_dict['non-smoker'][str1] = non_smoker_list
    my_dict['non-smoker'][str1] = average(non_smoker_list)
    my_dict['smoker'][str1] = average(smoker_list)
    return my_dict

In [24]:
#Let's test our function using the insurance cost. This would signify the average cost of insurance of a smoker and non-smoker.
group_by_smoker(smoker,insurance_cost,"cost")

{'smoker': {'count': 274, 'average_cost': 32050.23183153285},
 'non-smoker': {'count': 1064, 'average_cost': 8434.268297856199}}

### Group by Age Group

In [25]:
def group_by_age_group(ages, list):
    my_dict = {"18-24":[], "25-34":[], "35-44":[], "45-54":[],"55-64":[]}
    for i in range(len(list)):
        if 18 <= ages[i] <= 24:
            my_dict["18-24"].append(list[i])
        elif 25 <= ages[i] <= 34:
            my_dict["25-34"].append(list[i])
        elif 35 <= ages[i] <= 44:
            my_dict["35-44"].append(list[i])
        elif 45 <= ages[i] <= 54:
            my_dict["45-54"].append(list[i])
        else:
            my_dict["55-64"].append(list[i])
    for key in my_dict:
        my_dict[key] = average(my_dict[key])
    return my_dict

In [26]:
group_by_age_group(ages, insurance_cost)

{'18-24': 9011.340317334529,
 '25-34': 10352.392525793359,
 '35-44': 13134.168692692307,
 '45-54': 15853.927878188151,
 '55-64': 18513.276226900805}

In [27]:
group_by_age_group(ages, num_of_children)

{'18-24': 0.60431654676259,
 '25-34': 1.2767527675276753,
 '35-44': 1.4923076923076923,
 '45-54': 1.386759581881533,
 '55-64': 0.6818181818181818}

In [28]:
group_by_age_group(ages, bmis)

{'18-24': 30.038920863309365,
 '25-34': 30.064132841328412,
 '35-44': 30.399711538461535,
 '45-54': 31.14670731707317,
 '55-64': 31.761962809917364}

### Group by Sex


In [29]:
def group_by_sex(sex, list):
    my_dict = {'female': [], 'male':[]}
    for i in range(len(list)):
        if sex[i] == 'female':
            my_dict['female'].append(list[i])
        else:
            my_dict['male'].append(list[i])
    for key in my_dict:
        my_dict[key] = average(my_dict[key])
    return my_dict

In [30]:
group_by_sex(sex, insurance_cost)

{'female': 12569.57884383534, 'male': 13956.751177721886}

In [31]:
group_by_sex(sex, bmis)

{'female': 30.377749244713023, 'male': 30.943128698224832}

In [32]:
group_by_sex(sex, num_of_children)

{'female': 1.0740181268882176, 'male': 1.1153846153846154}

In [33]:
group_by_sex(sex, ages)

{'female': 39.503021148036254, 'male': 38.917159763313606}

## 5. Summary

We can see that the dataset's average age of the observations is 39 years. The spread of the age group is pretty even, with the least entries from individuals between 55 and 64 and individuals between 45 and 54 years with the most entries. Our dataset is also not skewed as regards region and sex. However, the ratio of smokers to non-smokers in the dataset is 1:4. 

The Average Insurance Cost in the southeast is **14,735.44 dollars**, the highest of the four regions, and is about **2500 dollars** more than the southwest region, which happens to be the region with the least average charge. It is also important to note that the southeast region has the highest average BMI of **33.36** and the lowest average age of all four areas.

A significant takeaway is the average insurance cost for a smoker and non-smoker. The average price of insurance for a smoker is about **32000 dollars**, which is about three times the cost for a non-smoker(8400 dollars). *It is important to note that the sample size of smokers in this data set is four times lesser than non-smokers.* Furthermore,  the average insurance cost increases with age; on average, men pay 1000 dollars more than their female counterparts.  



