# U.S. Medical Insurance Costs

In this project, a __CSV__ file with medical insurance costs will be investigated using Python. The goal with this project will be to analyse various attributes within __insurance.csv__ to learn more about the patient information in this file and gain insight into potential use cases for the dataset.

To start, all necessary libraries must be imported. For this project the only library needed is the _csv_ library in order to work with the __insurance.csv__ data.

In [1]:
import csv

## Looking over my dataset

The next step is to look through __insurance.csv__ in order to get acquainted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file.

- The number of columns and rows (7, 1339)
- The names of columns and rows (columns: age, sex, bmi, children, smoker, region, charges)
- Any noticeable missing data (none)
- Types of values (numerical vs categorical) (num, cat, num, num, cat, cat, num)

## Scoping out my project

Now let's identify some goals based on the data available:

1. Find out the average age of the people in the dataset
2. Evaluate the proportion of each region in the dataset
3. Look at the difference in cost for smokers and non-smokers
4. Determine the average age of those who have at least one child in this dataset

## Import and save a dataset

In this section of code, I am using a context manager to read a __.csv__ file and save it to a global variable in the script.

In [2]:
insurance_dataset = []

with open('insurance.csv', newline='') as insurance_csv:
 insurance_file = csv.DictReader(insurance_csv)
 for row in insurance_file:
  insurance_dataset.append(row)

## Build out analysis functions or class methods

I now have everything I need to begin my analysis, having organized the information from __insurance.csv__ and having thought about what it is I would like to investigate.

Now is the time to build out how I perform these investigations using Python.

_FIRST FUNCTION_
store_parameter(dataset, name, variable_type = str):
- takes two mandatory parameters, and one optional parameter:
  1. dataset: a dictionary (created after opening the __.csv__ file)
  2. name: name of the list where the output is to be saved
  3. variable_type (optional): defines the type of the elements in the list
 - returns a list of the values for the {name} parameter in __insurance.csv__.

In [3]:
def store_parameter(dataset, name, variable_type = str): #takes a dictionary and a string as parameters
 parameter = []
 for row in dataset:
  if variable_type == int:
   parameter.append(int(row[name]))
  elif variable_type == float:
   parameter.append(float(row[name]))
  else:
   parameter.append(row[name])
 return parameter

_SECOND FUNCTION_
calculate_average(lst):
- takes one parameter:
  1. lst: a list of values whose average is calculated
- returns the average if values are int or float, empty otherwise

In [4]:
def calculate_average(lst):
 total = 0
 for item in lst:
  if not isinstance(item, int) and not isinstance(item, float):
   print("There are non-integer/non-float values in this list. The function was interrupted and nothing has been returned")
   return
  total += item
 return total / len(lst)

_THIRD FUNCTION_
proportion_of_regions(lst):
- takes one parameter:
  1. lst: a list of values, in this case region strings, to be evaluated
- prints the number of individuals from each region and the equivalent percentage
- returns a dictionary where key:values is 'region':number_of_individuals

In [5]:
def proportion_of_regions(lst):
 regions_dictionary = {}
 for item in lst:
  if item in regions_dictionary:
   regions_dictionary[item] += 1
  else:
   regions_dictionary[item] = 1
 print("\nThere are {number} regions in this dataset.".format(number = len(regions_dictionary)))
 for region, number in regions_dictionary.items():
  print("There are {number} individuals from {region}, which represents {percentage:.2f}%.".format(number = number, region = region, percentage = 100 * number/len(lst)))
 return regions_dictionary

_FOURTH FUNCTION_
average_cost_smoker(smoker_lst, cost_lst):
- takes two parameters:
  1. smoker_lst: a list, which contains the individual's smoking status ("yes" or "no")
  2. cost_lst: a list, which contains the insurance charges for every individual (in dollars)
- prints:
  1. The number of smokers and non-smokers in the dataset
  2. Their average insurance charges and how much more/less smokers pay comparatively to non-smokers
- returns:
  1. the average insurance charges of smokers 
  2. the average insurance charges of non-smokers

In [6]:
def average_cost_smoker(smoker_lst, cost_lst):
 total_cost_smokers = 0
 total_number_smokers = 0
 total_cost_non_smokers = 0
 total_number_non_smokers = 0

 for smoker, charge in list(zip(smoker_lst, cost_lst)):
  if smoker == 'yes':
   total_cost_smokers += charge
   total_number_smokers += 1
  elif smoker == 'no':
   total_cost_non_smokers += charge
   total_number_non_smokers += 1
  else:
   print("An input other than \"yes\" or \"no\" was entered")

 print('\n')
 print("""There are {x} smokers and {y} non-smokers in this dataset.\nThe average insurance cost\
 for smokers is ${cost_smokers:.2f} and that of non-smokers is ${cost_non_smokers:.2f}.""".format(x = total_number_smokers, y = total_number_non_smokers, cost_smokers = total_cost_smokers / total_number_smokers, cost_non_smokers = total_cost_non_smokers / total_number_non_smokers))
 print("On average, smokers pay {percentage:.2f}% more than non-smokers in this dataset.".format(percentage = 100* total_cost_smokers * total_number_non_smokers / total_number_smokers / total_cost_non_smokers))

 return total_cost_smokers / total_number_smokers, total_cost_non_smokers / total_number_non_smokers

_FIFTH FUNCTION_
average_age_parents(list_age, list_children):
- takes two parameters:
  1. list_age: a list, which contains the age of all individuals
  2. list_children: a list, which contains the number of children (>= 0) each individual has
- returns:
  1. the number of parents in the dataset
  2. the average age of parents in the dataset

In [7]:
def average_age_parents(list_age, list_children):
 total_age_parents = 0
 total_number_parents = 0
 for children, age in list(zip(list_children, list_age)):
  if children > 0:
   total_age_parents += age
   total_number_parents += 1
 return total_number_parents, total_age_parents / total_number_parents

_SIXTH FUNCTION_
average_cost_bmi(list_bmi, list_cost):
- takes two parameters:
  1. list_bmi: a list, which contains individuals' BMI
  2. list_cost: a list, which contains individuals' inccured insurance charges
- prints:
  1. The BMI ranges as defined by the WHO
  2. The number of individuals and the average BMI in each category
  3. How much more overweight and obese individuals pay in insurance comparatively to healthy individuals
- returns:
  1. number of underweight individuals
  2. total insurance charges incurred by underweight individuals #might need changing given the function's name
  3. number of healthy individuals
  4. total insurance charges incurred by healthy individuals #might need changing given the function's name
  5. number of overweight individuals
  6. total insurance charges incurred by overweight individuals #might need changing given the function's name
  7. number of obese individuals
  8. total insurance charges incurred by obese individuals #might need changing given the function's name

In [8]:
def average_cost_bmi(list_bmi, list_cost): #underweight, healthy and overweight categories are defined according to WHO guidance
 total_cost_obese = 0
 number_obese = 0
 total_cost_overweight = 0
 number_overweight = 0
 total_cost_healthy = 0
 number_healthy = 0
 total_cost_underweight = 0
 number_underweight = 0

 for bmi, insurance in list(zip(list_bmi, list_cost)):
  if bmi >= 30:
   total_cost_obese += insurance
   number_obese += 1
  elif bmi >= 25.0:
   total_cost_overweight += insurance
   number_overweight += 1
  elif bmi >= 18.5:
   total_cost_healthy += insurance
   number_healthy += 1
  else:
   total_cost_underweight += insurance
   number_underweight += 1

 print("""\nAccording to the WHO, individuals are:
 - \'underweight\' if their bmi is less than 18.5
 - \'healthy\' if their bmi is between 18.5 and 25.0
 - \'overweight\' if their bmi is greater than 25.0
 - \'obese\' if their bmi is greater than 30.0

In this dataset, {num1} individuals are underweight, {num2} healthy, {num3} overweight and {num4} are obese.
Here is the average insurance cost for each of these:
 - underweight: ${average1:.2f}
 - healthy: ${average2:.2f}
 - overweight: ${average3:.2f}
 - obese: ${average4:.2f}

On average, overweight individuals pay {percentage1:.2f}% more than healthy individuals.
On average, obese individuals pay {percentage2:.2f}% more than healthy individuals.""".format(num1 = number_underweight, num2 = number_healthy, num3 = number_overweight, num4 = number_obese, average1 = total_cost_underweight / number_underweight, average2 = total_cost_healthy / number_healthy, average3 = total_cost_overweight / number_overweight, average4 = total_cost_obese / number_obese, percentage1 = 100 * total_cost_overweight * number_healthy / total_cost_healthy / number_overweight, percentage2 = 100 * total_cost_obese * number_healthy / total_cost_healthy / number_obese))
 return number_underweight, total_cost_underweight, number_healthy, total_cost_healthy, number_overweight, total_cost_overweight, number_obese, total_cost_obese

_MAIN BODY_

Let's load the lists of individual parameters in the dataset:

In [9]:
age = store_parameter(insurance_dataset, 'age', int)
sex = store_parameter(insurance_dataset, 'sex')
bmi = store_parameter(insurance_dataset, 'bmi', float)
children = store_parameter(insurance_dataset, 'children', int)
smoker = store_parameter(insurance_dataset, 'smoker')
region = store_parameter(insurance_dataset, 'region')
charges = store_parameter(insurance_dataset, 'charges', float)

Let's provide an overview of the dataset:

In [10]:
parameters_evaluated = [key for key, value in insurance_dataset[0].items()]

print("""There are {} individuals in this dataset. Their {}, {}, {}, number of {}, whether they are\
 {}s or not, {} of origin and incurred insurance {} are collected.""".format(len(age), parameters_evaluated[0], parameters_evaluated[1], parameters_evaluated[2], parameters_evaluated[3], parameters_evaluated[4], parameters_evaluated[5], parameters_evaluated[6]))
if len(parameters_evaluated) > 7: #this currently needs to be updated manually, as does the sentence above
 print("More parameters have been evaluated in this dataset since you last checked!")

There are 1338 individuals in this dataset. Their age, sex, bmi, number of children, whether they are smokers or not, region of origin and incurred insurance charges are collected.


Let's apply the functions created to the data available:

In [11]:
print("\nThe average {parameter} is {value:.2f} years old.".format(parameter = 'age', value = calculate_average(age)))
print("The average {parameter} is {value:.2f}.".format(parameter = 'bmi', value = calculate_average(bmi)))
print("Average insurance {parameter} amount to ${value:.2f}.".format(parameter = 'charges', value = calculate_average(charges)))
proportion_of_regions(region)
average_cost_smoker(smoker, charges)

print("\nThere are {number} parents in this dataset. On average, they are {age:.1f} years old.".format(number = average_age_parents(age, children)[0], age = average_age_parents(age, children)[1]))
average_cost_bmi(bmi, charges)


The average age is 39.21 years old.
The average bmi is 30.66.
Average insurance charges amount to $13270.42.

There are 4 regions in this dataset.
There are 325 individuals from southwest, which represents 24.29%.
There are 364 individuals from southeast, which represents 27.20%.
There are 325 individuals from northwest, which represents 24.29%.
There are 324 individuals from northeast, which represents 24.22%.


There are 274 smokers and 1064 non-smokers in this dataset.
The average insurance cost for smokers is $32050.23 and that of non-smokers is $8434.27.
On average, smokers pay 380.00% more than non-smokers in this dataset.

There are 764 parents in this dataset. On average, they are 39.8 years old.

According to the WHO, individuals are:
 - 'underweight' if their bmi is less than 18.5
 - 'healthy' if their bmi is between 18.5 and 25.0
 - 'overweight' if their bmi is greater than 25.0
 - 'obese' if their bmi is greater than 30.0

In this dataset, 20 individuals are underweight, 2

(20,
 177044.01170000003,
 225,
 2342100.9845200004,
 386,
 4241178.818049001,
 707,
 10995501.176489996)