# U.S. Medical Insurance Costs

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insights into potential use cases for the dataset.

To start, we need to import a **csv** library in order to work with the **insurance.csv** data. 

In [1]:
# Importing modules
import csv

Examining the **insurance.csv** dataset above, we can see that there are 7 given columns:

- Patient's age
- Patient's sex
- Patient's BMI
- Patient's number of children
- Patient's smoking status
- Patient's regional location
- Patient's medical insurance cost

Since the data is currently stored in a CSV file, it is difficult to manipulate and analyze. For easier analysis, we will create empty lists for each of the columns below and then, using a function, store the datapoints of the CSV file in the appropriate lists. 

We should also verify that there is no missing infromation from the dataset.


In [2]:
# Empty attributes
ages = []
sexes = []
bmis = []
num_children = []
smoker_statuses = []
regions = []
insurance_сosts = []

# Loading csv data with a function
def load_list_data(list, csv_file, column_name):
    # open csv file
    with open(csv_file) as csv_info:
        # read the data from the csv file
        csv_dict = csv.DictReader(csv_info)
        # loop through the data in each row of the csv 
        for row in csv_dict:
            # add the data from each row to a list
            list.append(row[column_name])
        # return the list
        return list
# look at the data in insurance_csv_dict
load_list_data(ages, "insurance.csv", "age")
load_list_data(sexes, "insurance.csv", "sex")
load_list_data(bmis, "insurance.csv", "bmi")
load_list_data(num_children, "insurance.csv", "children")
load_list_data(smoker_statuses, "insurance.csv", "smoker")
load_list_data(regions, "insurance.csv", "region")
load_list_data(insurance_сosts, "insurance.csv", "charges")

# Verifying that each column does not have nulls and there is not missing any information
print(len(ages) == (len(sexes)) == (len(bmis)) == (len(num_children)) == (len(smoker_statuses)) == (len(regions)) == (len(insurance_сosts)))

True


Now that all the data from **insurance.csv** neatly organized into labeled lists, the analysis can be started. We plan to answer the following questions: 
1. What is the average age of the patients in the dataset?
2. What is the difference between male and female average insurance?
3. Where are a majority of the individuals are from?
4. What is the different between smokers and non-smokers costs?
5. How does having a child effect inusrance costs?
6. How does a BMI affect an insurance cost?
7. What are the average yearly medical charges per patient?

To perform these inspections, a class called **PatientsInfo** has been built out which contains the following methods:
- analyze_ages()
- analyze_sexes_costs()
- analyze_regions_costs()
- analyze_smokers_costs()
- analyze_num_children_costs:
- analyze_bmis_costs()
- average_costs()
- create_dictionary()

The class has been built out below. 

In [3]:
# Building classes to analyze data
class PatientsInfo:
    # init method that takes in each list parameter
    def __init__(self, patients_ages, patients_sexes, patients_bmis, patients_num_children, 
patients_smoker_statuses, patients_regions, patients_costs):
        self.patients_ages = patients_ages
        self.patients_sexes = patients_sexes
        self.patients_bmis = patients_bmis
        self.patients_num_children = patients_num_children
        self.patients_smoker_statuses = patients_smoker_statuses
        self.patients_regions = patients_regions
        self.patients_costs = patients_costs
        
# 1. What is the average age of the patients in the dataset?
    def analyze_ages(self):
        # initialize total age at zero
        total_age = 0
        # iterate through all ages in the ages list
        for age in self.patients_ages:
            # sum of the total age
            total_age += int(age)
        # return total age divided by the length of the patient list
        return ("Average Patient Age: " + str(round(total_age/len(self.patients_ages), 1)) + " years")

# 2. What is the difference between male and female average insurance?
    def analyze_sexes_costs(self):
        # separate male and female insurance costs into empty lists
        female_costs = []
        male_costs = []
        # iterate through each sex in the sexes list
        for sex, cost in list(zip(self.patients_sexes, self.patients_costs)):
            # if female add to female variable
            if sex == "female":
                female_costs.append(float(cost))
            # if male add to male variable
            elif sex == "male":
                male_costs.append(float(cost))
        # find average cost for each sex rounded to two decimal places, and the difference between them
        male_average_cost = round(sum(male_costs) / len(male_costs), 2)
        female_average_cost = round(sum(female_costs) / len(female_costs), 2)
        cost_difference = abs(male_average_cost - female_average_cost)
        # print the results
        if male_average_cost < female_average_cost:
            return "The average male insurance cost is ${} and the average femaile cost is ${}, so the male insurance is cheaper by ${} on average.".format(male_average_cost, female_average_cost, cost_difference)
        else: 
            return "The average male insurance cost is ${} and the average femaile cost is ${}, so the female insurance is cheaper by ${} on average.".format(male_average_cost, female_average_cost, cost_difference)


# 3. Where are a majority of the individuals are from?
    def analyze_regions_costs(self):
        # separate costs from each region into their own respective lists
        sw_costs = []
        se_costs = []
        nw_costs = []
        ne_costs = []
        region_costs = list(zip(self.patients_regions, self.patients_costs))
        for region, cost in region_costs:
            if region == "southwest":
                sw_costs.append(float(cost))
            elif region == "southeast":
                se_costs.append(float(cost))
            elif region == "northwest":
                nw_costs.append(float(cost))
            elif region == "northeast":
                ne_costs.append(float(cost))
        # find average cost for each list, rounded to two decimal places
        sw_avg_cost = round(sum(sw_costs) / len(sw_costs), 2)
        se_avg_cost = round(sum(se_costs) / len(se_costs), 2)
        nw_avg_cost = round(sum(nw_costs) / len(nw_costs), 2)
        ne_avg_cost = round(sum(ne_costs) / len(ne_costs), 2)
        # print the results
        print("The average insurance cost per region of the U.S. is as follows:\nNortheast: ${} \nSoutheast: ${} \nSouthwest: ${} \nNorthwest: ${}".format(sw_avg_cost, se_avg_cost, nw_avg_cost, ne_avg_cost))


# 4. What is the different between smokers and non-smokers costs?
    def analyze_smokers_costs(self):
        # separating costs of smokers and non-smokers into their own respective lists
        smoker_costs = []
        non_smoker_costs = []
        # iterate through the smoker status and determine their number and charges
        for status, cost in list(zip(self.patients_smoker_statuses, self.patients_costs)):
            if status == "yes":
                smoker_costs.append(float(cost))
            elif status == "no":
                non_smoker_costs.append(float(cost))
        # find the average charges for smokes and non-smokers and the difference between them
        average_smoker_cost = round(sum(smoker_costs) / len(smoker_costs), 2)
        average_non_smoker_cost = round(sum(non_smoker_costs) / len(non_smoker_costs), 2)
        avg_diff = abs(average_smoker_cost - average_non_smoker_cost)
        # represent the average different in charges between smokers and non-smokers
        print("The average smoker pays ${} more for insurance than the average non-smoker".format(avg_diff))

# 5. How does having a child effect inusrance costs?
    def analyze_num_children_costs(self):
        # separating costs of parents and non-parents into their own respective lists
        parent_costs = []
        non_parent_costs = []
        # iterate through the smoker status and determine their number and charges
        for num_children, cost in list(zip(self.patients_num_children, self.patients_costs)):
            if int(num_children) >= 1:
                parent_costs.append(float(cost))
            else:
                non_parent_costs.append(float(cost))
        # find the average charges for parents and non-parents and the difference between them
        average_parent_cost = round(sum(parent_costs) / len(parent_costs), 2)
        average_non_parent_cost = round(sum(non_parent_costs) / len(non_parent_costs), 2)
        avg_diff = abs(average_parent_cost - average_non_parent_cost)
        # represent the average different in charges between smokers and non-smokers
        print("The average parent pays ${} more for insurance than the average non-parent".format(avg_diff))
        
# 6. How does a BMI affect an insurance cost?
    def analyze_bmi_costs(self):
        # according to the CDC, a healthy adult BMI is between 18.5 and 24.9 for both male and female
        # using a different approach than above to obtain averages
        # creating empty sum and count values for healthy and poor BMI patients' insurance costs
        healthy_bmi_cost_sum = 0
        healthy_count = 0
        poor_bmi_cost_sum = 0
        poor_count = 0
        # adding insurance cost to the correct sum depending on BMI, and incrementing the count values
        for bmi, cost in list(zip(self.patients_bmis, self.patients_costs)):
            if float(bmi) >= 18.5 and float(bmi) <= 24.9:
                healthy_bmi_cost_sum += float(cost)
                healthy_count += 1
            else:
                poor_bmi_cost_sum += float(cost)
                poor_count += 1
        # finding averages by dividing the cost sums by the count values, and rounding to two decimal places
        healthy_bmi_avg_cost = round(healthy_bmi_cost_sum/healthy_count, 2)
        poor_bmi_avg_cost = round(poor_bmi_cost_sum/poor_count, 2)
        avg_diff = poor_bmi_avg_cost - healthy_bmi_avg_cost
        # printing a string with our results
        print("According to the CDC, a healthy adult BMI is between 18.5 and 24.9 for both male and female. \nOn average, a patient with a poor BMI pays ${}.".format(avg_diff))

 # 7. What are the average yearly medical charges per patient?
    def average_charges(self):
        # initialize total_charges variable
        total_charges = 0
        # iterate through charges in the patients charges list
        # add each charge to total_charge
        for charge in self.patients_costs:
            total_charges += float(charge)
        # return the average charges rounded to the hundredths place
        return ("Average Yearly Medical Insurance Charges: ${}.".format(round(total_charges/len(self.patients_costs), 2)))

# Creating a dictionary with all patients information
    def create_dictionary(self):
        self.patients_dictionary = {}
        self.patients_dictionary["age"] = [int(age) for age in self.patients_ages]
        self.patients_dictionary["sex"] = self.patients_sexes
        self.patients_dictionary["bmi"] = self.patients_bmis
        self.patients_dictionary["children"] = self.patients_num_children
        self.patients_dictionary["smoker"] = self.patients_smoker_statuses
        self.patients_dictionary["regions"] = self.patients_regions
        self.patients_dictionary["charges"] = self.patients_costs
        return self.patients_dictionary
    


Now we need to create an instance of the PatientsInfo class, and call each method to see the results of our analysis.

In [4]:
# Creating an instance so each method can be used to see the results of the analysis.
patient_info = PatientsInfo(ages, sexes, bmis, num_children, smoker_statuses, regions, insurance_сosts)

In [5]:
patient_info.analyze_ages()

'Average Patient Age: 39.2 years'

In [6]:
patient_info.analyze_sexes_costs()

'The average male insurance cost is $13956.75 and the average femaile cost is $12569.58, so the female insurance is cheaper by $1387.17 on average.'

In [7]:
patient_info.analyze_regions_costs()

The average insurance cost per region of the U.S. is as follows:
Northeast: $12346.94 
Southeast: $14735.41 
Southwest: $12417.58 
Northwest: $13406.38


In [8]:
patient_info.analyze_smokers_costs()

The average smoker pays $23615.96 more for insurance than the average non-smoker


In [9]:

patient_info.analyze_num_children_costs()

The average parent pays $1583.960000000001 more for insurance than the average non-parent


In [10]:
patient_info.analyze_bmi_costs()

According to the CDC, a healthy adult BMI is between 18.5 and 24.9 for both male and female. 
On average, a patient with a poor BMI pays $3466.0.


In [11]:
patient_info.average_charges()

'Average Yearly Medical Insurance Charges: $13270.42.'