# U.S. Medical Insurance Costs

In this project, a **CSV** file with medical insurance costs will be investigated using Python fundamentals. The goal with this project will be to analyze various attributes within **insurance.csv** to learn more about the patient information in the file and gain insight into potential use cases for the dataset.

In [1]:
import csv
import json

In [2]:
file_name = "E:\\Semo4ka\\Python\\CodeCademy\\insurance.csv"


To start, all necessary libraries must be imported. For this project the only library needed is the `csv` library in order to work with the **insurance.csv** data. There are other potential libraries that could help with this project; however, for this analysis, using just the `csv` library will suffice.

The next step is to look through **insurance.csv** in order to get aquanted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [3]:
def load_data(file_name):
    
    with open(file_name) as insurance_file:
        file_reader = csv.DictReader(insurance_file)
        user_info = []
        for row in file_reader:
            user_info.append(row)
    
    return user_info

print(json.dumps(load_data(file_name), indent = 2))

[
  {
    "age": "19",
    "sex": "female",
    "bmi": "27.9",
    "children": "0",
    "smoker": "yes",
    "region": "southwest",
    "charges": "16884.924"
  },
  {
    "age": "18",
    "sex": "male",
    "bmi": "33.77",
    "children": "1",
    "smoker": "no",
    "region": "southeast",
    "charges": "1725.5523"
  },
  {
    "age": "28",
    "sex": "male",
    "bmi": "33",
    "children": "3",
    "smoker": "no",
    "region": "southeast",
    "charges": "4449.462"
  },
  {
    "age": "33",
    "sex": "male",
    "bmi": "22.705",
    "children": "0",
    "smoker": "no",
    "region": "northwest",
    "charges": "21984.47061"
  },
  {
    "age": "32",
    "sex": "male",
    "bmi": "28.88",
    "children": "0",
    "smoker": "no",
    "region": "northwest",
    "charges": "3866.8552"
  },
  {
    "age": "31",
    "sex": "female",
    "bmi": "25.74",
    "children": "0",
    "smoker": "no",
    "region": "southeast",
    "charges": "3756.6216"
  },
  {
    "age": "46",
    "sex": "fe

**insurance.csv** contains the following columns:
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost


There are many aspects of the data that could be looked into. The following operations will be implemented:
* find average age of the patients
* return the number of males vs. females counted in the dataset
* find geographical location of the patients
* return the average yearly medical charges of the patients
* creating a dictionary that contains all patient information

To perform these inspections, a class called `PatientsInfo` has been built out which contains fives methods:
* `analyze_ages()`
* `analyze_sexes()`
* `unique_regions()`
* `average_charges()`
* `create_dictionary()`

The class has been built out below. 

In [8]:
class Patient:
    def __init__(self, age, sex, bmi, children, smoker, region, charges):
        self.age = int(age) 
        self.sex = sex
        self.bmi = float(bmi)
        self.children = int(children)
        self.smoker = smoker
        self.region = region
        self.charges = float(charges)
        
        
class PatientsInfo:
    def __init__(self):
        self.records = []
        
        
    def analyze_ages(self):
        age_total = 0
        for item in self.records:
            age_total += int(item.age)
        age_average = age_total / len(self.records) 
        
        return age_average
    
    def age_with_kids(self):
        age_total_with_kids = 0
        count_with_kids = 0
        for item in self.records:
            if item.children > 0:
                age_total_with_kids += item.age
                count_with_kids += 1
        average_age_with_kids = age_total_with_kids / count_with_kids
               
        return average_age_with_kids
            
           
    def analyze_sexes(self):
        male_count = 0
        female_count = 0
        for item in self.records:
            if item.sex == "male":
                male_count += 1
            else:
                female_count += 1
                
        return male_count, female_count
    
    
    def unique_regions(self):
        regions_list = []
        for item in self.records:
            if item.region not in regions_list:
                regions_list.append(item.region)
        
        return regions_list
    
    
    def regions_count(self):
        regions_total = {item: 0 for item in self.unique_regions()}
        for item in self.records:
            for key, val in regions_total.items():
                if item.region == key:
                    regions_total[key] += 1
        max_region = max(regions_total, key=regions_total.get)
        
        return regions_total, max_region
                  
            
    def average_charges(self):
        charges_total = 0
        for item in self.records:
            charges_total += float(item.charges)
        charges_avarage = charges_total / len(self.records) 
        
        return charges_avarage
    
    def create_dictionary(self):
        generations = {
            "gen_z": [10,25],
            "millenials": [26, 41],
            "gen_x": [42, 57],
            "boomers_2": [58, 67],
            "boomers_1": [68, 76],
            "post_war": [77, 94],
            "ww_2": [95, 100]
        }
        patients_dictionary = {key: [] for key in generations}
               
        for item in self.records:
            for k, v in generations.items():
                if item.age >= v[0] and item.age <= v[1]:
                    patients_dictionary[k].append({
                        "Patient Age": item.age,
                        "Patient Sex": item.sex,
                        "Patient BMI": item.bmi,
                        "Patient Number of Children": item.children,
                        "Patient Smoking Status": item.smoker,
                        "Patient U.S Geopraphical Region": item.region,
                        "Patient Yearly Medical Insurance Cost": item.charges
                    })
        
        return patients_dictionary
    
    def age_charges_correlation(self):
        patients_count = {}
        data = self.create_dictionary()
        age_charges_avarage = {}
        
        for k, v in data.items():
            patients_count[k] = len(v)
            if len(v) != 0:
                age_charges_avarage[k] = "{:.2f}".format(sum([record["Patient Yearly Medical Insurance Cost"] for record in v]) / len(v))
            
        return patients_count, age_charges_avarage
    
    
    def smokers_vs_nonsmokers(self):
        smoker_charge = 0
        smoker_count = 0
        nonsmoker_charge = 0
        nonsmoker_count = 0
        for item in self.records:
            if item.smoker == "yes":
                smoker_charge += item.charges
                smoker_count += 1
            else:
                nonsmoker_charge += item.charges
                nonsmoker_count += 1
        smoker_average_charge = smoker_charge / smoker_count
        nonsmoker_average_charge = nonsmoker_charge / nonsmoker_count
        difference_dol = smoker_average_charge - nonsmoker_average_charge
        difference_percent = difference_dol / nonsmoker_average_charge *100
        
        return smoker_average_charge, nonsmoker_average_charge, difference_dol, difference_percent

The next step is to create an instance of the class called `patient_info`. With this instance, each method can be used to see the results of the analysis.

In [9]:
def load_data(file_name):
    
    with open(file_name) as insurance_file:
        file_reader = csv.DictReader(insurance_file)
        user_info = PatientsInfo()
        for row in file_reader:
            user_info.records.append(Patient(row["age"], row["sex"], row["bmi"], row["children"], row["smoker"], row["region"], row["charges"]))
    
    return user_info



In [10]:
print("Average Patient Age: {:.2f} years".format(load_data(file_name).analyze_ages())) 

Average Patient Age: 39.21 years


The average age of the patients in **insurance.csv** is about 39 years old. This is important to check in order to ensure the data in **insurance.csv** is representative for a broader population. If it is decided to use the dataset to make inferences about other populations, the data must abundant and broad enough for such use cases.

A further analysis would have to be done to make sure the range and standard deviation of the patient age group in **insurance.csv** is indicative of a random sampling of individuals. 

In [11]:
male_count, female_count = load_data(file_name).analyze_sexes()
print("Count for female: {}\nCount for male: {}".format(female_count, male_count))

Count for female: 662
Count for male: 676


The next step of the analysis is to check the balance of males vs. females in **insurance.csv**. Similar to above, it is important to check that this dataset is representative of a broader population of individuals. If a person were to use this dataset to create a classification model, it would be imperitive to make sure that the attributes are balanced.

Quite often in the real-world, data is not balanced; this is an issue because it can lead to statistical issues when performing analysis. This is something that will be explored further in future portfolio projects!

In [12]:
print("Analyzed U.S. regions : {}".format(load_data(file_name).unique_regions()))

Analyzed U.S. regions : ['southwest', 'southeast', 'northwest', 'northeast']


There are four unique geographical regions in this dataset, and it is important to note that all the patients come from the United States.

In [13]:
print("Average Yearly Medical Insurance Charges: {:.2f} dollars".format(load_data(file_name).average_charges()))

Average Yearly Medical Insurance Charges: 13270.42 dollars


The average yearly medical insurance charge per individual is 13270 US dollars. Some further analysis could be done to see what patient attributes contribute most strongly to low and/or high medical insurance charges. For example, one could check if patient age correlates with the amount of money they spend yearly.

In [14]:
print(json.dumps(load_data(file_name).create_dictionary(), indent=4))

{
    "gen_z": [
        {
            "Patient Age": 19,
            "Patient Sex": "female",
            "Patient BMI": 27.9,
            "Patient Number of Children": 0,
            "Patient Smoking Status": "yes",
            "Patient U.S Geopraphical Region": "southwest",
            "Patient Yearly Medical Insurance Cost": 16884.924
        },
        {
            "Patient Age": 18,
            "Patient Sex": "male",
            "Patient BMI": 33.77,
            "Patient Number of Children": 1,
            "Patient Smoking Status": "no",
            "Patient U.S Geopraphical Region": "southeast",
            "Patient Yearly Medical Insurance Cost": 1725.5523
        },
        {
            "Patient Age": 25,
            "Patient Sex": "male",
            "Patient BMI": 26.22,
            "Patient Number of Children": 0,
            "Patient Smoking Status": "no",
            "Patient U.S Geopraphical Region": "northeast",
            "Patient Yearly Medical Insurance Cost": 272

In [15]:
patients_count, age_charges_avarage = load_data(file_name).age_charges_correlation()
print("Patients count by generation:\n{}\nAverage Yearly Medical Insurance Cost by generation:\n{}".format(patients_count, age_charges_avarage))

Patients count by generation:
{'gen_z': 306, 'millenials': 422, 'gen_x': 446, 'boomers_2': 164, 'boomers_1': 0, 'post_war': 0, 'ww_2': 0}
Average Yearly Medical Insurance Cost by generation:
{'gen_z': '9087.02', 'millenials': '11004.36', 'gen_x': '15896.22', 'boomers_2': '19766.12'}


In [16]:
regions_total, max_region = load_data(file_name).regions_count()
print("Patients represented by region: {}\nMajority of the individuals are from: {}".format(regions_total, max_region))

Patients represented by region: {'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
Majority of the individuals are from: southeast


In [17]:
smoker_average_charge, nonsmoker_average_charge, difference_dol, difference_percent = load_data(file_name).smokers_vs_nonsmokers()

print("Smoking individuals spend on average {:.2f} dollars for their yearly medical insurance.\nAt the same time non-smokers spend on avaerage {:.2f} dollars a year.\nThat means, that smokers pay {:.2f} dollars more and {:.2f}% more, than non-smokers.".format(smoker_average_charge, nonsmoker_average_charge, difference_dol, difference_percent))

Smoking individuals spend on average 32050.23 dollars for their yearly medical insurance.
At the same time non-smokers spend on avaerage 8434.27 dollars a year.
That means, that smokers pay 23615.96 dollars more and 280.00% more, than non-smokers.


In [18]:
print("Average Patient's Age for those, who has at least one child is: {:.2f} years".format(load_data(file_name).age_with_kids())) 

Average Patient's Age for those, who has at least one child is: 39.78 years


All patient data is now neatly organized in a dictionary. This is convenient for further analysis if a decision is made to continue making investigations for the attributes in **insurance.csv**.