# U.S. Medical Insurance Cost 
For this project, I investigated a ***medical insurance costs dataset*** in a .csv file using the fundamental Python skills I have developed. The **insurance.csv** dataset contains patient information and insurance cost. This project aimed to analyze various attributes within **insurance.csv** to learn more about the patient data in the file and gain insight into potential cases for the dataset.

## Importing and Reviewing Dataset
To start,  I imported all the necessary libraries. For this project, the only library needed is the ***csv*** library to work with the **insurance.csv** data.

In [1]:
import csv

The next step was to look through **insurance.csv** to get acquainted with the data. I checked the following aspects of the data file to plan out how to import the data into a Python file:
* The names of columns and rows
* Any noticeable missing data
* Types of values (numerical vs. categorical)

In [9]:
#print the whole CSV file 
#with open("insurance.csv") as insurance_file:
#    print(insurance_file.read())

#Review first 10 rows and header of the csv file
with open("insurance.csv") as insurance_file:
    file_reader = csv.reader(insurance_file, delimiter = ",")
    count = 0
    for row in file_reader:
        print(row)
        if count > 9:
            break
        count += 1

['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924']
['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523']
['28', 'male', '33', '3', 'no', 'southeast', '4449.462']
['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061']
['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552']
['31', 'female', '25.74', '0', 'no', 'southeast', '3756.6216']
['46', 'female', '33.44', '1', 'no', 'southeast', '8240.5896']
['37', 'female', '27.74', '3', 'no', 'northwest', '7281.5056']
['37', 'male', '29.83', '2', 'no', 'northeast', '6406.4107']
['60', 'female', '25.84', '0', 'no', 'northwest', '28923.13692']


**insurance.csv** contains the 7 columns and 1339 rows. The 7 columns have the following information:  
* age: number (integer)
* sex: female, male
* bmi: number (float)
* children: number (integer)
* smoker: yes, no
* region: southwest, southeast, northwest, northeast
* charges: number (float)

Each row represents one record. Fortunately, there is no signs of missing data.

## Scoping your Project

Now that I have observed the typeo data that **insurance.csv** contains, I can start planning my analysis. First, I had to plan out what to investigate and how to perform the analysis. There are many aspects of the data that I could review. The following operations that I decided to implement:
* Find the average age of the patients in the data set
* Look at the different costs between males vs. females
* Find the geographical location where a majority of the patients are from
* Find the average yearly medical charges of the patients

## Saving Dataset via Python Variables

In this section of code, I am using a context manager to read a **.csv file** and save it to a global variable in the script.

In [38]:
insurance_dataset = []

with open('insurance.csv', newline = '') as insurance_csv:
    insurance_file = csv.DictReader(insurance_csv)
    for row in insurance_file:
        insurance_dataset.append(row)
        
 #print(insurance_dataset)

## Building Analysis Functions

### Average age of the patient in the data set

In [121]:
def average_age(dataset):
    total_age = 0
    num_records = 0
    for record in dataset:
        total_age += int(record["age"])
        num_records += 1
    return round(total_age / num_records)
average = average_age(insurance_dataset)
print("The average age of the patients in this dataset is " + str(average) + ".")

The average age of the patients in this dataset is 39.


The average age of the patients in **insurance.csv** is about 39 years old. This information is essential to check to ensure that the data in **insurance.csv** represents a broader population. In addition, if we use the dataset to make inferences about other populations, the data must be abundant and broad enough for such use cases.

Further analysis would have to be required to make sure the range and standard deviation of the patient age group in **insurance.csv** are representative of a random sampling of individuals

### Different Costs Between Males vs. Females

In [120]:
def cost_difference_by_category(dataset, category):
    categories = {}
    total_cost = {}
    for record in dataset:
        key = record[category]
        if key in categories: 
            categories[key] = categories[key] + float(record["charges"])
            total_cost[key] += 1 
        else:
            categories[key] = float(record["charges"])
            total_cost[key] = 1 
    for key, value in total_cost.items():
        categories[key] = categories[key] / value
    return categories
male_vs_female = cost_difference_by_category(insurance_dataset, "sex")
#print(male_vs_female)
print("The average insurance charge for a male is " + str(male_vs_female['male']) + " dollars, while the average insurance charges for a female is " + str(male_vs_female['female']) + " dollars. This is a " + str(male_vs_female['male'] - male_vs_female['female'])+ " dollar difference.")


The average insurance charge for a male is 13956.751177721886 dollars, while the average insurance charges for a female is 12569.57884383534 dollars. This is a 1387.1723338865468 dollar difference.


The balance of males vs. females in **insurance.csv** is vital to ensure that our data is unbiased and covers various populations. Similar to the above, it is important to check that this dataset is representative of a broader population of individuals. In addition, if a person were to use this dataset to create a classification model, it would be imperative to ensure that the attributes are balanced.

Often, data needs to be more balanced; this issue can lead to statistical problems when performing analysis. Again, this will have to be explored further in future portfolio projects!

### Geological Location Where a Majority of the Patients Live

In [49]:
#This funtion counts the number of entries in a category
def count_by_category(dataset, category):
    categories = {} 
    for record in dataset:
        key = record[category]
        if key in categories: 
            categories[key] = categories[key] + 1 
        else:
            categories[key] = 1 
    return categories

#This function will provide the category with the most customers register
def max_category(category): 
    name = ""
    total = 0
    for key,value in category.items(): 
        if total < value:
            total = value
            name = key
    return name, total

#testing function
regions = count_by_category(insurance_dataset, "region")
print(regions)
region, total = max_category(regions)
print("The " + str(region) + " region is the largest with " + str(total) + " members.")

{'southwest': 325, 'southeast': 364, 'northwest': 325, 'northeast': 324}
The southeast region is the largest with 364 members.


This dataset has four unique geographical regions, and it is essential to note that all the patients come from the United States. The largest region is in the southeast region, with 364 members. 

This information also shows that this population sample gives a broad range about the different parts of the United States due to a similar amount of members from each area in this dataset.

### Average Yearly Medical Charges of the Patients

In [124]:
def average_charges(dataset):
    total_cost = 0
    num_patients = 0
    for record in dataset:
        total_cost += float(record["charges"])
        num_patients += 1
    return round(total_cost / num_patients)

charges = average_charges(insurance_dataset)
print("The average yearly medical charges for the patients in this dataset is " + str(charges) + " dollars.")

The average yearly medical charges for the patients in this dataset is 13270 dollars.


The average yearly medical insurance charge per individual is 13270 US dollars. Further analysis would be required to see what patient attributes contribute most strongly to low and/or high medical insurance charges. For example, one could check if patient age correlates with the amount of money they spend yearly.