# U.S. Medical Insurance Costs

Let us first analyze our dataset. 
Note: We will not be using Pandas, Matplotlib or NumPy in this project.

In [1]:
import csv
with open("insurance.csv","r") as file:
    insurance_data = list(csv.reader(file))

Let us view the Column Headers to know what data we have.

In [2]:
print("The information we have is: ",insurance_data[0])
print("The number of records we have is: ",len(insurance_data)-1)
insurance_data = insurance_data[1:]

The information we have is:  ['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges']
The number of records we have is:  1338


We have 7 columns and they are : Age, Sex, BMI, Children, Smoker, Region & Charges.
There are a total of 1338 records.

# Goals

Based on the data we have been provided, there are a few insights we can gather. The main objectives of this project is to analyze the information and find the following:

1. Difference of average insurance costs for smokers & non-smokers.
2. Does location correlate with the insurance cost? (Example: A region could have a lot of smokers)
3. Identify the average age,cost & BMI of the male & female members.
4. Identify the average insurance costs for people with different number of children.

# Organizing the Data

Let us first get a better understanding of our demographic. Let us split the data into seperate lists based on their characteristics.

In [3]:
males_only = list()
for record in insurance_data:
    if record[1] == "male":
        males_only.append(record)
print("Here are the first 5 samples of our data:\n")
for i in range(5):
    print(males_only[i],"\n")
print("There are {} records consisting of an individual who is male.".format(len(males_only)))


Here are the first 5 samples of our data:

['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523'] 

['28', 'male', '33', '3', 'no', 'southeast', '4449.462'] 

['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061'] 

['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552'] 

['37', 'male', '29.83', '2', 'no', 'northeast', '6406.4107'] 

There are 676 records consisting of an individual who is male.


In [4]:
females_only = list()
for record in insurance_data:
    if record[1] == "female":
        females_only.append(record)
print("Here are the first 5 samples of our data:\n")
for i in range(5):
    print(females_only[i],"\n")
print("There are {} records consisting of an individual who is female.".format(len(females_only)))

Here are the first 5 samples of our data:

['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924'] 

['31', 'female', '25.74', '0', 'no', 'southeast', '3756.6216'] 

['46', 'female', '33.44', '1', 'no', 'southeast', '8240.5896'] 

['37', 'female', '27.74', '3', 'no', 'northwest', '7281.5056'] 

['60', 'female', '25.84', '0', 'no', 'northwest', '28923.13692'] 

There are 662 records consisting of an individual who is female.


In [5]:
smokers_only = list()
for record in insurance_data:
    if record[4] == "yes":
        smokers_only.append(record)
print("Here are the first 5 samples of our data:\n")
for i in range(5):
    print(smokers_only[i],"\n")
print("There are {} records consisting of an individual who is a smoker.".format(len(smokers_only)))

Here are the first 5 samples of our data:

['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924'] 

['62', 'female', '26.29', '0', 'yes', 'southeast', '27808.7251'] 

['27', 'male', '42.13', '0', 'yes', 'southeast', '39611.7577'] 

['30', 'male', '35.3', '0', 'yes', 'southwest', '36837.467'] 

['34', 'female', '31.92', '1', 'yes', 'northeast', '37701.8768'] 

There are 274 records consisting of an individual who is a smoker.


In [6]:
non_smokers_only = list()
for record in insurance_data:
    if record[4] == "no":
        non_smokers_only.append(record)
print("Here are the first 5 samples of our data:\n")
for i in range(5):
    print(non_smokers_only[i],"\n")
print("There are {} records consisting of an individual who is not a smoker.".format(len(non_smokers_only)))

Here are the first 5 samples of our data:

['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523'] 

['28', 'male', '33', '3', 'no', 'southeast', '4449.462'] 

['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061'] 

['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552'] 

['31', 'female', '25.74', '0', 'no', 'southeast', '3756.6216'] 

There are 1064 records consisting of an individual who is not a smoker.


In [7]:
region_count = dict()
for record in insurance_data:
    if record[5] not in region_count:
        region_count[record[5]] = 1
    else:
        region_count[record[5]]+=1
for key,value in region_count.items():
    print("The dataset has {} records from the region {}.\n".format(value,key))


The dataset has 325 records from the region southwest.

The dataset has 364 records from the region southeast.

The dataset has 325 records from the region northwest.

The dataset has 324 records from the region northeast.



In [8]:
have_children = list()
for record in insurance_data:
    if int(record[3]) >= 1:
        have_children.append(record)
print("Here are the first 5 samples of our data:\n")
for i in range(5):
    print(have_children[i],"\n")
print("There are {} records consisting of an individual who has atleast one child.".format(len(have_children)))
    

Here are the first 5 samples of our data:

['18', 'male', '33.77', '1', 'no', 'southeast', '1725.5523'] 

['28', 'male', '33', '3', 'no', 'southeast', '4449.462'] 

['46', 'female', '33.44', '1', 'no', 'southeast', '8240.5896'] 

['37', 'female', '27.74', '3', 'no', 'northwest', '7281.5056'] 

['37', 'male', '29.83', '2', 'no', 'northeast', '6406.4107'] 

There are 764 records consisting of an individual who has atleast one child.


In [9]:
no_children = list()
for record in insurance_data:
    if int(record[3]) == 0:
        no_children.append(record)
print("Here are the first 5 samples of our data:\n")
for i in range(5):
    print(no_children[i],"\n")
print("There are {} records consisting of an individual who does not have a child".format(len(no_children)))

Here are the first 5 samples of our data:

['19', 'female', '27.9', '0', 'yes', 'southwest', '16884.924'] 

['33', 'male', '22.705', '0', 'no', 'northwest', '21984.47061'] 

['32', 'male', '28.88', '0', 'no', 'northwest', '3866.8552'] 

['31', 'female', '25.74', '0', 'no', 'southeast', '3756.6216'] 

['60', 'female', '25.84', '0', 'no', 'northwest', '28923.13692'] 

There are 574 records consisting of an individual who does not have a child


# Analysis

Let us define a General Function which can be used to identify the Minimum, Average & Maximum Insurance Costs for each category.

In [10]:
def statistics(data,variable_name):
    total_cost = 0.0
    minimum_cost = float('inf')
    maximum_cost = -float('inf')
    for record in data:
        total_cost+= float(record[-1])
        if float(record[-1]) < minimum_cost:
            minimum_cost = float(record[-1])
        if float(record[-1]) > maximum_cost:
            maximum_cost = float(record[-1])
    average_cost = total_cost/len(data)
    print("\n--- STATISTICS FOR {} ---".format(variable_name.upper()))
    print("The Total Number of Records are: {}".format(len(data)))
    print("The Minimum Insurance Cost is: ${}.".format(round(minimum_cost,2)))
    print("The Average Insurance Cost is: ${}.".format(round(average_cost,2)))
    print("The Maximum Insurance Cost is: ${}.".format(round(maximum_cost,2)))
    print("--- STATISTICS FOR {} ---\n".format(variable_name.upper()))
    return average_cost

### 1. Average Insurance Cost: Smoker Vs. Non-Smoker

In [11]:
#Finding Statistics for a Smoker:
smokers_average_cost = statistics(smokers_only,"smokers")
#Finding Statistics for a Non-Smoker:
non_smokers_average_cost = statistics(non_smokers_only,"non-smokers")

#Difference between the Average costs:
difference_smokers_cost = smokers_average_cost - non_smokers_average_cost
print("The difference between the average insurance cost of a smoker and the average insurance cost of a non smoker is: \n$",round(difference_smokers_cost,2))


--- STATISTICS FOR SMOKERS ---
The Total Number of Records are: 274
The Minimum Insurance Cost is: $12829.46.
The Average Insurance Cost is: $32050.23.
The Maximum Insurance Cost is: $63770.43.
--- STATISTICS FOR SMOKERS ---


--- STATISTICS FOR NON-SMOKERS ---
The Total Number of Records are: 1064
The Minimum Insurance Cost is: $1121.87.
The Average Insurance Cost is: $8434.27.
The Maximum Insurance Cost is: $36910.61.
--- STATISTICS FOR NON-SMOKERS ---

The difference between the average insurance cost of a smoker and the average insurance cost of a non smoker is: 
$ 23615.96


### 1. Results

From this statistic, we can understand that an individual who smokes is more likely to pay a higher fee for his/her medical insurance. This may be a result of medical insurance issuers assuming that individuals who smoke are more likely to encounter medical issues than individuals who do not smoke. The difference is a huge, A smoker is likely to pay an approximated \\$23,000 more than a non smoker. 

### 2. Average Insurance Cost based on Region

In [12]:
regions = [ region for region in region_count.keys()]
print("The four regions are: ", regions)

The four regions are:  ['southwest', 'southeast', 'northwest', 'northeast']


Let us create seperate lists based on the region. I have used list comprehension to keep it short.

In [13]:
southwest_data = [ record for record in insurance_data if record[5]=="southwest"]
southeast_data = [ record for record in insurance_data if record[5]=="southeast"]
northwest_data = [ record for record in insurance_data if record[5]=="northwest"]
northeast_data = [ record for record in insurance_data if record[5]=="northeast"]

In [14]:
#Finding Statistics for Southwest region:
average_southwest = statistics(southwest_data,"southwest")
#Finding Statistics for Southeast region:
average_southeast = statistics(southeast_data,"southeast")
#Finding Statistics for Northwest region:
average_northwest = statistics(northwest_data,"northwest")
#Finding Statistics for Northeast region:
average_northeast = statistics(northeast_data,"northeast")


--- STATISTICS FOR SOUTHWEST ---
The Total Number of Records are: 325
The Minimum Insurance Cost is: $1241.57.
The Average Insurance Cost is: $12346.94.
The Maximum Insurance Cost is: $52590.83.
--- STATISTICS FOR SOUTHWEST ---


--- STATISTICS FOR SOUTHEAST ---
The Total Number of Records are: 364
The Minimum Insurance Cost is: $1121.87.
The Average Insurance Cost is: $14735.41.
The Maximum Insurance Cost is: $63770.43.
--- STATISTICS FOR SOUTHEAST ---


--- STATISTICS FOR NORTHWEST ---
The Total Number of Records are: 325
The Minimum Insurance Cost is: $1621.34.
The Average Insurance Cost is: $12417.58.
The Maximum Insurance Cost is: $60021.4.
--- STATISTICS FOR NORTHWEST ---


--- STATISTICS FOR NORTHEAST ---
The Total Number of Records are: 324
The Minimum Insurance Cost is: $1694.8.
The Average Insurance Cost is: $13406.38.
The Maximum Insurance Cost is: $58571.07.
--- STATISTICS FOR NORTHEAST ---



In [15]:
print("The Maximum average cost among the 4 regions is for the region Southeast with it's average being ${} and the Minimum average cost among the 4 regions is for the region Southwest with it's average being ${}.".format(round(average_southeast,2),round(average_southwest,2)))

The Maximum average cost among the 4 regions is for the region Southeast with it's average being $14735.41 and the Minimum average cost among the 4 regions is for the region Southwest with it's average being $12346.94.


### 2. Results

From the statistic, we can understand that southeast has the maximum number of records compared to the other regions. The average cost for Southeast is also higher than the rest. There may be a slight chance that the southwest region has more medical emergencies than other regions and hence there are more records and the average cost is higher but this is a possibility not a guarantee. We can conclude that 4 regions average insurance is within the range of \\$12,000 - \\$15,000.

### 3. Average Age, Costs & BMI of Males & Females

In-order to calculate average age & average BMI, let us define a function to make it simple and re-usable.

In [16]:
def average_age_BMI(data,variable_name):
    total_age = 0
    total_BMI = 0
    for record in data:
        total_age+=float(record[0])
        total_BMI+=float(record[2])
    average_age = total_age/len(data)
    average_BMI = total_BMI/len(data)
    print("--- AGE & BMI INFO FOR {} ---".format(variable_name.upper()))
    print("The average age of {} is: {} years. ".format(variable_name,round(average_age,1)))
    print("The average BMI of {} is: {}.".format(variable_name,round(average_BMI,1)))
    print("--- AGE & BMI INFO FOR {} ---".format(variable_name.upper()))
    
#Finding Statistics for Males:
males_average_cost = statistics(males_only,"Males")
average_age_BMI(males_only,"Males")
#Finding Statistics for Females:
females_average_cost = statistics(females_only,"Females")
average_age_BMI(females_only,"Females")


--- STATISTICS FOR MALES ---
The Total Number of Records are: 676
The Minimum Insurance Cost is: $1121.87.
The Average Insurance Cost is: $13956.75.
The Maximum Insurance Cost is: $62592.87.
--- STATISTICS FOR MALES ---

--- AGE & BMI INFO FOR MALES ---
The average age of Males is: 38.9 years. 
The average BMI of Males is: 30.9.
--- AGE & BMI INFO FOR MALES ---

--- STATISTICS FOR FEMALES ---
The Total Number of Records are: 662
The Minimum Insurance Cost is: $1607.51.
The Average Insurance Cost is: $12569.58.
The Maximum Insurance Cost is: $63770.43.
--- STATISTICS FOR FEMALES ---

--- AGE & BMI INFO FOR FEMALES ---
The average age of Females is: 39.5 years. 
The average BMI of Females is: 30.4.
--- AGE & BMI INFO FOR FEMALES ---


We have our gender categories split already let us futher split each category based on their BMI's
1. Under weight (BMI <= 18.5)
2. Healthy weight (18.5 < BMI <= 25)
3. Over weight (25 < BMI <= 30)
4. Obese (30 < BMI)

In [17]:
#Splitting males dataset based on BMI and finding the corresponding statistics:
uw_males = [record for record in males_only if float(record[2])<=18.5]
hw_males = [record for record in males_only if float(record[2])>18.5 and float(record[2])<=25]
ow_males = [record for record in males_only if float(record[2])>25 and float(record[2])<=30]
ob_males = [record for record in males_only if float(record[2])>30]

In [18]:
#Finding Statistics for Under-weight Males:
uw_average_cost = statistics(uw_males,"Under-Weight Males")
average_age_BMI(uw_males,"Under-Weight Males")
#Finding Statistics for Healthy-weight Males:
hw_average_cost = statistics(hw_males,"Healthy-Weight Males")
average_age_BMI(hw_males,"Healthy-Weight Males")
#Finding Statistics for Over-weight Males:
ow_average_cost = statistics(ow_males,"Over-Weight Males")
average_age_BMI(ow_males,"Over-Weight Males")
#Finding Statistics for Obese Males:
ob_average_cost = statistics(ob_males,"Obese Males")
average_age_BMI(ob_males,"Obese Males")


--- STATISTICS FOR UNDER-WEIGHT MALES ---
The Total Number of Records are: 8
The Minimum Insurance Cost is: $1621.34.
The Average Insurance Cost is: $5611.71.
The Maximum Insurance Cost is: $12829.46.
--- STATISTICS FOR UNDER-WEIGHT MALES ---

--- AGE & BMI INFO FOR UNDER-WEIGHT MALES ---
The average age of Under-Weight Males is: 29.2 years. 
The average BMI of Under-Weight Males is: 17.3.
--- AGE & BMI INFO FOR UNDER-WEIGHT MALES ---

--- STATISTICS FOR HEALTHY-WEIGHT MALES ---
The Total Number of Records are: 108
The Minimum Insurance Cost is: $1121.87.
The Average Insurance Cost is: $9868.02.
The Maximum Insurance Cost is: $35069.37.
--- STATISTICS FOR HEALTHY-WEIGHT MALES ---

--- AGE & BMI INFO FOR HEALTHY-WEIGHT MALES ---
The average age of Healthy-Weight Males is: 36.8 years. 
The average BMI of Healthy-Weight Males is: 22.6.
--- AGE & BMI INFO FOR HEALTHY-WEIGHT MALES ---

--- STATISTICS FOR OVER-WEIGHT MALES ---
The Total Number of Records are: 189
The Minimum Insurance Cost 

In [19]:
#Splitting females dataset based on BMI and finding the corresponding statistics:
uw_females = [record for record in females_only if float(record[2])<=18.5]
hw_females = [record for record in females_only if float(record[2])>18.5 and float(record[2])<=25]
ow_females = [record for record in females_only if float(record[2])>25 and float(record[2])<=30]
ob_females = [record for record in females_only if float(record[2])>30]

In [20]:
#Finding Statistics for Under-weight Females:
uwf_average_cost = statistics(uw_females,"Under-Weight Females")
average_age_BMI(uw_females,"Under-Weight Females")
#Finding Statistics for Healthy-weight Females:
hwf_average_cost = statistics(hw_females,"Healthy-Weight Females")
average_age_BMI(hw_females,"Healthy-Weight Females")
#Finding Statistics for Over-weight FeM]males:
owf_average_cost = statistics(ow_females,"Over-Weight Females")
average_age_BMI(ow_females,"Over-Weight Females")
#Finding Statistics for Obese Females:
obf_average_cost = statistics(ob_females,"Obese Females")
average_age_BMI(ob_females,"Obese Females")


--- STATISTICS FOR UNDER-WEIGHT FEMALES ---
The Total Number of Records are: 13
The Minimum Insurance Cost is: $1727.79.
The Average Insurance Cost is: $10532.03.
The Maximum Insurance Cost is: $32734.19.
--- STATISTICS FOR UNDER-WEIGHT FEMALES ---

--- AGE & BMI INFO FOR UNDER-WEIGHT FEMALES ---
The average age of Under-Weight Females is: 34.3 years. 
The average BMI of Under-Weight Females is: 17.8.
--- AGE & BMI INFO FOR UNDER-WEIGHT FEMALES ---

--- STATISTICS FOR HEALTHY-WEIGHT FEMALES ---
The Total Number of Records are: 118
The Minimum Insurance Cost is: $1607.51.
The Average Insurance Cost is: $10954.77.
The Maximum Insurance Cost is: $27117.99.
--- STATISTICS FOR HEALTHY-WEIGHT FEMALES ---

--- AGE & BMI INFO FOR HEALTHY-WEIGHT FEMALES ---
The average age of Healthy-Weight Females is: 37.0 years. 
The average BMI of Healthy-Weight Females is: 22.7.
--- AGE & BMI INFO FOR HEALTHY-WEIGHT FEMALES ---

--- STATISTICS FOR OVER-WEIGHT FEMALES ---
The Total Number of Records are: 19

### 3. Results

From these statistics, we can observe that compared to the medical costs for obese people, the medical costs for under-weight & healthy-weight people is much cheaper. We can observe a relation between the age and the BMI. As the age of a person increases, his/her BMI also increases. We can also observe that the medical insurance for under-weight males and females are the least even though they are under-weight. For females, Since the average costs for under-weight, healthy-weight and over-weight are very close to each other, we can suggest that BMI may not be factor for females that contributes to the variance in medical costs unless they are obese.

### 4. Number of Children factor in Medical Insurance Costs

Let us split the data based on the number of children they have

In [21]:
one_child = [record for record in insurance_data if float(record[3])==1]
two_child = [record for record in insurance_data if float(record[3])==2]
three_child = [record for record in insurance_data if float(record[3])==3]
more_child = [record for record in insurance_data if float(record[3])>3]


In [22]:
#Finding Statistics for No Children:
no_average_cost = statistics(no_children,"No Children")
average_age_BMI(no_children,"No Children")
#Finding Statistics for 1 Child:
one_child_average_cost = statistics(one_child,"One Child")
average_age_BMI(one_child,"One Child")
#Finding Statistics for 2 Children:
two_child_average_cost = statistics(two_child,"Two Children")
average_age_BMI(two_child,"Two Children")
#Finding Statistics for 3 Children:
three_child_average_cost = statistics(three_child,"Three Children")
average_age_BMI(three_child,"Three Children")
#Finding Statistics for more than 3 Children:
more_child_average_cost = statistics(more_child,"More than 3 Children")
average_age_BMI(more_child,"More than 3 Children")


--- STATISTICS FOR NO CHILDREN ---
The Total Number of Records are: 574
The Minimum Insurance Cost is: $1121.87.
The Average Insurance Cost is: $12365.98.
The Maximum Insurance Cost is: $63770.43.
--- STATISTICS FOR NO CHILDREN ---

--- AGE & BMI INFO FOR NO CHILDREN ---
The average age of No Children is: 38.4 years. 
The average BMI of No Children is: 30.6.
--- AGE & BMI INFO FOR NO CHILDREN ---

--- STATISTICS FOR ONE CHILD ---
The Total Number of Records are: 324
The Minimum Insurance Cost is: $1711.03.
The Average Insurance Cost is: $12731.17.
The Maximum Insurance Cost is: $58571.07.
--- STATISTICS FOR ONE CHILD ---

--- AGE & BMI INFO FOR ONE CHILD ---
The average age of One Child is: 39.5 years. 
The average BMI of One Child is: 30.6.
--- AGE & BMI INFO FOR ONE CHILD ---

--- STATISTICS FOR TWO CHILDREN ---
The Total Number of Records are: 240
The Minimum Insurance Cost is: $2304.0.
The Average Insurance Cost is: $15073.56.
The Maximum Insurance Cost is: $49577.66.
--- STATISTI

### 4. Results

From these statistics, In my opinion, I thought medical insurance costs would increase if you had more children but it does'nt seem to be the case here. Yes, the medical insurance costs increase from no children to three children but if you have more than 3 children, The medical costs are even lower than if you have no children! Yes, I know we cannot assume this to be true as the sample portion is very small but it is still a finding which has suprised me.

# Conclusion

Through observation and inspection, we were able to identify a lot of new information without even visualizing the data. As I am beginning my data analyst journey, this motivates me even more because we still have not used all our tools from our toolkit. We just used Python without importing any libraries(other than csv). Imagine how much information we can dervive once we master all our tools and have to work real world huge datasets.

I have tried my best and I am a newbie, so if you feel I can improve in places please let me know, I would love the feedback. Thank you! Have a lovely day!