# U.S. Medical Insurance Costs


## Scope of the project


### What do we want to analyse?
- What profile is considered the most high risk? (for each column, which group of people receive the highest charge)
- How related are gender and insurance cost
- What bmi range receives the lowest insurance costs on average
- What region has the highest insurance costs on average
  
### What do we want to know about the data set?
- What is the variation between the different observations in the data set
- How many observations are in the data set
- Is the data set biased

### Data
- Insurance.csv dataset containing the age, sex, bmi, no. of children, smoker status, region & insurance charge of 1338 individuals 
~


In [158]:
#import the csv library 
import csv

number_of_individuals = 0
charges = []
total_charges = 0

#open the csv file and convert it into a dictionary using DictReader
with open ('insurance.csv', newline = '') as insurances:
    insurance_content = csv.DictReader(insurances)

#iterate through the dictionary, counting how many individuals are in the sample
    for row in insurance_content:
        number_of_individuals += 1
        charges.append(row['charges'])
print(f'The total number of individuals in the sample is\n{number_of_individuals} individuals')

#calculate the total amount of insurance charges
for amount in charges:
    total_charges += float(amount)

print(f'The total of everyones\' insurance charge is\n${str(round(total_charges, 2))}')

#calculate the average insurance charges
average_insurance_cost = total_charges/number_of_individuals
print(f'The average of everyones\' insurance charge is\n${str(round(average_insurance_cost, 2))}')

The total number of individuals in the sample is
1338 individuals
The total of everyones' insurance charge is
$17755824.99
The average of everyones' insurance charge is
$13270.42


#### We now want to evaluate the highest risk groups according to the different categories available for classification,



In [11]:
#Classifying by 'Region' 

import csv

# Open the CSV file in read mode
with open('insurance.csv', 'r', newline='') as insurances:
    insurance_content = csv.DictReader(insurances)

    # Initialize counters and accumulators for each region
    southeast_count = 0
    southeast_charges = 0.0
    southwest_count = 0
    southwest_charges = 0.0
    northeast_count = 0
    northeast_charges = 0.0
    northwest_count = 0
    northwest_charges = 0.0
    
    # Iterate through each row in the CSV data
    for row in insurance_content:
        try:
            # Convert 'charges' to float for calculations
            charge = float(row['charges'])
        except ValueError:
            # Skip rows where 'charges' cannot be converted to float
            continue

        # Check region and update counts and charges accordingly
        region = row['region'].lower()
        if region == 'southeast':
            southeast_count += 1
            southeast_charges += round(charge,2)
        elif region == 'southwest':
            southwest_count += 1
            southwest_charges += round(charge,2)
        elif region == 'northeast':
            northeast_count += 1
            northeast_charges += round(charge,2)
        elif region == 'northwest':
            northwest_count += 1
            northwest_charges += round(charge,2)
            
# Print the results
print("Southeast Count:", southeast_count, "Total Charges:", southeast_charges)
print("Southwest Count:", southwest_count, "Total Charges:", southwest_charges)
print("Northeast Count:", northeast_count, "Total Charges:", northeast_charges)
print("Northwest Count:", northwest_count, "Total Charges:", northwest_charges)

Southeast Count: 364 Total Charges: 5363689.780000005
Southwest Count: 325 Total Charges: 4012754.69
Northeast Count: 324 Total Charges: 4343668.610000001
Northwest Count: 325 Total Charges: 4035711.93


In [33]:
#Classifying by 'Smoker'

import csv 

with open('insurance.csv', 'r', newline='') as insurances:
    insurance_content = list(csv.DictReader(insurances))
    
    smoker_count = sum(1 for row in insurance_content if row['smoker'] == 'yes')
    smoker_total_charges = round(sum(float(row['charges']) for row in insurance_content if row['smoker'] == 'yes'),2)

    non_smoker_count = sum(1 for row in insurance_content if row['smoker'] == 'no')
    non_smoker_total_charges = round(sum(float(row['charges']) for row in insurance_content if row['smoker'] == 'no'),2)

print("Smoker Count:", smoker_count, "Smoker Charges:", smoker_total_charges, 'Smoker Average Charge:', round(smoker_total_charges/smoker_count, 2))
print("Non Smoker Count:", non_smoker_count, "Non Smoker Charges:", non_smoker_total_charges, 'Non Smoker Average Charge:', round(non_smoker_total_charges/non_smoker_count, 2))

Smoker Count: 274 Smoker Charges: 8781763.52 Smoker Average Charge: 32050.23
Non Smoker Count: 1064 Non Smoker Charges: 8974061.47 Non Smoker Average Charge: 8434.27


In [3]:
#Classifying by gender

import csv 

with open('insurance.csv', 'r', newline='') as insurances:
    insurance_content = list(csv.DictReader(insurances))
    
    female_count = sum(1 for row in insurance_content if row['sex'] == 'female')
    female_total_charges = round(sum(float(row['charges']) for row in insurance_content if row['sex'] == 'female'),2)

    male_count = sum(1 for row in insurance_content if row['sex'] == 'male')
    male_total_charges = round(sum(float(row['charges']) for row in insurance_content if row['sex'] == 'male'),2)

print("Female Count:", female_count, "Female Charges:", female_total_charges, 'Female Average Charges:', round(female_total_charges/female_count, 2))
print("Male Count:", male_count, "Male Charges:", male_total_charges, 'Male Average Charges:', round(male_total_charges/male_count, 2))

Female Count: 662 Female Charges: 8321061.19 Female Average Charges: 12569.58
Male Count: 676 Male Charges: 9434763.8 Male Average Charges: 13956.75


In [21]:
#Classifying by age

import csv 

with open('insurance.csv', 'r', newline='') as insurances:
    insurance_content = list(csv.DictReader(insurances))

    zero_twenty = 0
    zero_twenty_count = 0
    twenty_29 = 0
    twenty_29_count = 0
    thirty_39 = 0
    thirty_39_count = 0
    forty_49 = 0
    forty_49_count = 0 
    fifty_59 = 0
    fifty_59_count = 0
    sixty_69 = 0
    sixty_69_count = 0
    
   # Iterate through each row in the CSV data
    for row in insurance_content:
        try:
            # Convert 'charges' to float for calculations
            charge = float(row['charges'])
        except ValueError:
            # Skip rows where 'charges' cannot be converted to float
            continue

        # Check region and update counts and charges accordingly
        age = int(row['age'])
        if age < 20:
            zero_twenty_count += round(charge,2)
            zero_twenty += 1
        elif age < 30:
            twenty_29_count += round(charge,2)
            twenty_29 += 1
        elif age < 40:
            thirty_39_count += round(charge,2)
            thirty_39 += 1
        elif age < 50:
            forty_49_count += round(charge,2)
            forty_49 += 1
        elif age < 60:
            fifty_59_count += round(charge,2)
            fifty_59 += 1
        else:
            sixty_69_count += round(charge,2)
            sixty_69 += 1

print(f'The average insurance cost for the {zero_twenty} peope aged under 20 is €{zero_twenty_count/zero_twenty}')
print(f'The average insurance cost for the {twenty_29} peope aged under 20 is €{twenty_29_count/twenty_29}')
print(f'The average insurance cost for the {thirty_39} peope aged under 20 is €{thirty_39_count/thirty_39}')
print(f'The average insurance cost for the {forty_49} peope aged under 20 is €{forty_49_count/forty_49}')
print(f'The average insurance cost for the {fifty_59} peope aged under 20 is €{fifty_59_count/fifty_59}')
print(f'The average insurance cost for the {sixty_69} peope aged under 20 is €{sixty_69_count/sixty_69}')

The average insurance cost for the 137 peope aged under 20 is €8407.349489051094
The average insurance cost for the 280 peope aged under 20 is €9561.751035714287
The average insurance cost for the 257 peope aged under 20 is €11738.784085603116
The average insurance cost for the 279 peope aged under 20 is €14399.203548387097
The average insurance cost for the 271 peope aged under 20 is €16495.232656826567
The average insurance cost for the 114 peope aged under 20 is €21248.021842105274


In [31]:
import csv 

with open('insurance.csv', 'r', newline='') as insurances:
    insurance_content = list(csv.DictReader(insurances))
    
    low_bmi_count = sum(1 for row in insurance_content if float(row['bmi']) < 30)
    low_bmi_charge = round(sum(float(row['charges']) for row in insurance_content if float(row['bmi']) < 30),2)

    high_bmi_count = sum(1 for row in insurance_content if float(row['bmi']) > 30)
    high_bmi_charge = round(sum(float(row['charges']) for row in insurance_content if float(row['bmi']) > 30),2)

print("Low BMI:", low_bmi_count, "Low BMI charges:", low_bmi_charge, 'Low BMI Average:', round(low_bmi_charge/low_bmi_count, 2))
print("High BMI:", high_bmi_count, "High BMI charges:", high_bmi_charge, 'High BMI Average:', round(high_bmi_charge/high_bmi_count, 2))

Low BMI: 631 Low BMI charges: 6760323.81 Low BMI Average: 10713.67
High BMI: 705 High BMI charges: 10970453.06 High BMI Average: 15560.93
