# U.S. Medical Insurance Costs

### Plan

Perform simple correlations of the Smoker variable with each of the other variables in the Medical Insurance dataset.

### Loading dataset

In [5]:
import csv
medical_insurance_data = []
with open('insurance.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        fixed_row = { key:(float(value) if value[0].isnumeric() else value) for key, value in row.items() }
        medical_insurance_data.append(fixed_row)
dataset_size = len(medical_insurance_data)
print('Dataset size = {}'.format(dataset_size))
print('The records look like this one')
medical_insurance_data[:1]

Dataset size = 1338
The records look like this one


[{'age': 19.0,
  'sex': 'female',
  'bmi': 27.9,
  'children': 0.0,
  'smoker': 'yes',
  'region': 'southwest',
  'charges': 16884.924}]

### Common functions

In [18]:
# Get the average value of a given numercal column in the dataset
def get_average_for_column(column_name):
    sum = 0
    for record in medical_insurance_data:
        sum += record[column_name]
    return sum / dataset_size

# For a binary (two values) column, determine how the dataset is divided among the two values
def get_binary_value_percentage(column_name, value):
    count = 0
    for record in medical_insurance_data:
        if record[column_name] == value:
            count += 1
    return 100 * count / dataset_size

def is_smoker(medical_record):
    return medical_record['smoker'] == 'yes'

# Get the average value of a given numercal column in the dataset - for the smokers
def get_field_average_for_smokers(numeric_field_name):
    smoker_values = []
    non_smoker_values = []
    for record in medical_insurance_data:
        numeric_value = record[numeric_field_name]
        if is_smoker(record):
            smoker_values.append(numeric_value)
        else:
            non_smoker_values.append(numeric_value)
    return sum(smoker_values) / len(smoker_values), sum(non_smoker_values) / len(non_smoker_values)

# For a binary (two values) column, determine how the smokers are divided among the two values
def get_binary_value_percentage_for_smokers(column_name, value):
    count_smokers = 0
    count_positives = 0
    for record in medical_insurance_data:
        if is_smoker(record):
            count_smokers += 1
            if record[column_name] == value:
                count_positives += 1
    return 100 * count_positives / count_smokers

# For a given categorical variable, return a dictionary with the percentage of people in the dataset for each value
# of the variable that actually occurs in the dataset.
# We can also pass a condition field name and a conditions value. Then the function will return the distribution of the 
# categorical variable among the records for which the condition field has the condition value. E.g when record['smoker'] == 'yes'
def get_percentages_for_categorial_field(field_name, condition_field = None, condition_value = None):
    result = {}
    count = 0
    for record in medical_insurance_data:
        if condition_field == None or record[condition_field] == condition_value:
            count += 1
            value = record[field_name]
            if value in result:
                result[value] += 1
            else:
                result[value] = 1
    for value in result:
        result[value] *= 100 / count
    return result


### The smoker variable

In [7]:
'{:.2f}% of the people in the dataset are smokers'.format(get_binary_value_percentage('smoker', 'yes'))

'20.48% of the people in the dataset are smokers'

### Smokers and age

In [8]:
print('The average age in the dataset is {:.2f}'.format(get_average_for_column('age')))

average1, average2 = get_field_average_for_smokers('age')
'The average age in the dataset for smokers is {:.2f}, for non-smokers {:.2f}'.format(average1, average2)

The average age in the dataset is 39.21


'The average age in the dataset for smokers is 38.51, for non-smokers 39.39'

### Smokers and sex

In [9]:
print('{:.2f}% of the people in the dataset are females'.format(get_binary_value_percentage('sex', 'female')))

'{:.2f}% of the smokers in the dataset are females'.format(get_binary_value_percentage_for_smokers('sex', 'female'))

49.48% of the people in the dataset are females


'41.97% of the smokers in the dataset are females'

### Smokers and BMI

In [10]:
print('The average BMI in the dataset is {:.2f}'.format(get_average_for_column('bmi')))

average1, average2 = get_field_average_for_smokers('bmi')
'The average BMI in the dataset for smokers is {:.2f}, for non-smokers {:.2f}'.format(average1, average2)

The average BMI in the dataset is 30.66


'The average BMI in the dataset for smokers is 30.71, for non-smokers 30.65'

### Smokers and children

In [11]:
print('The average number of children for people in the dataset is {:.2f}'.format(get_average_for_column('children')))

average1, average2 = get_field_average_for_smokers('children')
'The average number of children for smokers in the dataset is {:.2f}, for non-smokers {:.2f}'.format(average1, average2)

The average number of children for people in the dataset is 1.09


'The average number of children for smokers in the dataset is 1.11, for non-smokers 1.09'

### Smokers and region

In [20]:
print('Percentages for regions for people in the datasete:')
print(get_percentages_for_categorial_field('region'))

print('Percentages for regions for smokers in the datasete:')
print(get_percentages_for_categorial_field('region', 'smoker', 'yes'))

Percentages for regions for people in the datasete:
{'southwest': 24.28998505231689, 'southeast': 27.20478325859492, 'northwest': 24.28998505231689, 'northeast': 24.2152466367713}
Percentages for regions for smokers in the datasete:
{'southwest': 21.16788321167883, 'southeast': 33.21167883211679, 'northeast': 24.452554744525546, 'northwest': 21.16788321167883}


### Smokers and charges

In [21]:
print('The average charges for people in the dataset is {:.2f}'.format(get_average_for_column('charges')))

average1, average2 = get_field_average_for_smokers('charges')
'The average charges for smokers in the dataset for smokers is {:.2f}, for non-smokers {:.2f}'.format(average1, average2)

The average charges for people in the dataset is 13270.42


'The average charges for smokers in the dataset for smokers is 32050.23, for non-smokers 8434.27'

### Summary

About 20.5% of the people is the dataset are smokers.

In the dataset, the average age is similar for smokers and non-smokers.

In the dataset about 50% are women. 
Among smokers in the dataset, only 42% are women.  
We may hypothesize that there are more male smokers than women.

In the dataset, the average BMI is similar for smokers and non-smokers.

In the dataset, the average no. of children is similar for smokers and non-smokers.

In the dataset, 24% of the people are from US southwest, 27% from US southeast, 24% from US northeast.  
Among the smokers, 21% of the people are from US southwest, 33% from US southeast, 21% from  US northeast.  
We may hypothesize that there are more smokers in the US southeast than in other US regions.

In the dataset, the average medical charges are about $13270.  
for smokers, the average medical charges are about $32050.  
So the medical charges seem to be significantly higher for smokers.
