# **U.S Medical Insurance Cost**

## <u>Part 1: Data Reading and Preparation</u>

The `csv` module is imported to read raw data from the `insurance.csv` file using `csv.DictReader()` method. Initially all data are in string format. Therefore, the processed data is added to the `dataset` list by typecasting the variables in the following manner:

<center>

|<center>Variable</center> | <center>Data Type</center> | <center>Variable Type</center> |
|-|-|-|
| `age` | int | Scale |
| `sex` | bool (0 for 'male' and 1 for 'female') | Nominal |
| `bmi` | float | Ratio |
| `children` | int | Ordinal |
| `smoker` | bool (0 for 'no' and 1 for 'yes') | Nominal |
| `region` | int (1 for 'northeast', 2 for 'northwest' , <br>3 for 'southeast', 4 for 'southwest') | Nominal |
| `charges` | float | Ratio |

</center>

Identified the regions uniquely by first separating the `region` column in the `insurance.csv` to `regions.txt` file. Then the following command was executed in terminal to get the unique names of the regions:
```
sort regions.txt | uniq
```
This command gave the unique names of the regions based on which the numeric codes were provided.

In [58]:
import csv

dataset = []  # dict to store the raw contents

with open("insurance.csv", "r") as file:
    for row in csv.DictReader(file):
        data = {}
        for header, value in row.items():
            if header == "age":
                data[header] = int(value)
            elif header == "sex":
                data[header] = 0 if value == "male" else 1
            elif header == "bmi":
                data[header] = float(value)
            elif header == "children":
                data[header] = int(value)
            elif header == "smoker":
                data[header] = 0 if value == "no" else 1
            elif header == "region":
                if value == "northeast":
                    data[header] = 1
                elif value == "northwest":
                    data[header] = 2
                elif value == "southeast":
                    data[header] = 3
                elif value == "southwest":
                    data[header] = 4
            elif header == "charges":
                data[header] = float(value)
        dataset.append(data)

print("Dataset contains", len(dataset), "samples")

Dataset contains 1338 samples


## <u>Part 2: Summary Statistics</u>

Data from `dataset` list is further distributed into four individual lists name `age`, `bmi`, `children`, and `charges` so that summary statistics can be obtained from it.

### 1. Separating the ordinal, scale, and ratio variables

Each data row in the dataset is iterated and the specific data are appended to respective lists

In [59]:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

for data in dataset:

    age.append(data["age"])
    sex.append(data["sex"])
    bmi.append(data["bmi"])
    children.append(data["children"])
    smoker.append(data["smoker"])
    region.append(data["region"])
    charges.append(data["charges"])

### 2. Frequency Distribution

Frequency distribution to understand the followings:

1. How many people belong to certain age groups?
2. How many people belong to groups of certain number of children?
3. How many male and female are there?
4. How many smoker and non-smoker are there?
5. What is the regional distribution of people?

The following frequency distributions are performed for to get a hold of the dataset

*Grouped frequency distribution*
- age
- children

*Ungrouped frequency distribution*
- sex
- smoker
- region

#### **2.1 Grouped frequency distribution**

The class interval for grouped frequency is calculated as:

$${Class~Interval} = \frac{Upper~Bound-Lower~Bound}{Number~of~Classes}$$

Where 'Upper Bound' is the highest number in the range and 'lower bound' is the least.

The following code block has the frequency distribution of *age*

In [60]:
# Import for showing frequency distribution in a table
from IPython.display import display, Markdown

# Frequency distribution of age

output_text = "<center>\n\n<u>Frequency Distribution of Age</u>\n| <center>Age</center> | <center>Frequency</center> | <center>% of Respondents</center> |\n|-|-|-|\n"

highest_age, lowest_age = max(age), min(age)

print("Highest age is {} and lowest age is {}".format(highest_age, lowest_age))

# Since the upper bound is 64 and lower bound is 18, number of classes should be 10
num_classes = 10
class_interval = round((highest_age-lowest_age)/num_classes)

while lowest_age < highest_age:
    counter = 0
    for a in age:
        if lowest_age <= a <= lowest_age+class_interval:
            counter += 1
    output_text += "| <center>{}-{}</center> | <center>{}</center> | <center>{}</center> |\n".format(
        lowest_age, lowest_age+class_interval, counter, round(counter*100/len(dataset), 2))
    # Added 1 with class interval for going to next class
    lowest_age += class_interval+1

output_text += "\n</center>"

display(Markdown(output_text))

Highest age is 64 and lowest age is 18


<center>

<u>Frequency Distribution of Age</u>
| <center>Age</center> | <center>Frequency</center> | <center>% of Respondents</center> |
|-|-|-|
| <center>18-23</center> | <center>250</center> | <center>18.68</center> |
| <center>24-29</center> | <center>167</center> | <center>12.48</center> |
| <center>30-35</center> | <center>157</center> | <center>11.73</center> |
| <center>36-41</center> | <center>154</center> | <center>11.51</center> |
| <center>42-47</center> | <center>168</center> | <center>12.56</center> |
| <center>48-53</center> | <center>172</center> | <center>12.86</center> |
| <center>54-59</center> | <center>156</center> | <center>11.66</center> |
| <center>60-65</center> | <center>114</center> | <center>8.52</center> |

</center>

The following code block has the frequency distribution of *children*

In [61]:
# Frequency distribution of children

output_text = "<center>\n\n<u>Frequency Distribution of Children</u>\n| <center>Children</center> | <center>Frequency</center> | <center>% of Respondents</center> |\n|-|-|-|\n"

highest_children, lowest_children = max(children), min(children)

print("Highest number of children is {} and lowest number of children is {}".format(
    highest_children, lowest_children))

# Since the upper bound is 5 and lower bound is 0, number of classes should be 5
num_classes = 5
class_interval = round((highest_children-lowest_children)/num_classes)

while lowest_children < highest_children:
    counter = 0
    for c in children:
        if lowest_children <= c <= lowest_children+class_interval:
            counter += 1
    output_text += "| <center>{}-{}</center> | <center>{}</center> | <center>{}</center> |\n".format(
        lowest_children, lowest_children+class_interval, counter, round(counter*100/len(dataset), 2))
    # Added 1 with class interval for going to next class
    lowest_children += class_interval +1 

output_text += "\n</center>"

display(Markdown(output_text))

Highest number of children is 5 and lowest number of children is 0


<center>

<u>Frequency Distribution of Children</u>
| <center>Children</center> | <center>Frequency</center> | <center>% of Respondents</center> |
|-|-|-|
| <center>0-1</center> | <center>898</center> | <center>67.12</center> |
| <center>2-3</center> | <center>397</center> | <center>29.67</center> |
| <center>4-5</center> | <center>43</center> | <center>3.21</center> |

</center>

### **2.2 Ungrouped frequency distribution**

The following code block contains frequency distribution of sex, smoker, and region

In [62]:
print("There are {} ({}%) male and {} ({}%) female among respondents.".format(
    sex.count(0), round(sex.count(0)*100/len(dataset), 2), sex.count(1), round(sex.count(1)*100/len(dataset), 2)))

print("{} ({}%) respondents are smokers and {} ({}%) are not.".format(
    smoker.count(1), round(smoker.count(1)*100/len(dataset), 2), smoker.count(0), round(smoker.count(0)*100/len(dataset), 2)))
print("Among the respondents, {} ({}%) number of people are from the northeast, {} ({}%) from northwest, {} ({}%) from southeast, {} ({}%) are from southwest region.".format(
    region.count(1), round(region.count(1)*100/len(dataset), 2), region.count(2), round(region.count(2)*100/len(dataset), 2), region.count(3), round(region.count(3)*100/len(dataset), 2), region.count(4), round(region.count(4)*100/len(dataset), 2)))

There are 676 (50.52%) male and 662 (49.48%) female among respondents.
274 (20.48%) respondents are smokers and 1064 (79.52%) are not.
Among the respondents, 324 (24.22%) number of people are from the northeast, 325 (24.29%) from northwest, 364 (27.2%) from southeast, 325 (24.29%) are from southwest region.


### 3. Central Tendency and Dispersion

The central tendency and dispersion of `age`, `bmi`, `children`, and `charges` are calculate in the following section. The measures of central tendency and dispersion are:

*Central tendency*

- Arithmetic Mean
- Median
- Quartiles

*Dispersion*

- Standard deviation
- Quartile deviation
- Coefficient of variation
- Coefficiant of quartile deviation

#### 3.1 Arithmetic Mean

Arithmetic mean is calculated using the following formula

$$Arithmetic~Mean = \frac{\displaystyle\sum_{i=1} ^{n} x_i}{n}

In [63]:
# Mean of age
mean_age = round(sum(age)/len(age), 2)

# Mean of bmi
mean_bmi = round(sum(bmi)/len(bmi), 2)

# Mean of children
mean_children = round(sum(children)/len(children), 2)

# Mean of charges
mean_charges = round(sum(charges)/len(charges), 2)

print("Mean age is {} years\nMean BMI is {}\nMean number of children is {}\nMean insurance chage is {} dollars".format(
    mean_age, mean_bmi, mean_children, mean_charges))

Mean age is 39.21 years
Mean BMI is 30.66
Mean number of children is 1.09
Mean insurance chage is 13270.42 dollars


#### 3.2 Median

Median is calculated as

$$Median = \frac{\frac{n}{2}th~observation+(\frac{n}{2}+1)th~observation}{2}$$

**N.B**: Median is calculated in such manner because the number of samples is even. There exists a different formula for calculating median for odd number of samples.

In [64]:
# Median of age
age.sort()
median_age = (age[int(len(age)/2)]+age[int(len(age)/2+1)])/2

# Median of bmi
bmi.sort()
median_bmi = (bmi[int(len(bmi)/2)]+bmi[int(len(bmi)/2+1)])/2

# Median of children
children.sort()
median_children = (children[int(len(children)/2)] +
                   children[int(len(children)/2+1)])/2

# Median of charges
charges.sort()
median_charges = round((charges[int(len(charges)/2)] +
                  charges[int(len(charges)/2+1)])/2, 2)

print("Median age is {} years\nMedian BMI is {}\nMedian number of children is {}\nMedian charges is {} dollars".format(
    median_age, median_bmi, median_children, median_charges))

Median age is 39.0 years
Median BMI is 30.4
Median number of children is 1.0
Median charges is 9388.75 dollars


#### 3.3 Quartile

Quartiles are calculated as

$$Quartile = \frac{(\frac{n}{4}\times i)th~observation+\{(\frac{n}{4}\times i)+1\}th~observation}{2}$$

**N.B**: Quartile is calculated in such manner because the number of samples is even. There exists a different formula for calculating quartile for odd number of samples.

In [65]:
# Quartiles of age
age_quartiles = []

print("Quartiles of age")
for i in range(1, 4):
    quartile = (age[int(len(age)*i/4)]+age[int(len(age)*i/4)+1])/2
    age_quartiles.append(quartile)
    print("Q{} = {}".format(i, quartile))

# Quartiles of bmi
bmi_quartiles = []

print("Quartiles of bmi")
for i in range(1, 4):
    quartile = round((bmi[int(len(bmi)*i/4)]+bmi[int(len(bmi)*i/4)+1])/2, 2)
    bmi_quartiles.append(quartile)
    print("Q{} = {}".format(i, quartile))

# Quartiles of children
children_quartiles = []

print("Quartiles of children")
for i in range(1, 4):
    quartile = round((children[int(len(children)*i/4)]+children[int(len(children)*i/4)+1])/2, 2)
    children_quartiles.append(quartile)
    print("Q{} = {}".format(i, quartile))

# Quartiles of charges
charges_quartiles = []

print("Quartiles of children")
for i in range(1, 4):
    quartile = round((charges[int(len(charges)*i/4)]+charges[int(len(charges)*i/4)+1])/2, 2)
    charges_quartiles.append(quartile)
    print("Q{} = {}".format(i, quartile))

Quartiles of age
Q1 = 27.0
Q2 = 39.0
Q3 = 51.0
Quartiles of bmi
Q1 = 26.3
Q2 = 30.4
Q3 = 34.7
Quartiles of children
Q1 = 0.0
Q2 = 1.0
Q3 = 2.0
Quartiles of children
Q1 = 4742.31
Q2 = 9388.75
Q3 = 16717.01


#### 3.3 Standard Deviation

Standard deviation is calculated as calculated as

$$\sigma = \sqrt{\frac{\sum (x_i-\bar{x})^2}{n}}$$

In [66]:
from math import sqrt

# Standard Deviation of age
std_dev_age = round(sqrt(sum([(x-mean_age)**2 for x in age])/len(age)), 2)

# Standard deviation of bmi
std_dev_bmi = round(sqrt(sum([(x-mean_bmi)**2 for x in bmi])/len(bmi)), 2)

# Standard deviation of children
std_dev_children = round(
    sqrt(sum([(x-mean_children)**2 for x in children])/len(children)), 2)

# Standard deviation of charges
std_dev_charges = round(
    sqrt(sum([(x-mean_charges)**2 for x in charges])/len(charges)), 2)

print("Standard deviation of age: {}\nStandard deviation of BMI: {}\nStandard deviation of children: {}\nStandard deviation of charges: {}".format(
    std_dev_age, std_dev_bmi, std_dev_children, std_dev_charges))

Standard deviation of age: 14.04
Standard deviation of BMI: 6.1
Standard deviation of children: 1.21
Standard deviation of charges: 12105.48


#### 3.4 Quartile Deviation

Quartile deviation is calculated as calculated as

$$Q.D. = \frac{Q_3-Q_1}{2}$$

In [67]:
# Quartile deviation of age
qd_age = (age_quartiles[0]+age_quartiles[2])/2

# Quartile deviation of bmi
qd_bmi = (bmi_quartiles[0]+bmi_quartiles[2])/2

# Quartile deviation of children
qd_children = (children_quartiles[0]+children_quartiles[2])/2

# Quartile deviation of charges
qd_charges = (charges_quartiles[0]+charges_quartiles[2])/2

print("QD of age: {}\nQD of BMI: {}\nQD of children: {}\nQD of charges: {}".format(qd_age, qd_bmi, qd_children, qd_charges))

QD of age: 39.0
QD of BMI: 30.5
QD of children: 1.0
QD of charges: 10729.66


#### 3.5 Coefficient of variation

Coefficient of variation is calculated as calculated as

$$C.V. = \frac{\sigma}{\bar{x}}\times 100$$

In [68]:
# Variation coefficient of age
cv_age = round(std_dev_age*100/mean_age, 2)

# Variation coefficient of bmi
cv_bmi = round(std_dev_bmi*100/mean_bmi, 2)

# Variation coefficient of children
cv_children = round(std_dev_children*100/mean_children,2)

# Variation coefficient of charges
cv_charges = round(std_dev_charges*100/mean_charges, 2)

print("CV of age: {}\nCV of BMI: {}\nCV of children: {}\nCV of charges: {}".format(
    cv_age, cv_bmi, cv_children, cv_charges))

CV of age: 35.81
CV of BMI: 19.9
CV of children: 111.01
CV of charges: 91.22


#### 3.6 Coefficient of quartile devation

Coefficient of variation is calculated as calculated as

$$C.Q.D. = \frac{Q_3-Q_1}{Q_3+Q_1}\times 100$$

In [71]:
# CQD of age
cqd_age = (age_quartiles[2]-age_quartiles[0]) * \
    100/(age_quartiles[0]+age_quartiles[2])

# CQD of bmi
cqd_bmi = (bmi_quartiles[2]-bmi_quartiles[0]) * \
    100/(bmi_quartiles[0]+bmi_quartiles[2])

# CQD of children
cqd_children = (children_quartiles[2]-children_quartiles[0]) * \
    100/(children_quartiles[0]+children_quartiles[2])

# CQD of charges
cqd_charges = (charges_quartiles[2]-charges_quartiles[0]) * \
    100/(charges_quartiles[0]+charges_quartiles[2])

print("CQD of age: {}\nCQD of BMI: {}\nCQD of children: {}\nCQD of charges: {}".format(
    cqd_age, cqd_bmi, cqd_children, cqd_charges))

CQD of age: 30.76923076923077
CQD of BMI: 13.770491803278691
CQD of children: 100.0
CQD of charges: 55.80186138237371
