# **U.S Medical Insurance Cost**

## <u>Part 1: Data Reading and Preparation</u>

The `csv` module is imported to read raw data from the `insurance.csv` file using `csv.DictReader()` method. Initially all data are in string format. Therefore, the processed data is added to the `dataset` list by typecasting the variables in the following manner:

<center>

|<center>Variable</center> | <center>Data Type</center> | <center>Variable Type</center> |
|-|-|-|
| `age` | int | Scale |
| `sex` | bool (0 for 'male' and 1 for 'female') | Nominal |
| `bmi` | float | Ratio |
| `children` | int | Ordinal |
| `smoker` | bool (0 for 'no' and 1 for 'yes') | Nominal |
| `region` | int (1 for 'northeast', 2 for 'northwest' , <br>3 for 'southeast', 4 for 'southwest') | Nominal |
| `charges` | float | Ratio |

</center>

Identified the regions uniquely by first separating the `region` column in the `insurance.csv` to `regions.txt` file. Then the following command was executed in terminal to get the unique names of the regions:
```
sort regions.txt | uniq
```
This command gave the unique names of the regions based on which the numeric codes were provided.

In [20]:
import csv

dataset = []  # dict to store the raw contents

with open("insurance.csv", "r") as file:
    for row in csv.DictReader(file):
        data = {}
        for header, value in row.items():
            if header == "age":
                data[header] = int(value)
            elif header == "sex":
                data[header] = 0 if value == "male" else 1
            elif header == "bmi":
                data[header] = float(value)
            elif header == "children":
                data[header] = int(value)
            elif header == "smoker":
                data[header] = 0 if value == "no" else 1
            elif header == "region":
                if value == "northeast":
                    data[header] = 1
                elif value == "northwest":
                    data[header] = 2
                elif value == "southeast":
                    data[header] = 3
                elif value == "southwest":
                    data[header] = 4
            elif header == "charges":
                data[header] = float(value)
        dataset.append(data)

print("Dataset contains", len(dataset), "samples")

Dataset contains 1338 samples


## <u>Part 2: Summary Statistics</u>

Data from `dataset` list is further distributed into four individual lists name `age`, `bmi`, `children`, and `charges` so that summary statistics can be obtained from it.

### 1. Separating the ordinal, scale, and ratio variables

Each data row in the dataset is iterated and the specific data are appended to respective lists

In [21]:
age = []
sex = []
bmi = []
children = []
smoker = []
region = []
charges = []

for data in dataset:

    age.append(data["age"])
    sex.append(data["sex"])
    bmi.append(data["bmi"])
    children.append(data["children"])
    smoker.append(data["smoker"])
    region.append(data["region"])
    charges.append(data["charges"])

### 2. Frequency Distribution

Frequency distribution to understand the followings:

1. How many people belong to certain age groups?
2. How many people belong to groups of certain number of children?
3. How many male and female are there?
4. How many smoker and non-smoker are there?
5. What is the regional distribution of people?

The following frequency distributions are performed for to get a hold of the dataset

*Grouped frequency distribution*
- age
- children

*Ungrouped frequency distribution*
- sex
- smoker
- region

#### **2.1 Grouped frequency distribution**

The class interval for grouped frequency is calculated as:

$${Class~Interval} = \frac{Upper~Bound-Lower~Bound}{Number~of~Classes}$$

Where 'Upper Bound' is the highest number in the range and 'lower bound' is the least.

The following code block has the frequency distribution of *age*

In [22]:
# Import for showing frequency distribution in a table
from IPython.display import display, Markdown

# Frequency distribution of age

output_text = "<center>\n\n<u>Frequency Distribution of Age</u>\n| <center>Age</center> | <center>Frequency</center> | <center>% of Respondents</center> |\n|-|-|-|\n"

highest_age, lowest_age = max(age), min(age)

print("Highest age is {} and lowest age is {}".format(highest_age, lowest_age))

# Since the upper bound is 64 and lower bound is 18, number of classes should be 10
num_classes = 10
class_interval = round((highest_age-lowest_age)/num_classes)

while lowest_age < highest_age:
    counter = 0
    for a in age:
        if lowest_age <= a <= lowest_age+class_interval:
            counter += 1
    output_text += "| <center>{}-{}</center> | <center>{}</center> | <center>{}</center> |\n".format(
        lowest_age, lowest_age+class_interval, counter, round(counter*100/len(dataset), 2))
    # Added 1 with class interval for going to next class
    lowest_age += class_interval+1

output_text += "\n</center>"

display(Markdown(output_text))

Highest age is 64 and lowest age is 18


<center>

<u>Frequency Distribution of Age</u>
| <center>Age</center> | <center>Frequency</center> | <center>% of Respondents</center> |
|-|-|-|
| <center>18-23</center> | <center>250</center> | <center>18.68</center> |
| <center>24-29</center> | <center>167</center> | <center>12.48</center> |
| <center>30-35</center> | <center>157</center> | <center>11.73</center> |
| <center>36-41</center> | <center>154</center> | <center>11.51</center> |
| <center>42-47</center> | <center>168</center> | <center>12.56</center> |
| <center>48-53</center> | <center>172</center> | <center>12.86</center> |
| <center>54-59</center> | <center>156</center> | <center>11.66</center> |
| <center>60-65</center> | <center>114</center> | <center>8.52</center> |

</center>

The following code block has the frequency distribution of *children*

In [23]:
# Frequency distribution of children

output_text = "<center>\n\n<u>Frequency Distribution of Children</u>\n| <center>Children</center> | <center>Frequency</center> | <center>% of Respondents</center> |\n|-|-|-|\n"

highest_children, lowest_children = max(children), min(children)

print("Highest number of children is {} and lowest number of children is {}".format(
    highest_children, lowest_children))

# Since the upper bound is 5 and lower bound is 0, number of classes should be 5
num_classes = 5
class_interval = round((highest_children-lowest_children)/num_classes)

while lowest_children < highest_children:
    counter = 0
    for c in children:
        if lowest_children <= c <= lowest_children+class_interval:
            counter += 1
    output_text += "| <center>{}-{}</center> | <center>{}</center> | <center>{}</center> |\n".format(
        lowest_children, lowest_children+class_interval, counter, round(counter*100/len(dataset), 2))
    # Added 1 with class interval for going to next class
    lowest_children += class_interval +1 

output_text += "\n</center>"

display(Markdown(output_text))

Highest number of children is 5 and lowest number of children is 0


<center>

<u>Frequency Distribution of Children</u>
| <center>Children</center> | <center>Frequency</center> | <center>% of Respondents</center> |
|-|-|-|
| <center>0-1</center> | <center>898</center> | <center>67.12</center> |
| <center>2-3</center> | <center>397</center> | <center>29.67</center> |
| <center>4-5</center> | <center>43</center> | <center>3.21</center> |

</center>

### **2.2 Ungrouped frequency distribution**

The following code block contains frequency distribution of sex, smoker, and region

In [24]:
print("There are {} ({}%) male and {} ({}%) female among respondents.".format(
    sex.count(0), round(sex.count(0)*100/len(dataset), 2), sex.count(1), round(sex.count(1)*100/len(dataset), 2)))

print("{} ({}%) respondents are smokers and {} ({}%) are not.".format(
    smoker.count(1), round(smoker.count(1)*100/len(dataset), 2), smoker.count(0), round(smoker.count(0)*100/len(dataset), 2)))
print("Among the respondents, {} ({}%) number of people are from the northeast, {} ({}%) from northwest, {} ({}%) from southeast, {} ({}%) are from southwest region.".format(
    region.count(1), round(region.count(1)*100/len(dataset), 2), region.count(2), round(region.count(2)*100/len(dataset), 2), region.count(3), round(region.count(3)*100/len(dataset), 2), region.count(4), round(region.count(4)*100/len(dataset), 2)))

There are 676 (50.52%) male and 662 (49.48%) female among respondents.
274 (20.48%) respondents are smokers and 1064 (79.52%) are not.
Among the respondents, 324 (24.22%) number of people are from the northeast, 325 (24.29%) from northwest, 364 (27.2%) from southeast, 325 (24.29%) are from southwest region.


### 3. Central Tendency and Measure of Dispersion

The central tendency and dispersion of `age`, `bmi`, `children`, and `charges` are calculate in the following section. The measures of central tendency and dispersion are:

*Central tendency*

- Mean
- Median
- Mode

*Dispersion*

- Standard deviation
- Interquartile range