# **U.S Medical Insurance Cost**

## <u>Part 1: Data Reading and Preparation</u>

The `csv` module is imported to read raw data from the `insurance.csv` file using `csv.DictReader()` method. Initially all data are in string format. Therefore, the processed data is added to the `dataset` list by typecasting the variables in the following manner:

<center>

|<center>Variable</center> | <center>Data Type</center> | <center>Variable Type</center> |
| -------- | --------- | ------------- |
| `age` | int | Scale |
| `sex` | bool (0 for 'male' and 1 for 'female') | Nominal |
| `bmi` | float | Ratio |
| `children` | int | Ordinal |
| `smoker` | bool (0 for 'no' and 1 for 'yes') | Nominal |
| `region` | int (1 for 'northeast', 2 for 'northwest' , <br>3 for 'southeast', 4 for 'southwest') | Nominal |
| `charges` | float | Ratio |

</center>

Identified the regions uniquely by first separating the `region` column in the `insurance.csv` to `regions.txt` file. Then the following command was executed in terminal to get the unique names of the regions:
```
sort regions.txt | uniq
```
This command gave the unique names of the regions based on which the numeric codes were provided.

In [93]:
import csv

dataset = []  # dict to store the raw contents

with open("insurance.csv", "r") as file:
    for row in csv.DictReader(file):
        data = {}
        for header, value in row.items():
            if header == "age":
                data[header] = int(value)
            elif header == "sex":
                data[header] = 0 if value == "male" else 1
            elif header == "bmi":
                data[header] = float(value)
            elif header == "children":
                data[header] = int(value)
            elif header == "smoker":
                data[header] = 0 if value == "no" else 1
            elif header == "region":
                if value == "northeast":
                    data[header] = 1
                elif value == "northwest":
                    data[header] = 2
                elif value == "southeast":
                    data[header] = 3
                elif value == "southwest":
                    data[header] = 4
            elif header == "charges":
                data[header] = float(value)
        dataset.append(data)

print("Dataset contains",len(dataset),"samples")

Dataset contains 1338 samples


## <u>Part 2: Summary Statistics</u>

Data from `dataset` list is further distributed into four individual lists name `age`, `bmi`, `children`, and `charges` so that summary statistics can be obtained from it.

### 1. Separating the ordinal, scale, and ratio variables

Each data row in the dataset is iterated and the specific data are appended to respective lists

In [94]:
age = []
sex = []
bmi = []
children = []
smoker = []
regions = []
charges = []

for data in dataset:

    age.append(data["age"])
    sex.append(data["sex"])
    bmi.append(data["bmi"])
    children.append(data["children"])
    smoker.append(data["smoker"])
    regions.append(data["region"])
    charges.append(data["charges"])

### 2. Obtaining the frequency distribution of data

Frequency distribution to understand the followings:

1. How many people belong to certain age groups?
2. How many people belong to groups of certain number of children?
3. How many male and female are there?
4. How many smoker and non-smoker are there?
5. What is the regional distribution of people?

The following frequency distributions are performed for to get a hold of the dataset

*Grouped frequency distribution*
- age
- children

*Ungrouped frequency distribution*
- sex
- smoker
- region

#### **2.1 Grouped frequency distribution**

The class interval for grouped frequency is calculated as:

$${Class~Interval} = \frac{Upper~Bound-Lower~Bound}{Number~of~Classes}$$

Where 'Upper Bound' is the highest number in the range and 'lower bound' is the least.

The following code block has the frequency distribution of *age*

In [95]:
# Frequency distribution of age

highest_age, lowest_age = max(age), min(age)

print("Highest age:", highest_age, "\nLowest age:", lowest_age)

# Since the upper bound is 64 and lower bound is 18, number of classes should be 10
num_classes = 10
class_interval = round((highest_age-lowest_age)/num_classes)
print("Class interval:", class_interval)

while lowest_age < highest_age:
    counter = 0
    for a in age:
        if lowest_age <= a <= lowest_age+class_interval:
            counter += 1
    print("There are {} people whose age is between {} and {}.".format(
        counter, lowest_age, lowest_age+class_interval)
    )
    # Added 1 with class interval for going to next class
    lowest_age += class_interval+1

Highest age: 64 
Lowest age: 18
Class interval: 5
There are 250 people whose age is between 18 and 23.
There are 167 people whose age is between 24 and 29.
There are 157 people whose age is between 30 and 35.
There are 154 people whose age is between 36 and 41.
There are 168 people whose age is between 42 and 47.
There are 172 people whose age is between 48 and 53.
There are 156 people whose age is between 54 and 59.
There are 114 people whose age is between 60 and 65.


The following code block has the frequency distribution of *children*

In [96]:
# Frequency distribution of children

highest_children, lowest_children = max(children), min(children)

print("Highest number of children:", highest_children,
      "\nLowest number of children:", lowest_children)

# Since the upper bound is 5 and lower bound is 0, number of classes should be 5
num_classes = 5
class_interval = round((highest_children-lowest_children)/num_classes)
print("Class interval:", class_interval)

while lowest_children < highest_children:
    counter = 0
    for c in children:
        if lowest_children <= c <= lowest_children+class_interval:
            counter += 1
    print("{} people have at least {} and at best {} children.".format(counter, lowest_children, lowest_children+class_interval))
    lowest_children += class_interval  # Added 1 with class interval for going to next class


Highest number of children: 5 
Lowest number of children: 0
Class interval: 1
898 people have at least 0 and at best 1 children.
564 people have at least 1 and at best 2 children.
397 people have at least 2 and at best 3 children.
182 people have at least 3 and at best 4 children.
43 people have at least 4 and at best 5 children.


### **2.2 Ungrouped frequency distribution**

The following code block contains frequency distribution of sex, smoker, and region

In [97]:
print("There are {} ({}%) male and {} ({}%) female among respondents.".format(
    sex.count(0), round(sex.count(0)*100/len(dataset), 2), sex.count(1), round(sex.count(1)*100/len(dataset), 2)))

print("{} ({}%) respondents are smokers and {} ({}%) are not.".format(
    smoker.count(1), round(smoker.count(1)*100/len(dataset), 2), smoker.count(0), round(smoker.count(0)*100/len(dataset), 2)))
print("Among the respondents, {} ({}%) number of people are from the northeast, {} ({}%) from northwest, {} ({}%) from southeast, {} ({}%) are from southwest region.".format(
    regions.count(1), round(regions.count(1)*100/len(dataset), 2), regions.count(2), round(regions.count(2)*100/len(dataset), 2), regions.count(3), round(regions.count(3)*100/len(dataset), 2), regions.count(4), round(regions.count(4)*100/len(dataset), 2)))

There are 676 (50.52%) male and 662 (49.48%) female among respondents.
274 (20.48%) respondents are smokers and 1064 (79.52%) are not.
Among the respondents, 324 (24.22%) number of people are from the northeast, 325 (24.29%) from northwest, 364 (27.2%) from southeast, 325 (24.29%) are from southwest region.
