# **U.S Medical Insurance Cost**

## <u>Part 1: Data Reading and Preparation</u>

The `csv` module is imported to read raw data from the `insurance.csv` file using `csv.DictReader()` method. Initially all data are in string format. Therefore, the processed data is added to the `dataset` list by typecasting the variables in the following manner:

<center>

| Variable | <center>Data Type</center> | Variable Type |
| -------- | --------- | ------------- |
| `age` | int | Scale |
| `sex` | bool (0 for 'male' and 1 for 'female') | Nominal |
| `bmi` | float | Ratio |
| `children` | int | Ordinal |
| `smoker` | bool (0 for 'no' and 1 for 'yes') | Nominal |
| `region` | int (1 for 'northeast', 2 for 'northwest' , 3 for 'southeast', 4 for 'southwest') | Nominal |
| `charges` | float | Ratio |

</center>

Identified the regions uniquely by first separating the `region` column in the `insurance.csv` to `regions.txt` file. Then the following command was executed in terminal to get the unique names of the regions:
```
sort regions.txt | uniq
```
This command gave the unique names of the regions based on which the numeric codes were provided.

In [7]:
import csv

dataset = []  # dict to store the raw contents

with open("insurance.csv", "r") as file:
    # 
    for row in csv.DictReader(file):
        data = {}
        for header, value in row.items():
            if header == "age":
                data[header] = int(value)
            elif header == "sex":
                data[header] = 0 if value == "male" else 1
            elif header == "bmi":
                data[header] = float(value)
            elif header == "children":
                data[header] = int(value)
            elif header == "smoker":
                data[header] = 0 if value == "no" else 1
            elif header == "region":
                if value == "northeast":
                    data[header] = 1
                elif value == "northwest":
                    data[header] = 2
                elif value == "southeast":
                    data[header] = 3
                elif value == "southwest":
                    data[header] = 4
            elif header == "charges":
                data[header] = float(value)

        dataset.append(data)

print("Dataset contains",len(dataset),"samples")

Dataset contains 1338 samples


## <u>Part 2: Summary Statistics</u>

Data from `dataset` list is further distributed into four individual lists name `age`, `bmi`, `children`, and `charges` so that summary statistics can be obtained from it.

### Separating the scale and ratio variables

Each data row in the dataset is iterated and the specific data are appended to respective lists

In [6]:
age = []
bmi = []
charges = []

for data in dataset:

    age.append(data["age"])
    bmi.append(data["bmi"])
    charges.append(data["charges"])

age.sort()
bmi.sort()
charges.sort()

### Obtaining median of `age`, `bmi`, and `charges`