# Project Medical Insurance Costs

- Name: Stefanus Bernard Melkisedek
- Codecademy profile: https://www.codecademy.com/profiles/DatenMeister
- Email: stefanussipahutar@gmail.com


## Project Description

Project Medical Insurance Costs is a **portfolio project** from **Codecademy** **Data Science Career Path**. This project is a part of Data Science Foundation path that uses Python to analyze data from a CSV file.The data is about the medical insurance costs of a person.

The data **insurance.csv** contains the following columns:

- age: Patient age in years.
- sex: Patient gender (female or male).
- bmi: Patient body mass index.
- children: Patient number of children covered by health insurance.
- smoker: Patient smoking status (yes or no).
- region: Patient U.S Geopraphical Region: northeast, southeast, southwest, northwest.
- charges: Patient Yearly Medical Insurance Cost.


## Project Goals

The goals of this project are:

- To analyze the data using Python by building out functions or class methods.
- To find out the average age of the patients in the dataset.
- To find out the number of each sex comprised in the dataset.
- To find out the geographical location of the patients in the dataset.
- To find out the average medical insurance costs of the patients in the dataset.
- Create a dictionary that contains all patient information.
- To estimate the medical insurance costs for a new patient with regression analysis.


In [None]:
# import necessary library
import csv

To start, all necessary libraries must be imported. For this project the only library needed is the `csv` library in order to work with the **insurance.csv** data. There are other potential libraries that could help with this project; however, for this analysis, using just the `csv` library will suffice.


The next step is to look through **insurance.csv** in order to get aquanted with the data. The following aspects of the data file will be checked in order to plan out how to import the data into a Python file:

- The names of columns and rows
- Any noticeable missing data
- Types of values (numerical vs. categorical)


In [None]:
# Create variables to hold each feature of insurance.csv
ages = []
sexes = []
bmis = []
num_of_children = []
smoker_status = []
regions = []
insurance_costs = []

<p align="center">
    <img src="./src/assets/dataset.png" alt="dataset" width="750px">
</p>

Based on the image preview of the dataset using data-wrangler, we can conclude that:

There are no signs of missing data. To store this information, seven empty lists will be created hold each individual column of data from **insurance.csv**.


In [None]:
# helper function to load csv data
def load_list_data(lst, csv_file, column_name):
    # open csv file
    with open(csv_file) as csv_info:
        # read the data from the csv file
        csv_dict = csv.DictReader(csv_info)
        # loop through the data in each row of the csv
        for row in csv_dict:
            # add the data from each row to a list
            lst.append(row[column_name])
        # return the list
        return lst

The helper function above was created to make loading data into the lists as efficient as possible. Without this function, one would have to open **insurance.csv** and rewrite the `for` loop seven times; however, with this function, one can simply call `load_list_data()` each time as shown below.


In [None]:
# look at the data in insurance_csv_dict
load_list_data(ages, "./src/data/insurance.csv", "age")
load_list_data(sexes, "./src/data/insurance.csv", "sex")
load_list_data(bmis, "./src/data/insurance.csv", "bmi")
load_list_data(num_of_children, "./src/data/insurance.csv", "children")
load_list_data(smoker_status, "./src/data/insurance.csv", "smoker")
load_list_data(regions, "./src/data/insurance.csv", "region")
load_list_data(insurance_costs, "./src/data/insurance.csv", "charges")

Now that all the data from **insurance.csv** neatly organized into labeled lists, the analysis can be started. This is where one must plan out what to investigate and how to perform the analysis. There are many aspects of the data that could be looked into. The following operations will be implemented:

- find average age of the patients
- return the number of males vs. females counted in the dataset
- find geographical location of the patients
- return the average yearly medical charges of the patients
- creating a dictionary that contains all patient information

To perform these inspections, a class called `PatientsInfo` has been built out which contains fives methods:

- `analyze_ages()`
- `analyze_sexes()`
- `unique_regions()`
- `average_charges()`
- `create_dictionary()`

The class has been built out below.


In [None]:
class PatientsInfo:
    # init method that takes in each list parameter
    def __init__(
        self,
        patients_ages,
        patients_sexes,
        patients_bmis,
        patients_num_children,
        patients_smoker_statuses,
        patients_regions,
        patients_charges,
    ):
        self.patients_ages = patients_ages
        self.patients_sexes = patients_sexes
        self.patients_bmis = patients_bmis
        self.patients_num_children = patients_num_children
        self.patients_smoker_statuses = patients_smoker_statuses
        self.patients_regions = patients_regions
        self.patients_charges = patients_charges

    # method that calculates the average ages of the patients in insurance.csv
    def analyze_ages(self):
        # initialize total age at zero
        total_age = 0
        # iterate through all ages in the ages list
        for age in self.patients_ages:
            # sum of the total age
            total_age += int(age)
        # return total age divided by the length of the patient list
        return (
            "Average Patient Age: "
            + str(round(total_age / len(self.patients_ages), 2))
            + " years"
        )

    # method that calculates the number of males and females in insurance.csv
    def analyze_sexes(self):
        # initialize number of males and females to zero
        females = 0
        males = 0
        # iterate through each sex in the sexes list
        for sex in self.patients_sexes:
            # if female add to female variable
            if sex == "female":
                females += 1
            # if male add to male variable
            elif sex == "male":
                males += 1
        # print out the number of each
        print("Count for female: ", females)
        print("Count for male: ", males)

    # method to find each unique region patients are from
    def unique_regions(self):
        # initialize empty list
        unique_regions = []
        # iterate through each region in regions list
        for region in self.patients_regions:
            # if the region is not already in the unique regions list
            # then add it to the unique regions list
            if region not in unique_regions:
                unique_regions.append(region)
        # return unique regions list
        return unique_regions

    # method to find average yearly medical charges for patients in insurance.csv
    def average_charges(self):
        # initialize total_charges variable
        total_charges = 0
        # iterate through charges in patients charges list
        # add each charge to total_charge
        for charge in self.patients_charges:
            total_charges += float(charge)
        # return the average charges rounded to the hundredths place
        return (
            "Average Yearly Medical Insurance Charges: "
            + str(round(total_charges / len(self.patients_charges), 2))
            + " dollars."
        )

    # method to create dictionary with all patients information
    def create_dictionary(self):
        self.patients_dictionary = {}
        self.patients_dictionary["age"] = [int(age) for age in self.patients_ages]
        self.patients_dictionary["sex"] = self.patients_sexes
        self.patients_dictionary["bmi"] = self.patients_bmis
        self.patients_dictionary["children"] = self.patients_num_children
        self.patients_dictionary["smoker"] = self.patients_smoker_statuses
        self.patients_dictionary["regions"] = self.patients_regions
        self.patients_dictionary["charges"] = self.patients_charges
        return self.patients_dictionary

---

## Understanding `.gitattributes`

The `.gitattributes` file is a simple text file that gives attributes to pathnames. Each line in the `.gitattributes` file is a pattern followed by an attribute specification. It's placed in the root directory of a repository or in any subdirectory (the latter being useful for applying attributes to only a subset of the repository). The file is committed into the repository and versioned like any other file.

In our case, the `.gitattributes` file contains the following lines:

```properties
*.ipynb filter=nbstripout
*.zpln filter=nbstripout
*.ipynb diff=ipynb
* text=auto eol=lf
```

Here's what each line does:

- `*.ipynb filter=nbstripout` and `*.zpln filter=nbstripout`: These lines tell Git to use the `nbstripout` filter for files ending in `.ipynb` and `.zpln`. `nbstripout` is a tool that strips output from Jupyter notebooks. This is useful because it allows you to commit only the code changes in your notebooks, not the output. This can make your commits cleaner and easier to understand.

- `*.ipynb diff=ipynb`: This line tells Git to use the `ipynb` diff tool when comparing changes in `.ipynb` files. This can make diffs of Jupyter notebooks more meaningful, as it understands the structure of notebook files.

- `* text=auto eol=lf`: This line tells Git to normalize all text files to use LF (Unix-style) line endings. This can help prevent issues with line endings differing between Windows and Unix systems, making your code more consistent and easier to work with across different platforms.

In summary, the `.gitattributes` file in this repository is used to improve the handling of Jupyter notebooks and line endings in Git.
