# U.S. Medical Insurance Costs

In this project, I will be analyzing the data from a **csv** file called **insurance.csv** using Python fundamentals. This file contains medical insurance costs for patients, along with other variables such as age, sex, etc, which affect the cost. By analyzing the various variables and thus learning more about the patient information, a variety of potential use cases for the data can be discovered.

To start off, data from the **insurance.csv** file will need to be used, so I will import the `csv` library for this purpose.

In [1]:
# Import csv library
import csv

Before actually starting off with Python, initial exploratory analysis is greatly helpful to help acquaint myself with the data. To do this, I will simply look through the data in the csv file itself. I take note of the following things: 
* the number of records, and the number of variables for each record (explanatory, response variable identification). 
* if there are any missing variables. 
* the type of variables (numerical, categorical, etc).

Investigating these aspects will help in thinking critically before importing all the data from the **insurance.csv** file into different lists created for each variable in Python. It helps to plan things out more effectively.

Having mentioned this, I will proceed to create seven empty lists in Python for each of the seven variables identified in the csv file. In addition, I will be creating a list to contain serial numbers for each of the patients. This will be helpful when having to create a dictionary of the people later on.

In [2]:
# Create empty lists for the different variables, which will soon be populated using the info from insurance.csv
age = []
sex = []
bmi = []
num_of_children = []
smoker = []
region = []
insurance_cost = []

Since I looked through the **insurance.csv** file, let me go into a little more depth about what I discovered. There were records for a total of 1338 individuals.
As mentioned earlier, there were a total of 7 variables, namely:
* *Age* - states the age of the individual. Is a numerical variable with ages ranging from 18 to 64 years.
* *Sex* - a binary variable taking values of either "male" or "female" which states the gender of the person.
* *BMI* - depicts the Body Mass Index of a person. A numerical variable with values ranging from 15.96 to 53.13.
* *Children* - a categorical variable which shows the number of children an individual has. The categories range from 0 to 5 children.
* *Smoker* - a binary variable taking values of either "yes" or "no" which states whether a person is a smoker or not.
* *Region* - a categorical variable with 4 possible classes of "northeast", "northwest", "southeast", or "southwest" which states the region the person hails from.
* *Charges* - a numerical variable containing the yearly medical insurance cost of a person. The costs range from 1121.8739 dollars to 63770.42801 dollars per year. 

*Charges* is the dependent variable, while all the other variables are explanatory. It was also noted that there was no missing data so no form of data imputation was required. The data from each column is ready to be imported into their respective empty lists.

Before importing, an additional list will be created to contain serial numbers for each person. The serial numbers will help to uniquely number each individual. The list will be simply called `serial_number`

In [3]:
serial_number = list(range(1, 1339))

Now we are ready to import data from **insurance.csv** into the other 7 empty cells.

In [4]:
with open("insurance.csv") as insurance_csv:           # opens csv file
    insurance_reader = csv.DictReader(insurance_csv)   # reads the data from the csv file
    for row in insurance_reader:                       # loops through the data in each row of the csv
        age.append(row["age"])                         # adds the data for the specified variable, to its respective list

with open("insurance.csv") as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        sex.append(row["sex"])

with open("insurance.csv") as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        bmi.append(row["bmi"])

with open("insurance.csv") as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        num_of_children.append(row["children"])

with open("insurance.csv") as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        smoker.append(row["smoker"])

with open("insurance.csv") as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        region.append(row["region"])

with open("insurance.csv") as insurance_csv:
    insurance_reader = csv.DictReader(insurance_csv)
    for row in insurance_reader:
        insurance_cost.append(row["charges"])

The data is now loaded into neatly labelled lists and is ready for analysis purposes. Now I must plan out the scope of my analysis. While there are numerous possible areas of analysis, in this project I have decided to focus on the following areas:
* Creating a dictionary containing the info of all the people in the dataset
* Identifying the average age of a person
* Identifying the average yearly insurance cost of a person
* Identifying if there is a balanced representation of males vs. females in the dataset

To carry out these 4 different analyses, firstly, we create a class called `Person_info` which will be comprised of 4 methods, each of which will be used to achieve one of the above stated goals. The methods are namely:
* `compile_dictionary()`
* `analyze_ages()`
* `analyze_insurance_cost()`
* `analyze_gender()`

Building of the classes and the methods is shown below:

In [5]:
# Start the new class containing all the different methods for each of our 4 analyses
class Person_info:
    # specify the init method with all the list parameters. These will be used in the upcoming methods
    def __init__(self, serial_number, age, sex, bmi, num_of_children, smoker, region, insurance_cost):
        self.serial_number = serial_number
        self.age = age
        self.sex = sex
        self.bmi = bmi
        self.num_of_children = num_of_children
        self.smoker = smoker
        self.region = region
        self.insurance_cost = insurance_cost
    
    # method to compile all the individuals' information in a single dictionary
    def compile_dictionary(self):
        # create an empty dictionary
        self.personal_info = {}
        # iterate through the different lists for the data regarding the person based on their index number, denoted by "i" 
        for i in range(len(self.serial_number)):
            # dictionary keys are the unique serial numbers  
            # dictionary values are dictionaries themselves containing a key for each piece of data
            self.personal_info[self.serial_number[i]] = {"Record No": self.serial_number[i], "Age": self.age[i], 
                                                         "Gender": self.sex[i], "BMI": self.bmi[i], 
                                                         "No. of children": self.num_of_children[i], "Smoker": self.smoker[i], 
                                                         "Region": self.region[i], 
                                                         "Yearly Insurance Cost": self.insurance_cost[i]}
        # return the compiled, comprehensive dictionary
        return self.personal_info
    
    # method to identify the average ages of all the people in insurance.csv
    def analyze_ages(self):
        # initialize the total age at 0
        total_age = 0
        # iterate through the list of ages
        for age in self.age:
            # find the sum of the total age in each iteration
            total_age += int(age)
            # now calculate average age, which is the total age divided by the number of instances in the list
            average_age = (total_age / len(self.age))
        # return the average age of a person
        return("Average age: " + str(average_age) + " years")
    
    # method to identify the average insurance costs of all the people in insurance.csv
    def analyze_insurance_cost(self):
        # initialize the total cost at 0
        total_cost = 0
        # iterate through the list of costs
        for cost in self.insurance_cost:
            # find the sum of the total cost in each iteration
            total_cost += float(cost)
            # now calculate the average insurance cost, 
            # average cost is the total cost divided by the number of instances in the list
            average_cost = (total_cost / len(self.insurance_cost))
        # return the average insurance cost of a person
        return("Average yearly insurance cost: " + str(average_cost) + " dollars")
    
    # method to calculate the number of males and females in insurance.csv
    def analyze_gender(self):
        # initialize the number of males and females to 0
        males = 0
        females = 0
        # iterate through the list containing the sex of each person
        for gender in self.sex:
            # if the gender is female, add to the number of females, incrementally
            if gender == "female":
                females += 1
            # otherwise, if it is male, add to the number of males, incrementally
            elif gender == "male":
                males += 1
        # print out the number of people in each gender
        print("Number of females: ", females)
        print("Number of males: ", males)
    

Once all the different methods have been written out and defined within the class, it is time to run each method in turn and obtain the results of each individual analysis and eventually, interpret the findings. An instance of the class called `person_info` will be created for this purpose. The parameters specified within the parentheses will be the names of the 8 lists we have now. 

In [6]:
person_info = Person_info(serial_number, age, sex, bmi, num_of_children, smoker, region, insurance_cost)

In [7]:
person_info.compile_dictionary()

{1: {'Record No': 1,
  'Age': '19',
  'Gender': 'female',
  'BMI': '27.9',
  'No. of children': '0',
  'Smoker': 'yes',
  'Region': 'southwest',
  'Yearly Insurance Cost': '16884.924'},
 2: {'Record No': 2,
  'Age': '18',
  'Gender': 'male',
  'BMI': '33.77',
  'No. of children': '1',
  'Smoker': 'no',
  'Region': 'southeast',
  'Yearly Insurance Cost': '1725.5523'},
 3: {'Record No': 3,
  'Age': '28',
  'Gender': 'male',
  'BMI': '33',
  'No. of children': '3',
  'Smoker': 'no',
  'Region': 'southeast',
  'Yearly Insurance Cost': '4449.462'},
 4: {'Record No': 4,
  'Age': '33',
  'Gender': 'male',
  'BMI': '22.705',
  'No. of children': '0',
  'Smoker': 'no',
  'Region': 'northwest',
  'Yearly Insurance Cost': '21984.47061'},
 5: {'Record No': 5,
  'Age': '32',
  'Gender': 'male',
  'BMI': '28.88',
  'No. of children': '0',
  'Smoker': 'no',
  'Region': 'northwest',
  'Yearly Insurance Cost': '3866.8552'},
 6: {'Record No': 6,
  'Age': '31',
  'Gender': 'female',
  'BMI': '25.74',
  '

The `compile_dictionary()` method is run and the results are shown above. All the data is now compactly and neatly stored in a single dictionary, which will serve as a valuable tool for further analysis should I wish to investigate deeper into the other attributes in **insurance.csv** at a later period of time.

In [8]:
person_info.analyze_ages()

'Average age: 39.20702541106129 years'

The `analyze_ages()` method is run and the results show that the average age of a health insurance policy holder is around 39 years. This is important to check because we want to know what if the data in **insurance.csv** is representative for a broader population. If we decide to use the dataset to make inferences about other populations, we must make sure that the data is abundant and broad enough for such use cases.

We would have to do further analysis to make sure the range and standard deviation of the patient age group in **insurance.csv** is indicative of a random sampling of individuals.

In [9]:
person_info.analyze_insurance_cost()

'Average yearly insurance cost: 13270.422265141257 dollars'

The `analyze_insurance_cost()` method is run and the results above show that on average, a health insurance policy holder spends around $13,270 annually on premium payments. This is a valuable piece of information as it can be used by a variety of stakeholders. For instance, it could be used to evaluate the profitability and success of the health insurance industry in the US. Alternatively, performing additional analysis into areas such as identifying correlations between costs and gender, costs and region, or costs and number of children will help to generate more specific findings with a clearly targeted scope of interest.

In [10]:
person_info.analyze_gender()

Number of females:  662
Number of males:  676


The `analyze_gender()` method is run and the results show that, of the sample data, there were a total of **662 female policy holders** and a total of **676 male policy holders** of health insurance. There seems to be a good balance of the genders which hint at a good representation of the overall population, as well as good collection of data. A balance in the classes is vital should the dataset be used to build a classification model in the future.