# U.S. Medical Insurance Costs

In this project, a csv file with medical insurance costs will be investigated using Python fundamentals. The goal of this project will be to analyze various variables within the data to learn more about the patient information in the file and gain insight into potential use cases for the dataset. Below, I have imported the libraries that I will use in the project.

In [64]:
import pandas as pd
import numpy as np

With the below code, the data is read in as a DataFrame. Using Pandas to explore the data, we can see that there is no missing data and we can read a list of the column names.

In [54]:
df = pd.read_csv('insurance.csv')
 
print(df.columns)

print(len(df.index))
print(df.count())

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')
1338
age         1338
sex         1338
bmi         1338
children    1338
smoker      1338
region      1338
charges     1338
dtype: int64


Here are some more detailed descriptions of the columns. This information is from contextual knowledge of the dataset.
* Patient Age
* Patient Sex 
* Patient BMI
* Patient Number of Children
* Patient Smoking Status
* Patient U.S Geopraphical Region
* Patient Yearly Medical Insurance Cost

### Project Goals

* Find out the average age of the patients in the dataset.
* Analyze where a majority of the individuals are from.
* Look at the different costs between smokers vs. non-smokers.
* Figure out what the average age is for someone who has at least one child in this dataset.
* Figure out the proportion of male vs female patients.
* Determine the max, min, mean, and Interquartile Range of the insurance cost variable.

Below is a function that returns the mean of any numerical variable in the dataset. The funtion takes a column name from the dataset, in quotes, and returns the mean with a contextual string.

In [36]:
def mean_var(var):
    x = round(np.mean(df[var]), 3)
    return "The average of the " + str(var) + " variable is " + str(x)

In [37]:
print(mean_var('age'))

The average of the age variable is 39.207


Below is a function that returns the most common observation (the mode) for any variable in the dataset. The function takes a column name from the dataset, in quotes, and returns the mean with a contextual string.

In [42]:
def mode_var(var):
    x = df[var].mode()
    return "The most common observation from the " + var + " variable is " + str(x.iloc[0])

# print(df.children.mode())

Using the function above, we can determine that the majority of individuals in the dataset are from the Southeast.

In [43]:
print(mode_var('region'))

The most common observation from the region variable is southeast


The code below confirms that there are 4 unique regions in the dataset: ['southwest' 'southeast' 'northwest' 'northeast']
. It is important to note that all observations are in the United States.

In [52]:
print(df.region.unique())

['southwest' 'southeast' 'northwest' 'northeast']


The function below takes a variable (column) name, in quotes, as an argument. It then returns the average insurance charges, by observation, associated with that variable. For example, inputting the 'smoker' variable into the function reveals that the average insurance charge is 32050.23 for smokers and 8434.27 for non-smokers in the dataset.

In [49]:
def cost_by_var(var):
    columns_list = df.columns.tolist()
    drop_columns_list = []
    
    for column in columns_list:
        if column != var and column != 'charges':
            drop_columns_list.append(column)
            
    costs_grouped = df.drop(drop_columns_list, axis = 1).groupby(var).mean().round(2).reset_index()
    return costs_grouped

print(cost_by_var('smoker'))

  smoker   charges
0     no   8434.27
1    yes  32050.23


The function below takes the DataFrame as the argument. It groups the ages of individuals by the number of children they have and then it averages that age. The oldest group are those that have 3 children while, surprisingly, the youngest group are those that have 5 children.

In [55]:
def avg_age_num_child(data):
    columns_list = df.columns.tolist()
    drop_columns_list = []
    
    for column in columns_list:
        if column != 'age' and column != 'children':
            drop_columns_list.append(column)
            
    age_groupedby_child = df.drop(drop_columns_list, axis = 1).groupby('children').mean().round(1).reset_index()
    return age_groupedby_child

print(avg_age_child(df))

   children   age
0         0  38.4
1         1  39.5
2         2  39.4
3         3  41.6
4         4  39.0
5         5  35.6


The function below calculates the average age of patients that have at least one child: 39.78.

In [65]:
def avg_age_child(data):
    child_df = data[data.children != 0]
    avg_age = child_df.age.mean().round(2)
    return "The average age of patients in the dataset with at least on child is " + str(avg_age)

print(avg_age_child(df))

The average age of patients in the dataset with at least on child is 39.78


The commented out code below served to ensure that I used the right approach to build a dataframe with only patients that have children.

In [69]:
# child_df = df[df.children != 0]
# print(child_df.head(10))

The function below shows the ratio of male to female patients in the dataset: %50.52 male and %49.48 female.

In [86]:
def male_female(data):
    male_ratio = ((data.sex.value_counts()['male']/len(df))*100).round(2)
    female_ratio = ((data.sex.value_counts()['female']/len(df))*100).round(2)
    
    return "The male ratio is %" + str(male_ratio) + " and the female ratio is %" + str(female_ratio)
    
print(male_female(df))

The male ratio is %50.52 and the female ratio is %49.48


The code below produces summary stats for the charges column in the dataset. We can see that the mean of the charges is 13,270.42, the max is 63,770.43, and the min is 1,121.87.

In [87]:
print(df.charges.describe())

count     1338.000000
mean     13270.422265
std      12110.011237
min       1121.873900
25%       4740.287150
50%       9382.033000
75%      16639.912515
max      63770.428010
Name: charges, dtype: float64
