# U.S. Medical Insurance Costs

Dataset has been provided publicly by https://www.kaggle.com/datasets/mirichoi0218/insurance

This is a python analysis project by @tylershienlim through codecademy's Data Scientist: Analytics learning path.

Data Analysis and cleaning, combined with a simple prediction model to predict insurace costs for patients.


In [15]:
#dependencies
import pandas as pd
import csv

In [16]:
#read file
df = pd.read_csv('insurance.csv')

### Initial Observation of dataframe

- no missing values in the dataset
- 4 different regions
- youngest person is age 18, oldest is age 64
- 3 columns of categorical values

In [17]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [18]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [19]:
df.region.unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [20]:
df.children.unique()

array([0, 1, 3, 2, 5, 4], dtype=int64)

### Further Observation into dataset

- there are 676 males, 662 females in this dataset (fairly balanced)
- male average insurance cost is 13956.75
- female average insurance cost is 12569.59

In [21]:
male = df[df['sex'] == 'male']
female = df[df['sex'] == 'female']

In [22]:
male.describe()

Unnamed: 0,age,bmi,children,charges
count,676.0,676.0,676.0,676.0
mean,38.91716,30.943129,1.115385,13956.751178
std,14.050141,6.140435,1.218986,12971.025915
min,18.0,15.96,0.0,1121.8739
25%,26.0,26.41,0.0,4619.134
50%,39.0,30.6875,1.0,9369.61575
75%,51.0,34.9925,2.0,18989.59025
max,64.0,53.13,5.0,62592.87309


In [23]:
female.describe()

Unnamed: 0,age,bmi,children,charges
count,662.0,662.0,662.0,662.0
mean,39.503021,30.377749,1.074018,12569.578844
std,14.054223,6.046023,1.192115,11128.703801
min,18.0,16.815,0.0,1607.5101
25%,27.0,26.125,0.0,4885.1587
50%,40.0,30.1075,1.0,9412.9625
75%,51.75,34.31375,2.0,14454.691825
max,64.0,48.07,5.0,63770.42801


### Smokers vs Non-Smokers observations
 - 274 smokers, 1064 non smokers
 - smokers average insurance cost is 32050.23
 - non smokers average insurance cost is 8434
 - smoking highly correlated to higher insurance cost (?)

In [24]:
smoker = df[df['smoker'] == 'yes']
nonsmoker = df[df['smoker'] == 'no']

In [26]:
smoker.describe()

Unnamed: 0,age,bmi,children,charges
count,274.0,274.0,274.0,274.0
mean,38.514599,30.708449,1.113139,32050.231832
std,13.923186,6.318644,1.157066,11541.547176
min,18.0,17.195,0.0,12829.4551
25%,27.0,26.08375,0.0,20826.244213
50%,38.0,30.4475,1.0,34456.34845
75%,49.0,35.2,2.0,41019.207275
max,64.0,52.58,5.0,63770.42801


In [27]:
nonsmoker.describe()

Unnamed: 0,age,bmi,children,charges
count,1064.0,1064.0,1064.0,1064.0
mean,39.385338,30.651795,1.090226,8434.268298
std,14.08341,6.043111,1.218136,5993.781819
min,18.0,15.96,0.0,1121.8739
25%,26.75,26.315,0.0,3986.4387
50%,40.0,30.3525,1.0,7345.4053
75%,52.0,34.43,2.0,11362.88705
max,64.0,53.13,5.0,36910.60803


Replace all categorical values into numerical values to make it easier for any further analysis/predictive model later on

In [28]:
df.sex = df.sex.replace({
    'female':0,
    'male':1
})
df.smoker = df.smoker.replace({
    'no':0,
    'yes':1
})
df.region = df.region.replace({
    'southwest':0,
    'southeast':1,
    'northwest':2,
    'northeast':3
})

# Exploratory Analysis
- Relationship between variables
- Any correlations between variables
- Etc.

Use data visualization for correlation between variables

# Prediction on insurance cost