# Sam Jeffery

# Date Started: 1/21/2025

# Healthcare Data Analysis and Visualization Pipeline



Objective

This dataset contains 1338 rows of insured data, where the Insurance charges are given against the following attributes of the insured: Age, Sex, BMI, Number of Children, Smoker and Region. The attributes are a mix of numeric and categorical variables. There are no missing or undefined values in the dataset. 

https://www.kaggle.com/datasets/teertha/ushealthinsurancedataset

My last project was only using Python. This time I will use:
Python to load data
MySQL to perform queries
Tableau for visualizations.


Deliverables:

    Python scripts for data cleaning and preprocessing.
    SQL database with healthcare data and optimized schema.
    SQL queries showcasing insightful analyses.
    Tableau dashboard visualizing healthcare data insights.


In [3]:
# Basic Imports. Now I need to find a dataset.

import numpy as np
import pandas as pd
import json
import datetime

df = pd.read_csv(r"insurance.csv")

df.shape

(1338, 7)

In [5]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [17]:
df.info()

df.notna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   age       1338 non-null   int64   
 1   sex       1338 non-null   category
 2   bmi       1338 non-null   float64 
 3   children  1338 non-null   int64   
 4   smoker    1338 non-null   category
 5   region    1338 non-null   category
 6   charges   1338 non-null   float64 
dtypes: category(3), float64(2), int64(2)
memory usage: 46.3 KB


age         1338
sex         1338
bmi         1338
children    1338
smoker      1338
region      1338
charges     1338
dtype: int64

Now, since I did a quick view of the data, and we know from the data dictionary that there are no null rows or columns, so nothing really needs to be cleaned.

I will still be using Python, but I will be connecting to a SQL Database and running all of my queries from VSCode to make it easier to look at!

In [None]:
import mysql.connector
# login files config kept seperately
import login_files as lf

mydb = mysql.connector.connect(
    host = 'localhost',
    user = 'root',
    password = lf.password
)

mycursor = mydb.cursor()

# db creation
mycursor.execute("CREATE DATABASE claims")



Now that the DB is created, we need to think about the Data Model.

We have a total of 6 attributes. I am going to split them up in to two tables. The first table will be Patient.

Patient Data will contain information on a patient, such as:

patient_id -- the primary key for a patient
age
sex
region_id

Then, we will store Health Details. This will be called HealthDetails.

health_id -- the primary key for the health records
patient_id -- the foreign key for patients
bmi
children
smoker
charges

Finally, we will have the region table. This table reduces data redundancy.

Region_id
region_name

In [None]:
# Creating the Patients Table

mycursor.execute("CREATE TABLE patients (patient_id INT AUTO_INCREMENT PRIMARYKEY, \
                 age INT, \
                 sex VARCHAR(10),\
                 region_id INT FOREIGN KEY REFERENCES regions(region_id))")
