Skip to content

Predicting the cost of treatment and insurance using Machine Learning

License

Notifications You must be signed in to change notification settings

Thomas-George-T/Forecasting-Healthcare-Costs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GitHub top language GitHub last commit GitHub License ViewCount

Forecasting Healthcare Costs

Predicting the cost of treatment and insurance using regression by leveraging personal health data.

Motivation

This is my first notebook. I am trying to perform Exploratory Data Analysis (EDA) and linear regression on personal health data. Any feedback and constructive criticism is appreciated. The personal heath data is hosted on Kaggle. Link: https://www.kaggle.com/mirichoi0218/insurance

Table of contents

  1. Components
  2. Model Implementation
    1. Import Data
    2. Data Preprocessing
    3. Exploratory Data Analysis (EDA)
    4. Model Building
    5. Model Evaluation
  3. License

Components

  • Kaggle Dataset
  • Jupyter notebook
  • Python: numpy, pandas, matplotlib packages

Model Implementation

1. Import Data

Once we import the Data using read_csv, we then use head() to sample the data . We try to identify numerical and categorical data.

sample-data

2. Data Preprocessing

We proceed to collect basic descriptive stats using describe(). We try to understand what the data looks like and what it is trying to tell us.

data.describe()

describe

We then split the data into numerical and categorical data.

preprocessing

We proceed to convert categorical data into numerical data. We use One hot encoding technique for this.

One hot encoding is a technique where we replace the categorical data with binary digits. The categorical column is split into same number of columns as the values. The respective column is then given a '1' or a '0' corresponding to the values.

we use one hot encoding by using get_dummies()

one hot encoding

3. Exploratory Data Analysis (EDA)

We try to then find the correlation between features.

eda

Using a heat map to explore the trends.

heatmap

From this we can see the following observations:

  1. Strong correlation between charges and smoker_yes.
  2. Weak correlation between charges and age.
  3. Weak correlation between charges and bmi.
  4. Weak correlation between bmi and region_southeast. Since the values for the weak correlations are less than 0.5, we can term them as insignificant and drop them.

Exploring the trend between charges and smoker_yes. Finding the range of the treatment charges of patients using graphs.

range of charges

From the graph, We can see the minimum charges are around 1122 for a high number of patients and maximum of 63770.

4. Model Building

Model building

We then begin to predict the values of the patient charges using the other features. We build a linear regression model after importing the package sklearn.linear_model. We split the data set into training and test set. We use 30% of the dataset for testing using test_size=0.3 We take the predictor variable without the charges column and the target variable as charges. We proceed to fit the linear regression model for the test and training set using fit(). This part is called Model fitting. We check the prediction score of both and training and test set using score(). It comes out to be 79%, which is pretty decent I would say.

5. Model Evaluation

To evaluate our linear regression, we use R2 and mean squared error.

mode evaluation

On evaluating our model, it showed accuracy of 80% on the test data.

From the figure, Our evaluation metrics of R2 and mean squared error of both training and test data are closely matching. This is enough to conclude our model is appropriate to predict patient charges based on their personal health data.

License

This project is under the MIT License - see License for more details