In [12]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Project Title

### Source

This data was found on kaggle here: https://www.kaggle.com/datasets/rahulvyasm/medical-insurance-cost-prediction?resource=download \
Thanks to Mr. M Rahul Vyas for uploading the dataset to kaggle. \
The data is licensed by MIT.

### Project Goals

1. Build a model which can predict medical insurance charges based on a variety of features
2. Assess which features influence medical insurance costs the most
3. Evaluate the accuracy of machine learning models in predicting medical insurance charges
4. Determine ways in which machine learning models can enhance efficiency/profitability of medical insurance companies?

In [5]:
insurance_raw = pd.read_csv(r"C:\Users\ritac\Downloads\medical_insurance\medical_insurance.csv")
insurance_raw

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
2767,47,female,45.320,1,no,southeast,8569.86180
2768,21,female,34.600,0,no,southwest,2020.17700
2769,19,male,26.030,1,yes,northwest,16450.89470
2770,23,male,18.715,0,no,northwest,21595.38229


### Dropping Features

Features are dropped if they are uninformative, or if they wouldn't be available at prediction time. All of the given featuers would be avaiable at prediction time, and seem like useful factors to consider for our model. Thus we will not drop any features to begin with

### Feature Engineering

We will need to convert the 'sex', 'smoker', and 'region' features into quantitative features for our model to work. We will use one-hot encoding for this.

In [17]:
insurance_features = insurance_raw.drop('charges', axis=1)

In [21]:
categorical_features = ['sex', 'smoker', 'region']

preproc = ColumnTransformer(transformers=[
    ('ohe-categorical', OneHotEncoder(), categorical_features)
], remainder='passthrough')

pl = Pipeline([
    ('preprocessing', preproc),
    ('rough_model', LinearRegression())
])

pl.fit(insurance_features, insurance_raw['charges'])

In [23]:
pl.predict(insurance_features.iloc[[0]])

array([25343.5])

In [24]:
pl.score(insurance_features, insurance_raw['charges'])

0.7509296564596399