### Medical Expenses / Медицинские расходы 

Dataset containing information about medical expenses. The dataset is already divided into ```train.csv``` и ```test.csv```. 

File ```data_description.txt``` contains the description of the columns.

**Purpose:** Purpose: to build models for predicting **medical expenses** (```"charges"```).

Steps:

1. Preprocessing

2. Train models

3. Сomparing models based on regression metrics


# Load the data and modules

In [28]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
%matplotlib inline

In [29]:
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
all_data = train_data.append(test_data)
print(all_data.shape, train_data.shape)
all_data.head(3)

(1338, 7) (1205, 7)


Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,34,male,42.9,1,no,southwest,4536.259
1,61,female,36.385,1,yes,northeast,48517.56315
2,60,male,25.74,0,no,southeast,12142.5786


# 1. Preprocessing

In [30]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1338 entries, 0 to 132
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 83.6+ KB


In [31]:
print(all_data.shape)
for col in all_data:
    print(col, len(all_data[col].unique()), all_data[col].dtype)

(1338, 7)
age 47 int64
sex 2 object
bmi 548 float64
children 6 int64
smoker 2 object
region 4 object
charges 1337 float64


In [32]:
train_data.corr()

Unnamed: 0,age,bmi,children,charges
age,1.0,0.100281,0.05094,0.296395
bmi,0.100281,1.0,0.020396,0.204654
children,0.05094,0.020396,1.0,0.059493
charges,0.296395,0.204654,0.059493,1.0


In [35]:
encoder = LabelEncoder()
for col in ['sex', 'smoker', 'region']:
    all_data[col] = encoder.fit_transform(all_data[col])
all_data.head(3)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,34,1,42.9,1,0,3,4536.259
1,61,0,36.385,1,1,0,48517.56315
2,60,1,25.74,0,0,2,12142.5786


In [43]:
train_data = all_data.iloc[:1205,]
test_data = all_data.iloc[1205:]

In [47]:
important = list(train_data.columns)
important.remove('charges')
important

['age', 'sex', 'bmi', 'children', 'smoker', 'region']

# 2. Train models

In [45]:
from sklearn.linear_model import LinearRegression, RidgeCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import sklearn.metrics as metrics

In [48]:
ytrue = test_data['charges']
xtest = test_data[important]
ytrain = train_data['charges']
xtrain = train_data[important]

In [49]:
def r2alpha(xtrain, r2):
    n, k = xtrain.shape
    return round(1 - (n-1)/(n-k-1)*(1-r2),2)

def CompareRegressions(xtrain, ytrain, xtest, ytrue):
    #Multiple Linear Regression
    LinRegressor = LinearRegression()
    LinRegressor.fit(xtrain, ytrain)
    ypred = LinRegressor.predict(xtest)
    r2a = r2alpha(xtrain, metrics.r2_score(ytrue, ypred))
    print(f'Multiple Lin Regression RMSE: {round(metrics.mean_squared_error(ytrue, ypred, squared=False))} R2alpha: {r2a}' )
    #Multiple Polynomial Regression
    xtrainPol2 = PolynomialFeatures(2).fit_transform(xtrain)
    xtestPol2 = PolynomialFeatures(2).fit_transform(xtest)
    PolRegressor = LinearRegression()
    PolRegressor.fit(xtrainPol2, ytrain)
    ypred = PolRegressor.predict(xtestPol2)
    r2a = r2alpha(xtrain, metrics.r2_score(ytrue, ypred))
    print(f'Multiple Polynomial Regression RMSE: {round(metrics.mean_squared_error(ytrue, ypred, squared=False))} R2alpha: {r2a}' )
    #Decision Tree
    DTregressor = DecisionTreeRegressor(random_state = 0)
    DTregressor.fit(xtrain, ytrain)
    ypred = DTregressor.predict(xtest)
    r2a = r2alpha(xtrain, metrics.r2_score(ytrue, ypred))
    print(f'Decision Tree RMSE: {round(metrics.mean_squared_error(ytrue, ypred, squared=False))} R2alpha: {r2a}' )
    #Random Forest
    RFregressor = RandomForestRegressor()
    RFregressor.fit(xtrain, ytrain)
    ypred = RFregressor.predict(xtest)
    r2a = r2alpha(xtrain, metrics.r2_score(ytrue, ypred))
    print(f'Random Forest RMSE: {round(metrics.mean_squared_error(ytrue, ypred, squared=False))} R2alpha: {r2a}' )
    #SVR
    scaler = StandardScaler()
    scaler.fit(xtrain)
    scaled_xtrain = scaler.transform(xtrain)
    scaled_xtest = scaler.transform(xtest)
    
    yscaler = StandardScaler()
    yscaler.fit(ytrain.values.reshape(-1,1))
    scaled_ytrain = yscaler.transform(ytrain.values.reshape(-1,1))
    
    regressor = SVR(kernel='rbf')
    regressor.fit(scaled_xtrain, scaled_ytrain)
    ypred = regressor.predict(scaled_xtest)
    ypred = yscaler.inverse_transform(ypred)
    r2a = r2alpha(xtrain, metrics.r2_score(ytrue, ypred))
    print(f'SVR RMSE: {round(metrics.mean_squared_error(ytrue, ypred, squared=False))} R2alpha: {r2a}' )

# 3. Сomparing models based on regression metrics

In [50]:
CompareRegressions(xtrain, ytrain, xtest, ytrue)

Multiple Lin Regression RMSE: 5960 R2alpha: 0.78
Multiple Polynomial Regression RMSE: 4993 R2alpha: 0.85
Decision Tree RMSE: 5824 R2alpha: 0.79
Random Forest RMSE: 4543 R2alpha: 0.87


  return f(*args, **kwargs)


SVR RMSE: 4601 R2alpha: 0.87


#####  Best Result Random Forest