#### Initial setup

Let's first read in the dataset and add the required libraries.

In [7]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error

df = pd.read_csv('./insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


#### Data cleaning

Let's create the explanatory matrix X and the response vector y to be used in the model.

In [8]:
def clean_data(df):
    '''
    INPUT
    df - pandas dataframe 
    
    OUTPUT
    X - A matrix holding all of the variables we want to consider when predicting the response
    y - the corresponding response vector
    
    Perform to obtain the correct X and y objects
    This function cleans df using the following steps to produce X and y:
    1. Removes missing data from response parameter
    2. For quantitative parameters impute missing values with the column mean
    3. For cathegorical parameters create dummy variables and use binary encoding
    '''

    # Drop all the rows with no charges
    df = df.dropna(subset = ['charges'], axis = 0)

    # Create y as the charges column
    y = df.charges

    # Create X as all the columns that are not the charges column
    X = df.drop('charges', axis = 1)

    # For each numeric variable in X, fill the column with the mean value of the column.
    num_cols = X.select_dtypes(include = ['float', 'int']).columns
    for col in num_cols:
        X[col].fillna((X[col].mean()), inplace = True)

    # Create dummy columns for all the categorical variables in X
    X = pd.get_dummies(X, drop_first=True)
    
    return X, y
    
#Use the function to create X and y
X, y = clean_data(df)

In [9]:
print(X.shape)
print(len(y))

(1338, 8)
1338


#### Create training and test sets

We use SKLearn libraries to split the data into training and test sets. We reserve %30 of data for testing. Here we have also used a fixed random state in case people want to compare their results with mine.

In [11]:
#Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42) 

lm_model = LinearRegression(normalize=True) # Instantiate
lm_model.fit(X_train, y_train) #Fit
        
#Predict and score the model
y_test_preds = lm_model.predict(X_test) 
"The r-squared score for the model using only quantitative variables was {} on {} values.".format(r2_score(y_test, y_test_preds), len(y_test))

'The r-squared score for the model using only quantitative variables was 0.769611805436901 on 402 values.'

#### Importance of parameters

The default penalty on coefficients using linear regression in sklearn is a ridge (also known as an L2) penalty.  Because of this penalty, and that all the variables were normalized, we can look at the size of the coefficients in the model as an indication of the impact of each variable on the insurance cost.  The larger the coefficient, the larger the expected impact on insurance cost.  

In [13]:
def coef_weights(coefficients, X_train):
    '''
    INPUT:
    coefficients - the coefficients of the linear model 
    X_train - the training data, so the column names can be used

    OUTPUT:
    coefs_df - a dataframe holding the parameter, coefficient, its absolute value
    
    Provides a dataframe that can be used to understand the most influential coefficients
    in a linear model by providing the coefficient estimates along with the name of the 
    variable attached to the coefficient.
    '''
    coefs_df = pd.DataFrame()
    coefs_df['parameter'] = X_train.columns
    coefs_df['coefs'] = coefficients
    coefs_df['abs_coefs'] = np.abs(coefficients)
    coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)
    return coefs_df

#Use the function
coef_df = coef_weights(lm_model.coef_, X_train)

#A quick look at the top results
coef_df

Unnamed: 0,parameter,coefs,abs_coefs
4,smoker_yes,23628.367222,23628.367222
6,region_southeast,-970.968839,970.968839
7,region_southwest,-926.322908,926.322908
5,region_northwest,-486.93461,486.93461
2,children,424.119128,424.119128
1,bmi,348.906915,348.906915
0,age,261.296924,261.296924
3,sex_male,104.811823,104.811823
