# Insurance Regression Analysis by TanishHP

### General Description
The dataset is retrieved from Machine Learning Website by Professor Eric Suess at http://www.sci.csueastbay.edu/~esuess/stat6620/#week-6. It contains 7 features (4 numerical and 3 categorical), and the target column contains medical expenses incurred by each individual.  

### Goals

1. Divide the features into two sections: numerical and categorical in order to clean and transform each appropriately. 
2. For categorical features:
    a. create a dataframe and clean missing data
    b. represent data visually in an appropriate format
    c. drop irrelevant features within reason and logic presented. 
    d. convert categorical variables to represent in a binary format.  
3. For numerical features:
    a. create a dataframe and clean missing data
    b. represent data visually in an appropriate format
    c. drop irrelevant features within reason and logic presented. 
4. Create a Linear Model, fit and predict the relevant features. 
5. Use appropriate metrics to measure effectiveness of model. 


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv('/kaggle/input/insurance-premium-prediction/insurance.csv')
df

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

##### It is quite lucky that there are no null values to deal with. 

In [None]:
df.isnull().sum()

In [None]:
#There are 7 columns, 6 of which are features which can be subdivided into categorical_cols and numerical_cols
all_columns = list(df.columns)
print(all_columns)
categorical_cols = ['sex', 'smoker', 'region']
numerical_cols = ['age', 'children', 'bmi']
target_col = 'expenses'

In [None]:
#Creating two dataframes, one to store numerical data and one to store categorical data

#I needed to transform categorical data to some type of numerical format. This was required if I ever to train my 
#linear model with categorical data. 

# I used PD.GET_DUMMIES function to tranform my categorical data. 

categorical_X = pd.DataFrame()
for i in categorical_cols:
    dummies = pd.get_dummies(df[i])
    categorical_X = pd.concat([categorical_X, dummies], axis=1)
numerical_X = pd.DataFrame(df[numerical_cols])

## Graphing Categorical Variables vs Expenses

In [None]:
cat_fig, axes=plt.subplots(1, 3, figsize=(15, 4))
sns.swarmplot(ax=axes[0], x='smoker', y='expenses', data=df)
sns.swarmplot(ax=axes[1], x='sex', y='expenses', data=df)
sns.swarmplot(ax=axes[2], x='region', y='expenses', data=df)
fig.suptitle('Categorical Variables vs Expenses', fontsize=16)

The best way to understand any categorical data is through relative frequency, which is why, it made so much more sense to use swarm plot to visualise my data. 

The first plot displayed shows that the smokers generally incur more expense than non-smokers. This makes sense as smoking is dangerous to health and makes people who practise it susceptible to more health problems. This feature definitely has a place in the multiple linear regression model that we will later train. 

The second plot, compares expenses to the sex of an individual. There isnt much variation in expense which makes sense as if the sample is large enough, and is random enough. It is reasonable to drop this feature. 

The third plot compares expense to the region where an individual belongs from. This also shows hardly any eye-popping variation and it doesnt make sense to train our model with this data. 

## Graphing Numerical Variables vs Expenses

In [None]:
fig, axes=plt.subplots(1, 3, figsize=(15, 4))
sns.regplot(ax=axes[0], x="age", y="expenses", data=df, scatter_kws={'s':10})
sns.regplot(ax=axes[1], x="children", y="expenses", data=df, scatter_kws={'s':10})
sns.regplot(ax=axes[2], x="bmi", y="expenses", data=df, scatter_kws={'s':10})
fig.suptitle('Numerical Variables vs Expenses', fontsize=16)

The best way to understand numerical data is through a scatter plot shown above. It also contains the line of best fit to give some perspective. 

The first plot displayed shows that age generally is positively correlated with expenses. We will keep this feature. 

The second plot, has a very weak correlation (almost 0). We can drop this feature. 

The third plot compares BMI (body mass index) to expenses which has some positive correlation with expenses. This makes sense as a high bmi indicates obesity and thus further medical issues. 

I have plotted correlations of features with expenses below. 

### Correlation of Numerical Variables with Expenses

In [None]:
df[['age', 'children', 'bmi', 'expenses']].corr().iloc[:-1, -1].round(2)

## Creating a Multiple Linear Regression Model and Metrics

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error,r2_score

In [None]:
#As discussed before, this process drops irrelevant features

categorical_X.drop(['female', 'male', 'northeast', 'northwest', 'southeast',
       'southwest'], axis=1, inplace=True)
numerical_X.drop(['children'], axis=1, inplace=True)

In [None]:
linearModel = LinearRegression()

In [None]:
X = pd.concat([categorical_X, numerical_X], axis=1)
Y = df[target_col]

In [None]:
train_X, test_X, train_Y, test_Y = train_test_split(X, Y, test_size=0.3, random_state=0)

In [None]:
linearModel.fit(train_X, train_Y)

In [None]:
pred_Y = linearModel.predict(test_X)

In [None]:
print("MEAN ABSOLUTE ERROR: ", mean_absolute_error(test_Y, pred_Y))
print("R2 SCORE: ", r2_score(test_Y, pred_Y))
print("MEAN SQUARED ERROR: ", mean_squared_error(test_Y, pred_Y))
print("ROOT MEAN SQUARED ERROR: ", (mean_squared_error(test_Y, pred_Y)**0.5))