This is my first Machine Learning algorithm! I am predicting the cost of insurance based on an insurance dataset.

In [None]:
import os
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import preprocessing

ins = pd.read_csv('/kaggle/input/insurance/insurance.csv')
print(ins.columns)
print(ins.head())
print(ins.shape)
ins.info()
print("Checking for number of Null or NaN values : ")
print(ins.isna().sum())
##No Null Values.

I am importing the dataset. There are 7 columns. The head() is used to see the first 5 rows of the data set. The shape shows 1338 rows and 7 columns. There are no null values from the .isna() command. .info shows the data types of each column.
Viewing the first 5 entries of ins. Sex has male/female categories, smoker has a yes or no and there are 4 regions in region. These columns will require one hot encoding, as there are multiple categories.

In [None]:
ins = pd.get_dummies(ins)
#One-Hot Encoding, categories converted.

ins.columns = ['age', 'bmi', 'children', 'charges', 'female', 'male',
       'non-smoker', 'smoker', 'northeast', 'northwest',
       'southeast', 'southwest']
ins.head()

Using pd.get_dummies(ins) which is one-hot encoding ins. Renaming the columns so they are easier to discern. These are the first 5 entries. Sex became female and male. Smoker became non-smoker and smoker. Region became northeast northwest southeast and southwest.

In [None]:
corr = ins.corr().round(2)
corr.style.background_gradient(cmap='coolwarm')
 

This is a correlation of all features and the label. We will use age and bmi and non-smoker and smoker as these have the highest values of correlation to charges.

In [None]:
sns.distplot(ins['charges'],bins=200)
#charges not normally distributed, skewed to the left

Charges are not normally distributed, and they are skewed to the left.

In [None]:
zoomin = sns.distplot(ins['charges'],bins=50)

zoomin.set_xlim(0, 20000)
plt.show()
#Most of the patients are charges ~1250 to 15000


Patients are charged mostly 1250 to 15000 range.

In [None]:

sns.pairplot(ins[['age', 'bmi','non-smoker','smoker','charges']])



Using pairplot, charges generally increase with age and bmi. There are more non-smokers than smokers in this dataset.

In [None]:
ins['non-smoker'].sum()


In [None]:
ins['smoker'].sum()


In [None]:
ins.describe()

Mean age of the data set is 40, 1 child. There are slightly less female than males in the dataset. The number of people in each region are generally similar, with southeast the highest.

In [None]:
#separate X and Y target dataset.
Y = ins['charges']
X = ins[['age','bmi','non-smoker','smoker']]

#split dataset into train and test sets with 80/20 split
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)



print("Shape of X_train: ",X_train.shape)
print("Shape of X_test: ", X_test.shape)
print("Shape of y_train: ",Y_train.shape)
print("Shape of y_test",Y_test.shape)

Chosen Y as the target and X as age,bmi and non-smoker and smoker due to the corr() values. Splitting train and test sets with an 80-20 split. Viewing the shape of the test and train data sets 1 last time before applying linear regression.

In [None]:


#data normalisation
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

#make linear regression model
ins_mod = linear_model.LinearRegression()
#train the model with X and Y training data sets
ins_mod.fit(X_train, Y_train)

#predict on Y for values in X test dataset
Y_pred = ins_mod.predict(X_test)

print('Coefficients: \n', ins_mod.coef_)
print("Mean squared error: %.2f" % mean_squared_error(Y_test, Y_pred))
print('Variance score: %.2f' % r2_score(Y_test, Y_pred))

Applying standardization to X train and test data via StandardScaler. Applying linear regression to ins. Y_pred is the predicted values. It has a high mean squared error of 34512843.88. The R2_score is 0.78. 

In [None]:
ins_mod.intercept_


In [None]:
plt.scatter(Y_test, Y_pred)
plt.show()

Comparing the actual results to predicted results, respectively X and Y.

In [None]:
coefficients = ins_mod.coef_
bias = ins_mod.intercept_
print("Y =", bias, "+", coefficients[0], "* AGE +", coefficients[1], "* BMI +",coefficients[2], "* NON-SMOKER", coefficients[3], "* SMOKER")

In [None]:
#Make a list of features to predict the charges.
newList = [[26, 27, 1,0 ], [18, 30, 0, 1]]
n = 1
for x in newList:
  predict_charges = bias + (coefficients[0] * x[0]) + (coefficients[1] * x[1]) + (coefficients[2] * x[2])+(coefficients[3] * x[3])
  print("Predicted charges %d =  %d" % (n, predict_charges))
  n+=1
  print("\n")


 So this is the formula in the form of Y = theta*X + b. There are 4 features so,  Y =  b + theta*X1 + theta*X2 + theta*X3 + theta*X4.  We can use ins_mod.coef_ to acquire the respective theta values. The bias value is acquired via ins_mod.intercept_.

# THANK YOU!