<a href="https://colab.research.google.com/github/skyanalyst/Machine-Learning-Projects/blob/main/Insurance_Forecast_by_using_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **About Dataset**
**Context**

Machine Learning with R by Brett Lantz is a book that provides an introduction to machine learning using R. As far as I can tell, Packt Publishing does not make its datasets available online unless you buy the book and create a user account which can be a problem if you are checking the book out from the library or borrowing the book from a friend. All of these datasets are in the public domain but simply needed some cleaning up and recoding to match the format in the book.

**Content Columns** 

**age:** age of primary beneficiary

**sex:** insurance contractor gender, female, male

**bmi:** Body mass index, providing an understanding of body, weights that are    
     relatively high or low relative to height,
     objective index of body weight (kg / m ^ 2) using the ratio of height to   weight, ideally 18.5 to 24.9

**children**: Number of children covered by health insurance / Number of dependents

**smoker:** Smoking

**region**: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

**charges:** Individual medical costs billed by health insurance

In [1]:
!pip install opendatasets --quiet
!pip install pandas --quiet

In [2]:
import opendatasets as od
import pandas as pd
medical_charges_url = 'https://www.kaggle.com/datasets/mirichoi0218/insurance'
od.download(medical_charges_url)

Downloading insurance.zip to ./insurance


100%|██████████| 16.0k/16.0k [00:00<00:00, 11.2MB/s]







In [3]:
medical_df = pd.read_csv('/content/insurance/insurance.csv')
medical_df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [4]:
medical_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [5]:
medical_df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [6]:
medical_df.corr()

Unnamed: 0,age,bmi,children,charges
age,1.0,0.109272,0.042469,0.299008
bmi,0.109272,1.0,0.012759,0.198341
children,0.042469,0.012759,1.0,0.067998
charges,0.299008,0.198341,0.067998,1.0


In [7]:
# Binary categories
smoker_codes = {'yes': 1, 'no': 0}
medical_df['smoker_code'] = medical_df.smoker.map(smoker_codes)

In [14]:
sex_codes = {'male':1, 'female':0}
medical_df['sex_code'] = medical_df.sex.map(sex_codes)

In [15]:
medical_df.corr()

Unnamed: 0,age,bmi,children,charges,smoker_code,northeast,northwest,southeast,southwest,sex_code
age,1.0,0.109272,0.042469,0.299008,-0.025019,0.002475,-0.000407,-0.011642,0.010016,-0.020856
bmi,0.109272,1.0,0.012759,0.198341,0.00375,-0.138156,-0.135996,0.270025,-0.006205,0.046371
children,0.042469,0.012759,1.0,0.067998,0.007673,-0.022808,0.024806,-0.023066,0.021914,0.017163
charges,0.299008,0.198341,0.067998,1.0,0.787251,0.006349,-0.039905,0.073982,-0.04321,0.057292
smoker_code,-0.025019,0.00375,0.007673,0.787251,1.0,0.002811,-0.036945,0.068498,-0.036945,0.076185
northeast,0.002475,-0.138156,-0.022808,0.006349,0.002811,1.0,-0.320177,-0.345561,-0.320177,-0.002425
northwest,-0.000407,-0.135996,0.024806,-0.039905,-0.036945,-0.320177,1.0,-0.346265,-0.320829,-0.011156
southeast,-0.011642,0.270025,-0.023066,0.073982,0.068498,-0.345561,-0.346265,1.0,-0.346265,0.017117
southwest,0.010016,-0.006205,0.021914,-0.04321,-0.036945,-0.320177,-0.320829,-0.346265,1.0,-0.004184
sex_code,-0.020856,0.046371,0.017163,0.057292,0.076185,-0.002425,-0.011156,0.017117,-0.004184,1.0


# **Using Scikit-Learn**

In [9]:
!pip install scikit-learn --quiet

In [10]:
# One Hot encoding
from sklearn import preprocessing
encoder = preprocessing.OneHotEncoder()
encoder.fit(medical_df[['region']])
encoder.categories_

[array(['northeast', 'northwest', 'southeast', 'southwest'], dtype=object)]

In [11]:
one_hot = encoder.transform(medical_df[['region']]).toarray()
one_hot

array([[0., 0., 0., 1.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.]])

In [16]:
medical_df[['northeast', 'northwest', 'southeast', 'southwest']]= one_hot
medical_df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,smoker_code,northeast,northwest,southeast,southwest,sex_code
0,19,female,27.900,0,yes,southwest,16884.92400,1,0.0,0.0,0.0,1.0,0
1,18,male,33.770,1,no,southeast,1725.55230,0,0.0,0.0,1.0,0.0,1
2,28,male,33.000,3,no,southeast,4449.46200,0,0.0,0.0,1.0,0.0,1
3,33,male,22.705,0,no,northwest,21984.47061,0,0.0,1.0,0.0,0.0,1
4,32,male,28.880,0,no,northwest,3866.85520,0,0.0,1.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,0,0.0,1.0,0.0,0.0,1
1334,18,female,31.920,0,no,northeast,2205.98080,0,1.0,0.0,0.0,0.0,0
1335,18,female,36.850,0,no,southeast,1629.83350,0,0.0,0.0,1.0,0.0,0
1336,21,female,25.800,0,no,southwest,2007.94500,0,0.0,0.0,0.0,1.0,0


In [18]:
import numpy as np
medical_df.select_dtypes(include=np.number).columns

Index(['age', 'bmi', 'children', 'charges', 'smoker_code', 'northeast',
       'northwest', 'southeast', 'southwest', 'sex_code'],
      dtype='object')

In [13]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [20]:
# RMSE
def rmse(targets, predictions):
  return np.sqrt(np.mean(np.square(targets-predictions)))

In [23]:
# Create inputs and targets
input_cols =['age', 'bmi', 'children', 'smoker_code', 'northeast', 'northwest', 'southeast', 'southwest', 'sex_code']
inputs, targets = medical_df[input_cols], medical_df['charges']

# Create and Train Model
model = LinearRegression().fit(inputs, targets)

# Generate prediction
predictions = model.predict(inputs)

# compute loss to evaluate model
loss = rmse(targets, predictions)
print('Loss: ', loss)

Loss:  6041.6796511744515


In [22]:
# Create inputs and targets
input_cols =['age', 'bmi', 'children']
inputs, targets = medical_df[input_cols], medical_df['charges']

# Create and Train Model
model = LinearRegression().fit(inputs, targets)

# Generate prediction
predictions = model.predict(inputs)

# compute loss to evaluate model
loss = rmse(targets, predictions)
print('Loss: ', loss)

Loss:  11355.317901125973


Scalining numeric features

In [29]:
from sklearn.preprocessing import StandardScaler
numeric_cols = ['age', 'bmi', 'children']
scaler = StandardScaler()
scaler.fit(medical_df[numeric_cols])

StandardScaler()

In [34]:
scaled_inputs = scaler.transform(medical_df[numeric_cols])
scaled_inputs

array([[-1.43876426, -0.45332   , -0.90861367],
       [-1.50996545,  0.5096211 , -0.07876719],
       [-0.79795355,  0.38330685,  1.58092576],
       ...,
       [-1.50996545,  1.0148781 , -0.90861367],
       [-1.29636188, -0.79781341, -0.90861367],
       [ 1.55168573, -0.26138796, -0.90861367]])

combine with numerical data 


In [33]:
cat_cols = ['smoker_code', 'northeast', 'northwest', 'southeast', 'southwest', 'sex_code']
categorical_data = medical_df[cat_cols].values

In [35]:
inputs = np.concatenate((scaled_inputs,categorical_data), axis=1)
targets = medical_df.charges

# Create and Train Model
model = LinearRegression().fit(inputs, targets)

# Generate prediction
predictions = model.predict(inputs)

# compute loss to evaluate model
loss = rmse(targets, predictions)
print('Loss: ', loss)

Loss:  6041.686321294424


Creating a test set 

In [39]:
from sklearn.model_selection import train_test_split
inputs_train, inputs_test, targets_train, targets_test = train_test_split(inputs, targets, test_size=0.1)

In [40]:
# Create and train the model
model = LinearRegression().fit(inputs_train, targets_train)

# Generate prediction
predictions_test = model.predict(inputs_test)

# compute loss to evaluate model
loss = rmse(targets_test, predictions_test)
print('Test Loss: ', loss)

Test Loss:  6127.780799313666
