# Understanding Regression Algorithms

### Scenario: Predicting patient charges Based on age, sex, bmi, children and region

In [14]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

1. Dataset Creation:

We create a simple synthetic dataset with columns for Experience, Education_Level (Bachelor's, Master's, or PhD), and Salary.


In [15]:
# Imported the dataset from kaggle open source datasets

df = pd.read_csv("insurance.csv")
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [16]:
# print the information of datasets 

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [17]:
# display the unique content of the region column 
df['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [18]:
# changing the object datatypes to integer 

df['sex'] = df['sex'].map({'female': 1, 'male': 0}) # Assigning female =1, male = 0
df['smoker'] = df['smoker'].map({'yes':1, 'no': 0}) # Assigning yes = 1, no = 0
df['region'] = df['region'].map({'southwest':1, 'southeast': 2, 'northwest':3, 'northeast': 4}) # Assigning southwest =1, southeast = 2, northwest =3, northeast = 4
df.head(4)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,1,27.9,0,1,1,16884.924
1,18,0,33.77,1,0,2,1725.5523
2,28,0,33.0,3,0,2,4449.462
3,33,0,22.705,0,0,3,21984.47061


2. Data Preprocessing:

We split the dataset into features (age, sex, bmi, children, smoker, region) and the target variable (charges).
We then split the dataset into training and test sets to train the model on one part of the data and test its performance on unseen data.

In [19]:
# Data Preprocessing (In this case, the data is already clean)
x = df[['age', 'sex','bmi','children', 'smoker', 'region']]
y = df['charges']


In [20]:
# Split the data into training and test set( 80% training, 20% testing)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size =0.20, random_state = 42)

3. Model Training:

We use Linear Regression, which is part of Scikit-Learn, to train our model on the training data.

In [21]:
# Train a Linear Regression Model
model = LinearRegression()
model.fit(x_train, y_train)

LinearRegression()

4. Model Evaluation:

We calculate the Mean Squared Error (MSE) and R-squared (R²) value to evaluate how well the model is predicting salaries.
The R-squared value indicates how much of the variance in the dependent variable (salary) is explained by the independent variables (experience and education level). An R² value close to 1 means a good fit.

In [22]:
# Evaluate the Model
y_pred = model.predict(x_test)

In [23]:
# Calculate the Mean Squared Error and R-Squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-Squared Value: {r2}")


Mean Squared Error: 33635210.431178406
R-Squared Value: 0.7833463107364539


5. Prediction:

We use the trained model to a 34 years, male, bmi = 33 with 2 children, smoker from northwest. The predicted charges is displayed.

In [26]:
# Make predictions for a new Data 

new_patient = [[34, 0, 33, 2, 1, 3]]
predicted_charges = model.predict(new_patient)
print(f"Predicted charges of a male patient of 34 age with BMI 33,smoker have 2 children from northwest: ${predicted_charges[0]:,.2f}") 

Predicted charges of a male patient of 34 age with BMI 33,smoker have 2 children from northwest: $32,082.02
