Hi class! Sorry if we weren't able to do the session for this synchronously due to the General Education Exams. This notebook will walk you through the laboratory exercise. 

Familiarize yourself with using Jupyter Notebooks for now. You guys are CS students, I'm pretty sure you'll easily grasp it. 

# Your Task

Your lab exercise is to explore the different aspects of the Machine Learning Workflow and try to improve the performance of the model. You can adjust the following things in the entire workflow to improve performance:
- Add new features by computing ratios, correlations, etc. (feature engineering)
- Changing scaling technique (See MinMaxScaler in Scikit-Learn, etc.)
- Add Regularization (See Lasso Regression and Ridge Redgression in Scikit-Learn) 
- Other things you may think of

Afterwards, the expectation is to have a documentation that summarizes all your insights and experiments. Sample documentation can be seen in Blackboard. Once you complete your Lab Exercise, please submit it to Blackboard containing the documentation and the Jupyter Notebook.

Thank you!

# Working Codebase

For now, we'll install the necessary dependencies 

- **Numpy** is a numerical Python package that's often used for optimized numerical computations. It's very useful for intensive numerical computations on vectors, matrices, etc.
- **Pandas** is a data manipulation package in Python that loads data as a table akin to a SQL table and you can do numerous SQL-like operations on it such as aggregations, group by, etc.
- **Seaborn** is a Python visualization package
- **Matplotlib** is a Python visualization package
- **Scikit-learn** is the machine learning package in Python. It houses numerous functionalities aside from the algorithms that can be used for machine learning. It also contains preprocessing functions, evaluation metrics, etc.

In [None]:
%pip install pandas
%pip install numpy
%pip install seaborn
%pip install matplotlib
%pip install scikit-learn

In [1]:
### Importing all of pandas and numpy functionalities
import pandas as pd 
import numpy as np 

### Retrieving specifically the percentile function in numpy
from numpy import percentile

### Importing all of seaborn functionalities
import seaborn as sns

### Importing plotting capabilities of matplotlib
import matplotlib.pyplot as plt

### Importing preprocessing functionalities of scikit-learn
from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

### Importing Linear Regression
from sklearn.linear_model import LinearRegression

### Importing evaluation metrics from scikit-learn
from sklearn import metrics

## Reading the Data
For reading the data, we use the `read_csv` function of pandas to read the Pandas Data Frame from a CSV File

In [642]:
df = pd.read_csv('insurance.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


## Exploratory Data Analysis
We'll try to inspect the data and look at its properties 

We're looking into the distributions of the data that we have right now. This gives us an idea if the variables or features that we have are good to use for our model

In [643]:
for i in df.columns : 
  if ( (df[i].dtypes == 'int64') | (df[i].dtypes == 'float64') ):
    sns.distplot(df[i])
    plt.title(i)
    plt.show()

Inspecting some summary statistics to see the consistency, averages, and ranges of the values that the data has

In [644]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Looking into how many different values are there for categorical features

In [645]:
for i in df.columns:
  if df[i].dtypes == 'object':
    print(i)
    print(df[i].nunique())

sex
2
smoker
2
region
4


Checking for nulls in the data

In [646]:
df.isnull().sum().sum()

0

Looking at the boxplot to see the ranges of the data and observe outliers

In [647]:
for i in df.columns : 
  if (df[i].dtypes != 'object'):
    plt.title(i)
    sns.boxplot(df[i])
    plt.show()

## Preprocessing
This section deals with doing any necessary steps for preprocessing such as splitting the data, transforming the values, removing outliers, removing nulls, etc.

In this specific code block, we'll be removing the outliers we've seen in the Exploratory Data Analysis part, specifically the outlier charges

In [649]:
# calculate interquartile range
q25 = percentile(df['charges'], 25) 
q75 = percentile(df['charges'], 75)
iqr = q75 - q25
cutoff = 1.5 * iqr
lower = q25 - cutoff
upper = q75 + cutoff
median = np.median(df['charges'])
df['charges'] = np.where(df['charges'] > upper, median, df['charges'])
X = df.drop('charges', axis=1)
X = X.drop('children', axis=1)
y = df['charges']

In [650]:
X

Unnamed: 0,age,sex,bmi,smoker,region
0,19,female,27.900,yes,southwest
1,18,male,33.770,no,southeast
2,28,male,33.000,no,southeast
3,33,male,22.705,no,northwest
4,32,male,28.880,no,northwest
...,...,...,...,...,...
1333,50,male,30.970,no,northwest
1334,18,female,31.920,no,northeast
1335,18,female,36.850,no,southeast
1336,21,female,25.800,no,southwest


Splitting the data using scikit-learn's train-test split function

In [651]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Converting the categorical features into numerical values that can be used as inputs to the model

In [652]:
ordinalbiner = ['sex', 'smoker']
for i in X_train[ordinalbiner].columns:
  label = LabelEncoder()
  X_train[i] = label.fit_transform(X_train[i])

for i in X_test[ordinalbiner].columns:
  label = LabelEncoder()
  X_test[i] = label.fit_transform(X_test[i])

X_train

Unnamed: 0,age,sex,bmi,smoker,region
560,46,0,19.950,0,northwest
1285,47,0,24.320,0,northeast
1142,52,0,24.860,0,southeast
969,39,0,34.320,0,southeast
486,54,0,21.470,0,northwest
...,...,...,...,...,...
1095,18,0,31.350,0,northeast
1130,39,0,23.870,0,southeast
1294,58,1,25.175,0,northeast
860,37,0,47.600,1,southwest


In [653]:
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)

Scaling the data to help the model converge and also to keep values closer to one another

In [654]:
#from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
X_train = mms.fit_transform(X_train)
X_test = mms.fit_transform(X_test)

## Modelling

In [656]:
reg = LinearRegression().fit(X_train, y_train)
preds = reg.predict(X_test)

## Evaluation

In [658]:
mae = metrics.mean_absolute_error(y_test, preds)
mse = metrics.mean_squared_error(y_test, preds)
r2 = metrics.r2_score(y_test, preds)

print("The model performance for testing set")
print("--------------------------------------")
print('MAE is {}'.format(mae))
print('MSE is {}'.format(mse))
print('R2 score is {}'.format(r2))

The model performance for testing set
--------------------------------------
MAE is 3514.602680376034
MSE is 28700039.065507136
R2 score is 0.4286517310965453
