# Tip Prediction using Polynomial Feature
**Objective** :
- Feature engineering using polynomial feature
- Multiple Linear Regression model from `tips` dataset to predict tip using polynomial feature
- Create Error Distribution based on training result
- Create evaluation matrix result 

---

## Load package

In [202]:
## Import necessary package 
import sklearn
import statsmodels
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

---

## Load dataset

In [203]:
## Load the tips dataset
df_tips = sns.load_dataset('tips')

In [204]:
## Preview the tips dataset
df_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [205]:
## Information for each df_tips columns
df_tips.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


---

## Splitting Dataset to Train and Test Set

In [206]:
## Import package
from sklearn.model_selection import train_test_split

In [207]:
# Define the feature and target columns
x = df_tips.drop(columns=['sex','smoker','day','time','tip'])
y = df_tips['tip']

In [208]:
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size=.80,random_state=42)

In [209]:
x_train.head()

Unnamed: 0,total_bill,size
228,13.28,2
208,24.27,2
96,27.28,2
167,31.71,4
84,15.98,2


In [210]:
y_train.head()

228    2.72
208    2.03
96     4.00
167    4.50
84     2.03
Name: tip, dtype: float64

In [211]:
x_test.head()

Unnamed: 0,total_bill,size
24,19.82,2
6,8.77,2
153,24.55,4
211,25.89,4
198,13.0,2


In [212]:
y_test.head()

24     3.18
6      2.00
153    2.00
211    5.16
198    2.00
Name: tip, dtype: float64

---

## Feature Engineering - Polynomial Feature

In [213]:
## Import polynomial feature package
from sklearn.preprocessing import PolynomialFeatures

In [214]:
Poli=PolynomialFeatures(degree=2,include_bias=False, interaction_only=True)

In [215]:
Poli=Poli.fit(x_train)

In [216]:
x_trainPoli=Poli.transform(x_train)

In [217]:
x_testPoli=Poli.transform(x_test)

In [218]:
df_xtrainPoli=pd.DataFrame(x_trainPoli)
df_xtrainPoli.head()

Unnamed: 0,0,1,2
0,13.28,2.0,26.56
1,24.27,2.0,48.54
2,27.28,2.0,54.56
3,31.71,4.0,126.84
4,15.98,2.0,31.96


In [219]:
df_xtestPoli=pd.DataFrame(x_testPoli)
df_xtestPoli.head()

Unnamed: 0,0,1,2
0,19.82,2.0,39.64
1,8.77,2.0,17.54
2,24.55,4.0,98.2
3,25.89,4.0,103.56
4,13.0,2.0,26.0


---

## Machine Learning Modelling using Multiple Linear Regression

In [220]:
from sklearn.linear_model import LinearRegression

In [221]:
Model = LinearRegression(fit_intercept=False)

In [222]:
Model.fit(df_xtrainPoli,y_train)

In [223]:
Model.score(df_xtrainPoli, y_train)

0.4442507925541529

---

## Evaluation Matrix

In [224]:
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error, median_absolute_error

In [225]:
pred_test = Model.predict(df_xtestPoli)

In [226]:
pred_train = Model.predict(df_xtrainPoli)

In [227]:
r2_train=r2_score(y_train,pred_train)
r2_train

0.4442507925541529

In [228]:
MAE_train = mean_absolute_error(y_train,pred_train)
MAE_train

0.7605982358224415

In [229]:
MSE_train = mean_squared_error(y_train,pred_train)
MSE_train

1.1291198023772115

In [230]:
RMSE_train = np.sqrt(MSE_train)
RMSE_train

1.0626004904841762

In [231]:
print("Hasil Evaluation Matrix untuk Data Training")
print('='*50)
print('Nilai R Square =', r2_train.round(2))
print('Nilai MAE', MAE_train.round(2))
print('Nilai MSE', MSE_train.round(2))
print('Nilai RMSE', RMSE_train.round(2))

Hasil Evaluation Matrix untuk Data Training
Nilai R Square = 0.44
Nilai MAE 0.76
Nilai MSE 1.13
Nilai RMSE 1.06


In [232]:
r2_test=r2_score(y_test,pred_test)
r2_test

0.47803089236800167

In [233]:
MAE_test = mean_absolute_error(y_test,pred_test)
MAE_test

0.6619717561823287

In [234]:
MSE_test = mean_squared_error(y_test,pred_test)
MSE_test

0.6524464276576134

In [235]:
RMSE_test = np.sqrt(MSE_train)
RMSE_test

1.0626004904841762

In [236]:
print("Hasil Evaluation Matrix untuk Data Test")
print('='*50)
print('Nilai R Square =', r2_train.round(2))
print('Nilai MAE', MAE_test.round(2))
print('Nilai MSE', MSE_test.round(2))
print('Nilai RMSE', RMSE_test.round(2))

Hasil Evaluation Matrix untuk Data Test
Nilai R Square = 0.44
Nilai MAE 0.66
Nilai MSE 0.65
Nilai RMSE 1.06


In [237]:
## EVALUATION MATRIX COMPARISON
eva_matrix={
    'Multiple Linear Regression Training':[r2_train,MAE_train,MSE_train,RMSE_train],
    'Multipler Linear Regression Testing':[r2_test,MAE_test,MSE_test,RMSE_test],
}
summary=pd.DataFrame(eva_matrix,index=['R-Squared','MAE','MSE','RMSE']).round(2)
summary.T

Unnamed: 0,R-Squared,MAE,MSE,RMSE
Multiple Linear Regression Training,0.44,0.76,1.13,1.06
Multipler Linear Regression Testing,0.48,0.66,0.65,1.06
