> # <u>Regression Model: Feature Engineering</u>

## Assignment Scope: Part I

Load the tips dataset.

(a). Create a column named price_per_person. This should be the total bill divided by the party size.

(b). Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

(c). Use select k best to select the top 2 features for predicting tip amount. What are they?

(d). Use recursive feature elimination to select the top 2 features for tip amount. What are they?

(e). Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

## Import required Libraries

In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest, f_regression, RFE, SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from feature_eng_modules import train_split, scale_tips_data, get_tips_data

import warnings
warnings.filterwarnings('ignore')



## Acquire the data

In [2]:
tips = get_tips_data()
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,0,0,7,1,2
2,10.34,1.66,1,0,7,1,3
3,21.01,3.5,1,0,7,1,3
4,23.68,3.31,1,0,7,1,2
5,24.59,3.61,0,0,7,1,4


### (a). Create a column named price_per_person. This should be the total bill divided by the party size.


In [3]:
# Multiplied by 1000 to remove the decimals*
tips['price_per_person'] = ((tips.total_bill / tips.size)* 1000)
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
1,16.99,1.01,0,0,7,1,2,9.988242
2,10.34,1.66,1,0,7,1,3,6.078777
3,21.01,3.5,1,0,7,1,3,12.351558
4,23.68,3.31,1,0,7,1,2,13.921223
5,24.59,3.61,0,0,7,1,4,14.456202


## Split the data 

In [4]:
train, validate, test = train_split(tips)
print(train.shape, validate.shape, test.shape)

(39, 8) (155, 8) (49, 8)


# Scale the data

>- ##### It is important that data scaling happens after data splitting. We don't want to leak information from our test/validate splits by using those to calculate parameters for scaling.

In [5]:
# # Scaled data 

# train_scaled, validate_scaled, test_scaled = scale_tips_data(train, validate, test)
# train_scaled, validate_scaled, test_scaled

## (Returned data from function is encorded data**) 

>- #### Encoding Key:
>- ##### Mon == 1, .... Sun == 7
>- ##### Male == 1, Female == 0
>- ##### Yes == 1, No == 0


In [6]:
X_train = train[['total_bill', 'tip', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_train = train.tip

X_validate = validate[['total_bill', 'tip', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_validate = validate.tip

X_test = test[['total_bill', 'tip', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_test = test.tip


In [7]:
# Examine the data
X_train.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
204,16.4,2.5,0,1,4,0,2,9.641387
101,11.35,2.5,0,1,5,1,2,6.672546
157,48.17,5.0,1,0,7,1,6,28.318636
50,18.04,3.0,1,0,7,1,2,10.605526
15,14.83,3.02,0,0,7,1,2,8.718401


### (b). Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

> I think price per person is better than total bill as it narrows down to persons which has more meaning in understanding behaviors.

> Dinner would also be a good/ interesting predictor to observe disparities by meals times





### (c). Use select <u>KBest</u> to select the top 2 features for predicting tip amount. What are they?

> #### KBest Recommends 'tip' & 'price_per_person'

In [8]:
## SelectKBest lib' already imported above. continuing....

kbest = SelectKBest(f_regression, k = 2)

kbest.fit(X_train, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7fa3847d2430>)

In [9]:
kbest_results = pd.DataFrame(dict(p = kbest.pvalues_, f = kbest.scores_), index = X_train.columns)
kbest_results


Unnamed: 0,p,f
total_bill,0.000157,17.720252
tip,0.0,inf
sex,0.9717,0.001276
smoker,0.756391,0.097676
day,0.264688,1.282678
time,0.452346,0.576891
size,0.005405,8.735074
price_per_person,0.000157,17.720252


### Get KBest predictors for tip

In [10]:
X_train.columns[kbest.get_support()]

Index(['tip', 'price_per_person'], dtype='object')

In [11]:

X_train_transformed = pd.DataFrame(kbest.transform(X_train),index = X_train.index,
                                   columns = X_train.columns[kbest.get_support()])

X_train_transformed.head()


Unnamed: 0,tip,price_per_person
204,2.5,9.641387
101,2.5,6.672546
157,5.0,28.318636
50,3.0,10.605526
15,3.02,8.718401


### (d). Use <u>Recursive Feature Elimination (RFE)</u> to select the top 2 features for tip amount. What are they?

>- ##### RFE recommends tip, & price_per_person


In [12]:
# Call linear reg'
model = LinearRegression()

# Fit the model and select best two features (n_features_to_select = 2)
rfe = RFE(model, n_features_to_select = 2)
rfe.fit(X_train, y_train)


RFE(estimator=LinearRegression(), n_features_to_select=2)

In [13]:
# Rank the RFE features

pd.DataFrame({'rfe_ranks': rfe.ranking_}, index = X_train.columns)

Unnamed: 0,rfe_ranks
total_bill,2
tip,1
sex,7
smoker,3
day,6
time,5
size,4
price_per_person,1


In [14]:
# Get FRE recommendation for best feature

X_train.columns[rfe.get_support()]

Index(['tip', 'price_per_person'], dtype='object')

In [15]:
# RFE Ranking
X_train_transformed = pd.DataFrame(rfe.transform(X_train),index = X_train.index,
                                   columns = X_train.columns[rfe.support_])

X_train_transformed.head()


Unnamed: 0,tip,price_per_person
204,2.5,9.641387
101,2.5,6.672546
157,5.0,28.318636
50,3.0,10.605526
15,3.02,8.718401


### (e). Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

>- ### NOTE: If your dataset is large (> 1GB; df.info()) use select k best instead
