> # <u>Regression Model: Feature Engineering</u>

## # 1. Assignment Scope: 

Load the tips dataset.

(a). Create a column named price_per_person. This should be the total bill divided by the party size.

(b). Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

(c). Use select k best to select the top 2 features for predicting tip amount. What are they?

(d). Use recursive feature elimination to select the top 2 features for tip amount. What are they?

(e). Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

## Import required Libraries

In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest, f_regression, RFE, SequentialFeatureSelector
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from feature_eng_modules import train_split, scale_tips_data, get_tips_data, get_swiss_data
from feature_eng_modules import select_kbest, rfe

import warnings
warnings.filterwarnings('ignore')



## Acquire the data

In [39]:
tips = get_tips_data()
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,0,0,7,1,2
2,10.34,1.66,1,0,7,1,3
3,21.01,3.5,1,0,7,1,3
4,23.68,3.31,1,0,7,1,2
5,24.59,3.61,0,0,7,1,4


### (a). Create a column named price_per_person. This should be the total bill divided by the party size.


In [40]:
# Multiplied by 1000 to remove the decimals*

# tips.size == rows * cols
# tips['size'] == correct form in this example

tips['price_per_person'] = ((tips.total_bill / tips['size']))
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,price_per_person
1,16.99,1.01,0,0,7,1,2,8.495
2,10.34,1.66,1,0,7,1,3,3.446667
3,21.01,3.5,1,0,7,1,3,7.003333
4,23.68,3.31,1,0,7,1,2,11.84
5,24.59,3.61,0,0,7,1,4,6.1475


## Split the data 

In [41]:
train, validate, test = train_split(tips)
print(train.shape, validate.shape, test.shape)

(135, 8) (59, 8) (49, 8)


# Scale the data

>- ##### It is important that data scaling happens after data splitting. We don't want to leak information from our test/validate splits by using those to calculate parameters for scaling.

In [42]:
# # Scaled data 

# train_scaled, validate_scaled, test_scaled = scale_tips_data(train, validate, test)
# train_scaled, validate_scaled, test_scaled

## (Returned data from function is encorded data**) 

>- #### Encoding Key:
>- ##### Mon == 1, .... Sun == 7
>- ##### Male == 1, Female == 0
>- ##### Yes == 1, No == 0


In [43]:
train.columns

Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size',
       'price_per_person'],
      dtype='object')

In [44]:
X_train = train[['total_bill', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_train = train.tip

X_validate = validate[['total_bill', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_validate = validate.tip

X_test = test[['total_bill', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_test = test.tip


In [45]:
# Examine the data
X_train.head()

Unnamed: 0,total_bill,sex,smoker,day,time,size,price_per_person
149,9.78,1,0,4,0,2,4.89
214,13.27,0,1,6,1,2,6.635
15,14.83,0,0,7,1,2,7.415
97,27.28,1,1,5,1,2,13.64
124,15.95,1,0,4,0,2,7.975


### (b). Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount?

> I think price per person is better than total bill as it narrows down to persons which has more meaning in understanding behaviors.

> Dinner would also be a good/ interesting predictor to observe disparities by meals times





### (c). Use select <u>KBest</u> to select the top 2 features for predicting tip amount. What are they?

> #### KBest Recommends 'tip' & 'price_per_person'

In [46]:
## SelectKBest lib' already imported above. continuing....

kbest = SelectKBest(f_regression, k = 2)

kbest.fit(X_train, y_train)

SelectKBest(k=2, score_func=<function f_regression at 0x7f7941ace4c0>)

In [47]:
kbest_results = pd.DataFrame(dict(p = kbest.pvalues_, 
                                  f = kbest.scores_), 
                             index = X_train.columns)
kbest_results


Unnamed: 0,p,f
total_bill,1.133953e-16,90.388913
sex,0.1865121,1.763105
smoker,0.5950088,0.283957
day,0.01233037,6.437157
time,0.03174534,4.711034
size,6.156797e-13,63.608371
price_per_person,0.01346994,6.272655


### Get KBest predictors for tip

In [48]:
X_train.columns[kbest.get_support()]

Index(['total_bill', 'size'], dtype='object')

In [49]:

X_train_transformed = pd.DataFrame(kbest.transform(X_train),index = X_train.index,
                                   columns = X_train.columns[kbest.get_support()])

X_train_transformed.head()


Unnamed: 0,total_bill,size
149,9.78,2.0
214,13.27,2.0
15,14.83,2.0
97,27.28,2.0
124,15.95,2.0


### (d). Use <u>Recursive Feature Elimination (RFE)</u> to select the top 2 features for tip amount. What are they?

>- ##### RFE recommends tip, & size


In [50]:
# Call linear reg'
model = LinearRegression()

# Fit the model and select best two features (n_features_to_select = 2)
rfe = RFE(model, n_features_to_select = 2)
rfe.fit(X_train, y_train)


RFE(estimator=LinearRegression(), n_features_to_select=2)

In [51]:
# Rank the RFE features

pd.DataFrame({'rfe_ranks': rfe.ranking_}, index = X_train.columns)

Unnamed: 0,rfe_ranks
total_bill,3
sex,2
smoker,4
day,1
time,5
size,1
price_per_person,6


In [52]:
# Get FRE recommendation for best feature

X_train.columns[rfe.get_support()]

Index(['day', 'size'], dtype='object')

In [53]:
# RFE Ranking
X_train_transformed = pd.DataFrame(rfe.transform(X_train),index = X_train.index,
                                   columns = X_train.columns[rfe.support_])

X_train_transformed.head()


Unnamed: 0,day,size
149,4.0,2.0
214,6.0,2.0
15,7.0,2.0
97,5.0,2.0
124,4.0,2.0


### (e). Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

>- ### NOTE: If your dataset is large (> 1GB; df.info()) use select k best instead





### (2). Write a function named select_kbest that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the SelectKBest class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [54]:
X_train = train[['total_bill', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_train = train.tip

X_validate = validate[['total_bill', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_validate = validate.tip

X_test = test[['total_bill', 'sex','smoker', 'day', 'time', 'size', 'price_per_person']]
y_test = test.tip

### Split the swiss data

In [55]:
train, validate, test = train_split(tips)
print(train.shape, validate.shape, test.shape)

(135, 8) (59, 8) (49, 8)


In [71]:
# Create the train sets

X_train = train.drop(columns = 'tip')
y_train = train.tip

X_validate = validate.drop(columns = 'tip')
y_validate = validate.tip

X_test = test.drop(columns = 'tip')
y_test = test.tip

# Cols: Predicted and tatget

# predictors = [[X_train]
# target = [[y_train[0]]

select_kbest(X_train, y_train, 0)

Enter count of SelectKBest features to return: 3


Index(['total_bill', 'day', 'size'], dtype='object')

### (3). Write a function named rfe that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the RFE class. Test your function with the tips dataset. You should see the same results as when you did the process manually.

In [58]:
# Using the tips datasets in # 2

X_test = test.drop(columns = 'tip')
y_test = test.tip

In [59]:
def rfe(predictors, target, num_features):
    '''
        This function takes in predictors, and the target variables and the number of 
        features desired and returns the names of the top Recussion Feature Elimination(RFE) features 
        based on the SelectKBest class. 
    '''
    model = LinearRegression()
    
    num_features = int(input('Enter count of RFE features to return: '))
    
    rfe = RFE(model, n_features_to_select = num_features)
    
    rfe.fit(predictors, target)
    
    result = rfe.get_support()
    
    return predictors.columns[result]

In [60]:
# Call the RFE feature selection model

rfe(X_train, y_train, 2)

Enter count of RFE features to return: 3


Index(['sex', 'day', 'size'], dtype='object')

### (4). Load the swiss dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [61]:
# Load the swiss dataset

swiss = get_swiss_data()
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


## Split the swiss data 

In [73]:
swiss_train, swiss_val, swiss_test = train_split(swiss)

swiss_train.shape, swiss_val.shape, swiss_test.shape

((25, 6), (12, 6), (10, 6))

In [77]:
# Create the train, validate and test sets

X_train = swiss_train.drop(columns = 'Fertility')
y_train = swiss_train.Fertility

X_validate = swiss_val.drop(columns = 'Fertility')
y_validate = swiss_val.Fertility

X_test = swiss_test.drop(columns = 'Fertility')
y_test = swiss_test.Fertility

### Using the SelectKBest  on Swiss Data


In [81]:
# Using the SelectKBest 

select_kbest(X_train, y_train, 0)

Enter count of SelectKBest features to return: 2


Index(['Examination', 'Education'], dtype='object')

### Using the RFE on Swiss Data

In [80]:
rfe(X_train, y_train, 0)

Enter count of RFE features to return: 2


Index(['Examination', 'Infant.Mortality'], dtype='object')