# Feature Engineering Exercises

Do your work for this exercise in a jupyter notebook named `feature_engineering` within the `regression-exercises` repo. Add, commit, and push your work.

### 1. Load the `tips` dataset.

In [1]:
# import modules
from pydataset import data
import pandas as pd
# load data, assign to variable
tips = data('tips')
# preview
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


#### Create a column named `tip_percentage`. This should be the tip amount divided by the total bill.

In [2]:
# create column
tips['tip_percentage'] = tips.tip/tips.total_bill
# re-view
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


#### Create a column named `price_per_person`. This should be the total bill divided by the party size.

In [5]:
# create column
tips['price_per_person'] = round(tips.total_bill/tips['size'], 2)
# re-view
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_percentage,price_per_person
1,16.99,1.01,Female,No,Sun,Dinner,2,0.059447,8.49
2,10.34,1.66,Male,No,Sun,Dinner,3,0.160542,3.45
3,21.01,3.5,Male,No,Sun,Dinner,3,0.166587,7.0
4,23.68,3.31,Male,No,Sun,Dinner,2,0.13978,11.84
5,24.59,3.61,Female,No,Sun,Dinner,4,0.146808,6.15


#### Before using any of the methods discussed in the lesson, which features do you think would be most important for predicting the tip amount? The tip percentage?

I think size and total bill will be most important for predicting tip amount. Obviously tip percentage would be a great predictor for predicting the tip amount, but it is also target leakage and would therefore not be helpful as a predictor.

#### Use select k best and recursive feature elimination to select the top 2 features for predicting tip amount. What are they?

In [7]:
# import functions
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression

# split tips into x and y

# define predictors
x = tips[['total_bill', 'size', 'tip_percentage', 'price_per_person']]
# define target
y = tips.tip

# set parameters for f_selector object
f_selector = SelectKBest(f_regression, k=2)
# fit object to data
f_selector.fit(x, y)
# get bool mask of features list
f_mask = f_selector.get_support()
# get list of features (True)
f_feature = x.iloc[:,f_mask].columns.tolist()
f_feature

['total_bill', 'size']

In [8]:
# create linear regression object
lm = LinearRegression()
# create rfe object, set parameters
rfe = RFE(lm, 2)
# fir rfe object to data
rfe.fit(x, y)
# mask of selected features (those in x)
f_mask = rfe.support_
# get list of features (True)
rfe_feature = x.iloc[:,f_mask].columns.tolist()
rfe_feature



['total_bill', 'tip_percentage']

#### Use select k best and recursive feature elimination to select the top 2 features for predicting tip percentage. What are they?

In [35]:
x = tips.drop(columns=['tip', 'tip_percentage']).select_dtypes(exclude='O')
y = tips.tip_percentage

f_selector = SelectKBest(f_regression, k=2)
f_selector.fit(x, y)
f_mask = f_selector.get_support()
f_feature = x.iloc[:,f_mask].columns.tolist()
f_feature

['total_bill', 'price_per_person']

In [10]:
lm = LinearRegression()
rfe = RFE(lm, 2)
rfe.fit(x, y)
f_mask = rfe.support_
rfe_feature = x.iloc[:,f_mask].columns.tolist()
rfe_feature



['size', 'tip']

#### Why do you think select k best and recursive feature elimination might give different answers for the top features? Does this change as you change the number of features your are selecting?

They give different answers for the top features because select k best is based on statistical testing whereas recursive feature elimination is based on modeling. As I changed the number of features I selected, there were some identical rankings between the two methods.

In [41]:
x = tips[['total_bill', 'size', 'price_per_person', 'tip']]
y = tips.tip_percentage

f_selector = SelectKBest(f_regression, k=3)
f_selector.fit(x, y)
f_mask = f_selector.get_support()
f_feature = x.iloc[:,f_mask].columns.tolist()
print(f_feature)

lm = LinearRegression()
rfe = RFE(lm, 3)
rfe.fit(x, y)
f_mask = rfe.support_
rfe_feature = x.iloc[:,f_mask].columns.tolist()
print(rfe_feature)

['total_bill', 'price_per_person', 'tip']
['size', 'price_per_person', 'tip']




In [39]:
x = tips[['total_bill', 'size', 'price_per_person', 'tip_percentage']]
y = tips.tip

f_selector = SelectKBest(f_regression, k=3)
f_selector.fit(x, y)
f_mask = f_selector.get_support()
f_feature = x.iloc[:,f_mask].columns.tolist()
print(f_feature)

lm = LinearRegression()
rfe = RFE(lm, 3)
rfe.fit(x, y)
f_mask = rfe.support_
rfe_feature = x.iloc[:,f_mask].columns.tolist()
print(rfe_feature)

['total_bill', 'size', 'price_per_person']
['total_bill', 'size', 'tip_percentage']




### 2. Write a function named `select_kbest` that takes in the predictors (X), the target (y), and the number of features to select (k) and returns the names of the top k selected features based on the `SelectKBest` class. Test your function with the `tips` dataset. You should see the same results as when you did the process manually.

In [25]:
# use previous steps to define function
def select_kbest(predictors, target, n_features):
    f_selector = SelectKBest(f_regression, k=n_features)
    f_selector.fit(predictors, target)
    f_mask = f_selector.get_support()
    f_feature = predictors.iloc[:,f_mask].columns.tolist()
    return f_feature
# test function
select_kbest(tips[['total_bill', 'size', 'price_per_person']], tips.tip, 2)

['total_bill', 'size']

### 3. Write a function named `rfe` that takes in the predictors, the target, and the number of features to select. It should return the top k features based on the `RFE` class. Test your function with the `tips` dataset. You should see the same results as when you did the process manually.

In [27]:
# use previous steps to define function
def rfe(predictors, target, n_features):
    lm = LinearRegression()
    rfe = RFE(lm, n_features)
    rfe.fit(predictors, target)
    f_mask = rfe.support_
    rfe_feature = predictors.iloc[:,f_mask].columns.tolist()
    return rfe_feature
# test function
rfe(tips[['total_bill', 'size', 'price_per_person']], tips.tip, 2)



['total_bill', 'price_per_person']

### 4. Load the `swiss` dataset and use all the other features to predict Fertility. Find the top 3 features using both select k best and recursive feature elimination (use the functions you just built to help you out).

In [28]:
# view `swiss` documentation
data('swiss', show_doc=True)

swiss

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Swiss Fertility and Socioeconomic Indicators (1888) Data

### Description

Standardized fertility measure and socio-economic indicators for each of 47
French-speaking provinces of Switzerland at about 1888.

### Usage

    data(swiss)

### Format

A data frame with 47 observations on 6 variables, each of which is in percent,
i.e., in [0,100].

[,1] Fertility Ig, "common standardized fertility measure" [,2] Agriculture
[,3] Examination nation [,4] Education [,5] Catholic [,6] Infant.Mortality
live births who live less than 1 year.

All variables but 'Fert' give proportions of the population.

### Source

Project "16P5", pages 549-551 in

Mosteller, F. and Tukey, J. W. (1977) “Data Analysis and Regression: A Second
Course in Statistics”. Addison-Wesley, Reading Mass.

indicating their source as "Data used by permission of Franice van de Walle.
Office of Population Research, Princeton Univer

In [29]:
# assign dataframe to variable
swiss = data('swiss')
# preview
swiss.head()

Unnamed: 0,Fertility,Agriculture,Examination,Education,Catholic,Infant.Mortality
Courtelary,80.2,17.0,15,12,9.96,22.2
Delemont,83.1,45.1,6,9,84.84,22.2
Franches-Mnt,92.5,39.7,5,5,93.4,20.2
Moutier,85.8,36.5,12,7,33.77,20.3
Neuveville,76.9,43.5,17,15,5.16,20.6


In [30]:
# run select_kbest function on swiss data
select_kbest(swiss.drop(columns='Fertility'), swiss.Fertility, 3)

['Examination', 'Education', 'Catholic']

In [31]:
# run rfe function on swiss data
rfe(swiss.drop(columns='Fertility'), swiss.Fertility, 3)



['Examination', 'Education', 'Infant.Mortality']