# Customer Spending Prediction - Feature Elimination

Feature elimination is a method to eliminate features (x variables) those are not important for the model. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through any specific attribute or callable. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

In this activity, I am going to use a customer spend data, then:

1. Build the Linear Regression with all of the variables
2. Build the Linear Regression by only including the top 3 variables
3. Compare the RMSE between the first and second model 

In [41]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split  # for train and test split
from sklearn.linear_model import LinearRegression  # import linear regression from sklearn library
from sklearn.metrics import mean_squared_error  # method to calculate RMSE from the linear regression
from sklearn.feature_selection import RFE  # feature selection using recursive feature elimination (RFE) algorithm

    1. Import pandas, use it to read in the data in customer_spend.csv, and use the head function to view the first five rows of data.

In [4]:
df = pd.read_csv('customer_spend.csv')

display(df.head(5))
print('number of rows from the data frame:', len(df))

Unnamed: 0,cur_year_spend,prev_year_spend,days_since_last_purchase,days_since_first_purchase,total_transactions,age,income,engagement_score
0,5536.46,1681.26,7,61,34,61,97914.93,-0.652392
1,871.41,1366.74,12,34,33,68,30904.69,0.007327
2,2046.74,1419.38,10,81,22,54,48194.59,0.221666
3,4662.7,1561.21,12,32,34,49,93551.98,1.149641
4,3539.46,1397.6,17,72,34,66,66267.57,0.835834


number of rows from the data frame: 1000


There are several columns and informations regarding various customers, including:

- **prev_year_spend:** How much they spent in the previous year 
- **days_since_last_purchase:** The number of days since their last purchase 
- **days_since_first_purchase:** The number of days since their first purchase 
- **total_transactions:** The total number of transactions
- **age:** The customer's age 
- **income:** The customer's income
- **engagement_score:** A customer engagement score, which is a score created based on customers' engagement with previous marketing offers.

    2. Use train_test_split from sklearn to split the data into training and test sets, with random_state=100 and cur_year_spend as the y variable.

In [10]:
x_cols = df.columns[1:]  # slice columns only from the 2nd column to the last column
x_var = df[x_cols]

y_var = df['cur_year_spend']  # use all x's to predict this year's customer spending

display(x_var.head(5))
display(y_var[:5])

Unnamed: 0,prev_year_spend,days_since_last_purchase,days_since_first_purchase,total_transactions,age,income,engagement_score
0,1681.26,7,61,34,61,97914.93,-0.652392
1,1366.74,12,34,33,68,30904.69,0.007327
2,1419.38,10,81,22,54,48194.59,0.221666
3,1561.21,12,32,34,49,93551.98,1.149641
4,1397.6,17,72,34,66,66267.57,0.835834


0    5536.46
1     871.41
2    2046.74
3    4662.70
4    3539.46
Name: cur_year_spend, dtype: float64

In [45]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_var, y_var, random_state=100)

In [14]:
print('length of x_train:', len(x_train))
print('length of x_test:', len(x_test))
print('length of y_train:', len(y_train))
print('length of y_test:', len(y_test))

length of x_train: 750
length of x_test: 250
length of y_train: 750
length of y_test: 250


    3. Train a linear regression model on the reduced training dataset and calculate the RMSE value on the test dataset.

In [44]:
from sklearn.linear_model import LinearRegression

linreg_model = LinearRegression().fit(x_train, y_train)
linreg_model.coef_

array([ 0.85320131, -0.3119968 , 15.55680368, 49.05766265, -0.0852239 ,
        0.05969942,  1.06228814])

In [59]:
from sklearn.metrics import mean_squared_error

predict_linreg_model = linreg_model.predict(x_test)
print('RMSE from the first model is:', mean_squared_error(predict_linreg_model, y_test)**0.5)  # add **0.5 in the end to calculate the root-square of it

RMSE from the first model is: 50.45039920074803


    4. Use RFE to obtain the three most important features and obtain the reduced versions of the training and test datasets by using only the selected columns.

In [46]:
from sklearn.feature_selection import RFE

rfe = RFE(estimator=LinearRegression(), n_features_to_select=3)  # n_features_to_select: look for the most 3 important variables
rfe.fit(x_train, y_train)

RFE(estimator=LinearRegression(), n_features_to_select=3)

In [47]:
x_train.shape[1]

7

In [55]:
# build the function to calculate rfe
# and print the importance score along with their ranks
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

def rfe_ranks(x, y, estimator, n_features):
    rfe = RFE(estimator=estimator, n_features_to_select=n_features)
    rfe.fit(x, y)
    
    for featureNum in range(x_train.shape[1]):
        if rfe.support_[featureNum] == True:
            # print the name of the columns and its rank in the model
            print('Feature: {}, Rank: {}'.format(x.columns[featureNum], rfe.ranking_[featureNum]))

In [56]:
rfe_ranks(x_train, y_train, estimator=LinearRegression(), n_features=3)

Feature: days_since_first_purchase, Rank: 1
Feature: total_transactions, Rank: 1
Feature: engagement_score, Rank: 1


So, we know that the most 3 important variables are days_since_first_purchase, total_transactions, and engagement_score. I am going to build another model with reduced variables by only incorporating these variables.

Train a linear regression model on the reduced training dataset and calculate the RMSE value on the test dataset.

In [62]:
df_reduced = df[['cur_year_spend', 'days_since_first_purchase', 'total_transactions', 'engagement_score']]

In [63]:
x_cols_reduced = df_reduced.columns[1:]  # slice columns only from the 2nd column to the last column
x_var_red = df_reduced[x_cols_reduced]

y_var_red = df_reduced['cur_year_spend']  # use all x's to predict this year's customer spending

display(x_var_red.head(5))
display(y_var_red[:5])

Unnamed: 0,days_since_first_purchase,total_transactions,engagement_score
0,61,34,-0.652392
1,34,33,0.007327
2,81,22,0.221666
3,32,34,1.149641
4,72,34,0.835834


0    5536.46
1     871.41
2    2046.74
3    4662.70
4    3539.46
Name: cur_year_spend, dtype: float64

In [65]:
x_train_red, x_test_red, y_train_red, y_test_red = train_test_split(x_var_red, y_var_red, random_state=100)

print('length of x_train_red:', len(x_train_red))
print('length of x_test_red:', len(x_test_red))
print('length of y_train_red:', len(y_train_red))
print('length of y_test_red:', len(y_test_red))

length of x_train_red: 750
length of x_test_red: 250
length of y_train_red: 750
length of y_test_red: 250


In [66]:
rfereg_model = LinearRegression().fit(x_train_red, y_train_red)
rfereg_model.coef_

predict_rfereg_model = rfereg_model.predict(x_test_red)

In [67]:
print('RMSE from the first model is:', mean_squared_error(predict_rfereg_model, y_test_red)**0.5)
print('RMSE from the second model (after RFE) is:', mean_squared_error(predict_linreg_model, y_test)**0.5)

RMSE from the first model is: 1075.9083016269915
RMSE from the second model (after RFE) is: 50.45039920074803


From the result above, it's known that it's better for us to incorporate all the variables into the model if we want to get a higher accuracy of the prediction. But, if we're aiming for the efficiency of the CPU memory, then RFE is a method that we can consider.