In [14]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression as LR
from sklearn.cross_validation import train_test_split

In [29]:
data = pd.read_csv('data/crp_cleandata.csv')

In this assignment we are continuing to work with customer reward programs (review Week 1 Application Assignment if you haven’t completed it). An analyst performed some preliminary data preprocessing on the raw data and shared the data with you in the file crp_cleandata.xlsx (see download link below). Note that some additional columns are created and some data columns are scaled. In this exercise, you will complete a predictive modeling task where the target variable is continuous based on the data in the shared file. First remove all rows where either the Reward or NumStores column takes the value 0. Also remove all rows where the rewards do not expire (ExpirationMonth=999). Consider linear regression models with ExpirationMonth column as the target variable.

In [30]:
data = data[(data["Reward"] > 0) & (data["NumStores"] > 0) & (data["ExpirationMonth"] != "999")]

In [31]:
data.reset_index(inplace=True, drop=True)

In [39]:
data["ExpirationMonth"] = data["ExpirationMonth"].astype(float)

Find the model with one predictor variable and the highest R-squared. Consider the following set of predictor variables: Salerank, X2013USSales, X2013WorldSales, NumStores,RewardSize, and ProfitMargin. 
Which variable did you choose?

In [120]:
features = ["Salerank", "X2013USSales", "X2013WorldSales", "RewardSize", "ProfitMargin"]

In [128]:
y = data["ExpirationMonth"].values.reshape(46, 1)

(46,)

In [58]:
lr = LR()

In [129]:
test_x = data['Salerank'].values.reshape(46, 1)

In [130]:
model = lr.fit(test_x, y)

In [131]:
model.score(test_x, y)

0.062154267461252077

In [132]:
model.intercept_

array([ 10.43886397])

In [134]:
pred = model.predict(test_x)

In [137]:
np.argmax(abs(pred-y))

38

In [140]:
data.loc[38]['Retailer']

'Subway'

In [93]:
for feature in features:
    X = data[["NumStores", feature]].values.reshape(46, 2)
    model = lr.fit(X, y)
    predicted = model.predict(X)
    score = model.score(X, y)
    intercept = model.intercept_
    slope = model.coef_
    print feature, "R^2=", score, "intercept=", intercept, "slope=", slope

Salerank R^2= 0.0621542674613 intercept= [ 10.43886397] slope= [[-0.06486178]]
X2013USSales R^2= 0.00843068655898 intercept= [ 7.908877] slope= [[-35.65556047]]
X2013WorldSales R^2= 0.00510513333909 intercept= [ 7.80172806] slope= [[-23.49759515]]
NumStores R^2= 0.253714718743 intercept= [ 4.82846651] slope= [[ 0.88984632]]
RewardSize R^2= 0.0204022561058 intercept= [ 8.36590472] slope= [[-0.15332229]]
ProfitMargin R^2= 0.00953160707593 intercept= [ 8.55066839] slope= [[-0.02689186]]


Data transformation is a great way to improve model fit. Now consider the log transformation for the model identified in the previous question. You can choose to transform neither of them, one of them, or both of them. You should have four different models.
* Model 1: neither variable is transformed; this gives you the same model as in the previous question.
* Model 2: only the target variable is transformed
* Model 3: only the explanatory variable is transformed
* Model 4: both variables are transformed.
Report the R-squared values of all four models.

In [94]:
transformed_data = data.copy()

In [96]:
transformed_data = transformed_data[['NumStores', 'ExpirationMonth']]

In [102]:
transformed_data["LogY"] = np.log(transformed_data["ExpirationMonth"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [104]:
transformed_data["LogX"] = np.log(transformed_data["NumStores"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [108]:
lr = LR()

In [116]:
y_1 = transformed_data['ExpirationMonth'].values.reshape(46,1)
x_1 = transformed_data['NumStores'].values.reshape(46,1)
model_1 = lr.fit(x_1, y_1)
model_1.predict(x_1)
model_1.score(x_1, y_1)

0.25371471874343032

In [117]:
y_2 = transformed_data['LogY'].values.reshape(46,1)
x_2 = transformed_data['NumStores'].values.reshape(46,1)
model_2 = lr.fit(x_2, y_2)
model_2.predict(x_2)
model_2.score(x_2, y_2)

0.070419836120243939

In [118]:
y_3 = transformed_data['ExpirationMonth'].values.reshape(46,1)
x_3 = transformed_data['LogX'].values.reshape(46,1)
model_3 = lr.fit(x_3, y_3)
model_3.predict(x_3)
model_3.score(x_3, y_3)

0.14469682613098556

In [119]:
y_4 = transformed_data['LogY'].values.reshape(46,1)
x_4 = transformed_data['LogX'].values.reshape(46,1)
model_4 = lr.fit(x_4, y_4)
model_4.predict(x_4)
model_4.score(x_4, y_4)

0.065266847624963931

In [105]:
transformed_data

Unnamed: 0,NumStores,ExpirationMonth,LogY,LogX
0,7.974,12.0,2.484907,2.076186
1,4.023,2.0,0.693147,1.392028
2,0.767,1.0,0.0,-0.265268
3,3.854,1.0,0.0,1.349112
4,4.802,3.0,1.098612,1.569032
5,1.492,12.0,2.484907,0.400118
6,0.684,3.0,1.098612,-0.379797
7,0.201,6.0,1.791759,-1.60445
8,1.288,3.0,1.098612,0.253091
9,1.309,4.0,1.386294,0.269263


In [146]:
for feature in features:
    X = data[["NumStores", feature]].values.reshape(46, 2)
    model = lr.fit(X, y)
    predicted = model.predict(X)
    score = model.score(X, y)
    intercept = model.intercept_
    max_abs_resid_index = np.argmax(abs(predicted-y))
    max_resid_store = data.loc[max_abs_resid_index]["Retailer"]
    min_abs_resid_index = np.argmin(abs(predicted-y))
    min_resid_store = data.loc[min_abs_resid_index]["Retailer"]
    coeffs = model.coef_
    print "NumStores+{}, R^2={}, intercept={}, coeffs={}, \
           max_abs_resid_store={}, min_abs_resid_store={}".format(feature, score, 
                                                           intercept, coeffs, max_resid_store, min_resid_store)

NumStores+Salerank, R^2=0.270645681981, intercept=[ 6.67687718], coeffs=[[ 0.83189299 -0.03491206]], max_abs_resid_store=Gap, min_abs_resid_store=7-Eleven
NumStores+X2013USSales, R^2=0.284475863177, intercept=[ 5.85759906], coeffs=[[  0.94050316 -69.01206803]], max_abs_resid_store=TJX, min_abs_resid_store=Neiman Marcus
NumStores+X2013WorldSales, R^2=0.267165104154, intercept=[ 5.58882773], coeffs=[[  0.90784555 -38.28747116]], max_abs_resid_store=TJX, min_abs_resid_store=7-Eleven
NumStores+RewardSize, R^2=0.284326058729, intercept=[ 6.07700564], coeffs=[[ 0.90938412 -0.18817992]], max_abs_resid_store=TJX, min_abs_resid_store=Bloomin' Brands (Outback)
NumStores+ProfitMargin, R^2=0.259580527324, intercept=[ 5.82282779], coeffs=[[ 0.88417119 -0.02111462]], max_abs_resid_store=TJX, min_abs_resid_store=7-Eleven
