In [57]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

In [3]:
data = pd.read_csv("data/crp_cleandata.csv")

Let’s reconsider the customer reward program dataset. In this exercise, you will complete a predictive modeling task where the target variable is binary. Again use the data in cleandata.csv. Consider logistic regression models with Reward column as the target variable. Cleandata.csv also contains a column IndustryType, which is created based on the column Industry in the raw data.

Note that Industry has too many categories. The analyst who prepared the data chose to combine the categories, which resulted in the column IndustryType. IndustryType has five categories: Department, Discount, Grocery, Restaurants, Specialty. You can create a set of dummy variables based on IndustryType in XLMiner by using the Transform functions.

(Part 1) Fit a logistic regression model with two indicator variables, one indicating whether a retailer is a discount store (i.e., IndustryType is Discount), and the other indicating whether a retailer is a grocery store (i.e., IndustryType is Grocery). Report the coefficient estimates in the next three questions.

In [5]:
log_reg_data = data[["IndustryType", "Reward"]]

In [9]:
dummified = pd.get_dummies(data = log_reg_data, drop_first=True)

In [11]:
dummified.head()

Unnamed: 0,Reward,IndustryType_Discount,IndustryType_Grocery,IndustryType_Restaurants,IndustryType_Specialty
0,0,1.0,0.0,0.0,0.0
1,0,0.0,1.0,0.0,0.0
2,0,0.0,1.0,0.0,0.0
3,0,0.0,1.0,0.0,0.0
4,0,0.0,0.0,0.0,1.0


In [12]:
y = dummified.Reward.as_matrix() 

In [13]:
X = dummified[["IndustryType_Discount", "IndustryType_Grocery"]].as_matrix()

In [17]:
log_reg = LogisticRegression()

In [18]:
model = log_reg.fit(X, y)

What is the estimated intercept coefficient ( report answer up to 4 decimal places i.e. x.xxxx )?

In [33]:
intercept = round(model.intercept_, 4)
intercept

0.4046


What is the estimated coefficient for IndustryType_Discount ( report answer up to 4 decimal places i.e. x.xxxx ) ?

What is the estimated coefficient for IndustryType_Grocery ( report answer up to 3 decimal places i.e. x.xxx ?

In [35]:
coeffs = model.coef_
print "beta IndustryType_Discount", round(coeffs[0][0], 4)
print "beta IndustryType_Grocery", round(coeffs[0][1], 3)

beta IndustryType_Discount -0.6963
beta IndustryType_Grocery -0.513


What is the number of true positives?

In [36]:
predictions = model.predict(X)

In [41]:
conf_matr = confusion_matrix(y, predictions)
conf_matr

array([[21, 24],
       [15, 40]])

In [53]:
tp = conf_matr[1][1]
tp

40

In [55]:
tn = conf_matr[0][0]
tn

21

(Part 2) Split the dataset into training and validation sets using a 60:40 split (set the seed for partitioning to 12345; this should be the default value if you have not changed it). Report the new coefficient estimates in the next three questions. Use the same two predictor variables as in Part 1.

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 12345)

In [63]:
model_2 = log_reg.fit(X_train, y_train)

What is the estimated intercept coefficient ( report answer up to 4 decimal places i.e. x.xxxx ) ?

In [65]:
intercept_2 = round(model_2.intercept_, 4)
intercept_2

0.4419

In [66]:
coeffs_2 = model_2.coef_

What is the estimated coefficient for IndustryType_Discount ( report answer up to 3 decimal places i.e. x.xxx ) ?

In [67]:
print "beta IndustryType_Discount", round(coeffs_2[0][0], 3)

beta IndustryType_Discount -0.628


What is the estimated coefficient for IndustryType_Grocery ( report answer up to 4 decimal places i.e. x.xxxx ) ?

In [68]:
print "beta IndustryType_Grocery", round(coeffs_2[0][1], 4)

beta IndustryType_Grocery -0.5816


How many observations are in the training set?

In [69]:
X_train.shape[0]

60

What is the number of true positives on the validation data?

In [70]:
predictions_2 = model_2.predict(X_test)

In [71]:
conf_matrix_2 = confusion_matrix(y_test, predictions_2)

In [72]:
tp_2 = conf_matrix_2[1][1]
tp_2

14

What is the number of true negatives on the validation data?

In [74]:
tn_2 = conf_matrix_2[0][0]
tn_2

9


(Part 3) By default, XLMiner uses the cutoff threshold 0.5. Repeat Part 2 with a cutoff threshold 0.3. What are the numbers of true positives and true negatives on the validation data?

In [84]:
pred_proba = model_2.predict_proba(X_test)
pred_proba

array([[ 0.53486832,  0.46513168],
       [ 0.39129356,  0.60870644],
       [ 0.54646522,  0.45353478],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.53486832,  0.46513168],
       [ 0.39129356,  0.60870644],
       [ 0.54646522,  0.45353478],
       [ 0.54646522,  0.45353478],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.54646522,  0.45353478],
       [ 0.53486832,  0.46513168],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.53486832,  0.46513168],
       [ 0.54646522,  0.45353478],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.39129356,  0.60870644],
       [ 0.54646522,  0.45353478],
       [ 0.54646522,

In [80]:
model_2.classes_

array([0, 1])

In [85]:
predictions_3 = np.array([1 for x in pred_proba if x[1] >= 0.3])

In [86]:
predictions_3

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [87]:
conf_matr_3 = confusion_matrix(y_test, predictions_3)

Report the number of true positives:

In [88]:
tp_3 = conf_matr_3[1][1]
tp_3

21

Report the number of true negatives:

In [90]:
tn_3 = conf_matr_3[0][0]
tn_3

0