#### Testing which variables are important for predicting responses to a marketing offer

* Data about a company’s marketing campaign, which offered discounts for various products. You are interested in building a model to predict the number of responses to the offer, and have information about how much discount the offer included (offer_discount), how many customers the offer reached (offer_reach), and a value for the offer quality that the marketing team assigned to that offer (offer_quality).
* Build a model that is accurate but does not contain unnecessary variables. Use the RMSE to evaluate how the model performs when all variables are included, and compare this to what happens when each variable is dropped from the model.

2. Import train_test_split from sklearn and use it to split the data into a training and test set, using responses as the y variable and all others as the predictor (X) variables. Use random_state=10 for the traintest split.
3. Import LinearRegression and mean_squared_error from sklearn. Fit a model to the training data (using all of the predictors), get predictions from the model on the test data, and print out the calculated RMSE on the test data. The RMSE with all variables should be approximately 966.2461828577945.
4. Create X_train2 and X_test2 by dropping offer_quality from X_train and X_test. Train and evaluate the RMSE of a model using X_train2 and X_test2. The RMSE without offer_quality should be 965.5346123758474.
5. Perform the same sequence of steps from step 4, but this time dropping offer_discount instead of offer_quality. The RMSE without offer_discount should be 1231.6766556327284.
6. Perform the same sequence of steps but this time dropping offer_reach. The RMSE without offer_reach should be 1185.8456831644114.

In [7]:
import pandas as pd

df = pd.read_csv('data_science/offer_responses.csv')
df.head()

Unnamed: 0,responses,offer_discount,offer_quality,offer_reach
0,4151.0,26.0,10.25768,31344.0
1,3397.0,35.0,15.19438,24016.0
2,3274.0,21.0,13.971468,28832.0
3,3426.0,27.0,6.054338,26747.0
4,5745.0,42.0,16.801365,46968.0


In [8]:
from sklearn.model_selection import train_test_split

X = df[['offer_quality',
        'offer_discount',
        'offer_reach'
       ]]

y = df['responses']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 10)

In [9]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

print('RMSE with all variables: ' + str(mean_squared_error(predictions, y_test)**0.5))

RMSE with all variables: 966.2461828578139


In [11]:
X_train2 = X_train.drop('offer_quality',axis=1)
X_test2 = X_test.drop('offer_quality',axis=1)

model = LinearRegression()
model.fit(X_train2, y_train)

predictions = model.predict(X_test2)

print('RMSE without offer quality: ' + str(mean_squared_error(predictions, y_test)**0.5))

RMSE without offer quality: 965.5346123758474


In [12]:
X_train3 = X_train.drop('offer_discount',axis=1)
X_test3 = X_test.drop('offer_discount',axis=1)

model = LinearRegression()
model.fit(X_train3, y_train)

predictions = model.predict(X_test3)

print('RMSE without offer discount: ' + str(mean_squared_error(predictions, y_test)**0.5))

RMSE without offer discount: 1231.6766556327284


In [13]:
X_train4 = X_train.drop('offer_reach',axis=1)
X_test4 = X_test.drop('offer_reach',axis=1)

model = LinearRegression()
model.fit(X_train4, y_train)

predictions = model.predict(X_test4)

print('RMSE without offer reach: ' + str(mean_squared_error(predictions, y_test)**0.5))

RMSE without offer reach: 1185.8456831644116


In [18]:
model.coef_

array([ 7.75510648, 72.82095482])