In [1]:
# Lasso Regression
## Ramya Prabhakar
### Often in marketing analytics we have too much data to do a simple multiple regression. That is, there are too many possible
### predictors to consider at once. Multiple regression falls apart in these instances because of multicollinearity and because
### often many variables will be significant, leaving us with no real idea what the few true factors driving the result we want
### truly are.
### Lasso stands for least absolute shrinkage and selection operator.
### LASSO is great in that it preforms feature (predictor) variable selection. That is, it automatically selects the most 
### powerful variables, the variables that explain the most variance in our regression, while leaving out those that explain
### little unique variance.




In [2]:
#import necessary packages
import pandas as pd
import pandas
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error



In [3]:
# data can be found in github repository
# read the data using pandas
alldata = pd.read_csv('finalmaster-ratios.csv')

In [4]:
# create a list of all predictors
allvariablenames = list(alldata.columns.values)

In [5]:
# remove the first 8 predictors since they're invalid
newdata = alldata.drop(columns = ['# Purchases', 'B01001001', 'B01001002', 'B01001003', 'B01001004', 'B01001005', 'B01001006', 'B01001007' ] )

In [6]:
#create a list of all predictors after removing the invalid columns
listofallpredictors = list(newdata.columns.values)

In [8]:
#load predictors into a dataframe
predictors = newdata[listofallpredictors]  

In [9]:
#load target into a dataframe
target = alldata['# Purchases']                         

In [10]:
# split data into training and test sets, with 30% retained for the test set
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)    


In [11]:
# Use lassolarscv and make a model and fit the training data onto the new model
model=LassoLarsCV(cv=10).fit(pred_train,tar_train)




In [12]:
#build coefficent chart
print('Create a predictors model data frame that contains the list of all possible predictor variables' )
predictors_model=pd.DataFrame(listofallpredictors)
print('Label the columns of the predictors model data frame' )
predictors_model.columns = ['label']
print('Assign the co efficients of the model to the coeff column in the predictors model data frame' )
predictors_model['coeff'] = model.coef_


Create a predictors model data frame that contains the list of all possible predictor variables
Label the columns of the predictors model data frame
Assign the co efficients of the model to the coeff column in the predictors model data frame


In [13]:
print('Iterate through each row of the predictors model data frame and print the positive co efficients')
for index, row in predictors_model.iterrows():
    if row['coeff'] > 0:
        print(row.values)

Iterate through each row of the predictors model data frame and print the positive co efficients
['B01001036' 2.784856986420159]
['B01001037' 0.9234930857404328]
['B01001038' 0.9491459380764333]
['B02001005' 0.3919782979991068]
['B13014026' 0.22147975335090184]
['B13014027' 0.05121418112617723]
['B19001017' 1.6058830181449382]


In [14]:
#Calculate the mean squared error for the training set:        
train_error = mean_squared_error(tar_train, model.predict(pred_train))
print ('Training Data MSE')
print(train_error)


Training Data MSE
22525.63625144556


In [15]:
#Calculate the mean squared error for the test set:        
test_error=mean_squared_error(tar_test, model.predict(pred_test))
print ('Testing Data MSE')
print(test_error)


Testing Data MSE
41573.80112905681


In [16]:
#Calculate the r square for the training set
rsquared_train=model.score(pred_train,tar_train)
print ('Training data R-square')
print(rsquared_train)


Training data R-square
0.2227648778602469


In [18]:
#Calculate the r square for the test set
rsquared_test=model.score(pred_test,tar_test)
print ('Testing data R-square')
print(rsquared_test)   

Testing data R-square
0.1753817900469531


In [19]:
#print the y intercept of the model
print("y interecept:")
print(model.intercept_)

y interecept:
2.8174754145509553


In [22]:
print(" Women in the age groups of 30 to 44, People of Asian only origin, households with income over $200,000 or more and women 15-50 years who had a birth in the past 12 months by marital status and educational attainment make the most of the purchases that are made in our example.\n")
print("\n Top two census variables that most steeply predict sales are: B01001036 – Females aged 30-34 tend to make more purchases B19001017 - Households with income over $200,000.\n")
print("\n Training Data MSE - 22525.64 Testing Data MSE - 41573.8 The training and test sets have different mean square errors. Practically the mean square error is a measure of the dispersion within a data set. The Mean Squared Error (MSE) is a measure of how close a fitted line is to data points in a given data set. In our example, the training set has a lesser mean square error than the test set. This seems logically correct, since the training set contains 70% of the data while the test set contains only 30% of the data. Increasing the sample size leads to a reduction in the dispersion.\n ")
print( "\n Training data R-square 0.2227648778602469 testing data R-square 0.1753817900469531 Considering the r square value of the testing data set, it can be concluded that the given census data predicts the overall sales correctly, 17.5% of the times. This does not give us much confidence in the model since we would like to be able to predict the correct possible sales values more than 17.5% of the times. \n")       
print ("\n The baseline sales number is 2.82. This means that if all the predictor variables (x) are set to zero, we will still have a baseline sale of 2.82 units.")      


 Women in the age groups of 30 to 44, People of Asian only origin, households with income over $200,000 or more and women 15-50 years who had a birth in the past 12 months by marital status and educational attainment make the most of the purchases that are made in our example.


 Top two census variables that most steeply predict sales are: B01001036 – Females aged 30-34 tend to make more purchases B19001017 - Households with income over $200,000.


 Training Data MSE - 22525.64 Testing Data MSE - 41573.8 The training and test sets have different mean square errors. Practically the mean square error is a measure of the dispersion within a data set. The Mean Squared Error (MSE) is a measure of how close a fitted line is to data points in a given data set. In our example, the training set has a lesser mean square error than the test set. This seems logically correct, since the training set contains 70% of the data while the test set contains only 30% of the data. Increasing the sample si