# Constructing and Evaluating a Lasso Regression Model Applied to Census and Sales Data
## By Tyler Chambers
### Created for APRD6432: Digital Advertising

## Project Summary

In this project I am constructing a Lasso Regression Model to compare sales data for different areas of the US to Census data for those regions. The hope is to find a handful of strong predictor variables that we can use to improve our market targeting in the future. 

## Setting up the Environment

In [3]:
#Importing packages for later use
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

#importing the csv file, you can find it in my github
Filename = 'finalmaster-ratios.csv'
#putting the file into a dataframe called fmr
fmr = pd.read_csv(Filename)

## Building the Model

In [5]:
#First I'm building the list of variable names as vn
vn = list(fmr.columns.values)
#Next I'm removing the unwanted first 8 values from the list, these variables are not relevant for our predictions
vn = vn[8:190]

#Next we are making our dataframes/columns of predictor and output variables
predictors = fmr[vn]
target = fmr['# Purchases']
             
#Next we are splitting the data into testing and training data sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)
#Showing that are split worked
print('\nShowing Successful Data Split')
print('\-----------------------------')
print(pred_test.head(1))
print(pred_train.head(1))
print(tar_test.head(1))
print(tar_train.head(1))

#Setting up our model's Lasso Parameters to model
#A CV=10 should make sure that our results are valid
model = LassoLarsCV(cv=10, precompute=False)

#Attaching our variables to the model using fit
model.fit(pred_train, tar_train)


Showing Successful Data Split
\-----------------------------
     B01001008  B01001009  B01001010  B01001011  B01001012  B01001013  \
521   6.542778   6.569125  20.093794  30.571022  30.860836  29.648886   

     B01001014  B01001015  B01001016  B01001017    ...      B19001008  \
521  26.794653  29.815748  34.900673  35.893067    ...      58.186313   

     B19001009  B19001010  B19001011   B19001012   B19001013  B19001014  \
521  57.459287  43.694261  98.318147  115.500194  130.355758  77.161691   

     B19001015  B19001016  B19001017  
521  43.985072  34.751842  26.439511  

[1 rows x 182 columns]
     B01001008  B01001009  B01001010  B01001011  B01001012  B01001013  \
309   9.023455    9.72758  25.827791  35.673712   35.08201  32.744787   

     B01001014  B01001015  B01001016  B01001017    ...      B19001008  \
309  30.756668  36.123405  37.874843  34.283212    ...      50.537432   

     B19001009  B19001010  B19001011   B19001012   B19001013  B19001014  \
309  50.537432  47.085

LassoLarsCV(copy_X=True, cv=10, eps=2.220446049250313e-16, fit_intercept=True,
      max_iter=500, max_n_alphas=1000, n_jobs=1, normalize=True,
      positive=False, precompute=False, verbose=False)

## Extracting and Evaluating the Coefficients

In [6]:
#Building the coefficient table
#The following line creates a dataframe named predictors_model with one column that has 
#each row representing one of the variables used for prediction.
predictors_model=pd.DataFrame(vn)
#The following line simply retitles all the columns in the dataframe 
#(which is currently only one column large) to the title label, 
#as before it was simply denoted by 0.
predictors_model.columns = ['label']
#The following line appends a new column to our dataframe titled ‘coeff’ 
#that is has a value equal to each coefficient calculated from our Lasso Model. 
#Since both our dataframe and our model are structured the same way, 
#the coefficients line up with their correct variable label.
predictors_model['coeff'] = model.coef_

#This for loop is picking out all of the variables in our dataframe that have a coefficient larger than 0,
#indicating a significance in prediction in sales, and then printing these values. 
for index, row in predictors_model.iterrows():
    if row['coeff'] > 0:
        print(row.values)

['B01001014' 0.8557908775529921]
['B01001036' 2.505392496591849]
['B01001037' 0.8894214357013622]
['B01001038' 1.5315839680821497]
['B02001005' 0.4125408937426837]
['B13014026' 0.4800240326923769]
['B13014027' 0.6977454940063235]
['B13016001' 874922971.7249781]
['B19001017' 1.4834465563617387]


### Explanation of the Coefficients

B01001014: This variable corresponds to Males aged 40 to 44. Therefore, we can conclude that in areas with higher concentrations of Males aged 40 to 44, we would sell more Bobo Bars.

b.	B01001036: This variable corresponds to Females aged 30 to 34. Therefore, we can conclude that in areas with higher concentrations of Females aged 30 to 34, we would sell more Bobo Bars.

c.	B01001037: This variable corresponds to Females aged 35 to 39. Therefore, we can conclude that in areas with higher concentrations of Females aged 35 to 39, we would sell more Bobo Bars.

d.	B01001038: This variable corresponds to Females aged 40 to 44. Therefore, we can conclude that in areas with higher concentrations of Females aged 40 to 44, we would sell more Bobo Bars.

e.	B02001005: This variable corresponds to individuals who identify as being only Asian in race. Therefore, we can conclude that in areas with higher concentrations of individuals with wholly Asian descent we would sell more Bobo Bars. 

f.	B13014026: This variable corresponds to women aged 15-50 who have not had a birth in the last 12 months, are not married, and have attained a Bachelors degree. Therefore, we can conclude that in areas with higher concentration of women aged 15-50 who have not had a birth in the last 12 months, are not married, and have attained a Bachelors degree we would sell more Bobo Bars. A more succinct (but less robust) way to say this might be independent adult women with a 4-year college degree seem to enjoy our Bobo bar.

g.	B13014027: This variable corresponds to women aged 15-50 who have not had a birth in the last 12 months, are not married, and have attained a graduate or professional degree.  Therefore, we can conclude that in areas with higher concentration of women aged 15-50 who have not had a birth in the last 12 months, are not married, and have attained a Graduate or professional degree we would sell more Bobo Bars. A more succinct (but less robust) way to say this might be independent adult with a graduate degree or above seem to enjoy our Bobo Bars. 

h.	B13016001: This variable is interesting. It is the highest level of a variable that breaks down women into two camps based on whether or not they have had a baby in the last 12 months, and the creates levels for both of those camps based on age ranges. Since this variable is the highest level, it is simply all women aged 15 to 50. Therefore, we can conclude that in areas with higher concentrations of women aged 15-50, we would sell more Bobo bars. 

i.	B19001017: This variable corresponds to households with an income of over 200,000 for the year. Thus, we would conclude that in households with an income over 200,000, we would expect to sell more Bobo bars.


### Analysis of the Coefficients

The clearly most important variable to report would be women aged 15-50, which had an undeniably huge coefficient at 874922971.7249781. Thus, adult women are clearly the strongest base for our product. The next largest coefficient was women aged 30-34, at 2.505392496591849, thus further homing in our market segmentation. Unfortunately, these two predictors obviously overlap with one another, so although they are the best predictors, they might not be the most practically useful. If I were to critically look at these variables, I might consider lumping our 3 women age groups ( 30-34,35-39, and 40-44 ) together into one variable looking at women aged 30-44, and also pulling the next highest predictive variable, household income over 200,000. This would create a more actionable prediction, but certainly makes some assumptions of the data. The safer bet is women aged 15-50, and then women aged 30-34. But the more practical is women aged 30-44, and household income over 200,000. 

## Examining the Mean Squared Errors

In [8]:
train_error = mean_squared_error(tar_train, model.predict(pred_train))
print ('Training data MSE')
print(train_error)
print('-------------------')
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('Testing data MSE')
print(test_error)

Training data MSE
22025.312777378716
-------------------
Testing data MSE
41549.12573000182


For our training set, I received a MSE of 22025.312777378716. For our Testing set I got a MSE of 41549.12573000182, which is a value that is close to double that of our training set. This is concerning to me, as it says our regression equation fits our training data much better than our testing data, as our testing data has approximately twice as much variation. To me this says that our model might be overfitting our training data. 

## Examining the R^2 Values

In [9]:
rsquared_train=model.score(pred_train,tar_train)
print ('Training data R-square')
print(rsquared_train)
print('------------------------')
rsquared_test=model.score(pred_test,tar_test)
print ('Testing data R-square')
print(rsquared_test)

Training data R-square
0.24002827375880997
------------------------
Testing data R-square
0.17587122769388464


For our training set, we are receiving an R-Squared of 0.24002827375880997, which is a good amount larger when compared to our testing set, which had an R-squared of 0.17587122769388464. Again, this points to the fact that our model developed in our testing data may not be generalizable to the larger population, or at the very least, may be overfitted and less powerful. I would say that census data does not seem to be a very good predictor of Bobo bar sales. Even in our testing environment, we were only able to get an R-squared of approximately .24, which means we are only accounting for 24% of variability with this model. And since our model performs much poorer when used on our testing set, I’d have to say this model is not a strong indicator of sales. At the very least, this data set may give us some insights into key populations, like adult women and affluent households, but by itself, it is not a great predictor. 

## Examining the Baseline Sales Value

In [10]:
print("y interecept:")
print(model.intercept_)

y interecept:
22.194697684317433


Our baseline sales, based on our Y-intercept is 22.194697684317433. Thus, in an area devoid of all of our other named coefficients above, we would still expect to sell 22.19 Bobo bars. 