<a href="https://colab.research.google.com/github/toplyn/structured/blob/master/Python_Exercise_3_Using_LASSO_Regression_to_Model_Fat_Sales_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code analyzes purchase behavior and what may have a stronger relationship with spend.  It uses a Lasso model to better predict this given a large number of variables.

Question 1.
The first line takes all of the predictor column headers and makes a list of these.  The second line renames the column 'label'.  The third line takes the model's coefficients for each variable and assings them next to the variable name.  The for loop goes through each coefficient and prints only those that are greater than 0.  In other words there a linear relationship where more of these values it aligned with more purchases.

Question 2.
- B01001036 - Female age 30 to 34  
- B01001037 - Female age 35 to 39
- B01001038 - Female age 40 to 44 years
- B02001005 - Asian Alone
- B13014026 - Women, birth in past 12 months, unmarried with Bachelors Degree
- B13014027 - Women, birth in past 12 months, unmarried with graduate or professional degree
- B19001017 - Household income (last 12 months) $200k or more

- This means that there is a strong linear relationship between sales and areas with females between 30 and 44

Question 3. 
- If I had to pick two variables I would pick B01001036 (Females age 30 to 34) and B19001017 (Household income (last 12 months) $200k or more) because they seem to have the highest coefficient to sales, and just because that age seems to have the strongest relationship, it may align with interests of females 35-44 too, which also have strong coefficients.

Question 4.
- The MSE for our Training set is 22535.419, but for our Test set it is much higher at 41589.162.  This means that while our model has learned from our training data, it doesn't quite explain all the variation from variables selected.  It could be that other variables account for more fluctuation, or that there isn't enough training data possibly. Test sets are expected to have larger error than training set though since the model has seen the training set.

Question 5.
- The R Squared value for our training set is 0.2224 while for our test set it's .1751.  R Squared is a measure for how much variable is explained by the model.  From our training data about 22% of the target variable is explained, while only about 17.5% is explained in our test set.  Overall, I'd say that this doesn't predict sales with a high level of confidence, though it may still be significant and may still predict more than other factors.

Question 6.
- The baseline sales number is 2.6162, which would suggest we have 2.6 sales without doing anything, however this is likely extrapolation and may not be accurate




In [0]:
import pandas as pd
import pandas
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LassoLarsCV
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
final = 'drive/My Drive/Colab Notebooks/finalmaster-ratios.csv'
alldata = pd.read_csv(final)

In [0]:
## Review the shape of the dataset
alldata.shape

(732, 190)

In [0]:
## Review data types
alldata.dtypes

# Purchases      int64
B01001001        int64
B01001002      float64
B01001003      float64
B01001004      float64
                ...   
B19001013      float64
B19001014      float64
B19001015      float64
B19001016      float64
B19001017      float64
Length: 190, dtype: object

In [0]:
## Determine all variables
allvariablenames = list(alldata.columns.values)

In [0]:
# Remove 7 repetitive columns similar to # Purchases  
alldata = alldata.drop(alldata.iloc[:,1:8], axis = 1)

In [0]:
#load predictors into dataframe
predictors = alldata.loc[:,alldata.columns != '# Purchases']

#load target into dataframe
target = alldata['# Purchases']

In [0]:
# split data into train and test sets, with 30% retained for test
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.3, random_state=123)    

In [0]:
# LassLarsCV model, 10 fold validations, precompute = False for ML predictive analysis
model = LassoLarsCV(cv = 10).fit(pred_train,tar_train)



In [0]:
## Determine predictors with more significant impact on # Purchases
predictors_model = pd.DataFrame(predictors.columns)
predictors_model.columns = ['label']
predictors_model['coeff'] = model.coef_

for index, row in predictors_model.iterrows():
  if row['coeff'] > 0:
    print(row.values)

['B01001036' 2.789241245108878]
['B01001037' 0.91172111116467]
['B01001038' 0.9381412192491126]
['B02001005' 0.3905728308786057]
['B13014026' 0.21833407432229296]
['B13014027' 0.04875992798343665]
['B19001017' 1.6072015752692943]


In [0]:
# Mean squared error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
print ('training data MSE')
print(train_error)

training data MSE
22535.419468897468


In [0]:
# Mean squared error test sets
train_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('testing data MSE')
print(train_error)

testing data MSE
41589.16222617473


In [0]:
# R Squared for training set
rsquared_train=model.score(pred_train,tar_train)
print ('training data R-square')
print(rsquared_train)

training data R-square
0.22242731313504793


In [0]:
# R Squared for test set
rsquared_test=model.score(pred_test,tar_test)
print ('testing data R-square')
print(rsquared_test)

testing data R-square
0.1750771019966838


In [0]:
print("y interecept:")
print(model.intercept_)

y interecept:
2.616228871954327
