### Goals of this notebook:
### 1. Preprocesses the data to prepare for modeling
### 2. Modeling using a number of different models


In [30]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing

In [31]:
import pickle
os.chdir("C:\Springboard\Github\Capstone2_cust\Intermediate_Data")

In [32]:
# load picked version of X
X = pickle.load(open("X1.pkl", "rb"))
# look at the first 10 rows of this file
X.head(10)

Unnamed: 0,first_total,Marketing_first,first_items,first_order,server,Vendor,Source,Area_Code,Ship_Zip,lead_sku,weekday,mon,first_tot_lg,first_it_lg
1,145.58,1,2,2019-11-26 21:44:16+00:00,custom,1.0,web,404,30087,other,Tuesday,November,2.163102,0.30103
2,137.55,0,5,2019-11-26 20:52:08+00:00,custom,1.0,web,845,12545,other,Tuesday,November,2.138461,0.69897
3,22.98,1,2,2019-11-26 18:12:04+00:00,custom,1.0,web,262,53402,ROUTEINS10,Tuesday,November,1.36135,0.30103
4,28.0,0,1,2019-08-07 18:14:49+00:00,custom,0.0,web,617,1983,BEM1003,Wednesday,August,1.447158,0.0
5,12.0,1,1,2019-08-07 18:05:28+00:00,custom,0.0,web,740,43143,other,Wednesday,August,1.079181,0.0
6,42.0,0,2,2019-08-07 03:45:52+00:00,custom,0.0,web,701,58801,BES1006,Wednesday,August,1.623249,0.30103
7,27.2,1,1,2019-08-06 22:00:54+00:00,custom,0.0,web,754,33026,BEM6001,Tuesday,August,1.434569,0.0
8,22.0,1,2,2019-08-06 20:22:25+00:00,custom,0.0,web,unknown,1880,BES5001,Tuesday,August,1.342423,0.30103
9,100.0,1,5,2019-08-06 19:59:05+00:00,custom,0.0,web,617,1880,BEM1007,Tuesday,August,2.0,0.69897
10,36.64,1,2,2019-08-06 19:14:49+00:00,custom,0.0,web,626,92887,BEM1007,Tuesday,August,1.563955,0.30103


That looks pretty good. Let's review what each column means: <br>
- first_total: total $ spend on first order <br>
- Marketing_first: whether they accept marketing on the first order <br>
- first_items: number of items on first order <br>
- first_order: date-time of first order <br>
- server: domain name of the customer email server <br>
- vendor: 0 = first order from company; 1 = first order from outside source <br>
- Source: web or iphone
- Area_Code: area code of order placed
- Ship_Zip: zip code of shipping address
- lead_sku: name of SKU that was lead item on purchase
- weekday: day of week first order was placed
- mon: month that first order was placed
- first_tot_lg: log of first order total
- first_it_lg: log of number of items in first order <br>
<br>
The values for some of these catagorical features need to be converted to numbers

In [33]:
X['Source'].value_counts()

web        33949
1356615     5423
294517       273
457101        67
580111        16
412739         2
Name: Source, dtype: int64

In [34]:
# let's drop first order item log
X.drop('first_it_lg', axis=1, inplace=True)

In [35]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39730 entries, 1 to 39770
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype              
---  ------           --------------  -----              
 0   first_total      39730 non-null  float64            
 1   Marketing_first  39730 non-null  int64              
 2   first_items      39730 non-null  int64              
 3   first_order      39730 non-null  datetime64[ns, UTC]
 4   server           39730 non-null  object             
 5   Vendor           39730 non-null  float64            
 6   Source           39730 non-null  object             
 7   Area_Code        39730 non-null  object             
 8   Ship_Zip         39730 non-null  object             
 9   lead_sku         39730 non-null  object             
 10  weekday          39730 non-null  object             
 11  mon              39730 non-null  object             
 12  first_tot_lg     39730 non-null  float64            
dtypes: datetime64[ns

traditional regression can't handle date-time, so we'll drop that

In [36]:
# drop date-time
X.drop('first_order', axis=1, inplace=True)

features that need dummy variables:
- server
- source
- Area_Code
- Ship_Zip
- lead_sku
- weekday
- mon <br>
<br>
Let's look at the size of these

In [37]:
X['server'].nunique()

26

In [38]:
X['Source'].nunique()

6

In [39]:
X['Area_Code'].nunique()

373

This is too man catagorical variables for this feature. I wonder if there are high concentrations of purchases in some zip codes that we could account for

In [40]:
X['Area_Code'].value_counts().head(20)

unknown    21387
949          273
714          204
720          161
760          155
310          148
512          144
214          144
817          139
801          135
757          132
917          131
208          131
909          130
503          130
704          129
916          129
360          128
619          127
303          121
Name: Area_Code, dtype: int64

There are 40k customers, so the largest area code makes up 0.68% of total orders. I think this feature is too small and would add unnecessary dimensions. We will drop.

In [41]:
X.drop('Area_Code', axis=1, inplace=True)

In [42]:
X['Ship_Zip'].nunique()

15930

That is way too many features. Unless they are very concentrated in a few zip codes, we'll drop it.

In [43]:
X['Ship_Zip'].value_counts().head(20)

92692      48
92688      48
unknown    48
92691      46
28532      41
92694      40
92630      39
92627      38
92656      31
92679      29
92677      27
92675      25
92672      24
92626      23
92629      22
80013      22
92592      22
92660      21
93551      20
79936      20
Name: Ship_Zip, dtype: int64

In [44]:
# drop that feature
X.drop('Ship_Zip', axis=1, inplace=True)

In [45]:
X['lead_sku'].nunique()

26

We know how many variables for days of the week and month there are. This means we should see: <br>
- server: 26
- source: 6
- lead_sku: 26
- weekday: 7
- mon: 12 <br>
- total: 77 more freatures <br>
<br>
That seems reasonable

In [46]:
# dummy for server ##
dfs = X['server']
dummy_server = pd.get_dummies(dfs)
X = pd.concat([X.drop('server', axis=1), dummy_server], axis=1)

Unnamed: 0,first_total,Marketing_first,first_items,Vendor,Source,lead_sku,weekday,mon,first_tot_lg,aim.com,...,msn.com,optonline.net,outlook.com,rocketmail.com,sbcglobal.net,vendor,verizon.net,windstream.net,yahoo.com,ymail.com
1,145.58,1,2,1.0,web,other,Tuesday,November,2.163102,0,...,0,0,0,0,0,0,0,0,0,0
2,137.55,0,5,1.0,web,other,Tuesday,November,2.138461,0,...,0,0,0,0,0,0,0,0,0,0
3,22.98,1,2,1.0,web,ROUTEINS10,Tuesday,November,1.361350,0,...,0,0,0,0,0,0,0,0,0,0
4,28.00,0,1,0.0,web,BEM1003,Wednesday,August,1.447158,0,...,0,0,0,0,0,0,0,0,0,0
5,12.00,1,1,0.0,web,other,Wednesday,August,1.079181,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39766,27.98,0,2,1.0,web,ROUTEINS10,Saturday,September,1.446848,0,...,0,0,0,0,0,0,0,0,0,0
39767,59.97,0,2,1.0,web,ROUTEINS10,Saturday,September,1.777934,0,...,0,0,0,0,0,0,0,0,0,0
39768,54.97,1,2,1.0,web,ROUTEINS10,Saturday,September,1.740126,0,...,0,0,0,0,0,0,0,0,0,0
39769,90.68,0,3,1.0,web,other,Saturday,September,1.957512,0,...,0,0,0,0,0,0,0,0,0,0


In [47]:
# Dummy for Source
dfs = X['Source']
dummy_source = pd.get_dummies(dfs)
X = pd.concat([X.drop('Source', axis=1), dummy_source], axis=1)

Unnamed: 0,first_total,Marketing_first,first_items,Vendor,lead_sku,weekday,mon,first_tot_lg,aim.com,aol.com,...,verizon.net,windstream.net,yahoo.com,ymail.com,1356615,294517,412739,457101,580111,web
1,145.58,1,2,1.0,other,Tuesday,November,2.163102,0,0,...,0,0,0,0,0,0,0,0,0,1
2,137.55,0,5,1.0,other,Tuesday,November,2.138461,0,0,...,0,0,0,0,0,0,0,0,0,1
3,22.98,1,2,1.0,ROUTEINS10,Tuesday,November,1.361350,0,0,...,0,0,0,0,0,0,0,0,0,1
4,28.00,0,1,0.0,BEM1003,Wednesday,August,1.447158,0,0,...,0,0,0,0,0,0,0,0,0,1
5,12.00,1,1,0.0,other,Wednesday,August,1.079181,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39766,27.98,0,2,1.0,ROUTEINS10,Saturday,September,1.446848,0,0,...,0,0,0,0,0,0,0,0,0,1
39767,59.97,0,2,1.0,ROUTEINS10,Saturday,September,1.777934,0,0,...,0,0,0,0,0,0,0,0,0,1
39768,54.97,1,2,1.0,ROUTEINS10,Saturday,September,1.740126,0,0,...,0,0,0,0,0,0,0,0,0,1
39769,90.68,0,3,1.0,other,Saturday,September,1.957512,0,0,...,0,0,0,0,0,0,0,0,0,1


In [48]:
# Dummy for lead_sku
dfs = X['lead_sku']
dummy_source = pd.get_dummies(dfs)
X = pd.concat([X.drop('lead_sku', axis=1), dummy_source], axis=1)

Unnamed: 0,first_total,Marketing_first,first_items,Vendor,weekday,mon,first_tot_lg,aim.com,aol.com,att.net,...,BES1009,BES1010,BES1011,BES3003,BES5001,ROUTEINS10,ROUTEINS18,ROUTEINS19,ROUTEINS22,other
1,145.58,1,2,1.0,Tuesday,November,2.163102,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,137.55,0,5,1.0,Tuesday,November,2.138461,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,22.98,1,2,1.0,Tuesday,November,1.361350,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,28.00,0,1,0.0,Wednesday,August,1.447158,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,12.00,1,1,0.0,Wednesday,August,1.079181,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39766,27.98,0,2,1.0,Saturday,September,1.446848,0,0,0,...,0,0,0,0,0,1,0,0,0,0
39767,59.97,0,2,1.0,Saturday,September,1.777934,0,0,0,...,0,0,0,0,0,1,0,0,0,0
39768,54.97,1,2,1.0,Saturday,September,1.740126,0,0,0,...,0,0,0,0,0,1,0,0,0,0
39769,90.68,0,3,1.0,Saturday,September,1.957512,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [49]:
# Dummy for weekday
dfs = X['weekday']
dummy_source = pd.get_dummies(dfs)
X = pd.concat([X.drop('weekday', axis=1), dummy_source], axis=1)

Unnamed: 0,first_total,Marketing_first,first_items,Vendor,mon,first_tot_lg,aim.com,aol.com,att.net,bellsouth.net,...,ROUTEINS19,ROUTEINS22,other,Friday,Monday,Saturday,Sunday,Thursday,Tuesday,Wednesday
1,145.58,1,2,1.0,November,2.163102,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
2,137.55,0,5,1.0,November,2.138461,0,0,0,0,...,0,0,1,0,0,0,0,0,1,0
3,22.98,1,2,1.0,November,1.361350,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,28.00,0,1,0.0,August,1.447158,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,12.00,1,1,0.0,August,1.079181,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39766,27.98,0,2,1.0,September,1.446848,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
39767,59.97,0,2,1.0,September,1.777934,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
39768,54.97,1,2,1.0,September,1.740126,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
39769,90.68,0,3,1.0,September,1.957512,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0


In [50]:
# Dummy for weekday
dfs = X['mon']
dummy_source = pd.get_dummies(dfs)
X = pd.concat([X.drop('mon', axis=1), dummy_source], axis=1)
X

Unnamed: 0,first_total,Marketing_first,first_items,Vendor,first_tot_lg,aim.com,aol.com,att.net,bellsouth.net,charter.net,...,December,February,January,July,June,March,May,November,October,September
1,145.58,1,2,1.0,2.163102,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,137.55,0,5,1.0,2.138461,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,22.98,1,2,1.0,1.361350,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
4,28.00,0,1,0.0,1.447158,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,12.00,1,1,0.0,1.079181,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
39766,27.98,0,2,1.0,1.446848,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
39767,59.97,0,2,1.0,1.777934,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
39768,54.97,1,2,1.0,1.740126,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
39769,90.68,0,3,1.0,1.957512,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


That looks good. Let's load Y.

In [52]:
# load picked version of X
y = pickle.load(open("Y1.pkl", "rb"))
# look at the first 10 rows of this file
y.head(10)

1     2.163102
2     2.138461
3     1.361350
4     1.447158
5     1.079181
6     1.623249
7     1.434569
8     1.342423
9     2.000000
10    1.563955
Name: life_lg, dtype: float64

In [53]:
# let's scale the X values
# Here we use the StandardScaler() method of the preprocessing package, and then call the fit() method with parameter X 
scaler = preprocessing.StandardScaler().fit(X)
# Declare a variable called X_scaled, and assign it the result of calling the transform() method with parameter X 
X_scaled=scaler.transform(X)

array([[ 1.90467987,  1.19308323, -0.10354591, ...,  2.61788669,
        -0.1628155 , -0.1798618 ],
       [ 1.72547237, -0.83816449,  2.42911099, ...,  2.61788669,
        -0.1628155 , -0.1798618 ],
       [-0.83141472,  1.19308323, -0.10354591, ...,  2.61788669,
        -0.1628155 , -0.1798618 ],
       ...,
       [-0.11748596,  1.19308323, -0.10354591, ..., -0.3819875 ,
        -0.1628155 ,  5.55982433],
       [ 0.67946296, -0.83816449,  0.74067306, ..., -0.3819875 ,
        -0.1628155 ,  5.55982433],
       [-0.78633388, -0.83816449, -0.94776488, ..., -0.3819875 ,
        -0.1628155 ,  5.55982433]])

In [54]:
# let's split the training and test sets
from sklearn.model_selection import train_test_split

# Get the 1-dimensional flattened array of our response variable y by calling the ravel() function on y
y = y.ravel()

# let's do the split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.20, random_state=1)
