## Second Data Wrangling and Pre-processing
In the first notebook, we removed all of the data that could be revealing of an individual purchaser. <br>
In this notebook, we'll eliminate some unnecessary columns and create some more important feature columns that we can then look at in more detail in the Exploratory Data Analysis.

## Goal: Eliminate unnecessary columns, create some obvious features, minimize Nan values, and separate into Items, Orders, and Customers DataFrames

In [2]:
import os
import pandas as pd
import numpy as np
import datetime
import pickle

In [3]:
# change to the path with the raw csv file data
os.chdir("C:\Springboard\Github\Capstone2_cust\Intermediate_Data")
# load the pickled version of the 
df = pickle.load(open("cust_pub4.pkl", "rb"))
# look at the first 10 rows of this file
df.head(10)

Unnamed: 0,Name,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,...,Tax 3 Value,Tax 4 Name,Tax 4 Value,Tax 5 Name,Tax 5 Value,Receipt Number,Server,ship_bill,Area_Code,Cust_ID
0,#64088,paid,2020-09-12 05:49:40 -0700,unfulfilled,,no,USD,25.0,0.0,0.0,...,,,,,,,gmail.com,True,,2779653000000.0
1,#64087,paid,2020-09-12 05:36:26 -0700,unfulfilled,,no,USD,90.68,0.0,0.0,...,,,,,,,gmail.com,True,,2779539000000.0
2,#64087,,,,,,,,,,...,,,,,,,gmail.com,False,,2779539000000.0
3,#64087,,,,,,,,,,...,,,,,,,gmail.com,False,,2779539000000.0
4,#64086,paid,2020-09-12 05:28:27 -0700,unfulfilled,,yes,USD,54.97,0.0,0.0,...,,,,,,,gmail.com,True,803.0,2779470000000.0
5,#64086,,,,,,,,,,...,,,,,,,gmail.com,False,,2779470000000.0
6,#64085,paid,2020-09-12 05:26:56 -0700,unfulfilled,,no,USD,54.97,5.0,0.0,...,,,,,,,gmail.com,True,,2779457000000.0
7,#64085,,,,,,,,,,...,,,,,,,gmail.com,False,,2779457000000.0
8,#64084,paid,2020-09-12 05:07:11 -0700,unfulfilled,,no,USD,24.98,3.0,0.0,...,,,,,,,gmail.com,True,,2779303000000.0
9,#64084,,,,,,,,,,...,,,,,,,gmail.com,False,,2779303000000.0


In [4]:
# let's drop all of the tax columns from this DF
df.drop(['Tax 1 Name', 'Tax 1 Value', 'Tax 2 Name', 'Tax 2 Value', 'Tax 3 Name', 'Tax 3 Value', 'Tax 4 Name', 'Tax 4 Value',
       'Tax 5 Name', 'Tax 5 Value'], axis=1, inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142224 entries, 0 to 142223
Data columns (total 45 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Name                         142224 non-null  object 
 1   Financial Status             63084 non-null   object 
 2   Paid at                      61027 non-null   object 
 3   Fulfillment Status           63084 non-null   object 
 4   Fulfilled at                 62270 non-null   object 
 5   Accepts Marketing            63084 non-null   object 
 6   Currency                     63084 non-null   object 
 7   Subtotal                     63084 non-null   float64
 8   Shipping                     63084 non-null   float64
 9   Taxes                        63084 non-null   float64
 10  Total                        63084 non-null   float64
 11  Discount Code                7212 non-null    object 
 12  Discount Amount              63084 non-null   float64
 13 

In [6]:
# we noticed from the first 10 rows that some of these values aren't filled. Let's use forward fill since that is the same order
df['Paid at'].fillna(method='ffill', inplace=True, limit=None)

In [7]:
# we need to convert the "Paid at" column into datetime
df['Paid at'] = pd.to_datetime(df['Paid at'], infer_datetime_format=True)

In [8]:
df.head(10)

Unnamed: 0,Name,Financial Status,Paid at,Fulfillment Status,Fulfilled at,Accepts Marketing,Currency,Subtotal,Shipping,Taxes,...,Id,Tags,Risk Level,Source,Lineitem discount,Receipt Number,Server,ship_bill,Area_Code,Cust_ID
0,#64088,paid,2020-09-12 05:49:40-07:00,unfulfilled,,no,USD,25.0,0.0,0.0,...,2779653000000.0,,Low,web,0.0,,gmail.com,True,,2779653000000.0
1,#64087,paid,2020-09-12 05:36:26-07:00,unfulfilled,,no,USD,90.68,0.0,0.0,...,2779539000000.0,,Low,web,0.0,,gmail.com,True,,2779539000000.0
2,#64087,,2020-09-12 05:36:26-07:00,,,,,,,,...,,,,,0.0,,gmail.com,False,,2779539000000.0
3,#64087,,2020-09-12 05:36:26-07:00,,,,,,,,...,,,,,0.0,,gmail.com,False,,2779539000000.0
4,#64086,paid,2020-09-12 05:28:27-07:00,unfulfilled,,yes,USD,54.97,0.0,0.0,...,2779470000000.0,,Low,web,0.0,,gmail.com,True,803.0,2779470000000.0
5,#64086,,2020-09-12 05:28:27-07:00,,,,,,,,...,,,,,0.0,,gmail.com,False,,2779470000000.0
6,#64085,paid,2020-09-12 05:26:56-07:00,unfulfilled,,no,USD,54.97,5.0,0.0,...,2779457000000.0,,Low,web,0.0,,gmail.com,True,,2779457000000.0
7,#64085,,2020-09-12 05:26:56-07:00,,,,,,,,...,,,,,0.0,,gmail.com,False,,2779457000000.0
8,#64084,paid,2020-09-12 05:07:11-07:00,unfulfilled,,no,USD,24.98,3.0,0.0,...,2779303000000.0,,Low,web,0.0,,gmail.com,True,,2779303000000.0
9,#64084,,2020-09-12 05:07:11-07:00,,,,,,,,...,,,,,0.0,,gmail.com,False,,2779303000000.0


In [9]:
# let's drop some more useless columns
df.drop(['Taxes', 'Notes', 'Note Attributes',
       'Cancelled at'], axis=1, inplace=True)

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142224 entries, 0 to 142223
Data columns (total 41 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   Name                         142224 non-null  object 
 1   Financial Status             63084 non-null   object 
 2   Paid at                      142224 non-null  object 
 3   Fulfillment Status           63084 non-null   object 
 4   Fulfilled at                 62270 non-null   object 
 5   Accepts Marketing            63084 non-null   object 
 6   Currency                     63084 non-null   object 
 7   Subtotal                     63084 non-null   float64
 8   Shipping                     63084 non-null   float64
 9   Total                        63084 non-null   float64
 10  Discount Code                7212 non-null    object 
 11  Discount Amount              63084 non-null   float64
 12  Shipping Method              62285 non-null   object 
 13 

In [11]:
# Reciept Number is empty - drop that
# Fullfilled at is missing a lot of values - we are using 'Paid at '
# remove a few more columns that are too sparse to be useful in modeling
df.drop(['Fulfilled at', 'Receipt Number', 'Location', 'Device ID', 'Id', 'Risk Level'], axis=1, inplace=True)

In [12]:
# Let's see what currencies are used
df['Currency'].value_counts()

USD    63084
Name: Currency, dtype: int64

In [13]:
# it's just USD ($) or NaN. Not worth keeping that column
df.drop(['Currency'], axis=1, inplace=True)

In [14]:
# let's look at Paid at vs. Created at
df[['Paid at', 'Created at']].sample(10)

Unnamed: 0,Paid at,Created at
101976,2019-12-26 09:57:17-08:00,2019-12-26 09:57:16 -0800
109088,2019-12-01 16:25:58-08:00,2019-12-01 16:25:58 -0800
127379,2019-09-09 20:55:48-07:00,2019-09-09 20:55:48 -0700
103109,2019-12-22 04:59:51-08:00,2019-12-22 04:59:51 -0800
135196,2018-07-03 22:19:13-07:00,2018-07-03 22:19:12 -0700
111447,2019-11-28 16:58:15-08:00,2019-11-28 16:57:08 -0800
90933,2020-01-24 05:32:02-08:00,2020-01-24 05:32:01 -0800
65493,2020-04-15 17:14:20-07:00,2020-04-15 17:14:19 -0700
79223,2020-03-01 07:40:41-08:00,2020-03-01 07:40:41 -0800
109923,2019-11-30 08:37:51-08:00,2019-11-30 08:37:50 -0800


Those looks to be identical except for a 1-2 second lag for the payment. I'm good with dropping the paid at column

In [15]:
df.drop(['Paid at'], axis=1, inplace=True)

In [16]:
# since we are using 'Created at' as the time stamp, let's convert it to date time
# we need to convert the "Paid at" column into datetime
df['Created at'] = pd.to_datetime(df['Created at'], infer_datetime_format=True, utc=True)

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142224 entries, 0 to 142223
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype              
---  ------                       --------------   -----              
 0   Name                         142224 non-null  object             
 1   Financial Status             63084 non-null   object             
 2   Fulfillment Status           63084 non-null   object             
 3   Accepts Marketing            63084 non-null   object             
 4   Subtotal                     63084 non-null   float64            
 5   Shipping                     63084 non-null   float64            
 6   Total                        63084 non-null   float64            
 7   Discount Code                7212 non-null    object             
 8   Discount Amount              63084 non-null   float64            
 9   Shipping Method              62285 non-null   object             
 10  Created at                   142

### These look pretty good. Now, it's time to start filling in some of the NaN values

In [18]:
# For financial status
df['Financial Status'].value_counts()

paid                  61766
refunded                690
partially_refunded      616
partially_paid            7
pending                   5
Name: Financial Status, dtype: int64

In [19]:
# it looks like the first line of an order has that Financial Status; we'll forward fill
df['Financial Status'].fillna(method='ffill', inplace=True, limit=25)

In [20]:
# same applies for Fulfillment Status
df['Fulfillment Status'].fillna(method='ffill', inplace=True, limit=25)

In [21]:
# same is true for Accepts Marketing
df['Accepts Marketing'].fillna(method='ffill', inplace=True, limit=25)

In [22]:
df['Tags'].value_counts()

Subscription, Subscription Recurring Order                                                                                                        1113
Subscription, Subscription First Order                                                                                                             896
recurring_order                                                                                                                                    480
0-3DB542442Y809192H, carthook-checkout, ch_FID_carthook-checkout, ch_id_0htFEtCty4bxpLxlrrqK                                                         3
0-100137263438, CartHook Checkout, carthook-skinquiz-minikits, ch_FID_carthook-skinquiz-minikit, ch_id_H7WWZqswDx8D5DW9jH9d                          3
                                                                                                                                                  ... 
0-pi_0HPcbCnIsXf3x9XH5n71VFaf, carthook-checkout, ch_FID_carthook-checkout, ch_id_GktwTKYxLDo6

In [23]:
# these look unnecessarily complicated, so we'll drop - or maybe not
# df.drop(['Tags'], axis=1, inplace=True)

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142224 entries, 0 to 142223
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype              
---  ------                       --------------   -----              
 0   Name                         142224 non-null  object             
 1   Financial Status             142217 non-null  object             
 2   Fulfillment Status           142217 non-null  object             
 3   Accepts Marketing            142217 non-null  object             
 4   Subtotal                     63084 non-null   float64            
 5   Shipping                     63084 non-null   float64            
 6   Total                        63084 non-null   float64            
 7   Discount Code                7212 non-null    object             
 8   Discount Amount              63084 non-null   float64            
 9   Shipping Method              62285 non-null   object             
 10  Created at                   142

In [25]:
# let's look at Payment Reference
df['Payment Reference'].value_counts()

c11911267909689.1    1
c12032780271673.1    1
c12195769483321.1    1
c12251903066169.1    1
#59738.1             1
                    ..
c14758843809960.2    1
c11954251890745.1    1
#54261.1             1
c11874050342969.1    1
c14787129376936.1    1
Name: Payment Reference, Length: 61651, dtype: int64

In [26]:
# let's drop it
df.drop(['Payment Reference'], axis=1, inplace=True)

In [27]:
# let's create one more feature that would be usable: total items in an order
df['ITEMS'] = df.groupby('Name')['Lineitem quantity'].transform('sum')

In [28]:
df['ITEMS'].unique()

array([  1,   3,   2,   5,   4,   6,  10,   7,   8,  11,  12,   9,  14,
        22,  15,  80,  20,  40,  19,  18,  30,  50,  52,  45,  33,  16,
        13, 100,  25,  35,  26,  21,  17], dtype=int64)

In [29]:
# see how many unique "names" are in the DF
df['Name'].value_counts()

#5957     33
#60457    22
#34086    20
#25140    19
#10488    16
          ..
#51999     1
#53695     1
#17108     1
#10648     1
#32476     1
Name: Name, Length: 63084, dtype: int64

This looks like the same number of "subtotal" and some other fields that are order specific.

In [30]:
# let see if we can use the compare at price relative to the lineitem price as another feature
df['compared'] = (df['Lineitem compare at price'] - df['Lineitem price'])/df['Lineitem price']
# positive values mean the line item price is cheaper
# this relative price is more important than the absolute

In [31]:
# let's convert this to a difference in price
df['Lineitem compare at price'] = df['Lineitem compare at price'] - df['Lineitem price']

### Separate the Dataframe <br>
Right now, the items and the orders are each lines in the DataFrame; we are going to separate out the orders and items in the order into 2 separate dataframes:
### 1. Order - contains the order information
### 2. Items - line by line items contained in an order
### 3. Customers - contains the sum of the orders and items

In [32]:
# create Order DF by taking the first line of a name
Order = df.groupby('Name').first()

# or I could do it groupby Name and then take the value that has a subtotal that is not null

In [33]:
Order.info()

<class 'pandas.core.frame.DataFrame'>
Index: 63084 entries, #10000 to #9999
Data columns (total 33 columns):
 #   Column                       Non-Null Count  Dtype              
---  ------                       --------------  -----              
 0   Financial Status             63084 non-null  object             
 1   Fulfillment Status           63084 non-null  object             
 2   Accepts Marketing            63084 non-null  object             
 3   Subtotal                     63084 non-null  float64            
 4   Shipping                     63084 non-null  float64            
 5   Total                        63084 non-null  float64            
 6   Discount Code                7212 non-null   object             
 7   Discount Amount              63084 non-null  float64            
 8   Shipping Method              62285 non-null  object             
 9   Created at                   63084 non-null  datetime64[ns, UTC]
 10  Lineitem quantity            63084 non-null  i

In [34]:
# let's look at discount codes
Order['Discount Code'].value_counts()

BAMBUBEAUTY            1404
THANKYOU10              664
save10                  559
Custom discount         274
CARACLARKNUTRITION      221
                       ... 
6650a83187f2              1
669c0584d084              1
RealSavings3c0d2d9d       1
83b673ae5ff8              1
RealSavings9368f887       1
Name: Discount Code, Length: 920, dtype: int64

The most popular discount codes are used largely enough that they could provide some value, but the largest code is used on 2% of all orders; discount codes are used on 11% of orders. I think it's best to just consider the discount amount to start and that's already contained in another column, so we'll drop this column.

In [35]:
Order.drop(['Discount Code'], axis=1, inplace=True)

In [36]:
# for orders, it shouldn't matter if that particular items is taxable, so we'll drop that or the fulfillment status
Order.drop(['Lineitem taxable', 'Lineitem fulfillment status'], axis=1, inplace=True)

In [37]:
Order.info()

<class 'pandas.core.frame.DataFrame'>
Index: 63084 entries, #10000 to #9999
Data columns (total 30 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   Financial Status            63084 non-null  object             
 1   Fulfillment Status          63084 non-null  object             
 2   Accepts Marketing           63084 non-null  object             
 3   Subtotal                    63084 non-null  float64            
 4   Shipping                    63084 non-null  float64            
 5   Total                       63084 non-null  float64            
 6   Discount Amount             63084 non-null  float64            
 7   Shipping Method             62285 non-null  object             
 8   Created at                  63084 non-null  datetime64[ns, UTC]
 9   Lineitem quantity           63084 non-null  int64              
 10  Lineitem name               63084 non-null  object        

In [38]:
# let's fill the payment method with "unknown for the missing values"
Order['Payment Method'].fillna(value='Unknown', inplace=True)

In [39]:
# let's look at Line item requires shipping
Order['Lineitem requires shipping'].value_counts()

True     33591
False    29493
Name: Lineitem requires shipping, dtype: int64

That seems reasonable enough; let's keep that

In [40]:
# let's drop some more unnecessary info; line item name should be covered in the sku
Order['Lineitem sku'].fillna(value=Order['Lineitem name'], inplace=True)
Order.drop(['Lineitem name'], axis=1, inplace=True)

In [41]:
Order.info()

<class 'pandas.core.frame.DataFrame'>
Index: 63084 entries, #10000 to #9999
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   Financial Status            63084 non-null  object             
 1   Fulfillment Status          63084 non-null  object             
 2   Accepts Marketing           63084 non-null  object             
 3   Subtotal                    63084 non-null  float64            
 4   Shipping                    63084 non-null  float64            
 5   Total                       63084 non-null  float64            
 6   Discount Amount             63084 non-null  float64            
 7   Shipping Method             62285 non-null  object             
 8   Created at                  63084 non-null  datetime64[ns, UTC]
 9   Lineitem quantity           63084 non-null  int64              
 10  Lineitem price              63084 non-null  float64       

In [42]:
# Customer ID should be an integer - but this gets weird, so we'll skip it.
# Order['Cust_ID'] = Order['Cust_ID'].astype('int')

In [43]:
# let's find out how this shipping method looks
Order['Shipping Method'].value_counts()

Standard Shipping (5-7 Business Days)             20338
USPS First Class Package (5-7 Business Days)      14909
USPS First Class Package (2-5 Business Days)       7680
Free Shipping (5-8 Business Days)                  6902
Priority Mail                                      2349
Fedex 2Day (2-3 Business Days)                     1124
First Class Package                                 925
Flat Rate Shipping                                  818
Standard Shipping (5-8 Business Days)               781
USPS Priority Mail (1-3 Business Days)              772
Free shipping                                       742
Standard Shipping (free)                            696
Priority Mail (2-4 Business Days)                   569
UPS® Ground                                         539
USPS First Class International                      503
Always Free Shipping                                480
Free shipping for orders over $99                   463
USPS First Class                                

In [44]:
# let's fill that shipping method with unknown - Shipping Method
Order['Shipping Method'].fillna(value='Unknown', inplace=True)

In [45]:
Order.head(10)

Unnamed: 0_level_0,Financial Status,Fulfillment Status,Accepts Marketing,Subtotal,Shipping,Total,Discount Amount,Shipping Method,Created at,Lineitem quantity,...,Employee,Tags,Source,Lineitem discount,Server,ship_bill,Area_Code,Cust_ID,ITEMS,compared
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#10000,paid,fulfilled,yes,8.0,0.0,8.62,0.0,USPS First Class Package (2-5 Business Days),2019-10-25 15:49:51+00:00,1,...,False,,web,0.0,gmail.com,True,,2029549000000.0,1,
#10001,paid,fulfilled,yes,44.0,0.0,44.0,0.0,USPS First Class Package (2-5 Business Days),2019-10-25 16:22:12+00:00,1,...,False,,web,0.0,gmail.com,True,,2604838000000.0,2,
#10002,paid,fulfilled,yes,34.0,0.0,34.0,0.0,USPS First Class Package (2-5 Business Days),2019-10-25 16:57:00+00:00,1,...,False,,web,0.0,yahoo.com,False,813.0,1928534000000.0,1,-1.0
#10003,paid,fulfilled,yes,34.0,0.0,34.0,0.0,USPS First Class Package (2-5 Business Days),2019-10-25 17:15:01+00:00,1,...,False,,web,0.0,gmail.com,True,,1825239000000.0,1,-1.0
#10004,paid,fulfilled,yes,8.0,0.0,8.0,0.0,USPS First Class Package (2-5 Business Days),2019-10-25 17:26:41+00:00,1,...,False,,web,0.0,gmail.com,True,513.0,1886785000000.0,1,
#10005,paid,fulfilled,yes,34.0,14.55,48.55,0.0,USPS First Class International,2019-10-25 17:26:46+00:00,1,...,False,,web,0.0,gmail.com,True,,1825258000000.0,1,-1.0
#10006,partially_refunded,fulfilled,yes,56.0,8.99,64.99,0.0,Priority Mail,2019-10-25 18:24:25+00:00,1,...,False,,web,0.0,gmail.com,False,,1825330000000.0,2,
#10007,paid,fulfilled,yes,80.0,0.0,80.0,0.0,USPS First Class Package (2-5 Business Days),2019-10-25 18:38:36+00:00,1,...,False,,web,0.0,gmail.com,False,,1825345000000.0,5,
#10008,paid,fulfilled,yes,72.0,0.0,72.0,8.0,USPS First Class Package (2-5 Business Days),2019-10-25 18:59:44+00:00,1,...,False,,web,0.0,gmail.com,True,,2010503000000.0,4,
#10009,paid,fulfilled,yes,0.0,0.0,0.0,18.0,Free shipping,2019-10-25 19:08:24+00:00,1,...,True,,shopify_draft_order,0.0,gmail.com,True,347.0,2110373000000.0,1,


In [46]:
# based on some weird data, let's look at the source
Order.Source.value_counts()

web                    50325
1356615                 8653
294517                  1977
shopify_draft_order     1564
457101                   478
580111                    45
1424624                   20
charge_rabbit             12
iphone                     8
412739                     2
Name: Source, dtype: int64

shopify_draft_order may just be draft orders that were used to test the system and not actual orders

In [47]:
Order[Order['Source'] == 'shopify_draft_order']

Unnamed: 0_level_0,Financial Status,Fulfillment Status,Accepts Marketing,Subtotal,Shipping,Total,Discount Amount,Shipping Method,Created at,Lineitem quantity,...,Employee,Tags,Source,Lineitem discount,Server,ship_bill,Area_Code,Cust_ID,ITEMS,compared
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
#10009,paid,fulfilled,yes,0.0,0.0,0.0,18.0,Free shipping,2019-10-25 19:08:24+00:00,1,...,True,,shopify_draft_order,0.0,gmail.com,True,347,2.110373e+12,1,
#10010,paid,unfulfilled,yes,0.0,0.0,0.0,28.0,Free shipping,2019-10-25 19:13:09+00:00,1,...,True,,shopify_draft_order,0.0,yahoo.com,True,,1.825382e+12,1,
#10299,paid,fulfilled,yes,0.0,0.0,0.0,1192.0,Free shipping,2019-10-29 21:38:36+00:00,3,...,True,,shopify_draft_order,0.0,hold,True,949,2.615669e+12,15,
#10308,paid,fulfilled,yes,0.0,0.0,0.0,46.0,Free shipping,2019-10-29 23:10:34+00:00,1,...,True,,shopify_draft_order,0.0,gmail.com,True,,1.834192e+12,1,
#10409,paid,fulfilled,yes,0.0,0.0,0.0,38.0,Free shipping,2019-10-31 02:42:21+00:00,1,...,True,,shopify_draft_order,0.0,gmail.com,True,,1.982813e+12,2,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
#9754,paid,fulfilled,yes,0.0,0.0,0.0,34.0,Free shipping,2019-10-21 20:43:35+00:00,1,...,True,,shopify_draft_order,0.0,gmail.com,False,570,1.836541e+12,1,
#9755,paid,fulfilled,yes,0.0,0.0,0.0,28.0,Free shipping,2019-10-21 20:45:10+00:00,1,...,True,,shopify_draft_order,0.0,gmail.com,True,618,1.814752e+12,1,
#9761,paid,fulfilled,yes,0.0,0.0,0.0,28.0,Free shipping,2019-10-21 22:14:39+00:00,1,...,True,,shopify_draft_order,0.0,yahoo.com,True,,2.588982e+12,1,
#9852,paid,fulfilled,yes,0.0,0.0,0.0,30.0,Free shipping,2019-10-22 22:15:20+00:00,1,...,True,,shopify_draft_order,0.0,gmail.com,True,,1.819088e+12,3,


These look weird and are probably just tests. I'm dropping them.

In [48]:
Order = Order[~(Order['Source'] == 'shopify_draft_order')]

In [49]:
Order.info()

<class 'pandas.core.frame.DataFrame'>
Index: 61520 entries, #10000 to #9999
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   Financial Status            61520 non-null  object             
 1   Fulfillment Status          61520 non-null  object             
 2   Accepts Marketing           61520 non-null  object             
 3   Subtotal                    61520 non-null  float64            
 4   Shipping                    61520 non-null  float64            
 5   Total                       61520 non-null  float64            
 6   Discount Amount             61520 non-null  float64            
 7   Shipping Method             61520 non-null  object             
 8   Created at                  61520 non-null  datetime64[ns, UTC]
 9   Lineitem quantity           61520 non-null  int64              
 10  Lineitem price              61520 non-null  float64       

### I think that wraps it up for the Order DF

### On to the Items DF that contains all of the line items in the orders

In [50]:
# every row in the dataframe represents a line item, so we'll keep them in 
Items = df.copy()

### That takes care of the Items DF

### Still have to work on the Customer DF

In [51]:
Order['Cust_ID'].value_counts()

2.746662e+12    48
2.695320e+12    47
2.577523e+12    36
2.602749e+12    36
2.599750e+12    36
                ..
2.585387e+12     1
1.930881e+12     1
2.068324e+12     1
2.617386e+12     1
1.864016e+12     1
Name: Cust_ID, Length: 39771, dtype: int64

Let's separate the customers based on these value counts

In [52]:
#Order[Order['Cust_ID'] == -2147483648]
# this order showed up in 60k orders when we changed these from float to integer. I have non idea why

In [53]:
# Create customer DF by aggregating the orders DF over the Customer ID
# 'Accepts Marketing': 'mode', 'Shipping Method': 'mode', 'Payment Method': 'mode',
Cust = Order.groupby('Cust_ID', as_index=False).agg({'Total': ["sum", 'mean'], 'Fulfillment Status': 'count', 'Subtotal': 'sum', 'Shipping': 'sum', 'Refunded Amount': 'sum', 'Accepts Marketing': 'sum', 'ITEMS': 'sum', 'Created at': ['first', 'last'], 'Server': 'first', 'Discount Amount': 'sum', 'Vendor': 'first', 'Employee': 'first', 'Source': 'first', 'ship_bill': 'first', 'Area_Code': 'first', 'Shipping Zip': 'first'})

In [54]:
Cust.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39771 entries, 0 to 39770
Data columns (total 19 columns):
 #   Column                       Non-Null Count  Dtype              
---  ------                       --------------  -----              
 0   (Cust_ID, )                  39771 non-null  float64            
 1   (Total, sum)                 39771 non-null  float64            
 2   (Total, mean)                39771 non-null  float64            
 3   (Fulfillment Status, count)  39771 non-null  int64              
 4   (Subtotal, sum)              39771 non-null  float64            
 5   (Shipping, sum)              39771 non-null  float64            
 6   (Refunded Amount, sum)       39771 non-null  float64            
 7   (Accepts Marketing, sum)     39771 non-null  object             
 8   (ITEMS, sum)                 39771 non-null  int64              
 9   (Created at, first)          39771 non-null  datetime64[ns, UTC]
 10  (Created at, last)           39771 non-null  d

In [55]:
# this is exciting let's look at the first 10 rows
Cust.head(10)

Unnamed: 0_level_0,Cust_ID,Total,Total,Fulfillment Status,Subtotal,Shipping,Refunded Amount,Accepts Marketing,ITEMS,Created at,Created at,Server,Discount Amount,Vendor,Employee,Source,ship_bill,Area_Code,Shipping Zip
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,count,sum,sum,sum,sum,sum,first,last,first,sum,first,first,first,first,first,first
0,75306.0,0.0,0.0,1,0.0,0.0,0.0,no,4,2020-03-18 04:04:53+00:00,2020-03-18 04:04:53+00:00,custom,154.0,0.0,True,iphone,False,,32817
1,112053.0,145.58,145.58,1,145.58,0.0,0.0,yes,2,2019-11-26 21:44:16+00:00,2019-11-26 21:44:16+00:00,custom,16.17,1.0,False,web,True,404.0,30087
2,112055.0,137.55,137.55,1,137.55,0.0,0.0,no,5,2019-11-26 20:52:08+00:00,2019-11-26 20:52:08+00:00,custom,0.0,1.0,False,web,True,845.0,12545
3,112095.0,22.98,22.98,1,22.98,0.0,0.0,yes,2,2019-11-26 18:12:04+00:00,2019-11-26 18:12:04+00:00,custom,0.0,1.0,False,web,True,262.0,53402
4,130108.0,28.0,28.0,1,28.0,0.0,0.0,no,1,2019-08-07 18:14:49+00:00,2019-08-07 18:14:49+00:00,custom,0.0,0.0,False,web,True,617.0,1983
5,130110.0,12.0,12.0,1,12.0,0.0,0.0,yes,1,2019-08-07 18:05:28+00:00,2019-08-07 18:05:28+00:00,custom,0.0,0.0,False,web,True,740.0,43143
6,130188.0,42.0,42.0,1,42.0,0.0,0.0,no,2,2019-08-07 03:45:52+00:00,2019-08-07 03:45:52+00:00,custom,0.0,0.0,False,web,True,701.0,58801
7,130231.0,27.2,27.2,1,27.2,0.0,0.0,yes,1,2019-08-06 22:00:54+00:00,2019-08-06 22:00:54+00:00,custom,6.8,0.0,False,web,True,754.0,33026
8,130241.0,22.0,22.0,1,22.0,0.0,22.0,yes,2,2019-08-06 20:22:25+00:00,2019-08-06 20:22:25+00:00,custom,0.0,0.0,False,web,True,,1880
9,130245.0,100.0,100.0,1,100.0,0.0,100.0,yes,5,2019-08-06 19:59:05+00:00,2019-08-06 19:59:05+00:00,custom,0.0,0.0,False,web,True,617.0,1880


In [56]:
Cust[('Fulfillment Status', 'count')].value_counts()

1     30072
2      5194
3      2083
4      1008
5       517
6       287
7       184
8       106
9        76
10       51
11       35
12       24
13       22
14       20
17       15
15       12
18       12
16       10
19        8
20        7
21        4
28        4
36        3
24        3
23        3
30        2
29        2
25        2
47        1
48        1
34        1
26        1
32        1
Name: (Fulfillment Status, count), dtype: int64

I think that does it for data wrangling. Let's export the data so that we can do EDA in the next notebook.

In [57]:
os.chdir("C:\Springboard\Github\Capstone2_cust\Intermediate_Data")
# let's save the Order to both pickle and CSV
Order.to_pickle("Order1.pkl")
Order.to_csv("Order1.csv")

In [58]:
# let's save the Items to both pickle and CSV
Items.to_pickle("Items1.pkl")
Items.to_csv("Items1.csv")

In [59]:
# let's save the Items to both pickle and CSV
Cust.to_pickle("Cust1.pkl")
Cust.to_csv("Cust1.csv")

See you in the EDA