# SamCart Data Analyst Challenge

### Background:

###### 
SamCart is the first e-commerce platform built for direct to consumer brands. We are rapidly growing, but am on the hunt to better understand and leverage our data to make decisions.
In this role, you will have access to millions of rows of data and it’s not only important that you know how to analyze the data, but that you can help tell the story of what the data says.

We have created a short challenge to do the following:
- Ensure you have a working knowledge of data analysis tools andmethodologies
- Test your experience analyzing large data sets
- Understand how you prepare and present data

### Prompt:
###### 
We are using 2 data sets (linked below) from Brazillian retailer Olist.
Here are some facts:
 - It’s currently September 2018 (e.g., you can ignore all data after September 2018)
 - The company’s inception was January 2017 (so you can ignore all data before January 2017)
 - Company is US-based, but launched in Brazil (which is why some information is in Portuguese)
 - You can assume all orders are delivered (so ignore the order state field) Please prepare a presentation that addresses the following points:
 - Summarizes the seller funnel from marketing sign up to launching products on the platform
 - Please provide Customer LTV [Helpful Site with Examples of Equations (geckoboard.com)](https://www.geckoboard.com/best-practice/kpi-examples/)
 - Summarizes the current state of the business
 - What has performance been monthly
 - What are the best selling categories
 - Predicts future revenue for the next 12 months for order volume and revenue

### Data Sources:
###### 
- [Marketing Funnel Data Set](https://www.kaggle.com/olistbr/marketing-funnel-olist?select=olist_marketing_qualified_leads_dataset.csv)
- [E-Commerce Data Set](https://www.kaggle.com/olistbr/brazilian-ecommerce/home?select=product_category_name_translation.csv)
- [Instructions to Link Data Sets](https://www.kaggle.com/andresionek/joining-marketing-funnel-with-brazilian-e-commerce)

###### 
Please let David Rapoport (drapoport@samcart.com) know if you have any questions. This challenge is due prior to your meeting with David.

checkout this cool website https://blog.hubspot.com/service/how-to-calculate-customer-lifetime-value

In [1]:
import time
import pandas as pd
import datetime
import numpy as np
import plotly.graph_objects as go
from tqdm import tqdm

In [2]:
df_cd = pd.read_csv('SamCart-Data/olist_closed_deals_dataset.csv')
df_c = pd.read_csv('SamCart-Data/olist_customers_dataset.csv')
df_g = pd.read_csv('SamCart-Data/olist_geolocation_dataset.csv')
df_m = pd.read_csv('SamCart-Data/olist_marketing_qualified_leads_dataset.csv')
df_oi = pd.read_csv('SamCart-Data/olist_order_items_dataset.csv')
df_op = pd.read_csv('SamCart-Data/olist_order_payments_dataset.csv')
df_o = pd.read_csv('SamCart-Data/olist_orders_dataset.csv')
df_or = pd.read_csv('SamCart-Data/olist_order_reviews_dataset.csv')
df_p = pd.read_csv('SamCart-Data/olist_products_dataset.csv')
df_s = pd.read_csv('SamCart-Data/olist_sellers_dataset.csv')
df_t = pd.read_csv('SamCart-Data/product_category_name_translation.csv')

In [3]:
#KEEP

# Read CSV into DataFrame
df_p = pd.read_csv('SamCart-Data/olist_products_dataset.csv')
# Convert spanish and enlish terms columns into lists
product_spanish = df_t['product_category_name'].to_list()
product_english = df_t['product_category_name_english'].to_list()
# Replace all spanish terms with english terms
df_pt = df_p.replace(product_spanish,product_english)
# Convert DataFrame into .csv
df_pt.to_csv('SamCart-Data/olist_products_dataset_english.csv')


In [4]:
print(df_or.shape)
print(df_or.review_id.unique().shape)
print(df_or.order_id.unique().shape)

(100000, 7)
(99173,)
(99441,)


In [5]:
t = df_g.geolocation_zip_code_prefix.value_counts()
t.where(t>1).dropna()

24220    1146.0
24230    1102.0
38400     965.0
35500     907.0
11680     879.0
          ...  
63145       2.0
55334       2.0
58685       2.0
73365       2.0
11570       2.0
Name: geolocation_zip_code_prefix, Length: 17972, dtype: float64

In [6]:
t = df_oi[["order_id", "order_item_id"]].value_counts() # Shows the occurances for
t

order_id                          order_item_id
fffe41c64501cc87c801fd61db3f6244  1                1
553c468f68af19869694de8bcd095213  2                1
5536d8682e25ee3f39ed444105996fee  1                1
553901a853048dcd33ec8de19f90c5d0  1                1
5539bd029cf95d97ba8f51f6f323c839  1                1
                                                  ..
ab1430e26162f925a945d18941057aee  1                1
ab14fdcfbe524636d65ee38360e22ce8  1                1
                                  2                1
                                  3                1
00010242fe8c5a6d1ba2dd792cb16214  1                1
Length: 112650, dtype: int64

In [7]:
# Creating the DataFrame, assigning datatypes
df_oi_o_c = df_oi.merge(df_o.merge(df_c, on='customer_id', how='inner'), on='order_id', how='inner')
df_oi_o_c['order_approved_at'] = pd.to_datetime(df_oi_o_c['order_approved_at'])
df_oi_o_c = df_oi_o_c.loc[(df_oi_o_c['order_approved_at'] >= datetime.datetime.strptime('2017-01', '%Y-%m')) & (df_oi_o_c['order_approved_at'] < datetime.datetime.strptime('2018-09', '%Y-%m')) & (df_oi_o_c['order_status'] == 'delivered')]
display(df_oi_o_c.sort_values(by=['order_approved_at'], ascending=False).head(30))

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
23530,35a972d7f8436f405b56e36add1a7140,1,d04857e7b4b708ee8b8b9921163edba3,9f505651f4a6abe901a56cdc21508025,2018-08-31 15:10:26,84.99,8.76,898b7fee99c4e42170ab69ba59be0a8b,delivered,2018-08-29 15:00:37,2018-08-29 15:10:26,2018-08-29 16:57:00,2018-08-30 16:23:36,2018-09-05 00:00:00,24ac2b4327e25baf39f2119e4228976a,13483,limeira,SP
1722,03ef5dedbe7492bdae72eec50764c43f,1,c7f27c5bef2338541c772b5776403e6a,7d7866a99a8656a42c7ff6352a433410,2018-08-31 15:05:22,24.9,8.33,496630b6740bcca28fce9ba50d8a26ef,delivered,2018-08-29 14:52:00,2018-08-29 15:05:22,2018-08-29 20:01:00,2018-08-30 16:36:59,2018-09-03 00:00:00,b701bebbdf478f5500348f03aff62121,9541,sao caetano do sul,SP
4924,0b223d92c27432930dfe407c6aea3041,1,2b4472df15512a2825ae86fd9ae79335,67bf6941ba2f1fa1d02c375766bc3e53,2018-08-31 14:30:19,209.0,46.48,e60df9449653a95af4549bbfcb18a6eb,delivered,2018-08-29 14:18:23,2018-08-29 14:31:07,2018-08-29 15:29:00,2018-08-30 16:24:55,2018-09-04 00:00:00,5c58de6fb80e93396e2f35642666b693,80045,curitiba,PR
4925,0b223d92c27432930dfe407c6aea3041,2,2b4472df15512a2825ae86fd9ae79335,67bf6941ba2f1fa1d02c375766bc3e53,2018-08-31 14:30:19,209.0,46.48,e60df9449653a95af4549bbfcb18a6eb,delivered,2018-08-29 14:18:23,2018-08-29 14:31:07,2018-08-29 15:29:00,2018-08-30 16:24:55,2018-09-04 00:00:00,5c58de6fb80e93396e2f35642666b693,80045,curitiba,PR
9829,168626408cb32af0ffaf76711caae1dc,1,bdcf6a834e8faa30dac3886c7a58e92e,2a84855fd20af891be03bc5924d2b453,2018-08-31 14:30:23,45.9,15.39,6e353700bc7bcdf6ebc15d6de16d7002,delivered,2018-08-29 14:18:28,2018-08-29 14:30:23,2018-08-29 18:51:00,2018-08-30 16:52:31,2018-09-11 00:00:00,7febafa06d9d8f232a900a2937f04338,38600,paracatu,MG
36164,52018484704db3661b98ce838612b507,1,777798445efd625458a90c13f3b3e6e7,5f2684dab12e59f83bef73ae57724e45,2018-08-31 12:35:17,63.9,9.2,e450a297a7bc6839ceb0cf1a2377fa02,delivered,2018-08-29 12:25:59,2018-08-29 12:35:17,2018-08-29 13:38:00,2018-08-30 22:48:27,2018-09-03 00:00:00,7a22d14aa3c3599238509ddca4b93b01,5863,sao paulo,SP
91787,d03ca98f59480e7e76c71fa83ecd8fb6,1,06601c3059e35a3bf65e72f2fd2ac626,6b90f847357d8981edd79a1eb1bf0acb,2018-08-31 11:24:02,109.9,9.52,56b1ac2855cc6d7950b4ffa6a9b41b0d,delivered,2018-08-29 11:06:11,2018-08-29 11:24:02,2018-08-29 17:46:00,2018-08-30 23:56:54,2018-09-04 00:00:00,0421e7a23f21e5d54efed456aedbc513,13322,salto,SP
94807,d70442bc5e3cb7438da497cc6a210f80,1,9a8706b8c060b16e5f0d2925f20bc35b,0be8ff43f22e456b4e0371b2245e4d01,2018-09-03 10:35:16,6.9,7.39,10a79ef2783cae3d8d678e85fde235ac,delivered,2018-08-29 10:22:35,2018-08-29 10:35:16,2018-08-29 19:57:00,2018-08-30 16:03:19,2018-09-04 00:00:00,21dbe8eabd00b34492a939c540e2b1a7,2413,sao paulo,SP
63590,912859fef5a0bd5059b6d48fa79d121a,1,9865c67a74684715521d1e70226cce0b,fa1c13f2614d7b5c4749cbc52fecda94,2018-09-03 10:04:16,169.8,8.45,b8c19e70d00f6927388e4f31c923d785,delivered,2018-08-29 09:48:09,2018-08-29 10:04:16,2018-08-29 19:01:00,2018-08-30 23:28:52,2018-09-04 00:00:00,0c6d7218d5f3fa14514fd29865269993,9625,sao bernardo do campo,SP
110563,fb393211459aac00af932cd7ab4fa2cc,1,b6b76b074ed0d77d0f3443b12d8adb5e,6560211a19b47992c3666cc44a7e94c0,2018-08-31 09:25:12,99.0,7.95,54365416b7ef5599f54a6c7821d5d290,delivered,2018-08-29 09:14:11,2018-08-29 09:25:12,2018-08-29 15:48:00,2018-08-30 13:03:28,2018-09-04 00:00:00,b4dcade04bc548b7e3b0243c801f8c26,13184,hortolandia,SP


In [9]:

# Creating the DataFrame, assigning datatypes. 
# Only considering records:
#   - between order_approved_at dates 2017-01 and 2018-08
#   - orders that have order_status delivered
df_oi_o_c = df_oi.merge(df_o.merge(df_c, on='customer_id', how='inner'), on='order_id', how='inner')
df_oi_o_c['order_approved_at'] = pd.to_datetime(df_oi_o_c['order_approved_at'])
df_oi_o_c = df_oi_o_c.loc[(df_oi_o_c['order_approved_at'] >= datetime.datetime.strptime('2017-01', '%Y-%m')) & (df_oi_o_c['order_approved_at'] < datetime.datetime.strptime('2018-09', '%Y-%m')) & (df_oi_o_c['order_status'] == 'delivered')]

# Dropping Values (Where order_approved_at is null)
nan_index = pd.isnull(df_oi_o_c['order_approved_at']) # to show use: df_oi_o_c.loc[nan_index]
df_oi_o_c = df_oi_o_c.drop(index=df_oi_o_c.loc[nan_index].index.to_list())

# Showing Dataframe
display(df_oi_o_c.sort_values(by=['order_approved_at'], ascending=False))


# Notes:
#     - order_id and order_item_id are a unique set that are a compound key can be proven with: 
#         - d = pd.DataFrame(); d['bar'] = df_oi_o_c.order_id + df_oi_o_c.order_item_id.map(str); d['bar'].unique().shape
      
#     - price and frieght_value spread across order_item_id's when grouped by order_id's does amount to the total prices, as verified.
#         - Example:
#  
#           From olist_order_items_dataset
#           order_id                              order_item_id     price   frieght_value
#           0b223d92c27432930dfe407c6aea3041      1	                209.00	46.48	
#           0b223d92c27432930dfe407c6aea3041	  2	                209.00	46.48

#           From olist_order_payements_dataset
#           order_id                            payment_value
#           0b223d92c27432930dfe407c6aea3041	510.96

#         

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
23530,35a972d7f8436f405b56e36add1a7140,1,d04857e7b4b708ee8b8b9921163edba3,9f505651f4a6abe901a56cdc21508025,2018-08-31 15:10:26,84.99,8.76,898b7fee99c4e42170ab69ba59be0a8b,delivered,2018-08-29 15:00:37,2018-08-29 15:10:26,2018-08-29 16:57:00,2018-08-30 16:23:36,2018-09-05 00:00:00,24ac2b4327e25baf39f2119e4228976a,13483,limeira,SP
1722,03ef5dedbe7492bdae72eec50764c43f,1,c7f27c5bef2338541c772b5776403e6a,7d7866a99a8656a42c7ff6352a433410,2018-08-31 15:05:22,24.90,8.33,496630b6740bcca28fce9ba50d8a26ef,delivered,2018-08-29 14:52:00,2018-08-29 15:05:22,2018-08-29 20:01:00,2018-08-30 16:36:59,2018-09-03 00:00:00,b701bebbdf478f5500348f03aff62121,9541,sao caetano do sul,SP
4924,0b223d92c27432930dfe407c6aea3041,1,2b4472df15512a2825ae86fd9ae79335,67bf6941ba2f1fa1d02c375766bc3e53,2018-08-31 14:30:19,209.00,46.48,e60df9449653a95af4549bbfcb18a6eb,delivered,2018-08-29 14:18:23,2018-08-29 14:31:07,2018-08-29 15:29:00,2018-08-30 16:24:55,2018-09-04 00:00:00,5c58de6fb80e93396e2f35642666b693,80045,curitiba,PR
4925,0b223d92c27432930dfe407c6aea3041,2,2b4472df15512a2825ae86fd9ae79335,67bf6941ba2f1fa1d02c375766bc3e53,2018-08-31 14:30:19,209.00,46.48,e60df9449653a95af4549bbfcb18a6eb,delivered,2018-08-29 14:18:23,2018-08-29 14:31:07,2018-08-29 15:29:00,2018-08-30 16:24:55,2018-09-04 00:00:00,5c58de6fb80e93396e2f35642666b693,80045,curitiba,PR
9829,168626408cb32af0ffaf76711caae1dc,1,bdcf6a834e8faa30dac3886c7a58e92e,2a84855fd20af891be03bc5924d2b453,2018-08-31 14:30:23,45.90,15.39,6e353700bc7bcdf6ebc15d6de16d7002,delivered,2018-08-29 14:18:28,2018-08-29 14:30:23,2018-08-29 18:51:00,2018-08-30 16:52:31,2018-09-11 00:00:00,7febafa06d9d8f232a900a2937f04338,38600,paracatu,MG
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
101735,e6db6e9529fecbe14cd05dd349816656,1,06ead9c0b05b368667d858c09148af03,b2ba3715d723d245138f291a6fe42594,2017-01-15 21:27:49,109.90,14.94,6b3efc29f67766dd725bb29e857614f2,delivered,2017-01-06 21:27:49,2017-01-06 21:35:20,2017-01-09 12:08:34,2017-01-13 13:57:29,2017-02-20 00:00:00,c9517b423493063fa4e805acf2cc4564,33400,lagoa santa,MG
9408,157ec3dc3f38cdbd2706bd216edfe8fb,1,27066995b777fb84dbcb25961fd6d007,46dc3b2cc0980fb8ec44634e21d2718e,2017-01-10 13:43:16,159.99,15.29,7dfd10dcc726950fc6171cea83872351,delivered,2017-01-06 13:43:16,2017-01-06 13:55:11,2017-01-09 16:03:33,2017-01-13 10:58:13,2017-02-16 00:00:00,48368b31672665cca1b8a03047a1702a,33115,santa luzia,MG
106868,f2dd5f15184c73c0d45c02941c7c23d1,1,b931645cdc2d9868f01544e8db63f5ab,b14db04aa7881970e83ffa9426897925,2017-01-09 22:52:33,65.00,26.92,4b60b3ade055c6ad88a00758c8e8a162,delivered,2017-01-05 22:52:33,2017-01-05 23:05:27,2017-01-06 16:08:45,2017-01-13 17:06:48,2017-02-23 00:00:00,b88b78a413e70182e18b032361b24f91,44900,irece,BA
99649,e1fe072ef14b519af1f0a8ed997c1301,1,743801b34cc44776de511ba8eff778e2,48efc9d94a9834137efd9ea76b065a38,2017-01-09 15:32:59,9.90,14.52,758b633d88b82063db189810084f4ea9,delivered,2017-01-05 15:32:59,2017-01-05 16:15:16,2017-01-06 13:45:22,2017-01-12 14:13:19,2017-02-15 00:00:00,4b3207464f5f7a48a7f63fa0b1251d86,14025,ribeirao preto,SP


In [10]:

# (Use Approved At Date)
periods = ['2017-01',
 '2017-02',
 '2017-03',
 '2017-04',
 '2017-05',
 '2017-06',
 '2017-07',
 '2017-08',
 '2017-09',
 '2017-10',
 '2017-11',
 '2017-12',
 '2018-01',
 '2018-02',
 '2018-03',
 '2018-04',
 '2018-05',
 '2018-06',
 '2018-07',
 '2018-08',]
newperiods = periods.copy()
newperiods.append('2018-09')
df = pd.DataFrame(periods, columns=['Periods']).set_index('Periods')
df['Total_Revenue'] = 0
df['Number_of_Orders'] = 0              # Does not mutli-count unique orders. Meaning, if multiple purchases are made with one order id, that counts as 1 order.
df['Unique_Customers_Who_Made_a_Purchase'] = 0 # A single Customer is only counted Once per Month. 
df['Average_Customer_Lifespans'] = 0    # average customer lifespan = sum of customer lifespans / number of customers.
                                        # Examples
                                        # Example 1: 1 Purchase
                                            # A customer (ID 181) makes one purchase in 09/30/2017 and never purchases again, has a lifespan of 1 day.
                                        # Example 2: More than 1 Purchase
                                            # A customer (ID 182) makes a purchase on 09/30/2017 and 11/30/2017 and ceases to make purchases after. The lifespan is counted only twice. 
                                            # - The purchase made on 9/30/2017 will be reported in 09/2017 and will have a lifespan of 1 day. 
                                            # - Since no purchase was made in 10/2017, no reports made
                                            # - The purchase made on 11/30/2017 will be reported in 11/2017 and will have a lifespan of 2 months (or 60 days).
                                            # - No other months will count this customer (ID 182) as no other purchases are made


In [315]:
for i in range(len(periods)):
    start_time = datetime.datetime.strptime(periods[i], '%Y-%m')
    if i == 19:
        end_time = datetime.datetime.strptime('2018-09', '%Y-%m')
    else:
        end_time = datetime.datetime.strptime(periods[i+1], '%Y-%m')
    period_df = df_oi_o_c.loc[(df_oi_o_c['order_approved_at'] >= start_time) & (df_oi_o_c['order_approved_at'] < end_time)]
    df.at[periods[i], 'Total_Revenue'] = period_df['price'].sum() + period_df['freight_value'].sum()
    df.at[periods[i], 'Number_of_Orders'] = period_df['order_id'].unique().shape[0]
    df.at[periods[i], 'Unique_Customers_Who_Made_a_Purchase'] = period_df['customer_unique_id'].unique().shape[0]
display(df)
    

Unnamed: 0_level_0,Total_Revenue,Number_of_Orders,Unique_Customers_Who_Made_a_Purchase,Average_Customer_Lifespans
Periods,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2017-01,121884,715,684,0
2017-02,270749,1638,1615,0
2017-03,410734,2554,2516,0
2017-04,387782,2278,2249,0
2017-05,568069,3548,3482,0
2017-06,494351,3143,3084,0
2017-07,560275,3828,3759,0
2017-08,646567,4217,4137,0
2017-09,691353,4170,4103,0
2017-10,755320,4441,4380,0


In [12]:
lifetime_df = pd.DataFrame(columns = ['customer_unique_id', '2017-01','2017-02','2017-03','2017-04','2017-05','2017-06','2017-07','2017-08','2017-09','2017-10','2017-11','2017-12','2018-01','2018-02','2018-03','2018-04','2018-05','2018-06','2018-07','2018-08','2018-09', ])
lifetime_df['customer_unique_id'] = df_oi_o_c['customer_unique_id'].unique()

for i in tqdm(lifetime_df.index):
    id = lifetime_df['customer_unique_id'][i]
    id_df = df_oi_o_c.loc[(df_oi_o_c['customer_unique_id'] == id)][['customer_unique_id','order_approved_at']].reset_index(drop=True)
    id_df['order_approved_at'] = pd.to_datetime(id_df['order_approved_at'])
    for p in range(len(periods)):
        start_time = datetime.datetime.strptime(periods[p], '%Y-%m')
        if p == 19:
            end_time = datetime.datetime.strptime('2018-09', '%Y-%m')
        else:
            end_time = datetime.datetime.strptime(periods[p+1], '%Y-%m')
        period_df = id_df.loc[(id_df['order_approved_at'] >= start_time) & (id_df['order_approved_at'] < end_time)]
        #Good ^^^

        # If a customer (customer_unique_id) has a prior purchase not in the period_df
        if id_df['order_approved_at'].min() not in period_df['order_approved_at'].values:
            lifetime_df.at[i, periods[p]] = (period_df['order_approved_at'].max() - id_df['order_approved_at'].min()).days
        else:
            # Finds the number of days between purchases in a month, if no other purchases had been made prior
            if period_df.shape[0] == 0:         # No purchase, leave as NaN
                pass
            elif period_df.shape[0] == 1:       # 1 purchase, counts as 1 day
                lifetime_df.at[i, periods[p]] = 1
            else:                               # More than 1 purchase in a month, counts the number of days between purchases
                if period_df['order_approved_at'].max() == period_df['order_approved_at'].min(): # For situations where mutliple orders were placed at the same time
                    lifetime_df.at[i, periods[p]] = 1
                elif (period_df['order_approved_at'].max() - period_df['order_approved_at'].min()).days < 1:
                    lifetime_df.at[i, periods[p]] = 1
                else:
                    lifetime_df.at[i, periods[p]] = (period_df['order_approved_at'].max() - period_df['order_approved_at'].min()).days


100%|██████████| 93091/93091 [41:26<00:00, 37.43it/s]


In [14]:
pd.set_option("display.max_columns", 100)
lifetime_df.head(5)

Unnamed: 0,customer_unique_id,2017-01,2017-02,2017-03,2017-04,2017-05,2017-06,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
0,871766c5855e863f6eccc05f988b23cb,,,,,,,,,1.0,,,,,,,,,,,,
1,eb28e67c4c0b83846050ddfb8a35d051,,,,1.0,,,,,145.0,,,,,,,,,,,,
2,3818d81c6709e39d06b2738a8d3a2474,,,,,,,,,,,,,1.0,,,,,,,,
3,af861d436cfc08b2c2ddefd0ba074622,,,,,,,,,,,,,,,,,,,,1.0,
4,64b576fb70d441e8f1b2d7d446e483c5,,1.0,,,,,,,,,,,,,,,,,,,


In [21]:
new_customers = [lifetime_df.loc[lifetime_df[p]==1].shape[0] for p in periods]
total_customers = []
for i in range(len(new_customers)):
    if i ==0:
        total_customers.append(new_customers[i])
    else:
        total_customers.append(total_customers[i-1] + new_customers[i])
average_customer_lifespan_in_days = []
for i, p in enumerate(periods):
	average_customer_lifespan_in_days.append(lifetime_df[p].sum()/total_customers[i])
average_customer_lifespan_in_days

[1.0029282576866765,
 0.7456293706293706,
 0.5635036496350365,
 0.41642938496583143,
 0.4914897685982023,
 0.4166481357942332,
 0.44097282830045964,
 0.4031521994824747,
 0.42421723469105016,
 0.43205740007445764,
 0.5656050256494665,
 0.42833973334605385,
 0.5648252768247838,
 0.4069839420310964,
 0.4444300518134715,
 0.4532558787718758,
 0.4888087125008352,
 0.46845027138219036,
 0.3880388304634231,
 0.4306450570055389]

In [20]:
df.to_excel('SamCart-Data/KPI-Dataset.xlsx', engine='openpyxl')

[683,
 2288,
 4795,
 7024,
 10458,
 13491,
 17187,
 21255,
 25263,
 29547,
 36453,
 41927,
 48677,
 54926,
 61760,
 68169,
 74835,
 80698,
 86530,
 92798]

In [368]:
# Testing Data

# Test Dataset 2
test2_df = df_oi_o_c.loc[(df_oi_o_c['customer_unique_id'] == '8d50f5eadf50201ccdcedfb9e2ac8455') | (df_oi_o_c['customer_unique_id'] == '3e43e6105506432c953e165fb2acf44c')][['customer_unique_id', 'order_approved_at']]
test2_df['order_approved_at'] = pd.to_datetime(test2_df['order_approved_at'])

# Test Lifetime Dataset 2
lifetime2_df = pd.DataFrame(columns = ['customer_unique_id', '2017-01','2017-02','2017-03','2017-04','2017-05','2017-06','2017-07','2017-08','2017-09','2017-10','2017-11','2017-12','2018-01','2018-02','2018-03','2018-04','2018-05','2018-06','2018-07','2018-08','2018-09', ])
lifetime2_df['customer_unique_id'] = test2_df['customer_unique_id'].unique()

display(lifetime2_df)
display(test2_df)

Unnamed: 0,customer_unique_id,2017-01,2017-02,2017-03,2017-04,2017-05,2017-06,2017-07,2017-08,2017-09,...,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
0,3e43e6105506432c953e165fb2acf44c,,,,,,,,,,...,,,,,,,,,,
1,8d50f5eadf50201ccdcedfb9e2ac8455,,,,,,,,,,...,,,,,,,,,,


Unnamed: 0,customer_unique_id,order_approved_at
7561,3e43e6105506432c953e165fb2acf44c,2017-12-01 22:52:47
7577,8d50f5eadf50201ccdcedfb9e2ac8455,2018-07-24 10:31:34
15443,8d50f5eadf50201ccdcedfb9e2ac8455,2018-05-22 01:53:35
23961,8d50f5eadf50201ccdcedfb9e2ac8455,2017-06-18 23:10:19
35059,8d50f5eadf50201ccdcedfb9e2ac8455,2017-07-18 23:23:26
35995,8d50f5eadf50201ccdcedfb9e2ac8455,2017-08-05 09:10:13
47446,8d50f5eadf50201ccdcedfb9e2ac8455,2018-05-22 23:36:01
49516,3e43e6105506432c953e165fb2acf44c,2018-02-12 10:28:11
49517,3e43e6105506432c953e165fb2acf44c,2018-02-12 10:28:11
57284,3e43e6105506432c953e165fb2acf44c,2017-12-01 22:52:53


In [379]:
for i in lifetime2_df.index:
    id = lifetime2_df['customer_unique_id'][i]
    id_df = df_oi_o_c.loc[(df_oi_o_c['customer_unique_id'] == id)][['customer_unique_id','order_approved_at']].reset_index(drop=True)
    id_df['order_approved_at'] = pd.to_datetime(id_df['order_approved_at'])
    display(id_df)
    for p in range(len(periods)):
        start_time = datetime.datetime.strptime(periods[p], '%Y-%m')
        if p == 19:
            end_time = datetime.datetime.strptime('2018-09', '%Y-%m')
        else:
            end_time = datetime.datetime.strptime(periods[p+1], '%Y-%m')
        period_df = id_df.loc[(id_df['order_approved_at'] >= start_time) & (id_df['order_approved_at'] < end_time)]
        display(period_df)
        #Good ^^^

        # If a customer (customer_unique_id) has a prior purchase not in the period_df
        if id_df['order_approved_at'].min() not in period_df['order_approved_at'].values:
            lifetime2_df.at[i, periods[p]] = (period_df['order_approved_at'].max() - id_df['order_approved_at'].min()).days
        else:
            # Finds the number of days between purchases in a month, if no other purchases had been made prior
            if period_df.shape[0] == 0:         # No purchase, leave as NaN
                pass
            elif period_df.shape[0] == 1:       # 1 purchase, counts as 1 day
                lifetime2_df.at[i, periods[p]] = 1
            else:                               # More than 1 purchase in a month, counts the number of days between purchases
                if period_df['order_approved_at'].max() == period_df['order_approved_at'].min(): # For situations where mutliple orders were placed at the same time
                    lifetime2_df.at[i, periods[p]] = 1
                elif (period_df['order_approved_at'].max() - period_df['order_approved_at'].min()).days < 1:
                    lifetime2_df.at[i, periods[p]] = 1
                else:
                    lifetime2_df.at[i, periods[p]] = (period_df['order_approved_at'].max() - period_df['order_approved_at'].min()).days

Unnamed: 0,customer_unique_id,order_approved_at
0,3e43e6105506432c953e165fb2acf44c,2017-12-01 22:52:47
1,3e43e6105506432c953e165fb2acf44c,2018-02-12 10:28:11
2,3e43e6105506432c953e165fb2acf44c,2018-02-12 10:28:11
3,3e43e6105506432c953e165fb2acf44c,2017-12-01 22:52:53
4,3e43e6105506432c953e165fb2acf44c,2017-12-01 22:52:53
5,3e43e6105506432c953e165fb2acf44c,2018-01-11 11:09:12
6,3e43e6105506432c953e165fb2acf44c,2017-12-02 15:50:49
7,3e43e6105506432c953e165fb2acf44c,2017-09-18 19:05:23
8,3e43e6105506432c953e165fb2acf44c,2017-09-18 19:05:23
9,3e43e6105506432c953e165fb2acf44c,2017-09-18 19:05:23


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at
7,3e43e6105506432c953e165fb2acf44c,2017-09-18 19:05:23
8,3e43e6105506432c953e165fb2acf44c,2017-09-18 19:05:23
9,3e43e6105506432c953e165fb2acf44c,2017-09-18 19:05:23
10,3e43e6105506432c953e165fb2acf44c,2017-09-18 19:05:23


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at
0,3e43e6105506432c953e165fb2acf44c,2017-12-01 22:52:47
3,3e43e6105506432c953e165fb2acf44c,2017-12-01 22:52:53
4,3e43e6105506432c953e165fb2acf44c,2017-12-01 22:52:53
6,3e43e6105506432c953e165fb2acf44c,2017-12-02 15:50:49


Unnamed: 0,customer_unique_id,order_approved_at
5,3e43e6105506432c953e165fb2acf44c,2018-01-11 11:09:12


Unnamed: 0,customer_unique_id,order_approved_at
1,3e43e6105506432c953e165fb2acf44c,2018-02-12 10:28:11
2,3e43e6105506432c953e165fb2acf44c,2018-02-12 10:28:11
11,3e43e6105506432c953e165fb2acf44c,2018-02-20 11:50:38
12,3e43e6105506432c953e165fb2acf44c,2018-02-12 10:28:08
13,3e43e6105506432c953e165fb2acf44c,2018-02-27 18:50:29


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at
0,8d50f5eadf50201ccdcedfb9e2ac8455,2018-07-24 10:31:34
1,8d50f5eadf50201ccdcedfb9e2ac8455,2018-05-22 01:53:35
2,8d50f5eadf50201ccdcedfb9e2ac8455,2017-06-18 23:10:19
3,8d50f5eadf50201ccdcedfb9e2ac8455,2017-07-18 23:23:26
4,8d50f5eadf50201ccdcedfb9e2ac8455,2017-08-05 09:10:13
5,8d50f5eadf50201ccdcedfb9e2ac8455,2018-05-22 23:36:01
6,8d50f5eadf50201ccdcedfb9e2ac8455,2018-07-05 16:27:55
7,8d50f5eadf50201ccdcedfb9e2ac8455,2017-10-29 17:10:09
8,8d50f5eadf50201ccdcedfb9e2ac8455,2018-08-18 12:50:37
9,8d50f5eadf50201ccdcedfb9e2ac8455,2017-11-22 20:12:32


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at
2,8d50f5eadf50201ccdcedfb9e2ac8455,2017-06-18 23:10:19


Unnamed: 0,customer_unique_id,order_approved_at
3,8d50f5eadf50201ccdcedfb9e2ac8455,2017-07-18 23:23:26
10,8d50f5eadf50201ccdcedfb9e2ac8455,2017-07-24 22:25:14


Unnamed: 0,customer_unique_id,order_approved_at
4,8d50f5eadf50201ccdcedfb9e2ac8455,2017-08-05 09:10:13


Unnamed: 0,customer_unique_id,order_approved_at
14,8d50f5eadf50201ccdcedfb9e2ac8455,2017-09-05 22:30:56


Unnamed: 0,customer_unique_id,order_approved_at
7,8d50f5eadf50201ccdcedfb9e2ac8455,2017-10-29 17:10:09
12,8d50f5eadf50201ccdcedfb9e2ac8455,2017-10-19 00:36:08


Unnamed: 0,customer_unique_id,order_approved_at
9,8d50f5eadf50201ccdcedfb9e2ac8455,2017-11-22 20:12:32


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at
1,8d50f5eadf50201ccdcedfb9e2ac8455,2018-05-22 01:53:35
5,8d50f5eadf50201ccdcedfb9e2ac8455,2018-05-22 23:36:01


Unnamed: 0,customer_unique_id,order_approved_at


Unnamed: 0,customer_unique_id,order_approved_at
0,8d50f5eadf50201ccdcedfb9e2ac8455,2018-07-24 10:31:34
6,8d50f5eadf50201ccdcedfb9e2ac8455,2018-07-05 16:27:55


Unnamed: 0,customer_unique_id,order_approved_at
8,8d50f5eadf50201ccdcedfb9e2ac8455,2018-08-18 12:50:37
11,8d50f5eadf50201ccdcedfb9e2ac8455,2018-08-07 23:45:21
13,8d50f5eadf50201ccdcedfb9e2ac8455,2018-08-20 19:30:05


In [380]:
pd.set_option("display.max_columns", 100)
display(lifetime2_df)
type(lifetime2_df)

Unnamed: 0,customer_unique_id,2017-01,2017-02,2017-03,2017-04,2017-05,2017-06,2017-07,2017-08,2017-09,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
0,3e43e6105506432c953e165fb2acf44c,,,,,,,,,1,,,74.0,114.0,161.0,,,,,,,
1,8d50f5eadf50201ccdcedfb9e2ac8455,,,,,,1.0,35.0,47.0,78,132.0,156.0,,,,,,338.0,,400.0,427.0,


pandas.core.frame.DataFrame

Unnamed: 0,order_id,order_item_id,product_id,seller_id,shipping_limit_date,price,freight_value,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
7561,1124c329070977fbd414f046bba149d7,1,689a9f33ae479ec0d9f68a9b6b0cfcbf,95e03ca3d4146e4011985981aeb959b9,2017-12-12 22:52:47,50.0,15.1,b70f8552b91ef49129519206966e2472,delivered,2017-12-01 22:37:41,2017-12-01 22:52:47,2017-12-04 22:09:20,2017-12-19 19:15:14,2018-01-02 00:00:00,3e43e6105506432c953e165fb2acf44c,11700,praia grande,SP
7577,112eb6f37f1b9dabbced368fbbc6c9ef,1,41f6cb7c3b1200749326e50106f32d58,db4350fd57ae30082dec7acbaacc17f9,2018-07-30 09:43:43,99.0,8.85,65f9db9dd07a4e79b625effa4c868fcb,delivered,2018-07-23 21:53:02,2018-07-24 10:31:34,2018-07-25 10:25:00,2018-07-26 18:29:28,2018-08-02 00:00:00,8d50f5eadf50201ccdcedfb9e2ac8455,4045,sao paulo,SP
15443,23427a6bd9f8fd1b51f1b1e5cc186ab8,1,5cb96c51c55f57503465e4d2558dc053,db4350fd57ae30082dec7acbaacc17f9,2018-05-24 01:53:35,45.99,7.39,a8fabc805e9a10a3c93ae5bff642b86b,delivered,2018-05-21 22:44:31,2018-05-22 01:53:35,2018-05-22 14:18:00,2018-05-23 15:33:09,2018-05-29 00:00:00,8d50f5eadf50201ccdcedfb9e2ac8455,4045,sao paulo,SP
23961,369634708db140c5d2c4e365882c443a,1,d83509907a19c72e1e4cdde78b8177ec,94e93ce877be27a515118dbfd2c2be41,2017-06-22 23:10:19,39.9,11.85,b2b13de0770e06de50080fea77c459e6,delivered,2017-06-18 22:56:48,2017-06-18 23:10:19,2017-06-19 20:12:26,2017-06-23 12:55:50,2017-07-07 00:00:00,8d50f5eadf50201ccdcedfb9e2ac8455,4045,sao paulo,SP
35059,4f62d593acae92cea3c5662c76122478,1,94cc774056d3f2b0dc693486a589025e,1da3aeb70d7989d1e6d9b0e887f97c23,2017-07-24 23:23:26,13.99,7.78,dfb941d6f7b02f57a44c3b7c3fefb44b,delivered,2017-07-18 23:10:58,2017-07-18 23:23:26,2017-07-20 19:00:02,2017-07-21 16:19:40,2017-07-31 00:00:00,8d50f5eadf50201ccdcedfb9e2ac8455,4045,sao paulo,SP
35995,519203404f6116d406a970763ee75799,1,5fb61f482620cb672f5e586bb132eae9,94e93ce877be27a515118dbfd2c2be41,2017-08-10 09:10:13,69.9,11.99,1c62b48fb34ee043310dcb233caabd2e,delivered,2017-08-05 08:59:43,2017-08-05 09:10:13,2017-08-07 18:50:00,2017-08-09 15:22:28,2017-08-25 00:00:00,8d50f5eadf50201ccdcedfb9e2ac8455,4045,sao paulo,SP
47446,6bdf325f0966e3056651285c0aed5aad,1,d6354128c28cc56532ba7393d9373083,412a4720f3e9431b4afa1476a1acddbe,2018-05-24 23:31:13,51.8,11.15,6289b75219d757a56c0cce8d9e427900,delivered,2018-05-22 23:08:55,2018-05-22 23:36:01,2018-05-23 19:02:00,2018-05-24 11:58:23,2018-05-30 00:00:00,8d50f5eadf50201ccdcedfb9e2ac8455,4045,sao paulo,SP
49516,70863e8ef99613bbc8f854807d187ea7,1,ceeb8b2d571b23399910f1b83980c973,1900267e848ceeba8fa32d80c1a5f5a8,2018-02-16 10:28:11,53.0,12.29,2bbd32d4ef14893d2d8c1a0df08403cf,delivered,2018-02-12 10:12:54,2018-02-12 10:28:11,2018-02-16 01:49:11,2018-02-27 22:09:24,2018-03-06 00:00:00,3e43e6105506432c953e165fb2acf44c,11701,praia grande,SP
49517,70863e8ef99613bbc8f854807d187ea7,2,d296d7996d75ec35fcfc75b3a06cae63,1900267e848ceeba8fa32d80c1a5f5a8,2018-02-16 10:28:11,53.0,12.29,2bbd32d4ef14893d2d8c1a0df08403cf,delivered,2018-02-12 10:12:54,2018-02-12 10:28:11,2018-02-16 01:49:11,2018-02-27 22:09:24,2018-03-06 00:00:00,3e43e6105506432c953e165fb2acf44c,11701,praia grande,SP
57284,826b47e4cd7bba4e4c6fa5485f898b74,1,82254d93b897fde054504b15f8fe923c,02ecc2a19303f05e59ce133fd923fff7,2017-12-06 22:52:53,239.9,14.36,e68e6423401e85c138229b23d4bf4761,delivered,2017-12-01 22:37:42,2017-12-01 22:52:53,2017-12-05 20:59:00,2017-12-06 19:09:43,2017-12-27 00:00:00,3e43e6105506432c953e165fb2acf44c,11700,praia grande,SP


In [7]:
#olist_order_payments_dataset && olist_orders_dataset && olist_customers_dataset
df_op_o_c = df_op.merge(df_o.merge(df_c, on='customer_id', how='inner'), on='order_id', how='inner')

In [9]:
#22 minute query
# tqdm.write('query est. time: 22 min')


# q = pd.DataFrame(df_c['customer_unique_id'].unique())
# q.columns = ['customer_unique_id']

# q['num_transactions_delivered'] = 0
# q['total_spent_delivered'] = 0
# q['lifetime_days'] = 0
# for i in tqdm(range(q.shape[0])):
#     #Identifing a single customer
#     id  = q['customer_unique_id'][i]
#     customer_transactions_delivered_df = df_op_o_c.loc[(df_op_o_c['customer_unique_id'] == id) & (df_op_o_c['order_status'] == 'delivered')]

#     #Only considering order status's that have been delivered, this finds the total amount spent and number of transactions that have occurred.
#     q.at[i,'total_spent_delivered'] = round(customer_transactions_delivered_df['payment_value'].sum(), 2)
#     q.at[i,'num_transactions_delivered'] = customer_transactions_delivered_df.shape[0]

#     #Calculating the lifetime of a customer in days
#     transaction_times = pd.to_datetime(customer_transactions_delivered_df['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S')
#     lifetime_days = 0
#     if transaction_times.shape[0] < 1:
#         pass
#     elif min(transaction_times) == max(transaction_times):
#         lifetime_days = 1
#     else:
#         lifetime_days = (max(transaction_times)-min(transaction_times)).days

#     q.at[i,'lifetime_days'] = lifetime_days


  0%|          | 4/96096 [00:00<41:48, 38.31it/s]query est. time: 22 min
100%|██████████| 96096/96096 [23:44<00:00, 67.47it/s]


In [222]:
q.sort_values(by=['lifetime_days'], ascending=False).head(100)

Unnamed: 0,customer_unique_id,num_transactions_delivered,total_spent_delivered,lifetime_days
11008,32ea3bdedab835c3aa6cb68ce66565ef,4,137,633
2037,ccafc1c3f270410521c3c6f3b249870f,2,207,608
72136,d8f3c4f441a9b59a29f977df16724f38,2,158,582
61254,94e5ea5a8c1bf546db2739673060c43f,2,187,580
56309,87b3f231705783eb2217e25851c0a45d,2,523,572
...,...,...,...,...
42724,8c21dd8c37144807c601f99f2a209dfb,3,848,375
7646,9c540c186d344c460d85726b531b287b,2,187,373
47000,f3d7cb12e7dc54a424d935aae4426461,2,201,373
836,b7b162291adff744d6f3c6450557ffee,3,519,368


In [26]:
q['lifetime_days'].mode()

0    1
dtype: int64

In [25]:
q['lifetime_days'].median()

1.0

In [11]:
q['lifetime_days'].mean()

3.503787878787879

In [12]:
q['total_spent_delivered'].mean()

160.02584915084915

In [13]:
q['total_spent_delivered'].sum()

15377844

In [14]:
q['num_transactions_delivered'].sum()

100756

In [24]:
# total time elapsed

end = datetime.datetime.strptime('2018-09-01 00:00:00', '%Y-%m-%d %H:%M:%S')
start = datetime.datetime.strptime('2017-01-01 00:00:00', '%Y-%m-%d %H:%M:%S')
time_elapsed = (end - start).days

print(end, '- ', start)
print(time_elapsed)

2018-09-01 00:00:00 -  2017-01-01 00:00:00
608


In [17]:
#KPI's

# average purchase value
#       = total revenue / number of orders
# (to be seperated by month)

total_revenue = q['total_spent_delivered'].sum()
number_of_purchases = q['num_transactions_delivered'].sum()

average_purchase_value = total_revenue / number_of_purchases
print("Average Purchase Value (over {} days) = {}".format(time_elapsed, round(average_purchase_value, 2)))

# average purchase frequency
#       = number of purchases / total unique customers

total_unique_customers = q.shape[0]
average_purchase_frequency = number_of_purchases / total_unique_customers
print("Average Purchase Frequency (over {} days) = {}".format(time_elapsed, round(average_purchase_frequency,2)))

# average customer lifespan
#       = sum of customer lifespans / number of customers

average_customer_lifespan = q['lifetime_days'].mean()/365.25
print("Average Customer Lifespan (over {} days) = {}".format(time_elapsed, round(average_customer_lifespan,5)))

# customer lifetime value
#       = (average purchase value – average purchase frequency) * average customer lifespan

customer_lifetime_value = (average_purchase_value - average_purchase_frequency) * average_customer_lifespan
print("Customer Lifetime Value (over {} days) = {}".format(time_elapsed, round(customer_lifetime_value,2)))




Average Purchase Value (over 772 days) = 152.62
Average Purchase Frequency (over 772 days) = 1.05
Average Customer Lifespan (over 772 days) = 0.00959
Customer Lifetime Value (over 772 days) = 1.45


In [92]:
# min(df_op_o_c['order_purchase_timestamp']) + max(df_op_o_c['order_purchase_timestamp'])
time_elapsed = (datetime.datetime.strptime(max(df_op_o_c['order_purchase_timestamp']), '%Y-%m-%d %H:%M:%S') - datetime.datetime.strptime(min(df_op_o_c['order_purchase_timestamp']), '%Y-%m-%d %H:%M:%S')).days


772

In [14]:
#apv    = average purchase value; 
#       = total revenue / number of orders

#apfr   = average purchase frequency rate
#       = number of purchases / number of customers

#cv     = customer value
#       = average purchase value / average purchase frequncy

#acl    = average customer lifespan
#       = sum of customer lifespans / number of customers

#cltv   = customer lifetime value
#       = customer value * average customer lifespan


In [None]:
#Average Purchase Value:
