# Quantium Data Analytics Virtual Experience
## Part 1: Data Preparation and Customer Analytics
We need to present a strategic recommendation to Julia that is supported by data which she can then use for the upcoming category review however to do so we need to analyse the data to understand the current purchasing trends and behaviours. The client is particularly interested in customer segments and their chip purchasing behaviour. Consider what metrics would help describe the customers’ purchasing behaviour.  

In [141]:
import pandas as pd

In [142]:
purchase_df = pd.read_csv('QVI_purchase_behaviour.csv')
transact_df = pd.read_excel('QVI_transaction_data.xlsx')

In [143]:
purchase_df.head()

Unnamed: 0,LYLTY_CARD_NBR,LIFESTAGE,PREMIUM_CUSTOMER
0,1000,YOUNG SINGLES/COUPLES,Premium
1,1002,YOUNG SINGLES/COUPLES,Mainstream
2,1003,YOUNG FAMILIES,Budget
3,1004,OLDER SINGLES/COUPLES,Mainstream
4,1005,MIDAGE SINGLES/COUPLES,Mainstream


In [144]:
transact_df.head()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES
0,43390,1,1000,1,5,Natural Chip Compny SeaSalt175g,2,6.0
1,43599,1,1307,348,66,CCs Nacho Cheese 175g,3,6.3
2,43605,1,1343,383,61,Smiths Crinkle Cut Chips Chicken 170g,2,2.9
3,43329,2,2373,974,69,Smiths Chip Thinly S/Cream&Onion 175g,5,15.0
4,43330,2,2426,1038,108,Kettle Tortilla ChpsHny&Jlpno Chili 150g,3,13.8


In [145]:
purchase_df.shape

(72637, 3)

In [146]:
transact_df.shape

(264836, 8)

In [147]:
purchase_df.dtypes

LYLTY_CARD_NBR       int64
LIFESTAGE           object
PREMIUM_CUSTOMER    object
dtype: object

In [148]:
purchase_df["LYLTY_CARD_NBR"].astype("object")

0           1000
1           1002
2           1003
3           1004
4           1005
          ...   
72632    2370651
72633    2370701
72634    2370751
72635    2370961
72636    2373711
Name: LYLTY_CARD_NBR, Length: 72637, dtype: object

In [149]:
transact_df["LYLTY_CARD_NBR"].astype("object")

0           1000
1           1307
2           1343
3           2373
4           2426
           ...  
264831    272319
264832    272358
264833    272379
264834    272379
264835    272380
Name: LYLTY_CARD_NBR, Length: 264836, dtype: object

In [150]:
transact_df.dtypes

DATE                int64
STORE_NBR           int64
LYLTY_CARD_NBR      int64
TXN_ID              int64
PROD_NBR            int64
PROD_NAME          object
PROD_QTY            int64
TOT_SALES         float64
dtype: object

In [151]:
transact_df.describe()

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES
count,264836.0,264836.0,264836.0,264836.0,264836.0,264836.0,264836.0
mean,43464.03626,135.08011,135549.5,135158.3,56.583157,1.907309,7.3042
std,105.389282,76.78418,80579.98,78133.03,32.826638,0.643654,3.083226
min,43282.0,1.0,1000.0,1.0,1.0,1.0,1.5
25%,43373.0,70.0,70021.0,67601.5,28.0,2.0,5.4
50%,43464.0,130.0,130357.5,135137.5,56.0,2.0,7.4
75%,43555.0,203.0,203094.2,202701.2,85.0,2.0,9.2
max,43646.0,272.0,2373711.0,2415841.0,114.0,200.0,650.0


In [152]:
purchase_df.LIFESTAGE.value_counts()

RETIREES                  14805
OLDER SINGLES/COUPLES     14609
YOUNG SINGLES/COUPLES     14441
OLDER FAMILIES             9780
YOUNG FAMILIES             9178
MIDAGE SINGLES/COUPLES     7275
NEW FAMILIES               2549
Name: LIFESTAGE, dtype: int64

In [153]:
purchase_df["PREMIUM_CUSTOMER"].value_counts()

Mainstream    29245
Budget        24470
Premium       18922
Name: PREMIUM_CUSTOMER, dtype: int64

In [154]:
purchase_df["LYLTY_CARD_NBR"].unique

<bound method Series.unique of 0           1000
1           1002
2           1003
3           1004
4           1005
          ...   
72632    2370651
72633    2370701
72634    2370751
72635    2370961
72636    2373711
Name: LYLTY_CARD_NBR, Length: 72637, dtype: int64>

In [155]:
transact_df["DATE"].head()

0    43390
1    43599
2    43605
3    43329
4    43330
Name: DATE, dtype: int64

Converting Excel Date to proper format

In [156]:
import xlrd

transact_df["DATE"] = transact_df["DATE"].apply(lambda x: xlrd.xldate_as_datetime(x, 0))
transact_df["DATE"] = transact_df["DATE"].apply(lambda x: x.date())
transact_df["DATE"] = transact_df["DATE"].apply(lambda x: x.isoformat())
transact_df["DATE"]

0         2018-10-17
1         2019-05-14
2         2019-05-20
3         2018-08-17
4         2018-08-18
             ...    
264831    2019-03-09
264832    2018-08-13
264833    2018-11-06
264834    2018-12-27
264835    2018-09-22
Name: DATE, Length: 264836, dtype: object

In [157]:
transact_df["STORE_NBR"].value_counts()

226    2022
88     1873
93     1832
165    1819
237    1785
       ... 
11        2
31        2
206       2
76        1
92        1
Name: STORE_NBR, Length: 272, dtype: int64

In [158]:
transact_df.isnull().any()

DATE              False
STORE_NBR         False
LYLTY_CARD_NBR    False
TXN_ID            False
PROD_NBR          False
PROD_NAME         False
PROD_QTY          False
TOT_SALES         False
dtype: bool

In [159]:
purchase_df.isnull().any()

LYLTY_CARD_NBR      False
LIFESTAGE           False
PREMIUM_CUSTOMER    False
dtype: bool

In [160]:
transact_df["PROD_NAME"].value_counts()

Kettle Mozzarella   Basil & Pesto 175g      3304
Kettle Tortilla ChpsHny&Jlpno Chili 150g    3296
Cobs Popd Swt/Chlli &Sr/Cream Chips 110g    3269
Tyrrells Crisps     Ched & Chives 165g      3268
Cobs Popd Sea Salt  Chips 110g              3265
                                            ... 
RRD Pc Sea Salt     165g                    1431
Woolworths Medium   Salsa 300g              1430
NCC Sour Cream &    Garden Chives 175g      1419
French Fries Potato Chips 175g              1418
WW Crinkle Cut      Original 175g           1410
Name: PROD_NAME, Length: 114, dtype: int64

In [161]:
transact_df["CHIP_NAME"] = transact_df["PROD_NAME"].replace(r'[^a-zA-Z ]|g$', '', regex=True)
transact_df["CHIP_NAME"].replace(r'\s+', " ", regex=True, inplace=True)
transact_df["CHIP_NAME"] = transact_df["CHIP_NAME"].str.strip()

In [162]:
transact_df["CHIP_NAME"].value_counts()

Kettle Mozzarella Basil Pesto         3304
Kettle Tortilla ChpsHnyJlpno Chili    3296
Cobs Popd SwtChlli SrCream Chips      3269
Tyrrells Crisps Ched Chives           3268
Cobs Popd Sea Salt Chips              3265
                                      ... 
RRD Pc Sea Salt                       1431
Woolworths Medium Salsa               1430
NCC Sour Cream Garden Chives          1419
French Fries Potato Chips             1418
WW Crinkle Cut Original               1410
Name: CHIP_NAME, Length: 114, dtype: int64

In [163]:
trans_df_clean = transact_df.drop(transact_df.loc[transact_df["PROD_NAME"].str.contains("Salsa") == True].index)

Outlier of 200 chips

In [164]:
trans_df_clean[trans_df_clean["PROD_QTY"] == 200]

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,CHIP_NAME
69762,2018-08-19,226,226000,226201,4,Dorito Corn Chp Supreme 380g,200,650.0,Dorito Corn Chp Supreme
69763,2019-05-20,226,226000,226210,4,Dorito Corn Chp Supreme 380g,200,650.0,Dorito Corn Chp Supreme


In [165]:
trans_df_clean[trans_df_clean["LYLTY_CARD_NBR"] == 226000]

Unnamed: 0,DATE,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_NAME,PROD_QTY,TOT_SALES,CHIP_NAME
69762,2018-08-19,226,226000,226201,4,Dorito Corn Chp Supreme 380g,200,650.0,Dorito Corn Chp Supreme
69763,2019-05-20,226,226000,226210,4,Dorito Corn Chp Supreme 380g,200,650.0,Dorito Corn Chp Supreme


Only have two purchases, and both of 200 chips. Since we want to look at retail customers, we can drop this.

In [166]:
trans_df_clean.drop(trans_df_clean[trans_df_clean["LYLTY_CARD_NBR"] == 226000].index, inplace=True)

In [168]:
trans_df_clean.describe()

Unnamed: 0,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES
count,246740.0,246740.0,246740.0,246740.0,246740.0,246740.0
mean,135.050361,135530.3,135130.4,56.352213,1.906456,7.316113
std,76.786971,80715.2,78147.6,33.695235,0.342499,2.474897
min,1.0,1000.0,1.0,1.0,1.0,1.7
25%,70.0,70015.0,67568.75,26.0,2.0,5.8
50%,130.0,130367.0,135181.5,53.0,2.0,7.4
75%,203.0,203083.2,202652.2,87.0,2.0,8.8
max,272.0,2373711.0,2415841.0,114.0,5.0,29.5


In [180]:
trans_df_clean.DATE.value_counts().sort_values()

2019-06-13    607
2018-09-22    609
2018-11-25    610
2018-10-18    611
2019-06-24    612
             ... 
2018-12-20    808
2018-12-19    839
2018-12-22    840
2018-12-23    853
2018-12-24    865
Name: DATE, Length: 364, dtype: int64

We see that we are missing a day of data - and it looks like it is Christmas day. Since shops are not usually open, we can assume that there are no sales.

In [189]:
trans_df_clean["CHIP_SIZE"] = trans_df_clean["PROD_NAME"].str.extract(pat = '(\d+)').astype('int')

In [190]:
trans_df_clean.describe()

Unnamed: 0,STORE_NBR,LYLTY_CARD_NBR,TXN_ID,PROD_NBR,PROD_QTY,TOT_SALES,CHIP_SIZE
count,246740.0,246740.0,246740.0,246740.0,246740.0,246740.0,246740.0
mean,135.050361,135530.3,135130.4,56.352213,1.906456,7.316113,175.583521
std,76.786971,80715.2,78147.6,33.695235,0.342499,2.474897,59.432118
min,1.0,1000.0,1.0,1.0,1.0,1.7,70.0
25%,70.0,70015.0,67568.75,26.0,2.0,5.8,150.0
50%,130.0,130367.0,135181.5,53.0,2.0,7.4,170.0
75%,203.0,203083.2,202652.2,87.0,2.0,8.8,175.0
max,272.0,2373711.0,2415841.0,114.0,5.0,29.5,380.0


In [195]:
trans_df_clean["BRAND"] = trans_df_clean["CHIP_NAME"].apply(lambda x: x.split(" ")[0])

0                  [Natural, Chip, Compny, SeaSalt]
1                              [CCs, Nacho, Cheese]
2            [Smiths, Crinkle, Cut, Chips, Chicken]
3               [Smiths, Chip, Thinly, SCreamOnion]
4           [Kettle, Tortilla, ChpsHnyJlpno, Chili]
                            ...                    
264831    [Kettle, Sweet, Chilli, And, Sour, Cream]
264832                 [Tostitos, Splash, Of, Lime]
264833                          [Doritos, Mexicana]
264834     [Doritos, Corn, Chip, Mexican, Jalapeno]
264835                 [Tostitos, Splash, Of, Lime]
Name: CHIP_NAME, Length: 246740, dtype: object