As part of Quantium’s retail analytics team; your client, the Category Manager for Chips, who wants to better understand the types of customers who purchase Chips and their purchasing behaviour within the region.
The insights from your analysis will feed into the supermarket’s strategic plan for the chip category in the next half year.

In [None]:
# Importing libraries
import pandas as pd
import numpy as np 
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
import plotly.express as px

In [None]:
#Loading the datasets
transaction =  pd.read_csv("/kaggle/input/quantium-data-analytics-virtual-experience-program/Transactions.csv")
behaviour=pd.read_csv("/kaggle/input/quantium-data-analytics-virtual-experience-program/PurchaseBehaviour.csv")

In [None]:
# Viewing the transaction data
transaction.head(10)

In [None]:
# Viewing the behaviour data
behaviour.head(10)

### Initial Tasks for Data Cleaning

#### Transaction Table

1) Conversion to date format for transaction table

2) Checking if Store Numbers, Product numbers and Loyalty Card Numbers are labels since they identify the unique customers, products and stores.

3) Creating columns for transaction table based on product name, such as Brand name, pkg weight, and product description.

4) Checking product description to identify chips from other products if any, product description frequency of words. Selecting only chips product

5) Checking the summary of data. Identify and remove outliers if any. Check for null values, data types etc. 

6) Left join the transaction table with behaviour table to add life stage anad premium details.






In [None]:
transaction.info()

In [None]:
transaction['DATE'] = pd.to_datetime(transaction['DATE'],errors='coerce',unit='d',origin='1900-01-01')

In [None]:
# Checking for the time periods covered by the transactions and how it is distributed over the years.
pd.DatetimeIndex(transaction.DATE).year.value_counts()

In [None]:
transaction['STORE_NBR'] = transaction['STORE_NBR'].astype('object')
transaction['LYLTY_CARD_NBR'] = transaction['LYLTY_CARD_NBR'].astype('object')
transaction['PROD_NBR'] = transaction['PROD_NBR'].astype('object')
transaction['TXN_ID'] = transaction['TXN_ID'].astype('object')

In [None]:
transaction.info()

In [None]:
transaction.describe()
 

Looks like there are no missing values in the data so far and from the min/max range it looks like we have outliers. Lets first 
analyze the data based on product name and see if it still exists.

Product name can be seperated into brand name, product packing dimensions and descrpiton of the product

In [None]:
# Extracting first name from product name which is the brand name
transaction['BRAND_NAME']=transaction['PROD_NAME'].apply(lambda x: x.split(" ")[0])
    

In [None]:
# extracting the last word from product namewhich is the pkg details
transaction['PROD_PKG']=transaction['PROD_NAME'].apply(lambda x: x.split(" ")[-1])

In [None]:
# removing the first word from product name
transaction['PROD_DESC'] = transaction['PROD_NAME'].str.split(n=1).str[1]

In [None]:
# also removing the last word further to get the product description
transaction['PROD_DESC']=transaction['PROD_DESC'].str.rsplit(' ',1).str[0]

In [None]:
transaction

In [None]:
transaction.info()

In [None]:
transaction.PROD_PKG.value_counts()

You can see that there are multiple data with mix of product description and dimensions. We have to clean the data for it

In [None]:
transaction['PROD_DESC'] = transaction['PROD_DESC']+' '+transaction['PROD_PKG'].str[:-4]

In [None]:
transaction

In [None]:
#extracting only numeric characters
transaction['PROD_PKG']= transaction.PROD_PKG.str.extract('(\d+)')

In [None]:
transaction

In [None]:
transaction.info()

In [None]:
transaction.PROD_PKG.value_counts()

You can see that around 3257 observations are missing in product pkg. As observed earlier the product name 

In [None]:
transaction["PROD_PKG"].fillna("No Value", inplace = True) 

In [None]:
transaction.PROD_PKG.value_counts()

In [None]:
transaction[transaction['PROD_PKG'] == 'No Value']

In [None]:
transaction["PROD_PKG"].replace({"No Value": "135"}, inplace=True)


In [None]:
transaction[transaction['PROD_PKG'] == '135']
# we have replaced the no values with 135 as needed. However PROD_Desc still needs to be edited. Further text ckleaning will be done later.

In [None]:
transaction.info()

#### Now we have zero null values and all products are accounted for.

In [None]:
transaction.PROD_DESC.value_counts()

#### There is Salsa products mixed with our chips data set. Hence we have to remove them while considering our data. Also we can remove the punctuations and extra spaces in the product desscription

In [None]:
chips_data = transaction[transaction['PROD_DESC'].str.contains('Salsa') == False].copy()

In [None]:
chips_data

In [None]:
import re
import string

def clean_text(text):
    '''Make text lowercase,remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

clean = lambda x: clean_text(x)

In [None]:
chips_data['PROD_DESC'] = chips_data.PROD_DESC.apply(clean)

In [None]:
chips_data

In [None]:
chips_data[chips_data['PROD_PKG'] == '135']
#checking for if the data that was previously having numeric value in prod_desc has been changed or not. The function to clean has worked.

In [None]:
chips_data.BRAND_NAME.value_counts()

There are various Brand names here that are duplicated , for example RRD is same as RED, SNBTS is SUNBITE etc. We need to replace them.

In [None]:
chips_data['BRAND_NAME'] = chips_data['BRAND_NAME'].replace('Red','RRD')
chips_data['BRAND_NAME'] = chips_data['BRAND_NAME'].replace('Snbts','Sunbites')
chips_data['BRAND_NAME'] = chips_data['BRAND_NAME'].replace('Dorito','Doritos')
chips_data['BRAND_NAME'] = chips_data['BRAND_NAME'].replace('Grain','GrnWves')
chips_data['BRAND_NAME'] = chips_data['BRAND_NAME'].replace('Infzns','Infuzions')
chips_data['BRAND_NAME'] = chips_data['BRAND_NAME'].replace('WW','Woolworths')
chips_data['BRAND_NAME'] = chips_data['BRAND_NAME'].replace('Smith','Smiths')
chips_data['BRAND_NAME'] = chips_data['BRAND_NAME'].replace('NCC','Natural')


In [None]:
chips_data.BRAND_NAME.value_counts()

In [None]:
#top 20
chips_data.PROD_DESC.str.split(expand=True).stack().value_counts()[:20].plot(kind='barh', figsize=(20,10))

In [None]:
#bottom 20
chips_data.PROD_DESC.str.split(expand=True).stack().value_counts()[-20:].plot(kind='barh', figsize=(10,10))

You can see words like chips, chip, chp are all the same and the most frquent observation is chips and least being fries, garden, onion dip, honey, chilliscream etc which is all equally distributed.

Also flavourings such as cheese,salt, crinkle,corn,chicken etc seem to be the most common descriptions among chips. A mix of these flavours could be the most sought out among the chips section. This needs further investigation.

In [None]:
chips_data.describe()

Based on previous observation as well , the outlier still exists . 3rd quartile value to the Maximum value difference is too high for both production qty and total sales. Further investigation on this is needed before the removal.

In [None]:
chips_data[chips_data['TOT_SALES']== 650]

Looks like there are 2 data points with prod_qty 200 and Total Sales 650, and both belongs to the same customer Loyalty card number 226000. Lets do a check on the card holder to cross check if they do have other purchases. 

In [None]:
chips_data[chips_data['LYLTY_CARD_NBR']== 226000]

In [None]:
behaviour[behaviour['LYLTY_CARD_NBR']== 226000]

This confirms that the particular customer could be a bulk buyer and hence we could consider this as an outlier and remove it. On checking the behaviour dataset looks like they are premium customers under older families category.



In [None]:
dategroup = chips_data.groupby('DATE')[['TXN_ID']].count()

In [None]:
dategroup

Only 364 rows which shows there is a missing date. lets find the missing date.

In [None]:
pd.date_range(start = '2018-07-03', end = '2019-07-02' ).difference(dategroup.index)

Looks like the missing date is 27th of december 2018.

In [None]:
dategroup = dategroup.reindex(pd.date_range("2018-07-03", "2019-07-02"), fill_value= 0)

In [None]:
dategroup

In [None]:
dategroup['TXN_ID']= dategroup['TXN_ID'].astype('int')

In [None]:
px.line(dategroup,dategroup.index,dategroup['TXN_ID'])


Transactions over time is plotted. The huge dip indicating zero is for the missing date december 27 2018. The Increase in sales highlighted around december last week should ideally be due to christmas season.

Its time to merge our datasets under leftjoin with chips as the main table to which we shall add life stage and premium customer details as well. Before that lets get rid of the outlier. We do have a lot of methods including IQR method, however since the dataset has only 2 outliers , lets go with a simple code.

In [None]:
chips = chips_data[chips_data['LYLTY_CARD_NBR']!= 226000]

In [None]:
chips.describe()

Outliers has been removed 

In [None]:
behaviour['LYLTY_CARD_NBR'] = behaviour['LYLTY_CARD_NBR'].astype('object')

In [None]:
merged = chips.merge(behaviour, on='LYLTY_CARD_NBR', how='left')

In [None]:
merged

In [None]:
merged.info()

In [None]:
merged.describe()

In [None]:
merged.isnull().sum()

No Null values or outliers in the final merged table. The dataset is finally ready for analysis

In [None]:
merged.LIFESTAGE.value_counts().plot(kind='bar',figsize=(20,10))

Older singles/couples seems to do the most of purchases and the least being new families

In [None]:
merged.PREMIUM_CUSTOMER.value_counts().plot(kind='bar',figsize=(20,10))

Mainstream membership seems to be the top buying performer followed by budget and premium customers.

In [None]:
merged.PROD_PKG.value_counts().plot(kind='bar',figsize=(20,10))

By dimensions 175gms seems to be in top followed 150,134,170 and 165 in order as the top 5 performers.

125,180,70,220,160 gms seems to be in least purchased frequency. 

In [None]:
merged.BRAND_NAME.value_counts()[:5].plot(kind='bar',figsize=(20,10))
plt.title("Top 5 brands")
plt.show()

In [None]:
merged.BRAND_NAME.value_counts()[-5:].plot(kind='bar',figsize=(20,10))
plt.title("Least performing 5 brands")
plt.show()

In [None]:
totalsales_cust= merged.groupby(['LIFESTAGE','PREMIUM_CUSTOMER'])[['TOT_SALES']].sum().reset_index()
totalsales_cust = totalsales_cust.sort_values('TOT_SALES', ascending=False)


In [None]:
totalsales_cust

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(x='LIFESTAGE',y='TOT_SALES',hue='PREMIUM_CUSTOMER',data = totalsales_cust)
plt.title("Sales Distribution across lifestages clustered by membership type")
plt.show()


Sales are coming mainly from  Mainstream due to performance of- young singles/couples, retirees and budget older families.
New families offer the overall low in terms of sales in any membership.



In [None]:
totalsales_brands= merged.groupby(['BRAND_NAME','PREMIUM_CUSTOMER'])[['TOT_SALES']].sum().reset_index()
totalsales_brands = totalsales_brands.sort_values('TOT_SALES', ascending=False)

In [None]:

plt.figure(figsize=(30,10))
sns.barplot(x='BRAND_NAME',y='TOT_SALES',hue='PREMIUM_CUSTOMER',data = totalsales_brands)
plt.title("Sales Distribution across Brands clustered by Membership Type")
plt.show()

Overall in terms of Brand performance by membership type, all membership types has an almost equall distribution in terms of sales per brand.

Kettle, Doritos ,Smiths, Pringles seems to be contributing to the most in Sales per brand wirh kettle leading heavily.

As shown earlier , Mainstream seems to be the most sought out membership type.



In [None]:
grouped_royalty = merged.groupby(['LIFESTAGE','PREMIUM_CUSTOMER'])[['LYLTY_CARD_NBR']].nunique().reset_index()

In [None]:
grouped_royalty = grouped_royalty.rename(columns={"LYLTY_CARD_NBR": "Loyalty_Card_Members"})

In [None]:
grouped_royalty=grouped_royalty.sort_values('Loyalty_Card_Members', ascending=False)


In [None]:
grouped_royalty

In [None]:
plt.figure(figsize=(30,10))
sns.barplot(x='LIFESTAGE',y='Loyalty_Card_Members',hue='PREMIUM_CUSTOMER',data = grouped_royalty)
plt.title(" Membership Distribution across Life stages clustered by Membership Type")
plt.ylabel('Distribution of Loyalty card members')
plt.show()

Old Singles/Couples have a pretty evenly distribution across all membership types.

Older Families and Retirees tend to be more on budget and premium memberships

Customer segmentation is more from  Mainstream due to performance of- young singles/couples, retirees and older single/ couples.


In [None]:
avgunits_cust= merged.groupby(['LIFESTAGE','PREMIUM_CUSTOMER'])[['PROD_QTY']].mean().reset_index()

In [None]:
avgunits_cust = avgunits_cust.rename(columns={"PROD_QTY": "Avg_Prod_Qty"})

In [None]:
avgunits_cust=avgunits_cust.sort_values('Avg_Prod_Qty', ascending=False)

In [None]:
plt.figure(figsize=(30,10))
sns.barplot(x='LIFESTAGE',y='Avg_Prod_Qty',hue='PREMIUM_CUSTOMER',data = avgunits_cust )
plt.title(" Average Product Qty across Life stages clustered by Membership Type")
plt.ylabel('Average Product Quantity')
plt.show()

The Older and the Younger families  spend more on Product quantity than all the other categories. The reamining follow with almost similar trend but still lesser. 

In [None]:
avgprice_unit= merged.groupby(['LIFESTAGE','PREMIUM_CUSTOMER'])[['TOT_SALES','PROD_QTY']].sum().reset_index()

In [None]:
avgprice_unit['AVG_price/unit']=avgprice_unit['TOT_SALES']/avgprice_unit['PROD_QTY']

In [None]:
avgprice_unit= avgprice_unit.sort_values('AVG_price/unit', ascending=False)

In [None]:
avgprice_unit

In [None]:
plt.figure(figsize=(30,10))
sns.barplot(x='LIFESTAGE',y='AVG_price/unit',hue='PREMIUM_CUSTOMER',data = avgprice_unit )
plt.title(" Average Price per unit across Life stages clustered by Membership Type")
plt.ylabel('Average Price per unit')
plt.show()

Midage single couples and Young single couples spend more on average price per unit bought especially in Mainstream membership.There is a clear trend here based on previous visualizations as well, that the younger and mid age couples are less likely to be taking premium memberships on purchasing products. Their consumption pattern could be mostly for entertainment rather than healthy snacks compared to the others.

Except for these two categories, remaining lifestyle trends almost remain the same across various memberships. How significantly large is the group from others?

Next step is to do a t-test to verify if there is any statistical significance to the unit price for mainstream,
young and mid-age singles and couples [ARE / ARE NOT] significantly higher than
that of budget or premium, young and midage singles and couples.



In [None]:
merged['PricePerUnit'] = merged['TOT_SALES'] / merged['PROD_QTY']

In [None]:
sample1 = merged[(merged['LIFESTAGE'].isin(["YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES"]))  & (merged['PREMIUM_CUSTOMER'] == 'Mainstream')]
sample2 = merged[(merged['LIFESTAGE'].isin(["YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES"]))  & (merged['PREMIUM_CUSTOMER'] != 'Mainstream')]

In [None]:
sample1

In [None]:
sample2

The sample size is unequal. Lets test for normality

In [None]:
plt.hist(sample1.PricePerUnit)

In [None]:
plt.hist(sample2.PricePerUnit)

Both of them are normalized.

Considering they are independent samples,unequal sample sizes and normalized data we first test them for variance ; F test and Levenes test in this case, followed by a t-test depending on the result ( pooled variance /seperate variance)

In [None]:
from scipy import stats
def f_test(x, y):
    x = np.array(x)
    y = np.array(y)
    f = np.var(x, ddof=1)/np.var(y, ddof=1) #calculate F test statistic 
    dfn = x.size-1 #define degrees of freedom numerator 
    dfd = y.size-1 #define degrees of freedom denominator 
    p = 1-stats.f.cdf(f, dfn, dfd) #find p-value of F test statistic 
    return f, p

#perform F-test
f_test(sample1.PricePerUnit, sample2.PricePerUnit)


P value is extending to 1 ; F-test is not really ideal when our sample sizes are largely unequal which might lead to false assumptions.

In [None]:
#testing for equality of variances for unequal sample sizes using levenes test
from scipy.stats import levene
a=sample1.PricePerUnit.values.tolist()
b=sample2.PricePerUnit.values.tolist()
stat, p = levene(a,b)
print('t=%.3f, p=%.3f ' % (stat, p))

With a p value less than 0.05 we can reject the null hypothesis; hence we prove there is no equality in variance between the 2 samples.

In [None]:
from scipy.stats import ttest_ind
stat, p = ttest_ind(sample1.PricePerUnit, sample2.PricePerUnit,equal_var=True)
print('t=%.3f, p=%.3f ' % (stat, p))

Since we can reject the null hypothesis yet again; the unit price for mainstream,young and mid-age singles and couples is  significantly higher than that of budget or premium, young and midage singles and couples.

In [None]:
Analysis1 = merged[(merged['LIFESTAGE'].isin(["YOUNG SINGLES/COUPLES"]))  & (merged['PREMIUM_CUSTOMER'] == 'Mainstream')]

In [None]:
Analysis1

In [None]:
Analysis1.BRAND_NAME.value_counts().plot(kind='bar',figsize=(20,10))

Looks like kettle is the most preferred brand among the Young Singles/Couples of Mainstream membership.

In [None]:
Analysis1.PROD_PKG.value_counts().plot(kind='bar',figsize=(20,10))

Based on previous analysis as well, looks like Young/Single Couples are the major contributors to the packsizes 175gms in top followed 150,134,170 and 165 in order as the top 5 performers. 

Lets do a basket analysis for Young.

In [None]:
 basket1= Analysis1.groupby(['LYLTY_CARD_NBR', 'BRAND_NAME'])['PROD_QTY'].sum().unstack().reset_index().fillna(0).set_index('LYLTY_CARD_NBR')

In [None]:
basket1

In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket1.applymap(encode_units)


In [None]:
import mlxtend
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

In [None]:
frequent_itemsets

In [None]:
rules = association_rules(frequent_itemsets, metric="lift")
rules.head()

In [None]:
basket2= Analysis1.groupby(['LYLTY_CARD_NBR', 'PROD_PKG'])['PROD_QTY'].sum().unstack().reset_index().fillna(0).set_index('LYLTY_CARD_NBR')

In [None]:
basket_sets2 = basket2.applymap(encode_units)

In [None]:
frequent_itemsets2 = apriori(basket_sets2, min_support=0.07, use_colnames=True)

In [None]:
frequent_itemsets2.sort_values('support',ascending=False)

In [None]:
rules_pkg=association_rules(frequent_itemsets2, metric="lift")

In [None]:
rules_pkg.head()

Lift value is less than 1 for both association properties for Young Singles/Couple Mainstream mebership in terms of relationships between various package sizes or brands. This means that the evidence does not support enough for strong relationships between multiple brands or dimensions.


Apriori Algorithm wise the frequency of occurence for each brand and packaging for Young Singles/Couple Mainstream membership is as follows:

Kettle 38%, Doritos, 26%, Pringles 25%, Smiths 20%

175 gms has a 45% frequency, 150 gms 31 %, 134 gms 25% etc.



In [None]:
quantity_bybrand =  Analysis1.groupby(['BRAND_NAME'])[['PROD_QTY']].sum().reset_index()

In [None]:
quantity_bybrand.PROD_QTY = quantity_bybrand.PROD_QTY / Analysis1.PROD_QTY.sum()

In [None]:
quantity_bybrand =quantity_bybrand.rename(columns={"PROD_QTY": "Targeted_Segment"})

In [None]:
quantity_bybrand

In [None]:
other = merged[~(merged['LIFESTAGE'].isin(["YOUNG SINGLES/COUPLES"]))  & (merged['PREMIUM_CUSTOMER'] != 'Mainstream')]

In [None]:
other_bybrand =  other.groupby(['BRAND_NAME'])[['PROD_QTY']].sum().reset_index()
other_bybrand.PROD_QTY = other_bybrand.PROD_QTY / other.PROD_QTY.sum()
other_bybrand =other_bybrand.rename(columns={"PROD_QTY": "Other_Segment"})

In [None]:
merged_segment_test = quantity_bybrand.merge(other_bybrand, on='BRAND_NAME', how='outer')

In [None]:
merged_segment_test['Affinitytobrand'] = merged_segment_test['Targeted_Segment']/merged_segment_test['Other_Segment']

In [None]:
merged_segment_test.sort_values('Affinitytobrand',ascending=False)

Tyrells, Twisties and Doritos has more than 20 % likelyhood to be the brand selection if the category is youngsingles/couples from mainstream. Brands such as woolsworth, burger and sunbites has more than 50% less likelyhood to be the brand selection.

In [None]:
quantity_bydesc =  Analysis1.groupby(['PROD_PKG'])[['PROD_QTY']].sum().reset_index()
quantity_bydesc.PROD_QTY = quantity_bydesc.PROD_QTY / Analysis1.PROD_QTY.sum()
quantity_bydesc =quantity_bydesc.rename(columns={"PROD_QTY": "Targeted_Segment"})

In [None]:
other_bydesc =  other.groupby(['PROD_PKG'])[['PROD_QTY']].sum().reset_index()
other_bydesc.PROD_QTY = other_bydesc.PROD_QTY / other.PROD_QTY.sum()
other_bydesc =other_bydesc.rename(columns={"PROD_QTY": "Other_Segment"})

In [None]:
qty_segment_test = quantity_bydesc.merge(other_bydesc, on='PROD_PKG', how='outer')

In [None]:
qty_segment_test['Affinitytopackaging'] = qty_segment_test['Targeted_Segment']/qty_segment_test['Other_Segment']

In [None]:
qty_segment_test.sort_values('Affinitytopackaging',ascending=False)

Packing sizes 270 g , 380 g , 330 g has more than 20% likelyhood to be purchased by young singles/couples in mainstream segment.



In [None]:
analyseddata = merged[(merged['PROD_PKG'].isin(["270"]))]

In [None]:
analyseddata.BRAND_NAME.unique()

Twistes 270 gms seems to be most bought out product by Young Singles/ Couples Mainstream

Conclusion:

Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream- retirees shoppers. 

We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more of them than other buyers. Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips. This is indicative of impulse buying behaviour.

We’ve also found that Mainstream young singles and couples are 23% more likely to purchase Tyrrells chips
compared to the rest of the population.
The Category Manager may want to increase the category’s performance by off-locating some Tyrrells and smaller packs of chips in discretionary space near segmentswhere young singles and couples frequent more often to increase visibilty and impulse behaviour.

Quantium can help the Category Manager with recommendations of where these segments are and further help them with measuring the impact of the changed placement. 
