# **Quantium Data Analysis on Chips Products**

# Project Objectives:
1. Explore transaction data and produce insights on chips products
2. Analyse different customer segments and their contribution to chip sales
3. Investigate which brand and packet size customers prefer

# **Dataset Information**
The dataset is created by Quantium and it is a part of virtual internship hosted on InsideSherpa. The dataset consists of customer transactions and details of customers

Contents:
1. [Dataset loading and Cleaning](#1)
2. [Adding new features](#2)
3. [Exploratory data analysis](#3)
4. [Independent t-test](#4)
5. [Affinity analysis using apriori and association rules](#5)
5. [Conclusion](#6)

<a id=1> </a>
# 1. Dataset loading and Cleaning

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates
import matplotlib.ticker as mticker
import seaborn as sns
sns.set()

%matplotlib inline

In [None]:
transaction_df = pd.read_csv("/kaggle/input/quantium-data-analytics-virtual-experience-program/Transactions.csv")
behaviour_df = pd.read_csv("/kaggle/input/quantium-data-analytics-virtual-experience-program/PurchaseBehaviour.csv")

In [None]:
transaction_df.head()

In [None]:
behaviour_df.head()

In [None]:
transaction_df.isna().sum()

In [None]:
behaviour_df.isna().sum()

Here in transaction dataframe dates are given in integer format of excel starting from 01-01-1990. Converting them into standard DateTime format

In [None]:
def convert_to_datetime(num):
    dt = datetime.fromordinal(datetime(1900, 1, 1).toordinal() + num - 2)
    return dt

In [None]:
%%time
transaction_df['DATE'] = transaction_df['DATE'].apply(convert_to_datetime)
transaction_df.head()

Here we can see there are salsa products in the transaction data. So lets remove them

In [None]:
transaction_df["PROD_NAME"].unique()[:20]

In [None]:
def remove_prod(df):
    unwanted = "salsa"
    if unwanted in df["PROD_NAME"].lower().split():
        return df.name

In [None]:
%%time
drop_index = list(transaction_df.apply(remove_prod,axis=1))

In [None]:
%%time
#Making a copy of all transaction and updating the new transactions without unwanted products
old_df = transaction_df.copy()
transaction_df = transaction_df.drop([i for i in drop_index if ~np.isnan(i)])

In [None]:
print("Total Transaction: "+str(len(old_df)))
print("Total Salsa Products: "+str(len(old_df)-len(transaction_df)))
print("Transactions without Salsa: "+str(len(transaction_df)))

Checking for outlier. Here we can see that mean PROD_QTY is 2 but max PROD_QTY is 200.

In [None]:
transaction_df.describe()

In [None]:
transaction_df[transaction_df.PROD_QTY > 100]

Checking the customer using its Loyalty Card Number and we found out that it is a premium customer.

In [None]:
behaviour_df[behaviour_df.LYLTY_CARD_NBR == 226000]

Removing the outlier for better analysis

In [None]:
transaction_df = transaction_df[transaction_df['PROD_QTY'] < 200].reset_index(drop=True)
transaction_df.describe()

<a id=2> </a>
# 2. Adding new features

Adding Packet size as a new column.

In [None]:
def packet_size(grp):
    string = grp["PROD_NAME"]
    num = []
    for i in string:
        if i.isdigit():
            num.append(i)
    number = "".join(num)
    return int(number)

In [None]:
%%time
transaction_df["PACKET_SIZE"] = transaction_df.apply(packet_size,axis=1)
transaction_df.head()

In [None]:
print("Largest Packet Size: "+str(max(transaction_df["PACKET_SIZE"]))+"g")
print("Smallest Packet Size: "+str(min(transaction_df["PACKET_SIZE"]))+"g")

Adding Brand name as new column it is the first word in the "PROD_NAME" column.

In [None]:
def Product_Company(grp):
    return grp["PROD_NAME"].split()[0]

In [None]:
%%time
transaction_df["BRAND"] = transaction_df.apply(Product_Company,axis=1)
transaction_df.head()

Some Brands have short forms in their PROD_NAMES, so replacing them with their original brand names

In [None]:
d = {'red':'RRD','ww':'WOOLWORTHS','ncc':'NATURAL','snbts':'SUNBITES','infzns':'INFUZIONS','smith':'SMITHS','dorito':'DORITOS','grain':'GRNWVES'}
transaction_df['BRAND'] = transaction_df['BRAND'].str.lower().replace(d).str.upper()
transaction_df.head()

<a id=3> </a>
# 3. Exploratory Data Analysis

**By Date**

In [None]:
date_sales = pd.DataFrame(transaction_df.groupby("DATE").agg({'TOT_SALES':'sum'}))

Here we can see that in December there is unusual high sales and there is a drop in middle.

In [None]:
plt.style.use('seaborn')
fig = date_sales.plot(figsize=(18,8))
plt.ylabel("Total Sales",{'fontsize':15})
ax = plt.gca()
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_minor_locator(mdates.MonthLocator(bymonthday=15))
ax.xaxis.set_major_formatter(mticker.NullFormatter())
ax.xaxis.set_minor_formatter(mdates.DateFormatter("%b-%y"))

Examining December month

In [None]:
december = pd.DataFrame({'Date' : pd.date_range(start='2018-12-01', end='2018-12-31'),'sales' : np.zeros((31))}).set_index('Date')
december.sales = date_sales.loc[[i for i in december.index if i in date_sales.index]]
december.fillna(0,inplace=True)

In [None]:
fig = plt.figure(figsize=(18,6))
plt.bar(december.index,december.sales)
ax = plt.gca()
formatter = mdates.DateFormatter("%b-%d")
ax.xaxis.set_major_formatter(formatter)
locator = mdates.DayLocator()
ax.xaxis.set_major_locator(locator)
plt.ylabel("Total Sales",{'fontsize':15})
plt.xticks(rotation='vertical')
plt.show()

Here we can see store was closed on 25th december due to christmas

**By Brand**

Here we can see maximum number of transactions are for brand Kettle

In [None]:
plt.xlabel('Number of Transactions',{'fontsize':15})
plt.ylabel('Brands',{'fontsize':15})
transaction_df.BRAND.value_counts().sort_values().plot(kind='barh',figsize=(18,8))

**By Packet Size**

In [None]:
plt.xlabel('Number of Packets',{'fontsize':15})
plt.ylabel('Packet Size',{'fontsize':15})
transaction_df.PACKET_SIZE.value_counts().plot(kind='bar',figsize=(18,8))
plt.xticks(rotation='horizontal')
plt.show()

**Customer Data**

In [None]:
behaviour_df.head()

In [None]:
behaviour_df.nunique()

In [None]:
plt.figure(figsize=(7,7))
plt.title('Customers Valuation Distribution',{'fontsize': 15})
plt.pie(behaviour_df.PREMIUM_CUSTOMER.value_counts(),labels=behaviour_df.PREMIUM_CUSTOMER.value_counts().index)
plt.show()

In [None]:
plt.figure(figsize=(21,4))

plt.subplot(131)
plt.title('Lifestage Distribution of Budget Customers',{'fontsize': 15})
plt.ylabel("Number of Transactions",{"fontsize":12})
behaviour_df.groupby("PREMIUM_CUSTOMER").LIFESTAGE.value_counts()['Budget'].plot(kind='bar',fontsize=12)

plt.subplot(132)
plt.ylabel("Number of Transactions",{"fontsize":12})
plt.title('Lifestage Distribution of Mainstream Customers',{'fontsize': 15})
behaviour_df.groupby("PREMIUM_CUSTOMER").LIFESTAGE.value_counts()['Mainstream'].plot(kind='bar',fontsize=12)

plt.subplot(133)
plt.ylabel("Number of Transactions",{"fontsize":12})
plt.title('Lifestage Distribution of Premium Customers',{'fontsize': 15})
behaviour_df.groupby("PREMIUM_CUSTOMER").LIFESTAGE.value_counts()['Premium'].plot(kind='bar',fontsize=12)

From the above graph we can see that maximum number of transaction is done by mainstream young singles/couples

**Merge Data**


 Now we merge customer data with transaction data for deeper analysis

In [None]:
combined_data = transaction_df.join(behaviour_df.set_index('LYLTY_CARD_NBR'), on = 'LYLTY_CARD_NBR')
combined_data.head()

Now let group together Premium and lifestage to check which section of customers have higher sales

In [None]:
customer_groups = combined_data.groupby(['LIFESTAGE','PREMIUM_CUSTOMER']).agg({'TOT_SALES':'sum','PROD_QTY':'sum'}).reset_index().sort_values('TOT_SALES')
customer_groups['SEGMENT'] = customer_groups.LIFESTAGE + '_' + customer_groups.PREMIUM_CUSTOMER

In [None]:
x = list(customer_groups.SEGMENT)
y = list(customer_groups.TOT_SALES)
plt.figure(figsize=(15,8))
plt.xlabel("Total Sales",{'fontsize':15})
plt.barh(x,y)

In [None]:
customer_groups = customer_groups.sort_values("PROD_QTY")
x = list(customer_groups.SEGMENT)
y = list(customer_groups.PROD_QTY)
plt.figure(figsize=(15,8))
plt.xlabel("Number of Products Purchased",{'fontsize':15})
plt.barh(x,y)

**Total Sales per Customer**

In [None]:
sales_pc = combined_data.groupby(['LIFESTAGE','PREMIUM_CUSTOMER']).agg({'PROD_QTY':'sum','TOT_SALES':'sum','TXN_ID':'count'}).reset_index().sort_values('TOT_SALES')
sales_pc['SEGMENT'] = customer_groups.LIFESTAGE + '_' + customer_groups.PREMIUM_CUSTOMER
sales_pc['SALES_PC'] = sales_pc.TOT_SALES / sales_pc.TXN_ID
sales_pc['QTY_PC'] = sales_pc.PROD_QTY / sales_pc.TXN_ID
sales_pc['AVG_PP'] = sales_pc.TOT_SALES / sales_pc.PROD_QTY

In [None]:
sales_pc = sales_pc.sort_values("SALES_PC")
x = list(sales_pc.SEGMENT)
y = list(sales_pc.SALES_PC)
plt.figure(figsize=(15,8))
plt.xlabel("Total Sales Per Customer",{'fontsize':15})
plt.barh(x,y)

Total Products purchased per Customer

In [None]:
sales_pc = sales_pc.sort_values("PROD_QTY")
x = list(sales_pc.SEGMENT)
y = list(sales_pc.PROD_QTY)
plt.figure(figsize=(15,8))
plt.xlabel("Total Products purchased Per Customer",{'fontsize':15})
plt.barh(x,y)

**Average price per productby LIFESTAGE and PREMIUM_CUSTOMER**

In [None]:
sales_pc = sales_pc.sort_values("AVG_PP")
x = list(sales_pc.SEGMENT)
y = list(sales_pc.AVG_PP)
plt.figure(figsize=(15,8))
plt.xlabel("Average Price paid per Product",{'fontsize':15})
plt.barh(x,y)

<a id=4> </a>
# **4. Independent t-test**

Performing an independent t‐test between mainstream vs premium and budget midage and young singles and couples.

In [None]:
combined_data["AVG_PACKET"] = combined_data["TOT_SALES"] / combined_data["PROD_QTY"]

data1 = list(combined_data[(combined_data['LIFESTAGE'].isin(["YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES"]))  & (combined_data['PREMIUM_CUSTOMER'] == 'Mainstream')]["AVG_PACKET"])
data2 = list(combined_data[(combined_data['LIFESTAGE'].isin(["YOUNG SINGLES/COUPLES", "MIDAGE SINGLES/COUPLES"]))  & (combined_data['PREMIUM_CUSTOMER'] != 'Mainstream')]["AVG_PACKET"])

In [None]:
from scipy.stats import ttest_ind

stat, p = ttest_ind(data1, data2,equal_var=True)
print('t=%.3f, p=%.3f ' % (stat, p))

Since the value of p < 0.05 e reject the null hypothesis and as t = 37.832 we canconclude that the unit price for mainstream,young and mid-age singles and couples is significantly higher than that of budget or premium, young and midage singles and couples.

<a id = 5> </a>
# **5. Affinity analysis using apriori and association rules**

Association rule mining is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy.

There are three major components of Apriori algorithm:

* Support
* Confidence
* Lift

**Support**
Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. 

**Confidence**
Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought.

**Lift**
Lift basically tells us that the likelihood of buying a 'A' and 'B' together.
* A Lift of 1 means there is no association between products A and B. 
* Lift of greater than 1 means products A and B are more likely to be bought together. 
* Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.

Suppose Lift('A' -> 'B') = 3.33

Here, lift tells us that the likelihood of buying a 'A' and 'B' together is 3.33 times more than the likelihood of just buying the 'B'.

We have found quite a few interesting insights that we can dive deeper into.

We might want to target customer segments that contribute the most to sales to retain them or further increase sales. Let’s look at Mainstream - young singles/couples. For instance, let’s find out if they tend to buy a particular brand of chips.

In [None]:
import mlxtend
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Apriori and association rules for Brand names

In [None]:
basket_brand = (combined_data[(combined_data['LIFESTAGE']=='YOUNG SINGLES/COUPLES') & (combined_data['PREMIUM_CUSTOMER']=='Mainstream')]
        .groupby(['LYLTY_CARD_NBR','BRAND'])['PROD_QTY']
        .sum().unstack().reset_index().fillna(0)
        .set_index('LYLTY_CARD_NBR'))

def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_brand = basket_brand.applymap(encode_units)
basket_brand

In [None]:
frequent_itemsets = apriori(basket_brand, min_support=0.07, use_colnames=True)
frequent_itemsets.head()

In [None]:
rules = association_rules(frequent_itemsets, metric="lift")
rules.head()

Apriori and association rules for packet size

In [None]:
basket_packet_size= (combined_data.groupby(['LYLTY_CARD_NBR', 'PACKET_SIZE'])['PROD_QTY']
                     .sum().unstack().reset_index().fillna(0).set_index('LYLTY_CARD_NBR'))

def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_packet_size = basket_packet_size.applymap(encode_units)
basket_packet_size

In [None]:
frequent_itemsets = apriori(basket_packet_size, min_support=0.07, use_colnames=True)
frequent_itemsets.head()

In [None]:
rules = association_rules(frequent_itemsets, metric="lift")
rules.head()

Since all the values of Lift are closer to 1, suggests purchase of antecedents make no effect on the purchase of consequent. This means that the evidence does not support enough for strong relationships between multiple brands or packet size.

**Affinity to Brand**

Checking which brand is favoured in Mainstream Young Single/Couples

In [None]:
segment1 = (combined_data[(combined_data['LIFESTAGE'].isin(["YOUNG SINGLES/COUPLES"]))  
                          & (combined_data['PREMIUM_CUSTOMER'] == 'Mainstream')])
other = (combined_data[~((combined_data['LIFESTAGE'].isin(["YOUNG SINGLES/COUPLES"]))  
                       & (combined_data['PREMIUM_CUSTOMER'] == 'Mainstream'))])

In [None]:
quantity_segment1 =  segment1.groupby(['BRAND'])[['PROD_QTY']].sum().reset_index().set_index("BRAND")
quantity_other = other.groupby(['BRAND'])[['PROD_QTY']].sum().reset_index().set_index("BRAND")

quantity_segment1_by_brand = quantity_segment1.PROD_QTY / segment1.PROD_QTY.sum()
quantity_segment1_by_brand.name = "TARGETTED_SEGMENT"

quantity_other_by_brand = quantity_other.PROD_QTY / other.PROD_QTY.sum()
quantity_other_by_brand.name = "OTHER_SEGMENT"

brand_proportions = pd.concat([quantity_segment1_by_brand,quantity_other_by_brand], names=["TARGETTED_SEGMENT","OTHER_SEGMENT"],axis=1)

brand_proportions["AFFINITY_TO_BRAND"] = brand_proportions.TARGETTED_SEGMENT / brand_proportions.OTHER_SEGMENT
brand_proportions = brand_proportions.sort_values("AFFINITY_TO_BRAND")

In [None]:
brand_proportions.tail()

**Affinity to packet size**

Checking which size is favoured in Mainstream Young Single/Couples

In [None]:
quantity_segment2 =  segment1.groupby(['PACKET_SIZE'])[['PROD_QTY']].sum().reset_index().set_index("PACKET_SIZE")
quantity_other = other.groupby(['PACKET_SIZE'])[['PROD_QTY']].sum().reset_index().set_index("PACKET_SIZE")

quantity_segment2_by_pack = quantity_segment2.PROD_QTY / segment1.PROD_QTY.sum()
quantity_segment2_by_pack.name = "TARGETTED_SEGMENT"

quantity_other_by_pack = quantity_other.PROD_QTY / other.PROD_QTY.sum()
quantity_other_by_pack.name = "OTHER_SEGMENT"

pack_proportions = pd.concat([quantity_segment2_by_pack,quantity_other_by_pack], axis=1)

pack_proportions["AFFINITY_TO_BRAND"] = pack_proportions.TARGETTED_SEGMENT / pack_proportions.OTHER_SEGMENT
pack_proportions = pack_proportions.sort_values("AFFINITY_TO_BRAND")

In [None]:
pack_proportions.tail()

In [None]:
fig = plt.figure(figsize=(18,10))

#Plotting Affinty by Brand
plt.subplot(121)
my_range = range(1, len(brand_proportions.index) + 1)
color_code = np.where(brand_proportions["AFFINITY_TO_BRAND"]==brand_proportions["AFFINITY_TO_BRAND"].max(), '#ff7f0e', '#1f77b4')
plt.hlines(y=my_range, xmin=0, xmax=brand_proportions['AFFINITY_TO_BRAND'],color=color_code)
plt.scatter(brand_proportions['AFFINITY_TO_BRAND'], my_range, color=color_code)
plt.yticks(np.arange(1,len(brand_proportions)+1), brand_proportions.index)

#Plotting Affinty by Packet Size
plt.subplot(122)
my_range = range(1, len(pack_proportions.index) + 1)
color_code = np.where(pack_proportions["AFFINITY_TO_BRAND"]==pack_proportions["AFFINITY_TO_BRAND"].max(), '#ff7f0e', '#1f77b4')
plt.hlines(y=my_range, xmin=0, xmax=pack_proportions['AFFINITY_TO_BRAND'],color=color_code)
plt.scatter(pack_proportions['AFFINITY_TO_BRAND'], my_range, color=color_code)
plt.yticks(np.arange(1,len(pack_proportions)+1), pack_proportions.index)
plt.show()

From the above plot we can say that Mainstream Young Single/Couples prefer Tyrrells, Twisties and Doritos more and prefer packet sizes of 270g, 380g and 330g.

In [None]:
combined_data[combined_data["PACKET_SIZE"] == 270].BRAND.unique()

Twisties are the only brand offering 270g packs and so this may instead be reflecting a higher likelihood of purchasing Twisties.

<a id = 6> </a>
# **6. Conclusion**

Let’s recap what we’ve found!

* Sales have mainly been due to Budget - older families, Mainstream - young singles/couples, and Mainstream - retirees shoppers. 
* We found that the high spend in chips for mainstream young singles/couples and retirees is due to there being more of them than other buyers.
* Mainstream, midage and young singles and couples are also more likely to pay more per packet of chips. This is indicative of impulse buying behaviour.
* We’ve also found that Mainstream young singles and couples are 23% more likely to purchase Tyrrells chips compared to the rest of the population. 
* Apriori and association rule analysis for brand and packet size did not provide any relationship between different products. 
* Finally, Affinity analysis of mainstream young singles/couples suggests that they prefer the Brand "Tyrrells" and packet size of "270g"