# Objective

To analyse customer, products, aisle, prior (purchase history) and order data to create customer segments using k-means algorithm and Apriori Association Rules. Use case - to come up with upselling and cross selling strategies.

PCA will be used to lower the dimensions of the dataset. The data will then be fed into k-means.
The ideal value of 'k' will be computed using the elbow method and PCA.

- Customer Segmentation (Part 1)
- Association Rules (Part 2)

In the second part, apriori algorithm and association rules have been used to identify the products with the optimum support, lift and confidence metrics. They aid decision making by formulating cross selling and up sell opportunity around products bought together.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
#Standard data science libraries.
import pandas as pd
import numpy as np

from sklearn.cluster import KMeans #for kmeans algorithm

#For dimensionality reduction.
from sklearn.decomposition import PCA #pca from decomposition module.
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition #decomposition module

#Plotting params.
%matplotlib inline
import matplotlib.pyplot as plt
from pylab import rcParams
import seaborn as sb
rcParams['figure.figsize'] = 12, 4
sb.set_style('whitegrid')

np.random.seed(42) # set the seed to make examples repeatable

In [None]:
#Since the files are zipped, they need to be imported with the following approach. 

prior = "order_products__prior.csv"
order_train = "order_products__train.csv"
orders = "orders.csv"
products = "products.csv"
aisles = "aisles.csv"
departments = "departments.csv"

In [None]:
import zipfile # Unzips the files
from subprocess import check_output    

#Prior Dataset
with zipfile.ZipFile("/kaggle/input/instacart-market-basket-analysis/"+prior+".zip","r") as z:
    z.extractall(".")
prior = pd.read_csv("order_products__prior.csv")

#Order_Train Dataset.
with zipfile.ZipFile("/kaggle/input/instacart-market-basket-analysis/"+order_train+".zip","r") as z:
    z.extractall(".")
order_train = pd.read_csv("order_products__train.csv")

#Orders Dataset.
with zipfile.ZipFile("/kaggle/input/instacart-market-basket-analysis/"+orders+".zip","r") as z:
    z.extractall(".")
orders = pd.read_csv("orders.csv")

#Products
with zipfile.ZipFile("/kaggle/input/instacart-market-basket-analysis/"+products+".zip","r") as z:
    z.extractall(".")
products = pd.read_csv("products.csv")

#Aisles
with zipfile.ZipFile("/kaggle/input/instacart-market-basket-analysis/"+aisles+".zip","r") as z:
    z.extractall(".")
aisles = pd.read_csv("aisles.csv")

#Departments
with zipfile.ZipFile("/kaggle/input/instacart-market-basket-analysis/"+departments+".zip","r") as z:
    z.extractall(".")
departments = pd.read_csv("departments.csv")

# PART 1: Customer Segmentation

The first part revolves around inspecting the data and segmenting customers into clusters using K-Means.

In [None]:
# Inspect all the dataframes, join them and make a combined df to form clusters. 

In [None]:
#Put them in a list to print shape.
combined_df_list = [products,orders, departments, aisles, prior, order_train]

In [None]:
#Check the size of the datasets.
for i in combined_df_list:
    print (i.shape)
#There are two df's which are very large in size, subset to use it on local machine with limited compute power.
del combined_df_list

In [None]:
#Products Dataframe
products.head(2)

In [None]:
#Departments Dataframe
departments.head(2)

In [None]:
#Aisles Dataframe - Products are kept in aisles.
aisles.head(2)

In [None]:
#Orders Dataframe
orders.head(2)

In [None]:
#Orders Train Dataframe
order_train.head(2)

In [None]:
#Products in Orders (Prior) - These files specify which products were purchased in each order. Contains Previous Orders.
prior.head(2) #notice the reordered feature.

In [None]:
#Since the dataframe is too big for in memory computation, reducing prior to only 500k rows. 
prior = prior [:500000]

#### Once the df's have been inspected, the next step is to combine them on primary and foreign keys. 

Merge 1 - Combining the orders to prior df. This will give the products that were ordered in each order. 

Merge 2 - Combining the department and aisle df's to product df. 

In [None]:
#Merge 1 - Prior and Orders DF (Joining Orders to prior df)
#Combining the Prior and Orders dataframe - shows which user ordered what products and in which order.
df1 = pd.merge(prior, orders, on= 'order_id')
df1.head(2)

In [None]:
#Merge 2
#Combining the department and aisle df's to product df. 
prod_aisles = pd.merge(products, aisles, on = 'aisle_id')
df2 = pd.merge(prod_aisles, departments, on = 'department_id')
df2.head(2)

In [None]:
#Combining df1 anf df2
combined_df = pd.merge(df1, df2, on = 'product_id').reset_index(drop=True)
combined_df.head(2)

# Data Exploration - Mini Version

A lot more will be covered on this in the subsequent commits.

In [None]:
#Check Nulls
sb.heatmap(combined_df.isnull(), cbar=True)

In [None]:
#These are null values in the feature 'days_since_prior_order'
combined_df[combined_df['days_since_prior_order'].isnull()].head(2)

#To be dealt with later, as this does not influence the current scope of work.

In [None]:
#Most ordering customer. Favourite Customer?
pd.DataFrame(combined_df.groupby('user_id')['product_id'].count()).sort_values('product_id', ascending=False).head(2)

#User_id = 142131

In [None]:
#Most ordered items.
pd.DataFrame(combined_df['product_name'].value_counts()).head(5)

In [None]:
#Most sold items as per aisle.
pd.DataFrame(combined_df['aisle'].value_counts()).head(5)

# Data Modeling

## Preparing Data 

In [None]:
combined_df.shape

In [None]:
#Using aisles and user_id. This shows the users that purchased items from which aisle.
user_by_aisle_df = pd.crosstab(combined_df['user_id'], combined_df['aisle'])
user_by_aisle_df.head(2)

In [None]:
#The final dataframe has about 134 features.
user_by_aisle_df.shape

In [None]:
#Standardization is not needed in this case.
user_by_aisle_df.describe() #this confirms that the values dont need to be standardized since they're all 'quantity'.

## Dimensionality Reduction using Elbow Method and PCA 

Since there are 134 features, they need to be lowered to a lower dimension with only the most important features.

PCA will be implemented using elbow method to compute the ideal value of 'k' clusters.

PCA is most common form of SVD (Singular Value Decomposition), SVD essentially decomposes the matrix into other resultant matrices to reduce information redundancy and noise. In the case above, the idea is to reduce the number of features from 134 to only the most relevant ones that capture the essence of the data.

#### But how to choose the number of principal components for PCA?

Elbow Method or K-Means

Important Note: The data does not need to be standardized since all the items bought by the user is quantity of units bought. 

### Using Elbow Method 

The bend of the elbow is where the ideal value of k lies.

In [None]:
#Taking array of 'user_by_aisle_df'. To use for elbow method.
X = user_by_aisle_df.values

In [None]:
user_by_aisle_df.head()

In [None]:
#Implementing the Elbow method to identify the ideal value of 'k'. 

ks = range(1,10) #hit and trial, let's try it 10 times.
inertias = []
for k in ks:
    model = KMeans(n_clusters=k)    # Create a KMeans instance with k clusters: model
    model.fit(X)                    # Fit model to samples
    inertias.append(model.inertia_) # Append the inertia to the list of inertias
    
plt.plot(ks, inertias, '-o', color='black') #Plotting. The plot will give the 'elbow'.
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()

### Using PCA 

The data has been reduced to just 6 components, which explain about 50% variation in the data.

In [None]:
#Seeing the above plot, the ideal value for cluster (k) should be between 5 and 6 - since the features beyond these values,
# do not explain much of the variability in the dataset. 

#Decomposing the features into 6 using PCA (seeing the above plot, n_components = 6)
pca = decomposition.PCA(n_components=6)
pca_user_order = pca.fit_transform(X)

#You can do hit and trial here to change the number of components and see how much variation in the data 
#is explained by the chose n_components.

In [None]:
#Checking the % variation explained by the 6 pca components.
pca.explained_variance_ratio_.sum()
#More than half (50%) of the variability in the data can be explained by just 6 components.

In [None]:
# Plot the explained variances to verify the variation.
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_, color='black')
plt.xlabel('PCA features')
plt.ylabel('variance %')

#A majority of the variance can be explained by just five to six components. Anything beyond that does not capture much of the variation in the dataset.

### Build The Model - K Means

Once the dimensionality has been lowered, the model can be built with the most chosen paramters.

In [None]:
#Chosen components.
PCA_components = pd.DataFrame(pca_user_order)
PCA_components.head(5)

In [None]:
#Build the model (kmeans using 5 clusters)
kmeans = KMeans(n_clusters=5)
X_clustered = kmeans.fit_predict(pca_user_order) #fit_predict on chosen components only.

In [None]:
#Visualize it.

label_color_mapping = {0:'r', 1: 'g', 2: 'b',3:'c' , 4:'m'}
label_color = [label_color_mapping[l] for l in X_clustered]

#Scatterplot showing the cluster to which each user_id belongs.
plt.figure(figsize = (15,8))
plt.scatter(pca_user_order[:,0],pca_user_order[:,2], c= label_color, alpha=0.3) 
plt.xlabel = 'X-Values'
plt.ylabel = 'Y-Values'
plt.show()

In [None]:
#This contains all the clusters which are to be mapped to each user_id in the user_by_aisle_df.
X_clustered.shape

In [None]:
#Mapping clusters to users.
user_by_aisle_df['cluster']=X_clustered

In [None]:
#Checking cluster concentration. 
user_by_aisle_df['cluster'].value_counts().sort_values(ascending = False)

In [None]:
#Check out cluster mapping.
user_by_aisle_df.head()

# PART 2: Association Rules

In [None]:
# Apriori Algorithm - Association Rules

#Some Theory - just a little.

#Fomatting Data
#Applying Apriori to get support in order to see what items go well together.
#Applying Association Rules to get the Confidence and Lift Scores
#How to come up with up-selling and cross-selling stratgies. The END.

Assumptions/Caveats

Transaction Data has to be in sparse format apriori algorithm.

* Fast
* Works well with less data
* Few (if any) feature engineering requirement 

Its a process that deploys pattern recognition to identify and quantify relationships between different yet related items.
Action 1: Place eggs and bread together so the customer does not need to walk, to get the items.
Action 2: Advertise eggs to bread buyer so they buy both together. Once you know the products are related.


Ex: 5000 total transactions, 500 were bread purchases, 350 eggs, and 150 both eggs and bread.

Measure Association

*  Support: Relative freq of item within a transaction dataset. Support for bread is (500/5000) = 0.1

*  Confidence: What is the confidence that eggs (item2) will be bought if bread (item1) was purchased. EX: (150/5000)/(500/5000) = 30%. There is a 30% chance that eggs will be bought when bread is bought.
 
*  Lift: A value that shows the relationship between two items. It is the confidence A->C/support(C)
 
    * If lift > 1: A is highly associated with C (Eggs will be bought if Bread, item 'A' was bought)
    * If lift < 1: If A was purchased, it is unlikely C will be purchased too.
    * If lift = 1: No association betweem item A and C.

Lift = 0.3/(350/5000) = 4.28

In [None]:
#Importing Libraries
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
#Checking with only a few samples. Concept is replicable.
np.random.seed(942) # set the seed to make examples repeatable
df2 = combined_df.sample(n=1000)[['user_id','product_name']]
basket = pd.crosstab(df2['user_id'],df2['product_name']).astype('bool').astype('int')
del df2

In [None]:
#Checking and removing index.
basket=basket.reset_index(drop=True)
basket.index

In [None]:
#Lets see if the format is correct.
basket.head(2)

In [None]:
#Calling apriori algorithm on dummified data - basket.
frequent_itemsets=apriori(basket, min_support=0.00002, use_colnames=True).sort_values('support', ascending=False) 

#These are all the POPULAR (Top 20) items purchased from the store.
frequent_itemsets.head(20)

In [None]:
#Lets check the length of the item sets using a tini lambda function.
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets.head()

In [None]:
#Putting a new filter to get all items with length 3 or more. (this means items purchased together)
frequent_itemsets[frequent_itemsets['length'] >= 3]

ASSOCIATION RULES

* First Part - Confidence
* Second Part - Lift
* Third Part - Confidence + Lift

In [None]:
#FIRST PART - CONFIDENCE

#For association rules, metric can be either confidence or lift. Second argument is minimum threshold (0.5).
#Trying confidence first.
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
rules.head()

#The minimum confidence level starts at 0.5 (confidence column). How likely is it for item C to be purchased if A was purchased?
#Hence if 'Kidz All Natural Baked Chicken Nuggets' was purchased, it is extremely likely that 'Quart Sized Easy Open Freezer Bags' will be purchased in the same transaction. 
#Confidence tells us if item C is purchased, how likely will item A be purchased too.

In [None]:
#SECOND PART - LIFT

#Changing metric to lift. Minimum threshold is 1.
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.head()

#Lift tells how likely are items bought together as opposed to being bought individually.
#Row 0: If Kidz All Natural Baked Chicken Nuggets is purchased, then Quart Sized Easy Open Freezer Bags will be purchased too.
#Row 1: If Quart Sized Easy Open Freezer Bags item is purchased, then Kidz All Natural Baked Chicken Nuggets will be purchased. As there is SLIGHTLY more confidence in row1 (compared to row0).


In [None]:
#THIRD PART - CONFIDENCE AND LIFT

#Select life>5 and confidence >.5
rules[(rules['lift'] >= 5) & (rules['confidence']>= 0.5)] 

#Now these items will be mostly be bought together. So you can make Cross-sell/upsell strategies based on that.

In [None]:
#Next steps - Some tuning to improve performance. 