# Recommender System on E-commerce Data using Association Rules
## Pierre-Louis Danieau

The objective of the system developed below is to **anticipate the customers' needs** of an online store with the development of a recommendation system.

A recommandation system enables to **personalize** the products' recommendation according to the needs of each customer. 

Such a system serves the interests of both **the customer and the company**. 

* **For the customers :** This makes it easier to find products that interest them. Their search for the next product to buy is facilitated, and their user experience is therefore significantly improved. Customers are therefore more satisfied.

* **For the company :** As far as the company is concerned, a recommendation engine allows to improve the loyalty of its customers. As they are satisfied with the products they buy, they are more likely to collaborate with the company. The churn rate decreases, allowing the company to reduce its costs related to the acquisition of new customers. And of course, the company sees its turnover increase with the proposition of new products to the customers with cross selling.

The recommendation engine developed is therefore based on a public dataset proposed by [The UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php) . This dataset contains the transaction history of an online store over 1 year.

The goal of the system is therefore to propose 1 product to each customer based on their current shopping basket that maximizes the probability of purchase. The final table is composed of the customer identifiers associated with the product that is proposed to him and the associated purchase probability. 


# 1. Set up environment 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Load package *fpgrowth_py* for association rules

In [None]:
! pip install fpgrowth_py

# 2. Import some libraries & data transformation

In [None]:
from fpgrowth_py import fpgrowth
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# here we will import the libraries used for machine learning
import numpy as np # linear algebra
import pandas as pd # data processing, data manipulation as in SQL
import matplotlib.pyplot as plt # this is used for the plot the graph 
import seaborn as sns # used for plot interactive graph
%matplotlib inline
import time

data=pd.read_csv('/kaggle/input/ecommerce-data/data.csv') # import from a csv file
data['GroupPrice']=data['Quantity']*data['UnitPrice']
data=data.dropna()
print('The dimensions of the dataset are : ', data.shape)
print('---------')
data.head()

**Variables explanation :**

* **InvoiceNo** : Invoice number corresponding to the product purchase.
* **StockCode** : Identifier of the purchased product. Each identifier is different. 
* **Description** : Description of the purchased product. 
* **Quantity** : Quantity of product purchased
* **InvoiceDate** : Date of invoice, from 01/12/2010 to 09/12/2011 . 
* **UnitPrice** : Price of one product. 
* **CustomerID** : Identifier of the customer. Each identifier is different.  
* **Country** : Country where the customer place the order. 
* **GroupPrice** : Price of all the same products purchased. Quantity x UnitPrice 




# 3. Data preprocessing

* #### Removal of products that correspond to gifts offered by the company to customers. We keep only the products that the customer has actually put in his shopping cart. 

* #### We group all the products a customer has purchased together. Each line corresponds to a transaction composed of the invoice number, the customer ID and all the products purchased.


In [None]:
liste= data['StockCode'].unique() 
stock_to_del=[]
for el in liste:
    if el[0] not in ['1','2','3','4','5','6','7','8','9','10']: # products corresponding to gifts. 
        stock_to_del.append(el)

data=data[data['StockCode'].map(lambda x: x not in stock_to_del)] # delete these products

basket = data.groupby(['InvoiceNo','CustomerID']).agg({'StockCode': lambda s: list(set(s))}) # grouping product from the same invoice. 

print('Dimension of the new grouped dataset : ', basket.shape)
print('----------')
basket.head()

# 4. Association Rules modelling : Fp growth algorithm

Fp Growth is a Data Mining model based on **association rules**.

This model allows, from a transaction history, to determine the set of most frequent association rules in the dataset. To do so, it needs as input parameter the set of transactions composed of the product baskets the customers have already purchased. 

Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items.

The second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets explicitly, which are usually expensive to generate. After the second step, the frequent itemsets can be extracted from the FP-tree and the model returns a set of product association rules like the example below: 

            {Product A + Product B} --> {Product C} with 60% probability
            {Product B + Product C} --> {Product A + Product D} with 78% probability
            {Prodcut C} --> {Product B + Product D} with 67% probability
            etc.
            
To establish this table, the model needs to be provided with 2 hyperparameters :
* **minSupRatio** : minimum support for an itemset to be identified as frequent. For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
* **minConf** :minimum confidence for generating Association Rule. Confidence is an indication of how often an association rule has been found to be true. For example, if in the transactions itemset X appears 4 times, X and Y co-occur only 2 times, the confidence for the rule X => Y is then 2/4 = 0.5. The parameter will not affect the mining for frequent itemsets, but specify the minimum confidence for generating association rules from frequent itemsets.

Once the association rules have been calculated, all you have to do is apply them to the customers' product baskets. 

In [None]:
a=time.time()
freqItemSet, rules = fpgrowth(basket['StockCode'].values, minSupRatio=0.005, minConf=0.3)
b=time.time()
print('time to execute in seconds : ',b-a, ' s.')
print('Number of rules generated : ', len(rules))

association=pd.DataFrame(rules,columns =['basket','next_product','proba']) 
association=association.sort_values(by='proba',ascending=False)
print('Dimensions of the association table are : ', association.shape)
association.head(10)

In [None]:
def compute_next_best_product(basket_el):
    """
    parameter : basket_el = list of consumer basket elements
    return : next_pdt, proba = next product to recommend, buying probability. Or (0,0) if no product is found. 
            
    
    Description : from the basket of a user, returns the product to recommend if it was not found 
    in the list of associations of the table associated with the FP Growth model. 
    To do this, we search in the table of associations for the product to recommend from each 
    individual product in the consumer's basket. 
    
    """
    
    for k in basket_el: # for each element in the consumer basket
            k={k}
            if len(association[association['basket']==k].values) !=0: # if we find a corresponding association in the fp growth table
                next_pdt=list(association[association['basket']==k]['next_product'].values[0])[0] # we take the consequent product
                if next_pdt not in basket_el : # We verify that the customer has not previously purchased the product
                    proba=association[association['basket']==k]['proba'].values[0] # Find the associated probability. 
                    return(next_pdt,proba)
    
    return(0,0) # return (0,0) if no product was found. 

In [None]:
def find_next_product(basket):
    """
    Parameter : basket = consumer basket dataframe
    Return : list_next_pdt, list_proba = list of next elements to recommend and the buying probabilities associated.
    
    description : Main function that uses the one above. For each client in the dataset we look for a corresponding 
    association in the Fp Growth model table. If no association is found, we call the compute_next_best_product 
    function which searches for individual product associations.
    If no individual ssociations are found, the function returns (0,0).
    
    """
    n=basket.shape[0]
    list_next_pdt=[]
    list_proba=[]
    for i in range(n): # for each customer
        el=set(basket['StockCode'][i]) # customer's basket
        if len(association[association['basket']==el].values) !=0: # if we find a association in the fp growth table corresponding to all the customer's basket.
            next_pdt=list(association[association['basket']==el]['next_product'].values[0])[0] # We take the consequent product
            proba=association[association['basket']==el]['proba'].values[0] # Probability associated in the table
            list_next_pdt.append(next_pdt)
            list_proba.append(proba)


        elif len(association[association['basket']==el].values) ==0: # If no antecedent to all the basket was found in the table
            next_pdt,proba= compute_next_best_product(basket['StockCode'][i]) # previous function
            list_next_pdt.append(next_pdt)
            list_proba.append(proba)
            
    return(list_next_pdt, list_proba)

# 5. Computation for each customer

In [None]:
a=time.time()
list_next_pdt, list_proba= find_next_product(basket) 
b=time.time()
print(b-a)
basket['Recommended Product']=list_next_pdt # Set of recommended products
basket['Probability']=list_proba # Set of rprobabilities associated
basket.head()

* #### Calculation of estimated prices from the recommendations made and display of the final table with the association (customer, product recommended)

In [None]:
basket=basket.rename(columns = {'StockCode': 'Customer basket'})
data_stock=data.drop_duplicates(subset ="StockCode", inplace = False)
prices=[]
description_list=[]
for i in range(basket.shape[0]):
    stockcode=basket['Recommended Product'][i]
    probability= basket['Probability'][i]
    if stockcode != 0:
        unitprice=data_stock[data_stock['StockCode']==stockcode]['UnitPrice'].values[0]
        description=data_stock[data_stock['StockCode']==stockcode]['Description'].values[0]
        estim_price=unitprice*probability
        prices.append(estim_price)
        description_list.append(description)
        
    else :
        prices.append(0)
        description_list.append('Null')

    

basket['Price estimation']=prices 
basket['Product description']=description_list 
basket = basket.reindex(columns=['Customer basket','Recommended Product','Product description','Probability','Price estimation'])
basket.head()

# 6. Results

#### Anticipation of customer needs :

In [None]:
print('On average, the recommendation system can predict in ',basket['Probability'].mean() *100,  '% of the cases the next product that the customer will buy.')

#### Turnover generated :

In [None]:
print('With only 1 single product proposed, the recommendation system can generate a turnover in this case up to : ', round(basket['Price estimation'].sum()), ' euros.') 

# 7. Conclusion 

Among a product catalog of more than 3000 items, a simple model based on association rules can predict in **35%** of the cases the next product that the customer will buy and thus generate significant additional revenue. 

The advantage of this model is that it offers very good accuracy while being both easy to implement and explainable. Indeed, unlike some other artificial intelligence models that can seem like "black boxes" because they are difficult to explain, the results of the Fp Growth model are understandable because you will find all the rules specific to your business. For example, if you know that most of the time your customers buy product A and product B together, you will see it immediately in your association table ! 
