# Recommender System | Revenue Potential 

This notebook contains all the code to product product recommendations, estimate the total revenue over 9 months and then annualise it.

### Loading Ncessary Libraries and Defining Functions

Getting the notebook ready for this part of the project.

In [1]:
##Libraries
#Dataframe & Arrays
import pandas as pd
import numpy as np

#SCiPy for Cluster Distance Analysis
from scipy.spatial.distance import pdist, squareform

#Required for Function within Function
from functools import partial

In [2]:
def customertypeinrecsys(row):
    """
    Input: CustomerID
    Output: Flag to identify if B2B customer or not (1 or 0)
    
    The purpose of this funtion is to vlookup if the customer ID belongs to a business or regular customer
    """
    try:
        return int(df_custtype.loc[df_custtype['CustomerID'] == row]["B2B"])
    except:
        return 0

In [3]:
#Working version
def pricelookup(row, col):
    """
    Input: StockCode taken from row within dataframe
    Output: The corresponding value of that item
    
    The purpose of this funtion is to vlookup value of item from a reference dataframe (df_price_lookup)
    """
    if row[col] is not None:
        try:
            awp = float(df_awp.loc[(df_awp['StockCode'] == row[col]) & (df_awp['B2B'] == row['B2B'])]["AWP"])
            mean = float(df_awp.loc[(df_awp['StockCode'] == row[col]) & (df_awp['B2B'] == row['B2B'])]["MeanQty"])
            return awp*mean
        except:
            awp2 = float(df_awp_noB2B.loc[(df_awp_noB2B['StockCode'] == row[col]) ]["AWP"])
            mean2 = float(df_awp_noB2B.loc[(df_awp_noB2B['StockCode'] == row[col])]["MeanQty"])
            return float(awp2*mean2)
    else:
        return 0

### Loading Necessary Dataframes from CSV Files

The files were created in a different notebook.

In [4]:
#Load main dataframe require for this project
#df = pd.read_csv('../data/dataframe_recom_system.csv')

In [5]:
#Load main dataframe required for this project Updates to row above as uses different source file
df = pd.read_csv('../data/dataframe_recom_system_NR.csv')

In [6]:
#Load the dataframe that contains the Average Weighted Prices and Quantity Purchased by Customer Type (B2B or B2C)
df_awp = pd.read_csv('../data/df_awp.csv')

In [7]:
#Load the dataframe that contains the Average Weighted Prices and Quantity Purchased overall 
#Required if can't be found in above file
df_awp_noB2B = pd.read_csv('../data/df_awp_noB2B.csv')

In [8]:
df_awp.dtypes

B2B          float64
StockCode     object
Rev          float64
Qty            int64
AWP          float64
ModeQty        int64
MeanQty        int64
dtype: object

In [9]:
#Ensure the B2B Flag is an Integer (1 or 0)
df_awp['B2B'] = df_awp['B2B'].astype(int)

In [10]:
#Ensure there are no Nans in CustomerID fields and values are integers
df['CustomerID'] = df['CustomerID'].astype(int)
df['CustomerID'].isna().sum()

0

In [11]:
df.shape

(306237, 18)

### Recommender System

First to determine the total number of items (StockCodes) purchased by each customer (CustomerID)

In [12]:
df_cust_prod_grouped = pd.DataFrame(df.groupby(['CustomerID', 'StockCode'])['Quantity'].agg('sum'))
df_cust_prod_grouped.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
CustomerID,StockCode,Unnamed: 2_level_1
12347,16008,24
12347,17021,36
12347,20665,6
12347,20719,40
12347,20780,12


Next to create a pivot table of customer against stock code to show quantity of each item purchased.

In [13]:
#Applying pivot table to df created above (df_cust_prod_grouped).
df_cust_matrix = df_cust_prod_grouped.pivot_table('Quantity', 'StockCode', 'CustomerID', aggfunc='sum', fill_value = 0)
df_cust_matrix.head()

CustomerID,12347,12348,12349,12350,12352,12353,12354,12355,12356,12357,...,18272,18273,18276,18277,18278,18280,18281,18282,18283,18287
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10080,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10120,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10123C,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10124A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now to create a matrix that provides info on the similarity of customers.

In [14]:
# I need to transpose the matrix, otherwise I get the distance for products, not customers.
# The pdist function returns a 1-Dimensional array.
# Then applying squareform returns a vector-form distance vector
# Finally convert it into a DataFrame
#The "euclidean" metric will be used for this example.

df_distance_matrix = pd.DataFrame(squareform(pdist(df_cust_matrix.T, metric='euclidean')), \
                                  index=df_cust_matrix.columns, columns=df_cust_matrix.columns)
df_distance_matrix.head()

CustomerID,12347,12348,12349,12350,12352,12353,12354,12355,12356,12357,...,18272,18273,18276,18277,18278,18280,18281,18282,18283,18287
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12347,0.0,635.655567,401.766101,394.970885,396.294083,392.57356,401.880579,347.997126,447.18788,490.658741,...,474.527133,400.492197,378.457395,392.593938,392.802749,392.149206,393.146283,394.717621,405.086411,451.18289
12348,635.655567,0.0,569.817515,564.191457,566.325878,562.003559,569.09226,540.118506,574.054875,637.45745,...,617.243874,567.563212,559.489053,562.017793,562.419772,562.105862,562.40377,563.897154,554.340148,595.815408
12349,401.766101,569.817515,0.0,106.113147,110.17713,95.31002,134.773885,150.943698,300.44467,306.065352,...,274.779912,123.951603,98.59006,95.39392,97.365292,95.911417,97.642204,105.280578,158.874164,283.591255
12350,394.970885,564.191457,106.113147,0.0,87.068938,52.0,108.967885,128.093716,304.018092,301.502902,...,271.834508,94.783965,57.792733,52.153619,56.320511,53.094256,56.160484,69.541355,142.313035,274.218891
12352,396.294083,566.325878,110.17713,87.068938,0.0,71.533209,116.425942,137.116739,305.018032,305.347343,...,267.677044,106.756733,75.848533,71.644958,74.732858,72.332565,74.612331,85.023526,150.339616,277.548194


In [15]:
# Given the values are so large and hard to determine their meaning they will be inverted and normalised. 
# ie. the closer to 1, the more similar they are

df_distance_matrix_norm = pd.DataFrame(1/(1 + df_distance_matrix))
df_distance_matrix_norm.head()

CustomerID,12347,12348,12349,12350,12352,12353,12354,12355,12356,12357,...,18272,18273,18276,18277,18278,18280,18281,18282,18283,18287
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12347,1.0,0.001571,0.002483,0.002525,0.002517,0.002541,0.002482,0.002865,0.002231,0.002034,...,0.002103,0.002491,0.002635,0.002541,0.002539,0.002544,0.002537,0.002527,0.002463,0.002211
12348,0.001571,1.0,0.001752,0.001769,0.001763,0.001776,0.001754,0.001848,0.001739,0.001566,...,0.001617,0.001759,0.001784,0.001776,0.001775,0.001776,0.001775,0.00177,0.001801,0.001676
12349,0.002483,0.001752,1.0,0.009336,0.008995,0.010383,0.007365,0.006581,0.003317,0.003257,...,0.003626,0.008003,0.010041,0.010374,0.010166,0.010319,0.010138,0.009409,0.006255,0.003514
12350,0.002525,0.001769,0.009336,1.0,0.011355,0.018868,0.009094,0.007746,0.003278,0.003306,...,0.003665,0.01044,0.017009,0.018813,0.017446,0.018486,0.017495,0.014176,0.006978,0.003633
12352,0.002517,0.001763,0.008995,0.011355,1.0,0.013787,0.008516,0.00724,0.003268,0.003264,...,0.003722,0.00928,0.013013,0.013766,0.013204,0.013637,0.013225,0.011625,0.006608,0.00359


It's time to check the output of the above by generating a list of the top 5 most similar customers for a specific CustomerID.

In [16]:
df_top5_recs_spec_customer = df_distance_matrix_norm[12350].sort_values(ascending = False).head(6)
df_top5_recs_spec_customer

CustomerID
12350    1.000000
15180    0.020469
15422    0.020460
15435    0.019740
16484    0.019289
17956    0.019289
Name: 12350, dtype: float64

For one of the similar customers identified above, it's time to see what they purchased.

In [17]:
df_sample_similar_cust_recomms = df_cust_prod_grouped.loc[(df_top5_recs_spec_customer.index[1:],)]
df_sample_similar_cust_recomms.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
CustomerID,StockCode,Unnamed: 2_level_1
15180,22112,3
15180,22113,4
15180,22114,4
15180,22348,12
15180,22835,4


The dataframe above will now be used to create and aggregate StockCode, and Quantity. Thwy will then be ordered in descending order of quantity.

This will provide the total number of each product purchased by the 5 most similar customers to the customer in descending order of quantity purcahsed. (The assumption is that if the Quantity purcahsed is higher then it's more like to be appealign to a similar customer).

In [18]:
df_agg_similar_prod_ranked = df_sample_similar_cust_recomms.groupby('StockCode')[['Quantity']].sum()\
                .sort_values(by = 'Quantity', ascending = False)
df_agg_similar_prod_ranked.head()

Unnamed: 0_level_0,Quantity
StockCode,Unnamed: 1_level_1
22348,28
72741,9
21531,8
21218,7
21844,6


Next, it's time to filter the list for products that the chosen customer has not yet purchased and then recommend the top 5 products with the highest quantities remaining.

In [19]:
df_product_recomms = pd.concat([df_agg_similar_prod_ranked, df_cust_matrix[12350]], axis=1, sort=False)
df_product_recomms.rename(columns = {12350:'test_customer'}, inplace = True)
df_product_recomms

Unnamed: 0,Quantity,test_customer
22348,28.0,24
72741,9.0,0
21531,8.0,0
21218,7.0,0
21844,6.0,0
22417,6.0,0
22964,4.0,0
22835,4.0,0
22113,4.0,0
22114,4.0,0


In [20]:
df_top5_recomm = df_product_recomms.query('Quantity > 0 and test_customer == 0').head(5)
df_top5_recomm

Unnamed: 0,Quantity,test_customer
72741,9.0,0
21531,8.0,0
21218,7.0,0
21844,6.0,0
22417,6.0,0


The recommendations made for a single customer now need to be applied across all customers.

In [21]:
dict_recommendations = {}
unique_ID = df_distance_matrix_norm.columns.unique()

In [22]:
for customer in unique_ID:
    head = df_distance_matrix_norm[customer].sort_values(ascending = False).head(6)
    df_sample_similar_cust_recomms = df_cust_prod_grouped.loc[(head.index[1:],)]
    df_agg_similar_prod_ranked = df_sample_similar_cust_recomms.groupby('StockCode')[['Quantity']].sum()\
                .sort_values(by = 'Quantity', ascending = False)
    df_product_recomms = pd.concat([df_agg_similar_prod_ranked, df_cust_matrix[customer]], axis=1, sort=False)
    df_product_recomms.rename(columns = {customer:'customer'}, inplace = True)
    dict_recommendations[customer] = list(df_product_recomms.query('Quantity > 0 and customer == 0').head(5).index)
    

The above results now need to be converted into a dataframe (from dictionary).

In [23]:
df_recommendations = pd.DataFrame.from_dict(dict_recommendations, orient='index', 
                                columns=['rec1', 'rec2', 'rec3', 'rec4', 'rec5'])
df_recommendations.head()

Unnamed: 0,rec1,rec2,rec3,rec4,rec5
12347,23077,22693,22418,22614,84375
12348,15056N,22693,21829,22384,20727
12349,85194S,22265,22851,22322,72741
12350,72741,21531,21218,21844,22417
12352,22915,21231,22907,22196,22072


## Putting it all together...
This section will run all the code for the different distance metrics availble and compare them all.

In [24]:
metrics = [ 'cityblock', 'correlation', 'cosine', 'dice', 
           'euclidean', 'hamming', 'jaccard']


for i in metrics:
    df_distance_matrix = pd.DataFrame(squareform(pdist(df_cust_matrix.T, metric=i)), index=df_cust_matrix.columns, columns=df_cust_matrix.columns)
    df_distance_matrix_norm = pd.DataFrame(1/(1 + df_distance_matrix))
    dict_recommendations = {}
    unique_ID = df_distance_matrix_norm.columns.unique()

    for customer in unique_ID:
        head = df_distance_matrix_norm[customer].sort_values(ascending = False).head(6)
        df_sample_similar_cust_recomms = df_cust_prod_grouped.loc[(head.index[1:],)]
        df_agg_similar_prod_ranked = df_sample_similar_cust_recomms.groupby('StockCode')[['Quantity']].sum()\
                    .sort_values(by = 'Quantity', ascending = False)
        df_product_recomms = pd.concat([df_agg_similar_prod_ranked, df_cust_matrix[customer]], axis=1, sort=False)
        df_product_recomms.rename(columns = {customer:'customer'}, inplace = True)
        dict_recommendations[customer] = list(df_product_recomms.query('Quantity > 0 and customer == 0').head(5).index)

    df_recommendations = pd.DataFrame.from_dict(dict_recommendations, orient='index', 
                                    columns=['rec1', 'rec2', 'rec3', 'rec4', 'rec5'])    

    #drop the index
    df_recommendations.reset_index(drop=False, inplace=True)

    #rename the index
    df_recommendations.columns = ['CustomerID','rec1','rec2','rec3','rec4','rec5']

    #add B2B columns
    df_recommendations['B2B'] = df_recommendations['CustomerID'].apply(customertypeinrecsys)

    #lookup prices based on the stock code and B2B flag
    for column in ['rec1','rec2','rec3','rec4','rec5']:
        look = partial(pricelookup, col=column)
        df_recommendations[f"{column}value"] = df_recommendations.apply(look, axis=1)
    
    
    
    df_recommendations['TotalPossCustRev'] = df_recommendations.iloc[:,-5:].sum(axis=1)

    Total = df_recommendations['TotalPossCustRev'].sum()
    Total_Annual = ((Total/11)*12)

    print(f"The total for {i} metric was : {Total} for 11 months or {Total_Annual} annualised")
    #print("The total for {} metric was : {0:.2f} for 11 months or {0:.2f} annualised".format(i, Total,Total_Annual ))

print("All done")

The total for cityblock metric was : 373847.2466246232 for 11 months or 407833.35995413444 annualised
The total for correlation metric was : 343420.03292339144 for 11 months or 374640.03591642703 annualised
The total for cosine metric was : 343782.16664453375 for 11 months or 375035.0908849459 annualised
The total for dice metric was : 349375.10380965436 for 11 months or 381136.4768832593 annualised
The total for euclidean metric was : 310986.66764364456 for 11 months or 339258.18288397585 annualised
The total for hamming metric was : 447534.2087055604 for 11 months or 488219.1367697023 annualised
The total for jaccard metric was : 335381.43709187454 for 11 months or 365870.65864568134 annualised
All done


The total for cityblock metric was : 370758.48566564394 for 9 months or 404463.80254433886 annualised
The total for correlation metric was : 315393.3666921541 for 9 months or 344065.49093689537 annualised
The total for cosine metric was : 315750.65981281555 for 9 months or 344455.26525034424 annualised
The total for dice metric was : 349029.88932659273 for 9 months or 380759.87926537386 annualised
The total for euclidean metric was : 310853.27841711615 for 9 months or 339112.6673641267 annualised
The total for hamming metric was : 423977.1892219803 for 9 months or 462520.57006034214 annualised
The total for jaccard metric was : 335560.0019931393 for 9 months or 366065.4567197883 annualised
All done