# Recommender System | Revenue Potential 

Complete the exercises below to solidify your knowledge and understanding of recommender systems.

For this lab, we are going to be putting together a user similarity based recommender system in a step-by-step fashion. Our data set contains customer grocery purchases, and we will use similar purchase behavior to inform our recommender system. Our recommender system will generate 5 recommendations for each customer based on the purchases they have made.

In [1]:
##Libraries
#Dataframe & Arrays
import pandas as pd
import numpy as np

#SCiPy for Cluster Distance Analysis
from scipy.spatial.distance import pdist, squareform

In [2]:
#df = pd.read_excel('../data/online_fashion.xlsx')

In [3]:
df = pd.read_csv('../data/cleaned_df2.csv')

In [4]:
df['CustomerID'].isna().sum()

0

In [5]:
df['CustomerID'] = df['CustomerID'].astype(int)


In [6]:
df.shape

(298407, 11)

In [7]:
##Just weighted average price

In [8]:
#data.rename(columns={'Description':'ProductName'}, inplace=True)

## Step 1: Create a data frame that contains the total quantity of each product purchased by each customer.

You will need to group by CustomerID and ProductName and then sum the Quantity field.

In [9]:
df_cust_prod_grouped = pd.DataFrame(df.groupby(['CustomerID', 'StockCode'])['Quantity'].agg('sum'))
df_cust_prod_grouped.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
CustomerID,StockCode,Unnamed: 2_level_1
12347,16008,24
12347,17021,36
12347,20665,6
12347,20719,40
12347,20780,12


## Step 2: Use the `pivot_table` method to create a product by customer matrix.

The rows of the matrix should represent the products, the columns should represent the customers, and the values should be the quantities of each product purchased by each customer. You will also need to replace nulls with zeros, which you can do using the `fillna` method.

In [10]:
df_cust_matrix = df_cust_prod_grouped.pivot_table('Quantity', 'StockCode', 'CustomerID', aggfunc='sum', fill_value = 0)
df_cust_matrix.head()

CustomerID,12347,12348,12349,12350,12352,12353,12354,12355,12356,12357,...,18272,18273,18276,18277,18278,18280,18281,18282,18283,18287
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10002,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10080,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10120,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10123C,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10124A,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 3: Create a customer similarity matrix using `squareform` and `pdist`. For the distance metric, choose "euclidean."

In [11]:
# I need to transpose the matrix, otherwise I get the distance for products, not customers.
# First applying pdist, gives an 1D array.
# Then applying squareform to turn it into a squareform
# Finally convert it into a DataFrame

df_distance_matrix = pd.DataFrame(squareform(pdist(df_cust_matrix.T, metric='euclidean')), index=df_cust_matrix.columns, columns=df_cust_matrix.columns)
df_distance_matrix.head()

CustomerID,12347,12348,12349,12350,12352,12353,12354,12355,12356,12357,...,18272,18273,18276,18277,18278,18280,18281,18282,18283,18287
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12347,0.0,635.655567,401.766101,394.970885,394.313327,392.57356,401.880579,347.997126,447.18788,490.658741,...,474.527133,400.492197,378.457395,392.593938,392.802749,392.149206,393.146283,394.717621,405.086411,451.18289
12348,635.655567,0.0,569.817515,564.191457,563.220206,562.003559,569.09226,540.118506,574.054875,637.45745,...,617.243874,567.563212,559.489053,562.017793,562.419772,562.105862,562.40377,563.897154,554.340148,595.815408
12349,401.766101,569.817515,0.0,106.113147,102.239914,95.31002,134.773885,150.943698,300.44467,306.065352,...,274.779912,123.951603,98.59006,95.39392,97.365292,95.911417,97.642204,105.280578,158.874164,283.591255
12350,394.970885,564.191457,106.113147,0.0,63.82006,52.0,108.967885,128.093716,304.018092,301.502902,...,271.834508,94.783965,57.792733,52.153619,56.320511,53.094256,56.160484,69.541355,142.313035,274.218891
12352,394.313327,563.220206,102.239914,63.82006,0.0,40.112342,103.474635,123.745707,301.734983,303.501236,...,270.279485,88.820043,47.381431,40.311289,45.574115,41.521079,45.376205,61.163715,139.168962,272.214989


In [12]:
# The distances I have doesn't tell me much. I will normalize to a value between 0 and 1,
# and inverse them: The closer to 1, the more similar they are

df_distance_matrix_norm = pd.DataFrame(1/(1 + df_distance_matrix))
df_distance_matrix_norm.head()

CustomerID,12347,12348,12349,12350,12352,12353,12354,12355,12356,12357,...,18272,18273,18276,18277,18278,18280,18281,18282,18283,18287
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12347,1.0,0.001571,0.002483,0.002525,0.00253,0.002541,0.002482,0.002865,0.002231,0.002034,...,0.002103,0.002491,0.002635,0.002541,0.002539,0.002544,0.002537,0.002527,0.002463,0.002211
12348,0.001571,1.0,0.001752,0.001769,0.001772,0.001776,0.001754,0.001848,0.001739,0.001566,...,0.001617,0.001759,0.001784,0.001776,0.001775,0.001776,0.001775,0.00177,0.001801,0.001676
12349,0.002483,0.001752,1.0,0.009336,0.009686,0.010383,0.007365,0.006581,0.003317,0.003257,...,0.003626,0.008003,0.010041,0.010374,0.010166,0.010319,0.010138,0.009409,0.006255,0.003514
12350,0.002525,0.001769,0.009336,1.0,0.015427,0.018868,0.009094,0.007746,0.003278,0.003306,...,0.003665,0.01044,0.017009,0.018813,0.017446,0.018486,0.017495,0.014176,0.006978,0.003633
12352,0.00253,0.001772,0.009686,0.015427,1.0,0.024324,0.009572,0.008016,0.003303,0.003284,...,0.003686,0.011133,0.020669,0.024206,0.021471,0.023518,0.021563,0.016087,0.007134,0.00366


## Step 4: Check your results by generating a list of the top 5 most similar customers for a specific CustomerID.

In [13]:
#select a random customer_id from unique customer IDs in 'df'
# this will help you find the best customer: df_distance_matrix_norm.columns.unique()   random
df_top5_recs_spec_customer = df_distance_matrix_norm[12350].sort_values(ascending = False).head(6)
df_top5_recs_spec_customer

CustomerID
12350    1.000000
15180    0.020469
15422    0.020460
15435    0.019740
16484    0.019289
17956    0.019289
Name: 12350, dtype: float64

## Step 5: From the data frame you created in Step 1, select the records for the list of similar CustomerIDs you obtained in Step 4.

In [14]:
# I select index from 1 because I don't want to get the first input, 
# as it is the customer itself

df_sample_similar_cust_recomms = df_cust_prod_grouped.loc[(df_top5_recs_spec_customer.index[1:],)]
df_sample_similar_cust_recomms.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
CustomerID,StockCode,Unnamed: 2_level_1
15180,22112,3
15180,22113,4
15180,22114,4
15180,22348,12
15180,22835,4


## Step 6: Aggregate those customer purchase records by ProductName, sum the Quantity field, and then rank them in descending order by quantity.

This will give you the total number of each product purchased by the 5 most similar customers to the customer you selected in order from most purchased to least.

In [15]:
df_agg_similar_prod_ranked = df_sample_similar_cust_recomms.groupby('StockCode')[['Quantity']].sum()\
                .sort_values(by = 'Quantity', ascending = False)
df_agg_similar_prod_ranked.head()

Unnamed: 0_level_0,Quantity
StockCode,Unnamed: 1_level_1
22348,28
72741,9
21531,8
21218,7
21844,6


## Step 7: Filter the list for products that the chosen customer has not yet purchased and then recommend the top 5 products with the highest quantities that are left.

- Merge the ranked products data frame with the customer product matrix on the ProductName field.
- Filter for records where the chosen customer has not purchased the product.
- Show the top 5 results.

In [16]:
df_product_recomms = pd.concat([df_agg_similar_prod_ranked, df_cust_matrix[12350]], axis=1, sort=False)
df_product_recomms.rename(columns = {12350:'test_customer'}, inplace = True)
df_product_recomms

Unnamed: 0,Quantity,test_customer
22348,28.0,24
72741,9.0,0
21531,8.0,0
21218,7.0,0
21844,6.0,0
22417,6.0,0
22964,4.0,0
22835,4.0,0
22113,4.0,0
22114,4.0,0


In [17]:
df_top5_recomm = df_product_recomms.query('Quantity > 0 and test_customer == 0').head(5)
df_top5_recomm

Unnamed: 0,Quantity,test_customer
72741,9.0,0
21531,8.0,0
21218,7.0,0
21844,6.0,0
22417,6.0,0


## Step 8: Now that we have generated product recommendations for a single user, put the pieces together and iterate over a list of all CustomerIDs.

- Create an empty dictionary that will hold the recommendations for all customers.
- Create a list of unique CustomerIDs to iterate over.
- Iterate over the customer list performing steps 4 through 7 for each and appending the results of each iteration to the dictionary you created.

In [18]:
dict_recommendations = {}
unique_ID = df_distance_matrix_norm.columns.unique()

In [33]:
#df['CustomerID'].unique()

In [19]:
for customer in unique_ID:
    head = df_distance_matrix_norm[customer].sort_values(ascending = False).head(6)
    df_sample_similar_cust_recomms = df_cust_prod_grouped.loc[(head.index[1:],)]
    df_agg_similar_prod_ranked = df_sample_similar_cust_recomms.groupby('StockCode')[['Quantity']].sum()\
                .sort_values(by = 'Quantity', ascending = False)
    df_product_recomms = pd.concat([df_agg_similar_prod_ranked, df_cust_matrix[customer]], axis=1, sort=False)
    df_product_recomms.rename(columns = {customer:'customer'}, inplace = True)
    dict_recommendations[customer] = list(df_product_recomms.query('Quantity > 0 and customer == 0').head(5).index)
    

In [20]:
dict_recommendations

{12347: ['23077', '22418', '22614', '22029', '84375'],
 12348: ['15056N', '22693', '21829', '22384', '20727'],
 12349: ['85194S', '22265', '22851', '22322', '72741'],
 12350: ['72741', '21531', '21218', '21844', '22417'],
 12352: ['22915', '21733', '23321', '23322', '22469'],
 12353: ['22485', '22802', '22803', '22982', '23073'],
 12354: ['20979', '21245', '20674', '22993', '22962'],
 12355: ['71477', '21167', '21381', '21380', '20829'],
 12356: ['16161P', '21210', '22961', '22952', '22986'],
 12357: ['72351B', '85034B', '21108', '72349B', '72225C'],
 12358: ['16008', '85015', '84946', '15044D', '85048'],
 12359: ['16156S', '23170', '22915', '84947', '22921'],
 12360: ['22631', '22993', '22962', '22966', '22659'],
 12361: ['20979', '23371', '22352', '22138', '22617'],
 12362: ['79190B', '85040A', '84569D', '82552', '22973'],
 12363: ['22961', '22909', '21975', '21880', '47591D'],
 12364: ['23077', '23080', '21231', '23076', '21232'],
 12365: ['22485', '22802', '22803', '22982', '23073'

##  Step 9: Store the results in a Pandas data frame. The data frame should a column for Customer ID and then a column for each of the 5 product recommendations for each customer.

In [21]:
df_recommendations = pd.DataFrame.from_dict(dict_recommendations, orient='index', 
                                columns=['rec1', 'rec2', 'rec3', 'rec4', 'rec5'])
df_recommendations

Unnamed: 0,rec1,rec2,rec3,rec4,rec5
12347,23077,22418,22614,22029,84375
12348,15056N,22693,21829,22384,20727
12349,85194S,22265,22851,22322,72741
12350,72741,21531,21218,21844,22417
12352,22915,21733,23321,23322,22469
12353,22485,22802,22803,22982,23073
12354,20979,21245,20674,22993,22962
12355,71477,21167,21381,21380,20829
12356,16161P,21210,22961,22952,22986
12357,72351B,85034B,21108,72349B,72225C


## Step 10: Change the distance metric used in Step 3 to something other than euclidean (correlation, cityblock, consine, jaccard, etc.). Regenerate the recommendations for all customers and note the differences.

In [22]:
#Create a price lookup table
df_price_lookup = pd.DataFrame(df['StockCode'].unique())
df_price_lookup['ModePrice'] = 1.5
df_price_lookup.columns=['StockCode','ModePrice']
df_price_lookup

Unnamed: 0,StockCode,ModePrice
0,22749,1.5
1,22310,1.5
2,84969,1.5
3,22913,1.5
4,22912,1.5
5,22914,1.5
6,21756,1.5
7,21724,1.5
8,21883,1.5
9,10002,1.5


In [23]:
#Step 10
#calculate the value of thse
#df_recommendations.iloc[:,1]

for index, row in df_recommendations.head(n=2).iterrows():
    print(row)
     #print(index, row)


rec1    23077
rec2    22418
rec3    22614
rec4    22029
rec5    84375
Name: 12347, dtype: object
rec1    15056N
rec2     22693
rec3     21829
rec4     22384
rec5     20727
Name: 12348, dtype: object


In [None]:
df_recommendations.head()

In [39]:
def pricelookup(row):
    """
    Input: StockCode taken from row within dataframe
    Output: The corresponding value of that item
    
    The purpose of this funtion is to vlookup value of item from a reference dataframe (df_price_lookup)
    """
    try:
        return float(df_price_lookup.loc[df_price_lookup['StockCode'] == row]["ModePrice"])
    except:
        #print(f"Failed on row {row}")
        pass

In [40]:
df_recommendations['rec1value'] = df_recommendations['rec1'].apply(pricelookup)

In [41]:
df_recommendations['rec1value'] = df_recommendations['rec1'].apply(pricelookup)
df_recommendations['rec2value'] = df_recommendations['rec2'].apply(pricelookup)
df_recommendations['rec3value'] = df_recommendations['rec3'].apply(pricelookup)
df_recommendations['rec4value'] = df_recommendations['rec4'].apply(pricelookup)
df_recommendations['rec5value'] = df_recommendations['rec5'].apply(pricelookup)

In [42]:
df_recommendations.head()

Unnamed: 0,rec1,rec2,rec3,rec4,rec5,rec1value,rec2value,rec3value,rec4value,rec5value,TotalPossCustRev
12347,23077,22418,22614,22029,84375,1.5,1.5,1.5,1.5,1.5,7.5
12348,15056N,22693,21829,22384,20727,1.5,1.5,1.5,1.5,1.5,7.5
12349,85194S,22265,22851,22322,72741,1.5,1.5,1.5,1.5,1.5,7.5
12350,72741,21531,21218,21844,22417,1.5,1.5,1.5,1.5,1.5,7.5
12352,22915,21733,23321,23322,22469,1.5,1.5,1.5,1.5,1.5,7.5


In [29]:
df_recommendations['TotalPossCustRev'] = df_recommendations.iloc[:,-5:].sum(axis=1)

In [30]:
df_recommendations.head()

Unnamed: 0,rec1,rec2,rec3,rec4,rec5,rec1value,rec2value,rec3value,rec4value,rec5value,TotalPossCustRev
12347,23077,22418,22614,22029,84375,1.5,1.5,1.5,1.5,1.5,7.5
12348,15056N,22693,21829,22384,20727,1.5,1.5,1.5,1.5,1.5,7.5
12349,85194S,22265,22851,22322,72741,1.5,1.5,1.5,1.5,1.5,7.5
12350,72741,21531,21218,21844,22417,1.5,1.5,1.5,1.5,1.5,7.5
12352,22915,21733,23321,23322,22469,1.5,1.5,1.5,1.5,1.5,7.5


###start to monitor acceptance of recommendations as this will give success rate - which is a valuable metric

In [None]:
df_price_lookup.duplicated().sum()

In [None]:
df_distance_matrix = pd.DataFrame(squareform(pdist(df_cust_matrix.T, metric='euclidean')), index=df_cust_matrix.columns, columns=df_cust_matrix.columns)
df_distance_matrix_norm = pd.DataFrame(1/(1 + df_distance_matrix))

In [None]:
metrics = [ 'cityblock', 'correlation', 'cosine', 'dice', 
           'euclidean', 'hamming', 'jaccard']

#try-the code below       except-what tiem happen if it fails
for i in metrics:
    df_distance_matrix = pd.DataFrame(squareform(pdist(df_cust_matrix.T, metric=i)), index=df_cust_matrix.columns, columns=df_cust_matrix.columns)
    df_distance_matrix_norm = pd.DataFrame(1/(1 + df_distance_matrix))
    dict_recommendations = {}
    unique_ID = df_distance_matrix_norm.columns.unique()

    for customer in unique_ID:
        head = df_distance_matrix_norm[customer].sort_values(ascending = False).head(6)
        df_sample_similar_cust_recomms = df_cust_prod_grouped.loc[(head.index[1:],)]
        df_agg_similar_prod_ranked = df_sample_similar_cust_recomms.groupby('StockCode')[['Quantity']].sum()\
                    .sort_values(by = 'Quantity', ascending = False)
        df_product_recomms = pd.concat([df_agg_similar_prod_ranked, df_cust_matrix[customer]], axis=1, sort=False)
        df_product_recomms.rename(columns = {customer:'customer'}, inplace = True)
        dict_recommendations[customer] = list(df_product_recomms.query('Quantity > 0 and customer == 0').head(5).index)

    df_recommendations = pd.DataFrame.from_dict(dict_recommendations, orient='index', 
                                    columns=['rec1', 'rec2', 'rec3', 'rec4', 'rec5'])    

    df_recommendations['rec1value'] = df_recommendations['rec1'].apply(pricelookup)
    df_recommendations['rec2value'] = df_recommendations['rec2'].apply(pricelookup)
    df_recommendations['rec3value'] = df_recommendations['rec3'].apply(pricelookup)
    df_recommendations['rec4value'] = df_recommendations['rec4'].apply(pricelookup)
    df_recommendations['rec5value'] = df_recommendations['rec5'].apply(pricelookup)

    df_recommendations['TotalPossCustRev'] = df_recommendations.iloc[:,-5:].sum(axis=1)

    Total = df_recommendations['TotalPossCustRev'].sum()
    
    print(f"The total for {i} metric was : {Total}")

print("All done")

#Determine how many of the units that were sold during the time period were later returned, and divide that number by the number of units sold. Multiply your answer by 100 to calculate the percentage of units returned.