## Primitiv recommendation model baseret på produkt popularitet

Kan bruges til fx anonyme sessions hvor kunden ikke er logget på, og hvor kunden ikke har lagt noget i kurven endnu -> fx på forsiden.

Vi bruger salgstal (antal individuelle ordre af hvert produkt, uden at gange med quantity/antal per ordre) som mål for produkt popularitet.
* Alternativt kan også bruges "hvor mange unikke kunder har købt produkt X" for at undgå situationer hvor fx forbrugsvarer, fx støvsugeposer, som bliver købt flere gange af den samme kunde, "udkonkurerer" ikke-forbrugsvarer, fx en støvsuger.

Det er vigtigt at introducere noget tilfældighed, så det ikke altid bare er de samme top-5 produkter der bliver anbefalet.

Hvert produkt er ranket efter popularitet.
Salgstallet fungerer som vægt i en tilfældig fordeling.

Vi vælger produkt anbefalinger ved tilfældig valg af et produkt, hvor normaliseret salgstal fungerer som vægt for hvert produkt.

In [2]:
import pandas as pd

from collections import defaultdict
import timeit
import itertools
import random
import matplotlib
from matplotlib import pyplot
from pprint import pprint
from importlib import reload

In [12]:
# Load data
df = df_raw = pd.read_csv("../data/raw/data.csv", keep_default_na=False)

# Remove bad data:
df = df_raw
print("Rows before QA filtering:", len(df))

# Remove lines with N/A values:
df = df_dropna = df.dropna()
print("Rows after dropping N/A:", len(df))

print("Columns with empty values:")
for column in df.columns:
    df = df[~(df[column] == '')]
print("Rows after dropping rows with empty values:", len(df))

# Remove lines that does not represent an actual product:
print("Non-product stock codes:")
non_product_stock_codes = ['BANK CHARGES', 'C2', 'CRUK', 'D', 'DOT', 'M', 'PADS', 'POST']
df = df[~df['StockCode'].isin(['BANK CHARGES', 'C2', 'CRUK', 'D', 'DOT', 'M', 'PADS', 'POST'])]
print("Rows after dropping non-product lines:", len(df))

print("Rows after QA filtering:", len(df))


Rows before QA filtering: 541909
Rows after dropping N/A: 541909
Columns with empty values:
Rows after dropping rows with empty values: 406829
Non-product stock codes:
Rows after dropping non-product lines: 404909
Rows after QA filtering: 404909


In [14]:
# Build product table:
from src.data import data_utils 
reload(data_utils)

products_df = data_utils.build_products_df_from_sales(df)
# products_df.head(5)

### Product Popularity Recommender 

In [22]:
import random

class SalesCountRecommendationModel():

    def __init__(self, sales_df, popularity_measure="UsersCount"):
        
        # Først, lav en tabel med alle produkter.
        # OBS: Doing it this way makes 'StockCode' the table index, not a column:
        products_df = sales_df.groupby('StockCode').first()
        del products_df['InvoiceNo']
        del products_df['Quantity']
        del products_df['CustomerID']
        del products_df['Country']
        # self.products_df = products_df

        # Beregn product order count:
        # product_orders_count = df.groupby('StockCode')['InvoiceNo'].count().sort_values(ascending=False)
        # products_df['OrdersCount'] = sales_df.groupby('StockCode')['InvoiceNo'].count()
        products_df['OrdersCount'] = sales_df.groupby('StockCode').size()  # same as count() on series
        products_df['UsersCount'] = sales_df.groupby('StockCode')['CustomerID'].nunique()
        # Sort and store:
        self.products_df_sorted = products_df.sort_values(by=popularity_measure, ascending=False)
        self.popularity_measure = popularity_measure
        # display(self.products_df_sorted)

    def recommend_stockcodes(self, basket=None, k=1, popularity_measure=None):
        """ This model doesn't use basket, only static sales numbers. """
        if popularity_measure is None:
            popularity_measure = self.popularity_measure
        # display(self.products_df_sorted.index)
        return random.choices(
            self.products_df_sorted.index, 
            weights=self.products_df_sorted[popularity_measure], 
            k=k
        )

    def recommend_top_stockcodes(self, basket=None, k=1, popularity_measure=None):
        """ This model doesn't use basket, only static sales numbers. """
        if popularity_measure is None:
            popularity_measure = self.popularity_measure
        self.products_df_sorted.sort_values(by=popularity_measure, ascending=False)
        return self.products_df_sorted.index[:k]

    def recommend_product_rows(self, basket=None, k=1, popularity_measure=None):
        stockcodes = self.recommend_stockcodes(basket=basket, k=k, popularity_measure=popularity_measure)
        print(f"{stockcodes!r}", type(stockcodes))
        return self.products_df_sorted.loc[stockcodes]

    def recommend_top_product_rows(self, basket=None, k=1, popularity_measure=None):
        """ This model doesn't use basket, only static sales numbers. """
        if popularity_measure is None:
            popularity_measure = self.popularity_measure
        self.products_df_sorted.sort_values(by=popularity_measure, ascending=False)
        return self.products_df_sorted.iloc[:k]

    
sales_count_rm = SalesCountRecommendationModel(sales_df=df)

n_recommendations = 5
print(f"\nRecommend {n_recommendations} semi-random products based on sales/users count:\n")
product_rows = sales_count_rm.recommend_product_rows(k=n_recommendations)
product_rows



Recommend 5 semi-random products based on sales/users count:

['22696', '22309', '22776', '22614', '21704'] <class 'list'>


Unnamed: 0_level_0,Description,InvoiceDate,UnitPrice,OrdersCount,UsersCount
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
22696,WICKER WREATH LARGE,12/1/2010 15:37,1.95,130,108
22309,TEA COSY RED STRIPE,12/17/2010 15:57,2.55,66,48
22776,SWEETHEART CAKESTAND 3 TIER,12/1/2010 12:49,9.95,491,333
22614,PACK OF 12 SPACEBOY TISSUES,12/1/2010 13:24,0.29,179,133
21704,BAG 250g SWIRLY MARBLES,12/1/2010 17:35,0.85,201,137


In [23]:
# The 5 top products are always the same:

product_rows = sales_count_rm.recommend_top_product_rows(k=n_recommendations)
product_rows

Unnamed: 0_level_0,Description,InvoiceDate,UnitPrice,OrdersCount,UsersCount
StockCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
22423,REGENCY CAKESTAND 3 TIER,12/1/2010 12:27,10.95,1905,887
85123A,WHITE HANGING HEART T-LIGHT HOLDER,12/1/2010 8:26,2.55,2077,858
47566,PARTY BUNTING,12/3/2010 12:43,4.65,1416,708
84879,ASSORTED COLOUR BIRD ORNAMENT,12/1/2010 8:34,1.69,1418,679
22720,SET OF 3 CAKE TINS PANTRY DESIGN,12/13/2010 15:13,4.95,1232,640
