In [1]:
#used for sampling
import random 

#data handling
import numpy as np
import pandas as pd

#data visualisation
import matplotlib.pyplot as plt
#import matplotlib.patches as mpatches
#import seaborn as sns

#handling missing values where not dropped
#from sklearn.impute import SimpleImputer

#Encoding categorical data for transformation
#from sklearn import preprocessing
#from sklearn.compose import ColumnTransformer
#from sklearn.preprocessing import OneHotEncoder

# Transaction Dataset

## Import Dataset

The transactions dataset is over 3gb in size. This will be slow and unweidly if we imported it all. Instead we will sample it and import only some of the data

In [3]:
p = 0.005  # ~.5% of the random lines
# keep the header, then take only % of lines
# if random from [0,1] interval is greater than 0.01 the row will be skipped
transactions_train_df = pd.read_csv(
         "data/transactions_train.csv",
         header=0, 
         skiprows=lambda i: i>0 and random.random() > p
)

## Check for Missing Values

In [2]:
#check for missing values
def check_missing(x):
    # ref: https://stackoverflow.com/questions/59694988/python-pandas-dataframe-find-missing-values
    try:
        missing = x.isnull().sum()
        print(x.shape)
        print(missing)
    except:
        print("An issue occured. Check dataframe.")

In [4]:
check_missing(transactions_train_df)

(158683, 5)
t_dat               0
customer_id         0
article_id          0
price               0
sales_channel_id    0
dtype: int64


## Check result

In [5]:
transactions_train_df.head()

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,00a4f2c8cc87ffb0ea7db47b4b0b247b9ef11033c8d976...,610097001,0.028797,2
1,2018-09-20,023b48de81f6af9de6bef17e720e3873ce40df3f561c97...,651863004,0.005068,2
2,2018-09-20,029ceb992cb63df03c109790046e3fdebfce0b63c96882...,660618001,0.016712,2
3,2018-09-20,02d796ea767fa2e94fc6228fe70d8af1a570da973c32f7...,634037002,0.041814,2
4,2018-09-20,03de367bf4b373ddf9963d3d7347a5dca4a2aa99df2065...,680492001,0.050831,2


# Transaction Dataset

## Import Dataset

In [6]:
articles_df = pd.read_csv("data/articles.csv")

## Check for Missing Values

In [7]:
check_missing(articles_df)

(105542, 25)
article_id                        0
product_code                      0
prod_name                         0
product_type_no                   0
product_type_name                 0
product_group_name                0
graphical_appearance_no           0
graphical_appearance_name         0
colour_group_code                 0
colour_group_name                 0
perceived_colour_value_id         0
perceived_colour_value_name       0
perceived_colour_master_id        0
perceived_colour_master_name      0
department_no                     0
department_name                   0
index_code                        0
index_name                        0
index_group_no                    0
index_group_name                  0
section_no                        0
section_name                      0
garment_group_no                  0
garment_group_name                0
detail_desc                     416
dtype: int64


In [8]:
articles_df.tail()

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
105537,953450001,953450,5pk regular Placement1,302,Socks,Socks & Tights,1010014,Placement print,9,Black,...,Socks Bin,F,Menswear,3,Menswear,26,Men Underwear,1021,Socks and Tights,Socks in a fine-knit cotton blend with a small...
105538,953763001,953763,SPORT Malaga tank,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey,A,Ladieswear,1,Ladieswear,2,H&M+,1005,Jersey Fancy,Loose-fitting sports vest top in ribbed fast-d...
105539,956217002,956217,Cartwheel dress,265,Dress,Garment Full body,1010016,Solid,9,Black,...,Jersey,A,Ladieswear,1,Ladieswear,18,Womens Trend,1005,Jersey Fancy,"Short, A-line dress in jersey with a round nec..."
105540,957375001,957375,CLAIRE HAIR CLAW,72,Hair clip,Accessories,1010016,Solid,9,Black,...,Small Accessories,D,Divided,2,Divided,52,Divided Accessories,1019,Accessories,Large plastic hair claw.
105541,959461001,959461,Lounge dress,265,Dress,Garment Full body,1010016,Solid,11,Off White,...,Jersey,A,Ladieswear,1,Ladieswear,18,Womens Trend,1005,Jersey Fancy,Calf-length dress in ribbed jersey made from a...


# Simple Recommendation System

In a perfect scenario for recommendation systems we would have a table of m users by n items, with each product given a rating r_ij by each user. However We could have hundreds of users and thousands of products. A user may not have tried every product so our table would have missing values. To solve this issue we could predict the value for missing cells (rhat_ij). A good prediction would mean a good recommendation to the user. Another way to make recommendations to users would be to rank the top k products for each user based on information we have available on products and users.In essence the prediction problem boils down to how we rate products. 

In order for us to rate products we will first need some metric to rate them by. Secondly, we will decide the prerequisites that a product must meet to recieve a rating. Thirdly we will calculate the score for each item that satisfies the prerequisites and finally we will output a list of items in decreasing order.

In our Transaction dataset we have customers who bought products at a particular price and time. We can use these features to create a simple collaborative model recommendation system that has a list of products with a popularity rating. 

We will create a new table for each item, how many customers bought the item and at what price. We will then rank each product on how many were sold. Before we do this, must consider an issue that will arrise with this recommendation system. If 100 customers bought product A and 10 customers bought product B but both products cost ten euros, our system would rate these two products with the same popularity rating. This is not ideal since the product that has more customer purchases would be more popular. We must instead use a weighted rating (WR) that takes into account the number of customers who bought a product. 

Our weighted formula can be defined as:

WR $= \frac{v}{v+m} x R + \frac{m}{v+m} x C$<br><br>
Where:<br>
&emsp; v = # of customers who bought the item<br>
&emsp; m = minimum number of customers who need to buy the item for it to appear in the list<br>
&emsp; R = the mean item rating of the product<br>
&emsp; C - the mean qty of all customers who bought items in the dataset<br>


In [9]:
df = transactions_train_df.drop(['t_dat', 'sales_channel_id'], axis=1)

pop_prod = df['article_id'].value_counts().reset_index()
pop_prod.columns = ['article_id', 'customers']
pop_df = pop_prod.merge(transactions_train_df, how = 'inner', on = ['article_id']).drop(['t_dat', 'customer_id', 'sales_channel_id'], axis=1)

m = pop_df['customers'].quantile(0.75) #only show products that at least 14 customers purchased
C = pop_df['customers'].median()
R = pop_df['price'].median() # we don't have a user rating e.g 6.2/10. We just use the price.

def weighted_rating(x, m=m, C=C, R=R):
    v = x['customers']
    return (v /(v + m) * R) + (m / (m + v) * C)

pop_df = pop_df.copy().loc[pop_df['customers'] >= m] #get qty greater than m
pop_df['rating'] = pop_df.apply(weighted_rating, axis=1)

q_df = pop_df.sort_values('rating', ascending=False)
q_df.head(12)

Unnamed: 0,article_id,customers,price,rating
39749,832036001,14,0.016322,3.012703
37561,745219001,14,0.022017,3.012703
37571,854678005,14,0.008458,3.012703
37570,854678005,14,0.006458,3.012703
37569,854678005,14,0.007186,3.012703
37568,854678005,14,0.007186,3.012703
37567,854678005,14,0.007898,3.012703
37566,854678005,14,0.007186,3.012703
37565,745219001,14,0.002525,3.012703
37564,745219001,14,0.022017,3.012703


In [10]:
pd.options.display.max_colwidth = 50

def get_product_info(df, n):
    if n >= 20:#cap to stop entry error issues
        n = 20
        
    for i in range(0, n):
        item_group = articles_df.loc[articles_df['article_id'].eq(q_df['article_id'].iloc(0)[i]), 'index_group_name']
        item_name = articles_df.loc[articles_df['article_id'].eq(q_df['article_id'].iloc(0)[i]), 'prod_name']
        item_desc = articles_df.loc[articles_df['article_id'].eq(q_df['article_id'].iloc(0)[i]), 'detail_desc']
        print(item_group.values[0] + "\n")
        print(item_name.values[0] + "\n")
        print(item_desc.values[0] + "\n")
    
get_product_info(q_df, 12)

Ladieswear

Mimosa padded softbra P2

Soft, non-wired bra in lace with adjustable shoulder straps and mesh-lined, triangular cups with removable inserts that shape the bust and provide good support. Elastic hem, and a hook-and-eye fastening at the back.

Divided

MINT TANK S.0

Short, flared top in a crinkled viscose weave in a narrow cut at the top with short shoulder straps, an opening and button at the back of the neck and a trim around the arm openings.

Ladieswear

C Lolly Bandeau

Lined, bandeau bikini top with side support and cups with removable inserts that shape the bust and provide good support. No fasteners. The polyester content of the outer fabric is partly recycled. The polyester content of the lining is recycled.

Ladieswear

C Lolly Bandeau

Lined, bandeau bikini top with side support and cups with removable inserts that shape the bust and provide good support. No fasteners. The polyester content of the outer fabric is partly recycled. The polyester content of the lini

# Knowledge Recommendation System

Some products may be rarely purchased or only purchased once in a life time (e.g a wedding dress). In this case it is difficult to include these in a simple popularity recommendation system. We could instead build a knowledge recommender ontop of our popularity one by getting custom requirements from the customer such as colour and clothing type (e.g blue shirt) then return results that match.

We will ask the user a series of questions and filter our table based on that. We will then rank the remaining items and display 12 recommendations to the user. 

In [11]:
popk_df = q_df.merge(articles_df, how = 'inner', on = ['article_id']).drop([
    'product_code',
    'prod_name',
    'product_type_no',
    'product_type_name',
    'graphical_appearance_no',
    'graphical_appearance_name',
    'colour_group_code',
    'perceived_colour_value_id',
    'perceived_colour_value_name',
    'perceived_colour_master_id',
    'perceived_colour_master_name',
    'department_no',
    'department_name',
    'index_code',
    'index_name',
    'index_group_no',
    'section_no',
    'section_name',
    'garment_group_no',
    'detail_desc'], axis=1)

In [12]:
popk_df.head()

Unnamed: 0,article_id,customers,price,rating,product_group_name,colour_group_name,index_group_name,garment_group_name
0,832036001,14,0.016322,3.012703,Underwear,Black,Ladieswear,"Under-, Nightwear"
1,832036001,14,0.010661,3.012703,Underwear,Black,Ladieswear,"Under-, Nightwear"
2,832036001,14,0.016932,3.012703,Underwear,Black,Ladieswear,"Under-, Nightwear"
3,832036001,14,0.011847,3.012703,Underwear,Black,Ladieswear,"Under-, Nightwear"
4,832036001,14,0.015237,3.012703,Underwear,Black,Ladieswear,"Under-, Nightwear"


In [13]:
def build_knowledge_recommendation(p_df, percentile=0.9):
    print("Input preferred group (e.g Ladieswear):")
    group = str(input())

    print("Input preferred Product (e.g Garment Upper body):")
    product = str(input())
    
    print("Input preferred Colour (e.g Dark Grey):")
    colour = (input())
    
    print("Input budget Low Price (e.g €2.00) €: ")
    low_price = float(input())
    low_price = low_price/100
    
    print("Input budget Low Price (e.g €4.00) €: ")
    high_price = float(input())
    high_price = high_price/100
    
    #Define a new movies variable to store the preferred products. Copy the
    #contents of p_df to movies
    pre_df = p_df.copy()
    
    #Filter based on the condition
    pre_df = pre_df[(pre_df['index_group_name'] == group) &
                    (pre_df['product_group_name'] == product) &
                    (pre_df['colour_group_name'] == colour) &
                    (pre_df['price'] >= low_price) &
                    (pre_df['price'] <= high_price)]
    
    #Compute the values of C and m for the filtered products
    m = pre_df['customers'].quantile(percentile)
    C = pre_df['customers'].median()
    R = pre_df['price'].median() # we don't have a user rating e.g 6.2/10. We just use the pricec.
    
    #Only consider products that have higher than m qty.
    q2_df = pre_df.copy().loc[pre_df['customers'] >= m]
    
    #Calculate score using the weighted rating formula
    q2_df['rating'] = q2_df.apply(weighted_rating, axis=1)

    #Sort products in descending order of their scores
    q2_df = q2_df.sort_values('rating', ascending=False)
    return q2_df

In [14]:
pref_df = build_knowledge_recommendation(popk_df)

Input preferred group (e.g Ladieswear):
Ladieswear
Input preferred Product (e.g Garment Upper body):
Garment Upper body
Input preferred Colour (e.g Dark Grey):
Dark Grey
Input budget Low Price (e.g €2.00) €: 
2.00
Input budget Low Price (e.g €4.00) €: 
4.00


In [16]:
get_product_info(pref_df, 12)

Ladieswear

Mimosa padded softbra P2

Soft, non-wired bra in lace with adjustable shoulder straps and mesh-lined, triangular cups with removable inserts that shape the bust and provide good support. Elastic hem, and a hook-and-eye fastening at the back.

Divided

MINT TANK S.0

Short, flared top in a crinkled viscose weave in a narrow cut at the top with short shoulder straps, an opening and button at the back of the neck and a trim around the arm openings.

Ladieswear

C Lolly Bandeau

Lined, bandeau bikini top with side support and cups with removable inserts that shape the bust and provide good support. No fasteners. The polyester content of the outer fabric is partly recycled. The polyester content of the lining is recycled.

Ladieswear

C Lolly Bandeau

Lined, bandeau bikini top with side support and cups with removable inserts that shape the bust and provide good support. No fasteners. The polyester content of the outer fabric is partly recycled. The polyester content of the lini

In [None]:
Item Description Based Recommender