# Content-based filtering

In order to perform content based filtering, I going to use the column prod_name of the articles dataframe as a first approach. This column contains the names of the articles respectively.

First I'm going to import the datasets needed for the recommendation. I'm going to use the TfidfVectorizer provided by sklearn.

In [1]:
import pandas as pd
import numpy as np

In [236]:
df_articles = pd.read_csv("../data/h-and-m-personalized-fashion-recommendations/articles.csv", dtype={'article_id': 'string'})
df_customers = pd.read_csv("../data/h-and-m-personalized-fashion-recommendations/customers.csv")
df_sample = pd.read_csv("../data/h-and-m-personalized-fashion-recommendations/sample_submission.csv")
df_transactions = pd.read_csv("../data/h-and-m-personalized-fashion-recommendations/transactions_train.csv", dtype={'article_id': 'string'}, parse_dates=[0])

In [237]:
df_articles_20k = df_articles.head(20000)
df_articles_20k

Unnamed: 0,article_id,product_code,prod_name,product_type_no,product_type_name,product_group_name,graphical_appearance_no,graphical_appearance_name,colour_group_code,colour_group_name,...,department_name,index_code,index_name,index_group_no,index_group_name,section_no,section_name,garment_group_no,garment_group_name,detail_desc
0,0108775015,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,9,Black,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
1,0108775044,108775,Strap top,253,Vest top,Garment Upper body,1010016,Solid,10,White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
2,0108775051,108775,Strap top (1),253,Vest top,Garment Upper body,1010017,Stripe,11,Off White,...,Jersey Basic,A,Ladieswear,1,Ladieswear,16,Womens Everyday Basics,1002,Jersey Basic,Jersey top with narrow shoulder straps.
3,0110065001,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,9,Black,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
4,0110065002,110065,OP T-shirt (Idro),306,Bra,Underwear,1010016,Solid,10,White,...,Clean Lingerie,B,Lingeries/Tights,1,Ladieswear,61,Womens Lingerie,1017,"Under-, Nightwear","Microfibre T-shirt bra with underwired, moulde..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,0586608009,586608,Malibu,274,Shorts,Garment Lower body,1010001,All over pattern,73,Dark Blue,...,Kids Boy Shorts,H,Children Sizes 92-140,4,Baby/Children,46,Kids Boy,1025,Shorts,Cotton poplin shorts with an elasticated draws...
19996,0586645001,586645,Robyn cargo,262,Jacket,Garment Upper body,1010016,Solid,19,Greenish Khaki,...,Woven Tops,A,Ladieswear,1,Ladieswear,8,Mama,1010,Blouses,"Cotton parka with a high collar, shoulder tabs..."
19997,0586648001,586648,Isabella,258,Blouse,Garment Upper body,1010016,Solid,53,Dark Pink,...,Woven Tops,A,Ladieswear,1,Ladieswear,8,Mama,1010,Blouses,Blouse in airy plumeti chiffon with a stand-up...
19998,0586651002,586651,Viggo jogger,272,Trousers,Garment Lower body,1010016,Solid,9,Black,...,Young Boy Trouser,I,Children Sizes 134-170,4,Baby/Children,47,Young Boy,1009,Trousers,Pull-on trousers in cotton twill with an elast...


In [238]:
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df_articles_20k['product_group_name'])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(20000, 14)

As the result of the operation above shows, there are used 14 words to describe the first 20000 articles in the dataframe. The amount of words used is depending on what column is used to build up the vectorizer (here it is "product_group_name"). And of course the limitation of the amount of articles (here 20000) has also an effect on the amount of the words used.

After calculating the tfidf_matrix, we need to compute the similarity between the articles. For this we use the cosine similarity (other possible similarity metrics are Euclidean or Pearson similarity). Instead of using the cosine_similarity function, we will use the linear_kernel function provided by sklearn.

In [239]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [240]:
tfidf_matrix

<20000x14 sparse matrix of type '<class 'numpy.float64'>'
	with 46511 stored elements in Compressed Sparse Row format>

In [241]:
print(tfidf_matrix[1])

  (0, 1)	0.5101112443555542
  (0, 13)	0.6925121203011946
  (0, 3)	0.5101112443555542


In [242]:
cosine_sim[0].shape

(20000,)

In [243]:
print(cosine_sim)
print("---"*30)
print("Shape of cosine_sim: " + str(cosine_sim.shape))

[[1.         1.         1.         ... 1.         0.43144042 0.43144042]
 [1.         1.         1.         ... 1.         0.43144042 0.43144042]
 [1.         1.         1.         ... 1.         0.43144042 0.43144042]
 ...
 [1.         1.         1.         ... 1.         0.43144042 0.43144042]
 [0.43144042 0.43144042 0.43144042 ... 0.43144042 1.         1.        ]
 [0.43144042 0.43144042 0.43144042 ... 0.43144042 1.         1.        ]]
------------------------------------------------------------------------------------------
Shape of cosine_sim: (20000, 20000)


In [244]:
#Construct a reverse map of indices and article_id
indices = pd.Series(df_articles_20k.index, index=df_articles_20k['article_id']).drop_duplicates()
print(indices)

article_id
0108775015        0
0108775044        1
0108775051        2
0110065001        3
0110065002        4
              ...  
0586608009    19995
0586645001    19996
0586648001    19997
0586651002    19998
0586651005    19999
Length: 20000, dtype: int64


In [245]:
indices.loc[random_item].values

array([9183])

In [269]:
# Function that takes in article_id as input and outputs most similar articles
def get_recommendations(id, cosine_sim=cosine_sim):
    try:
        # Get the index of the article that matches article_id
        #print(indices.loc[id])
        idx = indices.loc[id]#.values[0]
    except:
        idx = indices.loc[id].values[0]
    #print(idx)
    # Get the pairwise similarity scores of all articles with that article
    sim_scores = list(enumerate(cosine_sim[idx]))
    #print((sim_scores[0]))
    # Sort the articles based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 12 most similar articles, ignore the first as it is the searched article itself
    sim_scores = sim_scores[1:13]
    #print(sim_scores)
    # Get the article indices
    article_indices = [i[0] for i in sim_scores]

    # Return the top 12 most similar articles
    return ' '.join(df_articles_20k['article_id'].iloc[article_indices].to_list())

In [270]:
random_row = df_articles_20k.sample().reset_index()
#random_item = int(random_row.iloc[0][1])
random_item = random_row.article_id

# \033[4m and \033[0m is used to underline a string
print(f"Recommended items for item \033[4m{random_item}\033[0m:")
get_recommendations(random_item)
#print(type(xxx))

Recommended items for item [4m0    0583928001
Name: article_id, dtype: string[0m:


''

In [273]:
get_recommendations('0568601043')

'0108775044 0108775051 0112679048 0112679052 0116379047 0145872001 0145872037 0145872043 0145872051 0145872052 0145872053 0146706001'

In [248]:
item_ids = df_articles_20k.article_id.to_list()

In [249]:
# compute wardrobe and dates for all customers, remove item ids that can't be predicted
df_transactions.t_dat = df_transactions.t_dat.dt.strftime('%Y-%m-%d')
wardrobe = df_transactions[df_transactions.article_id.isin(item_ids)].groupby('customer_id').article_id.agg(lambda id: ' '.join(id))
wardrobe_dates = df_transactions[df_transactions.article_id.isin(item_ids)].groupby('customer_id').t_dat.agg(lambda id: ' '.join(id))
df_wardrobe = pd.DataFrame({'dates':wardrobe_dates.astype('string'), 'wardrobe':wardrobe})

In [250]:
df_wardrobe = df_customers.join(df_wardrobe, on='customer_id', how='left').set_index('customer_id')

In [251]:
df_wardrobe.wardrobe = df_wardrobe.fillna('').wardrobe

In [252]:
baseline = '0706016001 0706016002 0372860001 0610776002 0759871002 0464297007 0372860002 0610776001 0399223001 0706016003 0720125001 0156231001'.split(' ')

In [256]:
df_wardrobe.head()

Unnamed: 0_level_0,FN,Active,club_member_status,fashion_news_frequency,age,postal_code,dates,wardrobe
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...,2018-12-27 2019-05-25 2019-05-25 2020-09-05,0176209023 0568601006 0568601006 0568601043
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...,2018-09-21 2018-09-25 2018-09-27 2019-05-07 20...,0583558001 0521269001 0583558001 0351484002 04...
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...,2018-09-20 2019-03-01 2020-02-03 2020-02-03,0541518023 0578020002 0351484002 0351484002
00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2c5feb1ca5dff07c43e,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...,,
00006413d8573cd20ed7128e53b7b13819fe5cfc2d801fe7fc0f26dd8d65a85a,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...,2019-10-01 2019-10-09,0399061015 0399061015


In [257]:
def predict_from_wardrobe(user_id, df_wardrobe, baseline, k=12):
    wardrobe = df_wardrobe.loc[user_id, 'wardrobe']
    if wardrobe == '':
        # cold start
        prediction = baseline
    else:
        # convert wardrobe to list with newest items first
        wardrobe = wardrobe.split(' ')[::-1]
        n_items = len(wardrobe)
#        i = 0
        # simplest most possible way: predict k items based on latest purchase
        item_id = wardrobe[0]
        try:
            prediction = get_recommendations(item_id).split(' ')
        except:
            print(item_id)
#        q = max(k//n_items, 4)
#        while (len(prediction) < k) and (i < n_items):
#            item_id = wardrobe[i]
#            best_items = recommend_from_item(item_id, item_id_map, item_id_map_rev, D_csr, k=12).split(' ')
#            while (len(prediction) < min(q, k)) and len(best_items) > 0:
#                item_id = best_items.pop(0)
                # recommend item if its not in wardrobe and hasn't already been recommended
#                if (item_id not in prediction) and (item_id not in wardrobe):
#                    prediction.append(item_id)
#            i += 1
#            q += q        
#        if len(prediction) < k:
#            print('Warning: incomplete predictions...')
    return ' '.join(prediction)

In [274]:
# make predictions for all users
tmp = df_wardrobe.copy()
tmp2 = tmp.reset_index().customer_id.apply(lambda id: predict_from_wardrobe(id, df_wardrobe, baseline, k=12))
tmp2 = pd.DataFrame({'customer_id':df_wardrobe.reset_index().customer_id, 'prediction':tmp2}).set_index('customer_id')
df_wardrobe['prediction'] = tmp2.prediction

In [275]:
df_wardrobe.head()

Unnamed: 0_level_0,FN,Active,club_member_status,fashion_news_frequency,age,postal_code,dates,wardrobe,prediction
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
00000dbacae5abe5e23885899a1fa44253a17956c6d1c3d25f88aa139fdfc657,,,ACTIVE,NONE,49.0,52043ee2162cf5aa7ee79974281641c6f11a68d276429a...,2018-12-27 2019-05-25 2019-05-25 2020-09-05,0176209023 0568601006 0568601006 0568601043,0108775044 0108775051 0112679048 0112679052 01...
0000423b00ade91418cceaf3b26c6af3dd342b51fd051eec9c12fb36984420fa,,,ACTIVE,NONE,25.0,2973abc54daa8a5f8ccfe9362140c63247c5eee03f1d93...,2018-09-21 2018-09-25 2018-09-27 2019-05-07 20...,0583558001 0521269001 0583558001 0351484002 04...,0184123020 0188183001 0188183008 0188183009 01...
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318,,,ACTIVE,NONE,24.0,64f17e6a330a85798e4998f62d0930d14db8db1c054af6...,2018-09-20 2019-03-01 2020-02-03 2020-02-03,0541518023 0578020002 0351484002 0351484002,0184123020 0188183001 0188183008 0188183009 01...
00005ca1c9ed5f5146b52ac8639a40ca9d57aeff4d1bd2c5feb1ca5dff07c43e,,,ACTIVE,NONE,54.0,5d36574f52495e81f019b680c843c443bd343d5ca5b1c2...,,,0706016001 0706016002 0372860001 0610776002 07...
00006413d8573cd20ed7128e53b7b13819fe5cfc2d801fe7fc0f26dd8d65a85a,1.0,1.0,ACTIVE,Regularly,52.0,25fa5ddee9aac01b35208d01736e57942317d756b32ddd...,2019-10-01 2019-10-09,0399061015 0399061015,0108775044 0108775051 0112679048 0112679052 01...


In [276]:
# make frame containing all available individualized recommendations and join with customer table
# submission = pd.DataFrame({'prediction':tmp}, index=user_ids)
# submission = df_customers.join(submission, on='customer_id', how='left').set_index('customer_id')
# now fill empty predictions with baseline
# baseline_prediction = '0706016001 0706016002 0372860001 0610776002 0759871002 0464297007 0372860002 0610776001 0399223001 0706016003 0720125001 0156231001'
# submission.fillna(baseline_prediction, inplace=True)

In [277]:
# save to csv
df_wardrobe.loc[:, 'prediction'].to_csv('../data/content_based_sinan_20k.csv')

----

In [171]:
# reduce data, WARNING: more than 100k samples will take ages!
df = df_transactions.sample(100000, random_state=42)
# create new column filled with ones
df['rating'] = df['customer_id'].apply(lambda s: 1)
# now have rating be the number of times a customer bought an item
df = df.groupby(['customer_id', 'article_id']).sum()
df = df.drop(['price', 'sales_channel_id'], axis=1).reset_index()
# make sure ratings lie between 0 and 5 (very naive method)
df['rating'] = df.rating.apply(lambda r: min(r, 5))

In [173]:
df.head(50)

Unnamed: 0,customer_id,article_id,rating
0,00006413d8573cd20ed7128e53b7b13819fe5cfc2d801f...,399061015,1
1,00007d2de826758b65a93dd24ce629ed66842531df6699...,721257001,1
2,0000f1c71aafe5963c3d195cf273f7bfd50bbf17761c91...,612075001,1
3,0001ab2ebc1bb9a21d135e2fefdb11f12bee5c74ab2984...,742947001,1
4,0001d44dbe7f6c4b35200abdb052c77a87596fe1bdcc37...,756582001,1
5,0001f8cef6b9702d54abf66fd89eb21014bf98567065a9...,699867001,1
6,00023e3dd8618bc63ccad995a5ac62e21177338d642d66...,772927001,1
7,00035e92a9bd02a8e28e2e59721663fdeb39bfd57c4860...,673718001,1
8,0003abe64294e66a6310c3436fa9e5b754cc5603deef4f...,579541026,1
9,0003e867a930d0d6842f923d6ba7c9b77aba33fe2a0fbf...,377277001,1


In [134]:
def calc_basketsize (purchases):
    """Function to generate dataframe with basketsizes out of dataframe with single purchases (e.g. dataframe from transaction_train.csv). 
    Assumption: Purchases of an individual customer on one day form an order.

    Args:
        purchases (_dataframe_): Dataframe which contains single purchases per customer in each row
    """    
    
    purchases['datetime'] = pd.to_datetime(purchases['t_dat'])

    orderbaskets = purchases.groupby(['datetime', 'customer_id']).size().reset_index()
    orderbaskets.rename(columns={0: "basketsize"}, inplace=True)

    return orderbaskets

In [135]:
test_df = calc_basketsize(df_transactions)

In [153]:
test_df.head(50)

Unnamed: 0,datetime,customer_id,basketsize
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,2
1,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,5
2,2018-09-20,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,5
3,2018-09-20,0008968c0d451dbc5a9968da03196fe20051965edde741...,2
4,2018-09-20,000aa7f0dc06cd7174389e76c9e132a67860c5f65f9706...,30
5,2018-09-20,001127bffdda108579e6cb16080440e89bf1250a776c6e...,1
6,2018-09-20,001ea4e9c54f7e9c88811260d954edc059d596147e1cf8...,2
7,2018-09-20,001fd23db1109a94bba1319bb73df0b479059027c182da...,2
8,2018-09-20,0021da829b898f82269fc51feded4eac2129058ee95bd7...,4
9,2018-09-20,00228762ecff5b8d1ea6a2e52b96dafa198febddbc3bf3...,1


In [163]:
customer_id = test_df.customer_id.drop_duplicates()

In [166]:
print(customer_id.shape)
print(customer_id[0])

(1362281,)
000058a12d5b43e67d225668fa1f8d618c13dc232df0cad8ffe7ad4a1091e318


In [169]:
one_customer = test_df[test_df["customer_id"] == customer_id[2]]

In [170]:
one_customer

Unnamed: 0,datetime,customer_id,basketsize
2,2018-09-20,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,5
445471,2018-10-24,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,9
480456,2018-10-26,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,2
528168,2018-10-29,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,1
1589387,2019-01-23,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,2
1800341,2019-02-10,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,7
6891193,2020-04-01,00083cda041544b2fbb0e0d2905ad17da7cf1007526fb4...,1
