# Personalized Recommendation System
Personalized recommendation system that is trained with raw clickstream data from an e-commerce website.

##Approach
I approached to the problem as a simple recommender system problem that are solved by using factorization matrix. Applied a hybrid solution with collaborative and content-based filtering.

Solving this problem using transformers or rnn type of architectures to create session-based recommendations would be a more modern and solid approach, but due to given time; I've chosen a simpler approach.

Also the fact that numbers of unique items and users in the dataset not being too large, makes this approach proper to use.

## Solution
What we're going to do in this notebook is basically to create a user-item interaction matrix, train a recommendation model, evaluate and then inference.

LighFM is chosen for this task which is simple implementation of popular recommendations algorithms. It allows us to apply matrix factorization algorithms incorporating metadata of users and items.

In our specific problem, every item will have their own feature representation as the categories that they are related. Also, every user will have their own feature representation created from the categories that they were interested.

Using this approach, instead of only using interaction matrix between users and items, we take the specific data for each user and item into account as well. Which increases our performance drastically and also gives us future opportunities to improve.


##Result
As it's being an hybrid model, I created two applications for now. One is recommendations for old users, based on their previous interactions. Other is item based recommendations for either new users(cold start problem) or just for another recommendation section like 'similar items'.
LightFM also lets us to get similar users which we can use for a section like 'Users also vieved these:', but that is for future developments.  

Enough talk, let's dive into implementation starting with preprocessing the dataset.

## Preprocessing

I've chosen to remove the lines that I used for getting to know the dataset for the sake of readibility of notebook. For ex. Number of 0's, NaN's, their types, unique values etc.

In [24]:
## Import needed libraries
import pandas as pd
import numpy as np
from pyarrow import parquet # For reading dataset saved as .parquet file
from scipy.sparse import csr_matrix
from sklearn.preprocessing import MultiLabelBinarizer

import ast
import pickle

# Import only needed functions from lightfm framework
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.cross_validation import random_train_test_split
from lightfm.evaluation import auc_score, precision_at_k, recall_at_k



In [25]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [26]:
# Read the Parquet file
df = pd.read_parquet('/content/drive/MyDrive/insider-cp/train.parquet')

# Take a look
df.head()

Unnamed: 0,date,userId,sessionId,pageType,itemId,category,productPrice,oldProductPrice
0,2019-08-05 19:30:37,00172f1d9a71e9a8de0aa34288a6b19b,e8167c23f8ac2f9be979c32380e0fc2b7e94941e917d30...,productDetail,83472aea4051c00d031b01ff42ef73fc,"[""kadın çanta"",""omuz askılı çanta""]",622.0,1220.0
1,2019-08-31 16:53:55,00172f1d9a71e9a8de0aa34288a6b19b,c7f54acdf56e2d7539ffa59107b9017c2a8164495df909...,category,[],"[""seyahat samsonite"",""laptop çantası""]",,
2,2019-08-31 16:53:29,00172f1d9a71e9a8de0aa34288a6b19b,c7f54acdf56e2d7539ffa59107b9017c2a8164495df909...,main,[],[],,
3,2019-08-31 16:53:43,00172f1d9a71e9a8de0aa34288a6b19b,c7f54acdf56e2d7539ffa59107b9017c2a8164495df909...,category,[],"[""seyahat samsonite"",""laptop çantası""]",,
4,2019-08-31 16:54:13,00172f1d9a71e9a8de0aa34288a6b19b,c7f54acdf56e2d7539ffa59107b9017c2a8164495df909...,productDetail,d6afa22ab475d41e7dc9b721f3f795ad,"[""seyahat samsonite"",""laptop çantası""]",389.0,389.0


In [27]:
# See basic info about dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691842 entries, 0 to 691841
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   date             691842 non-null  object 
 1   userId           691842 non-null  object 
 2   sessionId        691842 non-null  object 
 3   pageType         691842 non-null  object 
 4   itemId           691842 non-null  object 
 5   category         691842 non-null  object 
 6   productPrice     217985 non-null  float64
 7   oldProductPrice  217985 non-null  float64
dtypes: float64(2), object(6)
memory usage: 42.2+ MB


In [28]:
# Take the columns only we need
columns_needed = ['userId', 'itemId', 'category', 'pageType']
df = df[columns_needed]

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691842 entries, 0 to 691841
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   userId    691842 non-null  object
 1   itemId    691842 non-null  object
 2   category  691842 non-null  object
 3   pageType  691842 non-null  object
dtypes: object(4)
memory usage: 21.1+ MB


In [29]:
# itemIds and categories saved as string representations of lists, so they need to be converted
# Parse columns to convert to actual lists
def convert_str_to_list(row):
    if isinstance(row, str) and row.startswith('['):
        return ast.literal_eval(row)
    return [row]

# Apply function to itemId and category columns
df['itemId'] = df['itemId'].apply(convert_str_to_list)
df['category'] = df['category'].apply(convert_str_to_list)

## Feature Engineering

### Feature Vectors
In this step we will be creating feature vectors for items and users as described above.

LightFM model needs one-hot encoded vectors for this. So we will one-hot encode each item with their categories.
For doing this, we filter items when they are windowed in productDetail page. Then take their categories and encode.

In [30]:
# Filter rows with pageType as 'productDetail'
filtered_df = df[df['pageType'] == 'productDetail'][['itemId', 'category']]

# Convert itemId that is saved as list to string
filtered_df['itemId'] = filtered_df['itemId'].apply(lambda x: ''.join(x))

# Instantiate one-hot encoder
mlb = MultiLabelBinarizer()

# Encode the category column and convert it to dataframe
item_features = pd.DataFrame(mlb.fit_transform(filtered_df['category']), columns=mlb.classes_)

# Reset index of filtered_df
filtered_df.reset_index(drop=True, inplace=True)

# Add itemId column to the encoded dataframe
item_features.insert(loc=0, column='itemId', value=filtered_df['itemId'])

# As there is many interactions for items, merge their vectors by grouping them and taking max values which is 1 one enoded categories.
item_features = item_features.groupby('itemId').max().reset_index()

# Get every unique item from dataset for getting items without any interaction
unique_item_ids_df = pd.DataFrame({'itemId': df['itemId'].explode().unique()}).dropna().reset_index(drop=True)

# Merge item_features and non-interacted items. Fill non-interacted items with zeros as we don't know their categories
item_features = pd.merge(unique_item_ids_df, item_features, on='itemId', how='left').fillna(0)

#convert all columns except itemId to int
item_features[item_features.columns[1:]] = item_features[item_features.columns[1:]].astype(int)

print("number of unique items: ", len(item_features))

# Take a look our feature vectore table
item_features.head()

number of unique items:  10718


Unnamed: 0,itemId,abiye,aksesuar,alışveriş çantası,babet,bakım aksesuarları,batai,bel çantası,bluz,bootie,...,бабетта,большой размер чемодан,женщины кожа одежда,женщины обувь,кожаная куртка,путешествия samsonite,スポーツシューズ,メンズ シューズ,샌들,여성 신발
0,83472aea4051c00d031b01ff42ef73fc,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,d6afa22ab475d41e7dc9b721f3f795ad,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1d84ddc6c6224402a845c0b5c684335b,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2a411dd5f3ffb793a160235d5eb4a881,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9197e3dfdf3da36a2f55c5bc9300528e,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Same process goes for users. Feature vectors for users consists of the categories that they clicked. It is used to get their interests to find similar users and give more information to the model.

In [31]:
# Get the rows where users goes to category page
filtered_df = df[df['pageType'] == 'category'][['userId', 'category']]

# Instantiate the one-hot encoder
mlb = MultiLabelBinarizer()

# Encode the category column and convert it to dataframe
user_features = pd.DataFrame(mlb.fit_transform(filtered_df['category']), columns=mlb.classes_)

# Reset index of filtered_df
filtered_df.reset_index(drop=True, inplace=True)

# Add userId column to the encoded dataframe
user_features.insert(loc=0, column='userId', value=filtered_df['userId'])

# As there is many interactions for users, merge their vectors by grouping them and taking max values which is 1 one enoded categories.
user_features = user_features.groupby('userId').max().reset_index()

# create a df with unique userIds for getting the users who did not click category page
unique_userIds = pd.DataFrame(df['userId'].unique(), columns=['userId'])

# Merge user_features and non-interacted users. Fill non-interacted users with zeros as we don't know their interests
user_features = pd.merge(unique_userIds, user_features, on='userId', how='left').fillna(0)

#convert all columns except userId to int
user_features[user_features.columns[1:]] = user_features[user_features.columns[1:]].astype(int)

print('Number of unique users: ', len(user_features)),

# Take a look
user_features.head()

Number of unique users:  30133


Unnamed: 0,userId,10 indirim subat2019,2018 kis koleksiyonu,2019 ilkbahar yaz,2019 sonbahar kis,abiye,accessoires femmes,accessories,aksesuar,aksesuar erkek deri isimlik,...,النساء الأحذية,خانمها کیف های دستی,زنان,女装 鞋子,남성,남성 가죽 의류,남성 액세서리,여성,여성 신발,여행 삼소 나이트
0,00172f1d9a71e9a8de0aa34288a6b19b,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,02912533de5da26ffac47a2cbb31d2f3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,03daf723d02456b73052fe3fff187d86,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,03ef6492a9b89c0078cffa2687f728e1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,06f55081c7f6e04fa37ec0b5ebe3d5b2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Create LightFM dataset

In this section, we will be using Dataset utility from LightFM library for creating our dataset for training.

In [32]:
# Assign necessary data to variables that are gonna be used in Dataset creation. Done seperately for better reability
user_features_col = user_features.drop(columns= ['userId']).columns.values
user_feat = user_features.drop(columns= ['userId']).to_dict(orient='records')

item_features_col = item_features.drop(columns= ['itemId']).columns.values
item_feat = item_features.drop(columns= ['itemId']).to_dict(orient='records')

In [33]:
# Init a dataset object
dataset = Dataset()

# Fit it our data
dataset.fit(
    users=user_features['userId'],
    items=item_features['itemId'],
    user_features=user_features_col,
    item_features=item_features_col
)

In [34]:
# LightFM model uses it is own internal indices. Get the mappings for getting our own Ids later.
user_id_mapping, _, item_id_mapping, _ = dataset.mapping()

# Create reverse mappings
reverse_item_id_mapping = {v: k for k, v in item_id_mapping.items()}
reverse_user_id_mapping = {v: k for k, v in user_id_mapping.items()}

In [35]:
# Build item features. Details can be found in LightFM documentation
item_features = dataset.build_item_features((x, y) for x, y in zip(item_features['itemId'], item_feat))
user_features = dataset.build_user_features((x, y) for x, y in zip(user_features['userId'], user_feat))

### Interaction data
In this section we will prepare interaction data for training. We can think this as ratings. Let's say our users rate our items and then we recommend them new items according to their ratings. We will pretend like this.

Our rates ranges between 1-3. If a user purchased an item, then we will give an interaction point 3. If a user added an item to the cart but did not purchase it, then 2. If only viewed then 1.

With this approach, we give weights to interactions so we keep more information. Also it helps us to improve our system later with new features such as not recommending purchased items or prioritizing recommending carted items.

-------------
I created this approach because i did not want to lose all important information by only using binary matrix. This creates a foundation for future development.

In [36]:
# Define a function to assign interaction scores
def assign_interaction_score(row):
    if 'success' in row['pageType']:
        return 3
    elif 'cart' in row['pageType']:
        return 2
    elif 'productDetail' in row['pageType']:
        return 1
    else:
        return 0

# Create a dictionary to hold the interaction scores
interaction_dict = {}

# Populate the dictionary
for _, row in df.iterrows():
    user_id = row['userId']
    item_ids = row['itemId']
    score = assign_interaction_score(row)   # Get the score for each user item pair

    if user_id not in interaction_dict:     # If not already added to dict before
        interaction_dict[user_id] = {}

    # Fill the dict for each item
    for item_id in item_ids:
        if item_id not in interaction_dict[user_id]:
            interaction_dict[user_id][item_id] = score
        else:
            # Get the max value to make sure each interaction gets their score for highest interaction
            interaction_dict[user_id][item_id] = max(interaction_dict[user_id][item_id], score)

In [37]:
#TODO: There should be a more efficient way of doing this instead of using a dict, a list and a df

In [38]:
# Convert the dictionary to a dataframe to be used in lightfm dataset
interaction_data = []

for userid, items in interaction_dict.items():
    for item_id, score in items.items():
        interaction_data.append([userid, item_id, score])

interaction_df = pd.DataFrame(interaction_data, columns=['userId', 'itemId', 'interaction_score'])

interaction_df.head()

Unnamed: 0,userId,itemId,interaction_score
0,00172f1d9a71e9a8de0aa34288a6b19b,83472aea4051c00d031b01ff42ef73fc,1
1,00172f1d9a71e9a8de0aa34288a6b19b,d6afa22ab475d41e7dc9b721f3f795ad,1
2,02912533de5da26ffac47a2cbb31d2f3,1d84ddc6c6224402a845c0b5c684335b,1
3,02912533de5da26ffac47a2cbb31d2f3,2a411dd5f3ffb793a160235d5eb4a881,1
4,02912533de5da26ffac47a2cbb31d2f3,9197e3dfdf3da36a2f55c5bc9300528e,1


In [39]:
# Build interactions and weights objects for training
(interactions, weights) = dataset.build_interactions((x, y) for x, y in zip(interaction_df['userId'], interaction_df['itemId']))

## Model Training

In [40]:
#TODO: Using masking in matrix here sounds like a better idea instead of train test split

In [41]:
# Split train-test using the ratio 80/20
train, test = random_train_test_split(interactions, test_percentage=0.2, random_state=42)
train_w, test_w = random_train_test_split(weights, test_percentage=0.2, random_state=42)

In [42]:
# Assign parameters for the model
N_COMPONENTS = 30
LOSS = 'warp'
EPOCHS = 10
NUM_THREAD = 1

# Instantiate the model with given parameters
model = LightFM(no_components=N_COMPONENTS, loss=LOSS, random_state=42)

# Train the model. Used fit_partial to train our model epoch by epoch, in case we want to monitor process.
for epoch in range(EPOCHS):
    model.fit_partial(train, user_features=user_features, item_features=item_features, epochs=1, num_threads=NUM_THREAD, sample_weight=train_w)

## Evaluation


In [None]:
train_precision = precision_at_k(model, train, k=10,item_features=item_features, user_features=user_features).mean()
train_precision

0.050863966

In [None]:
test_precision = precision_at_k(model, test,train_interactions=train, k=10,item_features=item_features, user_features=user_features).mean()
test_precision

0.02046754


Deeper research is needed here for analysing results.I've tried some other models and different parameters but gotten usually similar precision@K values. As the time is constraint and AUC value is promising, I kept working with this model.

In [None]:
test_auc = auc_score(model, test, train_interactions=train,item_features=item_features, user_features=user_features).mean()
test_auc

0.93659043

## Inference

In [50]:
# Get N(default 10) best recommendation for given user_id
def recommend(user_id, N=10):
    # Get model's internal id for given user_id
    user_idx = user_id_mapping[user_id]

    # Get top N recommendation
    item_ids = np.arange(len(item_id_mapping))
    scores = model.predict(user_idx, item_ids)
    top_items = np.argsort(-scores)[:N]

    # Extract original ids of items
    recommended_items = [reverse_item_id_mapping[item_id] for item_id in top_items]

    print("Top recommended items for user {}: {}".format(user_id, recommended_items))

# Get an example recommendation
exmpl_user_id = '97e5da3ce9a142fdb2ed7694dd7dc82c'
recommend(exmpl_user_id)

Top recommended items for user 97e5da3ce9a142fdb2ed7694dd7dc82c: ['5edb4e8e2e635e611e3a6f7f0c019eeb', '30b8a8a8c59f1af8aa2e4a808ea41dd0', '0073290e3fa8c300e1b697f68f32ce38', 'aacd2328a7965c764fee0a22869f99bd', 'd5979e28c248b49d1527fda897511593', '3bbdea29f99482c95d230ffc622b6d6a', '82f1cdab53b27c76e559bd7f6a24496a', '840c698196020af86f0d50e6f6d03fa7', '3c53e4a29408a863125d6afbd83a2e6f', '2b2445bba163a98c5035100bf87bddfe']


In [53]:
# Get similar items for given item_id.
def similar_items(item_id, N=10):
    # Get item representations for extracting similar items
    _ ,item_representations = model.get_item_representations(features=item_features)

    # Get model's internal id for given item_id
    item_idx = item_id_mapping[item_id]

    # Cosine similarity
    scores = item_representations.dot(item_representations[item_idx, :])
    item_norms = np.linalg.norm(item_representations, axis=1)

    scores /= item_norms
    best = np.argpartition(scores, -N)[-N:]
    similar = sorted(zip(best, scores[best]/ item_norms[item_idx] ), key=lambda x: -x[1])

    similar_idx = [x[0] for x in similar]
    recommended_items = [reverse_item_id_mapping[item_idx] for item_idx in similar_idx]
    print("Top recommended items for item {}: {}".format(item_id, recommended_items))

exmpl_item_id = '83e9e0606443b7bed19cd8f423cb552d'
similar_items(exmpl_item_id)

Top recommended items for item 83e9e0606443b7bed19cd8f423cb552d: ['83e9e0606443b7bed19cd8f423cb552d', '327eebb5d016ebb44f98583b33112476', 'fab1de3716503c35ac80bd4d5cd4e712', '436cc26999a1c57239905dea0c62431d', '6bf12ddc2a1624acbf1ec83ca5eb1d31', '923d329499028ea2676e3421c11444cd', '2d89fa36dd7713888ef1dae79bf3976c', '4ef823eeef4a4bd41d23ba6a8e334eb4', '68b601e4eace3f26ae3d294214e4c044', 'bf16ee90440c80a0f5bc66b2d567fb04']


In [54]:
# Save the model and needed utils for deployment
_ ,item_representations = model.get_item_representations(features=item_features)

model_utils = {
    # 'item_features': item_features,
    # 'user_features': user_features,
    'user_id_mapping': user_id_mapping,
    'item_id_mapping': item_id_mapping,
    'item_representations': item_representations
}

with open('lightfm_model.pkl', 'wb') as f:
    pickle.dump(model, f)

with open('model_utils.pkl', 'wb') as f:
    pickle.dump(model_utils, f)

sevgiler

tarık aytek

tarik.aytek99@gmail.com