# Building a Fashion Recommender System from Learned Embeddings
```
Seyed Saeid Masoumzadeh 
Senior Data Scientist @ Lyst
Open Data Science Conference (ODSC) London - 16th June 2022
```
<br/><br/>
<br/><br/>
<br/><br/>
<br/><br/>
<br/><br/>


**ADD Images to the folder**








### Word2vec - SkipGram architecture 

```
The Skip-gram model architecture usually tries to predict the probability of the context words (surrounding words) given a target word.

```
- A shallow network including just one hidden layer
- Input size is equal to the number of unique words/phrases we have in our text/corpus
- Output size is also equal to the number of unique words/phrases we have in our text/corpus

![title](img/skipgram.png)

<br/><br/>
<br/><br/>
<br/><br/>
<br/><br/>
<br/><br/>

### Word2vec - Data Sampling 
```
the word2vec data sampling using a sliding window strategy whereby the window size specifies how many next or previous token must be considered to be paired with a given token in the window.
```

![title](img/sampling.png)

## Writing a method to do sampling using sliding window

In [None]:
def sample_data(sequence, window_size):
    """
    This function provides a sampling using a window strategy, the window moves on the sequence
    of link_ids and the positives are selected in the scope of the window. e.g, if a list of sequence is
    [1,2,3,4] and the window is 1, the samples are [(1,2), (2,1), (2,3), (3,2), (3,4), (4,3)].
    """

    number_of_tokens = len(sequence)
    samples = []
    for i in range(number_of_tokens):
        nbr_inds = list(range(max(0, i - window_size), i)) + list(
            range(i + 1, min(number_of_tokens, i + window_size + 1))
        )
        for j in nbr_inds:
            samples.append((sequence[i], sequence[j]))
    return samples

In [None]:
sequence = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
sample_data(sequence, window_size = 2)

```

for the sample (quick, brown), the input to word2vec model is a one-hot vector where all the cells are zeros except the cell pointing to the word quick,  which is initialized with 1. The output is again a one-hot vector where all the cells are zeros except the cell pointing to the word brown. Briefly speaking word2vec can be considered as a multi class classifier and can be solved using a sampled softmax loss.  
```


In [None]:
import pandas as pd
import numpy as np
import fasttext
import glob
import re
import cv2
import matplotlib.pyplot as plt
from rpforest import RPForest

```
A sentence is a sequence of words, and Word2vec using skip-gram model tries to find the probability of the surrounding words given a word. Is it a concept that we can apply on the other sequences?
```

### Reading session data

This is an anonymized data, showing the users' interactions in terms of clicking on the items, for example in a fashion platform like Lyst. 
- the session_id represnts a user
- the product_id represents a fashion product/clothing item has been clicked by the user
- the event_time_stamp is the time the click event occurred

In [None]:
data = pd.read_parquet("data/data.parquet")
data.head()

### Sorting by event time stamp

In [None]:
data = data.sort_values('event_time_stamp')
data.head()

### Representing the sequence of clicks
```
grouping the data by session_id allows us to build the product_id sequences which have been clicked by the users. Each sequence has been sorted by the time the click occurred as a result of the previous sorting logic.
```  

In [None]:
data['product_id'] = data['product_id'].astype(str)
session_seq = data.groupby('session_id')['product_id'].apply(list).reset_index(
).rename(columns={'product_id':"sequence_of_clicks"})
session_seq.head()

### Some data exploration on sequences

In [None]:
session_seq['sequence_length'] = session_seq['sequence_of_clicks'].apply(lambda x: len(x))
session_seq.head()

In [None]:
session_seq['sequence_length'].plot.box()

In [None]:
session_seq['sequence_length'].quantile(0.95)

### Removing the outliers

In [None]:
session_seq = session_seq[session_seq['sequence_length'] <= session_seq['sequence_length'].quantile(0.95)]
session_seq = session_seq[session_seq['sequence_length'] >= 2]
session_seq['sequence_length'].plot.box()

In [None]:
session_seq['sequence_length'].value_counts().to_frame().plot.bar()

In [None]:
sample_data(['1463503', '1418365', '1531480'],  2)

### Running SkipGram (using fasttext) on the sequences

In [None]:
fasttext_params = {
            "model": "skipgram",
            "lr": 0.05,
            "dim": 100,
            "ws": 3,
            "epoch": 100,
            "minCount": 1,
            "minn": 3,
            "maxn": 0,
            "neg": 5,
            "wordNgrams": 1,
            "loss": "ns",
            "bucket": 2000000,
            "thread": 24,
            "lrUpdateRate": 100,
            "t": 0.0001,
            "verbose": 2,
        }
sequence_txt_file = 'data/seq.txt'
sequence = [' '.join(x) for x in session_seq['sequence_of_clicks'].values]
np.savetxt(sequence_txt_file, sequence, fmt="%s", encoding="utf-8")
model = fasttext.train_unsupervised(sequence_txt_file, **fasttext_params)

### Generating Embeddings

In [None]:
vectors = np.vstack([model[x] for x in model.words]).astype("double")
vocabs = model.words

vectors_dict = dict(zip(vocabs, vectors))

In [None]:
vectors_dict['1531480']

### Cosine similarity

In [None]:
import numpy as np


def cos_sim(a, b):
    """
    Takes 2 ndarray and  a, b and returns the cosine similarity according
    to the definition of the dot product.
        a should be a single 1-d array
        b should be a 2-d array
    """

    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b, axis=1)
    return np.dot(a, b.T) / (norm_a * norm_b)

### Build a hash table for (product_id, image)

In [None]:
files = glob.glob('product_images/*.jpeg')
file_dict = {}
for file in files:
    result = re.search('images/(.*).jpeg', file)
    file_dict[result.group(1)] = file   

### Finding similar items to a given item 

In [None]:
sims = cos_sim(vectors_dict['1556752'], vectors)
sims = sorted(zip(vocabs, sims), key=lambda x: x[1], reverse=True)[:9]
print(sims)

In [None]:
images = []
for product_id, sim in sims: 
    images.append(file_dict[product_id]) 

In [None]:
img = cv2.imread(images[0], cv2.IMREAD_COLOR)
plt.imshow(img[:,:,::-1])


fig = plt.figure(figsize=(10, 7))
i = 1
for image in images[1:]:
    img =  cv2.imread(image, cv2.IMREAD_COLOR)
    ax = fig.add_subplot(3, 3, i)
    plt.imshow(img[:,:,::-1])
    i = i + 1

### Approximate Nearest Neighbor (ANN)

```
Finding points in a high-dimensional space that are close to a given query point in a fast but approximate manner.

In each tree, the set of training points is recursively partitioned into smaller and smaller subsets until a leaf node of at most M points is reached. Each partition is based on the cosine of the angle the points make with a randomly drawn hyperplane: points whose angle is smaller than the median angle fall in the left partition, and the remaining points fall in the right partition.
```

![title](img/rpforest.png)

### Train rpforest

In [None]:
rpf_model = RPForest(leaf_size=50, no_trees=10)
rpf_model.fit(vectors)

### Finding similar items by making query to ANN

In [None]:
sims_index = rpf_model.query(vectors_dict['1556752'], 9)
sims = [vocabs[i] for i in sims_index]

In [None]:
images = []
for product_id in sims:
    try:
        images.append(file_dict[product_id])
    except KeyError:
        continue

In [None]:
img = cv2.imread(images[0], cv2.IMREAD_COLOR)
plt.imshow(img[:,:,::-1])


fig = plt.figure(figsize=(10, 7))
i = 1
for image in images[1:]:
    img =  cv2.imread(image, cv2.IMREAD_COLOR)
    ax = fig.add_subplot(3, 3, i)
    plt.imshow(img[:,:,::-1])
    i = i + 1

### Cold start problem

Cold start heppens in this steup when an item have some content information but no interactions are present
 - **An efficnet solution would be using a triplet neural network**
 
Triplet NN helps us to learn distributed embedding by the notion of similarity and dissimilarity. It's a kind of neural network architecture where multiple parallel networks are trained that share weights among each other.


![title](img/triplet_NN.png)

![title](img/triplet_rec_embeddings.png)

### Sampling Anchor, Positive and Negative
sampling anchor and positive is the same as what has been explained in Item2vec approach

In [None]:
sample_data(['1463503', '1418365', '1531480'],  2)

What about negatives???
the negatives are sampled randomly

### Triplet loss

L(a, p, n) = max(0, D(a, p) — D(a, n) + margin)

In [None]:
def triplet_loss(a, p, n, margin):
    dist_a_p = 1 - cos_sim(a, p.reshape(1, -1))[0]
    dist_a_n = 1 - cos_sim(a, n.reshape(1, -1))[0]
    return max(dist_a_p - dist_a_n + margin, 0)

The higher margin, the softer negatives contributes into the cost  

In [None]:
a = vectors_dict['1556752']
p = vectors_dict['1387755']
s_n = vectors_dict['1418365'] #soft_negative
s_h = vectors_dict['1451117'] #hard_negative

triplet_loss(a, p, s_h, margin= 0.5)