# PROJECT - NLPCorps - CMPT825
***
***

## E-Ranked: A Deep Learning Based Product Search Relevance Tool 
#### Contributors
- Tushar Chand Kapoor - tkapoor@sfu.ca
- Shray Khanna - skhanna@sfu.ca
- Manan Parasher - mparashe@sfu.ca
____


## Motivation

- Going through various online retail website we found that the knowledge of NLP can be applied to improve the search results of the websites, however, to provide the best customer experience, online retailers have to give results that are relevant to the customers based on their search queries, which makes **search** an important element of the website. Providing search results to complex queries remains a challenge for many retailers
- With increasing customers towards online purchases we decided to work upon improving the ranks of the retrieved search results.
***
- More formally we chose an IR (Information Retrieval) using the vector representations of the search query and the combination of the product title and product description. 
- This is done to learn the context between the query and the product description.
- The details, working and the dataset are explained in the sections below.

## Approach

### - Baseline

###### Approach 1 (XGboost)
- In the second iteration of the baseline, we wanted to find the relevance of products using a regression approach as the problem can be funneled down to see the rank of products.
- to get the rank of products and to see the relevancy of results we made some statistical features.
- Feature engineering finds the common words in the search query and product description/title. It also finds the following things:
    - Word lengths of the search term, product description, and title.
    - Seeing search term in the product title and description
    - Finding ratios of description to search term
    - finding ratios of title to search term
    - combining title and description to see ratio with search term
- We then train these statistical features to predict the Rank of products based on search term using XGBoost.
- The loss is calculated using RMSE on Predicted values and Truth values on Test Set.

###### Approach 2 (Gensim)
- Model Building
    - We trained the genism word2vec model with our own custom corpus as following:
    - Word2Vec(window = 10, sg = 1, hs = 0,
            negative = 10, # for negative sampling
            alpha=0.03, min_alpha=0.0007,
            seed = 14)

    - Let’s try to understand the hyperparameters of this model.
        - size: The number of dimensions of the embeddings and the default is 100.
        - window: The maximum distance between a target word and words around the target word. The default window is 10.
        - sg: The training algorithm, either CBOW(0) or skip-gram(1). The default training algorithm is CBOW.
        - hs ({0, 1}, optional) – If 1, hierarchical softmax will be used for model training. If 0, and negative is non-zero, negative sampling will be used.
        - negative (int, optional) – If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.
        - alpha (float, optional) – The initial learning rate.
        - min_alpha (float, optional) – Learning rate will linearly drop to min_alpha as training progresses.
        - seed (int, optional) – Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed).

- Vocal Building:
    - model.build_vocab(train_data, progress_per=200)
    - To train our model on custom corpus we created our custom vocab for the model.

- Model Training
    - model.train(train_data, total_examples = model.corpus_count, epochs=10, report_delay=1)

- Prediction
    - After training the model, we used most_similar method to get the most similar products. For example model.wv.most_similar("100019") [('102893', 0.9978876113891602), ('116983', 0.9978049993515015),..,('127507', 0.9976705312728882),('102263', 0.997661828994751)]

***

### - Main
- We use a convolutional-pooling methodology over the sequence of search queries and product description and title in order to learn the vector representations of the same.
- Both search query and the product description + product title are converted into sentence embeddings:
    - Embedding of search query:
    $$embeds_Q = [q_1,q_2,q_3,....q_n]$$
$\;\;\;\;\;\;\;\;\;\;$where $q_n$ represents the nth word of the search query
    - Embedding of product description and title :
    $$embeds_P = [p_1,p_2,p_3,....p_m]$$
$\;\;\;\;\;\;\;\;\;\;$where $p_m$ represents the mth word of the product desciption + product


- We use a contextual window [1] to caputure the  contextual structure from the search query and the product description, by starting with a temporal context window in the sequences to caputure the contenxtual features. The representation is a as follows:

$$l_t = [f_{t-d}^T,..,f_{t}^T,..,f_{t+d}^T]^T,\;\;\;\;t=1,..,T$$
$\;\;\;\;\;\;\;\;\;\;$where, $f_t$ is the $t^{th}$ word and $d=\frac{n-1}{2}$ size of the window

- We have the convolution operation in which is the sliding window based abstraction, a length sequence is produced as the output from the convolution layer
$$h_t = tanh(W_c\cdot l_t),\;\;\;\;t=1,..,T$$
$\;\;\;\;\;\;\;\;\;\;$ where, $W_c$ is the feature transformation matrix and $tanh$ is used as the activation function


- To retain the most useful features max pooling operation is applied to the output generated by the convolutional layers.
$$v(i) =  \max_{t=1,..,T}{\{h_t(i)\}},\;\;\;\;i=1,..,K$$
$\;\;\;\;\;\;\;\;\;\;$where, K equals the dimension of $h_t$


- To add a non-linearity to the output tanh transformation is applied 
$$y = tanh(W_s\cdot v),\;\;\;\;t=1,..,T$$
$\;\;\;\;\;\;\;\;\;\;$where, $W_s$ is the semantic projection matrix


- Semantic relevance score between a search_query and product

$$R(Q,P) = \cos (y_Q,y_P) =  \frac{y_{Q}^T y_P}{||y_Q||\;||y_P||}$$


- Loss Function:
    - We use MSE (Mean Squared Error) loss between the cosine similarity values and truth reference relevances.
        $$l(x,y) = \{l_1,..,l_N\}^T,l_n = (x_n-y_n)^2$$ where x and y are the tensors

## Data

- The datasets we use for this project are the [Home Depot Product Search Relevance](https://www.kaggle.com/c/home-depot-product-search-relevance/data) which we use as the training data and for our test purposes, we use [eCommerce search relevance](https://data.world/crowdflower/ecommerce-search-relevance) dataset.
- Both of the datasets require a series of preprocessing steps which include cleaning of the data, combining information from multiple files and web scrapping to complete the dataset.
- The train data set is split into two the first 80% of the values are used for training and the next 20% are used as the dev set to evaluate our metric.
- Below shown are a glimpse of train and test data. (Shown below both train and test are the clean version of the data for the raw data link is provided)

In [1]:
import sys
sys.path.append('../')
import pandas as pd
from data_utils import *

#### Train Data: [link](https://www.kaggle.com/c/home-depot-product-search-relevance/data)

In [2]:
pd.read_csv('../data/train_final.csv').head(2)

Unnamed: 0,product_uid,product_title,search_term,relevance,brand,product_description
0,116711,ge z wave 1800 watt resist cfl led indoor plug...,zwave switch,3.0,,transform ani home into a smart home with the ...
1,141628,leviton z wave control 3 way/remot scene capab...,zwave switch,3.0,leviton,the leviton dzmx1 is a z wave enabl univers di...


#### TEST DATA: [link](https://data.world/crowdflower/ecommerce-search-relevance)

In [3]:
pd.read_csv('../data/test_final.csv',encoding='ISO-8859-1').head(2)

Unnamed: 0,product_uid,relevance,product_title,search_term,rank,product_description,brand
0,711158459,3.67,soni playstat 4 ps4 latest model 500 gb jet bl...,playstat 4,1,the playstat 4 system open the door to an incr...,soni
1,711158460,4.0,soni playstat 4 latest model 500 gb jet black ...,playstat 4,2,the playstat 4 system open the door to an incr...,soni


## Code

- Although most of the code is written by the group members but few portions which were not are as follows (the source links have been commented inside the files along with the functions):
    - def removeAdditional - To remove unnecessary characters 
    - def sliding_window - window over the sentence 
    - forward of CPSIR (taken partial implementation) - few lines are taken to complement the sliding_window function

## Experimental Setup

- The training was done on approximately 10k searches and queries and related products.
- All search queries and product description and title are pre-processed in such a way that:
    - The text was lowercased and all the special characters were removed, but the numbers were retained.
    - The text was tokenized.
    - Stemming was done on the cleaned text.
    - Products relating to the same search query were stacked using torch.stack, and the texts were padded to match the stacked length.
    - Both the search query and the description + title is processed in the same way.
- There is no overlap amongst the train and the test data.
- The model was trained using Stochastic Gradient Descent (SGD).
- The performance of the model is measured by mean average precision metric shown as below:
$$score = \frac{ \sum_{i=1,..,N}[\frac{count(matches)_{P_i}}{|P_i|}] }{N}$$
$\;\;\;\;\;\;\;\;\;\;$ where, $N$ is the size of the group by search term products and $P_i$ is the size of the number of products related to current search query.


### check.py 

- check.py scores are calculated out of 1. (Max = 1 and Min = 0)
- It reads the output created by the program and references it against the .out files in the reference folder.
- Every match scores a point, but if a product is missing against a particular search query or another product is present which was not part of the result we penalize it.
- Below are the options of check.py

In [6]:
!python check.py --help

Usage: check.py [options]

Options:
  -h, --help            show this help message and exit
  -d DEVFILE, --devfile=DEVFILE
                        Dev File Output Path
  -t TESTFILE, --testfile=TESTFILE
                        Test File  Output Path
  -x DEVREFFILE, --devreffile=DEVREFFILE
                        Dev Reference File Path
  -z TESREFFILE, --tesreffile=TESREFFILE
                        Test Reference File Path
  -b BASELINE, --baseline=BASELINE
                        Test Only Base Line Score
  -p BACKDIR, --backdir=BACKDIR
                        directory to go back


## Results

### - Baseline

- XGBoost works by taking into account the statistical features.
- On learning the features it predicts the relevancy.
- It sees the words present in the search term, product description, title and brand names and evaluates the ratios and frequencies as mentioned above. 
- Running our XGBoost model on the cleaned data gives a score of 12%.
- ***Note:*** This doesn't learn actual meaning in the description and title of the product.

In [26]:
!python check.py -t output/test_bs.out -p ../ -b true

test score: 0.12031


***

### - Main

- The model starts to learn the context with window 1.
- After 10 epochs model learns the context and meaning in the text to find the similarity.
- The architecture defined above helps in finding the actual similarity between search terms and product information.
- The MSE loss drops to 0.39 after the final epoch which is in the desired range.

In [4]:
!python check.py

dev score: 0.44857
test score: 0.38266


### Files

- output
    - dev.out (dev output of main model)
    - test.out (test output of main model)
    - test_bs.out (test output of baseline model)
- output/reference
    - dev.out (dev truth values)
    - test.out (test truth values)

## Analysis of the Results

- We see a major improvement with our Deep Learning approach to get accurate results with a score of 38.26% on the test set against 12.03% (baseline).
- If we see both the things matching the truth values then we assign a score otherwise we penalize the model.
- By doing so, we might not get a score of 80% and more, but it helps in providing all the results according to its context and meaning.

- Inference:
- Using the gensim model was a naive approach. We needed to make the model more accurate, henceforth we decided to move to PyTorch and neural networks.

## Future Work

In this project, we have implemented a deep learning approach for product ranking against a given query, even though deep learning was applied it is not a silver bullet to the solution there is, below are our thoughts on the future work:
- Although in this iteration of the project we were able to break the rank between two products with the same similarity using cosine similarity we would like to add the following:
    - Understanding the context of the product features based on the search query which in many scenarios is a good metric to break the tie.
    - Given more data of user clicks, we can add that as an additional feature to determine the relevance alongside the similarity currently being produced by the model.
- In the future, we also plan to use pre-trained embeddings coupled with our changes to enhance the performance of our model.
- We plan to find a more diverse set of data.

## References

[1] Shen, Yelong & He, Xiaodong & Gao, Jianfeng & Deng, Li & Mesnil, Grégoire. (2014). A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval. CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. 101-110. 10.1145/2661829.2661935.

[2] Gensim: Topic Modelling for Humans. Machine Learning Consulting, Retrieved from https://radimrehurek.com/gensim/models/word2vec.html.

[3] Li, Z. (2019 30). A Beginner's Guide to Word Embedding with Gensim Word2Vec Model. Retrieved from https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92.