As a ML beginner, I found this compeition very useful in learning various concepts around data science and thought of documenting in this notebook.

### Text processing using RAPIDS

Traditionally we used GPU for neural network processing due to its huge need for memory and matrix operations. However in this case we need to compare title of 70,000 products (while submission) with each other to find similar ones. Using Pandas or scikit-learn would force us doing this on CPU, eventhough we have GPUs available. Doing this computation on CPU would time out the submission. So we need to do this on GPU. Enter 'RAPIDS'.

RAPIDS is by Nvidia and it provides several libraries for making use of GPU capabilities. Important one for our task are   
**CUDF**: This is supposed to be a replacement for Pandas. Though not as extensive as Pandas in its APIs, it is catching up,    
**CUML**: This is supposed to be a replacement for scikit-learn. Again not all functionality of Scikit-Learn is there but we have enough for this competition.   

### Image and Text Embeddings

To compare an image to another image or a sentence to another sentence, we need to represent these objects in numbers (vectors).    
We can use pre-trained EfficientNet layers (last but one layer) for getting a representative vector from an image. You can actually use any other pretrained neural net image models as well for this.   
Similarly, for sentenses/text (titles here), you can use TFidfVectorizer which is available in cuML library for getting vectors from each of the 

### Compare embeddings speed

#### scikit-learn

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer as skTfidfVectorizer
train_pd = pd.read_csv('../input/shopee-product-matching/train.csv')
# Submission time we have 70k records. Since train has 35k, appending it to itself for having 70k records.
train_pd = train_pd.append(train_pd)

In [None]:
%%time
model = skTfidfVectorizer(stop_words='english', binary=True, max_features=25_000)
text_embeddings_pd = model.fit_transform(train_pd['title'])

#### CUDF

In [None]:
import cudf, cuml
from cuml.feature_extraction.text import TfidfVectorizer as cuTfidfVectorizer
train_cu = cudf.read_csv('../input/shopee-product-matching/train.csv')
# Submission time we have 70k records. Since train has 35k, appending it to itself for having 70k records.
train_cu = train_cu.append(train_cu)

In [None]:
%%time
model = cuTfidfVectorizer(stop_words='english', binary=True, max_features=25_000)
text_embeddings_cu = model.fit_transform(train_cu['title']).toarray()

As we see creating embeddings itslef is not a memory intensive task. CPU takes just around a second and we see RAPIDS saves 80% of time. So I do not think using RAPIDS is **not really a major help for creating embeddings.**

### Comparing titles for similarity

Once you have the vecor representation of titles, you can use various algorithms to compare the distance between the vectors so as to find if they are same/similar. One such algorithm is consine distance between the vectors.    
Using cpu for comparing 70k records will take a very long time and would timeout on Kaggle. So we have to use GPU here. Here we have two options (that I know). Either to use RAPIDS or PyTorch. Let us compare them.

In [None]:
from cuml.neighbors import NearestNeighbors
KNN = 50
model = NearestNeighbors(n_neighbors=KNN)

In [None]:
# try:
#     # Compare every record against every other record and five 50 nearest records for each
#     model.fit(text_embeddings_cu.get())
#     distances, indices = model.kneighbors(text_embeddings_cu)
# except Exception as e:
#     print(e)

As we see above, we get **GPU out of memory** error. So that means that we cannot use this approach for submission. Sowe need a technique called as chunking.

### Chunking

Instead of comparing EVERY record with every other record in ONE GO,    
we will compare only a CHUNK of records with every other record. This technique is called as chunking.

In [None]:
predictions = []
oneChunkLen = 1024 * 4
totalChunks = len(text_embeddings_cu)//oneChunkLen
if len(text_embeddings_cu)%oneChunkLen != 0: totalChunks += 1  

In [None]:
import cupy, gc

##### Chunking using CUML

In [None]:
%%time
for i in range(totalChunks):
    a = i*oneChunkLen
    b = (i+1)*oneChunkLen
    b = min(b, len(text_embeddings_cu))
#     print('chunk',a,'to',b)
    
    #COSINE Similarity
    cSim = cupy.matmul(text_embeddings_cu, text_embeddings_cu[a:b].T).T
    #Now cSim will be an array of size [oneChunkLen X len(text_embeddings_cu)]
    #That is, for each row in the chunk, it provides the distance to every other row in text_embeddings_cu
    #Distance in between 0 to 1. Values closer to 1 means more similar.
    
    for j in range(b-a):
        matchIndices = cupy.where(cSim[j,]>0.7)[0]
        posting_ids = train_pd.iloc[cupy.asnumpy(matchIndices)].posting_id.values
        predictions.append(posting_ids)
        
del model,text_embeddings_cu
_ = gc.collect()

##### Chunking using Pytorch

In [None]:
import torch
import numpy as np

In [None]:
%%time
predictions = []
#Move to GPU
text_embeddings_tch = torch.from_numpy(text_embeddings_pd.toarray().astype(np.float16)).to('cuda:0')

for i in range(totalChunks):
    a = i*oneChunkLen
    b = (i+1)*oneChunkLen
    b = min(b, len(text_embeddings_tch))
#     print('chunk',a,'to',b)
    
    #COSINE Similarity using PyTorch
    cSim = torch.matmul(text_embeddings_tch, text_embeddings_tch[a:b].T).T
    
    for j in range(b-a):
        matchIndices = torch.where(cSim[j,]>0.7)[0].cpu().numpy()
        posting_ids = train_pd.iloc[matchIndices].posting_id.values
        predictions.append(posting_ids)
        
del text_embeddings_tch
gc.collect()

So pytorch could run this in 1min 37secs, and RAPIDS couls run this in 1min 19secs. There is a catch though. In PyTorch, I had to use FP16 using *.astype(np.float16)*, otherwise it resulted in CUDA out of memory (you may try reducing the chunk size). **In Summary, if you are not too familar with you can do this with PyTorch.**

### Hash for an image

Hash is another set of techniques that create a fingerprint of a media file and widely used to compare media to find out piracy. Small changes to the image (resolution, hue, rotation) do not alter the fingerprint thus you can use it to check if images are same/similar. 

There are several algorithms here aHash, pHash, dHash and wHash. As part of our training data pHash is provided so we can use it to find out the duplicate images.

### Acknowledgements

Code around RAPIDS AI is from [Chris Deotte](https://www.kaggle.com/cdeotte). [This](https://www.kaggle.com/cdeotte/part-2-rapids-tfidfvectorizer-cv-0-700) notebook.   
Code around PyTorch is from [Nick](https://www.kaggle.com/nicksergievskiy/pytorch-is-all-you-need-tfidf) from [this](https://www.kaggle.com/nicksergievskiy/pytorch-is-all-you-need-tfidf) notebook.