<center><h1>Shopee - Price Match Guarantee</h1></center>

<h2>Let’s understand the problem first-</h2>


Shopee is an e-commerce platform which provides ‘Lowest Price Guaranteed’ feature to thousands of products listed. To ensure lowest price Shopee must find the duplicate/similar items listed in other retailer’s websites. To perform these matches automatically we have to a ML algorithm which can cluster similar items irrespective of different images, titles, description etc. It is given that we can find at most 49 similar products to a given product.


<h2>Let’s Understand the DATA now-</h2>


The training data provided has following features - posting_id, image name, image, image_phash (perceptual hash of the image), the title of the image and image group. Image group is basically ID code for all postings that map to the same product. 
The test data has -  3 samples but the model will be evaluated on more samples (about 70K images) privately when submitted. The submission file should consist of 2 rows:


•posting_id: The Posting Id of the image (taken from the test file)

•matches: All the different matches to the current image by their posting id. Keep in mind, all images are a self-match 
for first (i.e: all images also match themselves, so you would have to include that in your entry too). Different posting ids will be separated by space.


<h2>Evaluation Metric</h2>

This competition will be judged on F1 score evaluation. The major difference between ‘accuracy’ and ‘F1’ is that accuracy is dependent on ‘True Positives’ and ‘True Negatives’ while F1 is also dependent on ‘False Positives’ and ‘False Negatives’.


<h2>Libraries</h2>

In [None]:
import numpy as np 
import pandas as pd
import os

import matplotlib.pyplot as plt
import seaborn as sns
import cv2
from wordcloud import WordCloud, STOPWORDS
import glob
import random

<h2>Load Data</h2>

<h3> Train Data </h3

In [None]:
train_data = pd.read_csv("../input/shopee-product-matching/train.csv")
train_data.head()

<h3>Test Data </h3>

In [None]:
test_data = pd.read_csv("../input/shopee-product-matching/test.csv")
test_data.head()

<h3> Sample Submission</h3>

In [None]:
sample_sub = pd.read_csv("../input/shopee-product-matching/sample_submission.csv")
sample_sub.head()

<h3>Exploratory Data Analysis</h3>

<h5>Data Information</h5>

In [None]:
train_data.info()

<h5>There is no 'null value' in the dataset:)</h5>

<h5> Dataset Size </h5>

In [None]:
print(f"Training Dataset Shape: {train_data.shape}")
print(f"Test Dataset Shape: {test_data.shape}")

<h5> Column-wise Unique values </h5>

In [None]:
for col in train_data.columns:
    print(col + ":" + (str(len(train_data[col].unique()))))

<h5>Except posting_id column all columns have duplicate values </h5>

<h5> Train & Test Image Count </h5>

In [None]:
train_jpg_directory = '../input/shopee-product-matching/train_images'
test_jpg_directory = '../input/shopee-product-matching/test_images'
def getImagePaths(path):
    image_names = []
    for dirname, _, filenames in os.walk(path):
        for filename in filenames:
            fullpath = os.path.join(dirname, filename)
            image_names.append(fullpath)
    return image_names
train_images_path = getImagePaths(train_jpg_directory)
test_images_path = getImagePaths(test_jpg_directory)
print(f"Number of train images: {len(train_images_path)}")
print(f"Number of test images:  {len(test_images_path)}")

<h5>Display Images</h5>

In [None]:
def display_img(images_paths, rows, cols):
    figure, ax = plt.subplots(nrows=rows,ncols=cols,figsize=(16,8) )
    for ind,image_path in enumerate(images_paths):
        image=cv2.imread(image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 
        try:
            ax.ravel()[ind].imshow(image)
            ax.ravel()[ind].set_axis_off()
        except:
            continue;
    plt.tight_layout()
    plt.show()

<h5>Train Images</h5>

In [None]:
display_img(train_images_path[:50], 5, 5)

<h5>Test Images</h5>

In [None]:
display_img(test_images_path, 1, 3)

<h5> Image Label Groups by No. of Images </h5>

In [None]:
top10_names = train_data['label_group'].value_counts().index.tolist()[:15]
top10_values = train_data['label_group'].value_counts().tolist()[:15]

plt.figure(figsize=(20, 10))
sns.barplot(x=top10_names, y=top10_values)
plt.xticks(rotation=45)
plt.xlabel("Label Group")
plt.ylabel("Image Count")
plt.title("Top-15 Label Groups by Image Count")
plt.show()

<h5> Duplicate Count per Label</h5>

In [None]:
groups = train_data.label_group.value_counts()
plt.figure(figsize=(20,5))
plt.plot(np.arange(len(groups)),groups.values)
plt.ylabel('Duplicate Count',size=14)
plt.xlabel('Index of Unique Item',size=14)
plt.title('Duplicate Count vs. Unique Item Count',size=16)
plt.show()

plt.figure(figsize=(20,5))
plt.bar(groups.index.values[:50].astype('str'),groups.values[:50])
plt.xticks(rotation = 45)
plt.ylabel('Duplicate Count',size=14)
plt.xlabel('Label Group',size=14)
plt.title('Top 50 Duplicated Items',size=16)
plt.show()

<h4> So we have gathered a good in-dpeth knowledge about Data. Lets try Model now </h4>

<h2>RAPIDS</h2

In [None]:
import cudf, cuml, cupy
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors
import tensorflow as tf
import nltk
from cuml.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer as CV
from wordcloud import WordCloud,STOPWORDS
from tensorflow.keras.applications import ResNet101
print('TF',tf.__version__)
print('RAPIDS',cuml.__version__)

<h3> Finding Similar Titles using RAPIDS </h3>

To find similar items in train data using only the title's text, first we will extract text embeddings using RAPIDS cuML's TfidfVectorizer. This will turn every title into a one-hot-encoding of the words present. We will then compare one-hot-encodings with RAPIDS cuML KNN to find title's that are similar.

In [None]:
# Load Data
train_data = cudf.read_csv('../input/shopee-product-matching/train.csv')
train_data.head(2)

<h4>Extract Text Embeddings with RAPIDS TfidfVectorizer</h4>
TfidfVectorizer returns a cupy sparse matrix. Afterward we convert to a cupy dense matrix and feed that into RAPIDS cuML KNN.

In [None]:
model = TfidfVectorizer(stop_words='english', binary=True)
text_embeddings = model.fit_transform(train_data.title).toarray()
print('text embeddings shape is',text_embeddings.shape)

In [None]:
# Find similar 'Titles' with RAPIDS KNN
KNN = 50
model = NearestNeighbors(n_neighbors=KNN)
model.fit(text_embeddings)
distances, indices = model.kneighbors(text_embeddings)

In [None]:
for k in range(5):
    plt.figure(figsize=(20,3))
    plt.plot(np.arange(50),cupy.asnumpy(distances[k,]),'o-')
    plt.title('Text Distance From Train Row %i to Other Train Rows'%k,size=16)
    plt.ylabel('Distance to Train Row %i'%k,size=14)
    plt.xlabel('Index Sorted by Distance to Train Row %i'%k,size=14)
    plt.show()
    
    print( train_data.loc[cupy.asnumpy(indices[k,:10]),['title','label_group']] )

<h4> Matching Images Usings RAPIDS... In Progress</h4>

<h4>Refrences</h4>

https://www.kaggle.com/cdeotte/rapids-cuml-tfidfvectorizer-and-knn


https://www.kaggle.com/ishandutta/v7-shopee-indepth-eda-one-stop-for-all-your-needs