# Intro
Welcome to the [Shopee - Price Match Guarantee](https://www.kaggle.com/c/shopee-product-matching) competition.
![](https://storage.googleapis.com/kaggle-competitions/kaggle/24286/logos/header.png)

The goal of this competition is to find near-duplicates in large datasets.  In Shopee's case, everyday users can upload their own images and write their own product descriptions, adding an extra layer of challenge. Our task is to identify which products have been posted repeatedly. The differences between related products may be subtle while photos of identical products may be wildly different!

<span style="color: royalblue;">Please vote the notebook up if it helps you. Feel free to leave a comment above the notebook. Thank you. </span>

# Libraries

In [None]:
import os
import pandas as pd
import cv2
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

# Path

In [None]:
path = '/kaggle/input/shopee-product-matching/'
os.listdir(path)

# Load Data

In [None]:
train_data = pd.read_csv(path+'train.csv')
test_data = pd.read_csv(path+'test.csv')
samp_subm = pd.read_csv(path+'sample_submission.csv')

# Functions
We define some helper functions.

In [None]:
def plot_examples_by_label_group(label_group):
    fig, axs = plt.subplots(1, 5, figsize=(20, 10))
    fig.subplots_adjust(hspace = .1, wspace=.1)
    df=train_data[train_data['label_group']==label_group]
    df.index = range(len(df))
    axs = axs.ravel()
    for i in range(5):
        img = cv2.imread(path+'train_images/'+df.loc[i, 'image'])
        axs[i].imshow(cv2.cvtColor(img, cv2.COLOR_BGR2RGB))
        axs[i].set_title('label_group:'+str(label_group))
        axs[i].set_xticklabels([])
        axs[i].set_yticklabels([])
    plt.show()

# Oveview

In [None]:
print('number train samples:', len(train_data.index))
print('number test samples:', len(test_data.index))
print('number train images:', len(os.listdir(path+'train_images')))
print('number test images:', len(os.listdir(path+'test_images')))

In [None]:
train_data.head()

In [None]:
samp_subm.head()

# EDA

## Images - Near-duplicates

We plot 5 images of label groups and select 4 of the groups with the most images:

In [None]:
label_groups = train_data['label_group'].value_counts().keys()
plot_examples_by_label_group(label_groups[0])
plot_examples_by_label_group(label_groups[1])
plot_examples_by_label_group(label_groups[2])
plot_examples_by_label_group(label_groups[3])

## Title - Cloud Of Words

In [None]:
stopwords = set(STOPWORDS)
comment_words = ''
for val in train_data['title']:
    # Typecase Value To String
    val = str(val)
    # Split Values
    tokens = val.split()
    # Convert Token To Lower Case
    for i in range(len(tokens)):
        tokens[i] = tokens[i].lower()
        
    comment_words += " ".join(tokens)+" "
wordcloud = WordCloud(width=800, height=600,
                      background_color='white',
                      stopwords=stopwords,
                      min_font_size=10).generate(comment_words)
plt.figure(figsize=(8, 8), facecolor=None)
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# Export

In [None]:
output = samp_subm.copy()
output.to_csv('submission.csv', index=False)