### About me 
I am a physicist, performing statistical analysis of PetaBytes of physics data to make new discoveries. I was part of the team who discovered the famous "God Particle" or known as Higgs boson. I am here to try my data science skills on the real world industry problems. 


If you like this notbook please upvote to motivate me further to add more useful information in future.  



### Table of contents: 
* [Introduction](#introduction)
 * [What to expect in this article](#expect)


* [Load the libraries](#loadlib)
* [Load the dataset and basic operations](#loaddata)
* [Explore data with similar phash](#explorephash)
* [Explore data with similar image label](#exploreimglabel)

### Introduction <a class="anchor" id="introduction"></a>



Aim: The goal of the competetion is to find images which describe or represent same product. The target can be achieved by understanding better the images of the product itself and the metadata associated with these images. The metadata is basically the title of the image. 

#### What to expect in this notbook <a class="anchor" id="expect"></a> 
This notebook use the image data and its associated text metadata to identify the same products.  
##### Strategy 
We have three set of information available, 
1. Image 
2. Metadata of Image, i.e. title 
3. image_phash 

We can ideally use all three to identify if two (or more) images correspond to same product or not. 

Let me guide you through these three steps and see how useful each of these infomration are and how to combine them and what else can be tried? 



In [None]:

train_img='/kaggle/input/shopee-product-matching/train_images/'
train_csv='/kaggle/input/shopee-product-matching/train.csv'



### Load the libraries <a class="anchor" id="loadlib"></a>



In [None]:
import os
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import cv2


### Load the dataset and basic operations <a class="anchor" id="loaddata"></a>

In [None]:

df_train  = pd.read_csv(train_csv)

!ls /kaggle/input/shopee-product-matching/train_images/ | wc -l 



In [None]:
df_train.columns


In [None]:
df_train.shape


In [None]:
len(df_train.posting_id.unique())

In [None]:
len(df_train.image_phash.unique())

In [None]:
len(df_train["label_group"].unique())


Before we can start playing aruond with the data, it is important to understand what exactly is stored in it. The csv file/ dataframe has 5 columns. 
* 'posting_id': is a unique string associated with each image. 
* 'image: has the name of the image which can be used for traning. Note that train directory has less images when compared with images name listed in the .csv file
* 'image_phash': phash stands for Perceptual hashing. phash is an algorithm which in simple words creates a finger print for an image. If the finger prints of two images are same then the images are likely same. Looking at the unique values we can notice there are images which have same phash values. One can understand that the present competetion is about creating a new kind of hash which can then be used to identify the similar images.
* 'title': This is the string which describe the product in the image. 
* 'label_group': represent the groups. There are 11014 unique groups of product in the training dataset. 

1. ### Explore data with similar phash <a class="anchor" id="explorephash"></a>

In [None]:
def listofimages(df, feature):
    mode_=df[feature].value_counts().index[0]
    df = df[df[feature]==mode_]
    return [os.path.join(train_img,iimg) for iimg in df.image.to_list()]

In [None]:
df_train.image_phash.value_counts()


label_group count simply tells that there is no product which has no repetition, so in the input dataset, there is atleast one partner for each of the image. 

In [None]:
df_train.label_group.value_counts()

* The frequency distribution below gives a hint how often the images might have a match in the present dataset. 
* Almost half of the dataset is with those images which have only one matching image and rest have more than two and goes upto 51. 

In [None]:
import seaborn as sns 
nlist = [2,3,4,5,6,7,8,9,10]
x1=np.array(nlist)
y_=[]
for i in  nlist: 
    if i<6:
        y_.append (np.sum(df_train.label_group.value_counts()==i) )
    else:
        y_.append (np.sum(df_train.label_group.value_counts()>=i) )
y1 = np.array(y_)

sns.barplot(x=x1,y=y1)
#print (x1,y1)

In [None]:
# I will use this function very often 

import matplotlib.pyplot as plt 
import cv2

def display_multiple_img(images_paths, rows, cols):
    figure, ax = plt.subplots(nrows=rows,ncols=cols,figsize=(18,88) )
    for ind,image_path in enumerate(images_paths):
        image=cv2.imread(image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) 
        try:
            ax.ravel()[ind].imshow(image)
            ax.ravel()[ind].set_axis_off()
        except:
            continue;
    plt.tight_layout()
    plt.show()


## The Notebook using RAPID, Resnet18 and cosine distance (score 0.712) 
You can checkout the notebook with RAPIDs and Resnet and measuring the similarity using cosine distance at this link: 


* In the following part discussing some of the corner cases I found while exploring the cases when images are not tagged as same category. 
* The case here refer to 6 images which represent same item but the label_group is given different in the training dataset. 

### Let's check what are those 6 images. 
Things to notice: 
1. The 6 images belong to **two** label_groups 
2. Three of them have exact same title, but remaining have slightly different 
3. Image visualisation, they all looks very similar. 


In [None]:
image_list=df_train[ (df_train.posting_id=='train_3386243561') | 
                   (df_train.posting_id=='train_2120597446') |
                   (df_train.posting_id=='train_3423213080')|
                   (df_train.posting_id=='train_1816968361') |
                   (df_train.posting_id=='train_1831941588') |
                   (df_train.posting_id=='train_3805508898')
 ]

image_list

In [None]:
image_list.title.values.tolist()

In [None]:

image_list_path = [os.path.join(train_img,iimage) for iimage in image_list.image.values.tolist()]
display_multiple_img(image_list_path,3,2)


When cropping is added in the 

**Conclusion:** Looking at this one example we can guess that the data is not as clean as one might expect and same will be true for the test dataset as well. And this will effect the performance of the tagging on both seen and unseen dataset. 

1. The images in one label_group can be tagged as another very easibly which will lead to low score at the end. 
If you have some idea to tackle situation like these please write down in comment. 



## Looking at the features 

* Looking at the title and label_group one can guess why this particular ,image_phash has highest frequency in the dataset. This is what we can say looking at these two features. One more feature of dataset can be seen from image column. A given image is repeated multiple times. They have same hash but they have different entry more likely due to different title. Let's try to visualize these 26 images. 

In [None]:
img_TopOcc = df_train[df_train.image_phash=="fad28daa2ad05595"].image.to_list()
img_TopOcc_ = [os.path.join(train_img,iimg) for iimg in img_TopOcc]
display_multiple_img(img_TopOcc_,13,2) ## showing first 25
img_TopOcc_

As expected, they all are same. Lets check the images based on the label. 

### Explore data with similar image label <a class="anchor" id="exploreimglabel"></a>

In [None]:
list_label = listofimages(df_train,"label_group")
len(list_label)
display_multiple_img(list_label,17,3)
#list_label

In [None]:
df_train[df_train.image=='f9dc2cf9ed811fec7cbc9d5120638f0c.jpg']['title'].values

Now that we have seen some basic features of the data, lets try to dig in more. 

I personally think that we need to first make fragments of the problem and then combine the solutions. Therefore the fragments should be in a way that they can be combined later on, and solutions should be in a form or format that they can be combined as easy as plug and play. 

To me it can be divided into at least three fragments: 

1. Text associated with the image, i.e. title.
2. Image itself has a lot of information, the image of the product can be used to decide whether  two images are similar or not. 
3. I will focus on remaining 1-2 fragments later. 

Let's focus on these two for now. 



### Detecting the image similarity using the images. 

Following is still work in progress, 



In [None]:
#tmp = df_train.groupby('label_group')["posting_id"].agg('unique').to_dict()
#df_train.label_group.map(tmp)


tmp = df_train.groupby("image_phash")["posting_id"].agg('unique').to_dict()
#df_train.image_phash.map(tmp).values.tolist()

### Detecting the image similarity using the title of the image. 

For this purpose I will be using some of the natural processing tools to make a meaningful dataset which can be understood by the numpy and ML algorithms. The process mainly involves converting the text into numbers. Lets see how can be achieve this. 

Lets check some features of the images now. 

In [None]:
import os
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(2018)
from gensim.models import Word2Vec
import nltk
nltk.download('wordnet')
stemmer = SnowballStemmer('english')

from numpy import dot
from numpy.linalg import norm
import pandas as pd

In [None]:
'''
text_ = ""
import nltk
from nltk.stem import WordNetLemmatizer
'''

The very first step is to tokenize the string using nltk and find unique tokens, or lemmatising the verbs. 

In [None]:
'''text_tokens_ = nltk.word_tokenize(text_)
text_tokens_
text_tokens_1 = nltk.Text(text_tokens_)
text_tokens_1
set(text_tokens_)
wlemma = WordNetLemmatizer()
doc1 = [wlemma.lemmatize(w,'v') for w in text_tokens_1]
doc2 = doc1 
'''

In [None]:
'''
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text_])
vectors
feature_names = vectorizer.get_feature_names()

dense = vectors.todense()
dense
denselist = dense.tolist()
denselist
df = pd.DataFrame(denselist, columns=feature_names)
df
'''

In [None]:
'''
titles = df_train.title.values.tolist()
titles_skim = titles[:5000]
titles_skim
vectors = vectorizer.fit_transform(titles_skim)
feature_names = vectorizer.get_feature_names()
len(feature_names)
'''


In [None]:
'''
dense = vectors.todense()
dense
denselist = dense.tolist()
denselist
df = pd.DataFrame(denselist, columns=feature_names)

df
'''

## Stay tuned for more details