###### NOTE: this model can be improved by using different CNN models for calculating image dense vector representation and the prediction can be increase by cropping the images close to the human( i.e. with less white background).

## Assumptions:
1.No seller will publish their item in wrong category.

2.In a Reccomender system Products with same titles but different size will be displayed as one product under which sizes will be product variant. So I'll consider them as duplicate and add them as duplicate items.

3.All product titles start by brand name.

4.I am considering the products which are identical in appearence(can have different sizes) and product variants in different colour as identical. 

## Definition of duplicate:
Reference: https://www.netstrategy.it/en/seo-blog/duplicated-content-ecommerce

What is duplicated content?

According to the Mountain View official sources, duplicated content is defined as a significant portion of text that is similar, or almost identical to, another who resides in the same website or an external website.

Let look at Google’s original definition:

 “[…] Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin […]. If your site contains multiple pages with largely identical content, there are a number of ways you can indicate your preferred URL to Google. In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved.” (Fonte:https://support.google.com/webmasters/answer/66359?hl=en )

As you can imagine, not all duplicated contents found on the web are malicious. Regardless, Google adopted the necessary measures to indicate which text parts would be duplicated and which would be the version to point in order to avoid unpleasant inconveniences and guarantee correct indexing of the website.


Reference: https://www.netstrategy.it/en/seo-blog/duplicated-content-ecommerce

#### Types of duplicate content

    Copied content – Is where one web site copies information from another. With e-commerce web sites this is very common. Product descriptions are copied from the manufacture’s web site and cause duplicate content over multiple properties (sometimes thousands of them).
    Multiple URLs – The use of search filters within e-commerce sites creates numerous possibilities for visitors to find a single product. Each unique search query will create a new URL, but the content looks similar or exactly the same. Other related duplicate pages could result from URL setup and server side settings.
    Similar product – Arises when a site uses one product description to sell other versions of the same product. Other versions could be color, size, etc.
    Closely related content – When one narrow topic is explained multiple times on multiple pages. This is very common on web sites with a blog and causes content cannibalization.
### In this project I am considering Two different case:
#### 1. Exactly same product with different sizes or colour.
#### 2. SIMILAR PRODUCT as duplicates.

###### I will calculate both of them separately and then combine them afterwards.

In [486]:
#numpy for the efficient matrix calculation
import numpy as np
#pandas for loading data into dataframe and for performing operations on the dataframe
import pandas as pd
#tqdm to show the progress of the process
from tqdm import tqdm_notebook
#for Bag of Words
from sklearn.feature_extraction.text import CountVectorizer
#for tfidf
from sklearn.feature_extraction.text import TfidfVectorizer
#for calculating cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
#for capturing a particular pattern in a  string
import re
#for joining two lists componentwise.
import itertools
#for downloading all the images
from PIL import Image
import requests
from io import BytesIO
#for calculting cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity 

In [71]:
#since some rows have more entries as no of column given. therefore i need to ignore those lines.
data=pd.read_csv("small-2oq-c1r.csv",usecols=['productId', 'title', 'description', 'imageUrlStr', 'mrp','sellingPrice', 'specialPrice', 'productUrl', 'categories','productBrand', 'productFamily', 'inStock', 'codAvailable','offers', 'discount', 'shippingCharges', 'deliveryTime', 'size','color', 'sizeUnit', 'storage', 'displaySize', 'keySpecsStr','detailedSpecsStr', 'specificationList', 'sellerName','sellerAverageRating', 'sellerNoOfRatings', 'sellerNoOfReviews','sleeve', 'neck', 'idealFor'])#error_bad_lines=False)

  interactivity=interactivity, compiler=compiler, result=result)


In [72]:
#each product have 32 features in the raw dataset
data.columns.values

array(['productId', 'title', 'description', 'imageUrlStr', 'mrp',
       'sellingPrice', 'specialPrice', 'productUrl', 'categories',
       'productBrand', 'productFamily', 'inStock', 'codAvailable',
       'offers', 'discount', 'shippingCharges', 'deliveryTime', 'size',
       'color', 'sizeUnit', 'storage', 'displaySize', 'keySpecsStr',
       'detailedSpecsStr', 'specificationList', 'sellerName',
       'sellerAverageRating', 'sellerNoOfRatings', 'sellerNoOfReviews',
       'sleeve', 'neck', 'idealFor'], dtype=object)

In [73]:
#we'll now select only those products which belongs to only "top's category
#selecting tops_data
data=data[data["categories"].str.contains(">Tops",na=False)]
#reseting the indexing because tops_data indexing was same as the parent dataframe i.e. data
data.reset_index(drop=True)
#saving this dataset to new csv file "tops_data.csv 
data.to_csv("tops_data.csv",index=False)

In [74]:
#loading the saved tops_data csv file into pandas dataframe
data=pd.read_csv("tops_data.csv")
print("Data dimensions",data.shape)

  interactivity=interactivity, compiler=compiler, result=result)


Data dimensions (347669, 32)


In [75]:
#out of 33 features we'll be using only few features for our task.
data=data[['productId', 'title', 'description', 'imageUrlStr','categories','productBrand','size','color','detailedSpecsStr']]
print("Data dimensions",data.shape)

Data dimensions (347669, 9)


In [77]:
#basic stats for productId
print(data['productId'].describe())
#By observing the frequency=2 it means that there is duplicate productids
#There are no NaN values in productId

count               347669
unique              347579
top       TOPEAZFCZQ28HVNR
freq                     2
Name: productId, dtype: object


In [79]:
#basic stats for titles
print(data['title'].describe())
#so there are many products with same product title
#there are 5 products without title name "NaN" values.

count                                                            347664
unique                                                            50050
top       Snoogg Casual Sleeveless Graphic Print Women's Multicolor Top
freq                                                               3525
Name: title, dtype: object


In [80]:
data["title"].value_counts()

Snoogg Casual Sleeveless Graphic Print Women's Multicolor Top           3525
Diaz Casual Short Sleeve Solid Women's Multicolor Top                   1428
Uptown 18 Casual Short Sleeve Printed Women's White Top                 1176
Uptown 18 Casual Short Sleeve Printed Women's Black Top                 1124
Friskers Casual Sleeveless Solid Women's Multicolor Top                  725
Uptown 18 Casual Short Sleeve Printed Women's Red Top                    617
Amoya Casual Sleeveless Solid Women's Multicolor Top                     602
Uptown 18 Casual Short Sleeve Printed Women's Green Top                  597
Uptown 18 Casual Short Sleeve Printed Women's Blue Top                   596
Uptown 18 Casual Short Sleeve Printed Women's Grey Top                   588
Uptown 18 Casual Short Sleeve Printed Women's Yellow Top                 578
Stop Look Casual 3/4th Sleeve Printed Women's Multicolor Top             493
Piftif Beach Wear Sleeveless Solid Women's Multicolor Top                426

In [81]:
#basic stats for description
print(data['description'].describe())
#There are 347669-215905=131764 values which are "NaN".
#So we'll not use description for our task.
data=data[['productId', 'title', 'imageUrlStr','categories','productBrand','size','color','detailedSpecsStr']]

count                                                                                                  215905
unique                                                                                                  31941
top       Round neck Crop top with cool and amazing prints which makes you feel comfortable for whole day....
freq                                                                                                     5109
Name: description, dtype: object


In [83]:
#basic stats for imageUrlStr
print(data['imageUrlStr'].describe())
#all the values are present
#there are some images which are same for many product. Max frequency is 150.

count                                                                                                  347669
unique                                                                                                  87969
top       http://img.fkcdn.com/image/top/f/g/8/aarz009-aarzoo-m-original-imaej8seyxbdvuhg.jpeg;http://img....
freq                                                                                                      150
Name: imageUrlStr, dtype: object


In [85]:
#basic stats for categories
print(data['categories'].describe())
print(data['categories'].value_counts())
#all values are present
#there are only 5 categories in category section

count                                                     347669
unique                                                         5
top       Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops
freq                                                      333187
Name: categories, dtype: object
Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops      333187
Apparels>Women>Fusion Wear>Shirts, Tops & Tunics>Tops        12458
Apparels>Women>Maternity Wear>Shirts, Tops & Tunics>Tops      1257
Apparels>Kids>Girls>T-Shirts & Tops>Tops                       696
Apparels>Kids>Infants>Baby Girls>T-Shirts & Tops>Tops           71
Name: categories, dtype: int64


In [87]:
#basic stats for productBrand
print(data['productBrand'].describe())
# There are total 2354 product brands in our dataset.

count        347667
unique         2354
top       Vero Moda
freq           8770
Name: productBrand, dtype: object


In [89]:
#basic stats for size
print(data['size'].describe())
#all rows have data
#there are only 83 different sizes in pur dataset

count     347669
unique        83
top            M
freq       73265
Name: size, dtype: object


In [91]:
#basic stats for color
print(data['color'].describe())
#there are some fields which doesnt contain colour
#there are 6912 unique colours

count     347024
unique      6912
top        Black
freq       42575
Name: color, dtype: object


In [92]:
#basic stats for detailedSpecsStr
print(data['detailedSpecsStr'].describe())
#There are some fileds in which Detailed specification is not available.

count                                                                           346640
unique                                                                           24173
top       Round Neck, Short Sleeve;Fabric: Cotton;Pattern: Printed;Type: Top;Pack of 1
freq                                                                              7350
Name: detailedSpecsStr, dtype: object


In [93]:
#removing those rows which don't have titles.
data = data.loc[~data['title'].isnull()]
#removing those rows which don't have product brand.
data = data.loc[~data['productBrand'].isnull()]
#removing those rows which don't have detailedSpecsStr.
data = data.loc[~data['detailedSpecsStr'].isnull()]
print("Dataset dimensions",data.shape)

Dataset dimensions (346633, 8)


Here I'll remove the duplicate product ids and save it in new csv file for continuing my task from here. we'll also slice the image url to get the perfect url.

In [94]:
#function used for splitting the data.
def splitting(imgUrl):
    return imgUrl.split(";")[0]

In [96]:
#we'll remove the duplicate product and keep only one copy of duplicate product.so we removed 90 rows from the dataset.
data=data.drop_duplicates(subset=["productId"], keep='first', inplace=False)
data["imageUrlStr"]=data["imageUrlStr"].map(splitting)
data.reset_index(drop=True)
#saving this dataset to new csv file "clean_tops_data.csv "
data.to_csv("clean_tops_data.csv",index=False)

In [97]:
#loading the saved clean_tops_data csv file into pandas dataframe
data=pd.read_csv("clean_tops_data.csv")
print("Data dimensions",data.shape)

Data dimensions (346543, 8)


In [99]:
# Remove All products with very few words in title
data_sorted = data[data['title'].apply(lambda x: len(x.split())>4)]
print("After removal of products with short title:", data_sorted.shape[0])

After removal of products with short title: 345994


In [100]:
# Sort the whole data based on title (alphabetical order of title)
#This is done so that we will be able to notice that some products have exactly same title but differ in only size or color.
#here inplace parameter is True to change the data frame itself.
data_sorted.sort_values('title',inplace=True,ascending=False)
data_sorted.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,productId,title,imageUrlStr,categories,productBrand,size,color,detailedSpecsStr
232623,TOPER5RZG2JHTGFB,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,XS,mustard,"Round Neck, Sleeveless;Fabric: knitts;Pattern: Solid;Type: Top;Pack of 1"
329521,TOPER4ZERMKX9BGY,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,XL,mustard,"Round Neck, Sleeveless;Fabric: Georgette;Pattern: Solid;Type: Top;Pack of 1"
232801,TOPER4ZEVAKBXRFH,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,L,mustard,"Round Neck, Sleeveless;Fabric: knitts;Pattern: Solid;Type: Top;Pack of 1"
300795,TOPER4Z6AU328YRZ,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,S,mustred,"U Neck, Sleeveless;Fabric: knitts;Pattern: Solid;Type: Top;Pack of 1"
232281,TOPER4ZE8HNKPP92,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,XXL,mustard,"Round Neck, Sleeveless;Fabric: knitts;Pattern: Solid;Type: Top;Pack of 1"


In [38]:
#to diplay the whole title we'll increase the column width to 100 which is 50 as default.
pd.options.display.max_colwidth=100

From all these statistics i came to the conclusion that i cant use "only" titles or detailed specification for the detection of duplicate item as there are many products with exactly same title and detailed specification but different images. SO we have to consider the images along with the titles and detailed specification. 

In [122]:
data_sorted.to_csv("sorted_tops_data.csv",index=False)
data_sorted1=pd.read_csv("sorted_tops_data.csv")
data_sorted1.head(14)

Unnamed: 0,productId,title,imageUrlStr,categories,productBrand,size,color,detailedSpecsStr
0,TOPER5RZG2JHTGFB,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,XS,mustard,"Round Neck, Sleeveless;Fabric: knitts;Pattern: Solid;Type: Top;Pack of 1"
1,TOPER4ZERMKX9BGY,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,XL,mustard,"Round Neck, Sleeveless;Fabric: Georgette;Pattern: Solid;Type: Top;Pack of 1"
2,TOPER4ZEVAKBXRFH,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,L,mustard,"Round Neck, Sleeveless;Fabric: knitts;Pattern: Solid;Type: Top;Pack of 1"
3,TOPER4Z6AU328YRZ,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,S,mustred,"U Neck, Sleeveless;Fabric: knitts;Pattern: Solid;Type: Top;Pack of 1"
4,TOPER4ZE8HNKPP92,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,XXL,mustard,"Round Neck, Sleeveless;Fabric: knitts;Pattern: Solid;Type: Top;Pack of 1"
5,TOPER4ZEGCBGPZPX,zink london Casual Sleeveless Solid Women's Yellow Top,http://img.fkcdn.com/image/top/b/g/y/xs-t00146-zink-london-original-imaer4eummptmeyf.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,M,mustard,"Round Neck, Sleeveless;Fabric: Georgette;Pattern: Solid;Type: Top;Pack of 1"
6,TOPER4WDNVWSFFPH,zink london Casual Sleeveless Solid Women's Red Top,http://img.fkcdn.com/image/top/r/m/8/xxl-t00161-zink-london-original-imaer4j6dpbws4yh.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,XS,red,"Round Neck, Sleeveless;Fabric: Georgette;Pattern: Solid;Type: Top;Pack of 1"
7,TOPER4WHPM3ZKFM9,zink london Casual Sleeveless Solid Women's Red Top,http://img.fkcdn.com/image/top/r/m/8/xxl-t00161-zink-london-original-imaer4j6dpbws4yh.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,M,red,"Round Neck, Sleeveless;Fabric: Georgette;Pattern: Solid;Type: Top;Pack of 1"
8,TOPER4PXZWAENZGX,zink london Casual Sleeveless Solid Women's Red Top,http://img.fkcdn.com/image/top/r/m/8/xxl-t00161-zink-london-original-imaer4j6dpbws4yh.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,L,red,"Round Neck, Sleeveless;Fabric: Georgette;Pattern: Solid;Type: Top;Pack of 1"
9,TOPER6F9KZHFM82D,zink london Casual Sleeveless Solid Women's Red Top,http://img.fkcdn.com/image/top/r/m/8/xxl-t00161-zink-london-original-imaer4j6dpbws4yh.jpeg,"Apparels>Women>Western Wear>Shirts, Tops & Tunics>Tops",zink london,XXL,red,"Round Neck, Sleeveless;Fabric: Georgette;Pattern: Solid;Type: Top;Pack of 1"


# Code for Converting image into vector:

### Code for downloading images:

In [591]:
#data_sorted1.to_csv("final_sort.csv",index=False)
data_sorted1=pd.read_csv("final_sort.csv")

In [None]:
#downloading images 
rm_ls=[]
data_sorted_new=data_sorted2.copy()
for index, row in data_sorted2.iterrows():
        try:
            url = row['imageUrlStr']
            response = requests.get(url)
            img = Image.open(BytesIO(response.content))
            img.save('C:/Users/Hp/Desktop/infilect/tops/'+str(index)+'_'+row['productId']+'.jpeg')
        except OSError :
            rm_ls.append(index)
            data_sorted_new=data_sorted_new[data_sorted_new.index!=index]

## Code for converting images to dense vector representation:

In [45]:
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dropout, Flatten, Dense
from keras import applications
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt
import requests
from PIL import Image
import pandas as pd
import pickle

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [None]:
#dimensions of our images.
img_width, img_height = 224, 224

top_model_weights_path = 'bottleneck_fc_model.h5'
train_data_dir = 'topss/'
nb_train_samples = 21096
epochs = 50
batch_size = 1


def save_bottlebeck_features():
    
    #Function to compute VGG-16 CNN for image feature extraction.
    
    asins = []
    datagen = ImageDataGenerator(rescale=1. / 255)
    
    # build the VGG16 network
    model = applications.VGG16(include_top=False, weights='imagenet')
    generator = datagen.flow_from_directory(
        train_data_dir,
        target_size=(img_width, img_height),
        batch_size=batch_size,
        class_mode=None,
        shuffle=False)

    for i in generator.filenames:
        asins.append(i[2:-5])

    bottleneck_features_train = model.predict_generator(generator, nb_train_samples // batch_size)
    bottleneck_features_train = bottleneck_features_train.reshape((21096,-1))
    
    np.save(open('data_cnn_features_21k.npy', 'wb'), bottleneck_features_train)
    np.save(open('data_cnn_feature_procuctid_g21k.npy', 'wb'), np.array(asins))
    

save_bottlebeck_features()

In [621]:
#select some rows which i have already calculated the vector representation for testing.
in_name=np.load("data_cnn_feature_procuctid_g21k.npy")
cnn_features=np.load("data_cnn_features_21k.npy")
#in_name3k=np.load("data_cnn_in_names10k.npy")
#cnn_features3k=np.load("data_cnn_features10k.npy")
in_name1=[]
for i in in_name:
    inn=i.split("\\")[1].split("_")[0]
    in_name1.append(int(inn))
in_name=np.array(in_name1)

In [595]:
in_name_lt=np.sort(in_name)

In [596]:
#selecting only those rows for which i have converted their image into dense vector.
data_sorted1=data_sorted1.loc[in_name_lt]

In [599]:
data_sorted1=data_sorted1.loc[~data_sorted1['detailedSpecsStr'].isnull()]
data_sorted1.shape

(21075, 9)

In [600]:
# variable which holds the index of the data as the indexes are not continuous.
indices = []
for i,row in data_sorted1.iterrows():
    indices.append(i)

## Code for calcultaing duplicate based on difference in only product attributes (like size).

In [603]:
#function for checking similarity
def check_similarity(l,m):
    
    sim=cosine_similarity(cnn_features[np.where(in_name==l)[0]].reshape(1,-1),cnn_features[np.where(in_name==m)[0]].reshape(1,-1))
    if sim > 0.90:
        return True
    else:
        return False

In [604]:
#products without any duplicate titles.
dedup_prodid=[]
#dictinary store the product as key and their duplictes as values.

dup_dict={}

r_i=0
i=0
j=0
#list to track the record of the productid processed before.
marker=[]
start_i=0
num_data=data_sorted1.shape[0]
while i < num_data and j<num_data:
    if start_i == 1:
        i = j_skipped
        start_i=0
    previous_i=i
    #store the words of title as list in a.
    a=data_sorted1['title'].loc[indices[i]].split()

    #now we will search for similar product one by one
    j=i+1
    #for storing the duplicate product ids.
    dup_list=[]

    while j < num_data:

        #store the words of title as list in b.
        b=data_sorted1['title'].loc[indices[j]].split()
        #maximum length of the strings.
        length=max(len(a),len(b))



        #count variable to store number of matched words
        count=0

        #variable for counting the iterations
        l=0
        #variable to know whether the two products are of same brand or not
        sb=0
        
        #here itertools.zip_longest(a,b) will join the two list in a one list with values as tuple which contains element of a and b in that index     
        z=(len(list(itertools.zip_longest(a,b)))-1)
        for k in itertools.zip_longest(a,b):
            if (k[0] == k[1]):
                if any([ l==0, l==1 ]):
                    sb+=1

                count+=1
            l+=1
        #if number of words in which both strings differ are >1,we are considering it as different products.
        #if number of words in which both strings differ are < 1 and sb<1 then it is a different brand but if sb>1 then they are the same brand with different size or colour. 
        if (length-count) > 1:################### FOR MORE ACCURACY TUNE THIS PARAMETER.
            if i not in marker:
                
                dedup_prodid.append(data_sorted1["productId"].loc[indices[i]])
                marker.append(i)
            #if the comaprision is between the last iten and last second item then we append both
            if j == num_data-1:
                dedup_prodid.append(data_sorted1["productId"].loc[indices[j]])
            #since dataset is arranged in alphabetic order so we will change our i when the product changes.
            i = j
            break
        if (length-count) ==1 :################### FOR MORE ACCURACY TUNE THIS PARAMETER.
            
            if sb <= 1:################### FOR MORE ACCURACY TUNE THIS PARAMETER.
                #means different brand names but similar title.means different product
                if i not in marker:
                    dedup_prodid.append(data_sorted1["productId"].loc[indices[i]])
                    marker.append(i)
                #since dataset is arranged in alphabetic order so we will change our i when the product changes.
                i=j
                break
                #checking the similarity
            if check_similarity(indices[i],indices[j]): ################### FOR MORE ACCURACY TUNE THE VALUE OF SIMILARITY.      
                
                if j not in marker:
                    dup_list.append(data_sorted1["productId"].loc[indices[j]])
                    marker.append(j)
                dup_dict[data_sorted1["productId"].loc[indices[i]]] = dup_list
                j+=1
                    
             #if it is of same brand but are not similar then we add them to different product list.           
            else:
                if i not in marker:
                    dedup_prodid.append(data_sorted1["productId"].loc[indices[i]])
                    marker.append(i)
                i=j
                break
        
        if (length-count) == 0 :

            if check_similarity(indices[i],indices[j]): ################### FOR MORE ACCURACY TUNE THIS VALUE OF SIMILARITY.      
                if j not in marker:
                    dup_list.append(data_sorted1["productId"].loc[indices[j]])
                    marker.append(j)
                    dup_dict[data_sorted1["productId"].loc[indices[i]]] = dup_list
                j+=1
                    
                        
            else:
                #this is done so that we can move to the next row for comparison and next time i loop runs, it starts with that row which we skipped.
                if r_i != i:
                    start_i=1
                    j_skipped=j
                    r_i=i
                j+=1
     #if somehow i doen't increase we can get out of the loop.                   
    if previous_i == i:
        break

## Code for Finding the similar product from the list of product which don't contain duplicate products( one product with only one size):

In [608]:
data_dedup_size=data_sorted1[data_sorted1["productId"].isin(dedup_prodid)]
data_dedup_size.shape

(4946, 9)

In [612]:
# variable which holds the index of the data as the indexes are not continuous.
indices = []
for i,row in data_dedup_size.iterrows():
    indices.append(i)

In [613]:
#function for checking similarity
def check_similarity(l,m):
    
    sim=cosine_similarity(cnn_features[np.where(in_name==l)[0]].reshape(1,-1),cnn_features[np.where(in_name==m)[0]].reshape(1,-1))
    if sim > 0.88:
        return True
    else:
        return False

In [641]:
#products without any duplicate titles.
dedup_prodid_sim=[]
#dictinary store the product as key and their duplictes as values.

dup_dict_sim={}

r_i=0
i=0
j=0
#list to track the record of the productid processed before.
marker=[]
start_i=0
num_data=data_dedup_size.shape[0]
while i < num_data and j<num_data:
    if start_i == 1:
        i = j_skipped
        start_i=0
    previous_i=i
    #store the words of title as list in a.
    a=data_dedup_size['title'].loc[indices[i]].split()

    #now we will search for similar product one by one
    j=i+1
    #for storing the duplicate product ids.
    dup_list=[]

    while j < num_data:

        #store the words of title as list in b.
        b=data_dedup_size['title'].loc[indices[j]].split()
        #maximum length of the strings.
        length=max(len(a),len(b))



        #count variable to store number of matched words
        count=0

        #variable for counting the iterations
        l=0
        #variable to know whether the two products are of same brand or not
        sb=0
        
        #here itertools.zip_longest(a,b) will join the two list in a one list with values as tuple which contains element of a and b in that index     
        z=(len(list(itertools.zip_longest(a,b)))-1)
        for k in itertools.zip_longest(a,b):
            if (k[0] == k[1]):
                if any([ l==0, l==1 ]):
                    sb+=1

                count+=1
            l+=1
        #if number of words in which both strings differ are >1,we are considering it as different products.
        #if number of words in which both strings differ are < 1 and sb<1 then it is a different brand but if sb>1 then they are the same brand with different size or colour. 
        if (length-count) > 1:################### FOR MORE ACCURACY TUNE THIS PARAMETER.
            if i not in marker:
                
                dedup_prodid_sim.append(data_dedup_size["productId"].loc[indices[i]])
                marker.append(i)
            #if the comaprision is between the last iten and last second item then we append both
            if j == num_data-1:
                dedup_prodid_sim.append(data_dedup_size["productId"].loc[indices[j]])
            #since dataset is arranged in alphabetic order so we will change our i when the product changes.
            i = j
            break
            
                   
        if (length-count) == 0 :

            if check_similarity(indices[i],indices[j]):    
                if j not in marker:
                    dup_list.append(data_dedup_size["productId"].loc[indices[j]])
                    marker.append(j)
                    dup_dict_sim[data_dedup_size["productId"].loc[indices[i]]] = dup_list
                j+=1
                    
                        
            else:

                if r_i != i:
                    start_i=1
                    j_skipped=j
                    r_i=i
                j+=1
        else :
            
            if sb <= 1:
                #means different brand names but similar title.means different product
                if i not in marker:
                    dedup_prodid_sim.append(data_dedup_size["productId"].loc[indices[i]])
                    marker.append(i)
                #since dataset is arranged in alphabetic order so we will change our i when the product changes.
                i=j
                break
            if check_similarity(indices[i],indices[j]): ################### FOR MORE ACCURACY TUNE THE VALUE OF SIMILARITY.      
                
                if j not in marker:
                    dup_list.append(data_dedup_size["productId"].loc[indices[j]])
                    marker.append(j)
                dup_dict_sim[data_dedup_size["productId"].loc[indices[i]]] = dup_list
                j+=1
            if (length-count)==1:
                if r_i != i:
                    start_i=1
                    j_skipped=j
                    r_i=i
                j+=1
                        
            else:

                if i not in marker:
                    dedup_prodid_sim.append(data_dedup_size["productId"].loc[indices[i]])
                    marker.append(i)
                i=j
                break


                        
    if previous_i == i:
        break

## Saving the dictionary which contains duplicates(product variants):

In [699]:
import json

with open('dup_dict_size21k.json', 'w') as fp:
    json.dump(dup_dict, fp)

In [653]:
with open("dup_dict_size21k.json") as infile:
    data_json_size = json.load(infile)

## Saving the dictionary which contains Similar products:

In [697]:
with open('dup_dict_sim21k.json', 'w') as fp:
    json.dump(dup_dict_sim, fp)

In [652]:
with open("dup_dict_sim21k.json") as infile:
    data_json_sim = json.load(infile)

### Merging both dictionaries:

In [681]:
combined_dup_dict={}
marker=[]
for key,value in dup_dict.items():
    if key not in marker:
        value1=value
        marker.append(key)
        if key in dup_dict_sim.keys():
            value1=value1+dup_dict_sim[key]
            for values in dup_dict_sim[key]:
                if values in dup_dict.keys():
                    value1=value1+dup_dict[values]
                    marker.append(values)
        
        combined_dup_dict[key]=value1
not_added=[ele for ele in dup_dict_sim.keys() if ele not in marker]
for el in not_added:
    if len(dup_dict_sim[el])>0:
        combined_dup_dict[el]=dup_dict_sim[el]

## Saving the whole duplicate dictionary ( different size,colour and Similar Products):

In [685]:
with open('combined_dup_dict21k.json', 'w') as fp:
    json.dump(combined_dup_dict, fp)

In [688]:
with open("combined_dup_dict21k.json") as infile:
    combined_json_sim = json.load(infile)