## Part 2: Product matching

### Problem statement :-
Using ML/DL techniques, match similar products from the Flipkart dataset with the Amazon dataset. Once
similar products are matched, display the retail price from FK and AMZ side by side. Please explore as
many techniques as possible before choosing the final technique.

Dataset Link: https://www.dropbox.com/sh/aypq6h3254207bs/AACzMLvo-XtK9sYAAma6FW0la?dl=0

In [1]:
# Importing Warnings
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Importing the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# Reading the amazon dataset - (here i take only 5000 products)
amazon=pd.read_csv("amz_com-ecommerce_sample.csv",encoding= 'unicode_escape')
amazon=amazon.head(5000)

In [4]:
# Top 1 row
amazon.head(1)

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,982,438,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."


In [5]:
# Shape of Amazon Dataset - 5000 rows, 15 columns
amazon.shape

(5000, 15)

In [6]:
# Unwanted columns : uniq_id,crawl_timestamp,product_url,pid,image,is_FK_Advantage_product,product_rating,overall_rating

In [7]:
# Dropping Unwanted columns.
amazon.drop(["uniq_id","crawl_timestamp","product_url","image","is_FK_Advantage_product","product_rating","overall_rating","product_category_tree","product_specifications","brand"],inplace=True,axis=1)

In [8]:
# Shape of Amazon Dataset - 5000 rows, 5 columns
amazon.shape

(5000, 5)

In [9]:
# Information about Amazon Dataset
amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   product_name      5000 non-null   object
 1   pid               5000 non-null   object
 2   retail_price      5000 non-null   int64 
 3   discounted_price  5000 non-null   int64 
 4   description       4999 non-null   object
dtypes: int64(2), object(3)
memory usage: 195.4+ KB


###### C- Null value in Description column - we can handle it by replacing "no description".

In [10]:
# Here i create the new column for better understanding of product on which website. e.g- Amazon-a,Flipkart-f
amazon["product_on"]="a" #---->label for amazon-a

In [11]:
# Top 5 rows of amazon dataset
amazon.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,Alisha Solid Women's Cycling Shorts,SRTEH2FF9KEDEFGF,982,438,Key Features of Alisha Solid Women's Cycling S...,a
1,FabHomeDecor Fabric Double Sofa Bed,SBEEH3QGU7MFYJFY,32143,29121,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,a
2,AW Bellies,SHOEH4GRSUBJGZXE,991,551,Key Features of AW Bellies Sandals Wedges Heel...,a
3,Alisha Solid Women's Cycling Shorts,SRTEH2F6HUZMQ6SJ,694,325,Key Features of Alisha Solid Women's Cycling S...,a
4,Sicons All Purpose Arnica Dog Shampoo,PSOEH3ZYDMSYARJ5,208,258,Specifications of Sicons All Purpose Arnica Do...,a


In [12]:
# Getting null values from description column which we seen in information of dataset
amazon[amazon["description"].isna()]

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
553,Ozel Studio Casual Sleeveless Printed Women's Top,TOPEYV38KYVJKM54,1278,781,,a


In [13]:
# Filling the null value with "no description".
amazon["description"].fillna("no description",inplace=True)

In [14]:
# checking the null value in description.
amazon[amazon["description"].isna()]

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on


In [15]:
# Information of amazon dataset. - look clean!
amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   product_name      5000 non-null   object
 1   pid               5000 non-null   object
 2   retail_price      5000 non-null   int64 
 3   discounted_price  5000 non-null   int64 
 4   description       5000 non-null   object
 5   product_on        5000 non-null   object
dtypes: int64(2), object(4)
memory usage: 234.5+ KB


In [16]:
# Reading the flipkart dataset - (Here i take only 5000 products)
flipkart=pd.read_csv("flipkart_com-ecommerce_sample.csv",encoding= 'unicode_escape')
flipkart=flipkart.head(5000)

In [17]:
# Shape of Flipkart Dataset
flipkart.shape

(5000, 15)

In [18]:
# Top 1 row in flipkart dataset.
flipkart.head(1)

Unnamed: 0,uniq_id,crawl_timestamp,product_url,product_name,product_category_tree,pid,retail_price,discounted_price,image,is_FK_Advantage_product,description,product_rating,overall_rating,brand,product_specifications
0,c2d766ca982eca8304150849735ffef9,2016-03-25 22:59:23 +0000,http://www.flipkart.com/alisha-solid-women-s-c...,Alisha Solid Women's Cycling Shorts,"[""Clothing >> Women's Clothing >> Lingerie, Sl...",SRTEH2FF9KEDEFGF,999.0,379.0,"[""http://img5a.flixcart.com/image/short/u/4/a/...",False,Key Features of Alisha Solid Women's Cycling S...,No rating available,No rating available,Alisha,"{""product_specification""=>[{""key""=>""Number of ..."


In [19]:
# Dropping the unwanted columns.
flipkart.drop(["uniq_id","crawl_timestamp","product_url","image","is_FK_Advantage_product","product_rating","overall_rating","product_category_tree","product_specifications","brand"],inplace=True,axis=1)

In [20]:
# checking the shape of dataset
flipkart.shape

(5000, 5)

In [21]:
# Information about Flipkart Dataset.
flipkart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      5000 non-null   object 
 1   pid               5000 non-null   object 
 2   retail_price      4989 non-null   float64
 3   discounted_price  4989 non-null   float64
 4   description       4999 non-null   object 
dtypes: float64(2), object(3)
memory usage: 195.4+ KB


###### C- Null values in retail_price, Discounted_price, Description columns. The null values in retail_price column replace by 0 (Bcoz we dont know the actual price), Same for discounted_price,The null value in description column replace by "no description".

In [22]:
# Checking the null values in description column.
flipkart[flipkart["description"].isna()]

Unnamed: 0,product_name,pid,retail_price,discounted_price,description
553,Ozel Studio Casual Sleeveless Printed Women's Top,TOPEYV38KYVJKM54,1290.0,645.0,


In [23]:
# Filling the null value with "no description".
flipkart["description"].fillna("no description",inplace=True)

In [24]:
#Checking the null value.
flipkart[flipkart["description"].isna()]

Unnamed: 0,product_name,pid,retail_price,discounted_price,description


In [25]:
# fillng the null values in retail_price column with 0
flipkart["retail_price"].fillna(0,inplace=True)

In [26]:
# fillng the null values in discounted_price column with 0
flipkart["discounted_price"].fillna(0,inplace=True)

In [27]:
# Information about flipkart dataset
flipkart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   product_name      5000 non-null   object 
 1   pid               5000 non-null   object 
 2   retail_price      5000 non-null   float64
 3   discounted_price  5000 non-null   float64
 4   description       5000 non-null   object 
dtypes: float64(2), object(3)
memory usage: 195.4+ KB


In [28]:
# Here i create the new column for better understanding of product on which website. e.g- Amazon-a,Flipkart-f
flipkart["product_on"]="f" #----> Label for flipkart - f

In [29]:
# Top 5 Rows in dataset
flipkart.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,Alisha Solid Women's Cycling Shorts,SRTEH2FF9KEDEFGF,999.0,379.0,Key Features of Alisha Solid Women's Cycling S...,f
1,FabHomeDecor Fabric Double Sofa Bed,SBEEH3QGU7MFYJFY,32157.0,22646.0,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,f
2,AW Bellies,SHOEH4GRSUBJGZXE,999.0,499.0,Key Features of AW Bellies Sandals Wedges Heel...,f
3,Alisha Solid Women's Cycling Shorts,SRTEH2F6HUZMQ6SJ,699.0,267.0,Key Features of Alisha Solid Women's Cycling S...,f
4,Sicons All Purpose Arnica Dog Shampoo,PSOEH3ZYDMSYARJ5,220.0,210.0,Specifications of Sicons All Purpose Arnica Do...,f


In [30]:
# description of 1st row/product
flipkart["description"][0]

"Key Features of Alisha Solid Women's Cycling Shorts Cotton Lycra Navy, Red, Navy,Specifications of Alisha Solid Women's Cycling Shorts Shorts Details Number of Contents in Sales Package Pack of 3 Fabric Cotton Lycra Type Cycling Shorts General Details Pattern Solid Ideal For Women's Fabric Care Gentle Machine Wash in Lukewarm Water, Do Not Bleach Additional Details Style Code ALTHT_3P_21 In the Box 3 shorts"

In [31]:
# description of 3rd row/product
flipkart["description"][3]

"Key Features of Alisha Solid Women's Cycling Shorts Cotton Lycra Black, Red,Specifications of Alisha Solid Women's Cycling Shorts Shorts Details Number of Contents in Sales Package Pack of 2 Fabric Cotton Lycra Type Cycling Shorts General Details Pattern Solid Ideal For Women's Fabric Care Gentle Machine Wash in Lukewarm Water, Do Not Bleach Additional Details Style Code ALTGHT_11 In the Box 2 shorts"

In [32]:
# Checking the duplicate rows in amazon dataset
amazon.duplicated().sum()

0

In [33]:
# checking the duplicate rows in flipkart dataset
flipkart.duplicated().sum()

0

In [34]:
# shape of amazon dataset
amazon.shape

(5000, 6)

In [35]:
# shape of flipkart dataset
flipkart.shape

(5000, 6)

In [36]:
# Creation new dataset by concatinating the amaon and flipkart datasets
df=pd.concat([amazon,flipkart],ignore_index=True)

In [37]:
# Top 5 rows of new dataset 
df.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,Alisha Solid Women's Cycling Shorts,SRTEH2FF9KEDEFGF,982.0,438.0,Key Features of Alisha Solid Women's Cycling S...,a
1,FabHomeDecor Fabric Double Sofa Bed,SBEEH3QGU7MFYJFY,32143.0,29121.0,FabHomeDecor Fabric Double Sofa Bed (Finish Co...,a
2,AW Bellies,SHOEH4GRSUBJGZXE,991.0,551.0,Key Features of AW Bellies Sandals Wedges Heel...,a
3,Alisha Solid Women's Cycling Shorts,SRTEH2F6HUZMQ6SJ,694.0,325.0,Key Features of Alisha Solid Women's Cycling S...,a
4,Sicons All Purpose Arnica Dog Shampoo,PSOEH3ZYDMSYARJ5,208.0,258.0,Specifications of Sicons All Purpose Arnica Do...,a


In [38]:
# Shuffling the rows (bcoz in new dataset the top 5000 rows are amazon and bottom 5000 rows are flipkart)
df=df.sample(frac = 1)

In [39]:
# after shuffling,top 5 rows in new dataset
df.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
744,FabAlley Casual Short Sleeve Solid Women's Top,TOPE7ZHZNGYQVMSV,743.0,856.0,FabAlley Casual Short Sleeve Solid Women's Top...,a
9246,Vijisan Beautiful Pink Fire Bead Moonstone Sto...,NKCEA4CAPHDH5TJT,400.0,400.0,Vijisan Beautiful Pink Fire Bead Moonstone Sto...,f
266,"Marvel DW100243 Digital Watch - For Boys, Girls",WATE3GEYE8JQT7WM,282.0,348.0,"Marvel DW100243 Digital Watch - For Boys, Gir...",a
956,AKUP ur-own-kind Ceramic Mug,MUGEGZUGKCVTS27W,497.0,327.0,Key Features of AKUP ur-own-kind Ceramic Mug P...,a
7247,Tia by Ten on Ten Cathy Women's T-Shirt Bra,BRAEB2S8XHF9P96G,999.0,399.0,Tia by Ten on Ten Cathy Women's T-Shirt Bra\n ...,f


In [40]:
# reseting the index of dataset
df=df.reset_index(drop=True)

In [41]:
# top 5 rows in dataset
df.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,FabAlley Casual Short Sleeve Solid Women's Top,TOPE7ZHZNGYQVMSV,743.0,856.0,FabAlley Casual Short Sleeve Solid Women's Top...,a
1,Vijisan Beautiful Pink Fire Bead Moonstone Sto...,NKCEA4CAPHDH5TJT,400.0,400.0,Vijisan Beautiful Pink Fire Bead Moonstone Sto...,f
2,"Marvel DW100243 Digital Watch - For Boys, Girls",WATE3GEYE8JQT7WM,282.0,348.0,"Marvel DW100243 Digital Watch - For Boys, Gir...",a
3,AKUP ur-own-kind Ceramic Mug,MUGEGZUGKCVTS27W,497.0,327.0,Key Features of AKUP ur-own-kind Ceramic Mug P...,a
4,Tia by Ten on Ten Cathy Women's T-Shirt Bra,BRAEB2S8XHF9P96G,999.0,399.0,Tia by Ten on Ten Cathy Women's T-Shirt Bra\n ...,f


In [42]:
# Shape of new dataset - 10000 rows, 6 columns
df.shape

(10000, 6)

###### In this datasets,the names of products on amazon and flipkart are same and there are same products with different quantity packs,so to identify them i create the new product column which has name of product,productid & website of product. 

In [43]:
# modifying the product name column
df["product_name"]=df["product_name"]+" "+df["pid"]+" "+df["product_on"]

In [44]:
# Top 5 rows of dataset
df.head()

Unnamed: 0,product_name,pid,retail_price,discounted_price,description,product_on
0,FabAlley Casual Short Sleeve Solid Women's Top...,TOPE7ZHZNGYQVMSV,743.0,856.0,FabAlley Casual Short Sleeve Solid Women's Top...,a
1,Vijisan Beautiful Pink Fire Bead Moonstone Sto...,NKCEA4CAPHDH5TJT,400.0,400.0,Vijisan Beautiful Pink Fire Bead Moonstone Sto...,f
2,"Marvel DW100243 Digital Watch - For Boys, Gir...",WATE3GEYE8JQT7WM,282.0,348.0,"Marvel DW100243 Digital Watch - For Boys, Gir...",a
3,AKUP ur-own-kind Ceramic Mug MUGEGZUGKCVTS27W a,MUGEGZUGKCVTS27W,497.0,327.0,Key Features of AKUP ur-own-kind Ceramic Mug P...,a
4,Tia by Ten on Ten Cathy Women's T-Shirt Bra BR...,BRAEB2S8XHF9P96G,999.0,399.0,Tia by Ten on Ten Cathy Women's T-Shirt Bra\n ...,f


In [45]:
# product name of 1st product.
df["product_name"][0]

"FabAlley Casual Short Sleeve Solid Women's Top TOPE7ZHZNGYQVMSV a"

In [46]:
# dropping the extra columns.
df.drop(["pid","product_on"],inplace=True,axis=1)

In [47]:
# top 5 rows of dataset
df.head()

Unnamed: 0,product_name,retail_price,discounted_price,description
0,FabAlley Casual Short Sleeve Solid Women's Top...,743.0,856.0,FabAlley Casual Short Sleeve Solid Women's Top...
1,Vijisan Beautiful Pink Fire Bead Moonstone Sto...,400.0,400.0,Vijisan Beautiful Pink Fire Bead Moonstone Sto...
2,"Marvel DW100243 Digital Watch - For Boys, Gir...",282.0,348.0,"Marvel DW100243 Digital Watch - For Boys, Gir..."
3,AKUP ur-own-kind Ceramic Mug MUGEGZUGKCVTS27W a,497.0,327.0,Key Features of AKUP ur-own-kind Ceramic Mug P...
4,Tia by Ten on Ten Cathy Women's T-Shirt Bra BR...,999.0,399.0,Tia by Ten on Ten Cathy Women's T-Shirt Bra\n ...


#### Text Preprocessing

In [48]:
# Importing the Libraries
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim.models import KeyedVectors
import pickle
from tqdm import tqdm
import os
from bs4 import BeautifulSoup

In [49]:
import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

In [50]:
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
            "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
            "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
            'won', "won't", 'wouldn', "wouldn't"])

In [51]:
from tqdm import tqdm
preprocessed_description = []
# tqdm is for printing the status bar
for sentance in tqdm(df["description"].values):
    sentance = re.sub(r"http\S+", "", sentance)
    sentance = BeautifulSoup(sentance, 'lxml').get_text()
    sentance = decontracted(sentance)
    sentance = re.sub("\S*\d\S*", "", sentance).strip()
    sentance = re.sub('[^A-Za-z]+', ' ', sentance)
    sentance = ' '.join(e.lower() for e in sentance.split() if e.lower() not in stopwords)
    preprocessed_description.append(sentance.strip())

100%|███████████████████████████████████| 10000/10000 [00:10<00:00, 989.80it/s]


In [52]:
# shuffling the rows with random_state. (random_state fixed the samples at every we run the program)
sample_data = df.sample(n = 10000,random_state=2)
sample_data.head()

Unnamed: 0,product_name,retail_price,discounted_price,description
7878,Perfect Women's Leggings LJGEHWDVCNNGPECS a,980.0,690.0,Key Features of Perfect Women's Leggings 100% ...
3224,Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a,439.0,250.0,Printland CMW1618 Ceramic Mug (350 g)\n ...
1919,Allure Auto CM 334 Car Mat Hyundai Sonata Embe...,2794.0,1051.0,Buy Allure Auto CM 334 Car Mat Hyundai Sonata ...
4432,Belle Gambe Boots SHOEC9FZ7W9HTAQG f,3499.0,1795.0,Belle Gambe Boots - Buy Belle Gambe Boots - 7C...
4835,"Merchbay Pupcakes Accessory, Lotta Farber Jewe...",1599.0,799.0,"Merchbay Pupcakes Accessory, Lotta Farber Jewe..."


In [53]:
# here i arrange the preprocessed_description on the bases of index of sample_data
sample_description = [ preprocessed_description[i] for i in sample_data.index.values]
sample_description[0]

'key features perfect women leggings cotton mill dyed fabric assurance best quality super soft fabric highly stretchable comfort fit no bubbles perfect women leggings pack price rs full length leggings best fitting feeling leggings casual clothing wardrobe leggings ultra comfortable waistband fit wonderfully around body toes waist far basic women leggings go must basic legging want wear everyday specifications perfect women leggings pack box leggings general details number contents sales package pack fabric cotton type leggings pattern solid occasion casual ideal women fabric care machine wash'

In [54]:
# Create the new preprocessed column
sample_data['preprocessed'] = sample_description
sample_data.head()

Unnamed: 0,product_name,retail_price,discounted_price,description,preprocessed
7878,Perfect Women's Leggings LJGEHWDVCNNGPECS a,980.0,690.0,Key Features of Perfect Women's Leggings 100% ...,key features perfect women leggings cotton mil...
3224,Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a,439.0,250.0,Printland CMW1618 Ceramic Mug (350 g)\n ...,printland ceramic mug g price rs printland cof...
1919,Allure Auto CM 334 Car Mat Hyundai Sonata Embe...,2794.0,1051.0,Buy Allure Auto CM 334 Car Mat Hyundai Sonata ...,buy allure auto cm car mat hyundai sonata embe...
4432,Belle Gambe Boots SHOEC9FZ7W9HTAQG f,3499.0,1795.0,Belle Gambe Boots - Buy Belle Gambe Boots - 7C...,belle gambe boots buy belle gambe boots rs fli...
4835,"Merchbay Pupcakes Accessory, Lotta Farber Jewe...",1599.0,799.0,"Merchbay Pupcakes Accessory, Lotta Farber Jewe...",merchbay pupcakes accessory lotta farber jewel...


In [55]:
# Dropping the description column
sample_data.drop("description",axis=1,inplace=True)
sample_data.head()

Unnamed: 0,product_name,retail_price,discounted_price,preprocessed
7878,Perfect Women's Leggings LJGEHWDVCNNGPECS a,980.0,690.0,key features perfect women leggings cotton mil...
3224,Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a,439.0,250.0,printland ceramic mug g price rs printland cof...
1919,Allure Auto CM 334 Car Mat Hyundai Sonata Embe...,2794.0,1051.0,buy allure auto cm car mat hyundai sonata embe...
4432,Belle Gambe Boots SHOEC9FZ7W9HTAQG f,3499.0,1795.0,belle gambe boots buy belle gambe boots rs fli...
4835,"Merchbay Pupcakes Accessory, Lotta Farber Jewe...",1599.0,799.0,merchbay pupcakes accessory lotta farber jewel...


In [56]:
# resetting the index
sample_data=sample_data.reset_index(drop=True)

In [57]:
# top 1 row in sample_data
sample_data.head(1)

Unnamed: 0,product_name,retail_price,discounted_price,preprocessed
0,Perfect Women's Leggings LJGEHWDVCNNGPECS a,980.0,690.0,key features perfect women leggings cotton mil...


In [58]:
# used lammatization to get the root word.
#(Lammatization used bcoz it always give output in english/porterstemmer gives root word but not always in english).
from nltk.stem import WordNetLemmatizer

In [59]:
import nltk
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [60]:
# object creation 
lemmatizer = WordNetLemmatizer()

In [61]:
# creating user defined function to lammatization.
def lema(text):
    y=[]
    for i in text.split():
        y.append(lemmatizer.lemmatize(i))
    
    return " ".join(y)

In [62]:
# Applying lammatization on each word in document.
sample_data["preprocessed"]=sample_data["preprocessed"].map(lema)

In [63]:
# top 5 rows after lammatization.
sample_data.head()

Unnamed: 0,product_name,retail_price,discounted_price,preprocessed
0,Perfect Women's Leggings LJGEHWDVCNNGPECS a,980.0,690.0,key feature perfect woman legging cotton mill ...
1,Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a,439.0,250.0,printland ceramic mug g price r printland coff...
2,Allure Auto CM 334 Car Mat Hyundai Sonata Embe...,2794.0,1051.0,buy allure auto cm car mat hyundai sonata embe...
3,Belle Gambe Boots SHOEC9FZ7W9HTAQG f,3499.0,1795.0,belle gambe boot buy belle gambe boot r flipka...
4,"Merchbay Pupcakes Accessory, Lotta Farber Jewe...",1599.0,799.0,merchbay pupcakes accessory lotta farber jewel...


In [64]:
# Shape of dataset.
sample_data.shape

(10000, 4)

In [65]:
# Vectorization - convert text to vectors
# tfidf used bcoz it gives importance to words, the weightage of each word is different.
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer #---> convert text to vector,here we take only top 5000 words.
tfidf = TfidfVectorizer(max_features=5000)

In [66]:
# Creating the vectors
vectors=tfidf.fit_transform(sample_data["preprocessed"]).toarray()

In [67]:
vectors

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [68]:
# 5000 features which are selected
tfidf.get_feature_names()

['aa',
 'aaa',
 'aadivasi',
 'aadyaa',
 'aahana',
 'aaina',
 'aaishwarya',
 'aakash',
 'aakshi',
 'aaliya',
 'aao',
 'aapno',
 'aara',
 'aaradhi',
 'ab',
 'abaya',
 'abdominal',
 'abhira',
 'ability',
 'able',
 'abony',
 'abrasive',
 'abroad',
 'absolute',
 'absolutely',
 'absorbing',
 'absorbs',
 'absorption',
 'abstract',
 'abstrcts',
 'absurd',
 'ac',
 'accent',
 'accentuate',
 'accentuates',
 'accesories',
 'access',
 'accessible',
 'accessoreez',
 'accessorise',
 'accessorize',
 'accessory',
 'accident',
 'accord',
 'according',
 'accu',
 'acheived',
 'achieve',
 'achievement',
 'acid',
 'acm',
 'across',
 'acrylic',
 'act',
 'action',
 'active',
 'activity',
 'actn',
 'actual',
 'actually',
 'ada',
 'adaa',
 'adapater',
 'adapt',
 'adapter',
 'adaptive',
 'add',
 'added',
 'addiction',
 'adding',
 'addition',
 'additional',
 'addons',
 'address',
 'addyvero',
 'aden',
 'adhesive',
 'adidas',
 'adimani',
 'adiwalk',
 'adjust',
 'adjustable',
 'adjustble',
 'adjusting',
 'adjustmen

In [69]:
# used cosine similarity to get similarity score between vectors. 
from sklearn.metrics.pairwise import cosine_similarity

In [70]:
# Similarity metrix
similarity=cosine_similarity(vectors)

In [71]:
#similarity scores of 1st product with remaining products.
similarity[0]

array([1.00000000e+00, 4.34968193e-02, 7.87251770e-03, ...,
       9.23697799e-04, 3.19580185e-02, 8.31571988e-03])

In [72]:
# Top 5 similar products of 1st product.
sorted(list(enumerate(similarity[0])),reverse=True,key=lambda x:x[1])[1:6]

[(5909, 1.0000000000000002),
 (5427, 0.8339181519905545),
 (8892, 0.8339181519905545),
 (976, 0.81553507926201),
 (8617, 0.81553507926201)]

In [73]:
# creating the user defined function which finally gives the details of selected product and similar product on another website.
def match(product):
    p1_name=product
    p1_index=sample_data[sample_data["product_name"]==product].index[0]
    p1_retail_price=sample_data.iloc[p1_index].retail_price
    p1_discounted_price=sample_data.iloc[p1_index].discounted_price
    
    distances=similarity[p1_index]
    product_lists=sorted(list(enumerate(distances)),reverse=True,key=lambda x:x[1])[1:6]
    
    products=[]
    for i in product_lists:
        if sample_data.iloc[i[0]].product_name[-1]!=product[-1]:
            products.append(sample_data.iloc[i[0]].product_name)
    p2_name=products[0]
    p2_index=sample_data[sample_data["product_name"]==products[0]].index[0]
    p2_retail_price=sample_data.iloc[p2_index].retail_price
    p2_discounted_price=sample_data.iloc[p2_index].discounted_price
    
    final_products=[[p1_name,p1_retail_price,p1_discounted_price],[p2_name,p2_retail_price,p2_discounted_price]]
    
    return final_products
    
    

In [74]:
# Testing
sample_data["product_name"][1]

'Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a'

In [79]:
# Testing
match("Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a")

[['Printland CMW1618 Ceramic Mug MUGEACY82SGHCHYY a', 439.0, 250.0],
 ['Printland CMW1667 Ceramic Mug MUGEACY8TXR6HT5K f', 458.0, 199.0]]

In [80]:
# Importing the pickle
import pickle

In [81]:
# Dumping the sample data
pickle.dump(sample_data.to_dict(),open("product_dict.pkl","wb"))

In [82]:
# Dumping the similarity matrix
pickle.dump(similarity,open("similarity.pkl","wb"))