<div>
    <img src="http://www.uoc.edu/portal/_resources/common/imatges/marca_UOC/UOC_Masterbrand.jpg" ,="" align="left">
</div>
<div style="float: right; width: 50%;">
<p style="margin: 0; padding-top: 22px; text-align:right;">End-of-degree Project: Product Matching</p>
<p style="margin: 0; text-align:right;">Master's Degree in Data Science</p>
<p style="margin: 0; text-align:right; padding-button: 100px;">Computer Science, Multimedia and Telecommunication Studies</p>
</div>

<h1>Product Matching using Natural Language Processing</h1>

<hr style="background-color:#BCBCBC"/>

* [Machine Learning :: Cosine Similarity for Vector Space Models (Part III)](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/)

<h2>Imports</h2>

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

<h2>Data loading</h2>

In [2]:
products_basic_df = pd.read_csv('../RawData/ProductsBasic.csv')

product_names_df = pd.read_csv('../RawData/ProductDataFinal.csv')
product_names_df = product_names_df.loc[:, ['ProductName', 'ProdanetID']]

# Remove duplicates
product_names_df = product_names_df.drop_duplicates(subset='ProdanetID', keep='first')

<h2>Text preprocessing</h2>

In [3]:
# TODO remove stopwords: English, German and Spanish
# TODO remove punctuation marks

<h2>TF-IDF vectorization and cosine similarity</h2>

<h4>TF-IDF vectorization</h4>

TODO: TF-IDF vectorization summary

In [4]:
documents = product_names_df['ProductName'].values.astype('U').tolist()

tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

print("Number of names of products (rows): {}".format(tfidf_matrix.shape[0]))
print("Number of feature names (columns): {}".format(tfidf_matrix.shape[1]))

Number of names of products (rows): 161023
Number of feature names (columns): 115317


The number of features, which are the words that appear in the corpus (set of documents), can be also obtained by:  

In [5]:
len(tfidf_vectorizer.get_feature_names())

115317

<h4>Getting the N most similar products using Cosine Similarity</h4>

TODO: little summary about Cosine similarity and how it works

In [11]:
N = 3

# The following products are contained in the dataset having Amazon.de as their 'source' Web Shop
products2match = [
    'PANASONIC SC-PM 250 EG-S Micro-Anlage (CD, CD-R/-RW, USB, Silber/Schwarz)',      # Title in MediaMarkt.de
    'Logitech PTZ Pro 2 (960-001186)',                                                # Title in Geizhals.at
    'bbc Radio 1''s Dance anthems ibiza (cd)'                                         # Title in ElCorteIngles.es
]

vectorized_products2match = tfidf_vectorizer.transform(products2match)
similarity_matrix = cosine_similarity(vectorized_products2match, tfidf_matrix)

print("Number of products to match (rows): {}".format(similarity_matrix.shape[0]))
print("Number of products to compare against (columns): {}".format(similarity_matrix.shape[1]))

Number of products to match (rows): 3
Number of products to compare against (columns): 161023


In [12]:
for i in range(0, similarity_matrix.shape[0]):
    print("\nProduct to match [{}]: {}".format(i, products2match[i]))
    similarity_vector_i = similarity_matrix[i,:]
    
    # Select the N most similar products for product2match[i] (descending order)
    n_most_similar_products_idx = reversed(np.argsort(similarity_vector_i)[-N:])
    
    print("Similar products")
    print("================")
    for similar_product_idx in n_most_similar_products_idx:
        similar_product_df = product_names_df.iloc[[similar_product_idx]]
        print("{} (ProdanetID = {})".format(similar_product_df['ProductName'].values, 
                                            similar_product_df['ProdanetID'].values))
        print("Similarity: {}\n".format(similarity_vector_i[similar_product_idx]))


Product to match [0]: PANASONIC SC-PM 250 EG-S Micro-Anlage (CD, CD-R/-RW, USB, Silber/Schwarz)
Similar products
['PANASONIC SC-PM 250 EG-K schwarz - Kompaktanlage (20 Watt, Bluetooth, USB, Uhr, Timer)'] (ProdanetID = [1495704])
Similarity: 0.5249146344137537

['SC-PM 250BEGS'] (ProdanetID = [1236441])
Similarity: 0.32764730821916754

['Canton CD 250 schwarz (stück)'] (ProdanetID = [29850])
Similarity: 0.29049375547703127


Product to match [1]: Logitech PTZ Pro 2 (960-001186)
Similar products
['Webcam LOGITECH Logitech WebCam PTZ Pro 2 960-001184 (1080p - Microfone incorporado)'] (ProdanetID = [1746977])
Similarity: 0.5935985269001952

['Webcam LOGITECH 960-001106'] (ProdanetID = [1663972])
Similarity: 0.4060668439912333

['Disco SSD SAMSUNG 1TB 960 PRO M2 PCIE'] (ProdanetID = [1614023])
Similarity: 0.32095411698336673


Product to match [2]: bbc Radio 1s Dance anthems ibiza (cd)
Similar products
["BBC Radio 1's Dance Anthems Ibiza"] (ProdanetID = [1444577])
Similarity: 0.86443197240