# System Configuration
RAM -> 8GB  
Processor -> AMD A6-5350M  
OS -> Ubuntu 16.04  
OS Type -> 64-bit  

# # Approach
The original dataset contains 4,057,189 rows. Out of these it was given in the problem statement, that only those value has to be selected which has subcategories "Tunics". After selecting those values, with the help of Pandas, total rows become 644,661. Since this was still too big for the system to handle, I randomly selected 40,000 data entries from the entire dataset.
Still no label was given to us, it was a unsupervised problem. Out of all 32 feature points, I primarily selected title, description, productFamily, keySpecsStr, detailedSpecsStr because title and description describe the product in layman language; productFamily describes a class of product; keySpecsStr and detailedSpecsStr are describing the product in detail.
All these 5 features were string type, so features were extracted from these by converting textual data to word2vec vectors. After that cosine similarity was calculated for each features using these word2vec vectors. These cosine similarity will tell how two products are similar for a particular feature. After calculating for all 5 features, a score was calculated using those.

In [1]:
import pandas as pd
import numpy as np
from gensim.models import doc2vec
from collections import namedtuple
from scipy import spatial



In [2]:
# data1 = pd.read_csv('2oq-c1r.csv') # This is the original dataset's dataframe
# index_list = []
# for i in range(len(df)):
#    if 'unic' in df['categories'][i]:
#        index_list.append(i)
# data2 = data1.iloc[index_list, :] # This dataframe contains values having only Tunics subcategory

In [3]:
# data2 = data2.sample(frac=1).reset_index(drop=True)
# df = data2[:40000] # This dataframe contains randomly selected 40000 rows containing Tunic subcategory

In [4]:
df = pd.read_csv('tunic40k.csv')

In [5]:
df.shape

(40000, 34)

In [6]:
df.columns

Index([u'Unnamed: 0', u'Unnamed: 0.1', u'productId', u'title', u'description',
       u'imageUrlStr', u'mrp', u'sellingPrice', u'specialPrice', u'productUrl',
       u'categories', u'productBrand', u'productFamily', u'inStock',
       u'codAvailable', u'offers', u'discount', u'shippingCharges',
       u'deliveryTime', u'size', u'color', u'sizeUnit', u'storage',
       u'displaySize', u'keySpecsStr', u'detailedSpecsStr',
       u'specificationList', u'sellerName', u'sellerAverageRating',
       u'sellerNoOfRatings', u'sellerNoOfReviews', u'sleeve', u'neck',
       u'idealFor'],
      dtype='object')

In [7]:
df.neck = df.neck.fillna('')
df.sizeUnit = df.sizeUnit.fillna('')
df['size'] = df['size'].fillna('')
df.keySpecsStr = df.keySpecsStr.fillna('')
df.detailedSpecsStr = df.detailedSpecsStr.fillna('')
df.description = df.description.fillna('')
df.title = df.title.fillna('')
df.productFamily = df.productFamily.fillna('')

In [8]:
df = df.drop(['Unnamed: 0', 'Unnamed: 0.1', 'imageUrlStr', 'productUrl', 'inStock', 'codAvailable', 'offers', 'shippingCharges', 'deliveryTime', 'storage', 'displaySize', 'sellerName', 'sellerAverageRating', 'sellerNoOfRatings', 'sellerNoOfReviews', 'specificationList', 'idealFor'], axis=1)

In [9]:
df.shape

(40000, 17)

In [10]:
def return_similarity_score(doc1):
    docs = []
    analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
    for i, text in enumerate(doc1):
        words = text.lower().split()
        tags = [i]
        docs.append(analyzedDocument(words, tags))
    model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
    sim_score = []
    for i in range(len(model.docvecs)):
        sim_score.append(1-(spatial.distance.cosine(model.docvecs[0], model.docvecs[i])))
    return sim_score

In [11]:
doc_title = df['title'].tolist()
sim_score_title = return_similarity_score(doc_title)



In [12]:
sorted(sim_score_title, reverse=True)

[1.0,
 0.9580023884773254,
 0.9572941064834595,
 0.9560819864273071,
 0.9532979130744934,
 0.9523184895515442,
 0.9522967338562012,
 0.9522823691368103,
 0.9508616924285889,
 0.9507434368133545,
 0.9503090381622314,
 0.9481085538864136,
 0.9478361010551453,
 0.9473828673362732,
 0.9472674131393433,
 0.947053074836731,
 0.9460029006004333,
 0.9445982575416565,
 0.9435232877731323,
 0.9435194134712219,
 0.9434640407562256,
 0.9430810809135437,
 0.9427804350852966,
 0.9427653551101685,
 0.9424615502357483,
 0.9421488642692566,
 0.9421411752700806,
 0.9420633316040039,
 0.9419798851013184,
 0.9419723153114319,
 0.9409561157226562,
 0.9409391283988953,
 0.9409373998641968,
 0.9408707022666931,
 0.9407768249511719,
 0.9407556056976318,
 0.940752387046814,
 0.9405441880226135,
 0.9405302405357361,
 0.9401859045028687,
 0.9401128888130188,
 0.9399386048316956,
 0.9398705363273621,
 0.9397174119949341,
 0.9396159052848816,
 0.9395330548286438,
 0.939530611038208,
 0.9393796324729919,
 0.9392148

In [13]:
doc_description = df['description'].tolist()
sim_score_description = return_similarity_score(doc_description)
doc_keyspec = df['keySpecsStr'].tolist()
sim_score_keyspec = return_similarity_score(doc_keyspec)
doc_detailspec = df['detailedSpecsStr'].tolist()
sim_score_detailspec = return_similarity_score(doc_detailspec)
doc_prodfam = df['productFamily'].tolist()
sim_score_prodfam = return_similarity_score(doc_prodfam)

In [14]:
avg_score_list = []
for i in range(len(df)):
    sum_score = 0
    sum_score = sim_score_title[i] + sim_score_description[i] + sim_score_keyspec[i] + sim_score_detailspec[i] + sim_score_prodfam[i]
    avg_score = sum_score / 5
    avg_score_list.append(avg_score)

In [15]:
avg_score_list

[1.0,
 0.19913675151765348,
 0.12857295498251914,
 0.24324544370174409,
 0.291514178365469,
 0.2309822678565979,
 0.3559082821011543,
 0.17346120178699492,
 -0.07107783854007721,
 0.2975344300270081,
 0.25993516221642493,
 0.11861620284616947,
 0.36114214099943637,
 0.38421919345855715,
 0.015791203081607818,
 -0.15120927318930627,
 0.3144567418843508,
 0.3376820407807827,
 0.28393520414829254,
 0.18876098841428757,
 0.33565987199544906,
 0.3609116733074188,
 0.04419657401740551,
 0.1991051932796836,
 0.30343054942786696,
 0.39893912672996523,
 -0.058128141611814496,
 0.3418563693761826,
 0.3148974165320396,
 0.1370036207139492,
 0.09239943586289882,
 0.2136879812926054,
 0.40556347388774155,
 0.2590924009680748,
 0.1299200788140297,
 0.17243934869766236,
 0.27137675806879996,
 0.12001860151067376,
 0.09272409677505493,
 0.19870373010635375,
 0.16775750182569027,
 0.3101510763168335,
 0.018730311840772628,
 -0.10449664462357759,
 0.24982584714889527,
 0.3001817863434553,
 0.18173222243

In the above code snipet, I have equally weighted the all five features, but it can be weighted according to importance.
Also I have calculated all these 5 feature vectors cosinie similarity for only the zeroth product. It should be the covariance matrix. The reason for not calculating matrix is the time taken to calculate 40k once is nearly 20 seconds. To calculate matrix, it will be taking 20* 40000 = 800000 seconds.

The model is not performing much well as I have neglected more than nearly 94% dataset because of system issues. System was not able to handle such a huge dataset. If model is run on a good system, it can give promising output.

## Additional Work
If it was a supervised problem, i.e., if we have labelled dataset, we could train a classification model, like neural network on these word2vec feature vectors separately and can combine their result.
Apart from these five features, prices also play a very significant role in determining the similar products. On a very particular subcategory, generally prices form a normal distribution, so we can try to find the outliers available in the dataset using various statistical testing, like chi-square test and after finding the outliers can remove them.
Features like colors, size also play a significant role, and they are part of categorical data. If two products fall into same category can help in narrowing our search and if different categories can help in reject them.
Apart from all these features, one very important feature was the image of product. Since, these imageUrls are showing error 400, if it was accessible one can train the model to identify similar looking products.