#### Product Description Example

In order to demonstrate the Cosine Similarity model, we will create a simple example.  Imagine there are a number of product reviews, complete with a label that describes what type of product each description refers to. This is our training data.  We then have a set of product descriptions with unknown types.  We would like to assign a product type label to the unknown descriptions such that similar product descriptions have similar labels. 


In [338]:
import re
import spacy
import pandas as pd
import numpy as np
import random
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Regexes to be used for preprocessing
alpha_only = re.compile(r'[^a-zA-Z ,.!]+')
comma_find = re.compile(r'\,{1,1} *')
period_find = re.compile(r'\.{1,1} *')
nlp = spacy.load('en_core_web_sm')


train_data = [['HEADPHONE',"STABLE, FAST, EASY PAIRINGN - ever worry about walking away or losing your connection again. Bluetooth headset ear buds INSTANTLY PAIR with ANY Bluetooth device in seconds - your cell phone,TV, laptop, tablet, smart watch, really anything, then STAYS CONNECTED, providing CALLS and CHATS with SIRI - crisp, clear, unrivaled sound quality as you move about your day with an UNWAVERING, STABLE SIGNAL from 40 FEET AWAY. Humanize Designed - Weight only 8g, lightweight and secure, comfortable fit with most shape of ears. While you moving around or working out won't popping off. Very suitable for answering calls or listening music. Truly Sweatproof Technology - SweatGuard is specially designed to resist the corrosive properties of sweat. Long Battery Life - Built-in 3.7V/ 80 mAH lithium battery, allowing you to enjoy your music for a long time up to 3-4 hours, up to 4 hours of talking time, and 50 hours of standby time with a quick charge of only 30 minutes. During the charging process, the indicator light will turn red, and after full charge, the indicator light will turn blue. Please unplug the charger at this time\
    What You Get - 1× Bluetooth sports earphones , 1× charging cable, 1× user manual and our 12 months worry free warranty. Just feel free to contact us if you have any question with our wireless earbuds, we will reply you in 24 hours"],
         
              ['HEADPHONE'," Wired In-Ear Headphones: Perfect for exercising; With three sets of earpads (S/M/L), headphones stay in your ears while keeping surrounding noise out\
ErgoFit Design for Perfect Fit: Black, ultra-soft ErgoFit in-ear earbud headphones conform instantly to your ears (S/M/L earpads included for a perfect fit)\
Smartphone Compatible: Panasonic in-ear headphones with integrated microphone and remote are compatible with Apple (iPhone / iPod / iPad), Android and Blackberry Audio devices\
In-Ear Stereo Audio: Tonally balanced audio with crisp highs and deep low notes, plus wider frequency response and lively sound quality for recorded audio\
Extended Headphone Cord: Long, 3.6-ft cord threads comfortably through clothing and bags making it easy to connect "],
         
              ['SLEEPING_BAG'," ULTRA COMFORTABLE SLEEPING BAG – Abco sleeping bags are designed to ensure that after a tiring day of trekking, hiking, travel or any other exploration you can get a good and relaxing night’s sleep. The bags have barrel shaped design which is wide at the shoulders and narrow at the leg’s end to offer maximum comfort, warmth and freedom.\
DESIGNED FOR EXTREME WEATHERS – Our sleeping bags are designed for near-freezing temperatures and have a rating of 20 degrees Fahrenheit – meaning these are designed to keep the average sleeper warm even at 20F. Moreover, these bags also have a waterproof, weather-resistant design to keep you warm even in extreme conditions and prevent you from any dampness - this is achieved through double-filled technology and S-shaped quilted design.\
EASY TO CLEAN AND CARRY – Our sleeping bags are also extremely easy to clean as they are safer for machine wash too. Moreover, each sleeping bag comes with a travel-friendly carry bag, a compression sack with straps, which makes it quite convenient to store and carry the sleeping bag along. The bags are not only ideal for cold conditions but even for warmer weather.\
LIGHTWEIGHT, SKIN-FRIENDLY AND DURABLE – The sleeping bag offers you extra comfort during Adventurous activities but without adding any extra pounds to your backpack like a foam pad. It has 100% polyester lining which is skin-friendly and uses high quality 210T polyester on the outer side to offer durability to the sleeping bag. The high-quality precisely done stitches enhance the durability even further.\
100% RISK-FREE SATISFACTION GUARANTEE – We also offer you 100% risk-free satisfaction guarantee to let you buy with confidence; no questions asked. However, we are quite sure that this sleeping bag would bring enormous comfort during your rough and adventurous rides, camping, hiking, or long term travel, while also making it extremely convenient to carry along. "],             
             
              ['WATER_FILTER', 'SMALL WATER PITCHER: This small, clear plastic pitcher is designed to be space efficient and easily fit into narrow, tight places. Height 9.8"; Width 4.45"; Length/Depth 9.37"; Weight 1.39 pounds\
CLEANER AND GREAT TASTING: The BPA free Brita filter reduces chlorine taste and odor, copper, mercury, and cadmium impurities found in tap water. *Substances reduced may not be found in all users\' water.\
FILTER INDICATOR: For optimum performance, a helpful electronic filter indicator tracks when your water filter needs to be replaced.\
REDUCE WASTE: One Brita water filter can replace 300 standard 16.9 ounce water bottles.\
REPLACEMENTS: Change Brita filters every 40 gallons, about 2 months for the average household for optimal performance.'],
             
              ['WATER_FILTER',' SPACE EFFICIENT: The UltraMax Water Dispenser holds 18 cups or 1.13 gallons of water, making it great for families and fits neatly on countertops and refrigerator shelves with a modern, slim design. Height 10.47"; Width 5.67"; Length/Depth 14.37"; Weight 3 pounds\
REDUCES LEAD: The BPA-Free Longlast filter is certified by WQA to reduce 99% of lead, chlorine (taste and odor), cadmium, mercury, benzene, asbestos and more found in tap water for cleaner, great tasting water. *Contaminants reduced may not be in all users\' water\
FILTER CHANGE REMINDER: An electronic indicator indicates when it’s time to replace your Longlast filters, which last 6 months.  That\’s 4x longer compared to PUR Lead Reduction 30 gallon filter life and Zerowater 15 gallon filter life\
INSTANT POUR: This large, 18 cup filtered water dispenser has a spigot that makes pouring easy. With the flip top lid, refilling is a breeze\
BPA REDUCE WASTE & SAVE: One Brita Longlast Filter can replace 900 standard 16 oz. water bottles. You’ll stay hydrated, save money, and reduce plastic waste Free '],
              
              ['WATER_FILTER', 'EASY AND CONVENIENT: This Brita water filtration system attaches to your standard faucet making tap water cleaner and great-tasting. Filtration system is easy to install; no tools required. Height 8.25"; Width 2.38"; Length/Depth 6"; Weight 0.84 pounds\
REDUCES LEAD: Water filter system filters out 60 contaminants such as 99% lead, chlorine (taste and odor), benzene and asbestos contaminants* that may be found in tap water. *Substances reduced may not be in all users\' water\
FILTER CHANGE REMINDER: For optimum performance, a helpful green and red light lets you know when your filter is working and when it needs to be replaced with a 1-click filter replacement. Replace your tap water filter every 4 months for the average family\
REDUCE WASTE & SAVE: 1 BPA free Brita faucet filter can provide up to 100 gallons of filtered tap water, replacing up to 750 standard 16 oz. plastic water bottles\
3 SPRAY OPTIONS: Filtering made easy with an on and off filtering switch and 3 spray options: filtered water, unfiltered water, and unfiltered spray. *Fits standard faucets only. Does not fit pull out and spray style faucets ']]

unkn_products = [
    ['PRO_123',"IPX7 SWEATPROOF EARPHONES: Mpow IPX7 Water-resistant Nano-coating efficiently protects sport headphones from sweat and ensure more guaranteed life span, perfect for running, jogging, hiking, yoga, exercises, gym, fitness, travelling and etc. Everyday with Mpow is like a valentine's day!\
    REDEFINE YOUR EARS IN RICHER RANGE: You may not get used to this earbuds with richer bass and mid at first if you used to use earbuds with flat or sharp sound. Thanks to the tuned driver, CSR chip and Bluetooth 4.1, you can get superb bass sound, as well as richer and crisp sound with Mpow earbuds at the furthest degree that in-ear & Bluetooth-compression items can achieve.\
    ENHANCED COMFORT & WEARABILITY: 1. We have improved the ear hooks to the proper hardness for snug fit. 2. Additionally come with a pair of memory-foam ear tips (adapt to ear canal to provide a perfect seal and snug fit to help keep your earbuds in place) and a cord clip, besides 3 pairs of regular ear tips in different sizes for your custom fit. 3. Suitable for normal size ears.\
    1.5-HOUR QUICK CHARGE FOR 7-9 HOURS PLAYING: Improved lithium polymer battery brings up to 7-9 hours pleasure musically and socially with a quick charge of only 1.5 hours. It will show the remaining battery power of the headphones on the iOS Phone screen. Note: 1. Mpow Flame has 12V over-voltage hardware cut off, 1A over-current restored fuse to achieve safe charging. 2. Please use charging cable provided, or certified brand charging cable. 3. We don't recommend using fast charging.\
    WHY WE RECOMMEND MPOW FLAME TO YOU: Mpow has been dedicated to produce Bluetooth headphones for many years, and we have a professional team of experts in this area. Mpow Flame not only gives you attractive & stylish look, but also provide IPX7 waterproof protection and richer bass to meet your practical needs. Every Mpow product includes a 45 days money back & 18-month warranty."],
        
    ['PRO_456',' GET A GREAT NIGHT OF SLEEP CAMPING AND BACKPACKING - No more sore back or annoying discomfort from every little rocks or leafs underneath! This 2" thick PATENT PENDING ultralight sleeping pad uses individual (but interconnected) air cells that can adjust to your body shape to provide optimal comfort, support, and warmth (R-value 1.3).\
YOU WON\'T MISS YOUR MATTRESS AT HOME - Great comfort and support with our interconnected, self-adjusting air-cell design that conforms to your body\
PERFECT FOR BACKPACKING, CAMPING,TRAVEL - Ultralight, Ultra-Compact and weights only ~16 oz and packs down TINY to 8"x3"x3". Easily fit in your backpack and go with included sack. Get a great night of sleep wherever you go.\
INFLATE / DEFLATE IN SECONDS - Easy-to-use air valve allows for quick inflation (in 10 - 15 breaths) and deflation (in seconds!)\
MADE TO LAST WITH LIFETIME WARRANTY - durable and weather-proof w/ ultralight ripstop 20D Nylon fabric and extruded TPU layer. Fully warranted against defects in materials and workmanship with a lifetime guarantee ' ],
        
    ['PRO_555', ' The accuracy of filtration is up to 0.1um so that it can filtrate impurities,sediment,harmful materials,chlorine,rust,heavy metals,algae,colloid,bleaching powder,red worm,impurity and so on,the water tastes pure and natural.\
Double outlet,purified water and tap water outlet,so the water is clean and it is safety for the healthy of your family members.\
Equiped with the multi-use interface,simple and quick installation. Water purification flow is 2L/MIN,large capacity for your daily uses.\
Three kinds of water filtration faucet mount with great power,you can use different filter depending on the water quality,very easy to replace the filter.\
The shell of the faucet water filter is made of high-quality non-toxic plastic,safty and suitable to protect your health,especially for your mom and a baby.For your healthy,a filter is your best choice. '],
        
    ['PRO_777', 'For years Ryder Carroll tried countless organizing systems, online and off, but none of them fit the way his mind worked. Out of sheer necessity, he developed a method called the Bullet Journal that helped him become consistently focused and effective. When he started sharing his system with friends who faced similar challenges, it went viral. Just a few years later, to his astonishment, Bullet Journaling is a global movement.\
The Bullet Journal Method is about much more than organizing your notes and to-do lists. It\'s about what Carroll calls "intentional living": weeding out distractions and focusing your time and energy in pursuit of what\'s truly meaningful, in both your work and your personal life. It\'s about spending more time with what you care about, by working on fewer things. His new book shows you how to...\
  *  Track the past: Using nothing more than a pen and paper, create a clear and comprehensive record of your thoughts.\
  *  Order the present: Find daily calm by tackling your to-do list in a more mindful, systematic, and productive way.\
  *  Design the future: Transform your vague curiosities into meaningful goals, and then break those goals into manageable action steps that lead to big change. '],
        
    ['TEST'," ULTRA COMFORTABLE SLEEPING BAG – Abco sleeping bags are designed to ensure that after a tiring day of trekking, hiking, travel or any other exploration you can get a good and relaxing night’s sleep. The bags have barrel shaped design which is wide at the shoulders and narrow at the leg’s end to offer maximum comfort, warmth and freedom.\
DESIGNED FOR EXTREME WEATHERS – Our sleeping bags are designed for near-freezing temperatures and have a rating of 20 degrees Fahrenheit – meaning these are designed to keep the average sleeper warm even at 20F. Moreover, these bags also have a waterproof, weather-resistant design to keep you warm even in extreme conditions and prevent you from any dampness - this is achieved through double-filled technology and S-shaped quilted design.\
EASY TO CLEAN AND CARRY – Our sleeping bags are also extremely easy to clean as they are safer for machine wash too. Moreover, each sleeping bag comes with a travel-friendly carry bag, a compression sack with straps, which makes it quite convenient to store and carry the sleeping bag along. The bags are not only ideal for cold conditions but even for warmer weather.\
LIGHTWEIGHT, SKIN-FRIENDLY AND DURABLE – The sleeping bag offers you extra comfort during Adventurous activities but without adding any extra pounds to your backpack like a foam pad. It has 100% polyester lining which is skin-friendly and uses high quality 210T polyester on the outer side to offer durability to the sleeping bag. The high-quality precisely done stitches enhance the durability even further.\
100% RISK-FREE SATISFACTION GUARANTEE – We also offer you 100% risk-free satisfaction guarantee to let you buy with confidence; no questions asked. However, we are quite sure that this sleeping bag would bring enormous comfort during your rough and adventurous rides, camping, hiking, or long term travel, while also making it extremely convenient to carry along. "]]

# This function will be passed to the vectorizer class and will be used to pre-process our text data in whatever way
# we choose.  I'm choosing to simply convert everything to lowercase.
def my_preprocessor(doc):
    a = alpha_only.sub(' ',doc.lower())
    b = period_find.sub('. ', a)
    c = comma_find.sub(', ', b)
    return(c)


# This function will be passed to the vectorizer class and will be user to tokenize our text data in whatever way
# we choose.
def my_tokenizer(doc):
    my_nlp = nlp(doc)
    tokens = []
    for token in my_nlp:
        # Only include tokens longer than 3 characters, that aren't stop words and aren't numbers.
        # Then, only include the lemmatized version of the token.
        if len(token.text) > 3 and not token.is_stop and not token.is_space:
            tokens.append(token.lemma_)
    return (tokens)

# This function is used to make the vectorized array easier to vizualize. 
# It creates a Pandas dataframe from the word matrix 
def wm2df(wm, feat_names):
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return (df)

# This function averages together rows of a numpy array that share the same index values,
# in our case these index values are the product types.
def f_numpy(names, values):
    result_names = np.unique(names)
    result_values = np.empty((result_names.shape[0], values.shape[1]))

    for i, name in enumerate(result_names):
        result_values[i,:] = np.mean(values[names == name], axis=0)

    return result_names, result_values 

# Given a cosine similarity matrix, this function sorts and returns the top 'N' largest 
# values and their locations in the original matrix.

def find_top_n(cs, top_n):
    bst_mtch = np.argsort(-cs)
    i = bst_mtch[:,:top_n]
    yind = np.indices(i.shape).flatten('C')
    top_vals = np.reshape(cs[yind[:-i.size],i.flatten('C')], i.shape)
    return i, top_vals*100



### Main Idea

The main idea is that we want to convert our training data into a vector space along with a set of term weights. We will use TF-IDF to weight our vectors. Let's build this term document matrix....

In [339]:
# Build our corpus of text data, 1 element of the array per document.
train = [t[1] for t in train_data]

# Initialize the vectorizer. This will 1st run the 'my_preprocessor()' function I defined above, followed by 
# the 'my_tokenizer()' function I defined. (Note that you must re-initialize the vectorizer each time you want to
# use a different corpus)
vectorizer = TfidfVectorizer(preprocessor=my_preprocessor, tokenizer=my_tokenizer, norm='l1', smooth_idf=True)

# Fit the model using the tfidf vectorizer (ie - transform to the vector space)
doc_matrix = vectorizer.fit_transform(train)

# Extract our tokens
tokens = vectorizer.get_feature_names()

# Make it pretty
out_df = pd.DataFrame(data=doc_matrix.toarray(), index=[t[0] for t in train_data],
                      columns=tokens)
out_df

Unnamed: 0,abco,achieve,activity,add,adventurous,allow,android,answering,apple,asbestos,...,waterproof,weather,weight,wide,width,wire,wireless,work,worry,zerowater
HEADPHONE,0.0,0.0,0.0,0.0,0.0,0.009183,0.0,0.009183,0.0,0.0,...,0.0,0.0,0.005448,0.0,0.0,0.0,0.009183,0.00753,0.018366,0.0
HEADPHONE,0.0,0.0,0.0,0.0,0.0,0.0,0.014729,0.0,0.014729,0.0,...,0.0,0.0,0.0,0.012078,0.0,0.014729,0.0,0.0,0.0,0.0
SLEEPING_BAG,0.006918,0.006918,0.006918,0.006918,0.013835,0.0,0.0,0.0,0.0,0.0,...,0.006918,0.020753,0.0,0.005673,0.0,0.0,0.0,0.0,0.0,0.0
WATER_FILTER,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.010598,0.0,0.012368,0.0,0.0,0.0,0.0,0.0
WATER_FILTER,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010165,...,0.0,0.0,0.007354,0.0,0.008582,0.0,0.0,0.0,0.0,0.012396
WATER_FILTER,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.010025,...,0.0,0.0,0.007253,0.0,0.008464,0.0,0.0,0.010025,0.0,0.0


### The problem
Here, you notice that each product type ('headphone', etc) can exist multiple times. While this makes sense - we have 6 reviews, we should have 6 rows in the document term matrix, it is also a problem since we want to be able to match an "unknown" product description to a single product type. Having multiple records with a single product type muddies the waters and makes it more difficult to select the "right" record.

The solution is to make our document space unique.  There are two potential ways to do this (amongst others).

The first method is to train the vectorizer first on all documents, and then average all of the vectors that have the same product types. The second method is to concatenate all of the text together for a given product type, and treat that as a single document, which is then fed to the vectorizer. 

#### Method 1: Train first, reduce document space second

In [340]:
# In method 1, train the model first then reduce dimensions of the fit model
train = [t[1] for t in train_data]
query = [p[1] for p in unkn_products]


# Initialize the vectorizer - with a few new options, described above.
vectorizer = TfidfVectorizer(preprocessor=my_preprocessor, tokenizer=my_tokenizer, norm='l1', smooth_idf=True)
# Train the vectorizer using the training data corpus.
cwm = vectorizer.fit_transform(train)

# Fit the query data using that trained vectorizer.
qm = vectorizer.transform(query)

# Extract our tokens
tokens = vectorizer.get_feature_names()

# Average together all similar product type vectors
arr = cwm.toarray()
nms = np.array([t[0] for t in train_data])
nms, nm = f_numpy(nms, arr)

# Make the output pretty
df = pd.DataFrame(data=nm, index=nms,
                      columns=tokens)
df

Unnamed: 0,abco,achieve,activity,add,adventurous,allow,android,answering,apple,asbestos,...,waterproof,weather,weight,wide,width,wire,wireless,work,worry,zerowater
HEADPHONE,0.0,0.0,0.0,0.0,0.0,0.004592,0.007364,0.004592,0.007364,0.0,...,0.0,0.0,0.002724,0.006039,0.0,0.007364,0.004592,0.003765,0.009183,0.0
SLEEPING_BAG,0.006918,0.006918,0.006918,0.006918,0.013835,0.0,0.0,0.0,0.0,0.0,...,0.006918,0.020753,0.0,0.005673,0.0,0.0,0.0,0.0,0.0,0.0
WATER_FILTER,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00673,...,0.0,0.0,0.008402,0.0,0.009804,0.0,0.0,0.003342,0.0,0.004132


Now, we "query" the training data by calculating the cosine similarity (cosine of the angular difference between each vector, where similarity is inversely proportional to the angle between the vectors. When the difference between the two vectors is 0, the cosine is 1). Cells in the cosine similarity matrix that are closer to 1 imply a strong similarity between the two representative descriptions while cells with values closer to 0 imply orthogonal or unrelated descriptions.

In [341]:
# Now calculate cosine similarity between the unknown product "queries" and the known product data.
cs = cosine_similarity(qm, nm)
# Make the output readable
cdf = pd.DataFrame(data=cs, index=[u[0] for u in unkn_products], columns=nms)
cdf

Unnamed: 0,HEADPHONE,SLEEPING_BAG,WATER_FILTER
PRO_123,0.49785,0.096927,0.053384
PRO_456,0.093415,0.360921,0.05312
PRO_555,0.03773,0.048461,0.652556
PRO_777,0.173583,0.021237,0.127817
TEST,0.078533,1.0,0.028545


In [342]:
# Make output easier to interpret, where we rank by degree of similarity. Note that the values do not add up to 100.
n_depth = 3
top_index, _ = find_top_n(cs, n_depth)
for i,ind in enumerate(top_index):
    out = "{}: ".format(unkn_products[i][0])
    for j in range(n_depth):
        out += ' '
        out += "{}({}% similarity)".format(nms[ind[j]],
                                          round(100*cs[i,ind[j]],1))
    print(out)

PRO_123:  HEADPHONE(49.8% similarity) SLEEPING_BAG(9.7% similarity) WATER_FILTER(5.3% similarity)
PRO_456:  SLEEPING_BAG(36.1% similarity) HEADPHONE(9.3% similarity) WATER_FILTER(5.3% similarity)
PRO_555:  WATER_FILTER(65.3% similarity) SLEEPING_BAG(4.8% similarity) HEADPHONE(3.8% similarity)
PRO_777:  HEADPHONE(17.4% similarity) WATER_FILTER(12.8% similarity) SLEEPING_BAG(2.1% similarity)
TEST:  SLEEPING_BAG(100.0% similarity) HEADPHONE(7.9% similarity) WATER_FILTER(2.9% similarity)


#### Method 2: Reduce document space first, train second

In [344]:
# In method 2, reduce dimensions first then train the model

query = [p[1] for p in unkn_products]

# Convert the train data structure to a data frame.
df = pd.DataFrame([t[0] for t in train_data], columns=['id'])
df['text'] = [t[1] for t in train_data]

# Group by product name (called 'id' here) and then join all of the individual product descriptions within these
# groups.
df2 = df.groupby('id', as_index=False).agg(lambda x:' '.join(x))

# Convert the resulting text array to a list.
train = df2['text'].tolist()

# Initialize the vectorizer.
vectorizer = TfidfVectorizer(preprocessor=my_preprocessor, tokenizer=my_tokenizer, norm='l1', smooth_idf=True)

# Train our vectorizer
cwm = vectorizer.fit_transform(train)

# Fit the queries to the vectorized model
qm = vectorizer.transform(query)

# Extract our tokens
tokens = vectorizer.get_feature_names()

# Make the output pretty
out_df = pd.DataFrame(data=cwm.toarray(), index=df2['id'].tolist(),
                      columns=tokens)
out_df

Unnamed: 0,abco,achieve,activity,add,adventurous,allow,android,answering,apple,asbestos,...,waterproof,weather,weight,wide,width,wire,wireless,work,worry,zerowater
HEADPHONE,0.0,0.0,0.0,0.0,0.0,0.005577,0.005577,0.005577,0.005577,0.0,...,0.0,0.0,0.004242,0.004242,0.0,0.005577,0.005577,0.004242,0.011154,0.0
SLEEPING_BAG,0.006911,0.006911,0.006911,0.006911,0.013823,0.0,0.0,0.0,0.0,0.0,...,0.006911,0.020734,0.0,0.005256,0.0,0.0,0.0,0.0,0.0,0.0
WATER_FILTER,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.007528,...,0.0,0.0,0.008587,0.0,0.011291,0.0,0.0,0.002862,0.0,0.003764


In [345]:
# Now calculate cosine similarity between the unknown product "queries" and the known product data.
cs2 = cosine_similarity(qm, cwm)
# Make the output readable
cdf2 = pd.DataFrame(data=cs2, index=[u[0] for u in unkn_products], columns=df2['id'].tolist())
cdf2

Unnamed: 0,HEADPHONE,SLEEPING_BAG,WATER_FILTER
PRO_123,0.537663,0.093461,0.062939
PRO_456,0.100689,0.349385,0.060209
PRO_555,0.032306,0.040067,0.722178
PRO_777,0.1897,0.028164,0.122765
TEST,0.081445,1.0,0.025448


In [346]:
# Make output easier to interpret, where we rank by degree of similarity. Note that the values do not add up to 100.

n_depth = 3
top_index, _ = find_top_n(cs2, n_depth)
for i,ind in enumerate(top_index):
    out = "{}: ".format(unkn_products[i][0])
    for j in range(n_depth):
        out += ' '
        out += "{}({}% similarity)".format(nms[ind[j]],
                                          round(100*cs2[i,ind[j]],1))
    print(out)

PRO_123:  HEADPHONE(53.8% similarity) SLEEPING_BAG(9.3% similarity) WATER_FILTER(6.3% similarity)
PRO_456:  SLEEPING_BAG(34.9% similarity) HEADPHONE(10.1% similarity) WATER_FILTER(6.0% similarity)
PRO_555:  WATER_FILTER(72.2% similarity) SLEEPING_BAG(4.0% similarity) HEADPHONE(3.2% similarity)
PRO_777:  HEADPHONE(19.0% similarity) WATER_FILTER(12.3% similarity) SLEEPING_BAG(2.8% similarity)
TEST:  SLEEPING_BAG(100.0% similarity) HEADPHONE(8.1% similarity) WATER_FILTER(2.5% similarity)


### Questions

1. What are the relative advantages and disadvantages of using method 1 or method 2? What about if we have HUGE input text for each description or many thousands of descriptions for each product type?

2. Why does the TEST product have 100% similarity to the SLEEPING BAG product type, but also 8.1% similar to HEADPHONE?

3. PRO_777 doesn't have a strong match to any particular category? How do we deal with this? 