# Introduction

# Read the data

Follow the link below for a dataset of 500 entries of different items like shoes, shirts, etc., along with an item-id and a textual description.

In [63]:
import pandas as pd
from sklearn.metrics.pairwise import linear_kernel 
ds = pd.read_csv("sample-data.csv")

In [3]:
ds

Unnamed: 0,id,description
0,1,Active classic boxers - There's a reason why o...
1,2,Active sport boxer briefs - Skinning up Glory ...
2,3,Active sport briefs - These superbreathable no...
3,4,"Alpine guide pants - Skin in, climb ice, switc..."
4,5,"Alpine wind jkt - On high ridges, steep ice an..."
...,...,...
495,496,Cap 2 bottoms - Cut loose from the maddening c...
496,497,Cap 2 crew - This crew takes the edge off fick...
497,498,All-time shell - No need to use that morning T...
498,499,All-wear cargo shorts - All-Wear Cargo Shorts ...


# Creating a TF-IDF Vectorizer

Ok, we have the textual description of each one of the items. As explained in class, we need to perform an initial stage of Feature extraction to identify the most relevant tokens in the textual descriptions for each one of the items, as well as to transform it to a format readable for our recommender system.

For the feature extraction, we will use TF-IDF (please review the slides from the previous class) and, more particularly, the implementation that sklearn provides (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). By using sklearn, we will automatize the computation of TF-IDF, so we do not need to worry about it. Additionally, we will get the data in a format that is already usable for any ML or recommendation algorithm.

There are several parameters that we can use to configure the computation of TF-IDF (for more info, please check the sklearn documentation):
- ngram_range: It allows you to consider n-grams additionally to individual words (e.g., I like it --> [I, I like, like it])
- min_df: Sets a frequency threshold. Any token appearing less than min_df times will not be considered
- stopwords: Removes stopwords (words that are very common but meaningless in a language, such as: a, the, it,...) from the textual description. You have to set the language of the description since stop_words are language-dependant

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(analyzer='word', min_df=0, stop_words='english')
tfidf_matrix = tfidf.fit_transform(ds['description'])


At this point, we have each one of the items represented by an array of descriptors (each position in the array will be the TF-IDF weight related to the term in that position). As you can see in the following cell, they are mostly 0 (i.e., the document only includes a small subset of the vocabulary

In [52]:
feature_names = tfidf.get_feature_names()
df = pd.DataFrame(tfidf_matrix.todense(), columns=feature_names)
df

Unnamed: 0,000,03,10,100,1000,1021,1027,103,1038,1055,...,zest,zinger,zip,zipped,zipper,zippered,zippers,zipping,zips,zones
0,0.0,0.0,0.000000,0.076141,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
1,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
2,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
3,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.224591,0.068949,0.0,0.0,0.0
4,0.0,0.0,0.046321,0.045701,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.086593,0.034395,0.052795,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,0.0,0.0,0.000000,0.030434,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
496,0.0,0.0,0.000000,0.061080,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0
497,0.0,0.0,0.000000,0.131317,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.134173,0.0,0.000000,0.039532,0.000000,0.0,0.0,0.0
498,0.0,0.0,0.000000,0.068533,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.058352,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0


Most important tokens for item 1

In [53]:
dict(df.sort_values(by=1, ascending=False, axis=1).iloc[1])


{'71': 0.24563626354575435,
 'sport': 0.2340858842003656,
 'inner': 0.22512671394623607,
 'boxer': 0.22512671394623607,
 'sewn': 0.21161741898376388,
 'briefs': 0.21161741898376388,
 'li': 0.2007490290748775,
 'prevent': 0.1736974687891044,
 'seamless': 0.16564057933354295,
 'br': 0.160599223259902,
 'support': 0.15741812343381353,
 'active': 0.15741812343381353,
 'thigh': 0.15595796958240477,
 '93': 0.15595796958240477,
 'inseam': 0.15549279707890054,
 'gusseted': 0.14163753984419739,
 'dries': 0.136026124997511,
 'poach': 0.1309578044505226,
 'deciding': 0.1309578044505226,
 'route': 0.1309578044505226,
 'skinning': 0.12281813177287718,
 'edges': 0.12281813177287718,
 'chafe': 0.11916907462759568,
 'fast': 0.11916907462759568,
 'efficiently': 0.1170429421001828,
 'flat': 0.11534238941626163,
 'mesh': 0.11281641154295,
 'requires': 0.11256335697311803,
 'boxers': 0.11256335697311803,
 'wicking': 0.1090788491336157,
 'fly': 0.10776234411456517,
 'bind': 0.10580870949188194,
 'size': 0.

# Cosine Similarity

Once we have the vectorial representation of each item, we can apply some similarity metrics to define what are the most similar items to a given one and base the recommendations on it.

In [54]:
from sklearn.metrics.pairwise import linear_kernel 

# Compute cosine similarity
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

# Iterate over the items in the dataset to find the most similar ones to each one
results = {}
for idx, row in ds.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
    similar_items = [(cosine_similarities[idx][i], ds['id'][i]) for i in similar_indices] 
    results[row['id']] = similar_items[1:]

In [55]:
def item(id):  
    return ds.loc[ds['id'] == id]['description'].tolist()[0].split(' - ')[0] 

# Just reads the results out of the dictionary
def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")   
    print("-------")
    recs = results[item_id][:num]   
    for rec in recs: 
        print("Recommended: " + item(rec[1]) + " (score:" +      str(rec[0]) + ")")

In [56]:
recommend(item_id=11, num=5)

Recommending 5 products similar to Baby sunshade top...
-------
Recommended: Sunshade hoody (score:0.3208009592771187)
Recommended: Baby baggies apron dress (score:0.2987535109056266)
Recommended: Sunshade shirt (score:0.24775920989207287)
Recommended: Hooded monk sweatshirt (score:0.24274238697476735)
Recommended: Baby baggies shorts (score:0.23739223868857867)
