Reference: https://www.datacamp.com/community/tutorials/recommender-systems-python

In [2]:
# Import packages

In [3]:
import pandas as pd

In [4]:
# Import dataset

In [5]:
dataset = pd.read_csv("marketing_sample_for_amazon_com_ecommerce_20200101_20200131_10k_data.csv")
dataset["category"].head()

0    Sports & Outdoors | Outdoor Recreation | Skate...
1    Toys & Games | Learning & Education | Science ...
2            Toys & Games | Arts & Crafts | Craft Kits
3    Toys & Games | Hobbies | Models & Model Kits |...
4              Toys & Games | Puzzles | Jigsaw Puzzles
Name: category, dtype: object

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10002 entries, 0 to 10001
Data columns (total 28 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   uniq_id                10002 non-null  object 
 1   product_name           10002 non-null  object 
 2   brand_name             0 non-null      float64
 3   asin                   0 non-null      float64
 4   category               9172 non-null   object 
 5   upc_ean_code           34 non-null     object 
 6   list_price             0 non-null      float64
 7   selling_price          9895 non-null   object 
 8   quantity               0 non-null      float64
 9   model_number           8232 non-null   object 
 10  about_product          9729 non-null   object 
 11  product_specification  8370 non-null   object 
 12  technical_details      9212 non-null   object 
 13  shipping_weight        8864 non-null   object 
 14  product_dimensions     479 non-null    object 
 15  im

We can compute the similarity between product descriptions using TfidfVectorizer

In [7]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

#Replace NaN with an empty string
dataset["category"] = dataset["category"].fillna("")

#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(dataset["category"])

#Output the shape of tfidf_matrix
tfidf_matrix.shape

(10002, 1133)

We can observe that there are 1133 vocabularies in our dataset of 10002 products.

In [8]:
#Array mapping from feature integer indices to feature name.
tfidf.get_feature_names()[0:20]

['accent',
 'accents',
 'accessories',
 'accessory',
 'action',
 'activities',
 'activity',
 'additives',
 'adhesives',
 'adirondack',
 'adult',
 'advent',
 'agility',
 'aids',
 'air',
 'airbrush',
 'aircraft',
 'airplane',
 'airplanes',
 'albums']

We will use this matrix to calculate the similarity score with cosine similarity

In [9]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [10]:
cosine_sim.shape

(10002, 10002)

In [11]:
cosine_sim[1]

array([0.        , 1.        , 0.27876877, ..., 0.25244281, 0.27876877,
       0.        ])

This matrix represents each product category's similarity score with every other product category.

We need to define a function that takes the product name as an input and outputs a list of the 10 most similar products. For this we need a reverse mapping of products and DataFrame indices. This means we need a mechanism to identify the index of a product in our DataFrame.

In [12]:
#Construct a reverse map of indices and product names
indices = pd.Series(dataset.index, index=dataset["product_name"])

In [13]:
indices[:20]

product_name
DB Longboards CoreFlex Crossbow 41" Bamboo Fiberglass Longboard Complete                                                                             0
Electronic Snap Circuits Mini Kits Classpack, FM Radio, Motion Detector, Music Box (Set of 5)                                                        1
3Doodler Create Flexy 3D Printing Filament Refill Bundle (X5 Pack, Over 1000'. of Extruded Plastics! - Innovate                                      2
Guillow Airplane Design Studio with Travel Case Building Kit                                                                                         3
Woodstock- Collage 500 pc Puzzle                                                                                                                     4
Terra by Battat – 4 Dinosaur Toys, Medium – Dinosaurs for Kids & Collectors, Scientifically Accurate & Designed by A Paleo-Artist; Age 3+ (4 Pc)     5
Rubie's Child's Pokemon Deluxe Pikachu Costume, X-Small                          

In [14]:
# Function that takes in product name as input and outputs most similar product
def get_recommendations(product_name, cosine_sim=cosine_sim):
    # Get the index of the product that matches the product name
    idx = indices[product_name]

    # Get the pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the products based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar products
    sim_scores = sim_scores[1:11]

    # Get the product indices
    product_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar products
    return dataset[["product_name","selling_price"]].iloc[product_indices]

In [15]:
name = input("What would you like to search for today? ")
result = get_recommendations(name)
print(result)

What would you like to search for today? Yellies! Frizz; Voice-Activated Spider Pet; Ages 5 & Up
                                           product_name selling_price
19    Yellies! Frizz; Voice-Activated Spider Pet; Ag...        $17.85
750   Bright Starts Lots of Links Accessory and Baby...        $13.42
779           Basic Fun Fisher-Price Play Tape Recorder        $49.99
1091  Manhattan Toy Musical Shapes Maraca Wooden Tod...         $7.99
1650                      Playgo Musical Spinning Wheel        $19.95
1818           Disney Baby Dumbo On The Go Activity Toy        $15.00
3502  Singing Machine Kids Candy House Portable Blue...        $35.74
3518    Hape Toddler Beat Box Set, Wooden Music Toy Set        $32.00
3521                        Playtex Musical Monkey Blue         $8.99
3578                   Fisher-Price Musical Smart Phone         $4.99


In [None]:
# so far this only works when the exact name of the product is the input