**Table of contents**<a id='toc0_'></a>    
- [Introduction](#toc1_1_1_)    
      - [Importing Python Libraries](#toc1_1_1_1_)    
      - [Loading Clean Dataset](#toc1_1_1_2_)    
    - [Content based Recommender System](#toc1_1_2_)    
    - [Enhanced content -based recommender that utilizes numerical features](#toc1_1_3_)    
    - [Conclusion](#toc1_1_4_)    
    - [Pickle the dataframe](#toc1_1_5_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

### <a id='toc1_1_1_'></a>[Introduction](#toc0_)

In this notebook, we build a content-based recommender system that identifies similar products using:

- Cosine similarity based on product titles and categories.
- Weighting based on numerical product features such as Bayesian ratings, number of ratings, product age, and price.

By leveraging the content based features of products, we can provide personalized recommendations that align with user's preferences.

#### <a id='toc1_1_1_1_'></a>[Importing Python Libraries](#toc0_)

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import regex as re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler, StandardScaler,RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity
import string
import spacy

# Ignore all warnings to avoid cluttering the output
import warnings
warnings.filterwarnings("ignore")

#Useful settings
plt.rcParams['figure.figsize'] = (8.0, 6.0) # set matplotlib global settings eg. figsize
sns.set_style("white")                   #Setting grid style in seaborn

#### <a id='toc1_1_1_2_'></a>[Loading Clean Dataset](#toc0_)

In [2]:
# Here we load the pickled meta DataFrame which has undergone basic cleaning, EDA and preprocessing 
meta_df = pd.read_pickle('../data/meta_sample_preprocessed.pkl')
meta_df.head()

Unnamed: 0,product_title,average_rating,rating_number,product_price,store,parent_asin,all_subcategories,date_first_available,bayesian_rating,is_popular,...,subcategory1_rating_mean,subcategory1_rating_std,combined_category_rating_mean,combined_category_product_counts,combined_category_rating_std,log_combined_category_product_counts,subcategory1_rating_cv,combined_category_rating_cv,subcategory1_target_encoded,combined_category_target_encoded
0,Sterling Silver Hammered Ear Cuff,4.4,243,24.0,twisted designs jewelry,B0178HXZUY,Jewelry Earrings Ear Cuff,2015-10-27,4.407956,0,...,4.442259,0.155776,4.451759,2899,0.151968,7.972466,0.035067,0.034137,0.410765,0.433598
1,"Humorous Cat Wall Art - Decor for Home, Office...",4.5,108,12.95,yellowbird art & design,B07ZFJXDH8,Home & Kitchen Artwork Prints,2019-11-05,4.496569,0,...,4.495776,0.138751,4.48832,7350,0.140358,8.902592,0.030863,0.031272,0.553306,0.523537
2,Whiskey Glasses by Black Lantern – Floral Whis...,4.4,11,31.0,black lantern,B089LRPX7X,Home & Kitchen Dining Tableware Glassware Tumb...,2016-01-26,4.457846,0,...,4.495776,0.138751,4.518841,1839,0.128856,7.517521,0.030863,0.028515,0.553306,0.613921
3,LOVE Dog Paw Print Heart Sticker Decal Compati...,4.1,3,3.99,generic,B01MXKS1L5,Electronics Accessories Laptop Skins & Decals,2016-11-14,4.442769,0,...,4.471043,0.141498,4.508194,206,0.139531,5.332719,0.031648,0.030951,0.481562,0.592233
4,"Bachelorette Party Shirts, Soft Crew Neck and ...",4.6,64,15.59,patyz,B07Q5VXBCC,Clothing Shoes & Accessories Men Tops Tees T-S...,2019-03-30,4.566003,1,...,4.49006,0.131767,4.479571,1147,0.122598,7.045777,0.029346,0.027368,0.524289,0.471665


In [3]:
print(f'The shape of the meta dataframe is {meta_df.shape}.')

The shape of the meta dataframe is (38154, 36).


In [4]:
#display info
meta_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38154 entries, 0 to 38153
Data columns (total 36 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   product_title                         38154 non-null  object        
 1   average_rating                        38154 non-null  float64       
 2   rating_number                         38154 non-null  int64         
 3   product_price                         38154 non-null  float64       
 4   store                                 38154 non-null  object        
 5   parent_asin                           38154 non-null  object        
 6   all_subcategories                     38154 non-null  object        
 7   date_first_available                  38154 non-null  datetime64[ns]
 8   bayesian_rating                       38154 non-null  float64       
 9   is_popular                            38154 non-null  int64         
 10

In [5]:
#checking null values 
meta_df.isna().sum().loc[lambda x: x> 0]

Series([], dtype: int64)

### <a id='toc1_1_2_'></a>[Content based Recommender System](#toc0_)

In [6]:
#subsetting dataframe to only filter products with atleast 10 reviews
filtered_df = meta_df[meta_df['rating_number'] > 10].copy()

#Extract necessary columns to build the recommender
cols_to_keep = ['product_title','rating_number','average_rating','product_age_days',
                'bayesian_rating','product_price','parent_asin','title_category','all_subcategories']
filtered_df = filtered_df[cols_to_keep].reset_index(drop=True)

filtered_df.head()

Unnamed: 0,product_title,rating_number,average_rating,product_age_days,bayesian_rating,product_price,parent_asin,title_category,all_subcategories
0,Sterling Silver Hammered Ear Cuff,243,4.4,2987,4.407956,24.0,B0178HXZUY,Sterling Silver Hammered Ear Cuff Jewelry Earr...,Jewelry Earrings Ear Cuff
1,"Humorous Cat Wall Art - Decor for Home, Office...",108,4.5,1517,4.496569,12.95,B07ZFJXDH8,"Humorous Cat Wall Art - Decor for Home, Office...",Home & Kitchen Artwork Prints
2,Whiskey Glasses by Black Lantern – Floral Whis...,11,4.4,2896,4.457846,31.0,B089LRPX7X,Whiskey Glasses by Black Lantern – Floral Whis...,Home & Kitchen Dining Tableware Glassware Tumb...
3,"Bachelorette Party Shirts, Soft Crew Neck and ...",64,4.6,1737,4.566003,15.59,B07Q5VXBCC,"Bachelorette Party Shirts, Soft Crew Neck and ...",Clothing Shoes & Accessories Men Tops Tees T-S...
4,Rainbow Titanium Crystal Quartz Point Antique ...,36,4.3,1731,4.376456,25.0,B07QBBGDM3,Rainbow Titanium Crystal Quartz Point Antique ...,Jewelry Necklaces Pendant


We define a custom tokenizer function that processes text by:
- Removing punctuation and English stop words  
- Converting text to lowercase  
- Tokenizing the sentence into words  
- Applying lemmatization  

This preprocessing step helps in preparing text data for natural language processing tasks like TF-IDF.

In [7]:
import unicodedata

# Load the large English pipeline
nlp = spacy.load('en_core_web_lg', disable=["parser", "ner"])  # Disabling parser & NER for efficiency

def normalize_text(text):
    """
    Normalizes text by converting special Unicode characters into standard ASCII.
    """
    normalized_text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
    return normalized_text

def custom_tokenizer(row):
    """
    Tokenizes and lemmatizes text, normalizing Unicode characters
    and removing stopwords.

    Args:
        row (str): Text based product features.

    Returns:
        str: Processed text with lemmatized words.
    """
    # Normalize Unicode styles
    normalized_text = normalize_text(row)

    # Process text with SpaCy
    parsed_title = nlp(normalized_text)

    # Extract only relevant tokens
    tok_lemmas = [
        token.lemma_.lower()    # Convert lemma to lowercase
        for token in parsed_title 
        if token.is_alpha       # Ensure token is alphabetic
        and not token.is_stop   # Remove stopwords
        and len(token) > 3      # Ignore very short words
    ]

    # Remove duplicates while preserving order
    unique_tokens = list(dict.fromkeys(tok_lemmas))

    return unique_tokens 

Now, we create a TF-IDF matrix using the custom tokenizer function to transform the text data into a numerical format, enabling the computation of similarity scores between products.

The key text features we consider are the product title and combined category, as they provide meaningful information about the product and enhance similarity calculations. By computing cosine similarity between these features, we can identify products with similar titles and belonging to the same category. This similarity-based approach provides relevant and personalized product recommendations to users based on their preferences and the characteristics of products they have enjoyed in the past.

 We set the max_df parameter to 0.7 to filter out extremely common tokens that appear in a large fraction of documents, reducing noise and improving the relevance of similarity scores.

In [8]:
# Creating a TF-IDF vectorizer for item descriptions

tfidf_vectorizer =  TfidfVectorizer(
tokenizer = custom_tokenizer,
lowercase=True,
min_df=10,
max_df=0.7, #ignores terms that appear in 70% of products 
stop_words='english',
)

# Applying TF-IDF vectorization to item features

tfidf_matrix_content = tfidf_vectorizer.fit_transform(filtered_df['title_category'])

In [9]:
# Print the shape of the TF-IDF matrix
print("TF-IDF Matrix Shape:", tfidf_matrix_content.shape)

TF-IDF Matrix Shape: (17898, 2374)


In [10]:
# Create a DataFrame from the TF-IDF transformed data
tokens_df  = pd.DataFrame(tfidf_matrix_content.toarray(), columns= tfidf_vectorizer.get_feature_names_out())

display(tokens_df)

Unnamed: 0,abstract,accent,accessorie,accessory,acorn,acrylic,activity,actual,additive,address,...,zebra,zero,zinc,zipper,zippered,zircon,zirconia,zlkapt,zodiac,zombie
0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.0,0.0,0.092271,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17893,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17894,0.000000,0.0,0.0,0.146686,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17895,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
17896,0.000000,0.0,0.0,0.178927,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
print("Top 10 tokens:")                             
tokens_df.sum(axis=0).sort_values(ascending=False).head(10)

Top 10 tokens:


home         839.734502
kitchen      838.442664
jewelry      707.364985
decor        696.808504
gift         664.123771
artwork      588.027797
accessory    567.249812
print        548.953766
wall         508.884961
silver       403.118580
dtype: float64

In [12]:
# Calculate the cosine similarity matrix 
cosine_similarity_content = cosine_similarity(tfidf_matrix_content,dense_output=False)

print("Shape of cosine_similarity_matrix:", cosine_similarity_content.shape)

Shape of cosine_similarity_matrix: (17898, 17898)


In [13]:
#Function to generate recommendations using pre-computed cosine similarity matrix

def content_based_recommendations(df, item_title, top_n=10,rating_threshold=20):

    """
    Generate content-based product recommendations using pre-computed cosine similarities.

    Parameters:
    df (pd.DataFrame): DataFrame containing product information.
    item_title (str): The title of the product for which recommendations are needed.
    top_n (int, optional): Number of top similar products to return. Default is 10.
    rating_threshold (int, optional): Minimum number of ratings required for a product to be considered. Default is 10.

    Returns:
    pd.DataFrame: A DataFrame containing the top N most similar products, sorted by similarity score, 
                  and filtered by the rating threshold.
    """

    # Extract the parent_asin of the item 
    product_id = df.loc[df['product_title'] == item_title,'parent_asin'].values[0]
    item_index = df[df['parent_asin'] == product_id].index[0]

    # Create a dataframe with the movie titles
    sim_df = pd.DataFrame(
    {"product_title": df["product_title"],
    "similarity": np.array(cosine_similarity_content[item_index, :].todense()).squeeze().round(2),
    "bayesian_rating":df['bayesian_rating'].round(2),
    "rating_number": df["rating_number"],
    "average_rating": df["average_rating"],
    "all_subcategories":df['all_subcategories']}
)
    
    # Sorting similar items by similarity score in descending order
    similar_items = sim_df.sort_values(by="similarity",ascending=False)

    # Exclude the item itself
    similar_items = similar_items.loc[similar_items['product_title'] != item_title]
    
    # Filter items that meet the rating threshold
    qualified_items = similar_items[similar_items["rating_number"] > rating_threshold]

    # Getting the top N most similar items (excluding the item itself)
    top_similar_items = qualified_items.head(top_n)

    return top_similar_items

In [14]:
filtered_df[filtered_df['product_title'].str.contains('Personalzed')]

Unnamed: 0,product_title,rating_number,average_rating,product_age_days,bayesian_rating,product_price,parent_asin,title_category,all_subcategories
12497,Personalzed Baby Shark Birthday Outfit Tutu Set,25,4.4,1983,4.441967,43.99,B07KRX17BY,Personalzed Baby Shark Birthday Outfit Tutu Se...,Clothing Shoes & Accessories Girls Sets


In [15]:
# content based recommendation for a specific item
item_name = filtered_df.iloc[12497]['product_title']

print(f'Recommendations for {item_name}:')
#apply the function
content_based_rec = content_based_recommendations(filtered_df, item_name, top_n=8)

#show recommendations
content_based_rec

Recommendations for Personalzed Baby Shark Birthday Outfit Tutu Set:


Unnamed: 0,product_title,similarity,bayesian_rating,rating_number,average_rating,all_subcategories
950,Mouse Birthday Number Party Dress 2nd Birthday...,0.64,4.38,37,4.3,Clothing Shoes & Accessories Girls Sets
13589,Raiders Baby Outfit - Tutus and Touchdowns - R...,0.63,4.68,24,4.9,Clothing Shoes & Accessories Baby Girls Sets
12492,Mouse Birthday Tutu Outfit Set Dress Shirt Fir...,0.61,4.43,48,4.4,Clothing Shoes & Accessories Baby Girls Dresses
4092,Baby Girl Football Outfit - Tutus and Touchdow...,0.6,4.67,148,4.7,Clothing Shoes & Accessories Baby Girls Sets
1407,Groovy one birthday shirt 1st birthday groovy ...,0.58,4.54,24,4.6,Clothing Shoes & Accessories Baby Girls Sets
3750,Mermaid Birthday Outfit Under the Sea First Bi...,0.57,4.29,27,4.1,Clothing Shoes & Accessories Girls Sets
15168,Birthday Outfit Baby Outfit First Birthday Out...,0.56,4.59,238,4.6,Clothing Shoes & Accessories Girls Sets
12550,Peppa Girl three birthday Outfit Peppa Baby Gi...,0.56,4.49,36,4.5,Clothing Shoes & Accessories Baby Boys Sets


### <a id='toc1_1_3_'></a>[Enhanced content -based recommender that utilizes numerical features](#toc0_)

In [16]:
def enhanced_content_based_recommendations(df, item_title, top_n=8, text_weight=0.7, 
                                          numeric_weights={'bayesian_rating': 0.7, 
                                                          'product_price': -0.3},
                                                          rating_threshold = 20,
                                                          new_product_threshold=1500):
    """
    Enhanced content-based recommendation system that combines text similarity with weighted numeric features, 
    and ensures a mix of atleast one new product with popular ones.
    
    Parameters:
    - df: DataFrame containing product information
    - item_title: Title of the item to find recommendations for
    - top_n: Number of recommendations to return
    - text_weight: Weight for text similarity (0-1)
    - numeric_weights: Dictionary of weights for numeric features (must sum to 1)
    - new_product_threshold: Age in days below which a product is considered "new"
    
    Returns:
    - DataFrame with top_n recommended items and their details
    """

    # Create TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer(
        lowercase=True,
        min_df=10,
        max_df=0.7,
        stop_words='english'
    )

    # Extract the parent_asin of the item 
    product_id = df.loc[df['product_title'] == item_title,'parent_asin'].values[0]
    item_index = df[df['parent_asin'] == product_id].index[0]

    # Apply TF-IDF vectorization to text features
    tfidf_matrix = tfidf_vectorizer.fit_transform(df['title_category'])
    
    # Calculate cosine similarity for text features
    text_sim = cosine_similarity(tfidf_matrix)
    
    df['text_similarity'] = text_sim[item_index]

    #  Normalize numerical features (0 to 1)
    df_normalized = df.copy()
    for col in numeric_weights:
        if numeric_weights[col] > 0:  # Higher is better
            df_normalized[col] = (df[col] - df[col].min()) / (df[col].max() - df[col].min())
        else:  # Lower is better (e.g., price)
            df_normalized[col] = 1 - (df[col] - df[col].min()) / (df[col].max() - df[col].min())

    # Calculate numerical scores
    df['numeric_score'] = (
    df_normalized['bayesian_rating'] * 0.7 -  # Recommend popular products
    df_normalized['product_price'] * 0.3  # Recommend less pricey products
)
    
    # Combine scores
    df['combined_score'] = (text_weight * df['text_similarity']) + ((1 - text_weight) * df['numeric_score'])
    
    # Save results into dataframe
    sim_df = pd.DataFrame({
        'product_title': df['product_title'],
        'similarity_score': df['combined_score'].round(2),
        'bayesian_rating': df['bayesian_rating'].round(2),
        'rating_number': df['rating_number'],
        'product_age_days': df['product_age_days'],
    })

    # Sorting similar items by similarity score and rating number in descending order
    similar_items = sim_df.sort_values(by=["similarity_score","rating_number"],ascending=False)

    # Exclude the item itself
    similar_items = similar_items.loc[similar_items['product_title'] != item_title]

    # Filter items that meet the rating threshold
    qualified_items = similar_items[similar_items["rating_number"] > rating_threshold]

    # Identify new products (below new_product_threshold)
    new_items = qualified_items[qualified_items['product_age_days'] <= new_product_threshold]

    # Ensure at least 1 new product is included if possible
    if len(new_items) >= 1:
        new_items = new_items.head(1)  # Pick the top new item
    else:
        new_items = pd.DataFrame()  # No new products, continue with popular items

    # Select popular items to fill the remaining spots
    popular_items = qualified_items.head(top_n - len(new_items))

    # Combine both new items and popular items
    top_similar_items = pd.concat([new_items, popular_items]).head(top_n)

    return top_similar_items

In [17]:
# content based recommendation for a specific item
item_name = filtered_df.iloc[12497]['product_title']

print(f'Recommendations for {item_name}:')

#apply function to `title_category` column
enhanced_content_based_rec = enhanced_content_based_recommendations(filtered_df, item_name, top_n=8)

#print recommendations
enhanced_content_based_rec

Recommendations for Personalzed Baby Shark Birthday Outfit Tutu Set:


Unnamed: 0,product_title,similarity_score,bayesian_rating,rating_number,product_age_days
14420,Peppa Girl four birthday Outfit Peppa Baby Gir...,0.45,4.72,33,832
13589,Raiders Baby Outfit - Tutus and Touchdowns - R...,0.58,4.68,24,1614
15168,Birthday Outfit Baby Outfit First Birthday Out...,0.53,4.59,238,1812
16211,First Birthday Outfit Girl Shirt Rainbow Tutu ...,0.52,4.57,65,1812
12492,Mouse Birthday Tutu Outfit Set Dress Shirt Fir...,0.48,4.43,48,2390
11154,Girls Birthday Shark Personalized Baby Shark L...,0.48,4.64,26,1652
950,Mouse Birthday Number Party Dress 2nd Birthday...,0.47,4.38,37,1683
14420,Peppa Girl four birthday Outfit Peppa Baby Gir...,0.45,4.72,33,832


### <a id='toc1_1_5_'></a>[Pickle the dataframe](#toc0_)

In [18]:
### Pickle the dataframe 
filtered_df.to_pickle('../Streamlit/data/content_rec_data.pkl')

### <a id='toc1_1_4_'></a>[Conclusion](#toc0_)

The two recommenders follow a similar method for generating recommendations but differ in how they refine their predictions. 

- The content-based recommender relies entirely on pre-computed cosine similarity scores derived from product text features. 

- In contrast, the enhanced recommender improves the quality of recommendations by integrating both text similarity and weighted numerical factors such as Bayesian ratings and product price. These additional factors enable the enhanced method to prioritize popular and relevant products, ensuring recommendations are not only similar in content but also well-rated and widely reviewed. 

- Additionally, to boost the visibility of newer products, the enhanced system includes at least one product considered “new” (based on its age being below a specified threshold). This strategy strikes a balance between recommending well-reviewed, popular products and giving newer items a chance to gain traction in the market.