This script is designed to implement a product reccomendation system for e-commerce users.

# Data Prep

In [3]:
import pandas as pd
import os
from dotenv import load_dotenv

# User based collaborative filtering
from sklearn.metrics.pairwise import cosine_similarity

# Content based filtering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk

In [4]:
#load from .env file
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

load_dotenv(f'{parent_dir}/.env')

postgres_password = os.getenv('POSTGRES_PASSWORD')
postgres_port_no = os.getenv('POSTGRES_PORT_NO')
host = os.getenv('POSTGRES_HOST')
database = os.getenv('POSTGRES_DB')
user = os.getenv('POSTGRES_USER')

In [5]:
online_sales = pd.read_csv('../data/online_sales_edited.csv')
products = pd.read_csv('../data/products.csv')

In [6]:
online_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53701 entries, 0 to 53700
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   user_id           53701 non-null  int64  
 1   transaction_id    53701 non-null  int64  
 2   date              53701 non-null  object 
 3   product_id        53701 non-null  object 
 4   Quantity          53701 non-null  int64  
 5   Delivery_Charges  53701 non-null  float64
 6   Coupon_Status     53701 non-null  object 
 7   Coupon_Code       53701 non-null  object 
 8   Discount_pct      53701 non-null  float64
dtypes: float64(2), int64(3), object(4)
memory usage: 3.7+ MB


In [7]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1351 entries, 0 to 1350
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   product_id           1351 non-null   object 
 1   product_name         1351 non-null   object 
 2   about_product        1351 non-null   object 
 3   category             1351 non-null   object 
 4   actual_price         1351 non-null   float64
 5   discounted_price     1351 non-null   float64
 6   discount_percentage  1351 non-null   float64
dtypes: float64(3), object(4)
memory usage: 74.0+ KB


In [8]:
# left join online_sales and products

df = pd.merge(online_sales, products, on='product_id', how='left')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53701 entries, 0 to 53700
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              53701 non-null  int64  
 1   transaction_id       53701 non-null  int64  
 2   date                 53701 non-null  object 
 3   product_id           53701 non-null  object 
 4   Quantity             53701 non-null  int64  
 5   Delivery_Charges     53701 non-null  float64
 6   Coupon_Status        53701 non-null  object 
 7   Coupon_Code          53701 non-null  object 
 8   Discount_pct         53701 non-null  float64
 9   product_name         53701 non-null  object 
 10  about_product        53701 non-null  object 
 11  category             53701 non-null  object 
 12  actual_price         53701 non-null  float64
 13  discounted_price     53701 non-null  float64
 14  discount_percentage  53701 non-null  float64
dtypes: float64(5), int64(3), object(7)
m

In [10]:
df.head()

Unnamed: 0,user_id,transaction_id,date,product_id,Quantity,Delivery_Charges,Coupon_Status,Coupon_Code,Discount_pct,product_name,about_product,category,actual_price,discounted_price,discount_percentage
0,17850,16679,2019-01-01,B09DL9978Y,1,6.5,Used,ELEC10,0.1,Hindware Atlantic Compacto 3 Litre Instant wat...,Stainless Steel Tank|Copper Heating element|IS...,"Home&Kitchen|Heating,Cooling&AirQuality|WaterH...",55.08,28.79,0.48
1,17850,16680,2019-01-01,B09DL9978Y,1,6.5,Used,ELEC10,0.1,Hindware Atlantic Compacto 3 Litre Instant wat...,Stainless Steel Tank|Copper Heating element|IS...,"Home&Kitchen|Heating,Cooling&AirQuality|WaterH...",55.08,28.79,0.48
2,17850,16681,2019-01-01,B07GXHC691,1,6.5,Used,OFF10,0.1,STRIFF PS2_01 Multi Angle Mobile/Tablet Tablet...,"[PORTABLE SIZE]- 98mm*96mm*19mm, STRIFF desk p...",Electronics|Mobiles&Accessories|MobileAccessor...,5.99,1.19,0.8
3,17850,16682,2019-01-01,B08NCKT9FG,5,6.5,Not Used,SALE10,0.1,Boat A 350 Type C Cable 1.5m(Jet Black),"2 years warranty from the date of purchase, yo...",Computers&Accessories|Accessories&Peripherals|...,9.58,3.59,0.63
4,17850,16682,2019-01-01,B08H21B6V7,1,6.5,Used,AIO10,0.1,Nokia 150 (2020) (Cyan),MicroSD card slot expandable up to 32. Network...,Electronics|Mobiles&Accessories|Smartphones&Ba...,35.99,31.19,0.13


In [11]:
# Ensure 'Quantity' is non-negative
df = df[df['Quantity'] > 0]

# User-based collaborative filtering

User-Based Collaborative Filtering provides personalized recommendations by leveraging similarities between users' purchasing behaviors computed based on cosine similarity between users, making it intuitive and easy to implement without relying on product metadata. This approach fosters tailored user experiences and can uncover serendipitous product discoveries. However, UBCF faces challenges such as scalability issues with large user bases, data sparsity that can limit the effectiveness of similarity measures, and the cold start problem where new users or products lack sufficient interaction data. Additionally, it may suffer from popularity bias, leading to less diverse recommendations. Balancing these strengths and limitations is key to optimizing UBCF for an effective e-commerce recommendation system.

In [12]:
def user_based_recommendation(user_id, df, top_n=5):
    """
    Recommend top N products for a given user based on user similarity.

    Parameters:
    - user_id (int): The ID of the user for whom to generate recommendations.
    - df (DataFrame): The preprocessed DataFrame containing user transactions.
    - top_n (int): Number of top recommendations to return.

    Returns:
    - recommendations (list): List of recommended product IDs.
    """

    # Check if the user_id exists in the DataFrame
    if user_id not in df['user_id'].unique():
        print(f"User ID {user_id} not found in the dataset.")
        return []

    # 1. Create the User-Item Matrix
    user_item_matrix = df.pivot_table(index='user_id',
                                      columns='product_id',
                                      values='Quantity',
                                      aggfunc='sum',
                                      fill_value=0)

    # 2. Compute User Similarity Matrix using Cosine Similarity
    # Cosine similarity returns values between 0 and 1
    similarity_matrix = cosine_similarity(user_item_matrix)
    
    # Convert the similarity matrix to a DataFrame for easier handling
    similarity_df = pd.DataFrame(similarity_matrix, 
                                 index=user_item_matrix.index, 
                                 columns=user_item_matrix.index)

    # 3. Find Similar Users
    # Get similarity scores for the target user
    user_similarities = similarity_df[user_id].sort_values(ascending=False)
    
    # Exclude the target user from the similarity scores
    user_similarities = user_similarities.drop(labels=[user_id])

    # Select top similar users (you can adjust the number, e.g., top 10)
    top_similar_users = user_similarities.head(10).index.tolist()

    if not top_similar_users:
        print(f"No similar users found for User ID {user_id}.")
        return []

    # 4. Aggregate Products from Similar Users
    # Select the purchase data of similar users
    similar_users_data = df[df['user_id'].isin(top_similar_users)]

    # Aggregate the quantities for each product from similar users
    product_scores = similar_users_data.groupby('product_id')['Quantity'].sum()

    # 5. Exclude Products Already Purchased by the Target User
    # Get the list of products already purchased by the target user
    user_purchased_products = df[df['user_id'] == user_id]['product_id'].unique()

    # Remove these products from the recommendation candidates
    product_scores = product_scores.drop(labels=user_purchased_products, errors='ignore')

    if product_scores.empty:
        print(f"No new products to recommend for User ID {user_id}.")
        return []

    # 6. Sort the products based on the aggregated scores in descending order
    sorted_products = product_scores.sort_values(ascending=False)

    # 7. Select the Top N Products
    recommended_products = sorted_products.head(top_n).index.tolist()

    # Print the recommended products with their names
    recommended_product_names = products[products['product_id'].isin(recommended_products)]['product_name'].tolist()

    return recommended_products, recommended_product_names

In [13]:
user_based_recommendation(12583, df, top_n=5)

(['B078KRFWQB', 'B07LFWP97N', 'B07VX71FZP', 'B078W65FJ7', 'B0B61HYR92'],
 ['Lapster usb 2.0 mantra cable, mantra mfs 100 data cable (black)',
  'Gizga Essentials Laptop Bag Sleeve Case Cover Pouch with Handle for 14.1 Inch Laptop for Men & Women, Padded Laptop Compartment, Premium Zipper Closure, Water Repellent Nylon Fabric, Grey',
  'boAt BassHeads 900 On-Ear Wired Headphones with Mic (White)',
  'Amazon Brand - Solimo 2000/1000 Watts Room Heater with Adjustable Thermostat (ISI certified, White colour, Ideal for small to medium room/area)',
  'Havells Cista Room Heater, White, 2000 Watts'])

# Content based reccomendation

The function aims to generate personalized product recommendations for a user by analyzing the content (attributes) of products they have previously purchased. It uses a content-based filtering approach, which relies on the similarity between product attributes to suggest new products that are similar to those the user has already bought.

To encourage upselling, higher prices item are reccomended back to the user.

In [14]:
# Define a function to preprocess text
def preprocess_text(text):
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    
    # Tokenize, remove stop words, and apply stemming
    tokens = text.split()
    tokens = [stemmer.stem(word) for word in tokens if word.lower() not in stop_words]
    
    return ' '.join(tokens)

def content_based_recommendation(user_id, transactions_df, products_df, top_n=5):
    """
    Recommend top N products for a given user based on content similarity.
    
    Parameters:
    - user_id (int): The ID of the user for whom to generate recommendations.
    - transactions_df (DataFrame): The DataFrame containing user transactions.
    - products_df (DataFrame): The DataFrame containing product details.
    - top_n (int): Number of top recommendations to return.
    
    Returns:
    - recommendations (list): List of recommended product IDs.
    """
    
    # Check if the user_id exists in the transactions DataFrame
    if user_id not in transactions_df['user_id'].unique():
        print(f"User ID {user_id} not found in the dataset.")
        return []

    # 1. Create a new feature by combining relevant product attributes in the products DataFrame
    products_df['combined_features'] = products_df['product_name'] + ' ' + products_df['about_product'] + ' ' + products_df['category']
    
    # 2. Apply text preprocessing to the combined features
    products_df['combined_features'] = products_df['combined_features'].apply(preprocess_text)
    
    # 3. Remove duplicate products to ensure each product is unique
    products_df = products_df[['product_id', 'combined_features','actual_price']].drop_duplicates().reset_index(drop=True)
    
    # 4. Initialize the TF-IDF Vectorizer
    tfidf = TfidfVectorizer(stop_words='english')
    
    # 5. Fit and transform the combined features
    tfidf_matrix = tfidf.fit_transform(products_df['combined_features'])
    
    # 6. Compute the cosine similarity matrix
    cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
    
    # 7. Create a reverse mapping of product indices and IDs
    indices = pd.Series(products_df.index, index=products_df['product_id']).drop_duplicates()
    
    # 8. Get the list of products purchased by the user from the transactions DataFrame
    user_purchases = transactions_df[transactions_df['user_id'] == user_id]['product_id'].unique()
    
    # 9. Initialize a series to hold similarity scores
    similarity_scores = pd.Series(dtype=float)
    
    # 10. Iterate over each purchased product and accumulate similarity scores
    for product_id in user_purchases:
        if product_id not in indices:
            continue  # Skip if the product_id is not in the dataset
        idx = indices[product_id]
        sim_scores = pd.Series(cosine_sim[idx]).sort_values(ascending=False)
        sim_scores = sim_scores.iloc[1:]  # Exclude the product itself
        similarity_scores = similarity_scores.add(sim_scores, fill_value=0)
    
    if similarity_scores.empty:
        print(f"No similar products found for User ID {user_id}.")
        return []
    
    # 11. Remove products already purchased by the user
    similarity_scores = similarity_scores.drop(labels=[indices[pid] for pid in user_purchases if pid in indices], errors='ignore')
    
    if similarity_scores.empty:
        print(f"No new products to recommend for User ID {user_id}.")
        return []
    
    # 12. Sort the products based on similarity scores and price (for upselling)
    similarity_scores = similarity_scores.sort_values(ascending=False)
    top_indices = similarity_scores.head(top_n * 2).index.tolist()  # Get more candidates
    
    # 13. Map indices back to product IDs and filter by price
    recommended_products = products_df.iloc[top_indices]
    recommended_products = recommended_products.sort_values(by='actual_price', ascending=False)  # Prioritize higher-priced items
    recommended_product_ids = recommended_products.head(top_n)['product_id'].tolist()
    
    return recommended_product_ids

In [15]:
content_based_recommendation(12583, df, products, top_n=5)

['B0BC8BQ432', 'B095JQVC7N', 'B0B997FBZT', 'B0B1YZ9CB8', 'B0B1YZX72F']

# Coldstart Reccomendation

The cold_start_recommendation function provides a practical solution for recommending products to new users by leveraging the popularity of products. This approach ensures that new users receive relevant and popular product recommendations, even without prior interaction data.

In [16]:
def popularity_based_recommendation(transactions_df, products_df, top_n=5, category=None):
    """
    Recommend top N popular products based on overall sales or within a specific category.
    
    Parameters:
    - transactions_df (DataFrame): DataFrame containing user transactions with columns ['user_id', 'product_id', 'Quantity', ...].
    - products_df (DataFrame): DataFrame containing product details with columns ['product_id', 'product_name', 'about_product', 'category', ...].
    - top_n (int): Number of top recommendations to return.
    - category (str, optional): If specified, recommend popular products within this category.
    
    Returns:
    - recommended_product_ids (list): List of recommended product IDs.
    """
    
    # Merge transactions with products to get category information
    merged_df = transactions_df.merge(products_df, on='product_id', how='left')
    
    # If category is specified, filter by category
    if category:
        merged_df = merged_df[merged_df['category'] == category]
    
    # Aggregate the total quantity sold for each product
    product_sales = merged_df.groupby('product_id')['Quantity'].sum().sort_values(ascending=False)
    
    # Get the top N product IDs
    recommended_product_ids = product_sales.head(top_n).index.tolist()
    
    return recommended_product_ids

def cold_start_recommendation(user_id, transactions_df, products_df, users_df=None, top_n=5):
    """
    Recommend top N products for a new user using a hybrid approach combining popularity and demographic-based recommendations.
    
    Parameters:
    - user_id (int): The ID of the user for whom to generate recommendations.
    - transactions_df (DataFrame): DataFrame containing user transactions with columns ['user_id', 'product_id', 'Quantity', ...].
    - products_df (DataFrame): DataFrame containing product details with columns ['product_id', 'product_name', 'about_product', 'category', ...].
    - users_df (DataFrame, optional): DataFrame containing user demographic details with columns ['user_id', 'age', 'gender', 'location', ...].
    - top_n (int): Number of top recommendations to return.
    
    Returns:
    - recommended_product_ids (list): List of recommended product IDs.
    """
    
    # Check if the user exists in the transactions (i.e., is not a new user)
    if user_id in transactions_df['user_id'].unique():
        print(f"User ID {user_id} exists in the dataset. Use user-based or content-based recommendations instead.")
        return []
    
    # Initialize a dictionary to hold recommendation scores
    rec_scores = {}
    
    popular_recs = popularity_based_recommendation(transactions_df, products_df, top_n=top_n*2)
    for pid in popular_recs:
        rec_scores[pid] = rec_scores.get(pid, 0) + 1  # Weight can be adjusted as needed

    # Sort the products based on accumulated scores in descending order
    sorted_recs = sorted(rec_scores.items(), key=lambda x: x[1], reverse=True)
    
    # Extract product IDs from the sorted list
    recommended_product_ids = [pid for pid, score in sorted_recs]
    
    # Remove duplicates while preserving order
    recommended_product_ids = list(dict.fromkeys(recommended_product_ids))
    
    # Limit to top_n recommendations
    recommended_product_ids = recommended_product_ids[:top_n]
    
    # If not enough recommendations, fallback to popularity-based
    if len(recommended_product_ids) < top_n:
        additional_recs = popularity_based_recommendation(transactions_df, products_df, top_n=top_n - len(recommended_product_ids))
        # Append additional recommendations, ensuring no duplicates
        for pid in additional_recs:
            if pid not in recommended_product_ids:
                recommended_product_ids.append(pid)
            if len(recommended_product_ids) == top_n:
                break
    
    return recommended_product_ids

# Overall reccomendation engine

In this overall reccomendation engine, I have aggregated user based collaborative, content based reccomendation and coldstart reccomendation into a comprehensive system to support each reccomender's pros and cons.I have assigned higher weights to user_based reccomendations (+2) compared to content_based (+1), while ensuring that cold start problem is handled for new users. This reccomendation will be sure to reccomend what people like!

In [23]:
def overall_recommendation(user_id, transactions_df, products_df, top_n=5):
    """
    Generate a consolidated list of product recommendations for a user by integrating
    User-Based Collaborative Filtering, Content-Based Filtering, and Cold Start strategies.
    
    Parameters:
    - user_id (int): The ID of the user for whom to generate recommendations.
    - transactions_df (DataFrame): DataFrame containing user transactions with columns ['user_id', 'product_id', 'Quantity', ...].
    - products_df (DataFrame): DataFrame containing product details with columns ['product_id', 'product_name', 'about_product', 'category', ...].
    - users_df (DataFrame, optional): DataFrame containing user demographic details with columns ['user_id', 'age', 'gender', 'location', ...].
    - top_n (int): Number of top recommendations to return.
    
    Returns:
    - final_recommendations (list): List of recommended product IDs.
    - final_recommendation_names (list, optional): List of recommended product names (if available in products_df).
    """
    
    # Initialize a dictionary to hold aggregated recommendation scores
    recommendation_scores = {}
    
    # Check if the user exists in the transactions (i.e., is not a new user)
    if user_id in transactions_df['user_id'].unique():
        print(f"Existing User: Generating recommendations using User-Based and Content-Based strategies.\n")
        
        # 1. User-Based Collaborative Filtering Recommendations
        print("Generating User-Based Collaborative Filtering Recommendations...")
        user_based_recs, user_based_names = user_based_recommendation(user_id, transactions_df, top_n=top_n)
        for pid in user_based_recs:
            recommendation_scores[pid] = recommendation_scores.get(pid, 0) + 2  # Assign higher weight
        
        print("Generating Content-Based Recommendations...")
        # 2. Content-Based Recommendations
        content_based_recs = content_based_recommendation(user_id, transactions_df, products_df, top_n=top_n)
        for pid in content_based_recs:
            recommendation_scores[pid] = recommendation_scores.get(pid, 0) + 1  # Assign lower weight
        
    else:
        print(f"New User: Generating recommendations using Cold Start strategy.")
        
        # 3. Cold Start Recommendations
        cold_start_recs = cold_start_recommendation(user_id, transactions_df, products_df, top_n=top_n)
        for pid in cold_start_recs:
            recommendation_scores[pid] = recommendation_scores.get(pid, 0) + 1  # Assign weight
        
    # Convert the recommendation_scores dictionary to a DataFrame for sorting
    rec_scores_df = pd.DataFrame(list(recommendation_scores.items()), columns=['product_id', 'score'])
    
    # Sort the recommendations based on the aggregated scores in descending order
    rec_scores_df = rec_scores_df.sort_values(by='score', ascending=False)
    
    # Extract the top_n product_ids
    top_recommendations = rec_scores_df.head(top_n)['product_id'].tolist()
    
    print("Generating Final Recommendations...  Done!")
    # Optionally, retrieve product names for better readability
    if 'product_name' in products_df.columns:
        # Ensure that all product_ids are present in products_df
        valid_pids = [pid for pid in top_recommendations if pid in products_df['product_id'].values]
        product_names = products_df.set_index('product_id').loc[valid_pids]['product_name'].tolist()
        return top_recommendations, product_names
    else:
        return top_recommendations

In [24]:
overall_recommendation(12583, df, products, top_n=5)

Existing User: Generating recommendations using User-Based and Content-Based strategies.

Generating User-Based Collaborative Filtering Recommendations...
Generating Content-Based Recommendations...
Generating Final Recommendations...  Done!


(['B078KRFWQB', 'B07LFWP97N', 'B07VX71FZP', 'B078W65FJ7', 'B0B61HYR92'],
 ['Havells Cista Room Heater, White, 2000 Watts',
  'Gizga Essentials Laptop Bag Sleeve Case Cover Pouch with Handle for 14.1 Inch Laptop for Men & Women, Padded Laptop Compartment, Premium Zipper Closure, Water Repellent Nylon Fabric, Grey',
  'Amazon Brand - Solimo 2000/1000 Watts Room Heater with Adjustable Thermostat (ISI certified, White colour, Ideal for small to medium room/area)',
  'boAt BassHeads 900 On-Ear Wired Headphones with Mic (White)',
  'Lapster usb 2.0 mantra cable, mantra mfs 100 data cable (black)'])

Evaluation of the reccomendation system will not be included in this project