# The Home Depot - Use Case Study (Content recommender system)

The following code is a use case task for Home Depot for recommendation of a content to users. 

The task for the case study is to build a model to recommend relevant content to the user based on the similarity between the search term and title of the content. The recommendations are one to many. The search designed is agnostic of the user's past behavior. So, the context for the recommendation is purely derived from the current search term issued by the user.

This system will recommend relevant products or content from the dataset, leveraging natural language processing (NLP) techniques and embeddings to improve the relevance of the recommendations. My final objective is to evaluate the accuracy of the model using a labeled dataset and generate predictions for unseen data.



## Library initialization and Data Load:

Libraries used in the case study are as follows:
1. Pandas: For data manipulation and analysis.
2. NumPy: Used for numerical operations & array handling.
3. NLTK (Natural Language Toolkit): For text processing, lemmatization, and managing stopwords.
4. re: Regular expressions for cleaning text.
5. spaCy: For Named Entity Recognition (NER) and advanced NLP tasks.
6. SentenceTransformer: To generate sentence embeddings using a pre-trained model.
7. sklearn: For metrics such as cosine similarity, classification report, and accuracy score (Precision, Recall, F1 Score).

We are working with three main datasets:

`content_data`: Contains the product or content titles for which recommendations need to be generated. This dataset includes information such as the title and slug of each content piece.

`label_data`: A labeled dataset with search terms and slugs, where each entry is tagged as either relevant or irrelevant for a particular search term.

`test_data`: This dataset contains search terms for which we will generate recommendations without labels.
We load these datasets into Pandas dataframes for manipulation and analysis.

While loading the label_data, I found an issue at row 726. The headers of the file were misaligned, and the column names were incorrectly placed in the data row. To fix this:

The actual header was set using the values of row 726.
The erroneous row (row 726) was subsequently dropped from the dataset.
This adjustment ensured the correct alignment of columns in label_data for further processing.

In [20]:
import pandas as pd
import numpy as np
import nltk, re, spacy
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import classification_report, accuracy_score
from nltk.corpus import stopwords, words
from nltk.stem import WordNetLemmatizer

#data load
content_data = pd.read_csv('data/content_data_MASTER.csv')
label_data = pd.read_csv('data/labels_MASTER.csv', header=None)
test_data = pd.read_csv('data/test_MASTER.csv')

#removing duplicates based on 'title' as it may cause bias due to multiple entries (retrospective removal) 
content_data = content_data.drop_duplicates(subset='title', keep='first')

#setting the header using the row at index 726 as the header is wrongly entered in that entry
label_data.columns = label_data.iloc[726].values
label_data = label_data.drop(index=726)

#display the first few rows for exploration
#print(content_data.head())
#print(label_data.head())
#print(test_data.head())

Before we start using the library and the associated functions, we need to first download the required functions. Quiet mode is set to TRUE to silence repeated downloads.

Imported the following- Averaged perceptron tagger, Wordnet, Stopwords, Words, spaCy NER model, and spaCy stopwords.

In [21]:
#downloading necessary NLTK resources (quiet mode for repeated downloads)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('words', quiet=True)

#loading spaCy model for NER and stopword removal
nlp = spacy.load('en_core_web_sm')
spacy_stopwords = nlp.Defaults.stop_words
nltk_stopwords = set(stopwords.words('english'))
combined_stopwords = spacy_stopwords.union(nltk_stopwords)

#initializing lemmatizer
lemmatizer = WordNetLemmatizer()

## Exploratory Data Analysis:

The Exploratory Data Analysis (EDA) phase is essential for understanding the structure, content, and characteristics of the datasets. Here, I explored the data by gathering basic information, performing summary statistics, checking for data quality issues like duplicates (which i retrospectively removed in the above step), and identified frequently occurring values. This will help me prepare the data for further cleaning and modeling.

1. Basic Information of the Datasets
The info() function provides a summary of each dataset, including the number of entries, the data types, and the number of non-null values in each column.

Content Data: We inspected the structure and integrity of the content data, ensuring that fields such as title and slug contain valid entries without missing values.
Label Data: This dataset, which contains search terms and their corresponding labels, was checked for the number of entries and any missing or misaligned data.
Test Data: The test data was analyzed to ensure that all search terms are present and complete.

In [None]:
#basic information
print("Basic Information of the Content Data:")
print(content_data.info())

print("\nBasic Information of the Labels Data:")
print(label_data.info())

print("\nBasic Information of the Test Data:")
print(test_data.info())

2. Summary Statistics
We used the describe() function to obtain summary statistics such as count, mean, and percentiles. This provides a general overview of numerical data in the datasets, allowing us to check for any irregularities, such as extremely large or small values.

Content Data: The summary highlights statistics related to numerical columns in the dataset (if present), which could include fields like ratings or counts.
Label Data: Summary statistics were used to understand the distribution of search terms and their labels.
Test Data: We checked the basic distribution of search terms to ensure a balanced variety of queries.

In [None]:
#summary statistics
print("\nSummary of Content Data:")
print(content_data.describe())

print("\nSummary of Labels Data:")
print(label_data.describe())

print("\nSummary of Test Data:")
print(test_data.describe())

3. Distribution of Labels
The distribution of the Label column in the label data is important for understanding class balance. If one class (e.g., "relevant" or "irrelevant") dominates, it can lead to biased predictions. We used value_counts() to check the number of instances for each label.

Observation: If the dataset is heavily imbalanced, we may need to adjust the modeling approach by using techniques such as oversampling or class weighting.

In [None]:
#distribution of 'Label' column in the labels data
print("\nDistribution of the 'Label' column in Labels Data:")
print(label_data['Label'].value_counts())

4. Duplicate Titles in Content Data
Duplicates in content titles can cause bias in recommendations, as they may skew the results by giving some content undue prominence. Using duplicated(), we checked for repeated entries based on the title column in the content dataset and displayed the duplicated rows.

Observation: Any duplicates found were removed in data load data steps to ensure that the model doesn't favor duplicate content.

In [None]:
#checking for duplicates in the content data
print("\nDuplicate Titles in Content Data:")
print(content_data[content_data.duplicated(subset='title', keep=False)])

5. Frequent Search Terms in Labels Data
To understand user behavior, we analyzed the most frequently occurring search terms in the label_data dataset using value_counts(). This helps in identifying popular search queries that the model will need to handle efficiently.

Top 10 Frequent Search Terms: These insights into frequent search terms can inform future data cleaning and processing steps, especially if certain terms appear with high frequency and may require special handling.

In [None]:
#Frequent search terms in the labels data
print("\nTop 10 Frequent Search Terms in Labels Data:")
print(label_data['searchTerm'].value_counts().head(10))

6. Frequent Content Titles in Content Data
We also analyzed the top 10 most frequent content titles using value_counts(). Frequent titles could indicate the most popular products or pieces of content in the dataset.

Top 10 Frequent Content Titles: Identifying repeated titles gives us an idea of common content themes and whether we need to adjust the model to handle any overrepresentation.

In [None]:
#frequent content titles in the content data
print("\nTop 10 Frequent Content Titles in Content Data:")
print(content_data['title'].value_counts().head(10))

In [28]:
#labels.to_csv('labels_check.csv', index=False)
#content.to_csv('content_check.csv', index=False)
#test.to_csv('test_check.csv', index=False)

## Data Cleaning & Pre-processing: 
This step is extremely crucial, since the model needs to receive quality data, and not noise. The cleaner the data, the better. This process is split into multiple steps.

1. Step 1: Removal of the product specifications. 

The content data which is available does not have any individual numbers associated with it. All the recommendations and slugs do not contain numbers. So we can safely remove the numbers and the associated nuances.

The `clean_product_specs` function removes irrelevant product specifications and unwanted characters from text. It:

Removes words containing numbers, like model numbers (e.g., "T123"), which is not useful for understanding product descriptions, as it adds no value.

Eliminates isolated 'x', commonly used in product dimensions (e.g., "4x6"), but irrelevant since we are removing the numbers.

Removes extra spaces, ensuring clean and neatly formatted text.


In [29]:
#cleaning product specifications and unwanted characters
def clean_product_specs(text):
    #removing words with product specifications which containing numbers
    cleaned_text = re.sub(r'\b\w*\d\w*\b', '', text)  #removes words that contain text/numbers
    cleaned_text = re.sub(r'\bx\b', '', cleaned_text)  #removes isolated 'x'
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()  #removes extra spaces
    return cleaned_text

This section of code performs text preprocessing for product data by:

1. Loading an English dictionary to keep only meaningful English words.

2. Defining custom stopwords that are identified as non-informative during data analysis (e.g., 'diy', 'guide').

3. Preprocessing text by:

    a) Cleaning product specifications.

    b) Converting text to lowercase.

    c) Removing standard and custom stopwords.

    d) Performing part-of-speech-based lemmatization to reduce words to their base forms (nouns, verbs).

    e) Retaining named entities (important terms like product names) and filtering out non-informative symbols.

The result is a cleaned and structured text, ready for further analysis or model training.

In [30]:
#english dictionary load to keep english words
english_words = set(words.words())

#defining custom stopwords identified through data scan
combined_stopwords = set(stopwords.words('english')).union(
    set(['diy', 'guide', 'tutorial', 'how-to', 'best', 'ideas', 'gal', 'in', 'ft'])
)

#preprocessing function with NER, punctuation removal and POS-based lemmatization
def preprocess_text(text, custom_stopwords=None):
    #calling function to clean product specifications
    text = clean_product_specs(text)   
    #lowercase conversion
    doc = nlp(text.lower())
    tokens = []
    entities = []
    for token in doc:
        #removing stopwords and custom stopwords
        if token.text not in combined_stopwords and (custom_stopwords is None or token.text not in custom_stopwords):         
            #skip tokens if they are non-informative symbols
            if re.match(r'^\W+$', token.text):
                continue
            #prioritizing named entity
            if token.ent_type_:
                entities.append(token.text)
            else:
                #POS-based lemmatization for non-entities
                pos = token.pos_
                if pos.startswith('N'):
                    lemmatized_word = lemmatizer.lemmatize(token.text, pos='n')  #noun
                elif pos.startswith('V'):
                    lemmatized_word = lemmatizer.lemmatize(token.text, pos='v')  #verb
                else:
                    lemmatized_word = token.text  #others retained as is
                #adding tokens to list
                tokens.append(lemmatized_word)
    #combine named entities and lemmatized tokens
    processed_text = ' '.join(entities + tokens)
    #remove extra spaces which may have come up
    return re.sub(r'\s+', ' ', processed_text).strip()

This section applies the previously defined preprocessing function to clean and standardize the text data by doing the following:

1. Preprocessing content titles: The preprocess_text function is applied to each title in the content_data dataset to clean product names and specifications.
2. Preprocessing search terms in label data: The search terms in the label_data dataset undergo the same cleaning process to ensure consistency between the queries and content titles.
3. Preprocessing search terms in test data: The search terms in the test_data dataset are also processed for model evaluation.

In [31]:
#applying preprocessing to content titles, search terms, and labels
content_data['preprocessed_title'] = content_data['title'].apply(lambda x: preprocess_text(x))
label_data['preprocessed_searchTerm'] = label_data['searchTerm'].apply(lambda x: preprocess_text(x))
test_data['preprocessed_searchTerm'] = test_data['searchTerm'].apply(lambda x: preprocess_text(x))

In [32]:
#content_data.to_csv('content_check1.csv', index=False)
#label_data.to_csv('labels_check1.csv', index=False)

## Model Building:

I used the SentenceTransformer library (SBERT model) with the "all-MiniLM-L6-v2" model to generate embeddings for search terms and content titles. Sentence embeddings capture the semantic meaning of text, which is crucial for making recommendations based on content similarity.

Metrics used for content matching:

**Cosine Similarity:** This was employed to measure the similarity between the embeddings of search terms and content titles. The closer the cosine similarity score is to 1, the more relevant the content is for the search term.

In [33]:
#loading the SBERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

#function to get SBERT embeddings
def get_sbert_embeddings(texts):
    return model.encode(texts)

#call to get SBERT embeddings for all preprocessed content titles
content_embeddings = get_sbert_embeddings(content_data['preprocessed_title'].tolist())

**Recommendation Generation and Metrics Used**
The recommendation system works by:

Generating sentence embeddings for both the search terms and the content titles.

Calculating the cosine similarity between a search term and all content titles.

Filtering the content titles with a similarity score above a defined threshold (e.g., 0.35).

Applying keyword boosting for domain-specific terms like "power tool," "garden," or "hydrangea" to enhance the relevance of certain content.

In [34]:
#list of domain-specific keywords for boosting relevance
boost_keywords = ['power tool', 'paint', 'home decor', 'drill', 'saw', 'garden', 'furniture', 'shower', 'christmas', 'hydrangea']

#function to generate recommendations with cosine similarity, NER, and keyword boosting
def generate_recommendations(search_term, content_embeddings, content_data, similarity_threshold=0.35, min_top_n=2, max_top_n=60):
    search_embedding = model.encode([search_term])
    cosine_similarities = cosine_similarity(search_embedding, content_embeddings).flatten()
    
    #applying keyword boosting for domain-related phrases
    for keyword in boost_keywords:
        if keyword in search_term:
            cosine_similarities *= 1.1  #boost similarity by 10%
    
    valid_indices = np.where(cosine_similarities >= similarity_threshold)[0]
    
    if len(valid_indices) == 0:
        return pd.DataFrame()  # Return empty DataFrame if no valid content
    
    top_n = max(min(len(valid_indices), max_top_n), min_top_n)
    top_indices = valid_indices[np.argsort(cosine_similarities[valid_indices])[-top_n:][::-1]]
    
    recommendations = content_data.iloc[top_indices].copy()
    recommendations['similarity_score'] = cosine_similarities[top_indices]
    
    return recommendations[['slug', 'title', 'similarity_score']]

To evaluate the model, we iterate over the labeled data, extracting the preprocessed search terms and their corresponding slugs. For each search term, we generate recommendations using the cosine similarity between the search term embedding and content embeddings. We then check if the actual content slug is present in the recommended slugs to determine the predicted relevance.

- **True Labels**: Whether the label in the dataset is marked as "relevant."
- **Predicted Labels**: Whether the actual slug is present in the recommended slugs.

After predictions are generated for all search terms, we compute the model's performance using a classification report, which includes metrics such as precision, recall, and F1-score, alongside an overall accuracy score.


In [None]:
#evaluate model performance on label dataset
y_true = []
y_pred = []

for i, row in label_data.iterrows():
    search_term = row['preprocessed_searchTerm']
    slug = row['slug']
    true_label = row['Label'].strip().lower() == 'relevant'
    
    recommendations = generate_recommendations(search_term, content_embeddings, content_data, similarity_threshold=0.35)
    
    if not recommendations.empty:
        predicted_label = slug in recommendations['slug'].values
    else:
        predicted_label = False
    
    y_true.append(true_label)
    y_pred.append(predicted_label)

# Display classification report and accuracy score
print("Classification Report:")
print(classification_report(y_true, y_pred))
print("Accuracy Score:", accuracy_score(y_true, y_pred))

Now, we are ready to make suggestions for our recommendation system on the `test.csv` file. Following is the code to call the generate_recommendations function and provide recommendations.

In [None]:
# Generate recommendations for search terms in test data
recommendations_dict = {}
for search_term in test_data['preprocessed_searchTerm']:
    recommendations_dict[search_term] = generate_recommendations(search_term, content_embeddings, content_data, similarity_threshold=0.5)

# Display recommendations for each search term
for search_term, recs in recommendations_dict.items():
    print(f"Search Term: {search_term}")
    print(recs)
    print("\n")

****************    ***END***    ******************

In [37]:
# # Generate recommendations for search terms in test data
# recommendations_dict = {}
# for search_term in test_data['preprocessed_searchTerm']:
#     recommendations_dict[search_term] = generate_recommendations(search_term, content_embeddings, content_data, similarity_threshold=0.5)

# # Convert recommendations dictionary to a DataFrame for easier export
# recommendations_df = pd.DataFrame([
#     {'search_term': search_term, 'slug': rec['slug'], 'title': rec['title'], 'similarity_score': rec['similarity_score']}
#     for search_term, recs in recommendations_dict.items()
#     for _, rec in recs.iterrows()
# ])

# # Export recommendations to a CSV file
# recommendations_df.to_csv('recommendations.csv', index=False)

# print("Recommendations exported to 'recommendations.csv'.")

In [40]:
# # Manually setting multiple search terms
# search_terms = ["best wine glass", "primer paint", "tools for work"]

# # Loop through each search term
# for search_term in search_terms:
#     # Preprocess the search term before recommendation
#     preprocessed_search_term = preprocess_text(search_term)

#     # Generate recommendations using the current search term
#     recommendations = generate_recommendations(preprocessed_search_term, content_embeddings, content_data, similarity_threshold=0.5)

#     # Display recommendations for the current search term
#     print(f"Search Term: {search_term}")
#     print(recommendations)
#     print("\n")  # Add a line break between different search terms

Search Term: best wine glass
                                                   slug  \
342        the-best-wine-glasses-to-complement-any-wine   
1022                                   how-to-cut-glass   
50    the-types-of-drinking-glasses-you-need-in-your...   
1678                     how-to-pack-dishes-and-glasses   
690         Types-of-cocktail-glasses-for-your-home-bar   
343                           wine-coolers-buying-guide   
955                        how-to-drill-a-hole-in-glass   
1500                        how-to-replace-window-glass   
512             types-of-beer-glasses-to-suit-every-sip   
1551                  how-to-get-red-wine-out-of-carpet   
273                  how-to-remove-scratches-from-glass   

                                                  title  similarity_score  
342        The Best Wine Glasses to Complement Any Wine          0.804738  
1022                                   How to Cut Glass          0.661957  
50    The Types of Drinking Glasse