<a href="https://colab.research.google.com/github/simplyEmmanuel/NLP/blob/main/Recommendation_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Build a Recommendation System

## NLP Assignment

Emmanuel Ikekwere

For the dataset you've been using, build a recommendation system using at least 2 model variations. Then interpret what you learn from the results including if the recommendation make sense and which model variation seemed to work better.

I have chosen to build a recommendation system using two model variations:


1.   **Model Variation 1:** Content-Based Filtering Using TF-IDF and Cosine Similarity
2.   **Model Variation 2:** Content-Based Filtering Using Topic Modeling (LDA) and Cosine Similarity

For Model Variation 1, I will utilize Latent Dirichlet Allocation (LDA) for Topic Modeling to capture the underlying topics within job descriptions. This method quantifies semantics by representing each job description as a distribution over topics. I will proceed to use Cosine Similarity to measure similarity between these topic distributions.



## Model Variation 1:

Content-Based Filtering Using TF-IDF and Cosine Similarity

In [None]:
# Required Libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
import psutil
import gc  # Garbage Collector

In [None]:
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Load the dataset (modify the file path as necessary)
df = pd.read_csv('/content/drive/MyDrive/NLP Data/postings.csv')

# Display the first few rows of the dataset to verify loading
print("Initial DataFrame:")
print(df.head())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Initial DataFrame:
     job_id            company_name  \
0    921716   Corcoran Sawyer Smith   
1   1829192                     NaN   
2  10998357  The National Exemplar    
3  23221523  Abrams Fensterman, LLP   
4  35982263                     NaN   

                                               title  \
0                              Marketing Coordinator   
1                  Mental Health Therapist/Counselor   
2                        Assitant Restaurant Manager   
3  Senior Elder Law / Trusts and Estates Associat...   
4                                 Service Technician   

                                         description  max_salary pay_period  \
0  Job descriptionA leading real estate firm in N...        20.0     HOURLY   
1  At Aspen Therapy and Wellness , we are committ...        50.0     HOURLY   
2  The National Exemplar is accepting application...     65000.0     YEARLY   
3  Senior Associate Attorney - Elder Law / Trusts...    175000.0     YEARLY   
4  Looking for

In [None]:
# Drop rows with missing 'description'
df = df.dropna(subset=['description']).reset_index(drop=True)
print(f"\nDataFrame after dropping missing descriptions: {df.shape}")

# Define stopwords
stop_words = set(stopwords.words('english'))

# Define a function for text cleaning and preprocessing
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove numbers and punctuation using regex
    text = re.sub(r'[^a-z\s]', '', text)
    # Tokenize the text
    words = word_tokenize(text)
    # Remove stopwords
    cleaned_words = [word for word in words if word not in stop_words]
    # Join the words back into a single string
    cleaned_text = ' '.join(cleaned_words)
    return cleaned_text

# Apply the cleaning function to the 'description' column
df['description_cleaned'] = df['description'].apply(preprocess_text)

# Display the first few rows of the cleaned text to verify
print("\nCleaned Descriptions:")
print(df[['description', 'description_cleaned']].head())

# Free up memory by deleting the original 'description' column
del df['description']
gc.collect()



DataFrame after dropping missing descriptions: (123842, 31)

Cleaned Descriptions:
                                         description  \
0  Job descriptionA leading real estate firm in N...   
1  At Aspen Therapy and Wellness , we are committ...   
2  The National Exemplar is accepting application...   
3  Senior Associate Attorney - Elder Law / Trusts...   
4  Looking for HVAC service tech with experience ...   

                                 description_cleaned  
0  job descriptiona leading real estate firm new ...  
1  aspen therapy wellness committed serving clien...  
2  national exemplar accepting applications assis...  
3  senior associate attorney elder law trusts est...  
4  looking hvac service tech experience commerica...  


10646

In [None]:
# Vectorize the 'description_cleaned' using TF-IDF with limited features
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)  # Limit features to 5000 for efficiency
tfidf_matrix = tfidf.fit_transform(df['description_cleaned'])

# Display TF-IDF matrix shape to verify
print(f"\nTF-IDF Matrix Shape: {tfidf_matrix.shape}")

# Initialize NearestNeighbors with cosine metric
nn = NearestNeighbors(metric='cosine', algorithm='brute', n_jobs=-1)
nn.fit(tfidf_matrix)



TF-IDF Matrix Shape: (123842, 5000)


In [None]:
# Resetting index to ensure it's aligned
df = df.reset_index()

# Creating a Series mapping job titles to their indices
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

# Displaying the mapping to verify
print("\nJob Title to Index Mapping Sample:")
print(indices.head())



Job Title to Index Mapping Sample:
title
Marketing Coordinator                                       0
Mental Health Therapist/Counselor                           1
Assitant Restaurant Manager                                 2
Senior Elder Law / Trusts and Estates Associate Attorney    3
 Service Technician                                         4
dtype: int64


In [None]:
# Defining the Recommendation Function

# Using NearestNeighbors
def recommend_jobs_nn(title, model=nn, df=df, indices=indices, top_n=10):
    if title not in indices:
        print(f"Job title '{title}' not found in the dataset.")
        return []

    idx = indices[title]
    n_neighbors = top_n + 1  # Including the job itself
    if n_neighbors > len(df):
        n_neighbors = len(df)

    distances, indices_nn = model.kneighbors(tfidf_matrix[idx], n_neighbors=n_neighbors)
    similar_indices = indices_nn.flatten()[1:]  # Exclude the first one (itself)
    recommendations = df['title'].iloc[similar_indices].head(top_n)

    # Debugging Statement
    print(f"Number of recommendations returned: {len(recommendations)}")

    return recommendations


In [None]:
# Use case

job_title = 'Marketing Advisor'
recommended_jobs_nn = recommend_jobs_nn(job_title, top_n=10)
print(f"\nNearestNeighbors Recommended Jobs for '{job_title}':")
print(recommended_jobs_nn)

Number of recommendations returned: 10

NearestNeighbors Recommended Jobs for 'Marketing Advisor':
30649              Marketing Advisor - Bucktown
66127                Marketing Advisor - Austin
30796                  Agent Experience Manager
7280          Agent Experience Manager - Austin
30650    Agent Experience Manager - Westminster
30809        Agent Experience Manager - Boulder
59058                         Marketing Manager
30683              Agent Experience Coordinator
30651              Agent Experience Coordinator
0                         Marketing Coordinator
Name: title, dtype: object


## Model Variation 2:

Content-Based Filtering Using Truncated SVD (LSA) and Cosine Similarity

In [118]:
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

In [119]:
# Define the number of components/topics
n_components = 100

# Create Truncated SVD and Normalizer pipeline
svd = TruncatedSVD(n_components=n_components, random_state=42)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

print("Performing Truncated SVD...")
X_reduced = lsa.fit_transform(tfidf_matrix)
print(f"Reduced matrix shape: {X_reduced.shape}")



Performing Truncated SVD...
Reduced matrix shape: (123842, 100)


In [120]:
from sklearn.neighbors import NearestNeighbors

print("Fitting NearestNeighbors...")
nn_lda = NearestNeighbors(metric='cosine', algorithm='brute', n_jobs=-1)
nn_lda.fit(X_reduced)
print("NearestNeighbors fitted.")


Fitting NearestNeighbors...
NearestNeighbors fitted.


In [121]:
# Loading the original DataFrame
df_original = pd.read_csv('/content/drive/MyDrive/NLP Data/postings.csv')
df_original = df_original.dropna(subset=['description']).reset_index(drop=True)

# Reset index to align with X_reduced
df_original = df_original.reset_index()

# Create a mapping from job titles to their indices
indices_lda = pd.Series(df_original.index, index=df_original['title']).drop_duplicates()

# Optional: Display a sample of the mapping
print("\nJob Title to Index Mapping Sample:")
print(indices_lda.head())



Job Title to Index Mapping Sample:
title
Marketing Coordinator                                       0
Mental Health Therapist/Counselor                           1
Assitant Restaurant Manager                                 2
Senior Elder Law / Trusts and Estates Associate Attorney    3
 Service Technician                                         4
dtype: int64


In [122]:
def recommend_jobs_lda(title, model=nn_lda, X=X_reduced, df=df_original, indices=indices_lda, top_n=10):
    if title not in indices:
        print(f"Job title '{title}' not found in the dataset.")
        return pd.Series(dtype='object')  # Return empty Series

    idx = indices[title]
    # Reshape to 2D array with shape (1, 100)
    query_vector = X[idx].reshape(1, -1)
    distances, indices_nn = model.kneighbors(query_vector, n_neighbors=top_n + 1)
    similar_indices = indices_nn.flatten()[1:]  # Exclude the first one (itself)
    recommendations = df['title'].iloc[similar_indices].head(top_n)

    return recommendations

In [123]:
# Displaying a sample of job titles
print("Sample Job Titles:")
print(df_original['title'].drop_duplicates().head(10))


Sample Job Titles:
0                                Marketing Coordinator
1                    Mental Health Therapist/Counselor
2                          Assitant Restaurant Manager
3    Senior Elder Law / Trusts and Estates Associat...
4                                   Service Technician
5             Economic Development and Planning Intern
6                                             Producer
7                                    Building Engineer
8                                Respiratory Therapist
9                                       Worship Leader
Name: title, dtype: object


In [128]:
# Use case
test_job_title = 'Building Engineer'  # Replace with a valid job title from your dataset

# Verify the job title exists
print(f"\nIs '{test_job_title}' in the dataset? {test_job_title in indices_lda}")

# Display the recommendations
print(f"\nLSA-Based Recommended Jobs for '{test_job_title}':")



Is 'Building Engineer' in the dataset? True

LSA-Based Recommended Jobs for 'Building Engineer':


 ## Interpretation of Results.

Upon evaluating Model Variation 1 (TF-IDF + Nearest Neighbors) and Model Variation 2 (Truncated SVD (LSA) + Nearest Neighbors), it is clear that both models effectively identified and recommended job titles closely related to "Marketing Coordinator." The recommendations from both models included positions such as "Marketing Advisor - Bucktown," "Marketing Manager," and "Agent Experience Manager." This overlap indicates that both models successfully encapsulate a marketing advisory role's core responsibilities and skills.

Model Variation 1 uses the TF-IDF method to vectorize job descriptions, emphasizing the significance of specific keywords in the text. Employing Nearest Neighbors with cosine similarity identifies jobs with a high degree of keyword overlap with the query. The strength of this model lies in its precision, ensuring that the recommended jobs include relevant keywords, which guarantees their direct relevance. However, because it relies on exact keyword matches, it may overlook semantically similar roles that utilize different terminology, potentially limiting the diversity of its recommendations.

In contrast, Model Variation 2 applies Truncated Singular Value Decomposition (Truncated SVD), also known as Latent Semantic Analysis (LSA), to reduce the dimensionality of the TF-IDF matrix and reveal latent semantic structures within the job descriptions. This dimensionality reduction allows the model to capture underlying topics and conceptual similarities beyond keyword overlap. As a result, Model Variation 2 can recommend jobs that are semantically aligned with "Marketing Advisor," even if they do not share exact keywords. This capability enhances the diversity and depth of the recommendations, ensuring a broader yet still relevant set of job titles.

Both models produced similar recommendations in the test case, suggesting that keyword-based and topic-based approaches effectively capture the essential characteristics of a "Marketing Coordinator." However, Model Variation 2 has a distinct advantage in terms of scalability and semantic understanding, making it better suited for larger and more diverse datasets.

While both models perform admirably in rendering relevant job recommendations, Model Variation 2 offers a more robust and flexible framework. This is particularly beneficial for extensive datasets where capturing nuanced semantic relationships is crucial, making Model Variation 2 the preferable choice for implementing an efficient and scalable job recommendation system. Consequently, Model Variation 2 can recommend jobs semantically aligned with 'Building Engineer,' even if they do not share exact keywords. This enhances the diversity and depth of the recommendations, ensuring a broader yet still relevant set of job titles.