**<h1 style="color:blue;">Content-Based Job Recommendation System</h1>**

# **1. Problem Understanding and Research**  

### **1.1 Introduction**  
In today’s dynamic job market, individuals often struggle to find jobs that accurately align with their **skills and expertise**. Traditional job search engines rely on **keyword-based filtering**, which often fails to capture **semantic relevance** between a job description and a candidate’s profile.  

To bridge this gap, we develop a **Content-Based Job Recommendation System** that suggests the **top 5 most relevant job postings** based on a user’s input skills. This system utilizes **TF-IDF (Term Frequency-Inverse Document Frequency)** for textual representation and **Cosine Similarity** for ranking job recommendations.

We used two different dataset sizes: a sample of 5000 records and the full dataset. This approach allowed us to analyze the impact of dataset size on the quality of TF-IDF calculations and similarity measurements between job descriptions and skills


### **1.2 Objective**  
The primary objective of this project is to build an **efficient job recommendation system** using **TF-IDF vectorization** and **Cosine Similarity**. By analyzing the **textual content** of job descriptions and matching them with the user’s skills, we aim to **improve job search accuracy** beyond simple keyword matching.


### **1.3 Approach**  
Our approach consists of the following key steps:

1. **Data Preprocessing** – Cleaning job descriptions, removing stopwords, and standardizing text.  
2. **TF-IDF Feature Extraction** – Converting job descriptions into **numerical vector representations** using TF-IDF.  
3. **Similarity Computation** – Comparing the user’s inputted skills with job postings using **Cosine Similarity**.  
4. **Job Recommendation System** – Ranking job postings based on similarity scores and returning the **top 5 most relevant jobs**.  
5. **Evaluation** – Assessing the system’s effectiveness in recommending meaningful job matches using **Precision@5**.  


### **1.4 Notebook Structure**  
This notebook is structured into the following sections:

- **Data Preprocessing:** Cleaning and preparing job postings for analysis.  
- **TF-IDF Representation:** Transforming job descriptions into TF-IDF vectors.  
- **Cosine Similarity Computation:** Finding the closest job matches based on user input.  
- **Recommendation System Implementation:** Returning the **top 5 job recommendations**.  
- **Evaluation (Precision@5):** Measuring the accuracy of job recommendations by calculating how often the correct job appears in the **top 5 results**.

By implementing this **TF-IDF-based approach**, we ensure a **lightweight, fast, and interpretable** recommendation system suitable for large-scale job datasets.


# **2. Data Preparation and Preprocessing**

### **2.1 Import Necessary Libraries**  
Before proceeding, we need to install and import the required libraries.

### If running for the first time, install the following libraries by uncommenting and executing the lines below:

In [None]:
## First time required additional intalls only
# !pip install sentence-transformers
# !pip install nltk
# !pip install pandas

### The following libraries are essential for **data preprocessing and NLP**:

### - **pandas**: For handling structured data.
### - **numpy**: For numerical computations.
### - **nltk**: For natural language processing (e.g., stopword removal).
### - **re**: For text cleaning using regular expressions.

In [1]:
# Installed sentence-transformers for easy transformer-based text embeddings.
# nltk for text preprocessing (stopword removal, lemmatization).
# pandas and numpy for data handling.

import pandas as pd
import numpy as np
import nltk
import re

from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

print("Libraries loaded successfully!")

Libraries loaded successfully!


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### **2.2 Load and Merge the Dataset**  

To build an effective **job recommendation system**, we need to integrate multiple datasets that contain **job postings** and their respective **required skills**. This section focuses on loading, merging, and refining the data.  

#### **A. Loading Data**  
- We first **mount Google Drive** to access stored datasets.  
- Two CSV files are read into Pandas DataFrames:  
  - **`linkedin_job_postings.csv`** – Contains job postings, including job titles and descriptions.  
  - **`job_skills.csv`** – Lists the required skills associated with each job.  
- A quick preview of the first two rows from both datasets confirms successful loading.  

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
df_postings = pd.read_csv('/content/drive/MyDrive/linkedin_job_postings.csv') # Load job postings dataset
df_skills = pd.read_csv('/content/drive/MyDrive/job_skills.csv') # Load job skills dataset

In [4]:
# Display the first two rows of each dataset to check the structure
df_postings.head(2)
df_skills.head(2)

Unnamed: 0,job_link,job_skills
0,https://www.linkedin.com/jobs/view/housekeeper...,"Building Custodial Services, Cleaning, Janitor..."
1,https://www.linkedin.com/jobs/view/assistant-g...,"Customer service, Restaurant management, Food ..."


#### **B. Merging Data**  
- We perform an **inner join** on the `job_link` column to merge job postings with their corresponding required skills.  
- The resulting dataset, **`df_merged`**, combines job descriptions with relevant skill requirements.  
- The shape of the merged dataset is displayed to verify successful integration.  

In [5]:
# Merge both datasets on 'job_link' to combine job postings with their required skills
df_merged = pd.merge(df_postings, df_skills, on='job_link', how='inner')
print("Merged Data Shape:", df_merged.shape) # Print the shape of the merged dataset

Merged Data Shape: (1296381, 15)


### **C. Loading the Test Dataset**  
- We load `test_data.csv`, which contains **job titles** and their associated **skills**.  
- This dataset will be used for evaluating the **recommendation system's accuracy**.


### **D. Removing Test Records from Training Data**  
- We remove **test job postings** from `df_merged` using their indices.  
- This prevents **data leakage**, ensuring that the model is **not trained** on the same records it is tested on.  
- The shape of the updated `df_merged` dataset is displayed to confirm the changes.

In [6]:
# Load test dataset
test_data = pd.read_csv('/content/drive/MyDrive/test_data.csv')

In [7]:
# Remove the test records from the original dataset using their indices
df_merged = df_merged.drop(test_data.index)
df_merged.shape

(1296181, 15)

#### **E. Filtering Relevant Columns**  
- Since **job title** and **job skills** are the key components for similarity-based recommendations, we retain only these two columns.  
- A final preview of the refined dataset ensures it is structured correctly for further preprocessing and feature extraction.  

In [8]:
# Keep now the desired columns only i.e. job_title w.r.t their job_skills
df_merged = df_merged[['job_title', 'job_skills']]

In [9]:
# Display the first two rows of the merged dataset
df_merged.head(2)

Unnamed: 0,job_title,job_skills
200,Travel RN Peds Onc 3184.24/week - 24183827EXPPLAT,"Pediatrics RN, BLS, PALS, American Heart Assoc..."
201,RN Clinical - 4S Cardiovascular Surgery / Hear...,"Patient Care, Nursing Process, Medical Procedu..."


### **2.3 Data Cleaning & Preprocessing**  

To ensure high-quality job recommendations, we clean and preprocess the dataset by **removing duplicates, handling missing values, and standardizing text**.  



### **A. Handling Duplicates & Missing Values**  
- **Remove duplicate records** to eliminate redundancy.  
- **Check for missing values** in key columns (`job_title` and `job_skills`).  
- Since the dataset is large, **rows with missing values** in `job_skills` are safely removed.  
- The dataset size is updated and displayed after these operations.  



In [10]:
# Remove duplicates
df_merged.drop_duplicates(inplace=True)

In [11]:
# Check the null rows
print(df_merged.isnull().sum())

job_title        0
job_skills    1384
dtype: int64


In [12]:
# Since the dataset is large, we can safely drop these rows with null values
df_merged.dropna(inplace=True)

In [13]:
print("Merged Data Shape:", df_merged.shape) # Print the shape of the merged dataset

Merged Data Shape: (1290930, 2)



### **B. Text Preprocessing**  
To improve **text consistency and accuracy**, we apply multiple preprocessing steps:  

#### **- Cleaning Text**  
- Convert all text to **lowercase** to maintain uniformity.  
- Remove **special characters** to eliminate noise.  
- Remove **extra spaces** to standardize formatting.  

#### **- Removing Stopwords**  
- Stopwords (e.g., "the", "is", "and") are removed using **NLTK’s predefined list**, ensuring only meaningful words remain.

#### **- Combining Job Title and Skills**  
- A **new column** (`combined_text`) is created by merging `job_title` and `job_skills`.  
- This ensures that both **job roles** and **required skills** contribute to the similarity calculations.  



### **C Final Dataset Preview**  
- The cleaned dataset, including the `combined_text` column, is displayed to confirm successful preprocessing.  
- This processed data is now ready for **feature extraction and recommendation generation**.

In [14]:
# Defining the function for cleaning the text.

# Remove Special Characters, Convert to Lower Words, Remove Extra spaces

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [15]:
# Defining the function for removal of stop words using NLTK .

# Remove common English stopwords

def remove_stopwords(text):
    tokens = text.split()
    tokens = [w for w in tokens if w not in stop_words]
    return ' '.join(tokens)


In [16]:
# Building combined text from job_title & job_skills

# Combining the title and skills

df_merged['combined_text'] = (
    df_merged['job_title'].fillna('') + ' ' +
    df_merged['job_skills'].fillna('')
)

In [17]:
# Apply cleaning (clean text and remove stopwords on the data)

df_merged['combined_text'] = df_merged['combined_text'].apply(clean_text).apply(remove_stopwords)

In [18]:
# Check the dataset sample

# Dataset with cleaned combined_text to process further

df_merged.head()

Unnamed: 0,job_title,job_skills,combined_text
200,Travel RN Peds Onc 3184.24/week - 24183827EXPPLAT,"Pediatrics RN, BLS, PALS, American Heart Assoc...",travel rn peds onc week expplat pediatrics rn ...
201,RN Clinical - 4S Cardiovascular Surgery / Hear...,"Patient Care, Nursing Process, Medical Procedu...",rn clinical cardiovascular surgery heart lung ...
202,Emergency Medicine Physician Near Myrtle Beach...,"Emergency medicine, Physician, ABEM/AOBEM boar...",emergency medicine physician near myrtle beach...
203,To Go Specialist,"Teamwork, Customer service, Attention to detai...",go specialist teamwork customer service attent...
204,Strategic Content and Relationship Management ...,"Licensing agreements, Vendor management, Contr...",strategic content relationship management spec...


## **3. Feature Extraction Technique**  

# **First Section: Running the Recommendation System on 5,000 Records**

### We experimented with different dataset sizes, starting with a **smaller subset (5,000 records)** before scaling to the **full dataset**. This approach allowed us to analyze how the **amount of data impacts TF-IDF feature extraction, similarity calculations, and overall recommendation accuracy**.


In [19]:
# Limit the dataset to 5000 records for processing and evaluation
df_merged_5000 = df_merged.sample(n=5000, random_state=42)

df_merged_5000.shape

(5000, 3)

# **4.1 TF-IDF**

## **A. Steps in TF-IDF Processing:**  
### 1. **Import `TfidfVectorizer`** from `sklearn.feature_extraction.text`.  
### 2. **Initialize the vectorizer** (`TfidfVectorizer()`).  
## 3. **Fit and transform the `combined_text` column** into a **TF-IDF matrix**.  
### 4. **Display the shape** of the resulting matrix to verify the transformation.

In [20]:
# Importing the TfidfVectorizer pre-build library for feature extraction

from sklearn.feature_extraction.text import TfidfVectorizer

In [21]:
# Apply TF-IDF vectorization on the text data for the selected 5000 records
tfidf_vectorizer_5000 = TfidfVectorizer()
tfidf_matrix_5000 = tfidf_vectorizer_5000.fit_transform(df_merged_5000['combined_text'])

## **B. Output Interpretation**  
### - The TF-IDF matrix has **5,000 job postings** (rows) and **15,791 unique terms** (columns).  
### - This matrix serves as a **numerical representation** of job descriptions, enabling **text similarity calculations** for recommendations.

### By working with a **smaller dataset**, we can efficiently **test and refine** our recommendation model before scaling it to larger datasets.


In [22]:
# Print the shape of the TF-IDF matrix to confirm feature extraction
print("TF-IDF Matrix Shape:", tfidf_matrix_5000.shape)

TF-IDF Matrix Shape: (5000, 15557)


# **4.2 TF-IDF with Cosine Similarity**  

To retrieve the most relevant job postings for a given skill set, we use **TF-IDF (Term Frequency-Inverse Document Frequency)** to represent job descriptions as vectors and **Cosine Similarity** to measure their relevance to the user’s skills.


### **A. Why Cosine Similarity?**  
- Cosine similarity measures how **similar** two text vectors are by computing the **cosine of the angle** between them.  
- It is widely used in **information retrieval and recommendation systems** to rank documents based on their content similarity.  

### **B. Implementing the Recommendation System**  

### **C. Steps in the Process:**  
1. **Preprocess the user input** to ensure consistency with the dataset.  
2. **Convert the input into a TF-IDF vector** using the trained `TfidfVectorizer`.  
3. **Compute cosine similarity** between the user’s skillset and all job descriptions.  
4. **Retrieve the top N most similar jobs**, ranked by similarity scores.  
5. **Return job recommendations** with job titles, relevant skills, and similarity scores.


### **D. Output Interpretation**  
- The system **ranks job postings** based on their similarity to the user’s skills.  
- Higher similarity scores indicate **better matches**.  
- The retrieved recommendations include:
  - **Job Title**
  - **Required Skills**
  - **Similarity Score**  

This method provides a **fast and interpretable way** to recommend jobs based on **textual similarity**, though it does not capture deeper semantic meanings as well as **transformer-based embeddings** like SBERT.


In [23]:
# Importing th cosine_similatrity pre-build library

from sklearn.metrics.pairwise import cosine_similarity

In [24]:
# Defining the Recommender System with TF-IDF and Cosine Similarity with 5000 records

def recommend_jobs(user_skills, method='tfidf', top_n=5):
    # Preprocess user input
    user_input = clean_text(user_skills)

    if method == 'tfidf':
        # Transform user input into TF-IDF space
        user_vector = tfidf_vectorizer_5000.transform([user_input])

        # Compute cosine similarity
        similarities = cosine_similarity(user_vector, tfidf_matrix_5000).flatten()

    # Get top N job indices based on similarity
    top_indices = similarities.argsort()[-top_n:][::-1]

    # Retrieve job recommendations with similarity scores
    recommendations = df_merged_5000.iloc[top_indices][['job_title', 'job_skills']].copy()
    recommendations['similarity_score'] = similarities[top_indices]  # Add similarity scores

    return recommendations

# **5. Recommendation System: Job Matching Based on Skills**  

In this section, we test the **job recommendation system** by providing a sample **user input (skills)** and retrieving the **top 5 most relevant job postings** based on **cosine similarity with TF-IDF**.


### **5.1 Testing the Recommender System**  
- The user enters **"Dental Surgery"** as the skill set.  
- The system processes the input and retrieves **the top 5 job matches**.  
- Each recommended job is ranked based on **similarity scores**.


### **5.2 Output Interpretation**  
The table displays:  
- **Job Title** – The recommended job title.  
- **Job Skills** – Relevant skills associated with the job.  
- **Similarity Score** – A numerical value indicating how closely the job matches the user's input.

#### **5.3 Observations from the Output:**  
- Jobs related to **surgery, dentistry, and healthcare** are retrieved.  
- Higher similarity scores indicate **better matches** (e.g., Oral Surgeon and General Dentist).  
- Some **non-dental jobs** appear, suggesting that **TF-IDF alone may not fully capture contextual meaning**.


### **5.4 Evaluating Recommendation Accuracy**  
- **Precision@K** is used to measure how often the correct job appears in the **top K recommendations**.  
- If a recommended job is a **direct or highly relevant match**, it is considered correct.  
- The **final precision score** helps assess the **effectiveness of the TF-IDF-based model**.


- ✔ **TF-IDF with Cosine Similarity provides reasonable job recommendations**.  
-✔ **Some irrelevant results highlight the limitations of keyword-based similarity**.  
- ✔ **Using more advanced embeddings (e.g., SBERT) could improve contextual accuracy**.  


In [25]:
# Example usage with a sample skill input
user_skills = "Dental Surgery"
recommendations = recommend_jobs(user_skills, method='tfidf')

# Display recommendations with similarity scores
print(recommendations)

                                                 job_title  \
120252   CSA - Certified Surgical Assistant - Cardiac -...   
309128                                                  RN   
773309   10ES Registered Nurse - ENT/Gyne/Urology/Plastics   
743414                                            Dentists   
1109673                    Outpatient Surgery Veterinarian   

                                                job_skills  similarity_score  
120252   Suturing, Surgical instrumentation, Patient po...          0.712563  
309128   Nursing, Operating Room, Neuro Surgery, Client...          0.660768  
773309   Nursing, Surgical Oncology, Airway Management,...          0.423656  
743414   General Dentistry, Root Canal, Molar Extractio...          0.389755  
1109673  Veterinary Medicine, Surgery, Dentistry, Conti...          0.383614  


# **6. Evaluating Model Performance: Precision@5**  

To measure how effectively the **job recommendation system** retrieves relevant job postings, we use **Precision@5**, a common metric in ranking-based recommendations.


### **6.1 What is Precision@5?**  
- Precision@5 measures the fraction of **correctly recommended jobs** in the **top 5 predictions**.  
- A recommendation is considered correct if:
  - The **actual job title** appears in the top 5 recommended jobs, or  
  - The **cosine similarity score** between the actual job and a recommended job **exceeds 0.5**.  

The **Precision@5 score** is computed as:

$$
\text{Precision@5} = \frac{\text{Number of Correct Predictions in Top 5}}{\text{Total Test Records}}
$$


### **6.2 How the Function Works**  
1. **Validate the test dataset** to ensure required columns (`job_title`, `job_skills`) exist.  
2. **Loop through each test job** and retrieve recommendations using `recommend_jobs()`.  
3. **Compare actual job titles with recommendations**:
   - Check for **exact title matches**.  
   - Compute **cosine similarity** between the actual job title and each recommended title.  
   - If **similarity > 0.5**, consider it a correct match.  
4. **Count correct predictions** and compute the **final Precision@5 score**.  


### **6.3 Output Interpretation**  
- **Precision@5 Score:** `0.41`  
- This means **41% of test job queries** had a correct job in the **top 5 recommendations**.  
- A **higher score** indicates a **better recommendation system**.  


- ✔ **The system successfully retrieves relevant job matches, but accuracy can be improved.**  
- ✔ **Lower-than-expected Precision@5 may indicate limitations in TF-IDF-based matching.**  
- ✔ **Using more advanced models like SBERT or fine-tuned embeddings could enhance contextual understanding.**  

By analyzing **Precision@5**, we can determine areas for improvement and refine the recommendation approach.


In [26]:
# Defining the function to test the test_data for each row, and then calculate the precision based on given formula, using 5000 records.

def precision_at_5(test_data, threshold=0.5):
    # Validate columns
    required_columns = ['job_title', 'job_skills']
    for col in required_columns:
        if col not in test_data.columns:
            raise KeyError(f"Missing column: '{col}' in test_data")

    correct_predictions = 0
    total_tests = len(test_data)

    for index, row in test_data.iterrows():
        actual_job_title = row['job_title']
        user_skills = row['job_skills']
        # Get job recommendations for the given user skills
        recommendations = recommend_jobs(user_skills, method='tfidf', top_n=5)

        matched = False
        for rec_title in recommendations['job_title']:
          # Compute similarity between actual job title and recommended job titles
            sim_score = cosine_similarity(
                tfidf_vectorizer_5000.transform([actual_job_title]),
                tfidf_vectorizer_5000.transform([rec_title])
            )[0][0]
            # If the exact title matches or similarity exceeds the threshold, count it as correct
            if actual_job_title == rec_title or sim_score > threshold:
                matched = True
                break

        if matched:
            correct_predictions += 1
    # Calculate precision@5
    precision_at_5 = correct_predictions / total_tests
    return round(precision_at_5, 4)

In [27]:
# Compute precision@5 for the 5000-record dataset
precision_score_5000 = precision_at_5(test_data)
print("Precision@5 Score:", precision_score_5000)

Precision@5 Score: 0.4


# **Second Section: Running the Recommendation System on the Full Dataset**  

# **1. TF-IDF**

In [28]:
# Apply TF-IDF vectorization on the full dataset
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df_merged['combined_text'])

In [29]:
# Print the shape of the TF-IDF matrix to confirm feature extraction
print("TF-IDF Matrix Shape:", tfidf_matrix.shape)

TF-IDF Matrix Shape: (1290930, 260160)


# **2. TF-IDF with Cosine_Similarity**

In [30]:
# Defining the recommender system using TF-IDF and cosine similarity for the full dataset

def recommend_jobs(user_skills, method='tfidf', top_n=5):
    # Preprocess user input
    user_input = clean_text(user_skills)

    if method == 'tfidf':
        # Transform user input into TF-IDF space
        user_vector = tfidf_vectorizer.transform([user_input])

        # Compute cosine similarity
        similarities = cosine_similarity(user_vector, tfidf_matrix).flatten()

    # Get top N job indices based on similarity
    top_indices = similarities.argsort()[-top_n:][::-1]

    # Retrieve job recommendations with similarity scores
    recommendations = df_merged.iloc[top_indices][['job_title', 'job_skills']].copy()
    recommendations['similarity_score'] = similarities[top_indices]  # Add similarity scores

    return recommendations

# <h3 style="color:blue;"> **3. Recommendation System** </h3>

###### Testing Recommender System (Find the actual job in the top5 using exact match.)

In [31]:
# Example usage with a sample skill input
user_skills = "Dental Surgery"
recommendations = recommend_jobs(user_skills, method='tfidf')

# Display recommendations with similarity scores
print(recommendations)

                                            job_title  \
999935                                      Surg Tech   
85815            Registered Nurse (RN)-Operating Room   
268741                            Surgical Technician   
729278  Certified Registered Nurse Anesthetist (CRNA)   
267565                 Anesthesiologist (ANES) Locums   

                                               job_skills  similarity_score  
999935  Surgical Technology, EPIC, General Surgery, GY...          0.770690  
85815   Robotics, Pediatrics, Urology, Nursing, Patien...          0.741812  
268741  Surgical Technician, Patient Care, Equipment M...          0.720098  
729278  Nurse Anesthesia, Surgical Procedures, Nursing...          0.716647  
267565  General Surgery, GYN Surgery, ENT Surgery, Uro...          0.713267  


# <h3 style="color:blue;"> **4. Evaluation: Precision@K** </h3>

##### Defining the functions to calculate precision on test data, Compute the cosine_similarity between the actual job and each of the top5 jobs, and use MATCH_THRESHOLD = 0.5

In [32]:
# Defining the function to test the test_data for each row, and then calculate the precision based on given formula, for the full dataset.

def precision_at_5(test_data, threshold=0.5):
    # Validate columns
    required_columns = ['job_title', 'job_skills']
    for col in required_columns:
        if col not in test_data.columns:
            raise KeyError(f"Missing column: '{col}' in test_data")

    correct_predictions = 0
    total_tests = len(test_data)

    for index, row in test_data.iterrows():
        actual_job_title = row['job_title']
        user_skills = row['job_skills']
        # Get job recommendations for the given user skills
        recommendations = recommend_jobs(user_skills, method='tfidf', top_n=5)

        matched = False
        for rec_title in recommendations['job_title']:
          # Compute similarity between actual job title and recommended job titles
            sim_score = cosine_similarity(
                tfidf_vectorizer.transform([actual_job_title]),
                tfidf_vectorizer.transform([rec_title])
            )[0][0]
            # If the exact title matches or similarity exceeds the threshold, count it as correct
            if actual_job_title == rec_title or sim_score > threshold:
                matched = True
                break

        if matched:
            correct_predictions += 1
    # Calculate precision@5
    precision_at_5 = correct_predictions / total_tests
    return round(precision_at_5, 4)

In [33]:
# Compute precision@5 for the full dataset
precision_score = precision_at_5(test_data)
print("Precision@5 Score (TF-IDF):", precision_score)

Precision@5 Score (TF-IDF): 0.98


<h3 style="color:blue;"> 6. Conclusion </h3>

The **TF-IDF** method was used for feature extraction and **Cosine Similarity** for matching, achieving a **Precision@5 score of 0.98**, indicating highly accurate recommendations on the current dataset. Optimized data preprocessing and efficient merging of job titles with skills enhanced the method performance, making it scalable.  

 **TF-IDF** demonstrated efficiency in keyword-based recommendations but may be less effective when exact keyword matches are unavailable, as it does not cover complex NLP tasks like semantic search.  

The impact of dataset size on matching accuracy was evaluated using two sets: a small sample (5,000 records) and the full dataset. As the dataset size increased, **TF-IDF** captured more diverse linguistic patterns, leading to improved similarity calculations. In contrast, when data was limited, the model struggled with accurate matching due to less diverse term distributions, affecting recommendation quality.  

Overall, the method showed strong performance in mapping skills to jobs when sufficient data was available, reinforcing its potential as a practical job recommendation system.