# LinkedIn Recommendation System

## Dataset download

This project uses the **‚Äú1.3M LinkedIn Jobs and Skills 2024‚Äù** dataset available on [Kaggle](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024).

The dataset contains over **1.3 million LinkedIn job postings** collected in 2024, including detailed information on job titles, descriptions, companies, and associated skills. It is used to train and evaluate our job recommendation system.

### Download Options

You can obtain the dataset in two ways:

1. **Using the Kaggle API (Recommended)** ‚Äî automatic download and extraction.  
2. **Manual Download** ‚Äî download the ZIP file directly from the dataset page and extract it yourself.


### Option 1 ‚Äî Using the Kaggle API

To use the Kaggle API, ensure you have the Kaggle CLI installed and configured.

```bash
# Install Kaggle CLI
pip install kaggle

# Move your Kaggle API key (kaggle.json) into place
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
```

Once configured, you can run the provided script to automatically download and unzip the dataset into the `data/` folder.

```bash
chmod +x ./download_linkedin_dataset.sh
./download_linkedin_dataset.sh
```

This script:
- Creates the `data/` folder if it does not exist.
- Downloads the dataset from Kaggle.
- Extracts the contents.
- Removes the ZIP file after extraction.

### Option 2 ‚Äî Manual Download

If you prefer not to use the Kaggle API, you can manually download the dataset from:

üîó **[https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024)**

After downloading:
1. Extract the ZIP file.  
2. Move all extracted files into the `data/` directory in your project.

### Notes
- Ensure your Kaggle API credentials (`kaggle.json`) are correctly configured in `~/.kaggle/`.
- The dataset is distributed under the **ODC Attribution License (ODC-By)**.
- The total size is ~2 Gb (~1.3M entries), so the download may take several minutes depending on your connection.

## Data preparation

### Load and Inner Join

We load two CSVs:

- `job_postings_df` from `./data/linkedin_job_postings.csv`
- `job_summary_df` from `./data/job_summary.csv`
- `job_skills_df` from `./data/job_skills.csv`

We then **inner join** on the unique key `job_link`:

- `jobs_df = pd.merge(job_postings_df, job_skills_df, on="job_link", how="inner")`
- `jobs_df = pd.merge(jobs_df, job_summary_df, on="job_link", how="inner")`

This keeps only postings that exist in every sources and ensures aligned rows across tables.


In [1]:
import pandas as pd

# Load the datasets
job_skills_df = pd.read_csv('./data/job_skills.csv')
job_summary_df = pd.read_csv('./data/job_summary.csv')
job_postings_df = pd.read_csv('./data/linkedin_job_postings.csv')

In [2]:
job_skills_df.head()

Unnamed: 0,job_link,job_skills
0,https://www.linkedin.com/jobs/view/housekeeper...,"Building Custodial Services, Cleaning, Janitor..."
1,https://www.linkedin.com/jobs/view/assistant-g...,"Customer service, Restaurant management, Food ..."
2,https://www.linkedin.com/jobs/view/school-base...,"Applied Behavior Analysis (ABA), Data analysis..."
3,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Engineering, Project Controls, Sche..."
4,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Assembly, Point to point wiring, St..."


In [3]:
job_summary_df.head()

Unnamed: 0,job_link,job_summary
0,https://www.linkedin.com/jobs/view/restaurant-...,Rock N Roll Sushi is hiring a Restaurant Manag...
1,https://www.linkedin.com/jobs/view/med-surg-re...,Schedule\n: PRN is required minimum 12 hours p...
2,https://www.linkedin.com/jobs/view/registered-...,Description\nIntroduction\nAre you looking for...
3,https://uk.linkedin.com/jobs/view/commercial-a...,Commercial account executive\nSheffield\nFull ...
4,https://www.linkedin.com/jobs/view/store-manag...,Address:\nUSA-CT-Newington-44 Fenn Road\nStore...


In [4]:
job_postings_df.head()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type
0,https://www.linkedin.com/jobs/view/account-exe...,2024-01-21 07:12:29.00256+00,t,t,f,Account Executive - Dispensing (NorCal/Norther...,BD,"San Diego, CA",2024-01-15,Coronado,United States,Color Maker,Mid senior,Onsite
1,https://www.linkedin.com/jobs/view/registered-...,2024-01-21 07:39:58.88137+00,t,t,f,Registered Nurse - RN Care Manager,Trinity Health MI,"Norton Shores, MI",2024-01-14,Grand Haven,United States,Director Nursing Service,Mid senior,Onsite
2,https://www.linkedin.com/jobs/view/restaurant-...,2024-01-21 07:40:00.251126+00,t,t,f,RESTAURANT SUPERVISOR - THE FORKLIFT,Wasatch Adaptive Sports,"Sandy, UT",2024-01-14,Tooele,United States,Stand-In,Mid senior,Onsite
3,https://www.linkedin.com/jobs/view/independent...,2024-01-21 07:40:00.308133+00,t,t,f,Independent Real Estate Agent,Howard Hanna | Rand Realty,"Englewood Cliffs, NJ",2024-01-16,Pinehurst,United States,Real-Estate Clerk,Mid senior,Onsite
4,https://www.linkedin.com/jobs/view/group-unit-...,2024-01-19 09:45:09.215838+00,f,f,f,Group/Unit Supervisor (Systems Support Manager...,"IRS, Office of Chief Counsel","Chamblee, GA",2024-01-17,Gadsden,United States,Supervisor Travel-Information Center,Mid senior,Onsite


In [5]:
jobs_df = pd.merge(job_postings_df, job_skills_df, on='job_link', how='inner')
jobs_df = pd.merge(jobs_df, job_summary_df, on='job_link', how='inner')

In [6]:
jobs_df.describe()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_skills,job_summary
count,1296381,1296381,1296381,1296381,1296381,1296381,1296372,1296362,1296381,1296381,1296381,1296381,1296381,1296381,1294296,1296381
unique,1296381,722728,1,1,1,565695,88995,28791,6,1018,4,1923,2,3,1287101,957570
top,https://www.linkedin.com/jobs/view/account-exe...,2024-01-19 09:45:09.215838+00,t,t,f,LEAD SALES ASSOCIATE-FT,Health eCareers,"New York, NY",2024-01-14,North Carolina,United States,Account Executive,Mid senior,Onsite,"Front Counter, DriveThru, Outside Order Taker,...",Dollar General Corporation has been delivering...
freq,1,573487,1296381,1296381,1296381,7315,40049,12580,459354,9495,1105410,19465,1155276,1285565,169,4565


In [7]:
input_cols = [
    'job_link',
    'job_title',
    'job_location',
    'search_country',
    'job_skills',
    'company',
    'search_position',
    'job_summary',
    'job_level',
]

### Row Filtering

We remove rows that do not contain the required NLP outputs and rows flagged as in-progress:

- Drop entries **without NER** results.
- Drop entries **without summary**.
- Drop entries where **`is_being_worked` is `True`**.

This reduces noise and guarantees each training sample has complete text features.

In [8]:
jobs_df = jobs_df.loc[
    (jobs_df["is_being_worked"] == "f")
    & (jobs_df["got_summary"] == "t")
    & (jobs_df["got_ner"] == "t")
]

jobs_df = jobs_df[input_cols].dropna().reset_index(drop=True)

### Title Normalization and Reduction of Unique Values

We normalize `job_title` with a custom function:

1. **Lowercase** titles.
2. **Trim at the first dash**: keep text before `"-"` to collapse variants like  
   `Senior Software Engineer - Backend` ‚Üí `senior software engineer`.
3. **Remove parenthetical fragments**: delete content inside `(...)`, e.g.  
   `data scientist (NLP)` ‚Üí `data scientist`.
4. **Strip whitespace**.

**Effect:** Different textual variants map to a **single canonical form**, which **reduces the number of unique job titles** and stabilizes downstream grouping and modeling.


In [9]:
def _clean_job_titles(job_title: str):
    job_title = job_title.lower()
    if '-' in job_title:
        job_title = job_title.split('-')[0].strip()
    job_title = job_title.split('(')[0].strip()
    return job_title.strip()

jobs_df['job_title'] = jobs_df['job_title'].astype(str).apply(_clean_job_titles)

### Skill Canonicalization and Reduction of Unique Values

We clean `job_skills` as a comma-separated list:

1. **Split by comma** and **strip** whitespace.
2. **Lowercase** each skill token.
3. **De-duplicate per posting** to avoid repeated skills.
4. **Sort tokens** so the per-row skill list has a consistent order.

We also track a **global set of unique skills** to measure coverage.

**Effect:** Canonicalization merges superficial variants and ordering differences, which **reduces both per-row and global unique skill counts**. This yields a more compact and reliable skill space.


In [10]:
import re

# create a cleaned list of skills per job and a global unique skills array
def _clean_split_skills(skills_str: str):
    skills_str = skills_str.lower()
    parts = str(skills_str).split(',')
    unique = set()
    for part in parts:
        # remove non A-Za-z characters except whitespace, collapse spaces and strip ends
        c = re.sub(r'[^A-Za-z\s]', '', part)
        c = re.sub(r'\s+', ' ', c).strip()
        if c:
            unique.add(c)
    # return a deterministic, cleaned, ordered string
    ordered = sorted(unique)
    return ', '.join(ordered)

jobs_df['job_skills'] = jobs_df['job_skills'].astype(str).apply(_clean_split_skills)

In [11]:
display(jobs_df.head())

Unnamed: 0,job_link,job_title,job_location,search_country,job_skills,company,search_position,job_summary,job_level
0,https://www.linkedin.com/jobs/view/account-exe...,account executive,"San Diego, CA",United States,"bachelors degree, bd offerings, challenges, co...",BD,Color Maker,Responsibilities\nJob Description Summary\nJob...,Mid senior
1,https://www.linkedin.com/jobs/view/registered-...,registered nurse,"Norton Shores, MI",United States,"bachelor of science in nursing, care managemen...",Trinity Health MI,Director Nursing Service,Employment Type:\nFull time\nShift:\nDescripti...,Mid senior
2,https://www.linkedin.com/jobs/view/restaurant-...,restaurant supervisor,"Sandy, UT",United States,"arithmetic skills, bending and kneeling abilit...",Wasatch Adaptive Sports,Stand-In,Job Details\nDescription\nWhat You'll Do\nAs a...,Mid senior
3,https://www.linkedin.com/jobs/view/independent...,independent real estate agent,"Englewood Cliffs, NJ",United States,"closing statements, communication, customer se...",Howard Hanna | Rand Realty,Real-Estate Clerk,Who We Are\nRand Realty is a family-owned brok...,Mid senior
4,https://www.linkedin.com/jobs/view/registered-...,registered nurse,"Muskegon, MI",United States,"bsn, diversity, equal opportunity employer, eq...",Trinity Health MI,Nurse Practitioner,Employment Type:\nFull time\nShift:\n12 Hour N...,Mid senior


In [12]:
seen = set()
skills_array = []
for lst in jobs_df['job_skills']:
    for skill in lst.split(','):
        skill = skill.strip()
        if skill not in seen:
            seen.add(skill)
            skills_array.append(skill)

print(f"Jobs rows: {len(jobs_df)}, sample job_skills (first 5):\n", jobs_df['job_skills'].head())
print(f"Global unique skills count: {len(skills_array)}")
#skills_array[:20]

Jobs rows: 1294268, sample job_skills (first 5):
 0    bachelors degree, bd offerings, challenges, co...
1    bachelor of science in nursing, care managemen...
2    arithmetic skills, bending and kneeling abilit...
3    closing statements, communication, customer se...
4    bsn, diversity, equal opportunity employer, eq...
Name: job_skills, dtype: object
Global unique skills count: 2668569


### Location Cleaning and Reduction of Unique Values

We standardize `job_location` by **keeping only the part before the first comma**:

- Example: `San Diego, CA` ‚Üí `San Diego`

**Effect:** This collapses formatting variants that differ only by state or country suffix. It **reduces the number of unique locations** and helps counter sparse geography fields while preserving city-level signal.

In [13]:
def _clean_location(loc: str):
    return loc.split(',')[0].strip()

jobs_df['job_location'] = jobs_df['job_location'].astype(str).apply(_clean_location)

## Data Filtering 
The clean dataset is composed of 1.294.268 jobs posts. For simplification, we focus on the city with the most number of jobs posts (New York)

In [14]:
jobs_df.shape

(1294268, 9)

In [15]:
jobs_df['job_location'].value_counts()

job_location
New York         14850
London           11551
Houston          10380
Chicago          10154
Los Angeles       9724
                 ...  
Mathern              1
Hurdle Mills         1
Haylands             1
Kirribilli           1
Yallabatharra        1
Name: count, Length: 21036, dtype: int64

In [16]:
##this is in case we want to filter by a different method, filtering the dataset by a threshold of posts.

# min_jobs = 100      
# max_jobs = 500   

# jobs_df = (
#     jobs_df
#     .groupby(['search_country', 'job_location'])
#     .filter(lambda x: len(x) >= min_jobs)  
#     .groupby(['search_country', 'job_location'], group_keys=False)
#     .apply(lambda x: x.sample(n=min(len(x), max_jobs), random_state=42))
# ).reset_index(drop=True)

In [17]:
jobs_df = jobs_df[jobs_df['job_location'] == 'New York'].reset_index(drop=True)

In [18]:
jobs_df.describe()

Unnamed: 0,job_link,job_title,job_location,search_country,job_skills,company,search_position,job_summary,job_level
count,14850,14850,14850,14850,14850,14850,14850,14850,14850
unique,14850,8822,1,2,14842,4032,1147,13817,2
top,https://www.linkedin.com/jobs/view/part-time-h...,executive assistant,New York,United States,"background check, csea tuition vouchers, envir...",DocCafe,Account Executive,AC-Full E2E Rule Based Multiplexing Test - Obj...,Mid senior
freq,1,174,14850,14849,2,218,402,35,12481


### Feature Set Used

After preprocessing, the working DataFrame includes the key fields required for analysis and modeling:

- `job_link` (primary key, post-join)
- `job_title` (normalized)
- `job_location` (city-only normalized)
- `search_country`
- `job_skills` (cleaned, sorted, de-duplicated)
- `job_description`
- `search_position`
- `job_level`

These fields form the basis for representation building and recommendation.


### Impact on Cardinality (Unique Values)

The following transformations are specifically designed to **reduce the number of unique values**:

- **Job titles:** lowercasing, dash-trim, and parenthesis removal collapse stylistic variants.
- **Skills:** lowercase normalization, de-duplication, and sorted lists produce canonical rows and reduce global skill vocabulary.
- **Locations:** truncation before the first comma unifies location strings.

This cardinality reduction improves:
- Statistical reliability of counts and co-occurrences.
- Memory usage and runtime.
- Model stability and generalization.

# Content-Based Recommendation System

In [19]:
#unique number of skills
jobs_df["job_skills"].nunique()

14842

In [20]:
from collections import defaultdict
from itertools import combinations
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

In order to build a content-based recommendation system, the dataset must first undergo preprocessing. This is especially important because the dataset is large, and unprocessed text can make computation slow and expensive.
Since the dataset contains 14.842 unique skills, many of them appear only a few times. As a first step, we remove rare skills that appear fewer than 10 times, reducing noise and improving the quality of the TF-IDF representations.

In [21]:
rare_skill_threshold = 10

In [22]:
def remove_rare_skills(jobs_df, rare_skill_threshold):
    # Flatten all skills and strip spaces
    all_skills = [s.strip() for skills in jobs_df['job_skills'] for s in skills.split(',')]
    
    # Count frequency
    skill_counts = Counter(all_skills)
    
    # Identify rare skills
    rare_skills = {skill for skill, count in skill_counts.items() if count <= rare_skill_threshold}
    
    # Remove rare skills from each job
    def filter_skills(skills_str):
        skills = [s.strip() for s in skills_str.split(',')]
        skills = [s for s in skills if s not in rare_skills]
        return ', '.join(skills)
    
    jobs_df['job_skills'] = jobs_df['job_skills'].apply(filter_skills)
    # Remove jobs with no skills left
    jobs_df = jobs_df[jobs_df['job_skills'].str.strip() != ''].reset_index(drop=True)
    return jobs_df

jobs_df = remove_rare_skills(jobs_df, rare_skill_threshold)
jobs_df.shape

(14767, 9)

As the next step, we preprocess the data to prepare it for vectorization. Since our goal is to recommend jobs based on a user's title and skills, we combine these two fields into a single text input. This merged representation allows the TF-IDF vectorizer to capture the full semantic context of both the job title and its associated skills.

In [23]:
def preprocess(df):
    df['title_skills'] = df['job_title'] + " " + df['job_skills']
    return df
jobs_df = preprocess(jobs_df)
    

Finally, we are ready to create the TF-IDF representation. At this stage, we select the `min_df` and `max_df` parameters, which determine how frequently a term must appear to be included in the vocabulary.

`min_df` filters out terms that appear too rarely.

`max_df` filters out terms that appear too frequently.

After fitting the vectorizer, we obtain a TF-IDF matrix of shape (14,767 √ó 526), where each row corresponds to a job post and each column represents a distinct skill.

In [24]:
vectorizer = TfidfVectorizer(
            stop_words='english',
            max_features=10000,       
            min_df=100,               
            max_df=0.8                
        )
matrix = vectorizer.fit_transform(jobs_df["title_skills"])

Once we have the TF-IDF matrix, we can generate job recommendations based on a user‚Äôs title and skills. The user provides a job title of interest along with a list of skills they possess. The algorithm concatenates these inputs into a single text string and uses the vectorizer‚Äôs transform function to convert it into a TF-IDF vector.
We then compute cosine similarity between the user‚Äôs vector and all job posts in the dataset. Finally, the system selects the top 5 job posts with the highest similarity scores, returning the most relevant recommendations.

In [25]:
def preprocess_input(title, skills):
    input_title_skills = title + " " + ", ".join(skills)
    return input_title_skills

def fit_input_tfidf(vectorizer, input_title_skills):
    input_vec = vectorizer.transform([input_title_skills])
    return input_vec

def print_recommendations(recommended_jobs, columns=None):
    if columns is None:
        columns = ['company', 'job_title', 'job_skills', 'job_location']
    print("Recommended jobs:")
    display(recommended_jobs[columns])

def recommend(df, query, vectorizer, matrix, top_k=5, return_scores=False):
    input_title_skills = preprocess_input(query['title'], query['skills'])
    input_vec = fit_input_tfidf(vectorizer, input_title_skills)

    # Filter by city, fallback to country
    jobs_to_search = df[df['job_location'] == query['city']].index.to_list()
    if len(jobs_to_search) < 2:
        jobs_to_search = df[df['search_country'] == query['country']].index.to_list()

    matrix_subset = matrix[jobs_to_search]
    cos_scores = cosine_similarity(input_vec, matrix_subset).flatten()
    top_indices_in_subset = cos_scores.argsort()[::-1][:top_k]
    similar_indices = [jobs_to_search[i] for i in top_indices_in_subset]
    recommended_jobs = df.iloc[similar_indices].copy()

    # Add skills_score if we needed
    if return_scores:
        recommended_jobs["skills_score"] = cos_scores[top_indices_in_subset]
        return recommended_jobs.reset_index(drop=True)

    print_recommendations(recommended_jobs)
    return recommended_jobs.reset_index(drop=True)

## NLP Features Based on the Job Descriptions
As our new topic, we use extra NLP features, based on the job descriptions, for the recommendation system.
First we import the necessary packages.

In [26]:
import numpy as np

from sklearn.decomposition import TruncatedSVD, NMF
from sklearn.preprocessing import normalize

from scipy.sparse import csr_matrix, issparse

Next we create a class, which recommends jobs based only on the job descriptions.
Topics used inside the recommender:
- TF-IDF with n-grams
- Latent Semantic Analysis (LSA) via TruncatedSVD
- Topic modelling via NMF
- Rocchio pseudo-relevance feedback for query expansion
- Combines lexical, LSA and topic similarities into one score

In [27]:
class TextRecommender:
    """
    Job-description based recommender.

    It builds three representations of each job posting:
      1. Lexical similarity: TF-IDF with n-grams
      2. Low-dimensional semantic space: Latent Semantic Analysis (LSA) via TruncatedSVD
      3. Topics: Topic modelling via NMF
      4. Optional Rocchio pseudo-relevance feedback

    It computes similarities in all three
    spaces and combines them into a single score.
    """

    def __init__(
        self,
        jobs_df,
        title_col="job_title",
        desc_col="job_summary",
        skills_col=None,
        min_df=5,
        max_df=0.9,
        ngram_range=(1, 2),
        stop_words="english",
        n_lsa_components=150,
        n_topics=20,
        random_state=42,
    ):
        """
        Parameters
        ----------
        jobs_df : pd.DataFrame
            DataFrame containing job postings.
        title_col : str
            Column name for the job title text.
        desc_col : str
            Column name for the job description / summary text.
        skills_col : str or None
            Optional column name for skills text to append.
        min_df, max_df, ngram_range, stop_words :
            TF-IDF hyperparameters.
        n_lsa_components : int
            Number of latent dimensions for LSA.
        n_topics : int
            Number of topics for NMF.
        random_state : int
            Random seed for reproducibility.
        """

        # Store a clean copy of the input data
        self.jobs_df = jobs_df.copy().reset_index(drop=True)
        self.title_col = title_col
        self.desc_col = desc_col
        self.skills_col = skills_col

        # Basic sanity check: make sure required columns exist
        required_cols = [title_col, desc_col]
        if skills_col is not None:
            required_cols.append(skills_col)

        missing = [c for c in required_cols if c not in self.jobs_df.columns]
        if missing:
            raise ValueError(
                f"Missing required columns in jobs_df: {missing}"
            )

        # Build a single text field per job: title + description (+ skills)
        #    - Replace NaNs with empty strings
        #    - Cast to string to avoid problems in TfidfVectorizer
        title_text = self.jobs_df[title_col].fillna("").astype(str)
        desc_text = self.jobs_df[desc_col].fillna("").astype(str)

        if skills_col is not None:
            skills_text = self.jobs_df[skills_col].fillna("").astype(str)
            full_text = title_text + " " + desc_text + " " + skills_text
        else:
            full_text = title_text + " " + desc_text

        # Store the combined text as an internal column
        self.jobs_df["__full_text__"] = full_text

        # TF-IDF representation for all jobs
        self.vectorizer = TfidfVectorizer(
            min_df=min_df,
            max_df=max_df,
            ngram_range=ngram_range,
            stop_words=stop_words,
        )

        self.job_tfidf = self.vectorizer.fit_transform(
            self.jobs_df["__full_text__"].values
        )


        # LSA representation (low-dimensional dense vectors)
        self.svd = TruncatedSVD(
            n_components=n_lsa_components,
            random_state=random_state,
        )

        job_lsa = self.svd.fit_transform(self.job_tfidf)
        # L2-normalize rows
        self.job_lsa = normalize(job_lsa)


        # Topic representation via NMF
        self.nmf = NMF(
            n_components=n_topics,
            init="nndsvd",
            random_state=random_state,
            max_iter=300,
        )

        job_topics = self.nmf.fit_transform(self.job_tfidf)
        self.job_topics = normalize(job_topics)  # L2-normalize rows

    # Helpers

    # Project user text into TF-IDF space
    def _tfidf_query(self, user_text: str):
        return self.vectorizer.transform([user_text])

    # Project a TF-IDF query vector into LSA space
    def _lsa_query(self, tfidf_vec):
        q_lsa = self.svd.transform(tfidf_vec)
        return normalize(q_lsa)

    # Project a TF-IDF query vector into topic space (NMF)
    def _topic_query(self, tfidf_vec):
        q_topics = self.nmf.transform(tfidf_vec)
        return normalize(q_topics)

    def _rocchio_expand(
        self,
        query_vec,
        alpha=1.0,
        beta=0.75,
        gamma=0.15,
        n_pos=20,
        n_neg=10,
    ):
        """
        Rocchio pseudo-relevance feedback in TF-IDF space.

        Uses the current ranking of documents to:
          - pull the query closer to top-ranked (pseudo-relevant) jobs
          - push it away from bottom-ranked (pseudo-non-relevant) jobs

        Parameters follow the standard Rocchio notation:
          - alpha: weight of original query
          - beta: weight of positive centroid
          - gamma: weight of negative centroid
        """
        # Rank documents with the current query
        sims = cosine_similarity(query_vec, self.job_tfidf).ravel()
        ranked_idx = np.argsort(-sims)

        # Select pseudo-relevant and non-relevant sets
        n_pos = min(n_pos, len(ranked_idx))
        n_neg = min(n_neg, len(ranked_idx)) if n_neg > 0 else 0

        pos_idx = ranked_idx[:n_pos]
        neg_idx = ranked_idx[-n_neg:] if n_neg > 0 else []

        # Compute centroids in TF-IDF space
        pos_centroid = self.job_tfidf[pos_idx].mean(axis=0)
        pos_centroid = csr_matrix(pos_centroid)

        if n_neg > 0:
            neg_centroid = self.job_tfidf[neg_idx].mean(axis=0)
            neg_centroid = csr_matrix(neg_centroid)
        else:
            # zero vector of the same dimension
            neg_centroid = csr_matrix(query_vec.shape, dtype=query_vec.dtype)

        # Rocchio formula:
        expanded = (
            alpha * query_vec
            + beta * pos_centroid
            - gamma * neg_centroid
        )

        # Ensure we work with a CSR sparse matrix
        if not issparse(expanded):
            expanded = csr_matrix(expanded)

        # Clamp negative weights to zero (TF-IDF is non-negative)
        data = np.asarray(expanded.data)
        data[data < 0] = 0.0
        expanded.data[:] = data

        return expanded

    # Public "API"
    def recommend_from_text(
        self,
        user_text,
        top_k=20,
        use_rocchio=True,
        rocchio_params=None,
        weights=None,
        return_intermediate=False,
    ):
        """
        Recommend jobs based on a free-text description of what the user is looking for.

        This function takes a natural-language description of the desired job
        and scores all jobs using a mix of lexical similarity, LSA and topic similarity.
        It can also possibly improve the query using Rocchio pseudo-relevance feedback.

        Parameters
        ----------
        user_text : str
            Free-text description of the job the user is interested in.
        top_k : int
            How many jobs to return.
        use_rocchio : bool
            If True, refine the query using Rocchio pseudo-relevance feedback.
        rocchio_params : dict or None
            Optional Rocchio hyperparameters, e.g.
            {'alpha': ..., 'beta': ..., 'gamma': ..., 'n_pos': ..., 'n_neg': ...}.
            If None, sensible defaults are used.
        weights : dict or None
            Weights for combining the different similarity scores, e.g.
            {'lexical': 0.5, 'lsa': 0.3, 'topic': 0.2}.
            If None, defaults are used.
        return_intermediate : bool
            If True, also return the individual component scores
            ('lexical_score', 'lsa_score', 'topic_score') in the result.

        Returns
        -------
        pd.DataFrame
            A slice of jobs_df containing the top-k recommended jobs with an extra
            'combined_text_score' column (and, if requested, the component scores),
            sorted from most to least relevant.
        """
        if not isinstance(user_text, str) or not user_text.strip():
            raise ValueError("user_text must be a non-empty string")

        # Project user text into TF-IDF space
        q_tfidf = self._tfidf_query(user_text)

        # Rocchio expansion
        if use_rocchio:
            r_params = {
                "alpha": 1.0,
                "beta": 0.75,
                "gamma": 0.15,
                "n_pos": 20,
                "n_neg": 10,
            }
            if rocchio_params is not None:
                r_params.update(rocchio_params)

            q_tfidf = self._rocchio_expand(
                q_tfidf,
                alpha=r_params["alpha"],
                beta=r_params["beta"],
                gamma=r_params["gamma"],
                n_pos=r_params["n_pos"],
                n_neg=r_params["n_neg"],
            )

        # Represent query in all three spaces
        q_lsa = self._lsa_query(q_tfidf)
        q_topics = self._topic_query(q_tfidf)

        # Compute cosine similarities in each space
        sim_lexical = cosine_similarity(q_tfidf, self.job_tfidf).ravel()
        sim_lsa = cosine_similarity(q_lsa, self.job_lsa).ravel()
        sim_topic = cosine_similarity(q_topics, self.job_topics).ravel()

        # Normalize scores to [0, 1] for stable combination
        def _norm(x):
            x = np.asarray(x)
            if x.max() == x.min():
                # Avoid division by zero if all scores are equal
                return np.zeros_like(x)
            return (x - x.min()) / (x.max() - x.min())

        sim_lexical_n = _norm(sim_lexical)
        sim_lsa_n = _norm(sim_lsa)
        sim_topic_n = _norm(sim_topic)

        # Combine scores with given weights
        if weights is None:
            weights = {"lexical": 0.3, "lsa": 0.3, "topic": 0.3}

        combined = (
            weights["lexical"] * sim_lexical_n
            + weights["lsa"] * sim_lsa_n
            + weights["topic"] * sim_topic_n
        )

        # Take top_k jobs by combined score
        idx_sorted = np.argsort(-combined)[:top_k]

        result = self.jobs_df.iloc[idx_sorted].copy()
        result["combined_text_score"] = combined[idx_sorted]

        # Optionally expose individual component scores
        if return_intermediate:
            result["lexical_score"] = sim_lexical_n[idx_sorted]
            result["lsa_score"] = sim_lsa_n[idx_sorted]
            result["topic_score"] = sim_topic_n[idx_sorted]

        return result.reset_index(drop=True)

We create an instance of a text recommender.

In [28]:
text_rec = TextRecommender(
    jobs_df=jobs_df,
    title_col="job_title",
    desc_col="job_summary",
    skills_col="job_skills",
    min_df=5,
    max_df=0.9,
    ngram_range=(1, 2),
    n_lsa_components=150,
    n_topics=20,
)

Now we create a function, which combines the two approaches for the recommendation system.

In [29]:
def hybrid_recommend(
    df,
    query,
    description_text,
    vectorizer,
    matrix,
    text_rec,
    top_k=5,
    alpha_text=0.75,
    alpha_skills=0.25,
):
    """
    Hybrid recommender combining:
    - traditional title+skills TF-IDF similarity ("skills_score")
    - description-based NLP similarity from TextRecommender ("combined_text_score")
    """
    # Skills-based recommendations
    skills_recs = recommend(
        df=df,
        query=query,
        vectorizer=vectorizer,
        matrix=matrix,
        top_k=top_k * 5,
        return_scores=True,
    )

    # Text-based recommendations
    text_recs = text_rec.recommend_from_text(
        user_text=description_text,
        top_k=top_k * 5,
        use_rocchio=True,
        return_intermediate=False,
    )

    # Ensure we have the primary key
    if "job_link" not in skills_recs.columns or "job_link" not in text_recs.columns:
        raise KeyError("job_link column not found in recommendations; ensure it is kept in jobs_df.")

    # Merge scores on job_link
    merged_scores = pd.merge(
        text_recs[["job_link", "combined_text_score"]],
        skills_recs[["job_link", "skills_score"]],
        on="job_link",
        how="outer",
    ).fillna(0.0)

    # Compute hybrid score
    merged_scores["hybrid_score"] = (
        alpha_text * merged_scores["combined_text_score"]
        + alpha_skills * merged_scores["skills_score"]
    )

    # Join back job metadata
    full = merged_scores.merge(df, on="job_link", how="left")

    # Re-apply location / country filter similar to recommend()
    city = query.get("city")
    country = query.get("country")

    mask_city = (full["job_location"] == city) if city is not None and "job_location" in full.columns else False
    mask_country = (full["search_country"] == country) if country is not None and "search_country" in full.columns else False

    filtered = full[mask_city | mask_country]
    if not filtered.empty:
        full = filtered

    # Sort and attach original index from df
    full = full.sort_values("hybrid_score", ascending=False).head(top_k)

    # Build mapping: job_link -> original index in df
    index_map = (
        df.reset_index()[["index", "job_link"]]
          .set_index("job_link")["index"]
    )
    full["orig_index"] = full["job_link"].map(index_map)

    # For printing: use orig_index as DataFrame index
    full_print = full.set_index("orig_index").copy()

    print("Hybrid recommendations (description + title/skills):")
    print_recommendations(
        full_print,
        columns=[
            "company",
            "job_title",
            "job_skills",
            "job_location",
            "hybrid_score",
        ],
    )

    # For returning: clean 0..n index, but keep orig_index as a column
    return full.reset_index(drop=True)

Example query:

In [30]:
# Query without NLP features (except skills)
query = {'city': 'New York', 'country': 'United States', 'title': 'Data Scientist', 'skills': ['python', 'machine learning', 'data analysis']}
recommend(jobs_df, query, vectorizer, matrix, top_k=10)

print()

# Query with NLP features
description_text = """
I am looking for a junior machine learning engineer position where I can build,
train and deploy deep learning models in production. I enjoy working with
Python, PyTorch, MLOps tools and scalable model pipelines. Ideally in a
tech company with strong engineering culture.
"""

hybrid_recommend(
    df=jobs_df,
    query=query,
    description_text=description_text,
    vectorizer=vectorizer,
    matrix=matrix,
    text_rec=text_rec,
    top_k=10,
    alpha_text=0.6,
    alpha_skills=0.4,
)

Recommended jobs:


Unnamed: 0,company,job_title,job_skills,job_location
2887,X4 Life Sciences,senior/principal machine learning scientist,"collaboration, data preprocessing, machine lea...",New York
2879,257,data scientist,"algorithms, big data, data analysis, data engi...",New York
3713,Tribal Tech - The Digital & Tech Recruitment S...,machine learning / data scientist,"clustering, collaboration, communication, comp...",New York
3335,JBC,staff data scientist,"clustering, collaboration skills, communicatio...",New York
13200,"Tribal Tech - The Digital, Data & AI Specialists",machine learning / data scientist,"clustering, collaboration, communication, comp...",New York
8528,JPMorgan Chase & Co.,machine learning scientist,"big data, data science, deep learning, machine...",New York
12201,X4 Life Sciences,principal machine learning scientist,"collaboration, communication, data preprocessi...",New York
5069,Arena,machine learning scientist,"analytical skills, communication skills, data ...",New York
5068,Genentech,machine learning scientist,"aws, azure, data analysis, devops, gcp, git, l...",New York
14300,RWE,energy trading data scientist m/f/t,"applied mathematics, attention to detail, comm...",New York



Hybrid recommendations (description + title/skills):
Recommended jobs:


Unnamed: 0_level_0,company,job_title,job_skills,job_location,hybrid_score
orig_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2887,X4 Life Sciences,senior/principal machine learning scientist,"collaboration, data preprocessing, machine lea...",New York,0.875349
12201,X4 Life Sciences,principal machine learning scientist,"collaboration, communication, data preprocessi...",New York,0.83024
3713,Tribal Tech - The Digital & Tech Recruitment S...,machine learning / data scientist,"clustering, collaboration, communication, comp...",New York,0.798307
13200,"Tribal Tech - The Digital, Data & AI Specialists",machine learning / data scientist,"clustering, collaboration, communication, comp...",New York,0.785765
5069,Arena,machine learning scientist,"analytical skills, communication skills, data ...",New York,0.770405
8528,JPMorgan Chase & Co.,machine learning scientist,"big data, data science, deep learning, machine...",New York,0.763374
6818,Algo Capital Group,machine learning engineer c++/ python,"algorithms, analytical skills, c, communicatio...",New York,0.756973
3335,JBC,staff data scientist,"clustering, collaboration skills, communicatio...",New York,0.7411
2787,Hexaware Technologies,lead machine learning engineer,"agile, deep learning, educational technology, ...",New York,0.733918
5068,Genentech,machine learning scientist,"aws, azure, data analysis, devops, gcp, git, l...",New York,0.716075


Unnamed: 0,job_link,combined_text_score,skills_score,hybrid_score,job_title,job_location,search_country,job_skills,company,search_position,job_summary,job_level,title_skills,orig_index
0,https://www.linkedin.com/jobs/view/senior-prin...,0.896621,0.843442,0.875349,senior/principal machine learning scientist,New York,United States,"collaboration, data preprocessing, machine lea...",X4 Life Sciences,Agricultural-Research Engineer,My client is a leading firm at the intersectio...,Mid senior,senior/principal machine learning scientist co...,2887
1,https://www.linkedin.com/jobs/view/principal-m...,0.89556,0.73226,0.83024,principal machine learning scientist,New York,United States,"collaboration, communication, data preprocessi...",X4 Life Sciences,Biochemist,My client is a leading firm at the intersectio...,Mid senior,principal machine learning scientist collabora...,12201
2,https://www.linkedin.com/jobs/view/machine-lea...,0.815857,0.771982,0.798307,machine learning / data scientist,New York,United States,"clustering, collaboration, communication, comp...",Tribal Tech - The Digital & Tech Recruitment S...,Electrical-Research Engineer,Location: New York\nPosition Type: Full-Time\n...,Mid senior,"machine learning / data scientist clustering, ...",3713
3,https://www.linkedin.com/jobs/view/machine-lea...,0.817309,0.738449,0.785765,machine learning / data scientist,New York,United States,"clustering, collaboration, communication, comp...","Tribal Tech - The Digital, Data & AI Specialists",Biochemist,Location: New York\nPosition Type: Full-Time\n...,Mid senior,"machine learning / data scientist clustering, ...",13200
4,https://www.linkedin.com/jobs/view/machine-lea...,0.811007,0.709502,0.770405,machine learning scientist,New York,United States,"analytical skills, communication skills, data ...",Arena,Agricultural-Research Engineer,Who we are:\nOur name is inspired by Theodore ...,Mid senior,"machine learning scientist analytical skills, ...",5069
5,https://www.linkedin.com/jobs/view/machine-lea...,0.780892,0.737097,0.763374,machine learning scientist,New York,United States,"big data, data science, deep learning, machine...",JPMorgan Chase & Co.,Horticulturist,Job Description\nApplied AI ML opportunities a...,Mid senior,"machine learning scientist big data, data scie...",8528
6,https://www.linkedin.com/jobs/view/machine-lea...,0.861407,0.600323,0.756973,machine learning engineer c++/ python,New York,United States,"algorithms, analytical skills, c, communicatio...",Algo Capital Group,Agricultural-Research Engineer,My client is a world-class quantitative invest...,Mid senior,machine learning engineer c++/ python algorith...,6818
7,https://www.linkedin.com/jobs/view/staff-data-...,0.728915,0.759377,0.7411,staff data scientist,New York,United States,"clustering, collaboration skills, communicatio...",JBC,Chemist,"Location: Long Island City, New York\nType: Pe...",Associate,"staff data scientist clustering, collaboration...",3335
8,https://www.linkedin.com/jobs/view/lead-machin...,0.856341,0.550284,0.733918,lead machine learning engineer,New York,United States,"agile, deep learning, educational technology, ...",Hexaware Technologies,Data Communications Analyst,"HIR ING\nJob Skills\nTensorFlow, TensorFlow, B...",Mid senior,"lead machine learning engineer agile, deep lea...",2787
9,https://www.linkedin.com/jobs/view/machine-lea...,0.736229,0.685845,0.716075,machine learning scientist,New York,United States,"aws, azure, data analysis, devops, gcp, git, l...",Genentech,Agricultural-Research Engineer,The Position\nThe Opportunity\nThe Large Molec...,Associate,"machine learning scientist aws, azure, data an...",5068


# Association Rules

In addition to recommending jobs based on similarity, we can suggest complementary skills using association rules. By analyzing which skills frequently appear together across job posts, we can discover patterns such as ‚Äúusers with skill A often also have skill B.‚Äù This allows the system to recommend additional skills a user might consider learning to improve their job prospects.

## A-Priori Algorithm
We use the Apriori algorithm to identify frequent sets of skills in the dataset. Apriori scans the job posts to find combinations of skills that occur together above a minimum support threshold. From these frequent itemsets, we generate association rules that describe patterns such as ‚Äúif a job requires skill A and B, it often also requires skill C.‚Äù These rules can then be used to recommend additional skills to a user.

In [31]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

We use the Apriori algorithm to discover frequently co-occurring skills in job posts. By setting a minimum support of 0.005, we only consider skill combinations that appear in at least 0.5% of the dataset. From these frequent itemsets, we generate association rules with a minimum confidence of 0.5.

These rules allow the system to recommend complementary skills: given a user‚Äôs existing skills, the algorithm suggests additional skills that often appear together in job posts.

In [32]:
min_support=0.005
min_threshold=0.5

In [33]:
def compute_frequent_itemsets(df, min_support, min_threshold=0.5):
        transactions = df['job_skills'].apply(lambda x: [s.strip() for s in x.split(',')]).tolist()
        te = TransactionEncoder()
        te_ary = te.fit(transactions).transform(transactions)
        df_onehot = pd.DataFrame(te_ary, columns=te.columns_)
        
        frequent_itemsets = apriori(df_onehot, min_support=min_support, use_colnames=True)
        rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=min_threshold)
        return rules
rules = compute_frequent_itemsets(jobs_df,min_support, min_threshold)

### Recommending Additional Skills

Given a user‚Äôs current skills, we can filter the association rules to find consequents that complement their skill set. The top recommendations are sorted by confidence or lift.

In [34]:
def recommend_skills(rules, user_skills, top_n=5, sort_by='confidence'):
        # Filter rules where antecedents are subset of user_skills
        matching_rules = rules[rules['antecedents'].apply(lambda x: set(x).issubset(user_skills))]
        
        # Collect consequents
        recommended_skills = set()
        for cons in matching_rules['consequents']:
            recommended_skills.update(cons)
        
        # Remove skills user already has
        recommended_skills = recommended_skills - set(user_skills)
        
        # Optionally, sort by confidence or lift
        if not matching_rules.empty:
            sorted_rules = matching_rules.sort_values(by=sort_by, ascending=False)
            recommended_skills_ordered = []
            for cons in sorted_rules['consequents']:
                for skill in cons:
                    if skill in recommended_skills and skill not in recommended_skills_ordered:
                        recommended_skills_ordered.append(skill)
            return recommended_skills_ordered[:top_n]
        
        return list(recommended_skills)[:top_n]

recommended_skills = recommend_skills(rules, query['skills'], top_n=5)
print("Recommended additional skills:", recommended_skills)

Recommended additional skills: ['sql']
