# LinkedIn Recommendation System

## Dataset download

This project uses the **‚Äú1.3M LinkedIn Jobs and Skills 2024‚Äù** dataset available on [Kaggle](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024).

The dataset contains over **1.3 million LinkedIn job postings** collected in 2024, including detailed information on job titles, descriptions, companies, and associated skills. It is used to train and evaluate our job recommendation system.

### Download Options

You can obtain the dataset in two ways:

1. **Using the Kaggle API (Recommended)** ‚Äî automatic download and extraction.  
2. **Manual Download** ‚Äî download the ZIP file directly from the dataset page and extract it yourself.


### Option 1 ‚Äî Using the Kaggle API

To use the Kaggle API, ensure you have the Kaggle CLI installed and configured.

```bash
# Install Kaggle CLI
pip install kaggle

# Move your Kaggle API key (kaggle.json) into place
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
```

Once configured, you can run the provided script to automatically download and unzip the dataset into the `data/` folder.

```bash
chmod +x ./download_linkedin_dataset.sh
./download_linkedin_dataset.sh
```

This script:
- Creates the `data/` folder if it does not exist.
- Downloads the dataset from Kaggle.
- Extracts the contents.
- Removes the ZIP file after extraction.

### Option 2 ‚Äî Manual Download

If you prefer not to use the Kaggle API, you can manually download the dataset from:

üîó **[https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024)**

After downloading:
1. Extract the ZIP file.  
2. Move all extracted files into the `data/` directory in your project.

### Notes
- Ensure your Kaggle API credentials (`kaggle.json`) are correctly configured in `~/.kaggle/`.
- The dataset is distributed under the **ODC Attribution License (ODC-By)**.
- The total size is ~2 Gb (~1.3M entries), so the download may take several minutes depending on your connection.

## Data preparation

### Load and Inner Join

We load two CSVs:

- `job_postings_df` from `./data/linkedin_job_postings.csv`
- `job_summary_df` from `./data/job_summary.csv`
- `job_skills_df` from `./data/job_skills.csv`

We then **inner join** on the unique key `job_link`:

- `jobs_df = pd.merge(job_postings_df, job_skills_df, on="job_link", how="inner")`
- `jobs_df = pd.merge(jobs_df, job_summary_df, on="job_link", how="inner")`

This keeps only postings that exist in every sources and ensures aligned rows across tables.


In [3]:
import pandas as pd

# Load the datasets
job_skills_df = pd.read_csv('./data/job_skills.csv')
#job_summary_df = pd.read_csv('./data/job_summary.csv')
job_postings_df = pd.read_csv('./data/linkedin_job_postings.csv')


In [4]:
job_skills_df.head()

Unnamed: 0,job_link,job_skills
0,https://www.linkedin.com/jobs/view/housekeeper...,"Building Custodial Services, Cleaning, Janitor..."
1,https://www.linkedin.com/jobs/view/assistant-g...,"Customer service, Restaurant management, Food ..."
2,https://www.linkedin.com/jobs/view/school-base...,"Applied Behavior Analysis (ABA), Data analysis..."
3,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Engineering, Project Controls, Sche..."
4,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Assembly, Point to point wiring, St..."


In [5]:
#job_summary_df.head()

In [6]:
job_postings_df.head()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type
0,https://www.linkedin.com/jobs/view/account-exe...,2024-01-21 07:12:29.00256+00,t,t,f,Account Executive - Dispensing (NorCal/Norther...,BD,"San Diego, CA",2024-01-15,Coronado,United States,Color Maker,Mid senior,Onsite
1,https://www.linkedin.com/jobs/view/registered-...,2024-01-21 07:39:58.88137+00,t,t,f,Registered Nurse - RN Care Manager,Trinity Health MI,"Norton Shores, MI",2024-01-14,Grand Haven,United States,Director Nursing Service,Mid senior,Onsite
2,https://www.linkedin.com/jobs/view/restaurant-...,2024-01-21 07:40:00.251126+00,t,t,f,RESTAURANT SUPERVISOR - THE FORKLIFT,Wasatch Adaptive Sports,"Sandy, UT",2024-01-14,Tooele,United States,Stand-In,Mid senior,Onsite
3,https://www.linkedin.com/jobs/view/independent...,2024-01-21 07:40:00.308133+00,t,t,f,Independent Real Estate Agent,Howard Hanna | Rand Realty,"Englewood Cliffs, NJ",2024-01-16,Pinehurst,United States,Real-Estate Clerk,Mid senior,Onsite
4,https://www.linkedin.com/jobs/view/group-unit-...,2024-01-19 09:45:09.215838+00,f,f,f,Group/Unit Supervisor (Systems Support Manager...,"IRS, Office of Chief Counsel","Chamblee, GA",2024-01-17,Gadsden,United States,Supervisor Travel-Information Center,Mid senior,Onsite


In [7]:
jobs_df = pd.merge(job_postings_df, job_skills_df, on='job_link', how='inner')
#jobs_df = pd.merge(jobs_df, job_summary_df, on='job_link', how='inner')

In [8]:
jobs_df.describe()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_skills
count,1296381,1296381,1296381,1296381,1296381,1296381,1296372,1296362,1296381,1296381,1296381,1296381,1296381,1296381,1294296
unique,1296381,722728,1,1,1,565695,88995,28791,6,1018,4,1923,2,3,1287101
top,https://www.linkedin.com/jobs/view/account-exe...,2024-01-19 09:45:09.215838+00,t,t,f,LEAD SALES ASSOCIATE-FT,Health eCareers,"New York, NY",2024-01-14,North Carolina,United States,Account Executive,Mid senior,Onsite,"Front Counter, DriveThru, Outside Order Taker,..."
freq,1,573487,1296381,1296381,1296381,7315,40049,12580,459354,9495,1105410,19465,1155276,1285565,169


In [9]:
#input_cols = ['job_link', 'job_title', 'job_location', 'search_country', 'job_skills', 'job_description']
input_cols = ['job_link', 'job_title', 'job_location', 'search_country', 'job_skills', 'company', 'search_position', 'job_level']

### Row Filtering

We remove rows that do not contain the required NLP outputs and rows flagged as in-progress:

- Drop entries **without NER** results.
- Drop entries **without summary**.
- Drop entries where **`is_being_worked` is `True`**.

This reduces noise and guarantees each training sample has complete text features.

In [10]:
jobs_df = jobs_df.loc[
    (jobs_df["is_being_worked"] == "f")
    & (jobs_df["got_summary"] == "t")
    & (jobs_df["got_ner"] == "t")
]

jobs_df = jobs_df[input_cols].dropna().reset_index(drop=True)

### Title Normalization and Reduction of Unique Values

We normalize `job_title` with a custom function:

1. **Lowercase** titles.
2. **Trim at the first dash**: keep text before `"-"` to collapse variants like  
   `Senior Software Engineer - Backend` ‚Üí `senior software engineer`.
3. **Remove parenthetical fragments**: delete content inside `(...)`, e.g.  
   `data scientist (NLP)` ‚Üí `data scientist`.
4. **Strip whitespace**.

**Effect:** Different textual variants map to a **single canonical form**, which **reduces the number of unique job titles** and stabilizes downstream grouping and modeling.


In [11]:
def _clean_job_titles(job_title: str):
    job_title = job_title.lower()
    if '-' in job_title:
        job_title = job_title.split('-')[0].strip()
    job_title = job_title.split('(')[0].strip()
    return job_title.strip()

jobs_df['job_title'] = jobs_df['job_title'].astype(str).apply(_clean_job_titles)

### Skill Canonicalization and Reduction of Unique Values

We clean `job_skills` as a comma-separated list:

1. **Split by comma** and **strip** whitespace.
2. **Lowercase** each skill token.
3. **De-duplicate per posting** to avoid repeated skills.
4. **Sort tokens** so the per-row skill list has a consistent order.

We also track a **global set of unique skills** to measure coverage.

**Effect:** Canonicalization merges superficial variants and ordering differences, which **reduces both per-row and global unique skill counts**. This yields a more compact and reliable skill space.


In [12]:
import re

# create a cleaned list of skills per job and a global unique skills array
def _clean_split_skills(skills_str: str):
    skills_str = skills_str.lower()
    parts = str(skills_str).split(',')
    unique = set()
    for part in parts:
        # remove non A-Za-z characters except whitespace, collapse spaces and strip ends
        c = re.sub(r'[^A-Za-z\s]', '', part)
        c = re.sub(r'\s+', ' ', c).strip()
        if c:
            unique.add(c)
    # return a deterministic, cleaned, ordered string
    ordered = sorted(unique)
    return ', '.join(ordered)

jobs_df['job_skills'] = jobs_df['job_skills'].astype(str).apply(_clean_split_skills)

In [13]:
display(jobs_df.head())

Unnamed: 0,job_link,job_title,job_location,search_country,job_skills,company,search_position,job_level
0,https://www.linkedin.com/jobs/view/account-exe...,account executive,"San Diego, CA",United States,"bachelors degree, bd offerings, challenges, co...",BD,Color Maker,Mid senior
1,https://www.linkedin.com/jobs/view/registered-...,registered nurse,"Norton Shores, MI",United States,"bachelor of science in nursing, care managemen...",Trinity Health MI,Director Nursing Service,Mid senior
2,https://www.linkedin.com/jobs/view/restaurant-...,restaurant supervisor,"Sandy, UT",United States,"arithmetic skills, bending and kneeling abilit...",Wasatch Adaptive Sports,Stand-In,Mid senior
3,https://www.linkedin.com/jobs/view/independent...,independent real estate agent,"Englewood Cliffs, NJ",United States,"closing statements, communication, customer se...",Howard Hanna | Rand Realty,Real-Estate Clerk,Mid senior
4,https://www.linkedin.com/jobs/view/registered-...,registered nurse,"Muskegon, MI",United States,"bsn, diversity, equal opportunity employer, eq...",Trinity Health MI,Nurse Practitioner,Mid senior


In [14]:
seen = set()
skills_array = []
for lst in jobs_df['job_skills']:
    for skill in lst.split(','):
        skill = skill.strip()
        if skill not in seen:
            seen.add(skill)
            skills_array.append(skill)

print(f"Jobs rows: {len(jobs_df)}, sample job_skills (first 5):\n", jobs_df['job_skills'].head())
print(f"Global unique skills count: {len(skills_array)}")
#skills_array[:20]

Jobs rows: 1294268, sample job_skills (first 5):
 0    bachelors degree, bd offerings, challenges, co...
1    bachelor of science in nursing, care managemen...
2    arithmetic skills, bending and kneeling abilit...
3    closing statements, communication, customer se...
4    bsn, diversity, equal opportunity employer, eq...
Name: job_skills, dtype: object
Global unique skills count: 2668569


### Location Cleaning and Reduction of Unique Values

We standardize `job_location` by **keeping only the part before the first comma**:

- Example: `San Diego, CA` ‚Üí `San Diego`

**Effect:** This collapses formatting variants that differ only by state or country suffix. It **reduces the number of unique locations** and helps counter sparse geography fields while preserving city-level signal.

In [15]:
jobs_df['job_location'].value_counts()

job_location
New York, NY                         12561
London, England, United Kingdom      10878
Houston, TX                          10317
Chicago, IL                          10154
Los Angeles, CA                       9724
                                     ...  
Avonwick, England, United Kingdom        1
Kenley, England, United Kingdom          1
Oxwich, Wales, United Kingdom            1
Greenwood, WI                            1
Echo, UT                                 1
Name: count, Length: 28776, dtype: int64

In [16]:
def _clean_location(loc: str):
    return loc.split(',')[0].strip()

jobs_df['job_location'] = jobs_df['job_location'].astype(str).apply(_clean_location)

In [17]:
# min_jobs = 100      
# max_jobs = 500   

# jobs_df = (
#     jobs_df
#     .groupby(['search_country', 'job_location'])
#     .filter(lambda x: len(x) >= min_jobs)  
#     .groupby(['search_country', 'job_location'], group_keys=False)
#     .apply(lambda x: x.sample(n=min(len(x), max_jobs), random_state=42))
# ).reset_index(drop=True)

In [18]:
#filter jobs only in new york
jobs_df = jobs_df[jobs_df['job_location'] == 'New York'].reset_index(drop=True)

In [19]:
jobs_df.head()

Unnamed: 0,job_link,job_title,job_location,search_country,job_skills,company,search_position,job_level
0,https://www.linkedin.com/jobs/view/part-time-h...,part time,New York,United States,"employee complaints, employee onboarding, empl...",Creative Financial Staffing (CFS),Human Resource Advisor,Mid senior
1,https://www.linkedin.com/jobs/view/sr-experien...,"sr experience design manager, learn and help",New York,United States,"adobe creative cloud, adobe experience cloud, ...",Adobe,Cost-And-Sales-Record Supervisor,Mid senior
2,https://www.linkedin.com/jobs/view/manager-of-...,manager of platform and operations,New York,United States,"analytical skills, artificial intelligence, bu...",Aegis Ventures,Tier,Mid senior
3,https://www.linkedin.com/jobs/view/team-lead-c...,team lead,New York,United States,"adaptability, communication written and verbal...",Whizz,Supervisor Customer Services,Mid senior
4,https://www.linkedin.com/jobs/view/senior-acco...,senior account manager,New York,United States,"advertising operations, analysis, business int...","Urban One, Inc",Manager Advertising Agency,Mid senior


In [20]:
jobs_df.describe()

Unnamed: 0,job_link,job_title,job_location,search_country,job_skills,company,search_position,job_level
count,14850,14850,14850,14850,14850,14850,14850,14850
unique,14850,8822,1,2,14842,4032,1147,2
top,https://www.linkedin.com/jobs/view/part-time-h...,executive assistant,New York,United States,"background check, csea tuition vouchers, envir...",DocCafe,Account Executive,Mid senior
freq,1,174,14850,14849,2,218,402,12481


### Feature Set Used

After preprocessing, the working DataFrame includes the key fields required for analysis and modeling:

- `job_link` (primary key, post-join)
- `job_title` (normalized)
- `job_location` (city-only normalized)
- `search_country`
- `job_skills` (cleaned, sorted, de-duplicated)
- `job_description`

These fields form the basis for representation building and recommendation.


### Impact on Cardinality (Unique Values)

The following transformations are specifically designed to **reduce the number of unique values**:

- **Job titles:** lowercasing, dash-trim, and parenthesis removal collapse stylistic variants.
- **Skills:** lowercase normalization, de-duplication, and sorted lists produce canonical rows and reduce global skill vocabulary.
- **Locations:** truncation before the first comma unifies location strings.

This cardinality reduction improves:
- Statistical reliability of counts and co-occurrences.
- Memory usage and runtime.
- Model stability and generalization.

# Content-Based Recommendation System

In [21]:
from collections import defaultdict
from itertools import combinations
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter

In [24]:
jobs_df_test = jobs_df.copy()

In [51]:
from collections import Counter
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

class JobRecommender:
    def __init__(self, jobs_df, min_support=0.005, rare_skill_threshold=10):
        self.jobs_df = jobs_df.copy()
        self.vectorizer = TfidfVectorizer(
            stop_words='english',
            max_features=10000,       
            min_df=100,               
            max_df=0.8                
        )
        self.min_support = min_support
        self.tfidf_matrix = None
        self.rules = None
        self.rare_skill_threshold = rare_skill_threshold
        self.df_onehot = None
        self.rare_skills = set()
        
        # Clean and filter rare skills
        self.remove_rare_skills()
        
        # Prepare TF-IDF
        self.preprocess()
        self.fit_tfidf()

    # -----------------------------
    # Rare skills filtering
    # -----------------------------
    def remove_rare_skills(self):
        # Flatten all skills and strip spaces
        all_skills = [s.strip() for skills in self.jobs_df['job_skills'] for s in skills.split(',')]
        
        # Count frequency
        skill_counts = Counter(all_skills)
        
        # Identify rare skills
        self.rare_skills = {skill for skill, count in skill_counts.items() if count <= self.rare_skill_threshold}
        
        # Remove rare skills from each job
        def filter_skills(skills_str):
            skills = [s.strip() for s in skills_str.split(',')]
            skills = [s for s in skills if s not in self.rare_skills]
            return ', '.join(skills)
        
        self.jobs_df['job_skills'] = self.jobs_df['job_skills'].apply(filter_skills)
        # Remove jobs with no skills left
        self.jobs_df = self.jobs_df[self.jobs_df['job_skills'].str.strip() != ''].reset_index(drop=True)

    # -----------------------------
    # Preprocessing and TF-IDF
    # -----------------------------
    def preprocess(self):
        self.jobs_df['title_skills'] = self.jobs_df['job_title'] + " " + self.jobs_df['job_skills']
    
    def preprocess_input(self, title, skills):
        input_title_skills = title + " " + ", ".join(skills)
        return input_title_skills
    
    def fit_input_tfidf(self, input_title_skills):
        input_vec = self.vectorizer.transform([input_title_skills])
        return input_vec
    
    def fit_tfidf(self):
        self.tfidf_matrix = self.vectorizer.fit_transform(self.jobs_df['title_skills'])
    
    # -----------------------------
    # Content-based recommendation
    # -----------------------------
    def recommend(self, query, top_k=5):
        input_title_skills = self.preprocess_input(query['title'], query['skills'])
        input_vec = self.fit_input_tfidf(input_title_skills)
        matrix = self.tfidf_matrix
        
        # Filter by city, fallback to country
        jobs_to_search = self.jobs_df[self.jobs_df['job_location'] == query['city']].index.tolist()
        if len(jobs_to_search) < 2:
            jobs_to_search = self.jobs_df[self.jobs_df['search_country'] == query['country']].index.tolist()

        matrix_subset = matrix[jobs_to_search]
        cos_scores = cosine_similarity(input_vec, matrix_subset).flatten()
        top_indices_in_subset = cos_scores.argsort()[::-1][1:top_k+1]
        similar_indices = [jobs_to_search[i] for i in top_indices_in_subset]
        recommended_jobs = self.jobs_df.iloc[similar_indices]

        return self.print_recommendations(recommended_jobs)

    def print_recommendations(self, recommended_jobs):
        print("Recommended jobs:")
        display(recommended_jobs[['company','job_title', 'job_skills', 'job_location']])

    # -----------------------------
    # Step 1: Precompute frequent itemsets
    # -----------------------------
    def compute_frequent_itemsets(self):
        transactions = self.jobs_df['job_skills'].apply(lambda x: [s.strip() for s in x.split(',')]).tolist()
        te = TransactionEncoder()
        te_ary = te.fit(transactions).transform(transactions)
        self.df_onehot = pd.DataFrame(te_ary, columns=te.columns_)
        
        frequent_itemsets = apriori(self.df_onehot, min_support=self.min_support, use_colnames=True)
        self.rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
        return self.rules

    # -----------------------------
    # Step 2: Recommend skills given user skills
    # -----------------------------
    def recommend_skills(self, user_skills, top_n=5, sort_by='confidence'):
        # Filter rules where antecedents are subset of user_skills
        matching_rules = self.rules[self.rules['antecedents'].apply(lambda x: set(x).issubset(user_skills))]
        
        # Collect consequents
        recommended_skills = set()
        for cons in matching_rules['consequents']:
            recommended_skills.update(cons)
        
        # Remove skills user already has
        recommended_skills = recommended_skills - set(user_skills)
        
        # Optionally, sort by confidence or lift
        if not matching_rules.empty:
            sorted_rules = matching_rules.sort_values(by=sort_by, ascending=False)
            recommended_skills_ordered = []
            for cons in sorted_rules['consequents']:
                for skill in cons:
                    if skill in recommended_skills and skill not in recommended_skills_ordered:
                        recommended_skills_ordered.append(skill)
            return recommended_skills_ordered[:top_n]
        
        return list(recommended_skills)[:top_n]


In [52]:
job_recommender = JobRecommender(jobs_df)

query = {'city': 'New York', 'country': 'United States', 'title': 'Data Scientist', 'skills': ['python', 'machine learning', 'data analysis']}
job_recommender.recommend(query, top_k=5)
job_recommender.compute_frequent_itemsets()

recommended_skills = job_recommender.recommend_skills(query['skills'], top_n=5)
print("Recommended additional skills:", recommended_skills)


Recommended jobs:


Unnamed: 0,company,job_title,job_skills,job_location
2879,257,data scientist,"algorithms, big data, data analysis, data engi...",New York
3713,Tribal Tech - The Digital & Tech Recruitment S...,machine learning / data scientist,"clustering, collaboration, communication, comp...",New York
3335,JBC,staff data scientist,"clustering, collaboration skills, communicatio...",New York
13200,"Tribal Tech - The Digital, Data & AI Specialists",machine learning / data scientist,"clustering, collaboration, communication, comp...",New York
8528,JPMorgan Chase & Co.,machine learning scientist,"big data, data science, deep learning, machine...",New York


Recommended additional skills: ['sql']


In [53]:
job_recommender.rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(ability to work independently),(attention to detail),0.011377,0.133541,0.006095,0.535714,4.011609,1.0,0.004575,1.866219,0.759363,0.043902,0.464157,0.290677
1,(ability to work under pressure),(attention to detail),0.009481,0.133541,0.005824,0.614286,4.599978,1.0,0.004558,2.246375,0.790098,0.042448,0.554838,0.328948
2,(account management),(sales),0.024379,0.069818,0.012325,0.505556,7.241066,1.0,0.010623,1.881267,0.883436,0.150538,0.468443,0.341042
3,(cpa),(accounting),0.014018,0.047200,0.008939,0.637681,13.510241,1.0,0.008277,2.629728,0.939147,0.170984,0.619733,0.413532
4,(finance),(accounting),0.033385,0.047200,0.018826,0.563895,11.946959,1.0,0.017250,2.184793,0.947944,0.304825,0.542291,0.481373
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1850,"(job search, professional profile)","(email alerts, cv, confidentiality)",0.006095,0.007178,0.005011,0.822222,114.544864,1.0,0.004967,5.584623,0.997348,0.606557,0.820937,0.760168
1851,"(job search, confidentiality)","(email alerts, cv, professional profile)",0.007517,0.006027,0.005011,0.666667,110.614232,1.0,0.004966,2.981919,0.998465,0.587302,0.664646,0.749064
1852,(email alerts),"(job search, cv, professional profile, confide...",0.008803,0.005011,0.005011,0.569231,113.592308,1.0,0.004967,2.309795,1.000000,0.569231,0.567061,0.784615
1853,(professional profile),"(job search, email alerts, cv, confidentiality)",0.007788,0.005892,0.005011,0.643478,109.221189,1.0,0.004965,2.788353,0.998621,0.578125,0.641365,0.747026
