# LinkedIn Recommendation System

## Dataset download

This project uses the **“1.3M LinkedIn Jobs and Skills 2024”** dataset available on [Kaggle](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024).

The dataset contains over **1.3 million LinkedIn job postings** collected in 2024, including detailed information on job titles, descriptions, companies, and associated skills. It is used to train and evaluate our job recommendation system.

### Download Options

You can obtain the dataset in two ways:

1. **Using the Kaggle API (Recommended)** — automatic download and extraction.  
2. **Manual Download** — download the ZIP file directly from the dataset page and extract it yourself.


### Option 1 — Using the Kaggle API

To use the Kaggle API, ensure you have the Kaggle CLI installed and configured.

```bash
# Install Kaggle CLI
pip install kaggle

# Move your Kaggle API key (kaggle.json) into place
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
```

Once configured, you can run the provided script to automatically download and unzip the dataset into the `data/` folder.

```bash
chmod +x ./download_linkedin_dataset.sh
./download_linkedin_dataset.sh
```

This script:
- Creates the `data/` folder if it does not exist.
- Downloads the dataset from Kaggle.
- Extracts the contents.
- Removes the ZIP file after extraction.

### Option 2 — Manual Download

If you prefer not to use the Kaggle API, you can manually download the dataset from:

🔗 **[https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024](https://www.kaggle.com/datasets/asaniczka/1-3m-linkedin-jobs-and-skills-2024)**

After downloading:
1. Extract the ZIP file.  
2. Move all extracted files into the `data/` directory in your project.

### Notes
- Ensure your Kaggle API credentials (`kaggle.json`) are correctly configured in `~/.kaggle/`.
- The dataset is distributed under the **ODC Attribution License (ODC-By)**.
- The total size is ~2 Gb (~1.3M entries), so the download may take several minutes depending on your connection.

## Data preparation

### Load and Inner Join

We load two CSVs:

- `job_postings_df` from `./data/linkedin_job_postings.csv`
- `job_summary_df` from `./data/job_summary.csv`
- `job_skills_df` from `./data/job_skills.csv`

We then **inner join** on the unique key `job_link`:

- `jobs_df = pd.merge(job_postings_df, job_skills_df, on="job_link", how="inner")`
- `jobs_df = pd.merge(jobs_df, job_summary_df, on="job_link", how="inner")`

This keeps only postings that exist in every sources and ensures aligned rows across tables.


In [1]:
import pandas as pd

# Load the datasets
job_skills_df = pd.read_csv('./data/job_skills.csv')
#job_summary_df = pd.read_csv('./data/job_summary.csv')
job_postings_df = pd.read_csv('./data/linkedin_job_postings.csv')


In [2]:
job_skills_df.head()

Unnamed: 0,job_link,job_skills
0,https://www.linkedin.com/jobs/view/housekeeper...,"Building Custodial Services, Cleaning, Janitor..."
1,https://www.linkedin.com/jobs/view/assistant-g...,"Customer service, Restaurant management, Food ..."
2,https://www.linkedin.com/jobs/view/school-base...,"Applied Behavior Analysis (ABA), Data analysis..."
3,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Engineering, Project Controls, Sche..."
4,https://www.linkedin.com/jobs/view/electrical-...,"Electrical Assembly, Point to point wiring, St..."


In [3]:
#job_summary_df.head()

In [4]:
job_postings_df.head()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type
0,https://www.linkedin.com/jobs/view/account-exe...,2024-01-21 07:12:29.00256+00,t,t,f,Account Executive - Dispensing (NorCal/Norther...,BD,"San Diego, CA",2024-01-15,Coronado,United States,Color Maker,Mid senior,Onsite
1,https://www.linkedin.com/jobs/view/registered-...,2024-01-21 07:39:58.88137+00,t,t,f,Registered Nurse - RN Care Manager,Trinity Health MI,"Norton Shores, MI",2024-01-14,Grand Haven,United States,Director Nursing Service,Mid senior,Onsite
2,https://www.linkedin.com/jobs/view/restaurant-...,2024-01-21 07:40:00.251126+00,t,t,f,RESTAURANT SUPERVISOR - THE FORKLIFT,Wasatch Adaptive Sports,"Sandy, UT",2024-01-14,Tooele,United States,Stand-In,Mid senior,Onsite
3,https://www.linkedin.com/jobs/view/independent...,2024-01-21 07:40:00.308133+00,t,t,f,Independent Real Estate Agent,Howard Hanna | Rand Realty,"Englewood Cliffs, NJ",2024-01-16,Pinehurst,United States,Real-Estate Clerk,Mid senior,Onsite
4,https://www.linkedin.com/jobs/view/group-unit-...,2024-01-19 09:45:09.215838+00,f,f,f,Group/Unit Supervisor (Systems Support Manager...,"IRS, Office of Chief Counsel","Chamblee, GA",2024-01-17,Gadsden,United States,Supervisor Travel-Information Center,Mid senior,Onsite


In [5]:
jobs_df = pd.merge(job_postings_df, job_skills_df, on='job_link', how='inner')
#jobs_df = pd.merge(jobs_df, job_summary_df, on='job_link', how='inner')

In [6]:
jobs_df.describe()

Unnamed: 0,job_link,last_processed_time,got_summary,got_ner,is_being_worked,job_title,company,job_location,first_seen,search_city,search_country,search_position,job_level,job_type,job_skills
count,1296381,1296381,1296381,1296381,1296381,1296381,1296372,1296362,1296381,1296381,1296381,1296381,1296381,1296381,1294296
unique,1296381,722728,1,1,1,565695,88995,28791,6,1018,4,1923,2,3,1287101
top,https://www.linkedin.com/jobs/view/account-exe...,2024-01-19 09:45:09.215838+00,t,t,f,LEAD SALES ASSOCIATE-FT,Health eCareers,"New York, NY",2024-01-14,North Carolina,United States,Account Executive,Mid senior,Onsite,"Front Counter, DriveThru, Outside Order Taker,..."
freq,1,573487,1296381,1296381,1296381,7315,40049,12580,459354,9495,1105410,19465,1155276,1285565,169


These are the input columns

In [7]:
#input_cols = ['job_link', 'job_title', 'job_location', 'search_country', 'job_skills', 'job_description']
input_cols = ['job_link', 'job_title', 'job_location', 'search_country', 'job_skills']

### Row Filtering

We remove rows that do not contain the required NLP outputs and rows flagged as in-progress:

- Drop entries **without NER** results.
- Drop entries **without summary**.
- Drop entries where **`is_being_worked` is `True`**.

This reduces noise and guarantees each training sample has complete text features.

In [8]:
jobs_df = jobs_df.loc[
    (jobs_df["is_being_worked"] == "f")
    & (jobs_df["got_summary"] == "t")
    & (jobs_df["got_ner"] == "t")
]

jobs_df = jobs_df[input_cols].dropna().reset_index(drop=True)

### Title Normalization and Reduction of Unique Values

We normalize `job_title` with a custom function:

1. **Lowercase** titles.
2. **Trim at the first dash**: keep text before `"-"` to collapse variants like  
   `Senior Software Engineer - Backend` → `senior software engineer`.
3. **Remove parenthetical fragments**: delete content inside `(...)`, e.g.  
   `data scientist (NLP)` → `data scientist`.
4. **Strip whitespace**.

**Effect:** Different textual variants map to a **single canonical form**, which **reduces the number of unique job titles** and stabilizes downstream grouping and modeling.


In [9]:
def _clean_job_titles(job_title: str):
    job_title = job_title.lower()
    if '-' in job_title:
        job_title = job_title.split('-')[0].strip()
    job_title = job_title.split('(')[0].strip()
    return job_title.strip()

jobs_df['job_title'] = jobs_df['job_title'].astype(str).apply(_clean_job_titles)

### Skill Canonicalization and Reduction of Unique Values

We clean `job_skills` as a comma-separated list:

1. **Split by comma** and **strip** whitespace.
2. **Lowercase** each skill token.
3. **De-duplicate per posting** to avoid repeated skills.
4. **Sort tokens** so the per-row skill list has a consistent order.

We also track a **global set of unique skills** to measure coverage.

**Effect:** Canonicalization merges superficial variants and ordering differences, which **reduces both per-row and global unique skill counts**. This yields a more compact and reliable skill space.


In [10]:
import re

# create a cleaned list of skills per job and a global unique skills array
def _clean_split_skills(skills_str: str):
    skills_str = skills_str.lower()
    parts = str(skills_str).split(',')
    unique = set()
    for part in parts:
        # remove non A-Za-z characters except whitespace, collapse spaces and strip ends
        c = re.sub(r'[^A-Za-z\s]', '', part)
        c = re.sub(r'\s+', ' ', c).strip()
        if c:
            unique.add(c)
    # return a deterministic, cleaned, ordered string
    ordered = sorted(unique)
    return ', '.join(ordered)

jobs_df['job_skills'] = jobs_df['job_skills'].astype(str).apply(_clean_split_skills)

In [11]:
seen = set()
skills_array = []
for lst in jobs_df['job_skills']:
    for skill in lst.split(','):
        skill = skill.strip()
        if skill not in seen:
            seen.add(skill)
            skills_array.append(skill)

print(f"Jobs rows: {len(jobs_df)}, sample job_skills (first 5):\n", jobs_df['job_skills'].head())
print(f"Global unique skills count: {len(skills_array)}")
#skills_array[:20]

Jobs rows: 1294277, sample job_skills (first 5):
 0    bachelors degree, bd offerings, challenges, co...
1    bachelor of science in nursing, care managemen...
2    arithmetic skills, bending and kneeling abilit...
3    closing statements, communication, customer se...
4    bsn, diversity, equal opportunity employer, eq...
Name: job_skills, dtype: object
Global unique skills count: 2668580


We clean the locations because of the different type of duplications.

### Location Cleaning and Reduction of Unique Values

We standardize `job_location` by **keeping only the part before the first comma**:

- Example: `San Diego, CA` → `San Diego`

**Effect:** This collapses formatting variants that differ only by state or country suffix. It **reduces the number of unique locations** and helps counter sparse geography fields while preserving city-level signal.

In [12]:
jobs_df['job_location'].value_counts()

job_location
New York, NY                           12562
London, England, United Kingdom        10879
Houston, TX                            10317
Chicago, IL                            10155
Los Angeles, CA                         9724
                                       ...  
Haarlem, North Holland, Netherlands        1
Benington, England, United Kingdom         1
La Tuna, TX                                1
Voluntown, CT                              1
New Jerusalem, PA                          1
Name: count, Length: 28776, dtype: int64

In [13]:
def _clean_location(loc: str):
    return loc.split(',')[0].strip()

jobs_df['job_location'] = jobs_df['job_location'].astype(str).apply(_clean_location)

In [14]:
jobs_df.head()

Unnamed: 0,job_link,job_title,job_location,search_country,job_skills
0,https://www.linkedin.com/jobs/view/account-exe...,account executive,San Diego,United States,"bachelors degree, bd offerings, challenges, co..."
1,https://www.linkedin.com/jobs/view/registered-...,registered nurse,Norton Shores,United States,"bachelor of science in nursing, care managemen..."
2,https://www.linkedin.com/jobs/view/restaurant-...,restaurant supervisor,Sandy,United States,"arithmetic skills, bending and kneeling abilit..."
3,https://www.linkedin.com/jobs/view/independent...,independent real estate agent,Englewood Cliffs,United States,"closing statements, communication, customer se..."
4,https://www.linkedin.com/jobs/view/registered-...,registered nurse,Muskegon,United States,"bsn, diversity, equal opportunity employer, eq..."


In [15]:
jobs_df.describe()

Unnamed: 0,job_link,job_title,job_location,search_country,job_skills
count,1294277,1294277,1294277,1294277,1294277
unique,1294277,314791,21036,4,1283123
top,https://www.linkedin.com/jobs/view/account-exe...,registered nurse,New York,United States,"coaching, dining room attendant, drinks, drive..."
freq,1,27530,14851,1103598,215


### Feature Set Used

After preprocessing, the working DataFrame includes the key fields required for analysis and modeling:

- `job_link` (primary key, post-join)
- `job_title` (normalized)
- `job_location` (city-only normalized)
- `search_country`
- `job_skills` (cleaned, sorted, de-duplicated)
- `job_description`

These fields form the basis for representation building and recommendation.


### Impact on Cardinality (Unique Values)

The following transformations are specifically designed to **reduce the number of unique values**:

- **Job titles:** lowercasing, dash-trim, and parenthesis removal collapse stylistic variants.
- **Skills:** lowercase normalization, de-duplication, and sorted lists produce canonical rows and reduce global skill vocabulary.
- **Locations:** truncation before the first comma unifies location strings.

This cardinality reduction improves:
- Statistical reliability of counts and co-occurrences.
- Memory usage and runtime.
- Model stability and generalization.