In [2]:
!pip install confluent_kafka pandas


Collecting confluent_kafka
  Downloading confluent_kafka-2.10.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (22 kB)
Downloading confluent_kafka-2.10.0-cp311-cp311-manylinux_2_28_x86_64.whl (3.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m85.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: confluent_kafka
Successfully installed confluent_kafka-2.10.0


In [3]:
from confluent_kafka import Consumer, KafkaException
import pandas as pd
import json
from google.colab import userdata
import uuid

## Kafka Consumer & Dynamic Corpus Creation:

The Kafka consumer is responsible for subscribing to a topic where job data is streamed dynamically from the LinkedIn API using the Producer. As new job postings are published to the topic, the consumer listens in real time, retrieves each job payload, and stores the relevant information — such as job title, description, skills, and location — into a structured format. This dynamic corpus forms the real-time stream of job opportunities that reflect the most up-to-date market demand.

In [20]:
conf = {
    'bootstrap.servers': userdata.get("BOOTSTRAP_SERVER"),
    'security.protocol': "SASL_SSL",
    'sasl.mechanism': "PLAIN",
    'sasl.username': userdata.get("CONFLUENT_API_KEY"),
    'sasl.password': userdata.get("CONFLUENT_API_SECRET"),
    'group.id': 'group-12',   # Always use a fresh group
    'auto.offset.reset': 'earliest',              # Read all from start
    'enable.auto.commit': False
}


In [21]:
TOPIC = "topic_2"


In [22]:
from confluent_kafka import Consumer, KafkaException
import json
import pandas as pd

TOPIC = 'topic_2'
consumer = Consumer(conf)
consumer.subscribe([TOPIC])
all_jobs = []

print("📡 Listening for job messages on topic_2...")

# 🔁 Stop after 10 consecutive empty polls
no_message_limit = 10
no_message_count = 0

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            no_message_count += 1
            print(f" No message received... ({no_message_count}/{no_message_limit})")
            if no_message_count >= no_message_limit:
                print("No new messages. Stopping consumer.")
                break
            continue

        if msg.error():
            raise KafkaException(msg.error())

        # ✅ Reset counter when a message arrives
        no_message_count = 0

        job = json.loads(msg.value().decode('utf-8'))
        all_jobs.append(job)

        print(f"{job['title']} at {job['company']} in {job['location']}")
        print(f"  Type: {job['type']}, Level: {job['job_level']}")
        print(f"  Salary Range: ${job['salary_min']} - ${job['salary_max']}")
        print(f"  Skills Required: {job['skills_required']}")
        print(f"  Apply Here: {job['apply_link']}")
        print()

except KeyboardInterrupt:
    print(" Stopped by user.")

finally:
    consumer.close()
    print(" Done receiving all jobs.")


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  Apply Here: https://www.linkedin.com/jobs/view/4204032138

Canada PR (Non Voice) (Remote - Mumbai/Bangalore only) at First Advantage in United States (Remote)
  Type: , Level: Unspecified
  Salary Range: $80000 - $110000
  Skills Required: ['agile', 'wireframes', 'a/b testing', 'jira', 'stakeholder management', 'prototyping', 'confluence', 'figma', 'design thinking', 'mvp', 'kanban', 'competitive analysis', 'product roadmap', 'user stories', 'scrum', 'user research', 'market research']
  Apply Here: https://www.linkedin.com/jobs/view/4216398014

Summer 2025 Cybersecurity Undergraduate Internship at Brennan Center for Justice in New York, NY
  Type: , Level: Junior
  Salary Range: $30000 - $50000
  Skills Required: []
  Apply Here: https://www.linkedin.com/jobs/view/4214826538

Government Affairs Intern at Socure in United States (Remote)
  Type: , Level: Junior
  Salary Range: $30000 - $50000
  Skills Required: []
  App

In [23]:
jobs_df = pd.DataFrame(all_jobs)
jobs_df.to_csv("jobs.csv", index=False)
print(f"Saved {len(jobs_df)} jobs to jobs.csv")



Saved 2652 jobs to jobs.csv


In [24]:
jobs_df.head()

Unnamed: 0,job_id,title,company,location,type,benefits,description,salary_min,salary_max,job_level,skills_required,apply_link
0,4219413357,Measurement Intern-Fall 2025 (Unpaid),International Rescue Committee,"New York, NY (Remote)",,,,30000,50000,Unspecified,[],https://www.linkedin.com/jobs/view/4219413357
1,4221383857,Data Science Intern - Summer 2025,WSP in the U.S.,"New York, NY (On-site)",,,,30000,50000,Unspecified,"[numpy, machine learning, mlflow, jupyter, das...",https://www.linkedin.com/jobs/view/4221383857
2,4203643173,Summer Internship 2025 - AI Solutions,TradeStation,United States (Remote),,,,30000,50000,Unspecified,[],https://www.linkedin.com/jobs/view/4203643173
3,4215910516,Research Assistant (Part-Time),American Museum of Natural History,"New York, NY (On-site)",,$20/hr · 1 benefit,,80000,110000,Unspecified,[],https://www.linkedin.com/jobs/view/4215910516
4,4189605482,Junior Java Developer,Moyi-Tech,United States (Remote),,$70K/yr - $120K/yr,,70000,120000,Junior,"[full stack, angular, vue, typescript, rest ap...",https://www.linkedin.com/jobs/view/4189605482


In [25]:
jobs_df.isna().sum()

Unnamed: 0,0
job_id,0
title,0
company,0
location,0
type,0
benefits,0
description,0
salary_min,0
salary_max,0
job_level,0


In [27]:
import ast

df_dynamic_raw = jobs_df.copy()

# Format experience level
df_dynamic_raw["formatted_experience_level"] = df_dynamic_raw["job_level"].fillna("Unspecified").str.capitalize()

# 2. Format salary
df_dynamic_raw["salary"] = (
    df_dynamic_raw["salary_min"].fillna(0).astype(int).astype(str) + " - " +
    df_dynamic_raw["salary_max"].fillna(0).astype(int).astype(str)
)
df_dynamic_raw["salary"] = df_dynamic_raw["salary"].replace("0 - 0", "unspecified")

# 3. Parse skills list (stored as stringified list)
df_dynamic_raw["skills_desc"] = df_dynamic_raw["skills_required"].apply(lambda x: ", ".join(ast.literal_eval(x)) if isinstance(x, str) else "")

# 4. Build text_for_embedding
df_dynamic_raw["text_for_embedding"] = (
    df_dynamic_raw["title"].fillna('') + " " +
    df_dynamic_raw["description"].fillna('') + " " +
    df_dynamic_raw["skills_desc"] + " " +
    df_dynamic_raw["formatted_experience_level"] + " " +
    df_dynamic_raw["salary"]
)

# 5. Rename columns to match static schema
df_dynamic_jobs_corpus = df_dynamic_raw.rename(columns={
    "apply_link": "job_posting_url"
})[[
    "job_id", "title", "location", "formatted_experience_level",
    "text_for_embedding", "salary", "job_posting_url"
]]


In [28]:
df_dynamic_jobs_corpus.head()

Unnamed: 0,job_id,title,location,formatted_experience_level,text_for_embedding,salary,job_posting_url
0,4219413357,Measurement Intern-Fall 2025 (Unpaid),"New York, NY (Remote)",Unspecified,Measurement Intern-Fall 2025 (Unpaid) Unspec...,30000 - 50000,https://www.linkedin.com/jobs/view/4219413357
1,4221383857,Data Science Intern - Summer 2025,"New York, NY (On-site)",Unspecified,Data Science Intern - Summer 2025 Unspecifie...,30000 - 50000,https://www.linkedin.com/jobs/view/4221383857
2,4203643173,Summer Internship 2025 - AI Solutions,United States (Remote),Unspecified,Summer Internship 2025 - AI Solutions Unspec...,30000 - 50000,https://www.linkedin.com/jobs/view/4203643173
3,4215910516,Research Assistant (Part-Time),"New York, NY (On-site)",Unspecified,Research Assistant (Part-Time) Unspecified 8...,80000 - 110000,https://www.linkedin.com/jobs/view/4215910516
4,4189605482,Junior Java Developer,United States (Remote),Junior,Junior Java Developer Junior 70000 - 120000,70000 - 120000,https://www.linkedin.com/jobs/view/4189605482


## Augmenting with Static User and Job Datasets:

To enhance the size and diversity of the recommendation corpus, especially given the rate limits imposed by the API, I incorporated static datasets for both users and job postings. The static job dataset adds a broader pool of potential roles, while the static user dataset provides a reliable set of user profiles with detailed background information. Combining dynamic and static sources ensures the recommendation system remains robust and scalable, even under API constraints.

In [29]:
import pandas as pd
import zipfile

# Unzip directly
with zipfile.ZipFile("/content/postings.csv.zip", 'r') as zip_ref:
    zip_ref.extractall("/content/")

# Load the extracted CSV (assuming it's named postings.csv)
df_jobs_static = pd.read_csv("/content/postings.csv")

# Optional: Clean column names
df_jobs_static.columns = df_jobs_static.columns.str.strip().str.lower()

# Preview the data
df_jobs_static.head()



Unnamed: 0,job_id,company_name,title,description,max_salary,pay_period,location,company_id,views,med_salary,...,skills_desc,listed_time,posting_domain,sponsored,work_type,currency,compensation_type,normalized_salary,zip_code,fips
0,921716,Corcoran Sawyer Smith,Marketing Coordinator,Job descriptionA leading real estate firm in N...,20.0,HOURLY,"Princeton, NJ",2774458.0,20.0,,...,Requirements: \n\nWe are seeking a College or ...,1713398000000.0,,0,FULL_TIME,USD,BASE_SALARY,38480.0,8540.0,34021.0
1,1829192,,Mental Health Therapist/Counselor,"At Aspen Therapy and Wellness , we are committ...",50.0,HOURLY,"Fort Collins, CO",,1.0,,...,,1712858000000.0,,0,FULL_TIME,USD,BASE_SALARY,83200.0,80521.0,8069.0
2,10998357,The National Exemplar,Assitant Restaurant Manager,The National Exemplar is accepting application...,65000.0,YEARLY,"Cincinnati, OH",64896719.0,8.0,,...,We are currently accepting resumes for FOH - A...,1713278000000.0,,0,FULL_TIME,USD,BASE_SALARY,55000.0,45202.0,39061.0
3,23221523,"Abrams Fensterman, LLP",Senior Elder Law / Trusts and Estates Associat...,Senior Associate Attorney - Elder Law / Trusts...,175000.0,YEARLY,"New Hyde Park, NY",766262.0,16.0,,...,This position requires a baseline understandin...,1712896000000.0,,0,FULL_TIME,USD,BASE_SALARY,157500.0,11040.0,36059.0
4,35982263,,Service Technician,Looking for HVAC service tech with experience ...,80000.0,YEARLY,"Burlington, IA",,3.0,,...,,1713452000000.0,,0,FULL_TIME,USD,BASE_SALARY,70000.0,52601.0,19057.0


In [30]:
df = df_jobs_static.copy()

# Clean column names
df.columns = df.columns.str.strip().str.lower()
# Drop rows with missing title or description
df = df.dropna(subset=["title", "description", "location"])


In [31]:
#  Define NYC-only filter list
nyc_locations = [
    "new york, ny", "brooklyn, ny", "manhattan, ny", "queens, ny",
    "bronx, ny", "staten island, ny", "new york, united states",
    "new york city metropolitan area", "brooklyn, new york, united states",
    "queens, new york, united states", "staten island, new york, united states",
    "new york, new york, united states"
]

#  Filter by NYC locations
df = df[df["location"].str.lower().isin(nyc_locations)]

# Drop rows with missing required fields
required_cols = ["title", "description", "location"]
df = df.dropna(subset=required_cols)

df["salary"] = df["med_salary"]
df["salary"] = df["salary"].fillna(df["min_salary"])


#  Handle salary (combine median → min fallback)
df["salary"] = df["med_salary"]
df["salary"] = df["salary"].fillna(df["min_salary"])


In [32]:
import numpy as np

def infer_experience_level(title):
    title = title.lower()
    if "intern" in title or "junior" in title:
        return "entry level"
    elif any(x in title for x in ["senior", "lead", "principal"]):
        return "senior level"
    elif any(x in title for x in ["manager", "director", "vp", "head"]):
        return "manager"
    elif "associate" in title:
        return "mid level"
    else:
        return 'unspecified'  # leave for possible future inference

# Fill missing experience level only
df["formatted_experience_level"] = df["formatted_experience_level"].fillna(df["title"].apply(infer_experience_level))


In [33]:
df["salary_imputed"] = df["salary"]

# Compute median salary per experience level
salary_by_exp = df.groupby("formatted_experience_level")["salary"].median()

# Fill function with 'unspecified' fallback
def fill_salary(row):
    if pd.notna(row["salary"]):
        return row["salary"]
    elif pd.notna(row["formatted_experience_level"]) and row["formatted_experience_level"] in salary_by_exp:
        return salary_by_exp[row["formatted_experience_level"]]
    else:
        return "unspecified"

df["salary"] = df.apply(fill_salary, axis=1).astype(str)
df = df.drop(columns=["salary_imputed"])

# --- Step 3: Rebuild text_for_embedding column ---
df["text_for_embedding"] = (
    df["title"].fillna('') + " " +
    df["description"].fillna('') + " " +
    df["skills_desc"].fillna('') + " " +
    df["formatted_experience_level"].fillna('') + " " +
    df["salary"]
)


In [34]:

df["text_for_embedding"] = (
    df["title"].fillna('') + " " +
    df["description"].fillna('') + " " +
    df["skills_desc"].fillna('') + " " +
    df["formatted_experience_level"].fillna('') + " " +
    df["salary"]
)
#  Keep only final columns you care about
df_static_jobs_corpus = df[[
    "job_id", "title", "location", "formatted_experience_level",
    "text_for_embedding", "salary", "job_posting_url"
]]

In [35]:
df_static_jobs_corpus['text_for_embedding']

Unnamed: 0,text_for_embedding
43,HVAC Technician Service and installation of HV...
49,"Transactional Attorney Growing, boutique law f..."
57,Director of Training Job Posting: Service and ...
59,Social Media Coordinator 🚀 Exciting Opportunit...
60,Equity Institutional Sales Position Role:Equit...
...,...
123697,Underwriting Associate(Middle Market Private E...
123735,NetSuite EDI Consultant Role Overview:As an ED...
123748,Client Account Manager II About Pinterest:\n\n...
123808,Construction Estimator Our client is looking f...


In [36]:
import pandas as pd

# Adjust the path if needed
dataset_path = "/content/LinkedIn_Dataset.pcl"
df_users_raw = pd.read_pickle(dataset_path)

# Check structure
df_users_raw.head()
df_users_raw.columns


Index(['Intro', 'Full Name', 'Workplace', 'Location', 'Connections', 'Photo',
       'Followers', 'About', 'Experiences', 'Number of Experiences',
       'Educations', 'Number of Educations', 'Licenses', 'Number of Licenses',
       'Volunteering', 'Number of Volunteering', 'Skills', 'Number of Skills',
       'Recommendations', 'Number of Recommendations', 'Projects',
       'Number of Projects', 'Publications', 'Number of Publications',
       'Courses', 'Number of Courses', 'Honors', 'Number of Honors', 'Scores',
       'Number of Scores', 'Languages', 'Number of Languages', 'Organizations',
       'Number of Organizations', 'Interests', 'Number of Interests',
       'Activities', 'Number of Activities', 'Label'],
      dtype='object')

In [37]:
df_users = df_users_raw.dropna(subset=["Full Name", "Location", "Workplace"]).copy()


In [38]:
cols_to_use = ["Intro", "About", "Experiences", "Educations", "Projects", "Skills", "Languages"]

# Convert all columns to string safely using .loc
for col in cols_to_use:
    df_users.loc[:, col] = df_users[col].astype(str)

# Rebuild text_for_embedding safely
df_users.loc[:, "text_for_embedding"] = (
    df_users["Intro"].fillna('') + " " +
    df_users["About"].fillna('') + " " +
    df_users["Experiences"].fillna('') + " " +
    df_users["Educations"].fillna('') + " " +
    df_users["Projects"].fillna('') + " " +
    df_users["Skills"].fillna('') + " " +
    df_users["Languages"].fillna('')
)

df_users = df_users.reset_index(drop=True).copy()
df_users["user_id"] = df_users.index

# Final corpus
df_user_corpus = df_users[[
    "user_id", "Full Name", "Location", "Workplace", "text_for_embedding"
]]


In [39]:
role_keywords = [
    "data scientist", "data analyst", "software engineer", "machine learning",
    "developer", "ml engineer", "product manager", "ai", "cloud", "nlp"
]

# Join keywords into a regex pattern
pattern = '|'.join(role_keywords)

# Filter based on text_for_embedding
matching_users = df_user_corpus[
    df_user_corpus["text_for_embedding"].str.lower().str.contains(pattern, na=False)
]

# Get their user_ids
matching_user_ids = matching_users["user_id"].tolist()

# Preview
print("Matching user IDs based on profile content:", matching_user_ids[:50])



Matching user IDs based on profile content: [0, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 16, 18, 19, 20, 21, 22, 23, 24, 25, 28, 29, 32, 33, 34, 35, 36, 38, 41, 46, 48, 49, 50, 53, 58, 61, 62, 63, 68, 69, 71, 73, 74, 76, 78, 79, 80, 81, 82, 83]


## Embedding and Similarity-Based Matching:

Once both the user and job corpora are assembled, each profile and posting is converted into dense vector embeddings using the sentence-transformers library. These embeddings capture the semantic meaning of each profile’s content. To generate personalized job recommendations, cosine similarity is calculated between each user's vector and all job vectors. The top-ranked results — based on similarity scores — are returned as the most relevant job matches for that user.

In [40]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-4.1.0-py3-none-any.whl.metadata (13 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_

In [41]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


In [42]:
# Load pretrained model
model = SentenceTransformer('all-MiniLM-L6-v2')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [43]:
# Generate embeddings
df_all_jobs = pd.concat([df_static_jobs_corpus, df_dynamic_jobs_corpus], ignore_index=True)
print("Encoding job postings...")
job_embeddings = model.encode(df_all_jobs["text_for_embedding"].tolist(), show_progress_bar=True)
print("Encoding user profiles...")
user_embeddings = model.encode(df_user_corpus["text_for_embedding"].tolist(), show_progress_bar=True)


Encoding job postings...


Batches:   0%|          | 0/239 [00:00<?, ?it/s]

Encoding user profiles...


Batches:   0%|          | 0/78 [00:00<?, ?it/s]

In [44]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def recommend_jobs(user_id, top_n=10):
    # Check if user exists
    if user_id not in df_users["user_id"].values:
        print(" User ID not found.")
        return

    # Get user vector
    user_index = df_users[df_users["user_id"] == user_id].index[0]
    user_vector = user_embeddings[user_index]

    # Cosine similarity between user and all jobs
    similarities = cosine_similarity([user_vector], job_embeddings)[0]
    top_indices = np.argsort(similarities)[::-1][:top_n]

    # Get top jobs
    recommended = df_all_jobs.iloc[top_indices].copy()
    recommended["similarity_score"] = similarities[top_indices]

    # Truncate description if needed
    if "description_short" not in recommended.columns and "description" in recommended.columns:
        def truncate_description(text, word_limit=100):
            if pd.isna(text):
                return ""
            words = text.split()
            return " ".join(words[:word_limit]) + ("..." if len(words) > word_limit else "")
        recommended["description_short"] = recommended["description"].apply(truncate_description)

    # Select columns to return
    columns = [
        "title", "location", "salary", "formatted_experience_level",
        "skills_desc", "description_short", "job_posting_url", "similarity_score"
    ]
    columns_to_return = [col for col in columns if col in recommended.columns]

    return recommended[columns_to_return]


In [45]:
#function to view user details
def view_user_profile(user_id):
    if user_id not in df_user_corpus["user_id"].values:
        print(" User ID not found.")
        return
    user_profile = df_user_corpus[df_user_corpus["user_id"] == user_id].iloc[0]
    print(f"\n👤 Profile for User ID {user_id}")
    print(f"Name: {user_profile['Full Name']}")
    print(f"Location: {user_profile['Location']}")
    print(f"Workplace: {user_profile['Workplace']}")
    print(f"Label (0=Real, 1=Fake, 10/11=Generated): {user_profile['Label']}\n")
    print("📝 Profile Content:")
    print(user_profile['text_for_embedding'][:2000])  # limit to 2000 chars for readability


In [46]:
#funtion to show user and recommended jobs
def show_user_and_recommend_jobs(user_id, top_n=5):
    # Validate user
    if user_id not in df_user_corpus["user_id"].values:
        print("❌ User ID not found.")
        return

    # Get user info
    user = df_user_corpus[df_user_corpus["user_id"] == user_id].iloc[0]
    print("👤 USER PROFILE")
    print(f"User ID     : {user_id}")
    print(f"Name        : {user['Full Name']}")
    print(f"Location    : {user['Location']}")
    print(f"Workplace   : {user['Workplace']}")
    print("\n📝 Profile Summary:")
    print(user['text_for_embedding'][:300] + "...\n")

    # Get recommendations
    print("💼 TOP RECOMMENDED JOBS:\n")
    recommendations = recommend_jobs(user_id=user_id, top_n=top_n)
    display(recommendations)


In [47]:
print(f"df_all_jobs shape: {df_all_jobs.shape}")
print(f"job_embeddings shape: {job_embeddings.shape}")



df_all_jobs shape: (7632, 7)
job_embeddings shape: (7632, 384)


In [181]:
show_user_and_recommend_jobs(user_id=955, top_n=5)


👤 USER PROFILE
User ID     : 955
Name        : Dheevatsa Mudigere
Location    : San Francisco Bay Area
Workplace   : Accelerated compute and Systems for AI

📝 Profile Summary:
{'Full Name': 'Dheevatsa Mudigere', 'Workplace': 'Accelerated compute and Systems for AI', 'Location': 'San Francisco Bay Area', 'Connections': '500+', 'Photo': 'No', 'Followers': '1,734'} - HW / SW co-design- AI systems at scale- Deep Learning / AI- Scientific Computing- Parallel computing/High per...

💼 TOP RECOMMENDED JOBS:



Unnamed: 0,title,location,salary,formatted_experience_level,job_posting_url,similarity_score
4627,Senior Search Engineer - Artificial Intelligen...,"New York, NY",160000.0,senior level,https://www.linkedin.com/jobs/view/3905855257/...,0.554334
7603,Senior AI Infrastructure Engineer - DGX Cloud,United States (Remote),120000 - 160000,Senior,https://www.linkedin.com/jobs/view/4217966015,0.548831
4542,"Distinguished Engineer, Generative AI Systems ...","New York, New York, United States",180000.0,Executive,https://www.linkedin.com/jobs/view/3905806278/...,0.546163
4901,"Senior Software Engineer, Machine Learning","New York, NY",100844.0,senior level,https://www.linkedin.com/jobs/view/3906240774/...,0.534527
6961,"Director, Cloud GTM Practice Lead, Artificial ...","New York, NY (On-site)",120000 - 160000,Executive,https://www.linkedin.com/jobs/view/4225018609,0.530964


## Identifying “Better” Jobs and Skill Gaps
Beyond top matches, the system also recommends “better” jobs — roles that may offer higher salary or seniority, even if they're not the closest match. These are identified by analyzing salary and job level relative to the user’s profile. To help users close the gap, a LLaMA-based skill gap analysis is performed to suggest specific technical or domain skills needed to qualify for such roles.

In [182]:
#function to recommend better jobs
def recommend_better_jobs(user_id, base_jobs=10, better_jobs=5, similarity_range=(0.4, 0.7)):
    # Validate user
    if user_id not in df_users["user_id"].values:
        return "❌ User ID not found."

    # User vector and similarity computation
    user_index = df_users[df_users["user_id"] == user_id].index[0]
    user_vector = user_embeddings[user_index]
    similarities = cosine_similarity([user_vector], job_embeddings)[0]

    # Step 1: Get top matches to estimate user level and salary
    top_indices = np.argsort(similarities)[::-1][:base_jobs]
    top_jobs_df = df_all_jobs.iloc[top_indices].copy()
    top_jobs_df["similarity_score"] = similarities[top_indices]

    # Extract numeric salary from top jobs
    top_jobs_df["salary_numeric"] = pd.to_numeric(
        top_jobs_df["salary"].str.extract(r'(\d+)$')[0], errors="coerce"
    )
    base_avg_salary = top_jobs_df["salary_numeric"].dropna().mean()

    # Map user level
    level_order = {
        "unspecified": 0,
        "entry": 1,
        "mid": 2,
        "senior": 3
    }
    def map_level(level): return level_order.get(str(level).strip().lower(), 0)
    top_jobs_df["level_mapped"] = top_jobs_df["formatted_experience_level"].map(map_level)
    user_level = int(top_jobs_df["level_mapped"].mode()[0])
    target_level = user_level + 1 if user_level < max(level_order.values()) else user_level

    # Step 2: Filter candidate jobs in similarity range (exclude some top jobs)
    mask = (similarities >= similarity_range[0]) & (similarities <= similarity_range[1])
    candidate_indices = np.where(mask)[0]

    # Exclude top 70% of top jobs (allow 30% overlap)
    excluded = set(top_indices[:int(0.7 * len(top_indices))])
    filtered_indices = [i for i in candidate_indices if i not in excluded]

    candidate_jobs_df = df_all_jobs.iloc[filtered_indices].copy()
    candidate_jobs_df["similarity_score"] = similarities[filtered_indices]
    candidate_jobs_df["salary_numeric"] = pd.to_numeric(
        candidate_jobs_df["salary"].str.extract(r'(\d+)$')[0], errors="coerce"
    )
    candidate_jobs_df["level_mapped"] = candidate_jobs_df["formatted_experience_level"].map(map_level)

    # Step 3: Filter for better jobs (one level up or higher salary)
    better_jobs_df = candidate_jobs_df[
        (candidate_jobs_df["level_mapped"] == target_level) |
        (candidate_jobs_df["salary_numeric"] > base_avg_salary)
    ].sort_values(by="similarity_score", ascending=False).head(better_jobs)

    # Step 4: Shorten description if available
    if "description" in better_jobs_df.columns:
        def truncate_description(text, word_limit=100):
            if pd.isna(text): return ""
            words = text.split()
            return " ".join(words[:word_limit]) + ("..." if len(words) > word_limit else "")
        better_jobs_df["description_short"] = better_jobs_df["description"].apply(truncate_description)

    # Step 5: Parse skills if available
    if "skills_required" in better_jobs_df.columns and "skills_desc" not in better_jobs_df.columns:
        import ast
        def parse_skills(s):
            try: return ", ".join(ast.literal_eval(s)) if isinstance(s, str) else ""
            except: return ""
        better_jobs_df["skills_desc"] = better_jobs_df["skills_required"].apply(parse_skills)

    # Step 6: Final columns to show
    display_cols = [
        "title", "location", "formatted_experience_level",
        "skills_desc", "description_short", "job_posting_url", "similarity_score"
    ]
    cols_available = [col for col in display_cols if col in better_jobs_df.columns]

    if better_jobs_df.empty:
        return "No better jobs found in the specified similarity range."

    return better_jobs_df[cols_available]


In [184]:
recommend_better_jobs(user_id=634, base_jobs=5, better_jobs=5)


Unnamed: 0,title,location,formatted_experience_level,job_posting_url,similarity_score
1322,Wealth Management Associate,"New York, NY",mid level,https://www.linkedin.com/jobs/view/3895210863/...,0.525044
848,Technology Investment Banking Associate,"New York, NY",mid level,https://www.linkedin.com/jobs/view/3889752925/...,0.51436
1529,Onboarding Manager,"New York, NY",Mid-Senior level,https://www.linkedin.com/jobs/view/3898169701/...,0.500597
3101,Branch Banking - Client Consultant II,"New York, New York, United States",Mid-Senior level,https://www.linkedin.com/jobs/view/3903823469/...,0.492864
3709,Associate Client Advisor - Cyber,"New York, NY",mid level,https://www.linkedin.com/jobs/view/3904700847/...,0.491084


In [185]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch


from huggingface_hub import login
from google.colab import userdata
hf_token = userdata.get('HF_token')
login(token=hf_token)

In [186]:
#find the skill gap in user using llama
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

# Load model + tokenizer
model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

# Text generation pipeline
llama_chat = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False
)

# 🔍 Skill gap generator
def generate_missing_skills(user_text, job_text, max_tokens=100):
    user_text = user_text[:500]
    job_text = job_text[:500]


    prompt = f"""
You are an expert career advisor.

Given the user's profile and a job description, return a concise bullet-point list of 5–7 technical or domain-specific skills the user should learn to qualify for the job.

Respond only with the bullet points. No introduction or explanatiion needed.
User Profile:
{user_text}

Job Description:
{job_text}

Missing skills:"""

    response = llama_chat(prompt, max_new_tokens=max_tokens, do_sample=True, temperature=0.7)[0]["generated_text"]
    return response.strip()


Device set to use cpu


In [187]:
#show full output
def show_full_recommendation_output(user_id, top_n=5):
    if user_id not in df_user_corpus["user_id"].values:
        print("❌ User ID not found.")
        return

    # Step 1: Show user profile
    user = df_user_corpus[df_user_corpus["user_id"] == user_id].iloc[0]
    print("👤 USER PROFILE")
    print(f"User ID     : {user_id}")
    print(f"Name        : {user['Full Name']}")
    print(f"Location    : {user['Location']}")
    print(f"Workplace   : {user['Workplace']}")
    print("\n📝 Profile Summary:")
    print(user['text_for_embedding'][:300] + "...\n")

    # Step 2: Show top-N recommended jobs
    print("💼 TOP MATCHING JOBS (Based on Skills):\n")
    top_matches = recommend_jobs(user_id=user_id, top_n=top_n)
    print(top_matches[[
    "title",
    "location",
    "salary",
    "formatted_experience_level",
    "similarity_score"
]])



    # Step 3: Show similar but better jobs
    print("\n🚀 BETTER JOBS (Similar but Higher Salary/Seniority):\n")
    better_jobs_df = recommend_better_jobs(user_id=user_id, base_jobs=top_n, better_jobs=top_n)
    if isinstance(better_jobs_df, str):  # e.g., "⚠️ No better jobs found..."
        print(better_jobs_df)
        return
    display(better_jobs_df)

    # Step 4: Show missing skills from LLaMA
    print("\n📚 SUGGESTED SKILLS TO UPSKILL (From LLaMA):\n")
    user_text = user["text_for_embedding"]

    for idx, job in better_jobs_df.iterrows():
        job_text = job.get("description_short") or job.get("description", "")
        job_title = job["title"]
        job_location = job["location"]
        job_url = job.get("job_posting_url", "")

        print(f"🔹 Job: {job_title} ({job_location})")
        if job_url:
            print(f"🔗 Link: {job_url}")
        try:
            missing_skills = generate_missing_skills(user_text=user_text, job_text=job_text)
        except Exception as e:
            missing_skills = f"⚠️ Error: {e}"
        print("🧠 Missing Skills:\n" + missing_skills)
        print("-" * 80)



In [191]:
show_full_recommendation_output(user_id=955, top_n=5)


👤 USER PROFILE
User ID     : 955
Name        : Dheevatsa Mudigere
Location    : San Francisco Bay Area
Workplace   : Accelerated compute and Systems for AI

📝 Profile Summary:
{'Full Name': 'Dheevatsa Mudigere', 'Workplace': 'Accelerated compute and Systems for AI', 'Location': 'San Francisco Bay Area', 'Connections': '500+', 'Photo': 'No', 'Followers': '1,734'} - HW / SW co-design- AI systems at scale- Deep Learning / AI- Scientific Computing- Parallel computing/High per...

💼 TOP MATCHING JOBS (Based on Skills):

                                                  title  \
4627  Senior Search Engineer - Artificial Intelligen...   
7603      Senior AI Infrastructure Engineer - DGX Cloud   
4542  Distinguished Engineer, Generative AI Systems ...   
4901         Senior Software Engineer, Machine Learning   
6961  Director, Cloud GTM Practice Lead, Artificial ...   

                               location           salary  \
4627                       New York, NY         160000.0   
7603

Unnamed: 0,title,location,formatted_experience_level,job_posting_url,similarity_score
6961,"Director, Cloud GTM Practice Lead, Artificial ...","New York, NY (On-site)",Executive,https://www.linkedin.com/jobs/view/4225018609,0.530964
5092,AI/Machine Learning Engineer,United States (Remote),Unspecified,https://www.linkedin.com/jobs/view/4225830009,0.527054
6419,Senior Machine Learning Engineer - (Remote - A...,United States (Remote),Senior,https://www.linkedin.com/jobs/view/4206941132,0.52041
6471,Senior Machine Learning Engineer - (Remote - A...,United States (Remote),Senior,https://www.linkedin.com/jobs/view/4206941132,0.52041
5375,Lead/Senior Backend Engineer - Distributed Sys...,United States (Remote),Senior,https://www.linkedin.com/jobs/view/4212837517,0.517524


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.



📚 SUGGESTED SKILLS TO UPSKILL (From LLaMA):

🔹 Job: Director, Cloud GTM Practice Lead, Artificial intelligence (New York, NY (On-site))
🔗 Link: https://www.linkedin.com/jobs/view/4225018609


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


🧠 Missing Skills:
'Dheevatsa Mudigere'
Missing connections: '500+'
Missing photo: 'No'
Missing followers: '1,734'

Please provide the missing information.

To solve this problem, I can follow these steps:

1. Analyze the job description to identify key technical skills required for the job.
2. Compare the user's profile with the job description to find the missing skills.
3. Provide the missing information to the user.

Step 1: Analyze the job
--------------------------------------------------------------------------------
🔹 Job: AI/Machine Learning Engineer (United States (Remote))
🔗 Link: https://www.linkedin.com/jobs/view/4225830009


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


🧠 Missing Skills:
- Deep Learning / AI - Scientific Computing - Parallel computing/High performance numerical computing - HW / SW co-design - AI systems at scale - Scientific visualization - Machine learning - Operating System - Networking - Cloud computing - Database management - Cybersecurity - Cloud infrastructure - Cloud security - Agile methodologies - Agile testing - Version control - Cloud migration - Cloud deployment - Cloud security - Security architecture - Security policies - Incident response - Compliance - Data governance - Data quality - Data engineering - Data science - Data analysis - Data
--------------------------------------------------------------------------------
🔹 Job: Senior Machine Learning Engineer - (Remote - Anywhere) (United States (Remote))
🔗 Link: https://www.linkedin.com/jobs/view/4206941132


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


🧠 Missing Skills:
Identify the missing skills in the job description to be able to recommend a list of skills the user should learn to qualify for the job. Based on the job description, here are the skills required:

- Deep learning frameworks and libraries (TensorFlow, PyTorch)
- Scientific computing libraries (e.g., NumPy, SciPy)
- Parallel computing frameworks (e.g., OpenMP, MPI)
- High-performance numerical computing (e.g., Intel MKL, OpenBLAS)
- Data storage
--------------------------------------------------------------------------------
🔹 Job: Senior Machine Learning Engineer - (Remote - Anywhere) (United States (Remote))
🔗 Link: https://www.linkedin.com/jobs/view/4206941132


Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


🧠 Missing Skills:
AI, Scientific Computing, Parallel Computing. The job description mentions "accelerated compute and systems for AI", but the user's profile shows expertise in Deep Learning / AI, Scientific Computing, and Parallel Computing.

Here are the missing skills:

- AI
- Scientific Computing
- Parallel Computing

To address this, I will provide the following response:

• AI
• Scientific Computing
• Parallel Computing
• Parallel Programming
• Numerical Methods
• Parallel Algorithms
• Machine Learning
• Optimization
--------------------------------------------------------------------------------
🔹 Job: Lead/Senior Backend Engineer - Distributed Systems/API - Applied AI (United States (Remote))
🔗 Link: https://www.linkedin.com/jobs/view/4212837517
🧠 Missing Skills:
{ '0': {'Role': 'Distinguished Engineer', 'Workplace': 'NVIDIA', 'Duration': 'Dec 2022 - Present · 2 mos', 'Workplace Location': 'Santa Clara, California, United States', 'Description': 'Deep Learning / AI- Scientific 