# Career Recommender System - Data Preprocessing

This notebook preprocesses user profiles and job catalog data to prepare for the career recommendation system. 

**Workflow:**
1. Install dependencies and setup environment
2. Data loading and exploration
3. Text preprocessing and feature engineering
4. Embedding generation using sentence-transformers
5. Vector database setup with FAISS
6. Export processed data for training

## 📋 Prerequisites for Colab/Kaggle
- Upload the data files (`sample_users.csv`, `sample_jobs.csv`) to your environment
- Or use the sample data generation code provided below

## 0. Install Dependencies

**⚠️ Run this cell first in Colab/Kaggle environments:**

In [None]:
# Install required packages for Colab/Kaggle environments
!pip install pandas numpy scikit-learn matplotlib seaborn plotly
!pip install sentence-transformers transformers torch
!pip install faiss-cpu xgboost
!pip install python-jobspy>=1.1.79 datasets>=2.14.0 serpapi>=1.0.0

## 1. Environment Setup and Dependencies

In [None]:
import os
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
from tqdm.auto import tqdm
import pickle
import json

# Machine learning libraries
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
import faiss

# Text processing and embeddings
from sentence_transformers import SentenceTransformer
import re
from collections import Counter

# Environment setup
warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")

# Set random seed for reproducibility
np.random.seed(42)

# Setup paths
PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
MODELS_DIR = PROJECT_ROOT / "models"
SRC_DIR = PROJECT_ROOT / "src"

# Create directories if they don't exist
MODELS_DIR.mkdir(exist_ok=True)

# Add src to path for imports
sys.path.append(str(SRC_DIR))

print(f"Project root: {PROJECT_ROOT}")
print(f"Data directory: {DATA_DIR}")
print(f"Models directory: {MODELS_DIR}")
print("Environment setup complete!")

In [None]:
# Optional: Set up Hugging Face authentication
from dotenv import load_dotenv
load_dotenv(PROJECT_ROOT / ".env")

hf_token = os.getenv("HF_TOKEN")
if hf_token:
    from huggingface_hub import login
    login(hf_token)
    print("Hugging Face authentication successful!")
else:
    print("No HF_TOKEN found - using public models only")

## 2. Data Loading and Exploration

In [None]:
# Load datasets - try real data first, fallback to sample data
print("🔍 Looking for datasets...")

# Try to import our data utilities
try:
    from src.data_utils import load_real_or_sample_data
    REAL_DATA_AVAILABLE = True
    print("✅ Real data utilities available")
except ImportError:
    REAL_DATA_AVAILABLE = False
    print("⚠️ Real data utilities not available, using sample data")

# Load job data
if REAL_DATA_AVAILABLE:
    # Load real job dataset (lukebarousse/data_jobs) or fallback to sample
    jobs_df = load_real_or_sample_data(max_samples=10000, prefer_real=True)
    print(f"✅ Loaded jobs dataset: {len(jobs_df)} jobs")
else:
    # Fallback to original sample data loading
    jobs_df = pd.read_csv(DATA_DIR / "sample_jobs.csv")
    print(f"✅ Loaded sample jobs: {len(jobs_df)} jobs")

# Load user data (always from sample for now)
users_df = pd.read_csv(DATA_DIR / "sample_users.csv")

print("Dataset loaded successfully!")
print(f"Users dataset shape: {users_df.shape}")
print(f"Jobs dataset shape: {jobs_df.shape}")

# Display basic info
print("\n=== USERS DATASET ===")
print(users_df.info())
print("\nFirst few rows:")
print(users_df.head())

print("\n=== JOBS DATASET ===")
print(jobs_df.info())
print("\nFirst few rows:")
print(jobs_df.head())

In [None]:
# Exploratory data analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# GPA distribution
axes[0, 0].hist(users_df['gpa'], bins=20, alpha=0.7, color='skyblue')
axes[0, 0].set_title('GPA Distribution')
axes[0, 0].set_xlabel('GPA')
axes[0, 0].set_ylabel('Frequency')

# Education level distribution
education_counts = users_df['education_level'].value_counts()
axes[0, 1].pie(education_counts.values, labels=education_counts.index, autopct='%1.1f%%')
axes[0, 1].set_title('Education Level Distribution')

# Experience years distribution
axes[1, 0].hist(users_df['experience_years'], bins=15, alpha=0.7, color='lightgreen')
axes[1, 0].set_title('Experience Years Distribution')
axes[1, 0].set_xlabel('Years')
axes[1, 0].set_ylabel('Frequency')

# Industry distribution in jobs
industry_counts = jobs_df['industry'].value_counts().head(10)
axes[1, 1].barh(range(len(industry_counts)), industry_counts.values)
axes[1, 1].set_yticks(range(len(industry_counts)))
axes[1, 1].set_yticklabels(industry_counts.index)
axes[1, 1].set_title('Top Industries in Job Catalog')
axes[1, 1].set_xlabel('Number of Jobs')

plt.tight_layout()
plt.show()

## 3. User Profile Preprocessing

In [None]:
def clean_text_fields(text):
    """Clean and normalize text fields"""
    if pd.isna(text):
        return ""
    # Convert to lowercase, remove extra spaces
    text = str(text).lower().strip()
    # Remove special characters but keep commas for skill/interest separation
    text = re.sub(r'[^\w\s,]', '', text)
    return text

def parse_skills_interests(text):
    """Parse comma-separated skills/interests into list"""
    if pd.isna(text) or text == "":
        return []
    items = [item.strip() for item in str(text).split(',')]
    return [item for item in items if item]  # Remove empty strings

# Clean user profiles
users_clean = users_df.copy()

# Clean text fields
users_clean['interests_clean'] = users_clean['interests'].apply(clean_text_fields)
users_clean['skills_clean'] = users_clean['skills'].apply(clean_text_fields)
users_clean['field_of_study_clean'] = users_clean['field_of_study'].apply(clean_text_fields)

# Parse into lists
users_clean['interests_list'] = users_clean['interests_clean'].apply(parse_skills_interests)
users_clean['skills_list'] = users_clean['skills_clean'].apply(parse_skills_interests)

# Create combined text for embeddings
users_clean['profile_text'] = (
    users_clean['field_of_study_clean'] + ' ' + 
    users_clean['interests_clean'] + ' ' + 
    users_clean['skills_clean']
).str.strip()

# Normalize GPA to 0-1 scale
users_clean['gpa_normalized'] = users_clean['gpa'] / 4.0

# Encode education levels
education_levels = ['High School', 'Associate', 'Bachelor', 'Master', 'PhD']
education_mapping = {level: i for i, level in enumerate(education_levels)}
users_clean['education_level_encoded'] = users_clean['education_level'].map(education_mapping)

print("User profiles preprocessed!")
print(f"Sample profile text: {users_clean['profile_text'].iloc[0]}")
print(f"Education encoding: {dict(list(education_mapping.items())[:3])}")
print(users_clean[['user_id', 'gpa_normalized', 'education_level_encoded', 'profile_text']].head())

## 4. Job Catalog Preprocessing

In [None]:
def extract_experience_years(exp_text):
    """Extract minimum experience years from requirement text"""
    if pd.isna(exp_text):
        return 0
    
    # Look for patterns like "2-4 years", "3+ years", "1-3 years"
    numbers = re.findall(r'(\d+)', str(exp_text))
    if numbers:
        return int(numbers[0])  # Take first number as minimum
    return 0

def parse_salary_range(salary_text):
    """Parse salary range and return min, max, and average"""
    if pd.isna(salary_text):
        return 0, 0, 0
    
    numbers = re.findall(r'(\d+)', str(salary_text))
    if len(numbers) >= 2:
        min_sal, max_sal = int(numbers[0]), int(numbers[1])
        avg_sal = (min_sal + max_sal) / 2
        return min_sal, max_sal, avg_sal
    elif len(numbers) == 1:
        sal = int(numbers[0])
        return sal, sal, sal
    return 0, 0, 0

# Clean job data
jobs_clean = jobs_df.copy()

# Clean text fields
jobs_clean['description_clean'] = jobs_clean['description'].apply(clean_text_fields)
jobs_clean['required_skills_clean'] = jobs_clean['required_skills'].apply(clean_text_fields)
jobs_clean['job_title_clean'] = jobs_clean['job_title'].apply(clean_text_fields)

# Parse skills
jobs_clean['required_skills_list'] = jobs_clean['required_skills_clean'].apply(parse_skills_interests)

# Extract experience requirements
jobs_clean['min_experience_years'] = jobs_clean['experience_requirement'].apply(extract_experience_years)

# Parse salary information
salary_info = jobs_clean['salary_range'].apply(parse_salary_range)
jobs_clean['salary_min'] = [s[0] for s in salary_info]
jobs_clean['salary_max'] = [s[1] for s in salary_info]
jobs_clean['salary_avg'] = [s[2] for s in salary_info]

# Encode education requirements
jobs_clean['education_requirement_encoded'] = jobs_clean['education_requirement'].map(education_mapping)
jobs_clean['education_requirement_encoded'] = jobs_clean['education_requirement_encoded'].fillna(0)

# Create job text for embeddings
jobs_clean['job_text'] = (
    jobs_clean['job_title_clean'] + ' ' + 
    jobs_clean['description_clean'] + ' ' + 
    jobs_clean['required_skills_clean']
).str.strip()

print("Job catalog preprocessed!")
print(f"Sample job text: {jobs_clean['job_text'].iloc[0][:100]}...")
print(f"Experience distribution: {jobs_clean['min_experience_years'].value_counts().sort_index()}")
print(jobs_clean[['job_id', 'salary_avg', 'min_experience_years', 'education_requirement_encoded']].head())

## 5. Feature Engineering

In [None]:
def calculate_skill_overlap(user_skills, job_skills):
    """Calculate overlap between user skills and job requirements"""
    if not user_skills or not job_skills:
        return 0.0
    
    user_set = set(user_skills)
    job_set = set(job_skills)
    intersection = len(user_set.intersection(job_set))
    union = len(user_set.union(job_set))
    
    return intersection / union if union > 0 else 0.0

def create_user_job_features(user_row, job_row):
    """Create features for a user-job pair"""
    features = {}
    
    # Basic compatibility features
    features['gpa_normalized'] = user_row['gpa_normalized']
    features['experience_years'] = user_row['experience_years']
    features['education_level_match'] = 1 if user_row['education_level_encoded'] >= job_row['education_requirement_encoded'] else 0
    features['experience_match'] = 1 if user_row['experience_years'] >= job_row['min_experience_years'] else 0
    
    # Skill overlap
    features['skill_overlap'] = calculate_skill_overlap(
        user_row['skills_list'], 
        job_row['required_skills_list']
    )
    
    # Education over-qualification (might be negative for some positions)
    features['education_overqualified'] = max(0, user_row['education_level_encoded'] - job_row['education_requirement_encoded'])
    
    # Experience over-qualification
    features['experience_overqualified'] = max(0, user_row['experience_years'] - job_row['min_experience_years'])
    
    # Salary features (normalized)
    features['salary_avg_normalized'] = job_row['salary_avg'] / 150000.0  # Normalize by reasonable max
    
    return features

# Generate sample training data (user-job pairs with features)
print("Generating sample user-job interaction features...")
sample_features = []

# Create positive examples (good matches) and negative examples
for _, user in users_clean.head(5).iterrows():  # Limited sample for demo
    for _, job in jobs_clean.head(10).iterrows():
        features = create_user_job_features(user, job)
        features['user_id'] = user['user_id']
        features['job_id'] = job['job_id']
        
        # Simple heuristic for creating labels (in real scenario, use actual user feedback)
        label = 1 if (features['education_level_match'] and 
                     features['skill_overlap'] > 0.1 and 
                     features['gpa_normalized'] > 0.7) else 0
        features['label'] = label
        
        sample_features.append(features)

features_df = pd.DataFrame(sample_features)
print(f"Generated {len(features_df)} user-job feature vectors")
print(f"Positive examples: {features_df['label'].sum()}")
print(f"Feature columns: {list(features_df.columns)}")
print(features_df.head())

## 6. Text Embedding Generation

In [None]:
# Load sentence transformer model
print("Loading sentence transformer model...")
model_name = "all-MiniLM-L6-v2"  # Fast and efficient model
embedding_model = SentenceTransformer(model_name)

print(f"Model loaded: {model_name}")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

# Generate embeddings for user profiles
print("Generating user profile embeddings...")
user_texts = users_clean['profile_text'].tolist()
user_embeddings = embedding_model.encode(
    user_texts, 
    show_progress_bar=True, 
    convert_to_numpy=True
)

print(f"Generated {user_embeddings.shape[0]} user embeddings with dimension {user_embeddings.shape[1]}")

# Generate embeddings for job descriptions
print("Generating job description embeddings...")
job_texts = jobs_clean['job_text'].tolist()
job_embeddings = embedding_model.encode(
    job_texts, 
    show_progress_bar=True, 
    convert_to_numpy=True
)

print(f"Generated {job_embeddings.shape[0]} job embeddings with dimension {job_embeddings.shape[1]}")

# Add embeddings to dataframes
users_clean['embedding'] = list(user_embeddings)
jobs_clean['embedding'] = list(job_embeddings)

print("Embeddings generated and stored!")

## 7. Vector Database Setup

In [None]:
# Create FAISS index for job embeddings
embedding_dim = job_embeddings.shape[1]

# Use L2 distance (cosine similarity can also be used)
index = faiss.IndexFlatL2(embedding_dim)

# Add job embeddings to index
index.add(job_embeddings.astype('float32'))

print(f"FAISS index created with {index.ntotal} job embeddings")
print(f"Index dimension: {embedding_dim}")

# Test the index with a sample query
test_user_embedding = user_embeddings[0:1].astype('float32')
distances, indices = index.search(test_user_embedding, k=5)

print(f"\nTest search for user 0:")
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    job_title = jobs_clean.iloc[idx]['job_title']
    print(f"  {i+1}. {job_title} (distance: {dist:.3f})")

## 8. Save Processed Data

In [None]:
# Initialize embedding model - using JobBERT-v3 for specialized job embeddings
print("Setting up embedding model...")

# Try JobBERT-v3 first (specialized for job titles), fallback to all-MiniLM-L6-v2
try:
    model_name = "TechWolf/JobBERT-v3"  # Specialized job title embedding model
    embedding_model = SentenceTransformer(model_name)
    print(f"✅ Successfully loaded JobBERT-v3 (specialized for job embeddings)")
except Exception as e:
    print(f"⚠️ JobBERT-v3 not available: {e}")
    print("Falling back to all-MiniLM-L6-v2...")
    model_name = "all-MiniLM-L6-v2"  # Fast and efficient fallback model
    embedding_model = SentenceTransformer(model_name)
    print(f"✅ Loaded fallback model: {model_name}")

print(f"Model: {model_name}")
print(f"Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

# Generate embeddings
print("\nGenerating user embeddings...")
user_embeddings = embedding_model.encode(
    users_clean['user_text'].tolist(),
    show_progress_bar=True,
    convert_to_numpy=True
)

print("Generating job embeddings...")
job_embeddings = embedding_model.encode(
    jobs_clean['job_text'].tolist(), 
    show_progress_bar=True,
    convert_to_numpy=True
)

print(f"✅ Generated embeddings:")
print(f"  • User embeddings shape: {user_embeddings.shape}")
print(f"  • Job embeddings shape: {job_embeddings.shape}")
print(f"  • Model used: {model_name}")

## Summary

✅ **Data Loading**: Successfully loaded user profiles and job catalog  
✅ **Data Cleaning**: Cleaned text fields, parsed skills and interests  
✅ **Feature Engineering**: Created user-job compatibility features  
✅ **Embeddings**: Generated semantic embeddings using sentence-transformers  
✅ **Vector Database**: Set up FAISS index for efficient similarity search  
✅ **Data Export**: Saved all processed data for training pipeline

**Next Steps:**
1. Run `02_train_xgboost.ipynb` to train the reranking model
2. Use `03_evaluate.ipynb` to measure system performance  
3. Try `04_inference_demo.ipynb` for interactive recommendations

**Key Outputs:**
- Processed user profiles with embeddings
- Job catalog with semantic vectors
- FAISS index for fast similarity search
- Training features for reranking model