# Career Recommender System - XGBoost Reranker Training

This notebook trains an XGBoost model to rerank job recommendations based on structured features derived from user profiles and job requirements.

**Workflow:**
1. Load preprocessed data and features  
2. Prepare training data with positive/negative examples
3. Train XGBoost reranker with hyperparameter tuning
4. Evaluate model performance on validation set
5. Save trained model for inference pipeline

## 1. Install Dependencies

Run this cell first to install all required packages:

In [None]:
# Install required packages for Colab/Kaggle environments
!pip install pandas numpy scikit-learn matplotlib seaborn plotly
!pip install sentence-transformers transformers torch
!pip install faiss-cpu xgboost
!pip install python-jobspy>=1.1.79 datasets>=2.14.0 serpapi>=1.0.0

## 2. Environment Setup

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import pickle
import json

# ML libraries
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

warnings.filterwarnings('ignore')
plt.style.use('default')
np.random.seed(42)

# Create necessary directories
os.makedirs('models', exist_ok=True)
os.makedirs('data', exist_ok=True)

print("Environment setup complete!")

## 3. Load Preprocessed Data

If you haven't run the preprocessing notebook yet, we'll create sample data:

In [None]:
# Try to load preprocessed data, create sample if not available
try:
    # Load from preprocessing notebook
    users_df = pd.read_pickle('models/users_processed.pkl')
    jobs_df = pd.read_pickle('models/jobs_processed.pkl')
    features_df = pd.read_pickle('models/training_features.pkl')
    print("✅ Loaded preprocessed data from models/")
    
except FileNotFoundError:
    print("⚠️ Preprocessed data not found. Creating sample training data...")
    
    # Create sample training features for demonstration
    np.random.seed(42)
    n_samples = 500
    
    # Generate sample features
    features_data = {
        'user_id': np.random.randint(1, 21, n_samples),
        'job_id': np.random.randint(1, 31, n_samples),
        'gpa_normalized': np.random.normal(0.75, 0.15, n_samples).clip(0, 1),
        'experience_years': np.random.randint(0, 10, n_samples),
        'education_level_match': np.random.choice([0, 1], n_samples, p=[0.3, 0.7]),
        'experience_match': np.random.choice([0, 1], n_samples, p=[0.4, 0.6]),
        'skill_overlap': np.random.beta(2, 5, n_samples),  # Skewed towards lower values
        'education_overqualified': np.random.choice([0, 1, 2], n_samples, p=[0.6, 0.3, 0.1]),
        'experience_overqualified': np.random.randint(0, 5, n_samples),
        'salary_avg_normalized': np.random.beta(3, 2, n_samples),
        'location_match': np.random.choice([0, 1], n_samples, p=[0.7, 0.3])
    }
    
    features_df = pd.DataFrame(features_data)
    
    # Create realistic labels based on features
    label_prob = (
        0.3 * features_df['education_level_match'] +
        0.2 * features_df['experience_match'] +
        0.3 * features_df['skill_overlap'] +
        0.1 * features_df['gpa_normalized'] +
        0.1 * features_df['location_match']
    )
    
    features_df['label'] = np.random.binomial(1, label_prob.clip(0, 1))
    
    print("✅ Created sample training data")

print(f"Training data shape: {features_df.shape}")
print(f"Positive examples: {features_df['label'].sum()}")
print(f"Negative examples: {len(features_df) - features_df['label'].sum()}")

features_df.head()

## 4. Exploratory Data Analysis

In [None]:
# Analyze feature distributions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

feature_cols = ['gpa_normalized', 'skill_overlap', 'experience_years', 
                'education_level_match', 'experience_match', 'location_match']

for i, col in enumerate(feature_cols):
    if col in ['education_level_match', 'experience_match', 'location_match']:
        # Bar plot for binary features
        feature_counts = features_df[col].value_counts()
        axes[i].bar(feature_counts.index, feature_counts.values, alpha=0.7)
        axes[i].set_title(f'{col.replace("_", " ").title()}')
    else:
        # Histogram for continuous features
        axes[i].hist(features_df[col], bins=20, alpha=0.7)
        axes[i].set_title(f'{col.replace("_", " ").title()}')
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Count')

plt.tight_layout()
plt.show()

# Correlation matrix
print("Feature Correlations:")
feature_columns = [col for col in features_df.columns if col not in ['user_id', 'job_id', 'label']]
corr_matrix = features_df[feature_columns + ['label']].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Feature Correlation Matrix')
plt.show()

## 5. Prepare Training Data

In [None]:
# Prepare features and labels
feature_columns = [col for col in features_df.columns if col not in ['user_id', 'job_id', 'label']]
X = features_df[feature_columns].values
y = features_df['label'].values

# Handle any missing values
X = np.nan_to_num(X, nan=0.0)

print(f"Feature matrix shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Feature columns: {feature_columns}")

# Split into train/validation/test sets
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)

print(f"\nData splits:")
print(f"Train: {X_train.shape[0]} samples ({y_train.sum()} positive)")
print(f"Validation: {X_val.shape[0]} samples ({y_val.sum()} positive)")
print(f"Test: {X_test.shape[0]} samples ({y_test.sum()} positive)")

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

## 6. Train XGBoost Model

In [None]:
# Define XGBoost model with initial parameters
xgb_model = xgb.XGBClassifier(
    objective='binary:logistic',
    max_depth=6,
    learning_rate=0.1,
    n_estimators=100,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='logloss',
    early_stopping_rounds=10
)

# Train the model
print("Training XGBoost reranker...")
xgb_model.fit(
    X_train_scaled, y_train,
    eval_set=[(X_val_scaled, y_val)],
    verbose=False
)

print("✅ Model training completed!")

# Make predictions
y_pred_proba = xgb_model.predict_proba(X_test_scaled)[:, 1]
y_pred = xgb_model.predict(X_test_scaled)

# Calculate metrics
from sklearn.metrics import precision_score, recall_score, f1_score

auc_score = roc_auc_score(y_test, y_pred_proba)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"\n📊 Model Performance:")
print(f"AUC: {auc_score:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-Score: {f1:.4f}")

## 7. Feature Importance Analysis

In [None]:
# Feature importance analysis
feature_importance = xgb_model.feature_importances_
importance_df = pd.DataFrame({
    'feature': feature_columns,
    'importance': feature_importance
}).sort_values('importance', ascending=False)

print("🔍 Feature Importance Rankings:")
print(importance_df)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(data=importance_df, x='importance', y='feature', palette='viridis')
plt.title('XGBoost Feature Importance')
plt.xlabel('Importance Score')
plt.tight_layout()
plt.show()

# Plot ROC curve
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.grid(True, alpha=0.3)
plt.show()

## 8. Hyperparameter Tuning (Optional)

In [None]:
# Hyperparameter tuning with GridSearchCV (uncomment to run)
# Warning: This can take several minutes to complete

run_tuning = False  # Set to True to run hyperparameter tuning

if run_tuning:
    print("🔧 Starting hyperparameter tuning...")
    
    param_grid = {
        'max_depth': [3, 6, 9],
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [50, 100, 200],
        'subsample': [0.8, 1.0]
    }
    
    xgb_tuned = xgb.XGBClassifier(
        objective='binary:logistic',
        random_state=42,
        eval_metric='logloss'
    )
    
    grid_search = GridSearchCV(
        xgb_tuned, param_grid,
        cv=3,
        scoring='roc_auc',
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train_scaled, y_train)
    
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best cross-validation score: {grid_search.best_score_:.4f}")
    
    # Use best model
    best_model = grid_search.best_estimator_
    y_pred_tuned = best_model.predict_proba(X_test_scaled)[:, 1]
    auc_tuned = roc_auc_score(y_test, y_pred_tuned)
    print(f"Tuned model AUC: {auc_tuned:.4f}")
    
    # Update model if better
    if auc_tuned > auc_score:
        xgb_model = best_model
        print("✅ Using tuned model (better performance)")
    else:
        print("✅ Keeping original model (tuning didn't improve)")
else:
    print("⏭️ Skipping hyperparameter tuning (set run_tuning=True to enable)")

## 9. Save Trained Model

In [None]:
# Save the trained model and preprocessing components
model_data = {
    'model': xgb_model,
    'scaler': scaler,
    'feature_columns': feature_columns,
    'metrics': {
        'auc': auc_score,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }
}

# Save model
with open('models/reranker_model.pkl', 'wb') as f:
    pickle.dump(model_data, f)

# Save metadata
metadata = {
    'model_type': 'XGBClassifier',
    'feature_columns': feature_columns,
    'training_samples': len(X_train),
    'validation_auc': float(auc_score),
    'feature_importance': {feat: float(imp) for feat, imp in zip(feature_columns, feature_importance)}
}

with open('models/reranker_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print("✅ Model saved successfully!")
print(f"📁 Files saved:")
print("  - models/reranker_model.pkl")
print("  - models/reranker_metadata.json")

# Test loading the model
print("\n🧪 Testing model loading...")
with open('models/reranker_model.pkl', 'rb') as f:
    loaded_model_data = pickle.load(f)

test_input = X_test_scaled[:5]  # Test with 5 samples
test_predictions = loaded_model_data['model'].predict_proba(test_input)[:, 1]
print(f"✅ Model loading test successful!")
print(f"Sample predictions: {test_predictions}")

## Summary

✅ **XGBoost Reranker Training Complete!**

**What we accomplished:**
- Loaded/created training data with user-job compatibility features
- Trained XGBoost classifier for job recommendation reranking
- Evaluated model performance with AUC, precision, recall metrics
- Analyzed feature importance to understand key factors
- Saved trained model for use in recommendation pipeline

**Key Results:**
- Model learns to predict job relevance based on structured features
- Most important features typically include skill overlap and education matching
- Model can be used to rerank semantic search results for better recommendations

**Next Steps:**
1. Run the evaluation notebook to measure ranking metrics (NDCG@k)
2. Use the inference demo notebook to test recommendations
3. Integrate into production recommendation system