# üéì ResumeIQ ‚Äî Job Role Prediction Model Training

This notebook trains a **TF-IDF + Logistic Regression** classifier to predict the top job roles from resume skills.

### üìÇ Input file:
- `training_data.json` ‚Äî dataset of skill strings + job role labels. Add more entries here to improve accuracy.

### üì¶ Output files (saved into `models/` folder):
- `models/job_model.pkl` ‚Äî trained Logistic Regression classifier
- `models/tfidf.pkl` ‚Äî fitted TF-IDF vectorizer
- `models/label_encoder.pkl` ‚Äî label encoder for job role names

### ‚ñ∂ How to run:
```bash
pip install scikit-learn pandas numpy joblib
jupyter notebook train_model.ipynb
```
Run all cells top to bottom. The model files will be saved automatically.

### ‚úèÔ∏è To add more training data:
Open `training_data.json` and add entries in this format:
```json
{ "skills": "python docker kubernetes aws git", "job_role": "DevOps Engineer" }
```
Then re-run the notebook to retrain.

In [2]:
import pandas as pd
import numpy as np
import joblib
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline


In [4]:
%pwd

'd:\\ML PROJECTS\\AI-Based-Resume-analyser\\notebook'

In [5]:
os.chdir('../')

In [6]:
%pwd

'd:\\ML PROJECTS\\AI-Based-Resume-analyser'

In [7]:
os.makedirs('models', exist_ok=True)

In [10]:
import json

# Load from JSON file 
JSON_PATH = os.path.join('data', 'training_data.json')

with open(JSON_PATH, 'r') as f:
    raw = json.load(f)

df = pd.DataFrame(raw)   

# Validate expected columns exist
assert 'skills'   in df.columns, "JSON must have a 'skills' field"
assert 'job_role' in df.columns, "JSON must have a 'job_role' field"

print(f'Loaded {len(df)} samples from {JSON_PATH}')

Loaded 100 samples from data\training_data.json


In [11]:
print(f'   {df["job_role"].nunique()} unique job roles\n')
print(df['job_role'].value_counts())

   10 unique job roles

job_role
Data Scientist            10
Web Developer             10
DevOps Engineer           10
Data Analyst              10
Backend Developer         10
Mobile Developer          10
Cybersecurity Analyst     10
Cloud Engineer            10
ML Engineer               10
Database Administrator    10
Name: count, dtype: int64


In [12]:
le = LabelEncoder()
df['label'] = le.fit_transform(df['job_role'])

print('Job Role ‚Üí Label mapping:')
for role, label in zip(le.classes_, range(len(le.classes_))):
    print(f'  {label:2d} ‚Üí {role}')

Job Role ‚Üí Label mapping:
   0 ‚Üí Backend Developer
   1 ‚Üí Cloud Engineer
   2 ‚Üí Cybersecurity Analyst
   3 ‚Üí Data Analyst
   4 ‚Üí Data Scientist
   5 ‚Üí Database Administrator
   6 ‚Üí DevOps Engineer
   7 ‚Üí ML Engineer
   8 ‚Üí Mobile Developer
   9 ‚Üí Web Developer


In [13]:
X = df['skills']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f'Train: {len(X_train)} samples | Test: {len(X_test)} samples')

Train: 80 samples | Test: 20 samples


In [15]:
# ‚îÄ‚îÄ Cell 5: Build Pipeline (TF-IDF + Logistic Regression) ‚îÄ‚îÄ
# TF-IDF converts skill text ‚Üí numeric vector
# Logistic Regression gives probability scores per class ‚Üí perfect for top-3

tfidf = TfidfVectorizer(
    ngram_range=(1, 2),    
    min_df=1,
    max_features=5000,
    sublinear_tf=True,     
)

clf = LogisticRegression(
    max_iter=1000,
    C=5.0,
    class_weight='balanced',
    solver='lbfgs',
    random_state=42,
)

X_train_vec = tfidf.fit_transform(X_train)
X_test_vec  = tfidf.transform(X_test)
clf.fit(X_train_vec, y_train)

print('‚úÖ Model trained!')

‚úÖ Model trained!


In [17]:
y_pred = clf.predict(X_test_vec)
acc    = accuracy_score(y_test, y_pred)

print(f'Test Accuracy: {acc * 100:.1f}%\n')

Test Accuracy: 100.0%



In [18]:
print(classification_report(y_test, y_pred, target_names=le.classes_))

                        precision    recall  f1-score   support

     Backend Developer       1.00      1.00      1.00         2
        Cloud Engineer       1.00      1.00      1.00         2
 Cybersecurity Analyst       1.00      1.00      1.00         2
          Data Analyst       1.00      1.00      1.00         2
        Data Scientist       1.00      1.00      1.00         2
Database Administrator       1.00      1.00      1.00         2
       DevOps Engineer       1.00      1.00      1.00         2
           ML Engineer       1.00      1.00      1.00         2
      Mobile Developer       1.00      1.00      1.00         2
         Web Developer       1.00      1.00      1.00         2

              accuracy                           1.00        20
             macro avg       1.00      1.00      1.00        20
          weighted avg       1.00      1.00      1.00        20



In [19]:
sample = 'python machine learning tensorflow docker kubernetes git sql'
vec    = tfidf.transform([sample])
proba  = clf.predict_proba(vec)[0]

top3_idx   = np.argsort(proba)[::-1][:3]
top3_roles = [(le.classes_[i], round(proba[i] * 100, 1)) for i in top3_idx]

print(f'Input skills: "{sample}"\n')
print('Top 3 predicted roles:')
for rank, (role, pct) in enumerate(top3_roles, 1):
    print(f'  #{rank}: {role} ‚Äî {pct}% confidence')

Input skills: "python machine learning tensorflow docker kubernetes git sql"

Top 3 predicted roles:
  #1: Data Scientist ‚Äî 33.9% confidence
  #2: ML Engineer ‚Äî 16.2% confidence
  #3: DevOps Engineer ‚Äî 11.6% confidence


In [20]:
joblib.dump(tfidf, 'models/tfidf.pkl')
joblib.dump(clf,   'models/job_model.pkl')
joblib.dump(le,    'models/label_encoder.pkl')

print('Saved:')
print('   models/tfidf.pkl')
print('   models/job_model.pkl')
print('   models/label_encoder.pkl')
print('\nDone! Copy the models/ folder into your Flask project root.')

Saved:
   models/tfidf.pkl
   models/job_model.pkl
   models/label_encoder.pkl

Done! Copy the models/ folder into your Flask project root.
