# Job Title Prediction using Natural Language Processing

## Executive Summary

This project builds an end-to-end NLP pipeline to classify job titles based on job description text.  
The task is formulated as a supervised multi-class classification problem.

A systematic modeling approach was followed:

- Classical NLP baselines (TF-IDF + Naive Bayes / Logistic Regression)
- Word Embedding-based modeling (Word2Vec)
- Deep Learning sequence modeling (LSTM)
- Transformer-based contextual modeling (BERT)

The objective is to compare how different modeling strategies perform on structured recruitment text data.

## Problem Definition

Given a job description, the goal is to predict the corresponding job title.

This is a supervised multi-class classification problem:

- **Input (X):** Job description (unstructured text)
- **Output (y):** Job title (categorical label)

Performance is evaluated on unseen test data to estimate generalization capability.


In [None]:
!pip install gensim

In [None]:
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import nltk
import re
import numpy as np
import gensim

## NLP Resource Initialization

To support preprocessing, essential NLP resources were downloaded:

- `punkt` and `punkt_tab` for tokenization
- `stopwords` corpus for filtering non-informative words
- `wordnet` for lemmatization

These resources enable standardized text cleaning and normalization.

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')


In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

## Data Understanding

The dataset contains:

- **Job Description:** Detailed text outlining responsibilities and requirements
- **Job Title:** Categorized role (e.g., Java Developer, Data Scientist, Backend Developer)

The class distribution was observed to be relatively balanced across major job roles, making it suitable for multi-class classification modeling.

In [None]:
df=pd.read_csv('/content/job_title_des.csv')

In [None]:
df

In [None]:
df.drop('Unnamed: 0',axis=1,inplace=True)

In [None]:
df.info()

In [None]:
df['Job Title'].value_counts()

In [None]:
df.nunique()

In [None]:
df['Job Title'].value_counts().sort_values(ascending=False)

## Job Title Distribution

To understand the dataset better, the frequency of each job title was visualized using a bar chart.

The distribution shows that all job roles have similar sample counts, indicating a relatively balanced dataset.



In [None]:
df['Job Title'].value_counts().sort_values(ascending=False).plot(kind='bar')

plt.title('Distribution of Job Titles', fontsize=14)
plt.xlabel('Job Title', fontsize=12)
plt.ylabel('Number of Samples', fontsize=12)

plt.xticks(rotation=75)
plt.tight_layout()

plt.show()


In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Job Title_encoded']=le.fit_transform(df['Job Title'])

In [None]:
df[['Job Title','Job Title_encoded']].value_counts()


## Text Cleaning and Lemmatization

Before modeling, job descriptions were preprocessed to improve text quality and reduce noise.

Steps applied:

- Convert text to lowercase
- Remove special characters and punctuation
- Tokenize text into words
- Remove stopwords
- Apply lemmatization to normalize word forms

Lemmatization reduces variations like "developing" and "developed" to a common base form, improving model consistency.

A new feature column, **"Job Description Lemmatized"**, was created for modeling.

In [None]:
stop_words=set(stopwords.words('english'))
def clean(text):
  text=text.lower()
  text=re.sub(r'[^a-z\s]','',text)
  tokens=word_tokenize(text)
  words=[words for words in tokens if words not in stop_words]
  return words


In [None]:
lemmatizer=WordNetLemmatizer()
def lemmatized_word(text):
  token=clean(text)
  word=[lemmatizer.lemmatize(word) for word in token]
  return ' '.join(word)

In [None]:
df['Job Description Lemmatized']=df['Job Description'].apply(lemmatized_word)

In [None]:
df

## Feature Selection and Train-Test Split

After preprocessing, the modeling dataset contains:

- `Job Title_encoded` (Target variable)
- `Job Description Lemmatized` (Processed text feature)

The dataset was split into:

- 70% Training data
- 30% Test data

Stratified sampling was used to maintain class distribution.

The vectorizers were fit only on training data to prevent data leakage.

In [None]:
df_lemmatize=df[['Job Title_encoded','Job Description Lemmatized']]

In [None]:
df_lemmatize

In [None]:
x_lemm=df_lemmatize['Job Description Lemmatized']
y_lemm=df_lemmatize['Job Title_encoded']

In [None]:
from sklearn.model_selection import train_test_split
x_train_lemm,x_test_lemm,y_train_lemm,y_test_lemm=train_test_split(x_lemm,y_lemm,test_size=0.3,random_state=45)

## TF-IDF Feature Extraction

TF-IDF (Term Frequency–Inverse Document Frequency) transforms text into numerical feature vectors.

It:

- Converts textual data into numerical representation
- Assigns higher weight to informative terms
- Reduces the influence of very common words

This serves as the baseline feature representation for classical ML models.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf=TfidfVectorizer(lowercase=True,stop_words='english',max_features=5000)
x_train_vectorized_lemmatized=tfidf.fit_transform(x_train_lemm)
x_test_vectorized_lemmatized=tfidf.transform(x_test_lemm)

## Multinomial Naive Bayes (TF-IDF on Lemmatized Text)

A Multinomial Naive Bayes classifier was trained using TF-IDF features generated from lemmatized text.

Naive Bayes is commonly used in text classification because:

- It performs well with high-dimensional sparse data
- It is computationally efficient
- It provides a strong baseline for NLP tasks



In [None]:
from sklearn.naive_bayes import MultinomialNB
nm_lemmatized_model=MultinomialNB()
nm_lemmatized_model.fit(x_train_vectorized_lemmatized,y_train_lemm)

In [None]:
y_pred_lemmatized=nm_lemmatized_model.predict(x_test_vectorized_lemmatized)

## Model Evaluation

The model was evaluated using:

- Accuracy
- Precision (Weighted)
- Recall (Weighted)
- F1-Score
- Confusion Matrix



In [None]:
from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,classification_report,precision_recall_fscore_support
print('accuracy score for a multinomial model built on lemmatized text is',accuracy_score(y_test_lemm,y_pred_lemmatized))
print('\n')
print('precision score for a multinomial model built on lemmatized text is',precision_score(y_test_lemm,y_pred_lemmatized, average='weighted'))
print('\n')
print('recall score for a multinomial model built on lemmatized text is',recall_score(y_test_lemm,y_pred_lemmatized, average='weighted'))
print('\n')
print('classification report for a multinomial model built on lemmatized text is','\n',classification_report(y_test_lemm,y_pred_lemmatized))
print('\n')
print('confusion matrix for a multinomial model built on lemmatized text is','\n',confusion_matrix(y_test_lemm,y_pred_lemmatized))

### Observations

- The model achieved an accuracy of approximately **66%** on the test dataset.
- Precision and recall scores indicate moderate performance across job title categories.
- Some classes perform better than others, suggesting opportunities for improvement using more advanced models.

This baseline establishes a performance benchmark before experimenting with more complex architectures such as LSTM or Transformer-based models.


## TF-IDF Feature Extraction

TF-IDF (Term Frequency–Inverse Document Frequency) was used to convert raw job descriptions into numerical feature vectors.

TF-IDF:

- Measures the importance of a word within a document relative to the entire corpus  
- Reduces the influence of very common words  
- Preserves discriminative keywords useful for classification  

The vectorizer was fitted on the training data to avoid data leakage, and the same transformation was applied to the test data.


In [None]:
x=df['Job Description']
y=df['Job Title_encoded']

In [None]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=45)

In [None]:
x_train_vectorized=tfidf.fit_transform(x_train)
x_test_vectorized=tfidf.transform(x_test)


## Multinomial Naive Bayes (TF-IDF on Raw Text)

TF-IDF features were generated directly from raw job descriptions.

The goal was to compare performance between:

- Raw text representation
- Lemmatized text representation



In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
nm_model=MultinomialNB()

In [None]:
nm_model.fit(x_train_vectorized,y_train)

In [None]:
y_pred_nb=nm_model.predict(x_test_vectorized)

In [None]:
print('accuracy score for a multinomial model is',accuracy_score(y_test,y_pred_nb))
print('\n')
print('precision score for a multinomial model is',precision_score(y_test,y_pred_nb, average='weighted'))
print('\n')
print('recall score for a multinomial model is',recall_score(y_test,y_pred_nb, average='weighted'))
print('\n')
print('classification report for a multinomial model is','\n',classification_report(y_test,y_pred_nb))
print('\n')
print('confusion matrix for a multinomial model is','\n',confusion_matrix(y_test,y_pred_nb))

### Results

The model achieved approximately **75% accuracy**, outperforming the lemmatized version.

This suggests that TF-IDF on raw text preserved sufficient discriminative information for this classification task.

## Logistic Regression (TF-IDF on Raw Text)

TF-IDF features were generated directly from the raw job descriptions.



In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr=LogisticRegression()

In [None]:
lr.fit(x_train_vectorized,y_train)

In [None]:
y_pred_lr=lr.predict(x_test_vectorized)

In [None]:
print('accuracy score for a Logistic Model is',accuracy_score(y_test,y_pred_lr))
print('\n')
print('precision score for a Logistic Model is',precision_score(y_test,y_pred_lr, average='weighted'))
print('\n')
print('recall score for a Logistic Model  is',recall_score(y_test,y_pred_lr, average='weighted'))
print('\n')
print('classification report for a Logistic Model is','\n',classification_report(y_test,y_pred_lr))
print('\n')
print('confusion matrix for a Logistic Model is','\n',confusion_matrix(y_test,y_pred_lr))


### Results

The model achieved approximately **82–83% accuracy**, outperforming Naive Bayes.


## Word2Vec + Logistic Regression (Raw Text)

Word2Vec was trained directly on the raw job descriptions to learn dense word embeddings based on contextual relationships.

Each job description was converted into a fixed-length vector by averaging the embeddings of its words. These vectors were then used as input features for a Logistic Regression classifier.



In [None]:
from gensim.models import Word2Vec

In [None]:
sentences = [sent.split() for sent in df['Job Description']]

In [None]:
word2vec_model=Word2Vec(sentences=sentences,vector_size=1000,window=1,sg=0,min_count=1)

In [None]:
def sentences2vec_wv(inp_sentences,vec_model):
  vectors=[vec_model.wv[word] for word in inp_sentences if word in vec_model.wv]
  if not vectors:
    return np.zeros(vec_model.vector_size)
  return np.mean(vectors,axis=0)

In [None]:
x_vector_wv=[sentences2vec_wv(sent.split(),word2vec_model) for sent in df['Job Description']]

In [None]:
x_train_wv,x_test_wv,y_train_wv,y_test_wv=train_test_split(x_vector_wv,y,test_size=0.3,random_state=45)

In [None]:
lr_word2vec_pre_trained=LogisticRegression()

In [None]:
lr_word2vec_pre_trained.fit(x_train_wv,y_train_wv)

In [None]:
y_pred_wv=lr_word2vec_pre_trained.predict(x_test_wv)

In [None]:
print('accuracy score for a Logistic Model for Word2Vec Model is',accuracy_score(y_test_wv,y_pred_wv))
print('\n')
print('precision score for a Logistic Model for Word2Vec Model is',precision_score(y_test_wv,y_pred_wv, average='weighted'))
print('\n')
print('recall score for a Logistic Model for Word2Vec Model is',recall_score(y_test_wv,y_pred_wv, average='weighted'))
print('\n')
print('classification report for a Logistic Model for Word2Vec Model is','\n',classification_report(y_test_wv,y_pred_wv))
print('\n')
print('confusion matrix for a Logistic Model for Word2Vec Model is','\n',confusion_matrix(y_test_wv,y_pred_wv))

### Result

The model achieved significantly lower performance compared to TF-IDF-based models.

## LSTM-Based Text Classification

A deep learning model was built using an Embedding layer followed by an LSTM network.

The tokenizer converted job descriptions into numerical sequences, which were padded to a fixed length before training.

The model architecture consists of:
- Embedding layer for dense word representations
- LSTM layer to capture sequential dependencies
- Dropout layer for regularization
- Dense output layer with softmax activation for multi-class classification



In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
df.columns

In [None]:
sent = df['Job Description'].values

In [None]:
tokenizer=Tokenizer(num_words=10000,oov_token='<OOV>')
tokenizer.fit_on_texts(sent)

In [None]:
max_len=100

In [None]:
seq=tokenizer.texts_to_sequences(sent)

In [None]:
x_padded=pad_sequences(seq,max_len,padding='post',truncating='post')

In [None]:
x_padded_train,x_padded_test,y_train,y_test=train_test_split(x_padded,y,test_size=0.3,random_state=45)

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,LSTM,Dropout,Dense

In [None]:
lstm_model=Sequential([
    Embedding(input_dim=10000,output_dim=128,input_length=max_len),
    LSTM(64,return_sequences=False),
    Dropout(0.3),
    Dense(15,activation='softmax')
])

In [None]:
lstm_model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])

In [None]:
lstm_model.summary()

In [None]:
lstm_model.fit(x_padded_train,y_train,epochs=50,validation_split=0.1,batch_size=64)

In [None]:
y_pred_lstm=lstm_model.predict(x_padded_test)

In [None]:
y_pred_lstm

In [None]:
y_pred_lstm_val=np.argmax(y_pred_lstm,axis=1)

In [None]:
y_pred_lstm_val

In [None]:
print('accuracy score for LSTM is',accuracy_score(y_test,y_pred_lstm_val))
print('\n')
print('precision score for LSTM is',precision_score(y_test,y_pred_lstm_val, average='weighted'))
print('\n')
print('recall score for LSTM is',recall_score(y_test,y_pred_lstm_val, average='weighted'))
print('\n')
print('classification report for LSTM is','\n',classification_report(y_test,y_pred_lstm_val))
print('\n')
print('confusion matrix for LSTM is','\n',confusion_matrix(y_test,y_pred_lstm_val))

### Result

The LSTM model achieved moderate performance but did not outperform TF-IDF + Logistic Regression.

##Transformer-Based Model

Transformer models use self-attention to understand contextual meaning in text.
They capture long-range dependencies better than traditional NLP models.

In this project, a pre-trained Transformer model was fine-tuned
for multi-class job title classification.

In [None]:
!pip install transformers datasets torch scikit-learn

In [None]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df['Job Description'],
    df['Job Title_encoded'],
    test_size=0.3,
    random_state=42,
    stratify=df['Job Title_encoded']
)

In [None]:
train_df = pd.DataFrame({'text': X_train, 'label': y_train})
test_df = pd.DataFrame({'text': X_test, 'label': y_test})

train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
def tokenize_function(example):
    return tokenizer(
        example['text'],
        padding='max_length',
        truncation=True,
        max_length=128
    )

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset = test_dataset.map(tokenize_function, batched=True)

train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

In [None]:
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=15
)

In [None]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, predictions, average='weighted'
    )
    acc = accuracy_score(labels, predictions)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

##BERT (bert-base-uncased)

BERT is a bidirectional Transformer model pre-trained on large corpora.
It understands context from both left and right directions.

Configuration:
- Model: bert-base-uncased
- Max Length: 128
- Epochs: 2
- Batch Size: 8
- Learning Rate: 2e-5

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    eval_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir='./logs',
    load_best_model_at_end=True,
    save_strategy='epoch'
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

In [None]:
predictions = trainer.predict(test_dataset)
y_pred = np.argmax(predictions.predictions, axis=1)

print("Accuracy:", accuracy_score(y_test, y_pred))

##Results

The fine-tuned BERT model achieved:

- Accuracy: ~79%
- Weighted F1-score: ~0.79


##Final Conclusion

This project compared multiple NLP approaches for multi-class job title prediction, including TF-IDF, Word2Vec, LSTM, and BERT.

Among all models, **TF-IDF + Logistic Regression achieved the highest accuracy (~83%)**, indicating that the dataset is strongly driven by keyword-level patterns.

While advanced models like LSTM and BERT provided contextual understanding, classical linear modeling generalized better due to the dataset size and problem structure.

This demonstrates that effective model selection depends on data characteristics, and higher model complexity does not always guarantee better performance.