# Extracting "Total Experience" from Job Descriptions

## Objective

You are provided with an Excel sheet containing 3000 Job Descriptions (JDs). Your task is to:

1. Research and choose 5–6 models or techniques to extract the total experience required for
   that job from each JD. You can choose Langchain or hugging face or transformers as per your
   preference
2. Write a code which implements all model in single code to give comparison of the output of
   each model extract total experience.
3. Compare the results from each model and output them in an Excel sheet. Use parallel
   processing to run efficiently
4. Analyze which model performs the best based on consistency, logic, or accuracy (as per your
   understanding).
5. Submit your recommendation of the best model with justification.

## Guidelines for Research & Model Selection

• Share the Final Running code

• Share the generated output excel sheet

• Think of a holistic algorithm to start out to Extract “Total Years of Experience” as mentioned
in JD

• Go through data well, and the variations in JD

• According to you which is the best approach

• In case of writing a prompt, be vigilant to which prompt should be used for which model

## 📁 Project Directory Structure 

- `data/` – Contains the input Excel file with all job descriptions.

- `models/` – Holds individual scripts for each model or technique used to extract experience.

- `outputs/` – Stores generated results like extracted experience and final comparison files.

- `notebooks/` – Used for prototyping, testing, and experimenting with code in Jupyter.

- `main.py` – The main driver script that loads data, runs models, and saves outputs.

I'll be using this notebook as a scratchpad for exploration and testing, and then transferring stable code into structured scripts.

In [1]:
# imports
import pandas as pd
import spacy

In [2]:
# reading input data
file_path = "../data/jd_data.xlsx"
jd_df = pd.read_excel(file_path)

In [3]:
jd_df

Unnamed: 0,JD_Text
0,\n**Overview \n \n** Lazydays RV is looking ...
1,\n**Ãrea De AtuaÃ§Ã£o \n \n** TÃ©cnico em I...
2,\n\n\nðŸ”µ Capitole is still growing and we wa...
3,"\n\n\nAs a Solutions Engineer, you will work c..."
4,\n\n\n**ABOUT DAYONE**\n\nDayOne is a global l...
...,...
2992,\n Job Description\n \n\n**Who We Are:**\n\nLe...
2993,\n Job Description\n \n\nIt's fun to work in a...
2994,\n Job Description\n \n\n**About Bazaarvoice ...
2995,\n Job Description\n \n\n**Opportunity Overvie...


Upon going through the input data I found that not all JDs have explicitly mentioned the required total experience. So while trying all these models I will have to take it into account. Each model script must return something like "Not mentioned" in case there is no experience found in the JD.

## Testing Models

#### **1. en-core-web-sm**

In [4]:
import spacy

In [5]:
nlp = spacy.load("en_core_web_sm")

In [6]:
sample_df = jd_df.head(10)
sample_df

Unnamed: 0,JD_Text
0,\n**Overview \n \n** Lazydays RV is looking ...
1,\n**Ãrea De AtuaÃ§Ã£o \n \n** TÃ©cnico em I...
2,\n\n\nðŸ”µ Capitole is still growing and we wa...
3,"\n\n\nAs a Solutions Engineer, you will work c..."
4,\n\n\n**ABOUT DAYONE**\n\nDayOne is a global l...
5,\n\n\n**Job Role: Oracle PL/SQL Developer**\n\...
6,\n\n\n**Job Title:** Azure Data Engineer\n\n**...
7,\n\n\nSynergie Italia seleziona per azienda le...
8,\n**About \n \n** Du hast als IT-Ass eine Af...
9,\n\n\n**Position: Senior Microsoft SQL Server ...


In [7]:
# word map to identify non numeric experience values
WORD_NUM_MAP = {
    "zero": "0",
    "one": "1",
    "two": "2",
    "three": "3",
    "four": "4",
    "five": "5",
    "six": "6",
    "seven": "7",
    "eight": "8",
    "nine": "9",
    "ten": "10",
    "eleven": "11",
    "twelve": "12",
    "thirteen": "13",
    "fourteen": "14",
    "fifteen": "15",
    "sixteen": "16",
    "seventeen": "17",
    "eighteen": "18",
    "nineteen": "19",
    "twenty": "20",
}

In [8]:
def extract_experience_years(text, nlp):
    text = text.lower()
    doc = nlp(text)

    # First: Spacy-based entity detection
    for ent in doc.ents:
        if ent.label_ in ["CARDINAL", "QUANTITY", "ORDINAL"]:
            window = text[ent.start_char : ent.end_char + 20]
            if "year" in window or "yrs" in window:
                return ent.text.strip() + " years"

    # Second: Manual word-number scanning
    for word, digit in WORD_NUM_MAP.items():
        if f"{word} year" in text or f"{word} years" in text:
            return digit + " years"

    return "Not mentioned"

In [9]:
sample_df['spacy_experience'] = sample_df['JD_Text'].apply(lambda x: extract_experience_years(str(x), nlp))
sample_df.to_excel("../sample_outputs/spacy_model_results.xlsx", index=False)
print("Extraction complete. Results saved to outputs/spacy_model_results.xlsx")

Extraction complete. Results saved to outputs/spacy_model_results.xlsx


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_df['spacy_experience'] = sample_df['JD_Text'].apply(lambda x: extract_experience_years(str(x), nlp))


#### **2. LLaMA 3 8B 8192**

In [25]:
from openai import OpenAI
import os

# Load Groq API key from environment variable
GROQ_API_KEY = os.getenv("GROQ_API_KEY")

if not GROQ_API_KEY:
    raise ValueError("GROQ_API_KEY environment variable is not set.")

# Setup Groq client
client = OpenAI(
    base_url="https://api.groq.com/openai/v1",
    api_key=GROQ_API_KEY
)

# Extraction function
def extract_experience_groq(jd_text: str, model: str = "llama3-8b-8192") -> str:
    prompt = (
        "Only analyze job descriptions written in English. "
        "Extract only the number of years of experience required. "
        "Only return a number followed by the word 'years', like '3 years'. "
        "If it's not mentioned or the text is not in English, return 'Not mentioned'. "
        "Do not include any explanation.\n\n"
        f"{jd_text}\n\n"
        "Answer:"
    )

    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Error: {str(e)}"

In [26]:
import pandas as pd

sample_df["groq_experience"] = sample_df["JD_Text"].apply(lambda x: extract_experience_groq(str(x)))

# View results
sample_df[["JD_Text", "groq_experience"]]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_df["groq_experience"] = sample_df["JD_Text"].apply(lambda x: extract_experience_groq(str(x)))


Unnamed: 0,JD_Text,groq_experience
0,\n**Overview \n \n** Lazydays RV is looking ...,Not mentioned
1,\n**Ãrea De AtuaÃ§Ã£o \n \n** TÃ©cnico em I...,Not mentioned
2,\n\n\nðŸ”µ Capitole is still growing and we wa...,Not mentioned
3,"\n\n\nAs a Solutions Engineer, you will work c...",Not mentioned
4,\n\n\n**ABOUT DAYONE**\n\nDayOne is a global l...,5 years
5,\n\n\n**Job Role: Oracle PL/SQL Developer**\n\...,10 years
6,\n\n\n**Job Title:** Azure Data Engineer\n\n**...,5 years
7,\n\n\nSynergie Italia seleziona per azienda le...,Not mentioned
8,\n**About \n \n** Du hast als IT-Ass eine Af...,Not mentioned
9,\n\n\n**Position: Senior Microsoft SQL Server ...,10 years


In [27]:
sample_df[['JD_Text','groq_experience']].to_excel("../sample_outputs/groq_model_results.xlsx", index=False)

#### **3. deepset/roberta-base-squad2**

In [35]:
# trying code from docs

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/roberta-base-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(res)
print(res['answer'])

Device set to use cpu


{'score': 0.21171416342258453, 'start': 59, 'end': 84, 'answer': 'gives freedom to the user'}
gives freedom to the user


In [36]:
# implementation for task

from transformers import pipeline

qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

def extract_experience_roberta(jd_text: str) -> str:
    question = "How many years of experience is required for this job?"
    try:
        result = qa_pipeline(question=question, context=jd_text)
        if result["score"] > 0.3 and result["answer"].strip():
            return result["answer"]
        else:
            return "Not mentioned"
    except Exception as e:
        print(f"Error: {e}")
        return "Error"


Device set to use cpu


In [14]:
sample_df["hf_roberta_experience"] = sample_df["JD_Text"].apply(lambda x: extract_experience_roberta(str(x)))

# View results
sample_df[["JD_Text", "hf_roberta_experience"]]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_df["hf_roberta_experience"] = sample_df["JD_Text"].apply(lambda x: extract_experience_roberta(str(x)))


Unnamed: 0,JD_Text,hf_roberta_experience
0,\n**Overview \n \n** Lazydays RV is looking ...,Not mentioned
1,\n**Ãrea De AtuaÃ§Ã£o \n \n** TÃ©cnico em I...,Not mentioned
2,\n\n\nðŸ”µ Capitole is still growing and we wa...,Not mentioned
3,"\n\n\nAs a Solutions Engineer, you will work c...",Not mentioned
4,\n\n\n**ABOUT DAYONE**\n\nDayOne is a global l...,Not mentioned
5,\n\n\n**Job Role: Oracle PL/SQL Developer**\n\...,10+
6,\n\n\n**Job Title:** Azure Data Engineer\n\n**...,Not mentioned
7,\n\n\nSynergie Italia seleziona per azienda le...,Not mentioned
8,\n**About \n \n** Du hast als IT-Ass eine Af...,Not mentioned
9,\n\n\n**Position: Senior Microsoft SQL Server ...,10+


In [15]:
sample_df[['JD_Text','hf_roberta_experience']].to_excel("../sample_outputs/hf_roberta_results.xlsx", index=False)

#### **4. deepset/tinyroberta-squad2**

In [39]:
# trying code from docs
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "deepset/tinyroberta-squad2"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Why is model conversion important?',
    'context': 'The option to convert models between FARM and transformers gives freedom to the user and let people easily switch between frameworks.'
}
res = nlp(QA_input)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(res)
print(res['answer'])

Device set to use cpu


{'score': 0.26244911551475525, 'start': 59, 'end': 132, 'answer': 'gives freedom to the user and let people easily switch between frameworks'}
gives freedom to the user and let people easily switch between frameworks


In [19]:
sample_df

Unnamed: 0,JD_Text,spacy_experience,groq_experience,hf_roberta_experience
0,\n**Overview \n \n** Lazydays RV is looking ...,Not mentioned,Not mentioned,Not mentioned
1,\n**Ãrea De AtuaÃ§Ã£o \n \n** TÃ©cnico em I...,Not mentioned,Not mentioned,Not mentioned
2,\n\n\nðŸ”µ Capitole is still growing and we wa...,Not mentioned,Not mentioned,Not mentioned
3,"\n\n\nAs a Solutions Engineer, you will work c...",Not mentioned,Not mentioned,Not mentioned
4,\n\n\n**ABOUT DAYONE**\n\nDayOne is a global l...,2 years,5 years,5
5,\n\n\n**Job Role: Oracle PL/SQL Developer**\n\...,Not mentioned,10 years,10+
6,\n\n\n**Job Title:** Azure Data Engineer\n\n**...,up to years,5 years,"Up to 145,000/Year"
7,\n\n\nSynergie Italia seleziona per azienda le...,Not mentioned,Not mentioned,Not mentioned
8,\n**About \n \n** Du hast als IT-Ass eine Af...,Not mentioned,Not mentioned,Not mentioned
9,\n\n\n**Position: Senior Microsoft SQL Server ...,Not mentioned,10 years,10+


In [20]:
from transformers import pipeline, AutoTokenizer, AutoModelForQuestionAnswering

# Use a smaller, faster model
MODEL_NAME = "deepset/tinyroberta-squad2"

def load_model():
    try:
        tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, local_files_only=True)
        model = AutoModelForQuestionAnswering.from_pretrained(MODEL_NAME, local_files_only=True)
        qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)
        return qa_pipeline
    except Exception as e:
        print(f"[Tiny RoBERTa Load Error] Make sure the model is cached locally.\n{e}")
        return None

def extract_experience_hf_tinyroberta(jd_text: str, qa_pipeline) -> str:
    if qa_pipeline is None:
        return "Error: Model not loaded"

    question = "How many years of experience is required for this job?"

    try:
        result = qa_pipeline({"context": jd_text, "question": question})
        answer = result["answer"].strip()

        if any(char.isdigit() for char in answer):
            return answer
        else:
            return "Not mentioned"
    except Exception as e:
        print(f"[Tiny RoBERTa QA Error] {e}")
        return "Error"


In [21]:
qa_pipeline = load_model()

sample_df["hf_tinyroberta_experience"] = sample_df["JD_Text"].apply(
    lambda x: extract_experience_hf_tinyroberta(str(x), qa_pipeline)
)


Device set to use cpu
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sample_df["hf_tinyroberta_experience"] = sample_df["JD_Text"].apply(


In [22]:
sample_df

Unnamed: 0,JD_Text,spacy_experience,groq_experience,hf_roberta_experience,hf_tinyroberta_experience
0,\n**Overview \n \n** Lazydays RV is looking ...,Not mentioned,Not mentioned,Not mentioned,1976
1,\n**Ãrea De AtuaÃ§Ã£o \n \n** TÃ©cnico em I...,Not mentioned,Not mentioned,Not mentioned,Not mentioned
2,\n\n\nðŸ”µ Capitole is still growing and we wa...,Not mentioned,Not mentioned,Not mentioned,34 days per year
3,"\n\n\nAs a Solutions Engineer, you will work c...",Not mentioned,Not mentioned,Not mentioned,Not mentioned
4,\n\n\n**ABOUT DAYONE**\n\nDayOne is a global l...,2 years,5 years,5,Minimum 5 years
5,\n\n\n**Job Role: Oracle PL/SQL Developer**\n\...,Not mentioned,10 years,10+,10+ years
6,\n\n\n**Job Title:** Azure Data Engineer\n\n**...,up to years,5 years,"Up to 145,000/Year",5+ years
7,\n\n\nSynergie Italia seleziona per azienda le...,Not mentioned,Not mentioned,Not mentioned,12 maggio 2025
8,\n**About \n \n** Du hast als IT-Ass eine Af...,Not mentioned,Not mentioned,Not mentioned,Not mentioned
9,\n\n\n**Position: Senior Microsoft SQL Server ...,Not mentioned,10 years,10+,10+ years


In [23]:
sample_df[["JD_Text", "hf_tinyroberta_experience"]].to_excel("../sample_outputs/hf_tinyroberta_model_results.xlsx", index=False)

#### **5. distilbert/distilbert-base-cased-distilled-squad**

In [40]:
# trying code from docs
from transformers import pipeline
question_answerer = pipeline("question-answering", model='distilbert-base-cased-distilled-squad')

context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example     of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script.
"""

result = question_answerer(question="What is a good example of a question answering dataset?",     context=context)
print(result)
print(result['answer'])

Device set to use cpu


{'score': 0.5152314901351929, 'start': 151, 'end': 164, 'answer': 'SQuAD dataset'}
SQuAD dataset
