# Resume Information Extraction using Hugging Face NER

This project fine-tunes a Hugging Face Token Classification model (BERT) to extract structured job experience information from raw resumes.

The generated output is comparable to the earlier **GPT_Output**, but now produced by **custom trained model**.

PIPELINE BUILT ->

resume → cleaning → GPT annotation → span tagging → BERT fine-tuning → inference → structured JSON generation



In [None]:
!pip install -q \
  "transformers==4.57.3" \
  "datasets" \
  "seqeval" \
  "dateparser" \
  "pyarrow==22.0.0"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m91.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.7/47.7 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m20.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


In [None]:
!pip install dateparser



In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
import pandas as pd
import regex as re
import dateparser
from pathlib import Path
import json
import math

In [None]:
CSV_PATH = "/content/prepared_ent_9999.csv"

In [None]:
df = pd.read_csv(CSV_PATH,
                 dtype=str,
                 keep_default_na=False,
                 engine="python",
                 on_bad_lines="skip")
print("Columns:", df.columns.tolist())
print("Rows:", len(df))

pd.set_option("display.max_colwidth", 300)
df[["ResumeText"]].head(3)

Columns: ['ResumeId', 'ResumeUrl', 'ParentResourceId', 'ResumeText', 'GPT_Output', 'Education', 'EduEntity', 'CleanedText', 'EntityText', 'EntityList', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14', 'Unnamed: 15', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22', 'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25', 'Unnamed: 26', 'Unnamed: 27', 'Unnamed: 28', 'Unnamed: 29', 'Unnamed: 30', 'Unnamed: 31', 'Unnamed: 32', 'Unnamed: 33', 'Unnamed: 34', 'Unnamed: 35', 'Unnamed: 36', 'Unnamed: 37', 'Unnamed: 38', 'Unnamed: 39', 'Unnamed: 40', 'Unnamed: 41', 'Unnamed: 42', 'Unnamed: 43', 'Unnamed: 44', 'Unnamed: 45', 'Unnamed: 46', 'Unnamed: 47', 'Unnamed: 48', 'Unnamed: 49', 'Unnamed: 50', 'Unnamed: 51', 'Unnamed: 52', 'Unnamed: 53', 'Unnamed: 54', 'Unnamed: 55', 'Unnamed: 56', 'Unnamed: 57', 'Unnamed: 58', 'Unnamed: 59', 'Unnamed: 60', 'Unnamed: 61', 'Unnamed: 62', 'Unnamed: 63', 'Unnamed: 64', 'Unnamed: 65', 'Unnamed

Unnamed: 0,ResumeText
0,"JYOTI SINGH{new_line} QA Engineer{new_line}{new_line} 799 942 - 8937, jyotisingh5396@gmail.com{new_line} https://www.linkedin.com/in/jyoti - singh - 1a3199118{new_line}{new_line}Pune{new_line}{new_line} SUMMARY{new_line} 6+ years of total experience and IT{new_line}QA professional with 4+ years ..."
1,"Damini Meshram{new_line} daminisbhagat@outlook.com, - Linkedin Profile{new_line}{new_line} - 9284608602{new_line}{new_line} PROFILE{new_line}Dynamic IT specialist with 3+ years of proven expertise in Functional, Manual, and Automation Testing, particularly\t \tskilled in Selenium WebDriver wi..."
2,"SKILL: - {new_line}{new_line} AKASH SALUNKE{new_line} Software Test Engineer{new_line}{new_line} Experienced Quality Assurance (Certified Automation Testing and Development ){new_line}{new_line} Contact :{new_line} with, over 3+ years of experience a demonstrated history of working in the{new_li..."


In [None]:
def clean_resume_text(text: str) -> str:
    if not isinstance(text, str):
        return ""

    # 1) Replacing {new line} in each resume_text with real newline
    txt = text.replace("{new_line}", "\n")

    # 2) Removing some common weird bullets
    junk_chars = [
        "\u00a0",  # non-breaking space
        "\uf0b7",  # bullet
        "\u2022",  # bullet •
    ]
    for jc in junk_chars:
        txt = txt.replace(jc, " ")

    # 3) Normalizing spaces: multiple spaces -> single space (still keeping newlines)
    txt = re.sub(r"[ \t]+", " ", txt)

    # 4) Removeing spaces at start/end of lines
    lines = [ln.strip() for ln in txt.split("\n")]
    txt = "\n".join(lines)

    # 5) Removing empty lines that are useless...
    lines = [ln for ln in txt.split("\n") if ln.strip() != ""]
    txt = "\n".join(lines)

    return txt

In [None]:
# cleaning ResumeText -> new column
df["ResumeText_clean"] = df["ResumeText"].apply(clean_resume_text)

# before/after for a couple of rows
for i in range(2):
    print("="*80)
    print("RAW:")
    print(df.loc[i, "ResumeText"][:800])  # first 800 characters..
    print("\nCLEANED:")
    print(df.loc[i, "ResumeText_clean"][:800])
    print()

RAW:
JYOTI SINGH{new_line} QA Engineer{new_line}{new_line} 799 942 - 8937, jyotisingh5396@gmail.com{new_line} https://www.linkedin.com/in/jyoti - singh - 1a3199118{new_line}{new_line}Pune{new_line}{new_line} SUMMARY{new_line} 6+ years of total experience and IT{new_line}QA professional with 4+ years of	 	relevant experience in Functional{new_line}Testing where I owned end - to - end	 	activity starting with writing test	 	execution strategy, test scripts,	 	execution of software programs &	 	test scripts, defect logging,	 	reporting, follow - up, iterative test	 	execution until functional test	 	completion. I have Understanding	 	of languages like C++/Java, SQL,{new_line}HTML, Selenium which have	 	enabled to perform techno	 	functional testing at component	 	level and holistic testing of	 	ap

CLEANED:
JYOTI SINGH
QA Engineer
799 942 - 8937, jyotisingh5396@gmail.com
https://www.linkedin.com/in/jyoti - singh - 1a3199118
Pune
SUMMARY
6+ years of total experience and IT
QA professional 

In [None]:
OUT_PATH = "/content/prepared_ent_9999_clean.csv"
df.to_csv(OUT_PATH, index=False)
print("Saved cleaned CSV to:", OUT_PATH)

Saved cleaned CSV to: /content/prepared_ent_9999_clean.csv


In [None]:
CLEAN_CSV_PATH = "/content/prepared_ent_9999_clean.csv"
df = pd.read_csv(CLEAN_CSV_PATH, dtype=str, keep_default_na=False)

## Parsing GPT Output into Python Dict

The column **GPT_Output** in the cleaned dataset contains JSON-like text generated earlier.


To train and evaluate our NER model, first converting this **string representation** into a **Python dictionary**.


In [None]:
import ast
# Helper to turn GPT_Output string into a Python dict
def parse_gpt_output(text):
    if not isinstance(text, str) or not text.strip():
        return None
    s = text.strip()
    # to convert JSON-style null to Python None
    s = s.replace("null", "None")
    try:
        return ast.literal_eval(s)
    except Exception as e:

        return None

In [None]:
df["GPT_parsed"] = df["GPT_Output"].apply(parse_gpt_output)

### Verifying GPT-Parsed Data Structure

Before using GPT-parsed entities for training, we validate that:

- Each row was successfully parsed into a Python `dict`
- The dictionary contains a `"Companies"` key
- `"Companies"` is a list of extracted job-experience objects

check to ensure that only correctly structured rows are passed into the next processing step.

In [None]:
has_companies = df["GPT_parsed"].apply(
    lambda x: isinstance(x, dict) and "Companies" in x and isinstance(x["Companies"], list)
)
print("Rows with valid Companies:", has_companies.sum())

print("Type of first parsed row:", type(df["GPT_parsed"].iloc[0]))
print("First parsed keys:", df["GPT_parsed"].iloc[0].keys())

Rows with valid Companies: 4955
Type of first parsed row: <class 'dict'>
First parsed keys: dict_keys(['Companies', 'Education'])


### Building Training Samples (Text + Entity Spans)

Converting the GPT-extracted JSON data into the format required for training a HuggingFace NER model.

For each resume:
- read the cleaned resume text
- iterate through the list of extracted job experiences (`Companies`)
- For each experience, locate the character spans of:
  - **Company Name**
  - **Role**
  - **Start Date**
  - **End Date**

Span Detection Logic:
- A helper function `find_span_window()` performs **case-insensitive search**
- If a company span is found first, we search nearby for Role/Start/End dates to improve accuracy
- Each detected entity is stored as:  
  `[start_index, end_index, ENTITY_LABEL]`

Finally, each resume contributes an entry in `training_data` of the form:
```json
{"text": <resume_text>, "entities": [[start, end, label], ...]}

In [None]:
def find_span_window(text, value, center=None, window=300):
    """Case-insensitive search; if center given, search only in [center-window, center+window]."""
    if not isinstance(value, str) or not value.strip():
        return None
    v = value.strip()
    txt = text
    txt_lower = txt.lower()
    v_lower = v.lower()

    if center is not None:
        start_win = max(0, center - window)
        end_win = min(len(txt_lower), center + window)
        segment = txt_lower[start_win:end_win]
        idx = segment.find(v_lower)
        if idx == -1:
            return None
        start = start_win + idx
    else:
        idx = txt_lower.find(v_lower)
        if idx == -1:
            return None
        start = idx

    end = start + len(v)
    return [start, end]


training_data = []

for idx, row in df.iterrows():
    parsed = row["GPT_parsed"]
    text = row["ResumeText_clean"]
    if not isinstance(parsed, dict) or "Companies" not in parsed:
        continue

    entities = []
    for job in parsed["Companies"]:
        comp = job.get("Company Name") or job.get("Company") or job.get("Name")
        role = job.get("Role")
        s_date = job.get("Start Date")
        e_date = job.get("End Date")

        # 1) company first
        comp_span = find_span_window(text, comp)
        center = None
        if comp_span:
            entities.append(comp_span + ["COMPANY"])
            center = (comp_span[0] + comp_span[1]) // 2

        # 2) role, start, end searched near company (if we have center)
        if role:
            span = find_span_window(text, role, center=center, window=350) or \
                   find_span_window(text, role)  # fallback global
            if span:
                entities.append(span + ["ROLE"])

        if s_date:
            span = find_span_window(text, s_date, center=center, window=350) or \
                   find_span_window(text, s_date)
            if span:
                entities.append(span + ["START_DATE"])

        if e_date:
            span = find_span_window(text, e_date, center=center, window=350) or \
                   find_span_window(text, e_date)
            if span:
                entities.append(span + ["END_DATE"])

    if entities:
        training_data.append({"text": text, "entities": entities})

In [None]:
print("Total training samples:", len(training_data))
training_data[0]  # preview

Total training samples: 4577


{'text': 'JYOTI SINGH\nQA Engineer\n799 942 - 8937, jyotisingh5396@gmail.com\nhttps://www.linkedin.com/in/jyoti - singh - 1a3199118\nPune\nSUMMARY\n6+ years of total experience and IT\nQA professional with 4+ years of relevant experience in Functional\nTesting where I owned end - to - end activity starting with writing test execution strategy, test scripts, execution of software programs & test scripts, defect logging, reporting, follow - up, iterative test execution until functional test completion. I have Understanding of languages like C++/Java, SQL,\nHTML, Selenium which have enabled to perform techno functional testing at component level and holistic testing of applications. It also enabled me to explain/discuss issues with the team to remediate within the time frame.\nEXPERIENCE\nQA Engineer, 12/2021 - Present\nVenturit, Pune\nResponsibilities\nCollaborated with cross - functional teams, including developers, product managers, and designers, to define test strategies and establis

### Converting Entity Tuples to Dictionary Format
The previous step stored entity annotations as lists of the form:

[start_index, end_index, label]

But the training stages require HuggingFace-friendly dictionary format:

{"start": ..., "end": ..., "label": ...}

This prepares the final dataset for tokenization and NER model training.

In [None]:
def fix_entities(item):
    fixed = []
    for ent in item["entities"]:
        # ent = [start, end, label]
        s, e, lab = ent
        fixed.append({"start": s, "end": e, "label": lab})
    item["entities"] = fixed
    return item

training_data = [fix_entities(x) for x in training_data]

### 🔀 Splitting the Dataset into Train / Validation / Test Sets

After preparing the full `training_data` list, doing a train-test-split; 80% Train, 10% Validation (To tune hyperparameters and track overfitting), 10% Test (Unseen Data) -> Ensures evaluation metrics reflect generalization, not memorization.

In [None]:
import random
# We already have: training_data = [ {"text": ..., "entities": [...]} , ...]

# 1) shuffle in-place so splits are random
random.shuffle(training_data)

n = len(training_data)
n_test = int(0.10 * n)   # 10% test
n_val  = int(0.10 * n)   # 10% val
n_train = n - n_test - n_val

test_data  = training_data[:n_test]
val_data   = training_data[n_test:n_test + n_val]
train_data = training_data[n_test + n_val:]

In [None]:
print("Total:", n)
print("Train:", len(train_data))
print("Val:", len(val_data))
print("Test:", len(test_data))

Total: 4577
Train: 3663
Val: 457
Test: 457


### Defining NER Labels & Initializing the Tokenizer

To prepare the model for fine-tuning on resume data, defining **BIO tagging scheme** :
- **B-TAG** marks the beginning of an entity
- **I-TAG** marks continuation of the same entity
- **O** represents tokens that are not part of any entity

| Entity | Meaning |
|--------|----------|
| COMPANY | Company / Employer Name |
| ROLE | Job Title / Position |
| START_DATE | Beginning of employment |
| END_DATE | End of employment |


Finally, **pretrained BERT tokenizer (`bert-base-cased`)**, will convert raw resume text into token IDs and attention masks used by the model during training.

In [None]:
from transformers import AutoTokenizer

label_list = [
    "O",
    "B-COMPANY", "I-COMPANY",
    "B-ROLE",    "I-ROLE",
    "B-START_DATE", "I-START_DATE",
    "B-END_DATE",   "I-END_DATE",
]
label2id = {label: i for i, label in enumerate(label_list)} # maps label text → numeric index
id2label = {i: label for label, i in label2id.items()} # reverse lookup (necessary for prediction output)

MODEL_NAME = "bert-base-cased" # "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

### Data Collator (Padding to equal lengths)

During training, batches of resume texts contain sequences of **different lengths**.  
Hence, **padding to the same length**.

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(
    tokenizer=tokenizer,
    label_pad_token_id=-100,   # matches how we masked special tokens
    padding="longest",         # pads to longest sequence in batch
)

### Encoding Training Samples for Token Classification (BIO Tagging)

To train, the raw data is converted into token-level inputs and labels.

In [None]:
def encode_example(example):
    """
    example: {"text": str, "entities": [[start, end, label], ...] or list of dicts}
    returns: {"input_ids": [...], "attention_mask": [...], "labels": [...]}
    """

    text = example["text"]
    ents = example.get("entities", [])

    # 1) Tokenizing the text with offset mappings
    enc = tokenizer(
        text,
        return_offsets_mapping=True,
        truncation=True,
        max_length=512,
    )

    offsets = enc["offset_mapping"]  # returns the start/end character indices for each token, allowing us to align entity spans with tokens.

    # 2) Matching entity spans to tokens (BIO Tag for each Token)
    spans = []
    if isinstance(ents, list):
        for ent in ents:
            # dict like {"start": ..., "end": ..., "label": ...}
            if isinstance(ent, dict):
                if not {"start", "end", "label"} <= set(ent.keys()):
                    continue
                s = ent["start"]
                e = ent["end"]
                lab = ent["label"]

            # list/tuple like [start, end, label]
            elif isinstance(ent, (list, tuple)) and len(ent) == 3:
                s, e, lab = ent

            else:
                # any other weird format
                continue

            # casting to int; if fails ("start"/"end") -> skip
            try:
                s = int(s)
                e = int(e)
            except (TypeError, ValueError):
                continue

            spans.append({"start": s, "end": e, "label": lab})

    # 3) BIO tags as strings
    labels = ["O"] * len(offsets)

    for i, (tok_start, tok_end) in enumerate(offsets):
        # special tokens like [CLS], [SEP] -> will become -100 later
        if tok_start == tok_end:
            continue

        for span in spans:
            if tok_start < span["end"] and tok_end > span["start"]:
                prefix = "B" if tok_start == span["start"] else "I"
                tag = f"{prefix}-{span['label']}"
                if labels[i] == "O":
                    labels[i] = tag
                break

    # 4) converting label strings to ids, -100 for special tokens
    label_ids = []
    for (tok_start, tok_end), lab in zip(offsets, labels):
        if tok_start == tok_end:
            label_ids.append(-100)
        else:
            label_ids.append(label2id[lab])

    # 5) dropping offsets to return flat lists
    enc.pop("offset_mapping")

    return {
        "input_ids": enc["input_ids"],
        "attention_mask": enc["attention_mask"],
        "labels": label_ids,
    }

### Python Lists into HuggingFace `Dataset` Objects and Applying Encoding

In [None]:
from datasets import Dataset

train_ds = Dataset.from_list(train_data)
val_ds   = Dataset.from_list(val_data)
test_ds  = Dataset.from_list(test_data)

# After converting them, we apply encode_example() to every sample to prepare features in the format expected by model

encoded_train = train_ds.map(encode_example, batched=False)
encoded_val   = val_ds.map(encode_example, batched=False)
encoded_test  = test_ds.map(encode_example, batched=False)

print(encoded_train[0].keys())
print(len(encoded_train[0]["input_ids"]), len(encoded_train[0]["labels"]))

Map:   0%|          | 0/3663 [00:00<?, ? examples/s]

Map:   0%|          | 0/457 [00:00<?, ? examples/s]

Map:   0%|          | 0/457 [00:00<?, ? examples/s]

dict_keys(['text', 'entities', 'input_ids', 'attention_mask', 'labels'])
512 512


In [None]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
import numpy as np
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

### --- DONE WITH THE PRE PROCESSING ---

#### Token-Classification Model for NER Fine-Tuning


In [None]:
num_labels = len(label_list)

model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=num_labels, # Total number of BIO classes used for NER
    id2label=id2label, #  Maps between class-IDs and class-names (required for readable outputs)
    label2id=label2id,
)

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### --- ready for training via the HuggingFace Trainer API ---

### Metric Computation for NER (Precision / Recall / F1 using `seqeval`)


In [None]:
def align_predictions(predictions, label_ids):
    preds = np.argmax(predictions, axis=2)

    batch_size, seq_len = preds.shape
    pred_tags, true_tags = [], []

    for i in range(batch_size):
        p_tags, t_tags = [], []
        for j in range(seq_len):
            if label_ids[i, j] == -100: # ignore padded tokens where label = -100

                continue
            p_tags.append(id2label[preds[i, j]])  # Convert label IDs to tag strings using id2label
            t_tags.append(id2label[label_ids[i, j]])
        pred_tags.append(p_tags)
        true_tags.append(t_tags)
    return pred_tags, true_tags

def compute_metrics(p):
    preds, labels = p  # true tags -> ground-truth BIO labels per sentence
    pred_tags, true_tags = align_predictions(preds, labels) # pred tags -> predicted BIO labels per sentence
    return {
        "precision": precision_score(true_tags, pred_tags),
        "recall":    recall_score(true_tags, pred_tags),
        "f1":        f1_score(true_tags, pred_tags),
    }

### Configure Training Hyperparameters & Load Token-Classification Model

In [None]:
batch_size = 8

training_args = TrainingArguments(
    output_dir="/content/resume_ner_model",
    eval_strategy="epoch",
    save_strategy="epoch",
    report_to=[],
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01, # Regularization -> prevent overfitting
    logging_steps=100,
    load_best_model_at_end=True, # load best checkpoint based on f1 score
    metric_for_best_model="f1"
)

### loading a bert-base-cased


In [None]:
# initializing it for token classification using our custom BIO label set —
# so the model can predict COMPANY, ROLE, START_DATE and END_DATE (Work Experience)

model = AutoModelForTokenClassification.from_pretrained(
    MODEL_NAME,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Training the Resume NER Model

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,     # controls learning rate, batch size, epochs, checkpoint frequency
    train_dataset=encoded_train,  # uses encoded training samples to optimize model weights
    eval_dataset=encoded_val, # monitors performance after each epoch
    tokenizer=tokenizer,
    data_collator=data_collator,  # dynamically pads sequences per batch
    compute_metrics=compute_metrics,  # calculates Precision, Recall, and F1
)

  trainer = Trainer(


### TRAINING

In [None]:
train_result = trainer.train()
trainer.save_model("/content/resume_ner_model")  # saves model + config

tokenizer.save_pretrained("/content/resume_ner_model")

train_result.metrics

Epoch,Training Loss,Validation Loss,Precision,Recall,F1
1,0.0816,0.077594,0.455638,0.490547,0.472448
2,0.0626,0.067061,0.487419,0.597512,0.53688
3,0.0514,0.067937,0.493512,0.624378,0.551285
4,0.0409,0.072928,0.503573,0.59602,0.54591
5,0.034,0.07209,0.500406,0.612935,0.550984


{'train_runtime': 1950.3376,
 'train_samples_per_second': 9.391,
 'train_steps_per_second': 1.174,
 'total_flos': 4785952967009280.0,
 'train_loss': 0.0650511606291392,
 'epoch': 5.0}

In [None]:
trainer.save_model("/content/resume_ner_model")
tokenizer.save_pretrained("/content/resume_ner_model")

print("Model and tokenizer saved.")

Model and tokenizer saved.


### Evaluating Model Performance (Validation & Test Sets)

In [None]:
# Validation metrics
val_metrics = trainer.evaluate(encoded_val)
print("Validation metrics:", val_metrics)

# Test metrics
test_metrics = trainer.evaluate(encoded_test)
print("Test metrics:", test_metrics)

Validation metrics: {'eval_loss': 0.06793682277202606, 'eval_precision': 0.4935116004718836, 'eval_recall': 0.6243781094527363, 'eval_f1': 0.5512848671205798, 'eval_runtime': 14.5997, 'eval_samples_per_second': 31.302, 'eval_steps_per_second': 3.973, 'epoch': 5.0}
Test metrics: {'eval_loss': 0.06775468587875366, 'eval_precision': 0.4640159045725646, 'eval_recall': 0.6087636932707355, 'eval_f1': 0.526624548736462, 'eval_runtime': 14.4019, 'eval_samples_per_second': 31.732, 'eval_steps_per_second': 4.027, 'epoch': 5.0}


##### -> Validation and test scores are very close → no overfitting
##### -> F1 ≈ 0.53 → model extracts entities reasonably well, but:
#####       it misses some entities (recall < 1.0)
#####       and some extracted ones may be wrong (precision < 0.5)

# #### testing on a sample resume sentence.

In [None]:
from transformers import pipeline

ner_pipe = pipeline(
    "token-classification",
    model="/content/resume_ner_model",
    tokenizer="/content/resume_ner_model",
    aggregation_strategy="simple",  # groups B-/I- tags
)

text = "Senior Software Engineer at Google from Jan 2020 to May 2024."
preds = ner_pipe(text)
preds

Device set to use cuda:0


[{'entity_group': 'ROLE',
  'score': np.float32(0.6840112),
  'word': 'Software Engineer',
  'start': 7,
  'end': 24}]

##### This function:

##### -> Sorts predicted entities according to their character position in the text

##### -> Applies a confidence threshold to ignore weak predictions

##### -> Groups related entities under the same job entry

##### -> Starts a new job block whenever a new COMPANY tag is detected

In [None]:
import json

CONF_THRESHOLD = 0.40

def ner_preds_to_json(preds):
    preds = sorted(preds, key=lambda x: x["start"])  # sort by position

    jobs = []
    current = {"Company Name": None, "Role": None, "Start Date": None, "End Date": None}

    def start_new_job():
        nonlocal current
        if any(current.values()):   # add previous job only if something was filled
            jobs.append(current.copy())
        current = {"Company Name": None, "Role": None, "Start Date": None, "End Date": None}

    for ent in preds:
        if ent["score"] < CONF_THRESHOLD:
            continue

        label = ent["entity_group"]
        text  = ent["word"].strip()

        # COMPANY always indicates a new job
        if label == "COMPANY":
            start_new_job()
            current["Company Name"] = text

        elif label == "ROLE":
            current["Role"] = text if current["Role"] is None else current["Role"] + " " + text

        elif label == "START_DATE":
            current["Start Date"] = text if current["Start Date"] is None else current["Start Date"] + " " + text

        elif label == "END_DATE":
            current["End Date"] = text if current["End Date"] is None else current["End Date"] + " " + text

    start_new_job()  # add last collected job

    return {"Companies": jobs}

In [None]:
text = "Senior Software Engineer at Google from Jan 2020 to May 2024."
preds = ner_pipe(text)
print(preds)

print(json.dumps(ner_preds_to_json(preds), indent=2))

[{'entity_group': 'ROLE', 'score': np.float32(0.6840112), 'word': 'Software Engineer', 'start': 7, 'end': 24}]
{
  "Companies": [
    {
      "Company Name": null,
      "Role": "Software Engineer",
      "Start Date": null,
      "End Date": null
    }
  ]
}


**#### Running the NER Model on Every Resume & Saving Final CSV**



In [None]:
def extract_json_for_row(text):
    preds = ner_pipe(text)
    return ner_preds_to_json(preds)

df["Model_Output"] = df["ResumeText_clean"].apply(
    lambda t: json.dumps(extract_json_for_row(t), ensure_ascii=False)
)

# Saving final CSV
df.to_csv("resumes_with_model_output.csv", index=False)
print("Saved to resumes_with_model_output.csv")

Saved to resumes_with_model_output.csv


In [None]:
dfff = pd.read_csv('/content/resumes_with_model_output.csv')

In [None]:
dfff.head(2)

Unnamed: 0,ResumeId,ResumeUrl,ParentResourceId,ResumeText,GPT_Output,Education,EduEntity,CleanedText,EntityText,EntityList,...,Unnamed: 87,Unnamed: 88,Unnamed: 89,Unnamed: 90,Unnamed: 91,Unnamed: 92,Unnamed: 93,ResumeText_clean,GPT_parsed,Model_Output
0,5F92A9F0-D752-4C75-BB82-5DCD6869E574,https://hiringsolutions.blob.core.windows.net/resumebank/00a27d1b-7bc8-4631-9b1a-755ca40e8afa/1717499296633Naukri_JyotiSingh[4y_0m].pdf,00A27D1B-7BC8-4631-9B1A-755CA40E8AFA,"JYOTI SINGH{new_line} QA Engineer{new_line}{new_line} 799 942 - 8937, jyotisingh5396@gmail.com{new_line} https://www.linkedin.com/in/jyoti - singh - 1a3199118{new_line}{new_line}Pune{new_line}{new_line} SUMMARY{new_line} 6+ years of total experience and IT{new_line}QA professional with 4+ years ...","{'Companies': [{'Company Name': 'Venturit', 'Role': 'QA Engineer', 'Internship_Flag': 0, 'Start Date': '12/2021', 'End Date': 'Present', 'Current_Flag': 1}, {'Company Name': 'Globalstep', 'Role': 'Test Engineer', 'Internship_Flag': 0, 'Start Date': '11/2020', 'End Date': '12/2021', 'Current_Flag...","[{'College Name': 'Rungta Engineering College', 'Degree': 'Bachelor of Engineering in Computer Science', 'Specialization': None, 'Start Date': '2013', 'End Date': '2017', 'Education Type': 'graduate'}]","[['Bachelor of Engineering in Computer Science', 'G_Degree']]",jyoti singh qa engineer 799 942 8937 jyotisingh5396 gmail com https www linkedin com in jyoti singh 1a3199118 pune summary 6 years of total experience and it qa professional with 4 years of relevant experience in functional testing where i owned end to end activity starting with writing test exe...,[],[],...,,,,,,,,"JYOTI SINGH\nQA Engineer\n799 942 - 8937, jyotisingh5396@gmail.com\nhttps://www.linkedin.com/in/jyoti - singh - 1a3199118\nPune\nSUMMARY\n6+ years of total experience and IT\nQA professional with 4+ years of relevant experience in Functional\nTesting where I owned end - to - end activity startin...","{'Companies': [{'Company Name': 'Venturit', 'Role': 'QA Engineer', 'Internship_Flag': 0, 'Start Date': '12/2021', 'End Date': 'Present', 'Current_Flag': 1}, {'Company Name': 'Globalstep', 'Role': 'Test Engineer', 'Internship_Flag': 0, 'Start Date': '11/2020', 'End Date': '12/2021', 'Current_Flag...","{""Companies"": [{""Company Name"": null, ""Role"": ""QA Engineer"", ""Start Date"": ""12 / 2021"", ""End Date"": ""Present""}, {""Company Name"": ""Venturit"", ""Role"": null, ""Start Date"": null, ""End Date"": null}]}"
1,AA3B4B6A-A237-4408-8B07-45C43E65B1EC,https://hiringsolutions.blob.core.windows.net/resumebank/00a27d1b-7bc8-4631-9b1a-755ca40e8afa/1717665135973Damini_CV.pdf,00A27D1B-7BC8-4631-9B1A-755CA40E8AFA,"Damini Meshram{new_line} daminisbhagat@outlook.com, - Linkedin Profile{new_line}{new_line} - 9284608602{new_line}{new_line} PROFILE{new_line}Dynamic IT specialist with 3+ years of proven expertise in Functional, Manual, and Automation Testing, particularly\t \tskilled in Selenium WebDriver wi...","{'Companies': [{'Company Name': 'Tectigon IT Solution Pvt. Ltd.', 'Role': 'Software Test Engineer', 'Internship_Flag': 0, 'Start Date': '2021', 'End Date': 'Present', 'Current_Flag': 1}], 'Education': [{'College Name': 'RTMN University', 'Degree': 'Bachelor of Engineering', 'Specialization': 'Co...","[{'College Name': 'RTMN University', 'Degree': 'Bachelor of Engineering', 'Specialization': 'Computer Science and Engineering', 'Start Date': '2021', 'End Date': 'null', 'Education Type': 'graduate'}, {'College Name': 'MSBTE', 'Degree': 'Diploma', 'Specialization': 'Polytechnic in Computer Scien...","[['Bachelor of Engineering', 'G_Degree'], ['Computer Science and Engineering', 'SPL_Degree'], ['Diploma', 'D_Degree'], ['Polytechnic in Computer Science', 'SPL_Degree']]",damini meshram daminisbhagat outlook com linkedin profile 9284608602 profile dynamic it specialist with 3 years of proven expertise in functional manual and automation testing particularly skilled in selenium webdriver with java proficient in testing concepts and experienced in using jira for ef...,"[['bachelor of engineering', 'G_Degree'], ['computer science and engineering', 'SPL_Degree'], ['diploma', 'D_Degree'], ['polytechnic in computer science', 'SPL_Degree']]","[[505, 528, 'G_Degree'], [564, 596, 'SPL_Degree'], [602, 609, 'D_Degree'], [639, 670, 'SPL_Degree']]",...,,,,,,,,"Damini Meshram\ndaminisbhagat@outlook.com, - Linkedin Profile\n- 9284608602\nPROFILE\nDynamic IT specialist with 3+ years of proven expertise in Functional, Manual, and Automation Testing, particularly skilled in Selenium WebDriver with Java. Proficient in Testing concepts and experienced in usi...","{'Companies': [{'Company Name': 'Tectigon IT Solution Pvt. Ltd.', 'Role': 'Software Test Engineer', 'Internship_Flag': 0, 'Start Date': '2021', 'End Date': 'Present', 'Current_Flag': 1}], 'Education': [{'College Name': 'RTMN University', 'Degree': 'Bachelor of Engineering', 'Specialization': 'Co...","{""Companies"": [{""Company Name"": null, ""Role"": ""Software Test Engineer"", ""Start Date"": null, ""End Date"": null}, {""Company Name"": ""Tectigon IT Solution Pvt. Ltd."", ""Role"": null, ""Start Date"": null, ""End Date"": ""Present""}]}"


##### keeping only three columns, dropping others

In [None]:
df_final = dfff[["ResumeText", "GPT_Output", "Model_Output"]]
df_final.to_csv("/content/resumes_with_model_output.csv", index=False)

In [None]:
df_final.head(2)

Unnamed: 0,ResumeText,GPT_Output,Model_Output
0,"JYOTI SINGH{new_line} QA Engineer{new_line}{new_line} 799 942 - 8937, jyotisingh5396@gmail.com{new_line} https://www.linkedin.com/in/jyoti - singh - 1a3199118{new_line}{new_line}Pune{new_line}{new_line} SUMMARY{new_line} 6+ years of total experience and IT{new_line}QA professional with 4+ years ...","{'Companies': [{'Company Name': 'Venturit', 'Role': 'QA Engineer', 'Internship_Flag': 0, 'Start Date': '12/2021', 'End Date': 'Present', 'Current_Flag': 1}, {'Company Name': 'Globalstep', 'Role': 'Test Engineer', 'Internship_Flag': 0, 'Start Date': '11/2020', 'End Date': '12/2021', 'Current_Flag...","{""Companies"": [{""Company Name"": null, ""Role"": ""QA Engineer"", ""Start Date"": ""12 / 2021"", ""End Date"": ""Present""}, {""Company Name"": ""Venturit"", ""Role"": null, ""Start Date"": null, ""End Date"": null}]}"
1,"Damini Meshram{new_line} daminisbhagat@outlook.com, - Linkedin Profile{new_line}{new_line} - 9284608602{new_line}{new_line} PROFILE{new_line}Dynamic IT specialist with 3+ years of proven expertise in Functional, Manual, and Automation Testing, particularly\t \tskilled in Selenium WebDriver wi...","{'Companies': [{'Company Name': 'Tectigon IT Solution Pvt. Ltd.', 'Role': 'Software Test Engineer', 'Internship_Flag': 0, 'Start Date': '2021', 'End Date': 'Present', 'Current_Flag': 1}], 'Education': [{'College Name': 'RTMN University', 'Degree': 'Bachelor of Engineering', 'Specialization': 'Co...","{""Companies"": [{""Company Name"": null, ""Role"": ""Software Test Engineer"", ""Start Date"": null, ""End Date"": null}, {""Company Name"": ""Tectigon IT Solution Pvt. Ltd."", ""Role"": null, ""Start Date"": null, ""End Date"": ""Present""}]}"


**CONCLUSION**

**In this project, a Named Entity Recognition (NER) model was successfully fine-tuned using BERT-base-cased to extract structured work experience information from unstructured resume text. **