## Text Classification Using Hugging Face Models

### Resume Dataset Description

The [Resume Dataset](https://www.kaggle.com/datasets/snehaanbhawal/resume-dataset/data) is a comprehensive collection of over 2400 resumes sourced from livecareer.com, designed to categorize resumes into various job categories. This dataset is invaluable for training machine learning models to automate the process of resume sorting and classification, thereby assisting in the recruitment process by efficiently matching candidates to job openings.

### Dataset Content
- ID: A unique identifier for each resume, which also corresponds to the filename of the resume in PDF format within the dataset.
- Resume_str: The resume content is in plain text format. This field contains the entire resume text extracted from the PDF files, providing a clean and straightforward format for text processing and analysis.
- Resume_html: Contains the resume data in html format as present while web scrapping.
- Category: The job category for which the resume was used to apply. The dataset includes various categories such as HR, Designer, Information Technology, Teacher, Advocate, Business Development, Healthcare, and many more, covering a broad spectrum of industries and job functions.

### Objective
The primary objective is to utilize this dataset to train a text classification model capable of accurately categorizing a given resume into one of the predefined job categories. This involves several key steps:
1. Preprocessing the dataset to ensure it is in a suitable format for model training, including splitting the data into training, validation, and test sets.
2. Uploading the processed dataset to the Hugging Face Hub to make it accessible for model training and sharing.
3. Selecting an appropriate pre-trained language model from the Hugging Face Model Hub as the foundation for fine-tuning.
4. Fine-tuning the selected model on the resume dataset to optimize its performance for the task of resume classification.
5. Evaluate the model's performance and make it available on the Hugging Face Hub for public use and further research.

In [1]:
!pip install -q datasets torch
!pip install -q --upgrade accelerate
!pip uninstall -q -y transformers accelerate
!pip install -q transformers accelerate
!pip install -q scikit-learn
!pip install -q datasets

[0m

In [2]:
# Generic Imports
import pandas as pd
import numpy as np
import torch
import os

# Model related imports
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn.preprocessing import LabelEncoder
from datasets import Dataset, DatasetDict, load_dataset, load_metric
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments, pipeline

# Hugging Face dataset push
from huggingface_hub import notebook_login, HfApi, HfFolder

from IPython.display import display, HTML
display(HTML('<style>.container { width:100% !important; }</style>'))

import warnings
warnings.filterwarnings('ignore')

In [3]:
# Setting constants

# Seed for random number generators in numpy, torch, etc. for reproducibility
RANDOM_SEED = 42

# Name of the pre-trained model to be used from Hugging Face's model hub
MODEL_NAME = 'distilbert-base-uncased'

# Device configuration - uses GPU if available, otherwise CPU
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Directory where the trained model and checkpoints will be saved
output_dir='./results'

# Strategy to use for evaluating the model during training (after each epoch)
evaluation_strategy='epoch'

# Strategy to use for saving the model during training (after each epoch)
save_strategy='epoch'

# Learning rate for the optimizer during training
learning_rate=2e-5

# Batch size for training, per device (e.g., per GPU if using CUDA)
per_device_train_batch_size=16

# Batch size for evaluation, per device (e.g., per GPU if using CUDA)
per_device_eval_batch_size=16

# Total number of training epochs (complete passes over the dataset)
num_train_epochs=5

# Weight decay regularization term for the optimizer
weight_decay=0.01

# Whether to load the best model (with highest evaluation score) at the end of training
load_best_model_at_end=True

# Path to save or load the model and tokenizer
model_path = './model'

In [4]:
!nvidia-smi

Sun Feb 25 19:22:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:82:00.0 Off |                  Off |
|  0%   30C    P8              12W / 450W |      6MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

#### Objective 1: Preprocessing the dataset to ensure it is in a suitable format for model training, including splitting the data into train, test and validation sets.

In [5]:
# Read
resumes = pd.read_csv('./Resume/Resume.csv')
print(f'Shape of the dataframe: {resumes.shape}')
print('First few rows of the resumes dataframe')
display(resumes.head(3))

Shape of the dataframe: (2484, 4)
First few rows of the resumes dataframe


Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [6]:
print('Are there any duplicates in resumes?')
len(resumes['ID']), resumes['ID'].nunique()

Are there any duplicates in resumes?


(2484, 2484)

In [7]:
# Setting ID as the index of the dataframe
resumes.set_index('ID', inplace = True)

In [8]:
count = 0  # Initializing a variable to store the total count of files

# Iterating through all directories and files in the specified root directory
for root_dir, cur_dir, files in os.walk(r'Input Files/data/data/'):
   # Incrementing the count by the number of files in each directory
   count += len(files)

print('Displaying count of files after traversing all directories', count)

Displaying count of files after traversing all directories 0


In [9]:
# Drop the resume_html column
resumes.drop('Resume_html', axis = 1, inplace = True)

In [10]:
# Split the dataset into training and temporary sets (90% train, 10% temp)
train_df, temp_df = train_test_split(resumes,
                                     test_size=0.1,
                                     random_state=RANDOM_SEED)

# Split the temporary set into validation and test sets (45% val, 55% test)
val_df, test_df = train_test_split(temp_df,
                                   test_size=0.5,
                                   random_state=RANDOM_SEED)

In [11]:
# Convert the Pandas DataFrames to Hugging Face Dataset objects
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)
test_dataset = Dataset.from_pandas(test_df)

# Combine them into a DatasetDict
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': val_dataset,
    'test': test_dataset
})

#### Objective 2:  Uploading the processed dataset to the Hugging Face Hub to make it accessible for model training and sharing.

In [12]:
# login to HF hub using notebook
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [13]:
# Authenticate with Hugging Face
api = HfApi()
token = HfFolder.get_token()
user = api.whoami(token=token)
username = user['name']

# Replace 'your_dataset_name_here' with your desired dataset name
dataset_dict.push_to_hub(f'{username}/Resume_Classification', token=token)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/sharmapratik88/Resume_Classification/commit/681788dbe5f88af7d2fe58d75e7b2a5e0618bb8e', commit_message='Upload dataset', commit_description='', oid='681788dbe5f88af7d2fe58d75e7b2a5e0618bb8e', pr_url=None, pr_revision=None, pr_num=None)

#### Objective 3: Selecting an appropriate pre-trained language model from the Hugging Face Model Hub as the foundation for fine-tuning.

In [14]:
# Load the dataset
dataset = load_dataset(f'{username}/Resume_Classification')
dataset = dataset.rename_column('Category', 'label')
dataset

Downloading readme:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/7.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/420k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/422k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2235 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/124 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/125 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Resume_str', 'label', 'ID'],
        num_rows: 2235
    })
    validation: Dataset({
        features: ['Resume_str', 'label', 'ID'],
        num_rows: 124
    })
    test: Dataset({
        features: ['Resume_str', 'label', 'ID'],
        num_rows: 125
    })
})

In [15]:
# Define the LabelEncoder
label_encoder = LabelEncoder()
labels = dataset['train']['label']
encoded_labels = label_encoder.fit_transform(labels)

In [16]:
# Update the dataset with encoded labels
def update_labels(example):
    example['label'] = label_encoder.transform([example['label']])[0]
    return example

dataset = dataset.map(update_labels, batched=False)

Map:   0%|          | 0/2235 [00:00<?, ? examples/s]

Map:   0%|          | 0/124 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

In [17]:
# Load the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

# Tokenize and encode labels
def tokenize_and_encode_labels(examples):
    tokenized_inputs = tokenizer(examples['Resume_str'],
                                 padding='max_length',
                                 truncation=True,
                                 max_length=512)
    tokenized_inputs['labels'] = examples['label']
    return tokenized_inputs

# Apply the function to the dataset
tokenized_datasets = dataset.map(tokenize_and_encode_labels, batched=True)
tokenized_datasets

Map:   0%|          | 0/2235 [00:00<?, ? examples/s]

Map:   0%|          | 0/124 [00:00<?, ? examples/s]

Map:   0%|          | 0/125 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['Resume_str', 'label', 'ID', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 2235
    })
    validation: Dataset({
        features: ['Resume_str', 'label', 'ID', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 124
    })
    test: Dataset({
        features: ['Resume_str', 'label', 'ID', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 125
    })
})

In [18]:
# Split the tokenized dataset into training, validation, and test sets
train_dataset = tokenized_datasets['train']
val_dataset = tokenized_datasets['validation']
test_dataset = tokenized_datasets['test']

In [19]:
# Define a function to compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_metric('accuracy', trust_remote_code=True).compute(predictions=predictions, 
                                               references=labels)
    f1 = load_metric('f1', trust_remote_code=True).compute(predictions=predictions, 
                                   references=labels, 
                                   average='weighted')
    return {'accuracy': accuracy['accuracy'], 'f1': f1['f1']}

In [20]:
# Load the pre-trained DistilBert model, specifying the number of labels for classification
model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME,
                          num_labels=len(np.unique(encoded_labels))).to(device)

# Set up training arguments
training_args = TrainingArguments(
    output_dir=output_dir,
    evaluation_strategy=evaluation_strategy,
    save_strategy=save_strategy,
    learning_rate=learning_rate,
    per_device_train_batch_size=per_device_train_batch_size,
    per_device_eval_batch_size=per_device_eval_batch_size,
    num_train_epochs=num_train_epochs,
    weight_decay=weight_decay,
    load_best_model_at_end=load_best_model_at_end
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
)

# Start training
trainer.train()

# Save the model and tokenizer
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,No log,1.95604,0.701613,0.644654
2,No log,1.141211,0.798387,0.770666
3,No log,0.825942,0.846774,0.837074
4,1.598000,0.726854,0.846774,0.842475
5,1.598000,0.709095,0.846774,0.844304


('./model/tokenizer_config.json',
 './model/special_tokens_map.json',
 './model/vocab.txt',
 './model/added_tokens.json')

In [21]:
# Evaluate the model on the test dataset
results = trainer.evaluate(test_dataset)
display(results)

{'eval_loss': 0.6475618481636047,
 'eval_accuracy': 0.864,
 'eval_f1': 0.8579330547677297,
 'eval_runtime': 1.4867,
 'eval_samples_per_second': 84.079,
 'eval_steps_per_second': 5.381,
 'epoch': 5.0}

In [22]:
# Prediction pipeline
def predict_and_compare(resume_text, actual_label, model_path, label_encoder):
    # Load model and tokenizer for prediction
    model = DistilBertForSequenceClassification.from_pretrained(model_path)
    tokenizer = DistilBertTokenizer.from_pretrained(model_path)

    # Create prediction pipeline
    device = 0 if torch.cuda.is_available() else -1
    predict_pipeline = pipeline('text-classification', model=model, tokenizer=tokenizer, device=device)

    # Make prediction
    predictions = predict_pipeline(resume_text, truncation=True, max_length=512)
    predicted_label_index = int(predictions[0]['label'].split('_')[-1])

    # Use the integer index with inverse_transform to get the original label
    predicted_label = label_encoder.inverse_transform([predicted_label_index])

    # Compare actual vs predicted
    print('\n\n=================')
    print(f'Resume Text: {resume_text[0:50].strip()},\nActual label: {actual_label},\nPredicted label: {predicted_label[0]}')

In [23]:
# Assuming 'resumes' is your dataframe and it has a column 'Resume_str' for resume texts
random_resume = resumes.sample(n=10)#.iloc[0]
text = random_resume['Resume_str']
label = random_resume['Category']

for resume_text, actual_label in zip(text, label):
    predict_and_compare(resume_text, actual_label, model_path, label_encoder)



Resume Text: Kpandipou    Koffi         Summary,
Actual label: TEACHER,
Predicted label: TEACHER


Resume Text: DIRECTOR OF DIGITAL TRANSFORMATION,
Actual label: DIGITAL-MEDIA,
Predicted label: DIGITAL-MEDIA


Resume Text: SENIOR PROJECT MANAGER       Professional,
Actual label: CONSTRUCTION,
Predicted label: CONSTRUCTION


Resume Text: CHEF       Summary     Experienced cateri,
Actual label: CHEF,
Predicted label: CHEF


Resume Text: OPERATIONS MANAGER       Summary    Exper,
Actual label: BANKING,
Predicted label: APPAREL


Resume Text: BUSINESS DEVELOPMENT MANAGER       Profes,
Actual label: BUSINESS-DEVELOPMENT,
Predicted label: BUSINESS-DEVELOPMENT


Resume Text: JOB CAPTAIN
DESIGNER         Highlights,
Actual label: DESIGNER,
Predicted label: DESIGNER


Resume Text: DIRECTOR OF BUSINESS DEVELOPMENT       Su,
Actual label: BUSINESS-DEVELOPMENT,
Predicted label: BUSINESS-DEVELOPMENT


Resume Text: SUBSTITUTE TEACHER       Professional Sum,
Actual label: TEACHER,
Predicted label: 