<a href="https://colab.research.google.com/github/staswo86/API_skill_based_department_classification/blob/main/PROJECT_skill_based_department_classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Skill based department Classification Project**

This project is a multi class text classification project that predicts the suitable department for a potential candidate based on the given skills. There are 4 classes (departments) namely Data Science & AI, IT Software Engineering, Cyber Security and ERP. The created model, dataset, and demo are publicly available on my Hugging Face profile (https://huggingface.co/staswo86)

## 0. Import necessary librbaries and install dependencies

In [None]:
!pip install huggingface_hub
try:
  import datasets,evaluate, accelerate
  import gradio as gradio
except ModuleNotFoundError:
  !pip install -U datasets evaluate accelerate gradio
  import datasets, evaluate, accelerate
  import gradio as gr

import random
import numpy as np
import pandas as pd

import torch
import transformers
print(f"Using transformers version : {transformers.__version__}")
print(f"Using torch version: {torch.__version__}")
print(f"Using datasets version: {datasets.__version__}")


Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting accelerate
  Downloading accelerate-1.4.0-py3-none-any.whl.metadata (19 kB)
Collecting gradio
  Downloading gradio-5.20.0-py3-none-any.whl.metadata (16 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.11-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadat

Using transformers version : 4.48.3
Using torch version: 2.5.1+cu124
Using datasets version: 3.3.2


## 1. Getting dataset
Getting/Creating a dataset is an essential step in terms of creating a model.
I've used LLM generated dataset called **skill_based_department_classification**, which is available on my Hugging Face profile (https://huggingface.co/datasets/staswo86/skill_based_department_classification)

In [None]:
from datasets import load_dataset

dataset = datasets.load_dataset(path = "staswo86/skill_based_department_classification_dataset")
dataset

README.md:   0%|          | 0.00/301 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/15.5k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/240 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 240
    })
})

In [None]:
# Inspect the features of the dataset by using column_names
dataset.column_names

{'train': ['text', 'label']}

In [None]:
# Acessing the training split (right now only have training data)
dataset["train"]

Dataset({
    features: ['text', 'label'],
    num_rows: 240
})

In [None]:
# Getting the first object of our dataset
dataset["train"][0]

{'text': 'Network Security, Firewalls, Intrusion Detection Systems (IDS), Intrusion Prevention Systems (IPS), Secure Network Architecture, Network Segmentation, Network Access Control (NAC), Network Traffic Analysis',
 'label': 'Cyber Security'}

In [None]:
# To better understand the dataset and get more examples I can randomly inspect different entries

import random
# Create 4 random indexes
random_indexs = random.sample(range(len(dataset["train"])), 4)
print(random_indexs)

# Apply random indexes to the dataset in order to produce random samples
random_samples = dataset["train"][random_indexs]

print(f"[INFO] Random samples from dataset:\n")
for text, label in zip(random_samples["text"], random_samples["label"]):
  print(f"Text: {text} | Label: {label}")

[163, 28, 6, 189]
[INFO] Random samples from dataset:

Text: Blockchain for AI, Decentralized AI, Federated Learning, Differential Privacy, Homomorphic Encryption, Hugging Face Transformers, AI in Cybersecurity | Label: Data Science & AI
Text: Security Incident Management, Security Incident Management Tools, Security Incident Management Frameworks, Security Incident Management Implementation, Security Incident Management Monitoring, Security Incident Management Reporting, Security Incident Management Improvement, Security Incident Management Training | Label: Cyber Security
Text: Security Operations, Security Information and Event Management (SIEM), Security Orchestration Automation and Response (SOAR), Threat Hunting, Incident Response, Security Monitoring, Security Analytics, Security Operations Center (SOC) | Label: Cyber Security
Text: Agile, Scrum, Kanban, JIRA, Confluence | Label: IT Software Engineering


In [None]:
# Get unique label values of our dataset
unique_labels = dataset["train"].unique("label")

# Get number of unique labels
number_labels = dataset["train"].to_pandas().nunique()["label"]
print(f"Number of departments (labels) of the dataset: {number_labels}")
print(f"\nNames of the departments (labels) of the dataset:\n {unique_labels}")

Number of departments (labels) of the dataset: 4

Names of the departments (labels) of the dataset:
 ['Cyber Security', 'ERP', 'Data Science & AI', 'IT Software Engineering']


In [None]:
# Additionally using Counter in order to count the occurence of each label
from collections import Counter

Counter(dataset["train"]["label"])

Counter({'Cyber Security': 60,
         'ERP': 60,
         'Data Science & AI': 60,
         'IT Software Engineering': 60})

## 2. Data preprocessing
In this step I am going to tokenize the text, split the data into train and test, and afterwards evaluate the splits

In [None]:
# Creating mapping for labels to numeric values (it is going to be crucial for creating a model, since id2label and label2id are model arguments)
id2label = {idx: label for idx, label in enumerate(dataset["train"].unique("label"))}
label2id = {label: idx for idx, label in id2label.items()}
print(id2label)
print(label2id)

{0: 'Cyber Security', 1: 'ERP', 2: 'Data Science & AI', 3: 'IT Software Engineering'}
{'Cyber Security': 0, 'ERP': 1, 'Data Science & AI': 2, 'IT Software Engineering': 3}


In [None]:
# Implement function that enable numerical representation for distinct departments (labels)
def map_labels_to_number(example):
  example["label"] = label2id[example["label"]]
  return example


example_sample = {"text": "Python proficiency, Tensorflow, Pytorch, Pandas", "label" : "Data Science & AI"}

# Test our function (fact checked with the line above)
map_labels_to_number(example_sample)

{'text': 'Python proficiency, Tensorflow, Pytorch, Pandas', 'label': 2}

In [None]:
# Thanks to dataset.map(), apply the **map_labels_to_number** into all dataset instances
dataset = dataset["train"].map(map_labels_to_number)

Map:   0%|          | 0/240 [00:00<?, ? examples/s]

In [None]:
# Output shuffled data to look at 10 more random samples
dataset.shuffle()[:10]

{'text': ['F#, Fable, SAFE Stack, Giraffe, Suave',
  'Kenandy Cloud ERP, Kenandy Supply Chain Management, Kenandy Manufacturing Management, Kenandy Inventory Management, Kenandy Order Management, Kenandy Financial Management, Kenandy Project Management, Kenandy Analytics',
  'AI in Finance, Fraud Detection, Credit Scoring, Algorithmic Trading, Risk Management, Portfolio Optimization, Financial Forecasting, Regulatory Compliance',
  'QAD Adaptive ERP, QAD Cloud ERP, QAD Enterprise Applications, QAD Automation Solutions, QAD Precision, QAD Supplier Portal, QAD Cloud EDI, QAD Cloud QMS',
  'Rootstock Cloud ERP, Rootstock Salesforce Manufacturing Cloud, Rootstock Supply Chain Management, Rootstock Inventory Management, Rootstock Engineering Change Management, Rootstock Cost Management, Rootstock Project Control, Rootstock Compliance Management',
  'Application Security, Secure Software Development Lifecycle (SDLC), Application Vulnerability Assessment, Application Penetration Testing, Secu

In [None]:
# Split our dataset into train/test splits (80% of data is going to be in the train split and seed 42 for reproducibility of the results)
dataset = dataset.train_test_split(test_size = 0.2, seed = 42)
dataset
# 192 instances in train and 48 instances in test split

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 192
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 48
    })
})

In [None]:
# Inspect random observation in test split
random_idx_test = random.randint(0, len(dataset["test"]))
random_sample_test = dataset["test"][random_idx_test]
random_sample_test

{'text': 'COBOL, JCL, CICS, DB2, IMS', 'label': 3}

In [None]:
# Tokenize text data using well-known transformers Hugging Face library (AutoTokenizer is the most robust option for matching tokenizers to models)
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
tokenizer

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)

In [None]:
# Getting the length of our tokenizer vocabulary
length_of_tokenizer_vocabulary = len(tokenizer.vocab)
print(f"Number of items in our tokenizer vocabulary: {length_of_tokenizer_vocabulary}")

# Getting the maximum sequence length the tokenizer can handle
max_tokenizer_input_sequence_length = tokenizer.model_max_length
print(f"Max tokenizer input sequence length: {max_tokenizer_input_sequence_length}")

Number of items in our tokenizer vocabulary: 30522
Max tokenizer input sequence length: 512


In [None]:
# Creating a preprocessing function to tokenize the text

def tokenize_text(examples):
  """
  Tokenize given example text and return the tokenized text
  """

  return tokenizer(examples["text"],
                   padding = True, # pad short sequences to longest sequence length in batch (e.g. if sample length = 100, sample will be padded to 512 or longest sample in batch)
                   truncation = True # Truncate long sequence to the maximum length the model can handle (e.g. if sample length = 1000, model length = 512, sample will be shortend to 512 as shown in the cell above)
                  )

In [None]:
# Map tokenize text preprocessing function to the dataset
tokenized_dataset = dataset.map(function = tokenize_text,
                                batched = True, # positively impact the time of tokenization process
                                batch_size = 1000)

Map:   0%|          | 0/192 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]

In [None]:
# Setting up an evaluation metric (in this example accuracy, but I could have also used F1 Score, Precision, Sensitivity etc.)
import evaluate
import numpy as np
from typing import Tuple

accuracy_metric = evaluate.load("accuracy")

def compute_accuracy(predictions_and_labels : Tuple[np.array, np.array]):
  predictions, labels = predictions_and_labels
  return accuracy_metric.compute(predictions, labels)



Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

## 3. Model setup
Apply transfer-learning by using patterns of previously implemented model in order to fit my use case.

In [None]:
from transformers import AutoModelForSequenceClassification
# Typing AutoModelForSequenceClassification for robustness, but in fact fine-tuning DistilBERT model
model = AutoModelForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path = "distilbert/distilbert-base-uncased",
    num_labels = 4, # Since we have 4 departments
    id2label = id2label,
    label2id = label2id
)
model

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert/distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:
# Implement a function, which counts parameters by using torch.numel() which return total number of elements in tensor. Using requires_grad indicates that a given parameter is not in test split
def count_params(model):
  """
  Count the parameters of a PyTorch model.
  """
  trainable_parameters = sum(param.numel() for param in model.parameters() if param.requires_grad)
  total_parameters = sum(param.numel() for param in model.parameters())

  return {"trainable_parameters": trainable_parameters, "total_parameters": total_parameters}
count_params(model)

{'trainable_parameters': 66956548, 'total_parameters': 66956548}

In [None]:
# Creating a directory for saving models

# Creating model output directory
from pathlib import Path

# Creating models directory
model_dir = Path("models")
model_dir.mkdir(exist_ok = True)

# Creating model save name
model_saved_name = "skill_based_department_classifier-distilbert-base-uncased"

# Creating model save path
model_save_dir = Path(model_dir, model_saved_name)

model_save_dir

PosixPath('models/skill_based_department_classifier-distilbert-base-uncased')

## 4. Setting up training arguments (also called hyperparameters, since they are manually adjusted) with TrainingArguments

In [None]:
from transformers import TrainingArguments

print(f" Saving model checkpoints: {model_save_dir}")

# Create training arguments
training_args = TrainingArguments(
    output_dir = model_save_dir,
    learning_rate = 0.001,
    per_device_train_batch_size=32,
    per_device_eval_batch_size = 32,
    num_train_epochs = 20,
    eval_strategy = "epoch",
    save_strategy = "epoch",
    save_total_limit = 3,
    use_cpu = False,
    seed = 42,
    load_best_model_at_end = True,
    logging_strategy = "epoch",
    report_to = "none",
    hub_private_repo = False # when uploading on HF Hub, repo is going to be public as default
)

 Saving model checkpoints: models/skill_based_department_classifier-distilbert-base-uncased


In [None]:
training_args

TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
auto_find_batch_size=False,
average_tokens_across_devices=False,
batch_eval_metrics=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_persistent_workers=False,
dataloader_pin_memory=True,
dataloader_prefetch_factor=None,
ddp_backend=None,
ddp_broadcast_buffers=None,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
dispatch_batches=None,
do_eval=True,
do_predict=False,
do_train=False,
eval_accumulation_steps=None,
eval_delay=0,
eval_do_concat_batches=True,
eval_on_start=False,
eval_steps=None,
eval_strategy=IntervalStrategy.EPOCH,
eval_

In [None]:
def compute_accuracy(predictions_and_labels : Tuple[np.array, np.array]):
  """
  Computes accuracy of a model by comparing the predictions and labels.
  """
  # Getting the maximum value from the model output (the index) as this is the "most likely" label according to the model
  predictions, labels = predictions_and_labels

  if len(predictions.shape) >= 2:
    predictions = np.argmax(predictions, axis = 1)

  return accuracy_metric.compute(predictions= predictions, references = labels)

## 5. Setting up Trainer instance

In [None]:
from transformers import Trainer

trainer = Trainer(
    model = model,
    args= training_args,
    train_dataset = tokenized_dataset["train"],
    eval_dataset = tokenized_dataset["test"],
    tokenizer = tokenizer,
    compute_metrics = compute_accuracy
)
trainer


  trainer = Trainer(


<transformers.trainer.Trainer at 0x7ff81e65de90>

## 6. Train the model

In [None]:
results = trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,1.4941,1.26429,0.416667
2,1.2729,0.757279,0.770833
3,1.0054,0.777736,0.583333
4,1.5248,0.90707,0.583333
5,0.9654,0.987505,0.5
6,1.0701,0.629408,0.854167
7,0.5829,0.305055,0.916667
8,0.7055,0.353971,0.895833
9,0.4808,0.638761,0.854167
10,0.5519,0.556948,0.875


In [None]:
# Inspect training metrics
for key, value in results.metrics.items():
  print(f"{key} : {value}")

train_runtime : 293.7452
train_samples_per_second : 13.073
train_steps_per_second : 0.409
total_flos : 63586619228160.0
train_loss : 0.6330297907193502
epoch : 20.0


In [None]:
# Save the trained model into local directory
print(f"Saving model to {model_save_dir}")
trainer.save_model(output_dir = model_save_dir)

Saving model to models/skill_based_department_classifier-distilbert-base-uncased


## 7. Model deployment ot the Hugging Face Hub

In [None]:
# Save our model to the Hugging Face Hub
model_upload_url = trainer.push_to_hub(
    commit_message = "Uploading skills based multi class classifier model"
)

# Chech if it was uploaded successfully
print(f"Model successfully uploaded to the Hugging Face Hub with URL: {model_upload_url}")

No files have been modified since last commit. Skipping to prevent empty commit.


Model successfully uploaded to the Hugging Face Hub with URL: https://huggingface.co/staswo86/skill_based_department_classifier-distilbert-base-uncased/tree/main/


## 8. Making inference/predictions with the help of Pytorch and Hugging Face
In the Pytorch scheme predicted logits, which are raw outputs of th model are going ot be transformed to prediction probabilities with torch.softmax and finally to predicted labels.
Nonetheless, I additionally used Hugging Face pipeline model and the mixture of Pytorch and Hugging Face for inference.

In [None]:
# Firstly, perform predictions on the test data
predictions_all = trainer.predict(tokenized_dataset["test"])
prediction_values = predictions_all.predictions
prediction_metrics = predictions_all.metrics

print(f"Prediction metrics on the test data is following: {prediction_metrics}")


Prediction metrics on the test data is following: {'test_loss': 1.3856066465377808, 'test_model_preparation_time': 0.0122, 'test_accuracy': 0.3125, 'test_runtime': 0.1053, 'test_samples_per_second': 455.673, 'test_steps_per_second': 18.986}


In [None]:
import torch
from sklearn.metrics import accuracy_score

# 1. Get prediction probabilities
pred_probs = torch.softmax(torch.tensor(prediction_values), dim=1)
pred_probs

# 2. Get the predicted labels (argmax since I want to get index of the best prediction)
pred_labels = torch.argmax(pred_probs, dim=1)

# 3. Get the true labels
true_labels = tokenized_dataset["test"]["label"]

# 4. Compute prediction labels to true labels and get the test accuracy
test_accuracy = accuracy_score(y_true = true_labels,
                               y_pred = pred_labels)

print(f"Test accuracy: {test_accuracy}")

Test accuracy: 0.3125


In [None]:
# Creating objects for local path and huggingface model path
local_model_path = "models/skill_based_department_classifier-distilbert-base-uncased"
huggingface_model_path = "staswo86/skill_based_department_classifier-distilbert-base-uncased"

In [None]:
# Setting up device in order to make the predictions faster

def set_device():
  if torch.cuda.is_available():
    device = torch.device("cuda")
  elif torch.backends.mps.is_available() and torch.backends.mps.is_built(): # MPS for mac users
    device = torch.device("mps")
  else:
    device = torch.device("cpu")
  return device

# Check the device (cuda equals GPU)
DEVICE = set_device()
print(f"Using device: {DEVICE}")

Using device: cuda


In [None]:
# Predictions can also be obtained by using pipeline mode of Hugging Face library
import torch
from transformers import pipeline

BATCH_SIZE = 32

# Create an instance of transformers.pipeline
skill_based_department_classifier = pipeline(task = "text-classification",
                                             model = huggingface_model_path,
                                             device = DEVICE,
                                             top_k = 1, # Let's show TOP 1 result
                                             batch_size = BATCH_SIZE)
skill_based_department_classifier

Device set to use cuda


<transformers.pipelines.text_classification.TextClassificationPipeline at 0x7ff807250f50>

In [None]:
# Let's test the classifier with the custom sequence of skills
test_custom_sequence = "Machine Learning, Tensorflow, Proficiency in Python, SQL, Keras, Pytorch"

skill_based_department_classifier(test_custom_sequence)
# As expected, the inputted skills are in line with Data Science & AI department

[[{'label': 'Data Science & AI', 'score': 0.9032952189445496}]]

In [None]:
# Inference using Pytorch and Hugging Face
from transformers import AutoTokenizer

# Create an example to predict on
sample_skills_text = "Proficiency in Python, Pytorch, Hugging Face, Seaborn, Deep Learning knowledge, Pandas library, NumPy"

# Preparing the tokenzier
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path= huggingface_model_path)
inputs = tokenizer(sample_skills_text,
                   return_tensors = "pt") # Pytorch tensor format
inputs

{'input_ids': tensor([[  101, 26293,  1999, 18750,  1010,  1052, 22123,  2953,  2818,  1010,
         17662,  2227,  1010,  2712, 10280,  1010,  2784,  4083,  3716,  1010,
         25462,  2015,  3075,  1010, 16371,  8737,  2100,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1]])}

In [None]:
from transformers import AutoModelForSequenceClassification

# Load the text classification model
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name_or_path = huggingface_model_path)
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:
import torch

# I could've also used torch.no_grad() instead of torch.inference_mode()
with torch.inference_mode():
  outputs = model(**inputs)
  outputs_verbose = model(input_ids = inputs["input_ids"],
                          attention_mask = inputs["attention_mask"])
outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.2098,  0.3504,  3.0700, -2.6282]]), hidden_states=None, attentions=None)

In [None]:
outputs_verbose

SequenceClassifierOutput(loss=None, logits=tensor([[-0.2098,  0.3504,  3.0700, -2.6282]]), hidden_states=None, attentions=None)

In [None]:
# Converting logits (raw outputs) to prediction probability + label
predicted_class_id = outputs.logits.argmax().item()
predicted_class_id

2

In [None]:
# Getting prediction probabilities
prediction_probability = torch.softmax(outputs.logits, dim = 1).max().item()
prediction_probability

0.903435468673706

In [None]:
# Firstly, print the sample skills example
print(f"Text: {sample_skills_text}")
# Since, I want to get the name of the department with the highest prediction probability
print(f"Predicted label: {model.config.id2label[predicted_class_id]}")
# Lastly, print the probabilities prediction
print(f"Prediction probability: {prediction_probability}")

Text: Proficiency in Python, Pytorch, Hugging Face, Seaborn, Deep Learning knowledge, Pandas library, NumPy
Predicted label: Data Science & AI
Prediction probability: 0.903435468673706


## 9. Creating a local demo via Gradio

In [None]:
# Creating a function to perform infernece (in other words output the predictions made by the model in a easy readable format)
from typing import Dict
from transformers import pipeline
import torch

def skill_based_department_classifier(text:str) -> Dict[str, float]:
  skill_based_department_classifier_pipeline = pipeline(task = "text-classification",
                                                        model = huggingface_model_path,
                                                        batch_size = 32,
                                                        device ="cuda" if torch.cuda.is_available() else "cpu",
                                                        top_k = 1,
                                                        )
  # Getting the outputs of the pipeline created above
  outputs = skill_based_department_classifier_pipeline(text)[0]
  # Formatting output for Gradio
  output_dict = {}
  for item in outputs:
    output_dict[item["label"]] = item["score"]

  return output_dict

skill_based_department_classifier("Pytorch, Pandas, Seaborn, Hugging Face, NLTK, Machine Learning")

Device set to use cpu


{'Data Science & AI': 0.9033569693565369}

In [None]:
# Firstly, I built a demo to run locally within the notebook.
import gradio as gr
# Creating a gradion iterface
demo = gr.Interface(
    fn = skill_based_department_classifier,
    inputs = "text",
    outputs = gr.Label(num_top_classes=1),
    title = "Skill Based Department Classifier",
    description = "This text classifier aims to determine a suitable IT department based on candidate's skills. There are four distinct IT department: Cyber Security, ERP, Data Science & AI, IT Software Engineering",
    examples = [["Python proficiency, Machine Learning knowledge, Hugging Face, Pytorch, Pandas"],
                ["Node.js, Rust, Docker, C++, OOP"],
                ]
)

# Launching the interface
demo.launch()

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://919d92779db44f83d4.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




## 10. Deployment of the demo on Hugging Face Spaces

In [2]:
# Making a directory to store the demo

from pathlib import Path

demos_dir = Path("../demos")
demos_dir.mkdir(exist_ok =True)

# Creating a folder for skill_based_department_classifier
skill_based_department_classifier_demo_dir = Path(demos_dir, "skill_based_department_classifier")
skill_based_department_classifier_demo_dir.mkdir(exist_ok = True)

In [9]:
# Making an app.py file, which will contain the logic of an applcation to run. (By default Hugging Face Spaces run 'app.py' automatically)
# Using %%writefile to create a new file
%%writefile ../demos/skill_based_department_classifier/app.py
# Import required packages
import torch
import gradio as gr

from typing import Dict
from transformers import pipeline

# Previously defined function to use with our model
def skill_based_department_classifier(text:str) -> Dict[str, float]:
  skill_based_department_classifier_pipeline = pipeline(task = "text-classification",
                                                        model = "staswo86/skill_based_department_classifier-distilbert-base-uncased",
                                                        batch_size = 32,
                                                        device ="cuda" if torch.cuda.is_available() else "cpu",
                                                        top_k = 1,
                                                        )
  # Getting the outputs of the pipeline created above
  outputs = skill_based_department_classifier_pipeline(text)[0]
  # Formatting output for Gradio
  output_dict = {}
  for item in outputs:
    output_dict[item["label"]] = item["score"]

  return output_dict

# Creating a Gradio interface
description = """
This text classifier aims to determine a suitable IT department based on candidate's skills. There are four distinct IT department, where a possible candidate can be selected namely Data Science & AI, IT Software Engineering, Cyber Security and ERP.
Fine-tuned from [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on a relatively small LLM generated dataset(https://huggingface.co/datasets/staswo86/skill_based_department_classification_dataset).
See source code here ()
"""

demo = gr.Interface(
    fn = skill_based_department_classifier,
    inputs = "text",
    outputs = gr.Label(num_top_classes=1),
    title = "Skill Based Department Classifier",
    description = description,
    examples = [["Python proficiency, Machine Learning knowledge, Hugging Face, Pytorch, Pandas"],
                ["Node.js, Rust, Docker, C++, OOP"],
                ]
)
# Launching the interface
if __name__ == "__main__":
  demo.launch()



Overwriting ../demos/skill_based_department_classifier/app.py


In [4]:
## Creaitng a README file used for metadata + settings
%%writefile ../demos/skill_based_department_classifier/README.md
---
title: Skill based department Classifier
emoji: 🖥️📱
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.19.0
app_file: app.py
pinned: false
license: apache-2.0
---

# Skill-based department Classifier
Demo that showing the text classifier, which aims to determine a suitable IT department based on candidate's skills.
There are four distinct IT department, where a possible candidate can be selected namely Cyber Security, ERP, Data Science & AI and IT Software Engineering.
Fine-tuned from [DistilBERT](https://huggingface.co/distilbert/distilbert-base-uncased) on a relatively small LLM generated dataset[https://huggingface.co/datasets/staswo86/skill_based_department_classification_dataset].
See source code here ().

Writing ../demos/skill_based_department_classifier/README.md


In [5]:
# Creating a requirements file in order to minimize the probability of ModuleNotFoundError
%%writefile ../demos/skill_based_department_classifier/requirements.txt
gradio
torch
transformers

Writing ../demos/skill_based_department_classifier/requirements.txt


In [10]:
# Importing the required methods for deployment to the Hugging Face Hub
from huggingface_hub import(
    create_repo,
    get_full_repo_name,
    upload_file,
    upload_folder
)

# Define objects that I am going to use for uploading Hugging Face Space
LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD = "../demos/skill_based_department_classifier"
HF_TARGET_SPACE_NAME = "skill_based_department_classifier_demo"
HF_REPO_TYPE = "space"
HF_SPACE_SDK = "gradio"

# Creating a Space repo on Hugging Face Hub
create_repo(
    repo_id = HF_TARGET_SPACE_NAME,
    repo_type = HF_REPO_TYPE,
    private = False, # PUBLIC
    space_sdk = HF_SPACE_SDK,
    exist_ok = True
)

# Getting the full name
hf_full_repo_name = get_full_repo_name(model_id = HF_TARGET_SPACE_NAME)
print(f"Full Hugging Face Hub repo name is: {hf_full_repo_name}")

# Uploading the demo folder
folder_upload_url = upload_folder(
    repo_id = hf_full_repo_name,
    folder_path = LOCAL_DEMO_FOLDER_PATH_TO_UPLOAD,
    path_in_repo = ".", # at the root
    repo_type = HF_REPO_TYPE,
    commit_message = "Uploading skill based department classifier demo"
)

print(f"Demo folder succesfully uploaded with commit URL: {folder_upload_url}")


Full Hugging Face Hub repo name is: staswo86/skill_based_department_classifier_demo
Demo folder succesfully uploaded with commit URL: https://huggingface.co/spaces/staswo86/skill_based_department_classifier_demo/tree/main/.
