## Data Science - Technical Exercise

We load the required libraries.

In [42]:
import pandas as pd
import numpy as np
import torch
import evaluate

from IPython.display import display, HTML
from datasets import load_dataset, DatasetDict
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                            DataCollatorWithPadding, TrainingArguments, Trainer)

#### Exploratory Data Analysis

We start by loading the data into pandas dataframes.

In [2]:
df_train = pd.read_csv("data/train_data.csv")
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17927 entries, 0 to 17926
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         17927 non-null  object
 1   POSTED     17927 non-null  object
 2   TITLE_RAW  17927 non-null  object
 3   BODY       17927 non-null  object
 4   ONET_NAME  17927 non-null  object
 5   ONET       17927 non-null  object
dtypes: object(6)
memory usage: 840.5+ KB


In [3]:
df_test = pd.read_csv("data/test_data.csv")
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19394 entries, 0 to 19393
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   ID         19394 non-null  object
 1   POSTED     19394 non-null  object
 2   TITLE_RAW  19394 non-null  object
 3   BODY       19394 non-null  object
 4   ONET_NAME  19394 non-null  object
 5   ONET       19394 non-null  object
dtypes: object(6)
memory usage: 909.2+ KB


Next, we glance at the dataframes.

In [4]:
df_train.head()

Unnamed: 0,ID,POSTED,TITLE_RAW,BODY,ONET_NAME,ONET
0,3a9bc988d77e46507f6753429dd848a816d0b9b9,2023-05-03,Executive Meeting Manager,Executive Meeting Manager Marriott La Jolla - ...,"Meeting, Convention, and Event Planners",13-1121.00
1,eb3a017370d55577e892ff8207a640b7d7136f31,2023-05-03,Rehabilitation Technician-Outpatient Rehab-Fle...,Rehabilitation Technician-Outpatient Rehab-Fle...,Occupational Therapy Aides,31-2012.00
2,8717d2213055d39271bd12490263a7fbe603aedb,2023-05-03,Office/Bookkeeping Assistant,"Office/Bookkeeping Assistant\nSanta Barbara, C...","Office Clerks, General",43-9061.00
3,43b55e4334835e20e1c64d9ac7bb0a0267369b9e,2023-05-03,Administrative Support Coordinator - VA - (REM...,Find Jobs Administrative Support Coordinator -...,"Secretaries and Administrative Assistants, Exc...",43-6014.00
4,afa355a328687ddb88d6265a237d0375bb36eae7,2023-05-03,Receptionist/Administrative Assistant,Receptionist/Administrative Assistant Burgess ...,"Secretaries and Administrative Assistants, Exc...",43-6014.00


In [5]:
df_train.head()

Unnamed: 0,ID,POSTED,TITLE_RAW,BODY,ONET_NAME,ONET
0,3a9bc988d77e46507f6753429dd848a816d0b9b9,2023-05-03,Executive Meeting Manager,Executive Meeting Manager Marriott La Jolla - ...,"Meeting, Convention, and Event Planners",13-1121.00
1,eb3a017370d55577e892ff8207a640b7d7136f31,2023-05-03,Rehabilitation Technician-Outpatient Rehab-Fle...,Rehabilitation Technician-Outpatient Rehab-Fle...,Occupational Therapy Aides,31-2012.00
2,8717d2213055d39271bd12490263a7fbe603aedb,2023-05-03,Office/Bookkeeping Assistant,"Office/Bookkeeping Assistant\nSanta Barbara, C...","Office Clerks, General",43-9061.00
3,43b55e4334835e20e1c64d9ac7bb0a0267369b9e,2023-05-03,Administrative Support Coordinator - VA - (REM...,Find Jobs Administrative Support Coordinator -...,"Secretaries and Administrative Assistants, Exc...",43-6014.00
4,afa355a328687ddb88d6265a237d0375bb36eae7,2023-05-03,Receptionist/Administrative Assistant,Receptionist/Administrative Assistant Burgess ...,"Secretaries and Administrative Assistants, Exc...",43-6014.00


We find the total number of unique `ONET_CODE`s

In [6]:
onet_counts_train = df_train.ONET.value_counts().to_frame().reset_index()
onet_counts_train

Unnamed: 0,ONET,count
0,29-1141.00,959
1,41-2031.00,627
2,99-9999.00,608
3,41-1011.00,445
4,41-4012.00,407
...,...,...
694,53-1042.01,1
695,51-8011.00,1
696,13-2053.00,1
697,29-1129.00,1


In [7]:
onet_counts_test = df_test.ONET.value_counts().to_frame().reset_index()
onet_counts_test

Unnamed: 0,ONET,count
0,29-1141.00,765
1,99-9999.00,763
2,15-1252.00,526
3,41-4012.00,471
4,41-2031.00,434
...,...,...
703,49-2021.00,1
704,13-1041.00,1
705,19-3093.00,1
706,35-2012.00,1


As the numbers are different, we check the number of IDs in exactly one dataframe as well as the number of common IDs.

In [8]:
print(f"Number of IDs only in TRAIN: {len(set(onet_counts_train.ONET.values) - set(onet_counts_test.ONET.values))}")
print(f"Number of IDs only in TEST: {len(set(onet_counts_test.ONET.values) - set(onet_counts_train.ONET.values))}")
print(f"Number of common IDs only in TRAIN: {len(set(onet_counts_train.ONET.values) & set(onet_counts_test.ONET.values))}")

Number of IDs only in TRAIN: 97
Number of IDs only in TEST: 106
Number of common IDs only in TRAIN: 602


#### Data Preparation & Fine-Tuning

First, we load the CSV files into a Huggingface Dataset. We also take out 20% of the training data to form a validation set.

In [9]:
train_ds = load_dataset("csv", data_files="data/train_data.csv", split="train")
trainval_ds = train_ds.train_test_split(test_size=0.2, shuffle=True, seed=42)
test_ds = load_dataset("csv", data_files="data/test_data.csv", split="train")

dataset = DatasetDict({
    'train': trainval_ds['train'],
    'validation': trainval_ds['test'],
    'test': test_ds
})

In [10]:
dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'POSTED', 'TITLE_RAW', 'BODY', 'ONET_NAME', 'ONET'],
        num_rows: 14341
    })
    validation: Dataset({
        features: ['ID', 'POSTED', 'TITLE_RAW', 'BODY', 'ONET_NAME', 'ONET'],
        num_rows: 3586
    })
    test: Dataset({
        features: ['ID', 'POSTED', 'TITLE_RAW', 'BODY', 'ONET_NAME', 'ONET'],
        num_rows: 19394
    })
})

The model expects integer labels, so we create dictionaries to convert `ONET`s to integer labels and put them in the `labels` column. We also define `num_labels` to be the total number of classes in the training and validation sets.

In [11]:
# Get the unique labels in training and validation and create a dictionary mapping
unique_ids = np.unique(dataset['train']['ONET_NAME'] + dataset['validation']['ONET_NAME'])
id2label = {id: i for i, id in enumerate(unique_ids)}
num_labels = len(id2label)
# add extra ids from the test set
for i, id in enumerate(list(set(dataset['test']['ONET_NAME']) - set(id2label.keys()))):
    id2label[id] = num_labels + i
# define the reverse
label2id = {label:i for i, label in id2label.items()}

# Add a new 'labels' column to the dataset
dataset = dataset.map(lambda examples: {'labels': [id2label[id] for id in examples['ONET_NAME']]}, batched=True)

Map:   0%|          | 0/14341 [00:00<?, ? examples/s]

Map:   0%|          | 0/3586 [00:00<?, ? examples/s]

Map:   0%|          | 0/19394 [00:00<?, ? examples/s]

Next, we load a DistilBERT tokenizer to preprocess the text field.

In [12]:
# Specify the name of the pretrained model
model_name = 'distilbert-base-uncased'

# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)

Create a preprocessing function to tokenize the `BODY` column and truncate sequences.

In [14]:
# Define a function to tokenize the dataset
def tokenize(batch):
    return tokenizer(batch['BODY'], truncation=True) 

Tokenize the dataset.

In [15]:
tokenized_dataset = dataset.map(tokenize, batched=True, batch_size=len(dataset))

Map:   0%|          | 0/14341 [00:00<?, ? examples/s]

Map:   0%|          | 0/3586 [00:00<?, ? examples/s]

Map:   0%|          | 0/19394 [00:00<?, ? examples/s]

In [16]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['ID', 'POSTED', 'TITLE_RAW', 'BODY', 'ONET_NAME', 'ONET', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 14341
    })
    validation: Dataset({
        features: ['ID', 'POSTED', 'TITLE_RAW', 'BODY', 'ONET_NAME', 'ONET', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 3586
    })
    test: Dataset({
        features: ['ID', 'POSTED', 'TITLE_RAW', 'BODY', 'ONET_NAME', 'ONET', 'labels', 'input_ids', 'attention_mask'],
        num_rows: 19394
    })
})

We also load a metric to evaluate the performance of the model while training.

In [17]:
f1_metric = evaluate.load("f1")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return f1_metric.compute(predictions=predictions, references=labels, average="weighted")

Finally, we load DistilBERT with `AutoModelForSequenceClassification` along with the number of expected labels.

In [18]:
# Use GPU for computation if possible.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

In [19]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
model = model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

We define the training hyperparameters, pass them along to the `Trainer` and finetune the model for a few epochs.

In [21]:
training_args = TrainingArguments(
    output_dir='./sequence-classification-model',
    num_train_epochs=8,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./sequence-classification-logs',
    save_total_limit=2,
    load_best_model_at_end=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer=tokenizer),
    compute_metrics=compute_metrics
)

trainer.train()

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,F1
1,5.6931,3.302109,0.309772
2,2.7892,2.39515,0.480812
3,1.8828,2.055944,0.565727
4,1.3065,1.898311,0.610321
5,1.0854,1.82305,0.630892
6,0.7459,1.81887,0.651258
7,0.5661,1.839117,0.659924
8,0.4687,1.839912,0.661965


TrainOutput(global_step=7176, training_loss=1.7007938319092477, metrics={'train_runtime': 2008.671, 'train_samples_per_second': 57.116, 'train_steps_per_second': 3.573, 'total_flos': 1.5386627452870656e+16, 'train_loss': 1.7007938319092477, 'epoch': 8.0})

#### Effectiveness of the model

We define a function to predict the top N labels for each posting.

In [31]:
def predict(self, text_inputs, top_N=1, batch_size=64):
    """
    Predicts the top N ONET IDs, associated labels
    and their probabilities for the given texts inputs
    """
    # Prepare the inputs
    inputs = tokenizer(text_inputs, return_tensors='pt', truncation=True, padding=True)
    inputs = {name: tensor.to(self.device) for name, tensor in inputs.items()}

    # Initialize lists to hold the outputs
    all_labels = []
    all_probs = []
    
    # Process the inputs in batches
    for i in range(0, len(text_inputs), batch_size):
        batch_inputs = {name: tensor[i:i+batch_size] for name, tensor in inputs.items()}
        
        # Get the model outputs
        with torch.no_grad():
            outputs = self(**batch_inputs)

        # Compute the softmax, get the top N probabilities and their indices
        probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
        top_probs, top_labels = torch.topk(probs, top_N)
        
        # Append the results to the lists
        all_labels.extend(top_labels)
        all_probs.extend(top_probs)

    # Convert the lists to PyTorch tensors
    all_labels = torch.stack(all_labels)
    all_probs = torch.stack(all_probs)
    # Get the ONET IDs
    all_ids = [[label2id[idx] for idx in row] for row in all_labels.cpu().numpy().tolist()]

    return all_ids, all_labels, all_probs

Also, add a method to find the top N accuracy score.

In [32]:
def top_N_accuracy_score(self, text_inputs, labels, top_N=1, batch_size=64):
    """
    Get the top N accuracy score
    """
    labels = torch.tensor(labels).to(device)
    _, best_N_labels, _ = predict(self, text_inputs, top_N=top_N, batch_size=batch_size)    

    correct_bool = best_N_labels.eq(labels.view(-1, 1).expand_as(best_N_labels))
    
    # Compute the top N accuracy
    correct = correct_bool.view(-1).float().sum(0, keepdim=True)
    score = correct.mul_(100.0 / labels.size(0))
    
    return np.round(score.item(), 2)

Attach these methods to the model.

In [37]:
model.predict = predict.__get__(model)
model.top_N_accuracy_score = top_N_accuracy_score.__get__(model)

Find the top N accuracy for the test dataset for different values of N.

In [25]:
for N in range(1,10):
    print(f"N:{N}\tTop N Test Accuracy: {model.top_N_accuracy_score(dataset['test']['BODY'], dataset['test']['labels'], top_N=N)}%")

N:1	Top N Test Accuracy: 61.63%
N:2	Top N Test Accuracy: 71.54%
N:3	Top N Test Accuracy: 75.46%
N:4	Top N Test Accuracy: 77.95%
N:5	Top N Test Accuracy: 79.69%
N:6	Top N Test Accuracy: 81.1%
N:7	Top N Test Accuracy: 82.0%
N:8	Top N Test Accuracy: 82.81%
N:9	Top N Test Accuracy: 83.54%


We check the quality of top 3 predictions for a random sampling of the test set.

In [52]:
N = 3
n_rows = 30
test_sample = dataset["test"].shuffle(seed=42).select(range(n_rows))

ids_list, _, _ = model.predict(test_sample["BODY"], top_N=N)
values = []
for i, ids in enumerate(ids_list):
    row = [test_sample['ONET_NAME'][i]] + ids
    values.append(row)
results = pd.DataFrame(values, columns=["True Labels"]+[f"Prediction {i}" for i in range(1,N+1)])
display(HTML(results.to_html(index=False)))

True Labels,Prediction 1,Prediction 2,Prediction 3
"Maintenance and Repair Workers, General","Maintenance and Repair Workers, General",Industrial Engineering Technologists and Technicians,Industrial Machinery Mechanics
"Hotel, Motel, and Resort Desk Clerks","Hotel, Motel, and Resort Desk Clerks",Medical Secretaries and Administrative Assistants,"Office Clerks, General"
Cargo and Freight Agents,"Shipping, Receiving, and Inventory Clerks",Light Truck Drivers,"Packers and Packagers, Hand"
Food Service Managers,Food Service Managers,Lodging Managers,First-Line Supervisors of Food Preparation and Serving Workers
Customer Service Representatives,Customer Service Representatives,"Sales Representatives of Services, Except Advertising, Insurance, Financial Services, and Travel",Patient Representatives
Marketing Managers,Marketing Managers,Sales Managers,General and Operations Managers
Computer and Information Research Scientists,Computer Systems Engineers/Architects,"Engineers, All Other",Database Administrators
Fast Food and Counter Workers,Fast Food and Counter Workers,Retail Salespersons,Cashiers
Heavy and Tractor-Trailer Truck Drivers,Heavy and Tractor-Trailer Truck Drivers,Light Truck Drivers,"Laborers and Freight, Stock, and Material Movers, Hand"
Business Intelligence Analysts,Business Intelligence Analysts,Management Analysts,Human Resources Specialists
