Using the data webscraped with the "jobscryer" repo here:

https://github.com/tyleryou/jobscryer

The goal of the following script is to leverage a LLM to summarize each job description, then find a pattern between the summaries. This is all purely academic and for self-training purposes.

The following script utilized ChatGPT 4 to build a general model template, with manually added specification on what each element of the model does. 

In [18]:
import pandas as pd
import os

path = os.environ.get('jobscryer_data_path')

df_init = pd.read_csv(path)

In [19]:
df = df_init[df_init['description'] != '{}'].dropna()

In [30]:
df = df.iloc[0]

In [31]:
from transformers import BartTokenizer, BartForConditionalGeneration
from torch.utils.data import DataLoader, SequentialSampler, TensorDataset
import torch
from tqdm import tqdm

# Load the pre-trained model tokenizer (for example, facebook's BART)
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')

                        # -- Tokenize and encode sequences in the "description" column -- #
# Tokenizing is just separating the sentences (sequences) into individual words. 
# Then the tokenizer assigns token ids (pre-defined in the BART-large model pre-training process) to be encoded. 
# Encoding is the process of converting each word (token) into vectors.
# In classical NLP such as classifiers, we encode only the tokens. But in LLM models like BART 
# (Bidirectional Auto Regression from Transformers), the vectors are generated with respect to the tokens to the left and right
# of the token being encoded. The token limit per sequence is 1024 for BART (so the amount of words allowed per input sentence).
# Which is the "max_length" in the encoded_plus method below. We pad the tokenized sequence (add 0's) to the vector so that 
# all encoded elements are the same length, which is important when doing tensor maths. Each encoded vector is 
# length 1024 elements, each element 0 <= x >= 1. Tensor size will be based on 3 values, n = rows in dataset, 
# the max sequence length chosen (1024 for this model), and the embedding dimension which was chosen during the original
# training of the model. In summary: tensor size = (n, max_sequence_length, embedding_dimension)

input_ids = []
attention_masks = []

for desc in tqdm(df['description'], desc="Tokenizing description", unit="desc"):
    encoded_dict = tokenizer.encode_plus(
        desc,                        # Sentence to encode.
        add_special_tokens=True,     # Add '[CLS]' and '[SEP]', these are the beginning and end of the sentence
        max_length=1024,             # Max characters allowed for BART-large is 1024
        padding='max_length',        # Pad all to max_length (required)
        return_attention_mask=True,  # Construct attention masks to use later.
        return_tensors='pt',         # Return pytorch tensors to use later.
        truncation=True              # Truncate longer messages over 1024 (shortens sentence to max allowed).
    )
    
    # Add the encoded sentence to the list.
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)

# Set the batch size.
batch_size = 32  

# Create the DataLoader. TensorDataset builds out a single tensor out of the input IDs and the attention mask.
# The attention mask tells the model what token to pay attention to and which to ignore. AKA the "attention" in transformers.
# The sample selects a subset of data points (or samples) from the larger dataset. 
# This forms a mini-batch, which is set to 32 above. Batches are computed collectively as opposed to sequentially 
# (one at a time) like more traditional NLP models like a Recurrent Neural Network (RNN).
# The DataLoader then builds a new PyTorch class that wraps the tensor built from the input data and attention masks
# that acts as a pipeline for shuffling, batching and parallel loading.
prediction_data = TensorDataset(input_ids, attention_masks)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

# Prediction on test set
print('Summarizing {:,} documents...'.format(len(input_ids)))

# Load the pre-trained BART model for sequence-to-sequence summarization
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')

# Put model in evaluation mode. This ensures the model isn't really trained (as it is already pre-trained) 
# but instead points it towards predicting values.
model.eval()

# Tracking variables 
summaries = []

# Summarize 
for batch in tqdm(prediction_dataloader, desc="Summarizing", unit="batch"):
    # Unpack the inputs from our dataloader
    b_input_ids, b_input_mask = batch

    # Telling the model not to compute or store gradients, saving memory and speeding up prediction
    with torch.no_grad():
        # Generate summary
        summary_ids = model.generate(
            b_input_ids, 
            num_beams=4,    # Number of beams for beam search
            min_length=30,  # Minimum length of the summary
            max_length=200, # Maximum length of the summary
            early_stopping=True,  # Stop generation when all beam hypotheses reached EOS
            attention_mask=b_input_mask
        )

    # Decode the summary tokens and convert to strings
    summaries.extend([tokenizer.decode(summary, skip_special_tokens=True, clean_up_tokenization_spaces=True) for summary in summary_ids])

print('    DONE.')

# Now the 'summaries' list contains the generated summaries for each input document


Tokenizing description: 100%|████████████████████████████████████████████████████| 795/795 [00:00<00:00, 1647.45desc/s]


Summarizing 795 documents...


Summarizing:   8%|█████                                                           | 2/25 [07:33<1:26:53, 226.67s/batch]


KeyboardInterrupt: 