# IV: COSINE SIMILARITY CALCULATION WITH DIFFERENT TEXT REPRESENTATION TECHNIQUES.
 
This section focuses entirely on calculating the cosine similarity between the participant data and the job ads data. The calculation uses last hidden states of a fine-tuned BERT model, embeddings from a pre-trained Word2Vec model, and a combined embedding of TF-IDF with Bow.

## GENERAL

- **load module**

In [1]:
# Load nessesary libraries.
import re
import sys
import time
import torch
import psutil
import gpustat
import warnings
import platform
import numpy as np
import pandas as pd
import torch.nn.functional as F
from scipy.sparse import hstack
from nltk.corpus import stopwords
from gensim.models import KeyedVectors
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity as cos
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import AutoTokenizer
from transformers import BertForSequenceClassification
warnings.filterwarnings('ignore')

- **check computational environment**

In [2]:
# List the software and hardware configurations used for conducting the experiment.
print('WINDOWS VERSION:', platform.platform())
print('PYTHON VERSION:', sys.version)
print('CPU CORE:', psutil.cpu_count(logical=False))
print('CPU SPEED:', psutil.cpu_freq())
print('GPU:', gpustat.new_query().gpus[0].name)
print(f'RAM: {psutil.virtual_memory().total/(1024 ** 3):.2f} GB')
print(f"HARD DRIVE: {psutil.disk_usage('/').total/(1024 ** 3):.2f} GB")

WINDOWS VERSION: Windows-10-10.0.22631-SP0
PYTHON VERSION: 3.11.4 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 13:38:37) [MSC v.1916 64 bit (AMD64)]
CPU CORE: 4
CPU SPEED: scpufreq(current=2496.0, min=0.0, max=2496.0)
GPU: NVIDIA GeForce GTX 1650
RAM: 31.87 GB
HARD DRIVE: 237.45 GB


- **load dataset**

*job seekers*

In [3]:
# Load the experiment participants dataset.
df_jobseeker = pd.read_csv('data_jobseeker.csv', index_col=None)
print("The shape of the joob seekers' data frame is:", df_jobseeker.shape)

df_jobseeker.head()

The shape of the joob seekers' data frame is: (3, 8)


Unnamed: 0,participant,data_collection,date,location,preferred_position,education,skill,experience
0,user_1,voice call,2023-12-17 15:30:00,"dublin, ireland",registered nurse,bachelor's degree: critical care nursing,"patient care, wound care, medical procedures, ...",registered nurse: 3 years
1,user_2,voice call,2023-12-27 11:50:00,"dublin, ireland",electrician,"high school diploma, vocational electrician ce...","circuit testing, blueprint reading, fault find...",residential electrician's helper: 1 year
2,user_3,google form,2023-12-31 13:39:00,"dublin, ireland",data analyst,"degree: master of science in data analytics, b...","python, data mining and extraction, data analy...",entry level data analyst: 1 year; data coordin...


The first dataset consists of 3 rows and 8 columns of data collected from experiment participants through interviews. The last three columns in this DataFrame (DF), which contain text data on education, skill, and experience, are intended to be used for analysis. Calculating the cosine score for each column individually is impractical and illogical. Therefore, it is necessary to combine these columns into a single one.

In [4]:
# Apply minor modifications for further use.
df_jobseeker['combined_info'] = df_jobseeker.education + '. ' + df_jobseeker.skill + '. ' + df_jobseeker.experience + '.'
df_jobseeker.drop(['education', 'skill', 'experience'], axis=1, inplace=True)

df_jobseeker.head()

Unnamed: 0,participant,data_collection,date,location,preferred_position,combined_info
0,user_1,voice call,2023-12-17 15:30:00,"dublin, ireland",registered nurse,bachelor's degree: critical care nursing. pati...
1,user_2,voice call,2023-12-27 11:50:00,"dublin, ireland",electrician,"high school diploma, vocational electrician ce..."
2,user_3,google form,2023-12-31 13:39:00,"dublin, ireland",data analyst,"degree: master of science in data analytics, b..."


Having merged the text data into a single column, it is essential to perform a word count. This step will guide us in determining the appropriate approach for processing this text in the subsequent analytical stages.

In [5]:
# Calculate the word count for each ad and add its values to a new column.
df_jobseeker['word_count'] = df_jobseeker['combined_info'].apply(lambda x: len(x.split()))

df_jobseeker.head()

Unnamed: 0,participant,data_collection,date,location,preferred_position,combined_info,word_count
0,user_1,voice call,2023-12-17 15:30:00,"dublin, ireland",registered nurse,bachelor's degree: critical care nursing. pati...,27
1,user_2,voice call,2023-12-27 11:50:00,"dublin, ireland",electrician,"high school diploma, vocational electrician ce...",33
2,user_3,google form,2023-12-31 13:39:00,"dublin, ireland",data analyst,"degree: master of science in data analytics, b...",60


*job ads*

In [6]:
# Load the online job ads dataset and apply minor modifications for further use.
df_jobads = pd.read_csv('data_jobads_final.csv', index_col=None)
df_jobads['job_description'] = df_jobads['job_description'].str.replace('\n', ' ')
df_jobads = df_jobads.dropna().reset_index(drop=True)

print("The shape of the joob ads' data frame is:", df_jobads.shape)
df_jobads.head(3)

The shape of the joob ads' data frame is: (1166, 6)


Unnamed: 0,title,id,link,date,job_description,label
0,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse
1,clinical nurse manager (cnm),sj_358f1f68cde928c4,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,unknown,create a better future for yourself recruitne...,registered_nurse
2,registered nurse,job_4e16e9830b072344,https://ie.indeed.com/rc/clk?jk=4e16e9830b0723...,"January 10, 2024","access healthcare, one of irelands leading hea...",registered_nurse


The second dataset consists of 1166 rows and 6 columns of data scraped from Indeed.com. The most essential column in this DF is the one with job descriptions. Similarly to the first DF, counting the words for each row.

In [7]:
df_jobads['word_count'] = df_jobads['job_description'].apply(lambda x: len(x.split()))
df_jobads.head(3)

Unnamed: 0,title,id,link,date,job_description,label,word_count
0,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502
1,clinical nurse manager (cnm),sj_358f1f68cde928c4,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,unknown,create a better future for yourself recruitne...,registered_nurse,231
2,registered nurse,job_4e16e9830b072344,https://ie.indeed.com/rc/clk?jk=4e16e9830b0723...,"January 10, 2024","access healthcare, one of irelands leading hea...",registered_nurse,182


All necessary libraries have been imported, and the datasets are also laoded and ready for use.

## 1. WITH FINE-TUNED BERT MODEL

### 1.1 Test

In this sub-section, the text columns from both DFs are fed into Bert's fine-tuned encoding layers, and the resulting text representations from the last hidden layer are collected for cosine similarity computation. For demonstration purposes lets run the test for only one row value and retrieve the final hidden state.

In [8]:
# Assigning the text for demonstration to a variable.
input_text_test = df_jobseeker.iat[0, -2]

# Initialize a fine-tuned model with the hidden state output enabled.
model = BertForSequenceClassification.from_pretrained('ft_bert_temuulen2', output_hidden_states=True)

# Initialize a tokenizer used for the fine-tuned model.
tokenizer = AutoTokenizer.from_pretrained('ft_bert_temuulen_tokenizer2')

# Tokenize the input text and convert it to PyTorch tensors.
inputs = tokenizer(input_text_test, return_tensors='pt')
print(inputs)

{'input_ids': tensor([[  101,  5065,  1005,  1055,  3014,  1024,  4187,  2729,  8329,  1012,
          5776,  2729,  1010,  6357,  2729,  1010,  2966,  8853,  1010,  4639,
          8329,  1010,  8985,  2491,  1010, 16474,  1010,  2051,  2968,  1010,
          4807,  4813,  1010,  3086,  2000,  6987,  1012,  5068,  6821,  1024,
          1017,  2086,  1012,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


In the previous cell, the test text specified for demonstration purposes was assigned to a variable and tokenized. The results were formatted as tensors to be compatible with our deep learning framework, PyTorch in this instance. The output of the cell shows that input itself consists of **input_ids** and **attention_mask** values, which are important for further procesing, as well as **token_type_ids** values, which are optional for the current context.

In [9]:
# Perform a forward pass through the model to get the hidden states.
with torch.no_grad():
    outputs = model(**inputs)

# Extract the last hidden states from the model outputs.
last_hidden_states = outputs.hidden_states[-1]

print('The size of the last hidden state tensor is:', last_hidden_states.shape, '\n')
print('The data type of the last hidden state tensor is:', type(last_hidden_states), '\n')
print(last_hidden_states)

The size of the last hidden state tensor is: torch.Size([1, 44, 768]) 

The data type of the last hidden state tensor is: <class 'torch.Tensor'> 

tensor([[[-0.0363,  0.2450,  0.5963,  ...,  0.3683, -0.0280, -0.8850],
         [ 0.2678,  0.2943,  0.7582,  ...,  0.6081,  0.3891, -0.7876],
         [ 0.6730,  0.5588,  0.0457,  ...,  0.3886, -0.1934, -0.3093],
         ...,
         [ 0.2063,  1.2522,  0.8961,  ...,  0.4514, -0.2191, -0.9838],
         [-0.4627,  0.0477,  0.2985,  ...,  1.0643, -0.1532, -0.7737],
         [ 0.6308,  0.5534,  0.1932,  ...,  0.4055, -0.1966, -0.3423]]])


Following tokenization, the input values were passed forward through the model, resulting in the extraction of a torch tensor representing hidden states with dimensions of ([1, 44, 768]). This tensor will then be used for cosine similarity calculations.

### 1.2 Experiment

The test demonstration went well and the tensor was successfully extracted. Now lets begin the experiment.

In [10]:
# Starting the timer to track the execution duration.
start = time.time()

*initialize model*

The encoding model has been fine-tuned using the **bert-based-uncased** architecture for text sequence classification and was imported from the personal drive. The tokenizer employed is HuggingFace's autotokenizer, which automatically selects and pairs with the most suitable tokenizer for the model. In this instance, it is the **BertTokenizer**.

In [11]:
# Initialize a fine-tuned model with the hidden state output enabled.
model = BertForSequenceClassification.from_pretrained('ft_bert_temuulen2', output_hidden_states=True)

# Initialize a tokenizer used for the fine-tuned model.
tokenizer = AutoTokenizer.from_pretrained('ft_bert_temuulen_tokenizer2')

*load dataset*

The dataset used in this implementation is a duplicate of the primary DFs containing information about job seekers and job advertisements.

In [12]:
# Load datasets.
df_bert_js = df_jobseeker.copy()
df_bert_ja = df_jobads.copy()

*initialize gpu* (optional)

To enhance the effectiveness of managing matrix and tensor operations, the CUDA device was created. This capability represents a key advantage of utilizing the BERT model within the Torch framework.

In [13]:
# Check whether CUDA is accessible and, if so, create a CUDA device.
cuda_available = torch.cuda.is_available()
cuda_device= torch.cuda.get_device_name(0)

if cuda_available == True:
    device = torch.device('cuda')
    print('CUDA was successfully installed and compiled on my device.')
    print('CUDA device name is:', cuda_device)
else:
    print('Cuda in not available')

CUDA was successfully installed and compiled on my device.
CUDA device name is: NVIDIA GeForce GTX 1650


Before starting the encoding process, it's essential to check the word count to ensure that it doesn't surpass 510, due to a constraint associated with the BERT model. If the word count exceed this threshold, it is necessary to formulate a new strategy for obtaining the encoded value.

In [14]:
print('The total number of rows having word counts greater than 510 in the first DF is:', df_bert_js[df_bert_js['word_count'] > 510].shape[0])
print('The total number of rows having word counts greater than 510 in the second DF is:', df_bert_ja[df_bert_ja['word_count'] > 510].shape[0])
print('The word count for the longest text is:', df_bert_ja.iat[df_bert_ja['word_count'].idxmax(), -1])

The total number of rows having word counts greater than 510 in the first DF is: 0
The total number of rows having word counts greater than 510 in the second DF is: 236
The word count for the longest text is: 3145


*create custom function*

From the output observed in the preceding cell, it is clear that the DF for job seekers does not contain entries exceeding the 510-word limit, allowing the definition of a standard custom function for tokenization and extraction of the last hidden state without additional conditions. Conversely, the DF for job advertisements contains 236 entries surpassing the 510-word threshold, with the longest text totaling 3145 words. To process these inputs through the model, a custom function incorporating special conditions must be developed and applied. The upcoming two custom functions are designed specifically for this purpose.

In [15]:
# Define a custom function to extract the final layer encodings from BERT, without conditions.
def process_text(text):
    
    # Tokenize the input text
    inputs = tokenizer(text, return_tensors='pt')
    
    # Pass the tokenized input through the model.
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Retrieve the last hidden states from the model's outputs.
    last_hidden_states = outputs.hidden_states[-1]
    
    return last_hidden_states

In [16]:
# Define a custom function to extract the final layer encodings from BERT, with conditions.
def embed_with_bert(df_column):
    
    embedded_texts = []
    
    # Iterate through each text in the DataFrame column.
    for text in df_column:
        
        # Tokenize each text without adding special tokens and without truncation or padding.
        tokens = tokenizer(text, add_special_tokens=False, return_tensors='pt', truncation=False, padding=False)['input_ids'].squeeze()
        token_length = len(tokens)
        
        # If the token length is less than or equal to 512, process it normally.
        if token_length <= 512:
            inputs = tokenizer(text, return_tensors='pt').to(device)
            with torch.no_grad():
                outputs = model(**inputs)
            last_hidden_states = outputs.hidden_states[-1].cpu()  
            embedded_texts.append(last_hidden_states)
            
        # If the token length is greater than 512, split it into sliding windows withot lapping.
        else:
            max_length = 512
            stride = 0
            tokens = tokenizer(text, add_special_tokens=False, return_tensors='pt', truncation=False, padding=False)['input_ids'].squeeze().to(device)
            token_windows = [tokens[i:i+max_length] for i in range(0, len(tokens), max_length - stride)]
            
            all_hidden_states = []
            
            # Add special tokens (CLS and SEP) and truncate if needed.
            for window in token_windows:
                window = torch.cat([torch.tensor([tokenizer.cls_token_id], device=device), window, torch.tensor([tokenizer.sep_token_id], device=device)])
                if len(window) > max_length:
                    window = torch.cat((window[:max_length-1], torch.tensor([tokenizer.sep_token_id], device=device)))
                inputs = {'input_ids': window.unsqueeze(0)}
                with torch.no_grad():
                    outputs = model(**inputs)
                hidden_states = outputs.hidden_states[-1].cpu()  
                all_hidden_states.append(hidden_states)
            
            # Concatenate all hidden states from each sliding window.
            embedded_texts.append(torch.cat(all_hidden_states, dim=1))
            
    return embedded_texts


Furthermore, as observed in the previous test demonstration, the text that passes through the encoders generates a hidden states tensor with three dimensions. To keep the textual information without aggregating these dimensions, it is necessary to define a custom function. The function below processes the tensor of a user's text, computes the cosine score for each pair, and then returns the average score.

In [17]:
# Define a costum function that generates the evarage cosine similarity between the user's tensor and a job ad's tensor.
def calculate_average_similarity(tensor_user, tensor_ad):
    
    # Squeeze dimensions if the tensors have a batch dimension.
    tensor_user = tensor_user.squeeze(0) if tensor_user.dim() == 3 else tensor_user
    tensor_ad = tensor_ad.squeeze(0) if tensor_ad.dim() == 3 else tensor_ad

    tensor_ad = tensor_ad.to(tensor_user.device)

    # Initialize a similarity matrix with zeros.
    similarity_matrix = torch.zeros(tensor_user.size(0), tensor_ad.size(0), device=tensor_user.device)
    
    # Calculate cosine similarity for each pair of vectors.
    for i in range(tensor_user.size(0)):
        for j in range(tensor_ad.size(0)):
            similarity_matrix[i, j] = F.cosine_similarity(tensor_user[i].unsqueeze(0), tensor_ad[j].unsqueeze(0), dim=1)
            
    # Calculate the average similarity and convert it to a Python float.
    average_similarity = torch.mean(similarity_matrix).item()
    
    return average_similarity

*encode text*

Using the custom functions created earlier to process each DF and extract the tensor of the final hidden state layer.

In [18]:
# Apply function and create a new column with the extracted results.
df_bert_js['last_layer'] = df_bert_js.iloc[:, -2].apply(process_text)

print('The shape of the first tensor:', df_bert_js.iat[0, -1].shape, '\n')
print('The shape of the second tensor:', df_bert_js.iat[1, -1].shape, '\n')
print('The shape of the third tensor:', df_bert_js.iat[2, -1].shape, '\n')
print(df_bert_js.iat[0, -1], '\n')

# Check the Data Frame.
df_bert_js.head()

The shape of the first tensor: torch.Size([1, 44, 768]) 

The shape of the second tensor: torch.Size([1, 60, 768]) 

The shape of the third tensor: torch.Size([1, 89, 768]) 

tensor([[[-0.0363,  0.2450,  0.5963,  ...,  0.3683, -0.0280, -0.8850],
         [ 0.2678,  0.2943,  0.7582,  ...,  0.6081,  0.3891, -0.7876],
         [ 0.6730,  0.5588,  0.0457,  ...,  0.3886, -0.1934, -0.3093],
         ...,
         [ 0.2063,  1.2522,  0.8961,  ...,  0.4514, -0.2191, -0.9838],
         [-0.4627,  0.0477,  0.2985,  ...,  1.0643, -0.1532, -0.7737],
         [ 0.6308,  0.5534,  0.1932,  ...,  0.4055, -0.1966, -0.3423]]]) 



Unnamed: 0,participant,data_collection,date,location,preferred_position,combined_info,word_count,last_layer
0,user_1,voice call,2023-12-17 15:30:00,"dublin, ireland",registered nurse,bachelor's degree: critical care nursing. pati...,27,"[[[tensor(-0.0363), tensor(0.2450), tensor(0.5..."
1,user_2,voice call,2023-12-27 11:50:00,"dublin, ireland",electrician,"high school diploma, vocational electrician ce...",33,"[[[tensor(-0.0979), tensor(-0.6441), tensor(0...."
2,user_3,google form,2023-12-31 13:39:00,"dublin, ireland",data analyst,"degree: master of science in data analytics, b...",60,"[[[tensor(-0.4467), tensor(0.4116), tensor(-0...."


In [19]:
# Move the model to the GPU.
model.to(device)

# Apply the 'embed_with_bert' function to each ad.
df_bert_ja['tensors'] = df_bert_ja['job_description'].apply(lambda x: embed_with_bert([x])[0])

# Check the random cell to see the results.
print(df_bert_ja.iat[0, -1].shape, '\n')
print(df_bert_ja.iat[0, -1], '\n')

# Check the Data Frame.
df_bert_ja.head(2)

Token indices sequence length is longer than the specified maximum sequence length for this model (615 > 512). Running this sequence through the model will result in indexing errors


torch.Size([1, 617, 768]) 

tensor([[[-0.2150,  0.5150,  0.9837,  ...,  0.3223,  0.1705, -0.9307],
         [ 0.4196,  0.1590,  0.8688,  ...,  0.7425,  0.5898, -0.4012],
         [ 0.0512,  0.1291,  1.1575,  ...,  0.5806,  0.6952, -0.7819],
         ...,
         [ 0.3018,  0.2411,  0.6686,  ...,  0.8269,  0.3707,  0.1042],
         [ 0.1389,  1.0224,  0.7068,  ...,  0.7878, -0.0621, -0.2271],
         [ 0.5809,  1.0120,  0.4087,  ...,  0.9232,  0.4165, -0.5864]]]) 



Unnamed: 0,title,id,link,date,job_description,label,word_count,tensors
0,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502,"[[[tensor(-0.2150), tensor(0.5150), tensor(0.9..."
1,clinical nurse manager (cnm),sj_358f1f68cde928c4,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,unknown,create a better future for yourself recruitne...,registered_nurse,231,"[[[tensor(-0.0909), tensor(0.6022), tensor(0.9..."


The results from the previous cells indicate that the tensors generated by processing each text entry from the 'combined_info' column through the encoding layers of the fine-tuned models maintain consistent dimensions in the first and third positions. This consistency is due to the fact that each encoder handles a single sample at a time, with a batch size of one, and represents each token in the text with a 768-feature vector. However, the number of tokens in the second dimensions, representing each text, varies and slightly exceeds the actual word count of each text. This variability is because of the WordPiece tokenization approach used by the BERT model, which breaks down words into smaller pieces if they are not present in the tokenizer's lexicon. This approach enables the model to more effectively manage unrecognized words.

*calculate cosine*

In [20]:
print('The cosine similarity between the texts from user1 and user2 is:', calculate_average_similarity(df_bert_js.iat[0, -1], df_bert_js.iat[1, -1]))

The cosine similarity between the texts from user1 and user2 is: -0.045444291085004807


Before we begin the calculation, let's evaluate the cosine similarity between the tensors of two participants. Observing from the previous cell, the outcome is negative, which is expected considering the first participant is interested in registered nurse positions, whereas the second is seeking opportunities as an electrician. Next, we will proceed to apply the custom function across the entire dataframes to compute the results.

In [21]:
# Assign User's tensors to variables and then move them to a GPU for processing with PyTorch.
user1_tensor = df_bert_js.iat[0, -1]
user1_tensor = user1_tensor.to(device)

user2_tensor = df_bert_js.iat[1, -1]
user2_tensor = user2_tensor.to(device)

user3_tensor = df_bert_js.iat[2, -1]
user3_tensor = user3_tensor.to(device)

In [22]:
# Apply the calculation of average cosine similarity function to each job ad's tensor.
df_bert_ja['cos_user1'] = df_bert_ja.iloc[:, -1].apply(lambda x: calculate_average_similarity(user1_tensor, x.to(device)))
torch.cuda.empty_cache()

df_bert_ja['cos_user2'] = df_bert_ja.iloc[:, -2].apply(lambda x: calculate_average_similarity(user2_tensor, x.to(device)))
torch.cuda.empty_cache()

df_bert_ja['cos_user3'] = df_bert_ja.iloc[:, -3].apply(lambda x: calculate_average_similarity(user3_tensor, x.to(device)))
torch.cuda.empty_cache()

In [23]:
# Drop the encoded column from the Data Frame (it takes up too much memory and is no longer needed).
df_bert_ja = df_bert_ja.drop(columns=['tensors']) 

# Save the DF to local drive.
df_bert_ja.to_csv('cosine_bert.csv', index=False)

df_bert_ja.head(2)

Unnamed: 0,title,id,link,date,job_description,label,word_count,cos_user1,cos_user2,cos_user3
0,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502,0.644372,-0.061539,-0.079645
1,clinical nurse manager (cnm),sj_358f1f68cde928c4,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,unknown,create a better future for yourself recruitne...,registered_nurse,231,0.654735,-0.091259,-0.054981


The cosine similarity scores were calculated for each user, and the resulting DF, now including the cosine similarity results, has been saved locally.

In [24]:
end = time.time()
print(f'The calculation of cosine similarity score using fine-tuned Bert model completed in: {int((end - start)) // 60} minutes and {int((end - start)) % 60} seconds.')

The calculation of cosine similarity score using fine-tuned Bert model completed in: 373 minutes and 54 seconds.


## 2. WITH PRE-TRAINED WORD2VEC

In this sub-section text of each DF is passes throuth the choosen Word2Vec model as an input and generates the everage of all the word vectors as a single embedding vector. Then, these vectors are compared for cosine similarity with each other.

In [25]:
# Starting the timer to track the execution duration.
start = time.time()

*initialize model*

The pre-trained GoogleNews-vectors-negative300 model from Google is used for embedding. This model is trained on a dataset including approximately 100 billion words from Google News and features 300-dimensional vectors for 3 million words and phrases, therefore, it's widely used for various natural language processing (NLP) tasks.

In [26]:
# Load the pre-trained Word2Vec model
word2vec = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

*load dataset*

As for the datasets, dublicates of the primary DFs are used.

In [27]:
# Load datasets.
df_word2vec_js = df_jobseeker.copy()
df_word2vec_ja = df_jobads.copy()

*create custom function*

The preprocessing and tokenization steps for Word2Vec differ from those of transformer models, as Word2Vec exclusively uses whole words to generate embeddings and its fixed token limit for the input is 10000. Every word in this model has a fixed vector value, making it easier to derive embeddings using aggregation methods. And the following custom functions have been defined to accommodate these specific characteristics.

In [28]:
# Define a costum function for preprocessing and tokenizing the text.
def preprocess_text_word2vec(text):
    # Lowercasing.
    text = text.lower()
    # Removing punctuation.
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenization.
    tokens = word_tokenize(text)
    
    return tokens

In [29]:
# Define a costum function that returns an embedding vector.
def embed_tokens(tokens_list, model):
    # Iterate through each token in the list
    vectors = [model[word] for word in tokens_list if word in model]
    if vectors:
        # Averaging the vectors.
        embedding = np.mean(vectors, axis=0)
    else:
        # Use a zero vector if none of the tokens were found in the Word2Vec model.
        embedding = np.zeros(model.vector_size)
        
    return embedding

Following preprocessing and tokenization with the next custom function, each input's embedded vectors can be compared pairwise.

In [30]:
# Define a function to calculate cosine similarity (dot product in this case)
def cos(vector1, vector2):
    return np.dot(vector1, vector2)

*preprocessing*

In [31]:
# Apply the function to preprocess the input text.
df_word2vec_js['processed_ci'] = df_word2vec_js['combined_info'].apply(preprocess_text_word2vec)
df_word2vec_ja['processed_jd'] = df_word2vec_ja['job_description'].apply(preprocess_text_word2vec)

df_word2vec_js.head(2)

Unnamed: 0,participant,data_collection,date,location,preferred_position,combined_info,word_count,processed_ci
0,user_1,voice call,2023-12-17 15:30:00,"dublin, ireland",registered nurse,bachelor's degree: critical care nursing. pati...,27,"[bachelors, degree, critical, care, nursing, p..."
1,user_2,voice call,2023-12-27 11:50:00,"dublin, ireland",electrician,"high school diploma, vocational electrician ce...",33,"[high, school, diploma, vocational, electricia..."


*embedding*

In [32]:
# Apply the function to embed the input text.
df_word2vec_js['vectors'] = df_word2vec_js['processed_ci'].apply(lambda x: embed_tokens(x, word2vec))
df_word2vec_ja['vectors'] = df_word2vec_ja['processed_jd'].apply(lambda x: embed_tokens(x, word2vec))

print('The shape of the first random vector is:', df_word2vec_js.iat[0, -1].shape, '\n')
print('The shape of the second random vector is:', df_word2vec_js.iat[1, -1].shape, '\n')
print(df_word2vec_js.iat[0, -1], '\n')

df_word2vec_ja.head(2)

The shape of the first random vector is: (300,) 

The shape of the second random vector is: (300,) 

[-6.08450100e-02  6.02449253e-02 -1.20016243e-02  7.13876588e-03
 -7.40720332e-02  4.22175489e-02  1.12180419e-01 -1.45709693e-01
  1.79331861e-02 -1.19441107e-01 -4.40439060e-02 -9.83276367e-02
  1.14898682e-02  1.12257734e-01 -4.97694761e-02  2.10800171e-02
  2.50693094e-02  1.47782549e-01 -1.60428565e-02 -4.22304608e-02
  5.18317595e-02 -4.93539646e-02  9.00793448e-02  5.93214780e-02
 -7.79548055e-03  1.60757210e-02 -1.52020961e-01  2.08504014e-02
  9.60305985e-03 -1.08370267e-01 -4.76262011e-02 -1.63949821e-02
 -6.17863573e-02 -8.93061683e-02 -9.40270051e-02 -2.90926415e-02
  9.55681428e-02  4.57352847e-02  2.02589761e-03  2.31417138e-02
 -1.01036662e-02 -2.45408285e-02 -4.26882245e-02  4.94384766e-03
 -5.06456830e-02 -1.39112025e-01  4.37387303e-02  8.94681513e-02
 -2.47074999e-02  3.91387939e-02 -8.30829293e-02 -1.72072183e-02
 -2.95879655e-02 -1.32446289e-02 -3.79685611e-02  2.60

Unnamed: 0,title,id,link,date,job_description,label,word_count,processed_jd,vectors
0,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502,"[silver, stream, healthcare, group, offer, gre...","[-0.04270588, 0.0259989, 0.015020199, 0.031403..."
1,clinical nurse manager (cnm),sj_358f1f68cde928c4,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,unknown,create a better future for yourself recruitne...,registered_nurse,231,"[create, a, better, future, for, yourself, rec...","[-0.067376204, 0.035147008, 0.023164311, 0.027..."


Each text column underwent preprocessing and tokenization, followed by the extraction of its embeddings. As demonstrated by the output of the preceding cells, each piece of text now possesses a vector value with a fixed dimensionality of 300.

*calculate cosine*

Using the costum function for cosine computation, we can get the similarity score of job ads for each experiment's participant information.

In [33]:
# Copy the vector values for each user and assign it to a new variable.
user1_vector = df_word2vec_js.iat[0, -1].copy()
user2_vector = df_word2vec_js.iat[1, -1].copy()
user3_vector = df_word2vec_js.iat[2, -1].copy()

In [34]:
# Calculate the cosine similarity.
df_word2vec_ja['cos_user1'] = df_word2vec_ja['vectors'].apply(lambda x: cos(x, user1_vector))
df_word2vec_ja['cos_user2'] = df_word2vec_ja['vectors'].apply(lambda x: cos(x, user2_vector))
df_word2vec_ja['cos_user3'] = df_word2vec_ja['vectors'].apply(lambda x: cos(x, user3_vector))

In [35]:
# Drop unnecessary columns.
df_word2vec_ja.drop(columns=['processed_jd', 'vectors'], inplace=True)
# Save the DF to local drive.
df_word2vec_ja.to_csv('cosine_word2vec.csv', index=False)

df_word2vec_ja.head(2)

Unnamed: 0,title,id,link,date,job_description,label,word_count,cos_user1,cos_user2,cos_user3
0,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502,0.698632,0.601519,0.501146
1,clinical nurse manager (cnm),sj_358f1f68cde928c4,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,unknown,create a better future for yourself recruitne...,registered_nurse,231,0.779827,0.687005,0.570256


The cosine similarity scores were calculated for each user, and the resulting DF, now including the cosine similarity results, has been saved locally.

In [36]:
end = time.time()
print(f'The calculation of cosine similarity using pretrained Word2Vec model completed in: {int((end - start)) // 60} minutes and {int((end - start)) % 60} seconds.')

The calculation of cosine similarity using pretrained Word2Vec model completed in: 0 minutes and 22 seconds.


## 3. WITH TF-IDF AND BOW

In this sub-section text of each DF is converted into numerical form using TF-IDF and BoW and then compared for cosine similarity with each other.

In [37]:
# Starting the timer to track the execution duration.
start = time.time()

*initialize tools*


In [38]:
# Initialize a TF-IDF vectorizer object.
tfidf_vectorizer = TfidfVectorizer()
# Initialize a bag-of-words (BoW) vectorizer object.
bow_vectorizer = CountVectorizer()
# Initialize a WordNet lemmatizer object.
lemmatizer = WordNetLemmatizer()

*load dataset*

As for the datasets, dublicates of the primary DFs are used.

In [39]:
# Load datasets.
df_tfidf_js = df_jobseeker.copy()
df_tfidf_ja = df_jobads.copy()

*create custom function*

The preprocessing and tokenization steps for TF-IDF and BoW tools are distinct from those used in transformers and neural network architecture. These two text representation methods uses whole words to create embeddings, without any set limit on the number of tokens, and their functionality is reliant on the specific vocabulary found within their individual corpora. 

In [40]:
# Define a costum function for preprocessing and tokenizing the text.
def preprocess_text_tfidf(text):
    # Lowercasing.
    text = text.lower()
    # Removing punctuation.
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Tokenization.
    tokens = word_tokenize(text)
    # Removing stopwords and lemmatization.
    stop_words = set(stopwords.words('english'))
    processed_tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    # Re-joining tokens.
    processed_text = ' '.join(processed_tokens)
    
    return processed_text

*preprocessing*

In [41]:
# Apply the function to preprocess the input text.
df_tfidf_js['processed_ci'] = df_tfidf_js['combined_info'].apply(preprocess_text_tfidf)
df_tfidf_ja['processed_jd'] = df_tfidf_ja['job_description'].apply(preprocess_text_tfidf)

df_tfidf_js.head(2)

Unnamed: 0,participant,data_collection,date,location,preferred_position,combined_info,word_count,processed_ci
0,user_1,voice call,2023-12-17 15:30:00,"dublin, ireland",registered nurse,bachelor's degree: critical care nursing. pati...,27,bachelor degree critical care nursing patient ...
1,user_2,voice call,2023-12-27 11:50:00,"dublin, ireland",electrician,"high school diploma, vocational electrician ce...",33,high school diploma vocational electrician cer...


The custom function has been applied to the text columns of both DFs, and now we need to merge the text into asingle one DF to create a unified corpus for further processing.

In [42]:
# Create a DataFrame with empty values, matching the columns of df_tfidf_ja, repeated three times.
empty_rows = pd.DataFrame([[''] * len(df_tfidf_ja.columns)] * 3, columns=df_tfidf_ja.columns)

df_tfidf_ja = pd.concat([empty_rows, df_tfidf_ja], ignore_index=True)
values_to_add = df_tfidf_js['processed_ci'].tolist()[:3]
df_tfidf_ja['processed_jd'].iloc[:3] = values_to_add

df_tfidf_ja.head(4)

Unnamed: 0,title,id,link,date,job_description,label,word_count,processed_jd
0,,,,,,,,bachelor degree critical care nursing patient ...
1,,,,,,,,high school diploma vocational electrician cer...
2,,,,,,,,degree master science data analytics bachelor ...
3,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502.0,silver stream healthcare group offer great emp...


*embedding*

In [43]:
# Transform the column values into a TF-IDF matrix.
tfidf_matrix = tfidf_vectorizer.fit_transform(df_tfidf_ja['processed_jd'])
# Transform the column values into a BoW matrix.
bow_matrix = bow_vectorizer.fit_transform(df_tfidf_ja['processed_jd'])
# Combine the TF-IDF matrix and the BoW matrix horizontally (side by side).
combined_matrix = hstack([tfidf_matrix, bow_matrix])
# Convert each row of the matrix to a list and store in a new DF column.
df_tfidf_ja['vectors'] = list(combined_matrix.toarray())

check_vector = df_tfidf_ja.iat[0, -1]
print('The dimensions of the newly created vectors are:', check_vector.shape)
print('The vectors are stored as a:', type(check_vector))
df_tfidf_ja.head(4)

The dimensions of the newly created vectors are: (20162,)
The vectors are stored as a: <class 'numpy.ndarray'>


Unnamed: 0,title,id,link,date,job_description,label,word_count,processed_jd,vectors
0,,,,,,,,bachelor degree critical care nursing patient ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,,,,,,,,high school diploma vocational electrician cer...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,,,,,,,,degree master science data analytics bachelor ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502.0,silver stream healthcare group offer great emp...,"[0.0, 0.0, 0.04749643991878368, 0.0, 0.0, 0.0,..."


The text is transformed into matrices by TF-IDF and BoW vectorizers, which are then integrated into a single matrix using horizontal stacking tool, allowing to learn more about the text from both frequency of words and the importance of words.

*normalization*

In [44]:
# Create a vectors array.
vectors_array = pd.DataFrame(df_tfidf_ja['vectors'].tolist())
vectors_array.head(4)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20152,20153,20154,20155,20156,20157,20158,20159,20160,20161
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.047496,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
# Normalize the vectors using L2 normalization along the rows.
normalized_vectors = normalize(vectors_array, norm='l2', axis=1)
# Convert the normalized vectors to a list and assign them to a new column.
df_tfidf_ja['normolized_vec'] = normalized_vectors.tolist()

df_tfidf_ja.head(4)

Unnamed: 0,title,id,link,date,job_description,label,word_count,processed_jd,vectors,normolized_vec
0,,,,,,,,bachelor degree critical care nursing patient ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,,,,,,,,high school diploma vocational electrician cer...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,,,,,,,,degree master science data analytics bachelor ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502.0,silver stream healthcare group offer great emp...,"[0.0, 0.0, 0.04749643991878368, 0.0, 0.0, 0.0,...","[0.0, 0.0, 0.0016647493576457985, 0.0, 0.0, 0...."


TF-IDF and BoW vectorizations produce feature vectors that are on different scales. While BoW counts are integer frequencies of words in documents, TF-IDF weights are floating-point numbers that reflect how important a word is to a document in a collection. Normalization ensures that these features contribute equally to the analysis, preventing features with larger scales from dominating the model's behavior in a future.

*calculate cosine*

In [46]:
# Convert the normalized vectors to a NumPy array.
vectors_tf = np.array(df_tfidf_ja['normolized_vec'].tolist()).copy()

# Reshape vector to a row vector.
user1_vector_tf = vectors_tf[0].reshape(1, -1).copy()
user2_vector_tf = vectors_tf[1].reshape(1, -1).copy()
user3_vector_tf = vectors_tf[2].reshape(1, -1).copy()

print("The shape of the vector's collection is:", vectors_tf.shape)
print('The final shape of the single vector is:', user1_vector_tf.shape)

The shape of the vector's collection is: (1169, 20162)
The final shape of the single vector is: (1, 20162)


In [47]:
# Calculate the cosine similarities between the vector for user1, user2 and user3 and all vectors in vectors_tf.
cos_sim = cos(user1_vector_tf, vectors_tf).flatten()
df_tfidf_ja['cos_user1'] = cos_sim

cos_sim = cos(user2_vector_tf, vectors_tf).flatten()
df_tfidf_ja['cos_user2'] = cos_sim

cos_sim = cos(user3_vector_tf, vectors_tf).flatten()
df_tfidf_ja['cos_user3'] = cos_sim

In [48]:
# Removing first three rows.
df_tfidf_ja = df_tfidf_ja.iloc[3:].reset_index(drop=True)
# Removing unnecessary columns.
df_tfidf_ja.drop(columns=['processed_jd', 'vectors', 'normolized_vec'], inplace=True)
# Save the DF to local drive.
df_tfidf_ja.to_csv('cosine_tfidf.csv', index=False)

df_tfidf_ja.head(2)

Unnamed: 0,title,id,link,date,job_description,label,word_count,cos_user1,cos_user2,cos_user3
0,assistant director of nursing,sj_3c7e64c7996bb9d6,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,"January 10, 2024",silver stream healthcare group offer great emp...,registered_nurse,502,0.301457,0.022477,0.033491
1,clinical nurse manager (cnm),sj_358f1f68cde928c4,https://ie.indeed.com/pagead/clk?mo=r&ad=-6NYl...,unknown,create a better future for yourself recruitne...,registered_nurse,231,0.301988,0.03765,0.005109


In [49]:
end = time.time()
print(f'The calculation of cosine similarity using TF-IDF and BoW completed in: {int((end - start)) // 60} minutes and {int((end - start)) % 60} seconds.')

The calculation of cosine similarity using TF-IDF and BoW completed in: 0 minutes and 16 seconds.
