This notebook is for the purpose of running the `distillbert base uncased emotion` model hosted on huggingface. 

In [1]:
import pandas as pd
import json
import torch
import os
from torch.nn.functional import cosine_similarity
from tqdm import tqdm
import numpy as np

In [2]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BertTokenizer, BertModel
from transformers import DistilBertModel, DistilBertTokenizer
from sentence_transformers import SentenceTransformer

tokenizer = DistilBertTokenizer.from_pretrained('bhadresh-savani/distilbert-base-uncased-emotion')
model = SentenceTransformer('bhadresh-savani/distilbert-base-uncased-emotion')

No sentence-transformers model found with name bhadresh-savani/distilbert-base-uncased-emotion. Creating a new one with MEAN pooling.


In [3]:
# Load data
clean_ef_data = pd.read_csv('tabulated_cleaned_emotionfiltered_trec.csv')
clean_ef_data = clean_ef_data.drop(['polarity', 'self_ref', 'PRE', 'POST'], axis=1)
print("clean emotion filtered data shape: ", clean_ef_data.shape)
clean_ef_data.head()

clean emotion filtered data shape:  (27981, 2)


Unnamed: 0,docid,TEXT
0,s_1287_153_9,I mean what the hell bro.
1,s_1287_187_0,"Yeah, crazy isn't it?"
2,s_1287_204_0,No :( sadly it doesn't have everything
3,s_1287_222_4,I'm worried.
4,s_1287_240_1,Better weapons and going against a weaker team...


In [4]:
# Load set of augmented answers
aug_answer_set = pd.read_csv('augmented_answer_sets.csv')
print("aug answer set shape: ", aug_answer_set.shape)
aug_answer_set.sample(4)

aug answer set shape:  (928, 3)


Unnamed: 0,Question,Severity,Text
195,5,2,I feel guilty a significant amount of time
691,16,3,I am experiencing sleep disruption in the ear...
403,10,1,My crying frequency is normal
26,1,3,I’m sad all the time and can’t recover from it


Now that we have answer sets and input text, let's create embeddings.

In [5]:
# Create tokens and vector embeddings for user posts
post_embeddings_df = clean_ef_data.copy()

post_embeddings = post_embeddings_df['TEXT'].to_list()

post_embeddings = model.encode(post_embeddings, device='cuda', show_progress_bar=True, 
                               output_value='sentence_embedding', convert_to_tensor=True)

Batches:   0%|          | 0/875 [00:00<?, ?it/s]

In [6]:
# Initialize an empty dictionary to store the embeddings
bdi_embeddings = {}

# Loop over all 21 BDI questions
for i in tqdm(range(1, 22), total=21):
    # Filter the DataFrame for the current question and severity > 1
    bdi_i_embedding_df = aug_answer_set[(aug_answer_set['Question'] == i) & (aug_answer_set['Severity'] > 1)]
    
    # Get embeddings for the filtered DataFrame
    bdi_i_embeddings = model.encode(
        bdi_i_embedding_df['Text'].to_list(), device='cuda', output_value='sentence_embedding', convert_to_tensor=True
    )
    
    # Store the embeddings in the dictionary
    bdi_embeddings[f"q{i}"] = bdi_i_embeddings

100%|██████████| 21/21 [00:00<00:00, 51.39it/s]


Now that we have embeddings for each post and the associated question, we will rank. 

The rankings will be computed for each question, and are based on the max-similarity 

between a post's embedding and the embedding array of the corresponding question.

In [7]:
cosine_similarity_dict = {}

for i in range(1, 22):

    # Get the correct embeddings list by key
    qa_embeddings = bdi_embeddings[f"q{i}"]

    # Get the max cosine similarity between each post embedding and the answer set for the current BDI question
    qi_cos_similarities = [
        torch.max(cosine_similarity(post_embedding, qa_embeddings)).item()
        for post_embedding in tqdm(post_embeddings, total=len(post_embeddings))
    ]

    # Assign these max cosine similarity rankings to the cosine similarity dictionary
    cosine_similarity_dict[f'q{i}'] = qi_cos_similarities

100%|██████████| 27981/27981 [00:05<00:00, 5318.71it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5211.33it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5179.95it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5187.73it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5291.09it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5075.12it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5194.21it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5184.15it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5144.95it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5175.53it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5145.93it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5163.96it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5214.69it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5019.44it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5133.86it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5127.80it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5111.37it/s]
100%|██████████| 27981/27981 [00:05<00:00, 5129.

Now that we have cosine similarities for each question, we need to assign them post ids

In [8]:
# Initialize an empty dictionary to store the rankings for each question
rankings_dict = {}

# Loop over all 21 BDI questions
for i in tqdm(range(1, 22), total=21):
    # Make a copy of the clean_ef_data DataFrame
    q_rankings = clean_ef_data.copy()
    
    # Add a 'score' column to the DataFrame, which contains the cosine similarity scores for the current question
    q_rankings['score'] = cosine_similarity_dict[f'q{i}']
    
    # Sort the DataFrame by the 'score' column in descending order and reset the index
    q_rankings = q_rankings.sort_values('score', ascending=False, ignore_index=True)
    
    # Keep only the top 1000 rows
    q_rankings = q_rankings.head(1000)
    
    # Store the DataFrame in the rankings_dict dictionary with the key as the current question
    rankings_dict[f'q{i}'] = q_rankings

100%|██████████| 21/21 [00:00<00:00, 136.55it/s]


In [9]:
# Format the rankings dict as trec-formatted content before saving to disk.
for i, key in enumerate(rankings_dict.keys()):
    rankings_dict[key]['query'] = f'{i+1}' # create query number
    rankings_dict[key]['q_id'] = rankings_dict[key]['query']
    rankings_dict[key]['doc_id'] = rankings_dict[key]['docid']
    rankings_dict[key]['q0'] = '0'
    rankings_dict[key]['rank'] = range(1, 1001)
    rankings_dict[key]['rank'] = rankings_dict[key]['rank'].astype(str)
    rankings_dict[key]['model'] = 'distilbert-base-uncased-emotion'
    rankings_dict[key] = rankings_dict[key][["q_id", "q0", "doc_id", "score", "rank", "model"]]


In [10]:
# for key, df in rankings_dict.items():
#     df.to_csv(f'distilbert-base-uncased-ranking-outputs/{key}_rankings.tsv', sep='\t', index=False)

In [11]:
# # Concatenate all the DataFrames in the dictionary
all_rankings = pd.concat(rankings_dict.values(), ignore_index=True)

# # Save the concatenated DataFrame as a TSV file
all_rankings.to_csv('distilbert_mod_results.tsv', sep='\t', index=False)

-----------------------------------------------------------------------------------------------------------

Now we have completed the rankings portion for our fine-tuned model.

-----------------------------------------------------------------------------------------------------------------------------