# Baseline 2: Vanilla-BERT implementation for expert ranking

ToDo:
- Potentially fix expert criteria: Accepted by user

## Reproducibility comments
- Data incomplete: labels-file
- Data cleaning/preprocessing/normalization etc. not mentioned at all
- What data did they use exactly for train/test/val? Exact split might differ with every run, no seed 
- For what did they use the validation set? Was it used for bert? Not mentioned.
- At Chapter 3: "test collection" bad wording, since can't refer to the test set

In [39]:
import pandas as pd
import json
from transformers import BertTokenizer
import pickle
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertModel, BertTokenizerFast, AdamW

# Set to true to load the saved BERT-tokens and save time
load_tokens = True
# Set to true to avoid creating the datasets anew if already available
save_datasets = False

## 1. Initial data screening

In [2]:
# Relative file paths
queries_file = "../data/queries_bankruptcy.csv"
labels_file = "../data//labels.qrel"
lawyerid_to_url_file = "../data//lawyerid_to_lawyerurl.json"
answers_file = "../data//all_questions_and_answer_new.parquet"  # Parquet file with answers

# Load datasets
queries_df = pd.read_csv(queries_file, header=None, names=["queryid", "query"])
labels_df = pd.read_csv(labels_file, sep=" ", header=None, names=["queryid", "iteration", "lawyerid", "label"])
with open(lawyerid_to_url_file, "r") as f:
    lawyerid_to_url = json.load(f)
answers_df = pd.read_parquet(answers_file)

In [3]:
queries_df.head()

Unnamed: 0,queryid,query
0,0,chapter 13 bankruptcy reorganization plan
1,1,nondischargeable debt and student loans
2,2,employment as an independent contractor
3,3,debt collection and debt settlement
4,4,chapter 7 bankruptcy for businesses


In [4]:
labels_df.head()

Unnamed: 0,queryid,iteration,lawyerid,label
0,1,0,3,1
1,2,0,3,1
2,3,0,3,1
3,4,0,3,1
4,5,0,3,1


In [5]:
print(labels_df["label"].value_counts())

label
1    1576
Name: count, dtype: int64


In [6]:
print(labels_df["lawyerid"].nunique())

51


Right off the bat, we run into a major problem: The labels file provided by the authors is not complete. Firstly, there are uniquely relevant lawyers in the file, even though relevant and irrelevant ones are necessary for training. One could assume that this means all the other lawyers are irrelevant for the given queries. This is contradicted by the fact that there are only 51 unique relevant lawyers in the file, while the authors outlined finding 61.

The best way to deal with this is to follow the process of determining relevant lawyers ourselves and using that for training. **Note however**, that this will inevidably lead to differing results, since the questions and answers needed to be scraped manually and there was a significant number of changes to the webpages over time (mainly pages that are not available anymore.)

## 2. Create our own dataset since the one in GitHub is not usable

In [7]:
answers_file = "../data//all_questions_and_answer_new.parquet"

answers_df = pd.read_parquet(answers_file)
answers_df.head()

Unnamed: 0,number,url,title,question,question_tags,answers,lawyers,posted_times,answer_card_text,stars,reviews,rating,helpful,lawyers_agree,best_answer
0,0,https://www.avvo.com/legal-answers/a-company-a...,A company assigned by SSA to pay my bills.,I have an organization assigned by SSA to pay ...,"Bankruptcy,Debt,Bankruptcy and debt",Definitely get out of this arrangement immedia...,https://www.avvo.com/attorneys/750961.html,2021-07-18,Answer\nLarry R. Maitland II\nSocial Security ...,5.0,20.0,9.5,1.0,1.0,False
1,0,https://www.avvo.com/legal-answers/a-company-a...,A company assigned by SSA to pay my bills.,I have an organization assigned by SSA to pay ...,"Bankruptcy,Debt,Bankruptcy and debt",Sounds to me like you are getting scammed. Was...,https://www.avvo.com/attorneys/370602.html,2021-07-16,Answer\nStuart Gregory Steingraber\nBankruptcy...,4.994382,178.0,9.8,1.0,1.0,False
2,0,https://www.avvo.com/legal-answers/a-company-a...,A company assigned by SSA to pay my bills.,I have an organization assigned by SSA to pay ...,"Bankruptcy,Debt,Bankruptcy and debt",Why not pay bills yourself? You are in control...,https://www.avvo.com/attorneys/16108.html,2021-07-16,Answer\nRichard D. Granvold\nChapter 7 Bankrup...,4.680556,72.0,,0.0,1.0,False
3,1,https://www.avvo.com/legal-answers/high-credit...,High credit card balance consolidation offer.,Can I be held accountable for my late fathers ...,"Bankruptcy,Credit,Debt,Debt settlement,Debt ne...",Were you on the credit card along with your fa...,https://www.avvo.com/attorneys/383564.html,2021-07-15,Answer\nHarlene Miller\nBankruptcy Attorney in...,5.0,10.0,9.0,1.0,1.0,False
4,1,https://www.avvo.com/legal-answers/high-credit...,High credit card balance consolidation offer.,Can I be held accountable for my late fathers ...,"Bankruptcy,Credit,Debt,Debt settlement,Debt ne...",Probate may be required.\nCreditor may be requ...,https://www.avvo.com/attorneys/312867.html,2021-07-15,Answer\nJames Charles Shields\nBankruptcy Atto...,4.625,24.0,9.7,0.0,1.0,False


For many questions, the links were not available anymore, resulting in empty rows. We remove those.

In [8]:
# Count the number of None or NaN values in each column
nan_counts = answers_df.isna().sum()
print(nan_counts)

number                 0
url                    0
title                  0
question               0
question_tags       6867
answers             6867
lawyers             6867
posted_times        6867
answer_card_text    6867
stars               7275
reviews             7620
rating              8413
helpful             6867
lawyers_agree       9743
best_answer         6867
dtype: int64


In [9]:
# Drop rows where 'answers' column is NaN or None
answers_df = answers_df.dropna(subset=['answers'])

# Verify the changes
print(answers_df.shape)

(11223, 15)


### Create our own lawyerid_to_lawyerurl and queries_bankruptcy file

These will then be used to create our own labels file.

In [10]:
# Extract unique lawyer URLs
unique_lawyers = answers_df['lawyers'].unique()

# Create a mapping of lawyer IDs to lawyer URLs
lawyer_mapping = {'lawyer_id': [], 'lawyer_url': []}
for idx, url in enumerate(unique_lawyers, start=1):
    lawyer_mapping['lawyer_id'].append(idx)
    lawyer_mapping['lawyer_url'].append(url)

# Convert the mapping to a DataFrame
lawyer_mapping_df = pd.DataFrame(lawyer_mapping)

if save_datasets:
    # Save the mapping to a CSV file
    lawyer_mapping_df.to_csv('../data/own_files/lawyerid_to_lawyerurl_own.csv', index=False)

In [11]:
# Extract all queries from the "question_tags" column
all_queries = answers_df['question_tags'].str.split(',').explode().str.strip()

# Count the frequency of each query
query_counts = all_queries.value_counts()

# Select the top 20% most frequent queries according to paper methodology
top_20_percent_queries = query_counts.head(int(len(query_counts) * 0.2)).index

# Create a mapping of query IDs to these queries
query_mapping = {'query_id': [], 'query': []}
for idx, query in enumerate(top_20_percent_queries, start=1):
    query_mapping['query_id'].append(idx)
    query_mapping['query'].append(query)

# Convert the mapping to a DataFrame
query_mapping_df = pd.DataFrame(query_mapping)
query_mapping_df.shape

if save_datasets:
    # Save the mapping to a CSV file
    query_mapping_df.to_csv('../data/own_files/queries_bankruptcy_own.csv', index=False)

In [12]:
query_mapping_df.head()

Unnamed: 0,query_id,query
0,1,Bankruptcy
1,2,Bankruptcy and debt
2,3,Debt
3,4,Credit
4,5,Chapter 7 bankruptcy


Create lawyer_id and query_id column in the answers df to enable us to make the calculations for the lawyer-expert criteria.

In [13]:
# Create the lawyer_id column in answers_df
answers_df = answers_df.merge(lawyer_mapping_df, left_on='lawyers', right_on='lawyer_url', how='left')
answers_df = answers_df.rename(columns={'lawyer_id': 'lawyer_id'})

# Convert the question_tags column to a list format
answers_df['question_tags'] = answers_df['question_tags'].apply(lambda x: [tag.strip() for tag in x.split(',')])

# Create the query_id column
def get_query_ids(tags, query_mapping_df):
    tag_to_id = dict(zip(query_mapping_df['query'], query_mapping_df['query_id']))
    return [tag_to_id[tag] for tag in tags if tag in tag_to_id]

answers_df['query_id_list'] = answers_df['question_tags'].apply(lambda tags: get_query_ids(tags, query_mapping_df))
answers_df.rename(columns={"best_answer": "user_accepted"}, inplace=True)
answers_df.head()

Unnamed: 0,number,url,title,question,question_tags,answers,lawyers,posted_times,answer_card_text,stars,reviews,rating,helpful,lawyers_agree,user_accepted,lawyer_id,lawyer_url,query_id_list
0,0,https://www.avvo.com/legal-answers/a-company-a...,A company assigned by SSA to pay my bills.,I have an organization assigned by SSA to pay ...,"[Bankruptcy, Debt, Bankruptcy and debt]",Definitely get out of this arrangement immedia...,https://www.avvo.com/attorneys/750961.html,2021-07-18,Answer\nLarry R. Maitland II\nSocial Security ...,5.0,20.0,9.5,1.0,1.0,False,1,https://www.avvo.com/attorneys/750961.html,"[1, 3, 2]"
1,0,https://www.avvo.com/legal-answers/a-company-a...,A company assigned by SSA to pay my bills.,I have an organization assigned by SSA to pay ...,"[Bankruptcy, Debt, Bankruptcy and debt]",Sounds to me like you are getting scammed. Was...,https://www.avvo.com/attorneys/370602.html,2021-07-16,Answer\nStuart Gregory Steingraber\nBankruptcy...,4.994382,178.0,9.8,1.0,1.0,False,2,https://www.avvo.com/attorneys/370602.html,"[1, 3, 2]"
2,0,https://www.avvo.com/legal-answers/a-company-a...,A company assigned by SSA to pay my bills.,I have an organization assigned by SSA to pay ...,"[Bankruptcy, Debt, Bankruptcy and debt]",Why not pay bills yourself? You are in control...,https://www.avvo.com/attorneys/16108.html,2021-07-16,Answer\nRichard D. Granvold\nChapter 7 Bankrup...,4.680556,72.0,,0.0,1.0,False,3,https://www.avvo.com/attorneys/16108.html,"[1, 3, 2]"
3,1,https://www.avvo.com/legal-answers/high-credit...,High credit card balance consolidation offer.,Can I be held accountable for my late fathers ...,"[Bankruptcy, Credit, Debt, Debt settlement, De...",Were you on the credit card along with your fa...,https://www.avvo.com/attorneys/383564.html,2021-07-15,Answer\nHarlene Miller\nBankruptcy Attorney in...,5.0,10.0,9.0,1.0,1.0,False,4,https://www.avvo.com/attorneys/383564.html,"[1, 4, 3, 37, 109, 2, 42, 34]"
4,1,https://www.avvo.com/legal-answers/high-credit...,High credit card balance consolidation offer.,Can I be held accountable for my late fathers ...,"[Bankruptcy, Credit, Debt, Debt settlement, De...",Probate may be required.\nCreditor may be requ...,https://www.avvo.com/attorneys/312867.html,2021-07-15,Answer\nJames Charles Shields\nBankruptcy Atto...,4.625,24.0,9.7,0.0,1.0,False,5,https://www.avvo.com/attorneys/312867.html,"[1, 4, 3, 37, 109, 2, 42, 34]"


In [14]:
print(answers_df["helpful"].value_counts())

helpful
0.0     8777
1.0     2193
2.0      159
3.0       44
4.0       16
5.0       10
6.0        8
8.0        5
7.0        5
9.0        3
14.0       2
20.0       1
Name: count, dtype: int64


In [15]:
answers_df.head().to_csv('first_5_rows_answers_df.csv', index=False)

### Implement expert-lawyer criteria

Lawyer is an expert in a query/tag if:
- has 10 or more answers in bankruptcy accepted by the asker (col "user_accepted") (represented by the bankruptcy questions in our entire dataset)
- more than average number of best answers within query (best answer is either col "user_accepted" is True  by asker OR more than 3 lawyers found answer useful (col "lawyers_agree"))
AND
- count of best answers/count of answers higher average in query category 

In [16]:
# Define a function to determine expert lawyers
def identify_expert_lawyers(answers_df, query_mapping_df):
    # Initialize an empty list to store expert labels
    expert_labels = []

    # Calculate global user_accepted counts for each lawyer
    global_user_accepted_counts = answers_df[
        answers_df['user_accepted'] == True
    ].groupby('lawyer_id').size()

    # Loop through each query in the mapping DataFrame
    for _, query_row in query_mapping_df.iterrows():
        query_id = query_row['query_id']
        query_name = query_row['query']

        # Filter the answers DataFrame for rows related to the current query
        query_answers = answers_df[answers_df['query_id_list'].apply(lambda x: query_id in x)]

        if query_answers.empty:
            continue

        # Calculate metrics
        lawyer_answer_counts = query_answers.groupby('lawyer_id').size()
        lawyer_best_answer_counts = query_answers[
            (query_answers['user_accepted'] == True) | (query_answers['lawyers_agree'] > 3)
        ].groupby('lawyer_id').size()

        # Average metrics
        avg_best_answers_per_query = lawyer_best_answer_counts.sum() / len(lawyer_answer_counts)
        avg_best_answer_ratio = (lawyer_best_answer_counts / lawyer_answer_counts).mean()

        # Identify experts and create labels
        for lawyer_id in lawyer_answer_counts.index:
            is_expert = int(
                global_user_accepted_counts.get(lawyer_id, 0) >= 10 and \
                lawyer_best_answer_counts.get(lawyer_id, 0) > avg_best_answers_per_query and \
                (lawyer_best_answer_counts.get(lawyer_id, 0) / lawyer_answer_counts[lawyer_id]) > avg_best_answer_ratio
            )
            expert_labels.append({'query_id': query_id, 'lawyer_id': lawyer_id, 'label': is_expert})

    return pd.DataFrame(expert_labels)

# Call the function
expert_lawyers_df = identify_expert_lawyers(answers_df, query_mapping_df)

# Display the result
print(expert_lawyers_df)


       query_id  lawyer_id  label
0             1          1      0
1             1          2      0
2             1          3      0
3             1          4      0
4             1          5      0
...         ...        ...    ...
30071       131       1796      0
30072       131       1797      0
30073       131       1800      0
30074       131       1801      0
30075       131       1888      0

[30076 rows x 3 columns]


In [17]:
# Count unique expert lawyers
unique_expert_lawyers = expert_lawyers_df[expert_lawyers_df['label'] == 1]['lawyer_id'].nunique()

print(f"Number of unique expert lawyers: {unique_expert_lawyers}")

Number of unique expert lawyers: 3


We run into a problem using the original expert criteria from the paper. Due to the differing and smaller dataset, only 3 expert lawyers result, which is too little representation for experts compared to the 61 in the paper. We will therefore adjust the expert criteria slightly. Instead of using the condition of "user_accepted", which occurs very rarely, we will use "helpful", which denotes how many other platform users (non-lawyers) found the answer useful instead of being question-asker accepted. We will also adjust the limit of across-dataset count of useful answers from 10 to 8.

In [18]:
# Define a function to determine expert lawyers
def identify_expert_lawyers(answers_df, query_mapping_df):
    # Initialize an empty list to store expert labels
    expert_labels = []

    # Calculate global helpful answer counts for each lawyer
    global_helpful_counts = answers_df[answers_df['helpful'] >= 1].groupby('lawyer_id').size()

    # Loop through each query in the mapping DataFrame
    for _, query_row in query_mapping_df.iterrows():
        query_id = query_row['query_id']
        query_name = query_row['query']

        # Filter the answers DataFrame for rows related to the current query
        query_answers = answers_df[answers_df['query_id_list'].apply(lambda x: query_id in x)]

        if query_answers.empty:
            continue

        # Calculate metrics
        lawyer_answer_counts = query_answers.groupby('lawyer_id').size()
        lawyer_best_answer_counts = query_answers[
            (query_answers['helpful'] >= 1) | (query_answers['lawyers_agree'] > 3)
        ].groupby('lawyer_id').size()

        # Average metrics
        avg_best_answers_per_query = lawyer_best_answer_counts.sum() / len(lawyer_answer_counts)
        avg_best_answer_ratio = (lawyer_best_answer_counts / lawyer_answer_counts).mean()

        # Identify experts and create labels
        for lawyer_id in lawyer_answer_counts.index:
            is_expert = int(
                global_helpful_counts.get(lawyer_id, 0) >= 8 and \
                lawyer_best_answer_counts.get(lawyer_id, 0) > avg_best_answers_per_query and \
                (lawyer_best_answer_counts.get(lawyer_id, 0) / lawyer_answer_counts[lawyer_id]) > avg_best_answer_ratio
            )
            expert_labels.append({'query_id': query_id, 'lawyer_id': lawyer_id, 'label': is_expert})

    return pd.DataFrame(expert_labels)

# Call the function
expert_lawyers_df = identify_expert_lawyers(answers_df, query_mapping_df)

# Display the results
expert_lawyers_df.head()


Unnamed: 0,query_id,lawyer_id,label
0,1,1,0
1,1,2,0
2,1,3,0
3,1,4,0
4,1,5,0


In [19]:
# Count unique expert lawyers
unique_expert_lawyers = expert_lawyers_df[expert_lawyers_df['label'] == 1]['lawyer_id'].nunique()

print(f"Number of unique expert lawyers: {unique_expert_lawyers}")

Number of unique expert lawyers: 53


With this, we have a dataframe with every unique lawyer-query combination, denoting whether a lawyer is an expert for that query via the "label" column. We also have 53 expert lawyers, which is close enough to the paper methodology.

As per the paper methodology, only queries are retained that have at least two expert lawyers.

In [20]:
# Filter queries that have at least two ones in the "label" column
query_label_counts = expert_lawyers_df[expert_lawyers_df['label'] == 1].groupby('query_id').size()
queries_with_at_least_two_ones = query_label_counts[query_label_counts >= 2].index

# Retain only the unique queries in expert_lawyers_df that meet the criteria
filtered_expert_lawyers_df = expert_lawyers_df[expert_lawyers_df['query_id'].isin(queries_with_at_least_two_ones)]

In [21]:
print(filtered_expert_lawyers_df.shape)

(29810, 3)


In [22]:
filtered_expert_lawyers_df.head()

Unnamed: 0,query_id,lawyer_id,label
0,1,1,0
1,1,2,0
2,1,3,0
3,1,4,0
4,1,5,0


In [23]:
if save_datasets:
    filtered_expert_lawyers_df.to_csv('../data/own_files/labels_own.csv', index=False, header=True)

BERT is trained using query-answer pairs as well as a label whether the answer was written by a lawyer relevant to the query. To achieve this, we need to combine the answers_df for the lawyer answers, the query_mapping_df to get the corresponding queries and the filtered_expert_lawyers_df for the labels.

In [24]:
merged_df = pd.merge(filtered_expert_lawyers_df, query_mapping_df, on="query_id", how="inner")
label_answer_df = pd.merge(merged_df, answers_df, on="lawyer_id", how="inner")
# Select the relevant columns for the final output
final_df = label_answer_df[["lawyer_id", "answers", "query_id", "query", "label"]]
final_df.head(20)

Unnamed: 0,lawyer_id,answers,query_id,query,label
0,1,Definitely get out of this arrangement immedia...,1,Bankruptcy,0
1,2,Sounds to me like you are getting scammed. Was...,1,Bankruptcy,0
2,2,The debtor's lawyer cannot knowingly submit in...,1,Bankruptcy,0
3,2,"Chances are you can't remove the lien, but may...",1,Bankruptcy,0
4,2,What does your BK lawyer say? No lawyer? For s...,1,Bankruptcy,0
5,2,"With the new equity exemption amounts, your BK...",1,Bankruptcy,0
6,2,My colleagues have given you good advice with ...,1,Bankruptcy,0
7,2,"Probably not. To be certain, ask your BK lawye...",1,Bankruptcy,0
8,2,What does your lawyer say? No lawyer? Remember...,1,Bankruptcy,0
9,2,"Your question has several ""moving parts"" invol...",1,Bankruptcy,0


Even though we do not have the exact same data, we can still apply the same data splitting methodology as outlined in the paper, ensuring an equal number of expert lawyers in the train/validation and test set, and all non-relevant lawyers are in train, val AND test.

In [25]:
# Identify relevant and non-relevant lawyers
relevant_lawyers = final_df[final_df['label'] == 1]['lawyer_id'].unique()
non_relevant_lawyers = final_df[~final_df['lawyer_id'].isin(relevant_lawyers)]['lawyer_id'].unique()

# Ensure equal split of relevant lawyers
num_relevant = len(relevant_lawyers)
split_size = num_relevant // 3

train_relevant = relevant_lawyers[:split_size]
val_relevant = relevant_lawyers[split_size:2*split_size]
test_relevant = relevant_lawyers[2*split_size:]
print(train_relevant.shape, val_relevant.shape, test_relevant.shape)

# Create subsets
train_set = final_df[final_df['lawyer_id'].isin(train_relevant)]
val_set = final_df[final_df['lawyer_id'].isin(val_relevant)]
test_set = final_df[final_df['lawyer_id'].isin(test_relevant)]

# Add all non-relevant lawyers to each subset
train_set = pd.concat([train_set, final_df[final_df['lawyer_id'].isin(non_relevant_lawyers)]])
val_set = pd.concat([val_set, final_df[final_df['lawyer_id'].isin(non_relevant_lawyers)]])
test_set = pd.concat([test_set, final_df[final_df['lawyer_id'].isin(non_relevant_lawyers)]])

# Reset index
train_set.reset_index(drop=True, inplace=True)
val_set.reset_index(drop=True, inplace=True)
test_set.reset_index(drop=True, inplace=True)

(17,) (17,) (19,)


In [26]:
train_set.head()

Unnamed: 0,lawyer_id,answers,query_id,query,label
0,7,If the IRS's right to attach your refund arose...,1,Bankruptcy,1
1,7,"As the other attorneys have pointed out, now i...",1,Bankruptcy,1
2,7,Your situation presents complicated issues tha...,1,Bankruptcy,1
3,7,There is nothing stopping you from filing eith...,1,Bankruptcy,1
4,7,"Yes, absolutely. Do you mean can you cancel th...",1,Bankruptcy,1


Tokenize the data while retaining additional information about the tokenized instances for interpretation of the results and saving the results so the process does not need to be repeated every time.

In [27]:
# Function to tokenize the query-answer pairs and preserve additional information
def tokenize_query_answer_with_metadata(row):
    # Combine query and answer in BERT's required format
    combined_text = f"[CLS] {row['query']} [SEP] {row['answers']} [SEP]"
    
    # Tokenize the combined text
    tokenized = tokenizer(combined_text, padding='max_length', truncation=True, max_length=512)
    
    # Add original row metadata to the tokenized result
    tokenized_with_metadata = {
        "input_ids": tokenized["input_ids"],
        "attention_mask": tokenized["attention_mask"],
        "original_index": row.name  # Original row index
    }
    return tokenized_with_metadata

# Function to tokenize an entire dataframe
def tokenize_dataframe_with_metadata(dataframe):
    # Apply the tokenization function to each row
    tokenized_data = dataframe.apply(tokenize_query_answer_with_metadata, axis=1)
    # Convert the tokenized data to a DataFrame for structured handling
    tokenized_df = pd.DataFrame(tokenized_data.tolist())
    return tokenized_df

# Define a function to save tokenized data
def save_tokenized_data(dataframe, filename):
    path = f'../data/own_files/{filename}.pkl'
    with open(path, 'wb') as f:
        pickle.dump(dataframe, f)
    print(f"Tokenized data saved to {path}")

if not load_tokens:
    # Load the BERT tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    
    # Tokenize train, val, and test sets with metadata
    tokenized_train_set = tokenize_dataframe_with_metadata(train_set)
    tokenized_val_set = tokenize_dataframe_with_metadata(val_set)
    tokenized_test_set = tokenize_dataframe_with_metadata(test_set)

    # Save the tokenized dataframes
    save_tokenized_data(tokenized_train_set, 'train_set_tokenized')
    save_tokenized_data(tokenized_val_set, 'val_set_tokenized')
    save_tokenized_data(tokenized_test_set, 'test_set_tokenized')
if load_tokens:
    # Load the tokenized data from files
    with open('../data/own_files/train_set_tokenized.pkl', 'rb') as f:
        tokenized_train_set = pickle.load(f)
    with open('../data/own_files/val_set_tokenized.pkl', 'rb') as f:
        tokenized_val_set = pickle.load(f)
    with open('../data/own_files/test_set_tokenized.pkl', 'rb') as f:
        tokenized_test_set = pickle.load(f)


In [28]:
print(tokenized_train_set.shape, tokenized_val_set.shape, tokenized_test_set.shape)

(264453, 3) (297442, 3) (486486, 3)


In [29]:
tokenized_train_set.head()

Unnamed: 0,input_ids,attention_mask,original_index
0,"[101, 101, 10528, 102, 2065, 1996, 25760, 1005...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",0
1,"[101, 101, 10528, 102, 2004, 1996, 2060, 16214...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1
2,"[101, 101, 10528, 102, 2115, 3663, 7534, 8552,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",2
3,"[101, 101, 10528, 102, 2045, 2003, 2498, 7458,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",3
4,"[101, 101, 10528, 102, 2748, 1010, 7078, 1012,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",4


In [None]:
# Reset the index of train_set to make it a column
train_set_reset = train_set.reset_index()

# Merge the DataFrames RENAME DF
input_label = pd.merge(tokenized_train_set, train_set_reset, left_on='original_index', right_on='index')

# Drop the extra index column if not needed
input_label = input_label.drop(columns=['index', 'lawyer_id', 'answers', 'query_id', 'query'])

input_label.head()

Unnamed: 0,input_ids,attention_mask,original_index,label
0,"[101, 101, 10528, 102, 2065, 1996, 25760, 1005...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",0,1
1,"[101, 101, 10528, 102, 2004, 1996, 2060, 16214...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",1,1
2,"[101, 101, 10528, 102, 2115, 3663, 7534, 8552,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",2,1
3,"[101, 101, 10528, 102, 2045, 2003, 2498, 7458,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",3,1
4,"[101, 101, 10528, 102, 2748, 1010, 7078, 1012,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...",4,1


In [34]:
# Separate positive and negative examples
positive_df = input_label[input_label['label'] == 1]
negative_df = input_label[input_label['label'] == 0]

# Create a list to store the pairs
pairs = []

# Number of negative samples to pair with each positive sample
num_negative_samples = 1  # Adjust as needed

# Generate pairs
for _, pos_row in positive_df.iterrows():
    for _ in range(num_negative_samples):
        # Randomly select a negative sample
        neg_row = negative_df.sample(n=1).iloc[0]
        
        # Create a pair dictionary
        pairs.append({
            'positive_input_ids': pos_row['input_ids'],
            'positive_attention_mask': pos_row['attention_mask'],
            'negative_input_ids': neg_row['input_ids'],
            'negative_attention_mask': neg_row['attention_mask']
        })

# Convert the pairs to a DataFrame
pairs_df = pd.DataFrame(pairs)


In [None]:
class PairwiseDataset(Dataset):
    def __init__(self, pairs_df):
        self.pairs_df = pairs_df
    
    def __len__(self):
        return len(self.pairs_df)
    
    def __getitem__(self, idx):
        row = self.pairs_df.iloc[idx]
        
        return {
            'positive_input_ids': torch.tensor(row['positive_input_ids'], dtype=torch.long),
            'positive_attention_mask': torch.tensor(row['positive_attention_mask'], dtype=torch.long),
            'negative_input_ids': torch.tensor(row['negative_input_ids'], dtype=torch.long),
            'negative_attention_mask': torch.tensor(row['negative_attention_mask'], dtype=torch.long)
        }

# Pairwise Hinge Loss
def pairwise_hinge_loss(positive_scores, negative_scores, margin=1.0):
    return torch.mean(torch.clamp(margin - positive_scores + negative_scores, min=0))

# Define BERT Ranker Model
class BertRanker(torch.nn.Module):
    def __init__(self):
        super(BertRanker, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.classifier = torch.nn.Linear(self.bert.config.hidden_size, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        cls_output = outputs.last_hidden_state[:, 0, :]  # [CLS] token output
        scores = self.classifier(cls_output)
        return scores

# Training Function
def train_bert_ranker(model, dataloader, num_epochs=100, gradient_accumulation_steps=1):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)

    # Define Optimizer
    optimizer = AdamW([
        {'params': model.bert.parameters(), 'lr': 2e-5},  # BERT layers
        {'params': model.classifier.parameters(), 'lr': 0.001}  # Classifier layer
    ])

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0
        optimizer.zero_grad()

        for step, batch in enumerate(dataloader):
            # Move data to device
            positive_input_ids = batch['positive_input_ids'].to(device)
            positive_attention_mask = batch['positive_attention_mask'].to(device)
            negative_input_ids = batch['negative_input_ids'].to(device)
            negative_attention_mask = batch['negative_attention_mask'].to(device)

            # Forward pass
            positive_scores = model(positive_input_ids, positive_attention_mask)
            negative_scores = model(negative_input_ids, negative_attention_mask)

            # Compute loss
            loss = pairwise_hinge_loss(positive_scores, negative_scores)
            epoch_loss += loss.item()

            # Backward pass
            loss.backward()

            # Gradient accumulation
            if (step + 1) % gradient_accumulation_steps == 0 or (step + 1) == len(dataloader):
                optimizer.step()
                optimizer.zero_grad()

        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {epoch_loss / len(dataloader):.4f}")

    # Save the trained model
    torch.save(model.state_dict(), "bert_ranker.pth")
    print("Model saved to bert_ranker.pth")


# Assume pairs_df is prepared as per previous steps
dataloader = DataLoader(PairwiseDataset(pairs_df), batch_size=16, shuffle=True)

# Initialize Model
bert_ranker = BertRanker()

# Train Model
train_bert_ranker(bert_ranker, dataloader, num_epochs=10, gradient_accumulation_steps=1)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


## Modeling