This code snippet reads a CSV file named "train.csv" and loads its data into a Pandas DataFrame named `train_df`.

In [1]:
import pandas as pd
train_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/train.csv')
train_df

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,D
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,C
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,D
...,...,...,...,...,...,...,...,...
195,195,What is the relation between the three moment ...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...,C
196,196,"What is the throttling process, and why is it ...",The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,B
197,197,What happens to excess base metal as a solutio...,"The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be...",B
198,198,"What is the relationship between mass, force, ...",Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...,D


This code snippet involves several steps to encode the training data for a multiple-choice question answering task using the Hugging Face Transformers library.

1. It imports necessary libraries, including `AutoTokenizer` from Transformers and `Dataset` from Datasets.

2. The variable `MODEL_DIR` is set to the directory path where the pretrained BERT model is located.

3. An instance of the BERT tokenizer is created using `AutoTokenizer.from_pretrained()` with the specified BERT model directory.

4. The `encode()` function is defined. This function takes a row from the training DataFrame (`train_df`) and performs the following steps:
   - It retrieves the question prompt and options from the DataFrame row.
   - It maps the answer labels ('A', 'B', 'C', 'D', 'E') to integer values (0, 1, 2, 3, 4) and gets the correct answer's integer ID.
   - It tokenizes the question prompt and each option using the BERT tokenizer. The encoded result includes the question and option as a text pair, with truncation, padding, and a maximum sequence length of 512 tokens.
   - It sets the 'labels' key in the encoded data dictionary to 1 if the current option is the correct answer, otherwise 0.
   - It appends the encoded data to a list named `encoded_rows`.

5. A loop iterates over each row in the `train_df` DataFrame. For each row, the `encode()` function is called, and the resulting encoded data is extended to the `encoded_train` list.

6. After the loop, the `encoded_train` list contains dictionaries representing individual encoded examples.

7. The `encoded_train` list is converted into a `Dataset` using the `Dataset.from_dict()` method. Each key in the dictionary corresponds to a key in the encoded examples, and the corresponding values are extracted and grouped together to form the dataset.

At this point, the `encoded_train_dataset` contains the encoded training data in the form of a `Dataset` that can be used for training a multiple-choice question answering model with the Transformers library.

In [2]:
from transformers import AutoTokenizer
from datasets import Dataset

MODEL_DIR = "/kaggle/input/huggingface-bert/"
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR + "bert-large-uncased")

def encode(row):
    # Format the context and the options.
    prompt = str(row['prompt'])
    options = [str(option) for option in row[['A', 'B', 'C', 'D', 'E']].values.tolist()]
    
    answer_mapping = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
    correct_answer_id = answer_mapping[row['answer']]

    encoded_rows = []
    # Tokenize the question and the options, and include the correct answer label.
    for idx, option in enumerate(options):
        text_pair = [prompt, option]
        encoded = tokenizer(text_pair, truncation = True, padding = 'max_length', max_length = 512)
        
        # We set the label to 1 if this is the correct answer, otherwise 0.
        encoded['labels'] = 1 if idx == correct_answer_id else 0
        encoded_rows.append(encoded)

    return encoded_rows

encoded_train = []
for _, row in train_df.iterrows():
    encoded_train.extend(encode(row))

# Now each item in encoded_train is a dictionary representing a single example.
# We can convert it into a Dataset.
encoded_train_dataset = Dataset.from_dict({key: [dic[key] for dic in encoded_train] for key in encoded_train[0]})

In [3]:
encoded_train_dataset

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 1000
})

# See the Labels

In [4]:
print(encoded_train_dataset['labels'][:10])

[0, 0, 0, 1, 0, 1, 0, 0, 0, 0]


In [5]:
answer_mapping = {'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4}
train_labels = train_df['answer'].map(answer_mapping)

In [6]:
train_labels

0      3
1      0
2      0
3      2
4      3
      ..
195    2
196    1
197    1
198    3
199    2
Name: answer, Length: 200, dtype: int64

# Initialize the Model

We need to initialize the LLM for fine-tuning. We use a version of the LLM that is **specifically designed for multiple-choice tasks**.

**This time we use BERT-large.** The primary difference between the "base" and "large" versions of BERT models lies in **their size, which is reflected in the number of parameters they have, the number of transformer layers (i.e., the "depth" of the network), and the size of these layers (i.e., the "width" of the network)**. This directly impacts the model's capacity to learn from data, its computational requirements, and its performance on different tasks.

Here's a quick comparison:

    BERT-base: BERT-base models are smaller versions, with 12 transformer layers, each with a hidden size of 768, and 12 attention heads. This results in a total of about 110 million parameters.

    BERT-large: BERT-large models are much bigger, with 24 transformer layers, each with a hidden size of 1024, and 16 attention heads. This results in a total of about 340 million parameters.

**Because BERT-large models are larger and have more parameters, they have a greater capacity to learn and model complex patterns in data.** As a result, they typically perform better on tasks involving understanding natural language. **However, they also require more computational resources (both for training and inference), and the improvements they provide may not always justify the increased computational cost**, depending on the specific application and available resources.

**The uncased model does not distinguish between uppercase and lowercase letters (it lowercases all input before tokenizing), whereas the cased model does keep the original letter cases.**

In [7]:
from transformers import AutoModelForMultipleChoice

model = AutoModelForMultipleChoice.from_pretrained(MODEL_DIR + "bert-large-uncased")

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Some weights of the model checkpoint at /kaggle/input/huggingface-bert/bert-large-uncased were not used when initializing BertForMultipleChoice: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMultipleChoice from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSe

# Train the Model

Finally, we can train the model using a library like Hugging Face's Transformers, which provides an easy-to-use Trainer class. We need to provide our encoded dataset, the correct labels, and some training arguments to the Trainer, and then call the train method to start training.

In [9]:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions
    map3 = mean_average_precision_at_3(labels, preds)
    return {
        'map3': map3
    }

def mean_average_precision_at_3(labels, preds):
    ap3s = [average_precision_at_3(label, pred) for label, pred in zip(labels, preds)]
    return sum(ap3s) / len(ap3s)

def average_precision_at_3(label, pred):
    try:
        return (1 / (pred[:3].index(label) + 1))
    except ValueError:
        return 0

This code **computes the average precision at 3 for each question, then takes the mean of these scores**. The average_precision_at_3 function returns the precision at the rank of the correct label if it is within the top 3 predictions, or 0 otherwise. It uses the index method to find the rank of the correct label, adding 1 because index is 0-based while ranks are 1-based. The try/except block handles the case where the correct label is not in the top 3 predictions.

In [12]:
from transformers import TrainingArguments, Trainer

# Disable wandb globally.
import os
os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir = './finetuned_bert',  # change to a local directory
    num_train_epochs = 3,
    per_device_train_batch_size = 1,
    learning_rate = 2e-5,
    gradient_accumulation_steps = 2,
    report_to =  [],  # Disable all integrations.
)


trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = encoded_train_dataset,
    compute_metrics = compute_metrics,  # optional function to compute metrics for evaluation
)

In [13]:
trainer.train()



Step,Training Loss
500,0.8751
1000,0.7368
1500,0.9487


TrainOutput(global_step=1500, training_loss=0.8535211995442709, metrics={'train_runtime': 1267.5025, 'train_samples_per_second': 2.367, 'train_steps_per_second': 1.183, 'total_flos': 5591569287168000.0, 'train_loss': 0.8535211995442709, 'epoch': 3.0})

# Predict the Test Data

We will make predictions with the trained model and test data.

In [14]:
test_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")
test_df

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...
...,...,...,...,...,...,...,...
195,195,What is the relation between the three moment ...,The three moment theorem expresses the relatio...,The three moment theorem is used to calculate ...,The three moment theorem describes the relatio...,The three moment theorem is used to calculate ...,The three moment theorem is used to derive the...
196,196,"What is the throttling process, and why is it ...",The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...,The throttling process is a steady adiabatic f...,The throttling process is a steady flow of a f...,The throttling process is a steady adiabatic f...
197,197,What happens to excess base metal as a solutio...,"The excess base metal will often solidify, bec...",The excess base metal will often crystallize-o...,"The excess base metal will often dissolve, bec...","The excess base metal will often liquefy, beco...","The excess base metal will often evaporate, be..."
198,198,"What is the relationship between mass, force, ...",Mass is a property that determines the weight ...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is an inertial property that determines a...,Mass is a property that determines the size of...


Encoding: This is the step we are performing with our encode_test() function. Each prompt and option pair is tokenized.

In [15]:
def encode_test(example):
    # Format the context and the options.
    prompt = str(example['prompt'])
    options = [str(option) for option in example[['A', 'B', 'C', 'D', 'E']].values.tolist()]
    examples = []

    # Tokenize the question and the options.
    for option in options:
        text_pair = [prompt, option]
        encoded = tokenizer(text_pair, truncation = True, padding = 'max_length', max_length = 512)
        examples.append(encoded)

    return examples

encoded_test_df = test_df.apply(encode_test, axis = 1)
encoded_test_df

0      [[input_ids, token_type_ids, attention_mask], ...
1      [[input_ids, token_type_ids, attention_mask], ...
2      [[input_ids, token_type_ids, attention_mask], ...
3      [[input_ids, token_type_ids, attention_mask], ...
4      [[input_ids, token_type_ids, attention_mask], ...
                             ...                        
195    [[input_ids, token_type_ids, attention_mask], ...
196    [[input_ids, token_type_ids, attention_mask], ...
197    [[input_ids, token_type_ids, attention_mask], ...
198    [[input_ids, token_type_ids, attention_mask], ...
199    [[input_ids, token_type_ids, attention_mask], ...
Length: 200, dtype: object

Prediction: Next, we need to loop over the encoded inputs, feed them to the model, and store the model outputs.

In [16]:
import torch

# Check if a GPU is available and if not, default to CPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Reduce batch size and limit sequence length.
batch_size = 2
sequence_length = 128

predictions = []
for row in encoded_test_df:
    # Truncate or pad sequences to a fixed length.
    row = row[:sequence_length]

    # Create tensors for input_ids and attention_mask.
    input_ids = torch.tensor([item['input_ids'] for item in row], dtype = torch.long).to(device)
    attention_mask = torch.tensor([item['attention_mask'] for item in row], dtype = torch.long).to(device)

    # Run inference with reduced batch size.
    with torch.no_grad():
        outputs = model(input_ids = input_ids, attention_mask = attention_mask)

    predictions.append(outputs.logits.detach().cpu().numpy())

    # Free GPU memory by deleting tensors.
    del input_ids, attention_mask, outputs

In [17]:
predictions[0:5]

[array([[ 3.0476558, -1.3318549],
        [ 3.3362415, -1.2809052],
        [ 3.1815457, -1.5713125],
        [ 3.1966562, -1.5860578],
        [ 3.276362 , -1.494632 ]], dtype=float32),
 array([[ 3.1131527, -1.2204671],
        [ 3.2919345, -1.607218 ],
        [ 3.4295268, -1.1190839],
        [ 3.3223789, -1.2548648],
        [ 3.1287382, -1.0641866]], dtype=float32),
 array([[ 3.150208 , -1.2270874],
        [ 3.2730749, -1.2493844],
        [ 3.5324976, -1.5926801],
        [ 3.2353568, -1.6759583],
        [ 2.9885433, -1.3391691]], dtype=float32),
 array([[ 3.328914 , -1.7800725],
        [ 3.5202162, -1.4254386],
        [ 3.4338472, -1.2145462],
        [ 2.9645188, -1.6453615],
        [ 3.3694499, -1.5751523]], dtype=float32),
 array([[ 3.5551229, -1.4472487],
        [ 3.1200242, -0.94486  ],
        [ 3.0833318, -1.3686523],
        [ 3.1800926, -1.530232 ],
        [ 3.1129584, -1.0864849]], dtype=float32)]

# Submission

In [18]:
import numpy as np

# Convert the list of predictions to a numpy array.
predictions = np.array(predictions)

# Get the indices of the top 3 predictions for each question.
top_three_indices = (-predictions).argsort(axis = 1)[:, :3].tolist()

In [20]:
# Initialize an empty list to store the extracted values.
top_values = []

# Loop over all elements in the 'top_three_indices' list.
for i in range(len(top_three_indices)):
    # Use a list comprehension to extract the second element (index 1) from each sublist.
    # This will create a new list 'values' containing these three elements.
    values = [top_three_indices[i][j][1] for j in range(3)]
    # Append this new list to our 'top_values' list.
    top_values.append(values)

# Print the resulting list of lists.
print(top_values)

[[1, 0, 4], [4, 2, 0], [0, 1, 4], [2, 1, 4], [1, 4, 2], [3, 1, 4], [3, 2, 1], [0, 3, 2], [3, 0, 1], [3, 0, 1], [2, 0, 4], [0, 4, 3], [3, 4, 2], [2, 3, 0], [2, 3, 0], [2, 3, 4], [1, 0, 3], [2, 0, 1], [0, 4, 1], [3, 1, 4], [4, 2, 1], [1, 4, 3], [1, 3, 4], [2, 3, 0], [4, 0, 2], [2, 1, 4], [4, 0, 3], [3, 0, 4], [3, 4, 0], [1, 0, 3], [0, 2, 3], [0, 1, 2], [1, 3, 0], [4, 3, 1], [4, 2, 0], [3, 1, 4], [1, 3, 0], [3, 4, 0], [4, 2, 1], [0, 3, 2], [4, 2, 3], [4, 2, 3], [4, 3, 2], [3, 4, 0], [4, 1, 2], [2, 4, 3], [3, 0, 4], [3, 1, 2], [2, 3, 0], [3, 1, 0], [2, 0, 1], [1, 4, 0], [4, 1, 2], [4, 0, 2], [2, 0, 4], [2, 0, 1], [0, 2, 3], [4, 0, 2], [1, 0, 2], [2, 1, 4], [1, 0, 4], [2, 0, 3], [4, 2, 3], [0, 4, 3], [0, 4, 1], [4, 3, 1], [3, 4, 2], [3, 1, 2], [2, 3, 0], [1, 0, 4], [3, 1, 2], [2, 3, 1], [3, 4, 0], [0, 4, 1], [3, 2, 0], [3, 1, 4], [3, 4, 1], [1, 2, 4], [2, 0, 1], [0, 2, 4], [1, 2, 0], [0, 4, 3], [1, 4, 0], [4, 1, 0], [2, 1, 4], [0, 3, 1], [0, 3, 2], [4, 0, 2], [0, 3, 4], [1, 4, 3], [4, 2, 1]

In [21]:
# Define a mapping from indices to labels.
index_to_label = {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E'}

# Convert the top three indices to the required format (labels separated by spaces).
top_three_labels = [' '.join([index_to_label[idx] for idx in sublist]) for sublist in top_values]
top_three_labels[0:5]

['B A E', 'E C A', 'A B E', 'C B E', 'B E C']

In [22]:
# Create a new DataFrame for the submission.
submission_df = pd.DataFrame({
    'id': test_df['id'],
    'question': test_df['prompt'],  # Add the question text
    'prediction': top_three_labels
})

# Save the submission DataFrame to a .csv file.
submission_df.to_csv('submission.csv', index=False)


In [23]:
submission_df

Unnamed: 0,id,question,prediction
0,0,Which of the following statements accurately d...,B A E
1,1,Which of the following is an accurate definiti...,E C A
2,2,Which of the following statements accurately d...,A B E
3,3,What is the significance of regularization in ...,C B E
4,4,Which of the following statements accurately d...,B E C
...,...,...,...
195,195,What is the relation between the three moment ...,E A C
196,196,"What is the throttling process, and why is it ...",D A B
197,197,What happens to excess base metal as a solutio...,D B A
198,198,"What is the relationship between mass, force, ...",D E B


# Conclusion

**You will understand the basic concept as to the use of LLMs with the data.**

In [None]:
I am a medical doctor working on **artificial intelligence (AI) for medicine**. At present AI is also widely used in the medical field. Particularly, AI performs in the healthcare sector following tasks: **image classification, object detection, semantic segmentation, GANs, text classification, etc**. **If you are interested in AI for medicine, please see my other notebooks.**