### Setting Up Conda Environment with Python 3.12 and Installing Required Libraries
##### Step 1 - Created a Conda Environment with Python 3.12 and Connected to its Kernel
I created a new Conda environment with Python 3.12 and activated it. After that, I connected to the Python kernel in Jupyter/VSCode using this environment.

Commands to create and activate the environment:
```bash
conda create -n llm_env python=3.12
conda activate llm_env
```

##### Step 2 - Imported Required Libraries
Once the environment was set up and activated, I imported the following libraries for my project:
- `numpy` as `np`: For numerical operations and working with arrays.
- `pandas` as `pd`: For data manipulation and analysis.
- `torch`: PyTorch, for deep learning model building and training.
- `tqdm.notebook`: For displaying progress bars in Jupyter notebooks.
- `sklearn.preprocessing.LabelEncoder`: For encoding categorical labels into numerical format.
- `sklearn.model_selection.train_test_split`: For splitting datasets into training and testing sets.
- `transformers.BertTokenizer`: For tokenizing text data for BERT models.
- `transformers.BertForSequenceClassification`: For using BERT for sequence classification tasks.
- `torch.utils.data.TensorDataset`: For creating datasets from tensors.
- `torch.utils.data.DataLoader`: For batching datasets and loading them efficiently during training.
- `torch.utils.data.RandomSampler` and `torch.utils.data.SequentialSampler`: For random and sequential sampling of datasets.
- `transformers.AdamW` and `transformers.get_linear_schedule_with_warmup`: For the AdamW optimizer and learning rate scheduler used with transformers.
- `sklearn.metrics.f1_score`: For evaluating model performance using the F1 score.
- `random`: For random number generation, typically used in data shuffling.

Commands to install the libraries:
```bash
conda install numpy pandas scikit-learn tqdm pytorch
conda install -c huggingface transformers
```

Once the libraries were installed, I successfully imported them into the notebook and was ready to proceed with building and training the machine learning model.

In [104]:
import numpy as np
import pandas as pd
import torch
from tqdm.notebook import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, StratifiedKFold
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.metrics import f1_score
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
import datetime
import random
from collections import defaultdict

In [105]:
num_epochs = 4
learning_rate = 5e-5
num_folds  = 5 
batch_size = 16
# output_model_path = "bert_model_final.pth"
timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
output_model_path = f"model_{timestamp}.pth"

### Data Load and Handling

- **`df`**: The full dataset loaded from the CSV (`dataset.csv`).
- **`labels`**: The list of labels (i.e., 'Artifact Id') extracted from the dataset.
- **`label_counts`**: The count of occurrences of each unique label in the dataset.
- **`filtered_labels`**: The labels that appear at least 5 times in the dataset.
- **`filtered_labels_list`**: A list of labels that appear at least 5 times.
- **`filtered_df`**: The filtered DataFrame containing only the rows with labels that appear at least 5 times.

By running this script, you get a new DataFrame, `filtered_df`, containing only the rows with labels that appear 5 or more times in the original dataset.

**Note**: This is a very simple preprocessing step where we filter out labels with fewer than 5 occurrences. There is no further data cleaning, such as handling missing values or data normalization, applied at this stage.

In [106]:
# df = pd.read_csv('dataset.csv')

# labels = df['Artifact Id']

# label_counts = labels.value_counts()

# filtered_labels = label_counts[label_counts >= 5]

# filtered_labels_list = filtered_labels.index.tolist()

# filtered_df = df[df['Artifact Id'].isin(filtered_labels_list)]

# print(filtered_df['Artifact Id'].value_counts())

### Testing for Overfitting on a Small Portion of the Data

In this part of the code, we filter the dataset to only include labels that have exactly **16 samples**. By training the model on this small subset, we aim to observe if it achieves **overfitting**, where it memorizes the data rather than generalizing well. This helps us test the model's ability to avoid overfitting on limited data.


In [107]:
# df = pd.read_csv('dataset.csv')

# labels = df['Artifact Id']

# label_counts = labels.value_counts()

# filtered_labels = label_counts[label_counts == 16]

# filtered_labels_list = filtered_labels.index.tolist()

# filtered_df = df[df['Artifact Id'].isin(filtered_labels_list)]

### Balancing the Dataset

To address class imbalance in the dataset, we performed the following steps:

1. **Random Sampling of 'Command' Class**: We noticed that the **'d3f:Command'** class was underrepresented in the dataset. Therefore, we **randomly sampled 16 instances** from this class to ensure it is adequately represented.

2. **Combining the Data**: After sampling the 'Command' class, we combined it with the filtered dataset to create a **more balanced dataset**.

This process ensures that the model is not biased towards the larger classes and helps it generalize better across all classes.


In [108]:
df = pd.read_csv('dataset.csv')

labels = df['Artifact Id']

label_counts = labels.value_counts()

filtered_labels = label_counts[(label_counts >= 5) & (label_counts <= 200)] # Remove "Command" label

filtered_labels_list = filtered_labels.index.tolist()

filtered_df = df[df['Artifact Id'].isin(filtered_labels_list)]

command_df = df[df['Artifact Id'] == 'd3f:Command']
sample_size = min(len(command_df), 16)
sampled_command_df = command_df.sample(n=sample_size, random_state=42)
combined_df = pd.concat([filtered_df, sampled_command_df])

combined_df.reset_index(drop=True, inplace=True)

print(combined_df['Artifact Id'].value_counts())

Artifact Id
d3f:Database                     16
d3f:Software                     16
d3f:Command                      16
d3f:HardwareDriver               14
d3f:DisplayServer                11
d3f:OperatingSystem               8
d3f:FileSystem                    7
d3f:BootLoader                    6
d3f:InterprocessCommunication     5
Name: count, dtype: int64


In [109]:
train_val_df, test_df = train_test_split(filtered_df,
                                         test_size=0.2,
                                         stratify=filtered_df['Artifact Id'],
                                         random_state=42)

In [110]:
# print("Training set label distribution:")
# print(train_df['Artifact Id'].value_counts())

In [111]:
# print("\nValidation set label distribution:")
# print(val_df['Artifact Id'].value_counts())

In [112]:
# print("\nTest set label distribution:")
# print(test_df['Artifact Id'].value_counts())

In [113]:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [114]:
def tokenize_data(texts, tokenizer):
    return tokenizer(texts, padding=True, truncation=True, return_tensors='pt', max_length=512)


In [115]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [116]:
def train(model, train_loader, optimizer, scheduler, device, num_epochs):
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        total_correct = 0
        total_samples = 0

        all_labels = []
        all_preds = []

        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs}"):
            optimizer.zero_grad()

            input_ids, attention_mask, labels = [item.to(device) for item in batch]

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits
            
            preds = torch.argmax(logits, dim=-1)

            # Store predictions and labels for F1 score calculation
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

            # Accuracy calculation
            correct = (preds == labels).sum().item()
            total_correct += correct
            total_samples += labels.size(0)

            loss.backward()
            optimizer.step()
            scheduler.step()

            total_loss += loss.item()

        # Calculate metrics for the epoch
        accuracy = total_correct / total_samples * 100
        f1 = f1_score(all_labels, all_preds, average="weighted")

        print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.4f}, Accuracy: {accuracy:.2f}%, F1 Score: {f1:.4f}")

### Training with Validation

The `train_with_validation` function trains the model for multiple epochs and evaluates it on a validation set after each epoch. 

1. **Training**: For each batch, the model computes the loss, updates parameters, and tracks accuracy and F1 score.
2. **Validation**: After each epoch, the model's performance is evaluated on the validation set, and validation accuracy and F1 score are calculated.

Metrics (training and validation accuracy, F1 score) are printed after each epoch.


In [117]:
def train_with_validation(model, train_loader, val_loader, optimizer, scheduler, device, num_epochs):
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        total_correct = 0
        total_samples = 0
        all_labels = []
        all_preds = []

        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs}"):
            optimizer.zero_grad()

            input_ids, attention_mask, labels = [item.to(device) for item in batch]

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            logits = outputs.logits

            preds = torch.argmax(logits, dim=-1)

            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

            correct = (preds == labels).sum().item()
            total_correct += correct
            total_samples += labels.size(0)

            loss.backward()
            optimizer.step()
            scheduler.step()

            total_loss += loss.item()

        # Calculate training metrics
        accuracy = total_correct / total_samples * 100
        f1 = f1_score(all_labels, all_preds, average="weighted")
        
        # Validation phase
        model.eval()
        val_correct = 0
        val_samples = 0
        val_all_labels = []
        val_all_preds = []

        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Validating Epoch {epoch+1}/{num_epochs}"):
                input_ids, attention_mask, labels = [item.to(device) for item in batch]

                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs.logits

                preds = torch.argmax(logits, dim=-1)

                val_all_preds.extend(preds.cpu().tolist())
                val_all_labels.extend(labels.cpu().tolist())

                correct = (preds == labels).sum().item()
                val_correct += correct
                val_samples += labels.size(0)

        val_accuracy = val_correct / val_samples * 100
        val_f1 = f1_score(val_all_labels, val_all_preds, average="weighted")

        print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.6f}, Accuracy: {accuracy:.6f}%, F1 Score: {f1:.6f}")
        print(f"Validation Accuracy: {val_accuracy:.6f}%, Validation F1 Score: {val_f1:.6f}")

    return val_accuracy, val_f1


### Model Evaluation

After training, the model is evaluated on the test set to assess its performance on unseen data. During this phase, the model is set to evaluation mode using `model.eval()`, and predictions are made on the test set.

- **Prediction**: For each batch in the test set, the model generates predictions (logits), which are converted to class labels using `torch.argmax()`.
  
- **F1 Score**: The F1 score, calculated using the `f1_score` function from `sklearn.metrics`, is reported to measure the weighted average performance across all classes.
  
- **Accuracy**: The accuracy of the model is calculated by comparing the predicted labels to the true labels and computing the percentage of correct predictions.

The evaluation results help in understanding the model's effectiveness and generalization to the test set.

*Note: This evaluation process does not update the model parameters, ensuring that it is a true test of performance.*


In [118]:
def test(model, test_loader, device):
    model.eval()
    predictions, true_labels = [], []

    with torch.no_grad(): 
        for batch in tqdm(test_loader, desc="Evaluating"):
            input_ids, attention_mask, labels = [item.to(device) for item in batch]

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs.logits

            preds = torch.argmax(logits, dim=-1)
            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    # Calculate F1 score (weighted) for multiclass classification
    f1 = f1_score(true_labels, predictions, average='weighted')

    # Calculate accuracy
    accuracy = np.sum(np.array(predictions) == np.array(true_labels)) / len(true_labels) * 100

    print(f"F1 Score (Weighted): {f1:.4f}")
    print(f"Accuracy: {accuracy:.2f}%")

In [119]:
# test(model, test_loader, device)


### K-Fold Cross Validation with BERT

This code performs **K-Fold Cross Validation** using **StratifiedKFold** for evaluating a BERT-based model on a classification task. For each fold, it:

1. Splits the dataset into training and validation sets.
2. Tokenizes the input text (e.g., "Example Description").
3. Prepares and loads data into `DataLoader` for training and validation.
4. Initializes the BERT model and optimizer, and trains the model for a specified number of epochs.
5. Evaluates the model on the validation set using **F1 score** (weighted) and **accuracy**.
6. Computes average metrics across all folds.

Finally, the trained model is evaluated on a test set, and the final performance metrics are printed.

In [120]:
test_encodings = tokenize_data(test_df['Example Description'].tolist(), tokenizer)
label_mapping = {label: idx for idx, label in enumerate(filtered_labels_list)}
test_labels = torch.tensor([label_mapping.get(x, -1) for x in test_df['Artifact Id']])
test_dataset = TensorDataset(
    test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)
test_loader = DataLoader(test_dataset, sampler=SequentialSampler(
    test_dataset), batch_size=batch_size)

In [121]:
# # K-Fold Cross Validation on 80% train/val data
# best_model = None
# best_val_f1 = 0.0
# fold_results = []  # Store F1 score and accuracy for each fold
# skf = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
# for fold, (train_index, val_index) in enumerate(skf.split(train_val_df, train_val_df['Artifact Id'])):
#     print(f"\n--- Fold {fold+1}/{num_folds} ---")

#     # Split the train/val set into training and validation data for this fold
#     train_df = train_val_df.iloc[train_index]
#     val_df = train_val_df.iloc[val_index]

#     # Tokenize the training and validation data
#     train_encodings = tokenize_data(train_df['Example Description'].tolist(), tokenizer)
#     val_encodings = tokenize_data(val_df['Example Description'].tolist(), tokenizer)

#     # Map labels to indices for training and validation
#     train_labels = torch.tensor(train_df['Artifact Id'].map(
#         lambda x: filtered_labels_list.index(x)).tolist())
#     val_labels = torch.tensor(val_df['Artifact Id'].map(
#         lambda x: filtered_labels_list.index(x)).tolist())

#     # Create TensorDatasets for training and validation
#     train_dataset = TensorDataset(
#         train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
#     val_dataset = TensorDataset(
#         val_encodings['input_ids'], val_encodings['attention_mask'], val_labels)

#     # Create DataLoaders for training and validation
#     train_loader = DataLoader(train_dataset, sampler=RandomSampler(
#         train_dataset), batch_size=batch_size)
#     val_loader = DataLoader(val_dataset, sampler=SequentialSampler(
#         val_dataset), batch_size=batch_size)

#     # Initialize model, optimizer, and scheduler
#     model = BertForSequenceClassification.from_pretrained(
#         'bert-base-uncased', num_labels=len(filtered_labels_list))
#     model.to(device)
#     optimizer = AdamW(model.parameters(), lr=learning_rate)
#     total_steps = len(train_loader) * num_epochs
#     scheduler = get_linear_schedule_with_warmup(
#         optimizer, num_warmup_steps=0, num_training_steps=total_steps
#     )

#     # Train the model with validation
#     accuracy, f1 = train_with_validation(
#         model=model,
#         train_loader=train_loader,
#         val_loader=val_loader,
#         optimizer=optimizer,
#         scheduler=scheduler,
#         device=device,
#         num_epochs=num_epochs
#     )
    
#     fold_results.append((accuracy, f1))
#     if f1 > best_val_f1:
#         best_val_f1 = f1
#         best_model = model.state_dict()
#  # Best model should be taken in a different way!

# # # Average results over all folds
# # average_f1 = np.mean([result[0] for result in fold_results])
# # average_accuracy = np.mean([result[1] for result in fold_results])
# # print(f"\n--- K-Fold Results ---")
# # print(f"Average F1 Score (Weighted): {average_f1:.4f}")
# # print(f"Average Accuracy: {average_accuracy:.2f}%")
        
# test(model, test_loader, device)






--- Fold 1/5 ---


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training Epoch 1/4:   0%|          | 0/4 [00:00<?, ?it/s]

Validating Epoch 1/4:   0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1, Loss: 2.142305, Accuracy: 17.307692%, F1 Score: 0.122151
Validation Accuracy: 21.428571%, Validation F1 Score: 0.133333


Training Epoch 2/4:   0%|          | 0/4 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# Initialize variables for K-Fold Cross Validation
best_val_f1 = 0.0
fold_results = []  # Store F1 score and accuracy for each fold
model_weights = {}  # Store cumulative weights for averaging
skf = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)

# Initialize model weights storage
for name, param in BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=len(filtered_labels_list)
).named_parameters():
    model_weights[name] = torch.zeros_like(param.data)

for fold, (train_index, val_index) in enumerate(skf.split(train_val_df, train_val_df['Artifact Id'])):
    print(f"\n--- Fold {fold+1}/{num_folds} ---")

    # Split the train/val set into training and validation data for this fold
    train_df = train_val_df.iloc[train_index]
    val_df = train_val_df.iloc[val_index]

    # Tokenize the training and validation data
    train_encodings = tokenize_data(train_df['Example Description'].tolist(), tokenizer)
    val_encodings = tokenize_data(val_df['Example Description'].tolist(), tokenizer)

    # Map labels to indices for training and validation
    train_labels = torch.tensor(train_df['Artifact Id'].map(
        lambda x: filtered_labels_list.index(x)).tolist())
    val_labels = torch.tensor(val_df['Artifact Id'].map(
        lambda x: filtered_labels_list.index(x)).tolist())

    # Create TensorDatasets for training and validation
    train_dataset = TensorDataset(
        train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
    val_dataset = TensorDataset(
        val_encodings['input_ids'], val_encodings['attention_mask'], val_labels)

    # Create DataLoaders for training and validation
    train_loader = DataLoader(train_dataset, sampler=RandomSampler(
        train_dataset), batch_size=batch_size)
    val_loader = DataLoader(val_dataset, sampler=SequentialSampler(
        val_dataset), batch_size=batch_size)

    # Initialize model, optimizer, and scheduler
    model = BertForSequenceClassification.from_pretrained(
        'bert-base-uncased', num_labels=len(filtered_labels_list))
    model.to(device)
    optimizer = AdamW(model.parameters(), lr=learning_rate)
    total_steps = len(train_loader) * num_epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer, num_warmup_steps=0, num_training_steps=total_steps
    )

    # Train the model with validation
    accuracy, f1 = train_with_validation(
        model=model,
        train_loader=train_loader,
        val_loader=val_loader,
        optimizer=optimizer,
        scheduler=scheduler,
        device=device,
        num_epochs=num_epochs
    )
    
    fold_results.append((accuracy, f1))
    
    # Accumulate model weights for weighted average
    for name, param in model.named_parameters():
        model_weights[name] += param.data.clone() * f1

# Calculate weighted average of model weights
total_f1_sum = sum([result[1] for result in fold_results])  # Total F1 across folds
average_model_state_dict = {}
for name, weight in model_weights.items():
    average_model_state_dict[name] = weight / total_f1_sum

# Load the averaged model weights into a new model
average_model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', num_labels=len(filtered_labels_list)
)
average_model.load_state_dict(average_model_state_dict)

# Evaluate the averaged model on the test set
test(average_model, test_loader, device)

# Average results over all folds
average_f1 = np.mean([result[1] for result in fold_results])
average_accuracy = np.mean([result[0] for result in fold_results])
print(f"\n--- K-Fold Results ---")
print(f"Average F1 Score (Weighted): {average_f1:.4f}")
print(f"Average Accuracy: {average_accuracy:.2f}%")


In [None]:
model = average_model

In [None]:
torch.save(model.state_dict(), output_model_path)

In [None]:
saved_dict = torch.load(output_model_path, weights_only=True)

_model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', 
    num_labels=len(filtered_labels_list)  
)
_model.load_state_dict(saved_dict)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<All keys matched successfully>

In [None]:
all_weights_match = True  # Flag to track if all weights match

for (name1, param1), (name2, param2) in zip(model.named_parameters(), _model.named_parameters()):
    # Check if parameter names match (optional, but useful for debugging)
    if name1 != name2:
        print(f"Parameter mismatch: {name1} vs {name2}")
        all_weights_match = False
        continue
    
    # Compare parameter values
    if not torch.allclose(param1, param2, atol=1e-6):
        print(f"Mismatch found in parameter: {name1}")
        all_weights_match = False

if all_weights_match:
    print("All model weights match!")
else:
    print("Some weights do not match.")

All model weights match!


In [None]:
test_df.to_csv('test_data.csv', index=False)
print("DataFrame saved to test_data.csv")

DataFrame saved to test_data.csv


### Test the model on the baseline data set

In [None]:
csv_path = "baseline.csv"
baseline_df = pd.read_csv(csv_path)

baseline_encodings = tokenize_data(baseline_df['Example Description'].tolist(), tokenizer)

baseline_labels = torch.tensor(baseline_df['Artifact Id'].map(
    lambda x: filtered_labels_list.index(x)).tolist())

baseline_dataset = TensorDataset(
   baseline_encodings['input_ids'], baseline_encodings['attention_mask'], baseline_labels)

baseline_loader = DataLoader(baseline_dataset, sampler=RandomSampler(
    baseline_dataset), batch_size=batch_size)

test(model, baseline_loader, device)
test(_model, baseline_loader, device)

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

F1 Score (Weighted): 0.0278
Accuracy: 12.50%


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

F1 Score (Weighted): 0.0278
Accuracy: 12.50%
