### Setting Up Conda Environment with Python 3.12 and Preparing for Model Development  

##### Step 1 - Create and Activate a Conda Environment  
1. Create a new Conda environment with Python 3.12 using the following command:  
   ```bash  
   conda create -n llm_env python=3.12  
   ```  
2. Activate the environment:  
   ```bash  
   conda activate llm_env  
   ```  
3. Connect the environment to the Python kernel in Jupyter or VSCode to ensure it is ready for development.

##### Step 2 - Install Required Libraries  
1. Use Conda to install all necessary libraries for the project. Include any additional channels if required.  

##### Step 3 - Verify the Setup  
1. Open a notebook or script and try importing the installed libraries.  
2. Ensure there are no import errors to confirm that the environment is properly configured and ready for building and training the machine learning model.  

In [1]:
import numpy as np
import pandas as pd
import torch
import datetime
from torch import nn
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score


### Output file naming
Timestamp-based naming convention


In [2]:
timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
output_model_path = f"models/model_{timestamp}.pth"

### Setting the Device for Model Training  
The variable `device` is set to use the GPU (`cuda`) if available, otherwise defaults to the CPU. This ensures the model runs on the most suitable hardware.

In [3]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Hyperparameters:

In [4]:
num_epochs = 4
learning_rate = 5e-5
batch_size = 16

### Custom BERT Model for Sequence Classification

This is a custom implementation of a BERT-based model for sequence classification. The model uses the `BertForSequenceClassification` class from Hugging Face's `transformers` library. Here's how the model works:

- **Initialization**: The model is initialized with the `BertForSequenceClassification` pre-trained model (`bert-base-uncased`) and an optional number of output labels (`num_labels`).
  
- **Forward Pass**: During the forward pass, the input IDs and attention mask are passed through the BERT model. If labels are provided, the model computes the loss using `CrossEntropyLoss`. 

- **Outputs**: The model returns a dictionary containing:
  - `loss`: Computed if labels are provided.
  - `logits`: The raw predictions (before softmax) for each class.

In [5]:
class CustomBertModel(nn.Module):
    def __init__(self, num_labels):
        super(CustomBertModel, self).__init__()
        # Use BertForSequenceClassification directly
        self.bert = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=num_labels)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        
        logits = outputs.logits

        loss = None
        if labels is not None:
            loss = outputs.loss

        return {"loss": loss, "logits": logits}


### Dataset Creation for Text Classification

This function prepares a dataset for text classification by:
1. **Tokenizing** the 'Example Description' column with padding/truncation to a fixed length (512 tokens).
2. **Mapping labels** from 'Artifact Id' to indices based on `filtered_labels_balance_list`.
3. **Creating a `TensorDataset`** with input IDs, attention mask, and labels for model training.

The output is a PyTorch `TensorDataset` ready for use.


In [6]:
def tokenize_data(texts, tokenizer):
    return tokenizer(texts, padding=True, truncation=True, return_tensors='pt', max_length=512)

def create_dataset(df, tokenizer, label_mapping):
    texts = df['Example Description'].tolist() 
    encodings = tokenize_data(texts, tokenizer)

    labels = torch.tensor([label_mapping.get(x, -1) for x in df['Artifact Id']])  # Default to -1 for unknown labels
    
    input_ids = encodings['input_ids']
    attention_mask = encodings['attention_mask']
    
    dataset = TensorDataset(input_ids, attention_mask, labels)
    return dataset


### Training
Trains the model for a specified number of epochs using the provided data loader, optimizer, scheduler, and device. 

**Input Parameters**:
- `model`: The neural network model to be trained.
- `train_loader`: A PyTorch DataLoader containing the training data.
- `optimizer`: The optimization algorithm used to update the model's weights.
- `device`: The hardware device (CPU or GPU) where the model and data will be processed.
- `num_epochs`: The number of epochs to train the model.

The function performs the following:

1. **Training**: For each batch, the model computes the loss, updates parameters, and tracks accuracy and F1 score.
2. **Metrics**: After each epoch, the loss, training accuracy, and F1 score are printed.

This function does not include validation; it is a former version that focuses solely on training the model.

In [7]:
def train(model, train_loader, optimizer, device, num_epochs):
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        total_correct = 0
        total_samples = 0

        all_labels = []
        all_preds = []

        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs}"):
            optimizer.zero_grad()

            input_ids, attention_mask, labels = [item.to(device) for item in batch]

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs['loss']
            logits = outputs['logits']
            
            preds = torch.argmax(logits, dim=-1)

            # Store predictions and labels for F1 score calculation
            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

            # Accuracy calculation
            correct = (preds == labels).sum().item()
            total_correct += correct
            total_samples += labels.size(0)

            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        # Calculate metrics for the epoch
        accuracy = accuracy_score(all_labels, all_preds) * 100
        f1 = f1_score(all_labels, all_preds, average="weighted")

        print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.4f}, Accuracy: {accuracy:.2f}%, F1 Score: {f1:.4f}")

### Training with Validation

The `train_with_validation` function trains the model for several epochs while evaluating it on a validation set after each epoch.

1. **Training**: For each batch, the model computes the loss, updates parameters, and tracks accuracy and F1 score.
2. **Validation**: After each epoch, the model's performance is evaluated on the validation set, calculating validation accuracy and F1 score.
3. **Metrics**: After each epoch, the training loss, accuracy, and F1 score are displayed, along with the validation accuracy and validation F1 score.

In [8]:
def train_with_validation(model, train_loader, val_loader, optimizer, device, num_epochs):
    for epoch in range(num_epochs):
        model.train()
        total_loss = 0
        total_correct = 0
        total_samples = 0
        all_labels = []
        all_preds = []

        for batch in tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs}"):
            optimizer.zero_grad()

            input_ids, attention_mask, labels = [item.to(device) for item in batch]

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs['loss']
            logits = outputs['logits']

            preds = torch.argmax(logits, dim=-1)

            all_preds.extend(preds.cpu().tolist())
            all_labels.extend(labels.cpu().tolist())

            correct = (preds == labels).sum().item()
            total_correct += correct
            total_samples += labels.size(0)

            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        # Calculate training metrics
        accuracy = accuracy_score(all_labels, all_preds) * 100
        f1 = f1_score(all_labels, all_preds, average="weighted")
        
        # Validation phase
        model.eval()
        val_correct = 0
        val_samples = 0
        val_all_labels = []
        val_all_preds = []

        with torch.no_grad():
            for batch in tqdm(val_loader, desc=f"Validating Epoch {epoch+1}/{num_epochs}"):
                input_ids, attention_mask, labels = [item.to(device) for item in batch]

                outputs = model(input_ids=input_ids, attention_mask=attention_mask)
                logits = outputs['logits']

                preds = torch.argmax(logits, dim=-1)

                val_all_preds.extend(preds.cpu().tolist())
                val_all_labels.extend(labels.cpu().tolist())

                correct = (preds == labels).sum().item()
                val_correct += correct
                val_samples += labels.size(0)

        val_accuracy = accuracy_score(val_all_labels, val_all_preds) * 100
        val_f1 = f1_score(val_all_labels, val_all_preds, average="weighted")

        print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.6f}, Accuracy: {accuracy:.6f}%, F1 Score: {f1:.6f}")
        print(f"Validation Accuracy: {val_accuracy:.6f}%, Validation F1 Score: {val_f1:.6f}")


### Model Evaluation

After training, the model is evaluated on the test set to assess its performance on unseen data. During this phase, the model is set to evaluation mode using `model.eval()`, and predictions are made on the test set.

- **Prediction**: For each batch in the test set, the model generates predictions (logits), which are converted to class labels using `torch.argmax()`.
  
- **F1 Score**: The F1 score, calculated using the `f1_score` function from `sklearn.metrics`, is reported to measure the weighted average performance across all classes.
  
- **Accuracy**: The accuracy of the model is calculated by comparing the predicted labels to the true labels and computing the percentage of correct predictions.

The evaluation results help in understanding the model's effectiveness and generalization to the test set.

*Note: This evaluation process does not update the model parameters, ensuring that it is a true test of performance.*


In [27]:
def test(model, test_loader, device):
    model.eval()
    predictions, true_labels = [], []

    with torch.no_grad(): 
        for batch in tqdm(test_loader, desc="Evaluating"):
            input_ids, attention_mask, labels = [item.to(device) for item in batch]

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            logits = outputs['logits']
            preds = torch.argmax(logits, dim=-1)
            predictions.extend(preds.cpu().numpy())
            true_labels.extend(labels.cpu().numpy())

    # Calculate F1 score (weighted) for multiclass classification
    f1 = f1_score(true_labels, predictions, average='weighted')

    # Calculate accuracy
    accuracy = accuracy_score(true_labels, predictions) * 100

    print(f"F1 Score (Weighted): {f1:.4f}")
    print(f"Accuracy: {accuracy:.2f}%")
    return predictions, true_labels

### Data Load and Handling

- **`df`**: The full dataset loaded from the CSV (`dataset.csv`).
- **`labels`**: The list of labels (i.e., 'Artifact Id') extracted from the dataset.
- **`label_counts`**: The count of occurrences of each unique label in the dataset.
- **`filtered_labels_at_least_5`**: The labels that appear at least 5 times in the dataset.
- **`filtered_labels_at_least_5_list`**: A list of labels that appear at least 5 times.
- **`filtered_df`**: The filtered DataFrame containing only the rows with labels that appear at least 5 times.

By running this script, you get a new DataFrame, `filtered_df`, containing only the rows with labels that appear 5 or more times in the original dataset.

**Note**: This is a very simple preprocessing step where we filter out labels with fewer than 5 occurrences. There is no further data cleaning, such as handling missing values or data normalization, applied at this stage.

In [10]:
df = pd.read_csv('csv/dataset.csv')

labels = df['Artifact Id']

label_counts = labels.value_counts()

filtered_labels_at_least_5 = label_counts[label_counts >= 5]

filtered_labels_at_least_5_list = filtered_labels_at_least_5.index.tolist()

filtered_df = df[df['Artifact Id'].isin(filtered_labels_at_least_5_list)]

print(filtered_df['Artifact Id'].value_counts())

Artifact Id
d3f:Command                      250
d3f:Database                      16
d3f:Software                      16
d3f:HardwareDriver                14
d3f:DisplayServer                 11
d3f:OperatingSystem                8
d3f:FileSystem                     7
d3f:BootLoader                     6
d3f:InterprocessCommunication      5
Name: count, dtype: int64


### Balancing the Dataset

To address class imbalance in the dataset, we performed the following steps:

1. **Random Sampling of 'Command' Class**: We noticed that the **'d3f:Command'** class was underrepresented in the dataset. Therefore, we **randomly sampled 16 instances** from this class to ensure it is adequately represented.

2. **Combining the Data**: After sampling the 'Command' class, we combined it with the filtered dataset to create a **more balanced dataset**.

This process ensures that the model is not biased towards the larger classes and helps it generalize better across all classes.


In [11]:
filtered_labels_balance = label_counts[(label_counts >= 5) & (label_counts <= 200)]

filtered_labels_balance_list = filtered_labels_balance.index.tolist()

filtered_balance_df = df[df['Artifact Id'].isin(filtered_labels_balance_list)]

command_df = df[df['Artifact Id'] == 'd3f:Command']
sample_size = min(len(command_df), 16)
sampled_command_df = command_df.sample(n=sample_size, random_state=42)
combined_df = pd.concat([filtered_balance_df, sampled_command_df])

combined_df.reset_index(drop=True, inplace=True)

print(combined_df['Artifact Id'].value_counts())

Artifact Id
d3f:Database                     16
d3f:Software                     16
d3f:Command                      16
d3f:HardwareDriver               14
d3f:DisplayServer                11
d3f:OperatingSystem               8
d3f:FileSystem                    7
d3f:BootLoader                    6
d3f:InterprocessCommunication     5
Name: count, dtype: int64


In [12]:
label_mapping = {label: idx for idx, label in enumerate(filtered_labels_at_least_5_list)}

### Dataset Split for Training and Testing  
The dataset `combined_df` is split into **training/validation** and **test** sets with:  
- **20% test size**  
- **Stratification by `Artifact Id`**  

After the split, the distribution of each subset is printed to verify stratification.

In [13]:
train_val_df, test_df = train_test_split(combined_df,
                                         test_size=0.2,
                                         stratify=combined_df['Artifact Id'],
                                         random_state=42)
train_df, val_df = train_test_split(train_val_df,
                                         test_size=0.2,
                                         stratify=train_val_df['Artifact Id'],
                                         random_state=42)

In [14]:
print("Training set label distribution:")
print(train_df['Artifact Id'].value_counts())

Training set label distribution:
Artifact Id
d3f:Database                     11
d3f:Software                     10
d3f:Command                      10
d3f:HardwareDriver                9
d3f:DisplayServer                 7
d3f:OperatingSystem               5
d3f:BootLoader                    4
d3f:FileSystem                    4
d3f:InterprocessCommunication     3
Name: count, dtype: int64


In [15]:
print("Training set label distribution:")
print(val_df['Artifact Id'].value_counts())

Training set label distribution:
Artifact Id
d3f:Software                     3
d3f:Command                      3
d3f:Database                     2
d3f:DisplayServer                2
d3f:HardwareDriver               2
d3f:OperatingSystem              1
d3f:FileSystem                   1
d3f:InterprocessCommunication    1
d3f:BootLoader                   1
Name: count, dtype: int64


In [16]:
print("Test set label distribution:")
print(test_df['Artifact Id'].value_counts())


Test set label distribution:
Artifact Id
d3f:Command                      3
d3f:Database                     3
d3f:Software                     3
d3f:HardwareDriver               3
d3f:OperatingSystem              2
d3f:DisplayServer                2
d3f:FileSystem                   2
d3f:InterprocessCommunication    1
d3f:BootLoader                   1
Name: count, dtype: int64


### Model and Optimizer Setup

This code initializes the model, optimizer, and tokenizer for training:
1. **Model**: `CustomBertModel` is initialized with the number of labels set to the length of `filtered_labels_at_least_5_list`.
2. **Optimizer**: AdamW is used for optimization with the model's parameters and a specified learning rate.
3. **Tokenizer**: The BERT tokenizer (`bert-base-uncased`) is loaded for tokenizing the input text.
4. The model is then moved to the specified **device** (GPU or CPU).

The setup is ready for model training.

In [17]:
model = CustomBertModel(num_labels=len(filtered_labels_at_least_5_list))
optimizer = AdamW(model.parameters(), lr=learning_rate)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model.to(device)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.




CustomBertModel(
  (bert): BertForSequenceClassification(
    (bert): BertModel(
      (embeddings): BertEmbeddings(
        (word_embeddings): Embedding(30522, 768, padding_idx=0)
        (position_embeddings): Embedding(512, 768)
        (token_type_embeddings): Embedding(2, 768)
        (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (encoder): BertEncoder(
        (layer): ModuleList(
          (0-11): 12 x BertLayer(
            (attention): BertAttention(
              (self): BertSdpaSelfAttention(
                (query): Linear(in_features=768, out_features=768, bias=True)
                (key): Linear(in_features=768, out_features=768, bias=True)
                (value): Linear(in_features=768, out_features=768, bias=True)
                (dropout): Dropout(p=0.1, inplace=False)
              )
              (output): BertSelfOutput(
                (dense): Linear(in_features=768, out_features=

### Dataset and DataLoader Creation

This code prepares the datasets and DataLoaders for training, validation, and testing:

1. **Datasets**: 
   - The `create_dataset` function is called to process the training (`train_df`), validation (`val_df`), and testing (`test_df`) data, using the BERT tokenizer and a label list (`filtered_labels_at_least_5_list`).

2. **DataLoaders**:
   - `train_loader`: Uses a `RandomSampler` to shuffle data for training.
   - `val_loader` and `test_loader`: Use a `SequentialSampler` to load validation and test data sequentially.
   - The batch size is specified by `batch_size`.

These DataLoaders are ready to be used for model training and evaluation.


In [18]:
train_dataset = create_dataset(train_df, tokenizer, label_mapping)
val_dataset = create_dataset(val_df, tokenizer, label_mapping)
test_dataset = create_dataset(test_df, tokenizer, label_mapping)

# Create DataLoaders
train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=batch_size)
val_loader = DataLoader(val_dataset, sampler=SequentialSampler(val_dataset), batch_size=batch_size)
test_loader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=batch_size)


### Training and Validation

In [19]:
train_with_validation(model, train_loader, val_loader, optimizer, device, num_epochs)

Training Epoch 1/4:   0%|          | 0/4 [00:00<?, ?it/s]

Validating Epoch 1/4:   0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1, Loss: 2.270321, Accuracy: 14.285714%, F1 Score: 0.060647
Validation Accuracy: 12.500000%, Validation F1 Score: 0.027778


Training Epoch 2/4:   0%|          | 0/4 [00:00<?, ?it/s]

Validating Epoch 2/4:   0%|          | 0/1 [00:00<?, ?it/s]

Epoch 2, Loss: 2.072563, Accuracy: 23.809524%, F1 Score: 0.161118
Validation Accuracy: 31.250000%, Validation F1 Score: 0.196429


Training Epoch 3/4:   0%|          | 0/4 [00:00<?, ?it/s]

Validating Epoch 3/4:   0%|          | 0/1 [00:00<?, ?it/s]

Epoch 3, Loss: 1.907913, Accuracy: 38.095238%, F1 Score: 0.299060
Validation Accuracy: 56.250000%, Validation F1 Score: 0.435714


Training Epoch 4/4:   0%|          | 0/4 [00:00<?, ?it/s]

Validating Epoch 4/4:   0%|          | 0/1 [00:00<?, ?it/s]

Epoch 4, Loss: 1.674347, Accuracy: 68.253968%, F1 Score: 0.606885
Validation Accuracy: 56.250000%, Validation F1 Score: 0.494792


### Save the Model

In [20]:
torch.save(model.state_dict(), output_model_path)
print(f"Model saved to {output_model_path}")

Model saved to model_2024-12-16_14-53-35.pth


### Load the Saved Model

In [21]:
_model = CustomBertModel(num_labels=len(filtered_labels_at_least_5_list))

state_dict = torch.load(output_model_path, weights_only=True)
_model.load_state_dict(state_dict)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<All keys matched successfully>

### Testing the Model

We evaluate both the trained model and the loaded model to ensure that they produce the same results.

In [28]:
test(model, test_loader, device)
test(_model, test_loader, device)

Evaluating:   0%|          | 0/2 [00:00<?, ?it/s]

F1 Score (Weighted): 0.4110
Accuracy: 50.00%


Evaluating:   0%|          | 0/2 [00:00<?, ?it/s]

F1 Score (Weighted): 0.4110
Accuracy: 50.00%


([0, 3, 1, 0, 4, 2, 0, 3, 1, 0, 1, 3, 3, 1, 0, 1, 1, 1, 3, 1],
 [0, 5, 1, 0, 4, 2, 0, 3, 8, 7, 3, 4, 2, 1, 6, 5, 6, 2, 3, 1])

### Model Weights Comparison

We compare the weights of the trained model and the loaded model to verify if they are identical

In [23]:
all_weights_match = True  # Flag to track if all weights match

for (name1, param1), (name2, param2) in zip(model.named_parameters(), _model.named_parameters()):
    # Check if parameter names match (optional, but useful for debugging)
    if name1 != name2:
        print(f"Parameter mismatch: {name1} vs {name2}")
        all_weights_match = False
        continue
    
    # Compare parameter values
    if not torch.allclose(param1, param2, atol=1e-6):
        print(f"Mismatch found in parameter: {name1}")
        all_weights_match = False

if all_weights_match:
    print("All model weights match!")
else:
    print("Some weights do not match.")

All model weights match!


### Saving the test set

In [24]:
test_df.to_csv('csv/test_data.csv', index=False)
print("DataFrame saved to test_data.csv")

DataFrame saved to test_data.csv


This cell compares the **trained model's predictions** with the **baseline model's predictions** to evaluate performance improvements.

In [29]:
csv_path = "csv/baseline.csv"
baseline_df = pd.read_csv(csv_path)

def test_baseline (df):
    true_labels = df['Artifact Id'].str.replace('d3f:', '', regex=False).tolist()
    predicted_labels = df['Prediction'].tolist()
    accuracy = accuracy_score(true_labels, predicted_labels) * 100
    f1 = f1_score(true_labels, predicted_labels, average='macro')
    # Print results
    print(f"Baseline Accuracy: {accuracy:.4f}")
    print(f"Baseline F1 Score: {f1:.4f}")

baseline_dataset = create_dataset(baseline_df, tokenizer, label_mapping)
baseline_loader = DataLoader(baseline_dataset, sampler=SequentialSampler(baseline_dataset), batch_size=batch_size)

test_baseline(baseline_df)
test(model, baseline_loader, device)
test(_model, baseline_loader, device)


Baseline Accuracy: 18.7500
Baseline F1 Score: 0.2083


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

F1 Score (Weighted): 0.5568
Accuracy: 62.50%


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

F1 Score (Weighted): 0.5568
Accuracy: 62.50%


([1, 2, 1, 4, 2, 2, 0, 3, 2, 3, 2, 1, 2, 1, 4, 2],
 [5, 2, 1, 4, 1, 2, 6, 3, 2, 3, 8, 5, 2, 1, 4, 1])