### Setting Up Conda Environment with Python 3.12 and Installing Required Libraries
##### Step 1 - Created a Conda Environment with Python 3.12 and Connected to its Kernel
I created a new Conda environment with Python 3.12 and activated it. After that, I connected to the Python kernel in Jupyter/VSCode using this environment.

Commands to create and activate the environment:
```bash
conda create -n llm_env python=3.12
conda activate llm_env
```

##### Step 2 - Imported Required Libraries
Once the environment was set up and activated, I imported the following libraries for my project:
- `numpy` as `np`: For numerical operations and working with arrays.
- `pandas` as `pd`: For data manipulation and analysis.
- `torch`: PyTorch, for deep learning model building and training.
- `tqdm.notebook`: For displaying progress bars in Jupyter notebooks.
- `sklearn.preprocessing.LabelEncoder`: For encoding categorical labels into numerical format.
- `sklearn.model_selection.train_test_split`: For splitting datasets into training and testing sets.
- `transformers.BertTokenizer`: For tokenizing text data for BERT models.
- `transformers.BertForSequenceClassification`: For using BERT for sequence classification tasks.
- `torch.utils.data.TensorDataset`: For creating datasets from tensors.
- `torch.utils.data.DataLoader`: For batching datasets and loading them efficiently during training.
- `torch.utils.data.RandomSampler` and `torch.utils.data.SequentialSampler`: For random and sequential sampling of datasets.
- `transformers.AdamW` and `transformers.get_linear_schedule_with_warmup`: For the AdamW optimizer and learning rate scheduler used with transformers.
- `sklearn.metrics.f1_score`: For evaluating model performance using the F1 score.
- `random`: For random number generation, typically used in data shuffling.

Commands to install the libraries:
```bash
conda install numpy pandas scikit-learn tqdm pytorch
conda install -c huggingface transformers
```

Once the libraries were installed, I successfully imported them into the notebook and was ready to proceed with building and training the machine learning model.

In [197]:
import numpy as np
import pandas as pd
import torch
from tqdm.notebook import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import f1_score
# import random
import matplotlib.pyplot as plt

In [198]:
num_epochs = 8
learning_rate = 5e-5

Learning Rate = 5e-5 → Achieved ~77% to ~79% Accuracy on both the Test Set and the Train Set by the **Third** Epoch

This was the learning rate that gave the best results among the following options - 5e-3, 3e-5, 2e-5, 3e-4.

### Data Load and Handling

- **`df`**: The full dataset loaded from the CSV (`dataset.csv`).
- **`labels`**: The list of labels (i.e., 'Artifact Id') extracted from the dataset.
- **`label_counts`**: The count of occurrences of each unique label in the dataset.
- **`filtered_labels`**: The labels that appear at least 5 times in the dataset.
- **`filtered_labels_list`**: A list of labels that appear at least 5 times.
- **`filtered_df`**: The filtered DataFrame containing only the rows with labels that appear at least 5 times.

By running this script, you get a new DataFrame, `filtered_df`, containing only the rows with labels that appear 5 or more times in the original dataset.

**Note**: This is a very simple preprocessing step where we filter out labels with fewer than 5 occurrences. There is no further data cleaning, such as handling missing values or data normalization, applied at this stage.

In [199]:
df = pd.read_csv('dataset.csv')

labels = df['Artifact Id']

label_counts = labels.value_counts()

filtered_labels = label_counts[label_counts >= 5]

filtered_labels_list = filtered_labels.index.tolist()

filtered_df = df[df['Artifact Id'].isin(filtered_labels_list)]

print(filtered_df['Artifact Id'].value_counts())

Artifact Id
d3f:Command                      250
d3f:Database                      16
d3f:Software                      16
d3f:HardwareDriver                14
d3f:DisplayServer                 11
d3f:OperatingSystem                8
d3f:FileSystem                     7
d3f:BootLoader                     6
d3f:InterprocessCommunication      5
Name: count, dtype: int64


##### Check if my model can reach overfitting on a small dataset with two labels

I tested whether the model trains well by taking a small data subset to induce overfitting. Indeed, by the third epoch, I achieved 96% accuracy on the train set compared to 57% on the test set.


In [200]:
# df = pd.read_csv('dataset.csv')

# labels = df['Artifact Id']

# label_counts = labels.value_counts()

# filtered_labels = label_counts[label_counts == 16]

# filtered_labels_list = filtered_labels.index.tolist()

# filtered_df = df[df['Artifact Id'].isin(filtered_labels_list)]

In [201]:
# df = pd.read_csv('dataset.csv')

# labels = df['Artifact Id']

# label_counts = labels.value_counts()

# filtered_labels = label_counts[(label_counts >= 5) & (label_counts <= 200)]

# filtered_labels_list = filtered_labels.index.tolist()

# filtered_df = df[df['Artifact Id'].isin(filtered_labels_list)]

### Divide to Train, Validation, and Test While Keeping the Distribution

In this step, the dataset is split into a training set, a validation set, and a testing set, while maintaining the same label distribution across all sets. This is done using the `train_test_split` function from `sklearn.model_selection`, with the `stratify` parameter set to the labels (`Artifact Id`) to ensure that all splits preserve the same class proportions as the original dataset.

- **Training Set**: Contains 64% of the data.
- **Validation Set**: Contains 16% of the data.
- **Test Set**: Contains 20% of the data.

By using stratified splitting, the label distribution in all sets (training, validation, and test) is consistent, preventing potential bias caused by imbalanced labels in smaller datasets.


In [202]:
train_df, test_df = train_test_split(filtered_df,
                                     test_size=0.2,
                                     stratify=filtered_df['Artifact Id'],
                                     random_state=42)

train_df, val_df = train_test_split(train_df,
                                    test_size=0.2,
                                    stratify=train_df['Artifact Id'],
                                    random_state=42)

In [203]:
print("Training set label distribution:")
print(train_df['Artifact Id'].value_counts())

Training set label distribution:
Artifact Id
d3f:Command                      160
d3f:Software                      10
d3f:Database                      10
d3f:HardwareDriver                 9
d3f:DisplayServer                  7
d3f:OperatingSystem                5
d3f:FileSystem                     4
d3f:BootLoader                     4
d3f:InterprocessCommunication      3
Name: count, dtype: int64


In [204]:
print("\nValidation set label distribution:")
print(val_df['Artifact Id'].value_counts())


Validation set label distribution:
Artifact Id
d3f:Command                      40
d3f:Database                      3
d3f:Software                      3
d3f:HardwareDriver                2
d3f:DisplayServer                 2
d3f:InterprocessCommunication     1
d3f:OperatingSystem               1
d3f:FileSystem                    1
d3f:BootLoader                    1
Name: count, dtype: int64


In [205]:
print("\nTest set label distribution:")
print(test_df['Artifact Id'].value_counts())


Test set label distribution:
Artifact Id
d3f:Command                      50
d3f:HardwareDriver                3
d3f:Software                      3
d3f:Database                      3
d3f:DisplayServer                 2
d3f:FileSystem                    2
d3f:OperatingSystem               2
d3f:BootLoader                    1
d3f:InterprocessCommunication     1
Name: count, dtype: int64


### Model and Tokenizer Initialization

- **Model**:
  - A pre-trained `BertForSequenceClassification` model is loaded from the `bert-base-uncased` variant of BERT.
  - The model includes a classification head configured with `num_labels`, which corresponds to the number of unique labels in the filtered dataset (`filtered_labels_list`).

- **Tokenizer**:
  - The `BertTokenizer` associated with `bert-base-uncased` is initialized for consistent tokenization of input text.

These components leverage the pre-trained BERT architecture for fine-tuning on the specific multiclass classification task.


In [206]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(filtered_labels_list))
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Data Preparation: Tokenization and Dataloader Creation

- **`tokenize_data` Function**: A helper function that tokenizes input text using the BERT tokenizer. It ensures the data is padded, truncated to a maximum length of 512 tokens, and returned as PyTorch tensors.

- **Tokenization**:
  - The training, validation, and test data (`'Example Description'`) are tokenized into input IDs and attention masks using the `tokenize_data` function.

- **Label Encoding**:
  - Labels (`'Artifact Id'`) are mapped to their index positions in the `filtered_labels_list` to create tensor labels for the training, validation, and test datasets.

- **Dataset Creation**:
  - Training, validation, and test data are combined into `TensorDataset` objects, including tokenized inputs (`input_ids`, `attention_mask`) and the corresponding labels.

- **DataLoader**:
  - `train_loader`: A DataLoader with a random sampling strategy for shuffling and a batch size of 16.
  - `val_loader`: A DataLoader with a sequential sampling strategy for evaluation and the same batch size, used for validation.
  - `test_loader`: A DataLoader with a sequential sampling strategy for evaluation and the same batch size.

These steps prepare the tokenized data and labels for efficient use during training, validation, and evaluation.


In [207]:
def tokenize_data(texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors='pt', max_length=512)

# Tokenize the training, validation, and test data
train_encodings = tokenize_data(train_df['Example Description'].tolist())
val_encodings = tokenize_data(val_df['Example Description'].tolist())  # Validation set
test_encodings = tokenize_data(test_df['Example Description'].tolist())

# Convert labels to tensors
train_labels = torch.tensor(train_df['Artifact Id'].map(lambda x: filtered_labels_list.index(x)).tolist())
val_labels = torch.tensor(val_df['Artifact Id'].map(lambda x: filtered_labels_list.index(x)).tolist())  # Validation set
test_labels = torch.tensor(test_df['Artifact Id'].map(lambda x: filtered_labels_list.index(x)).tolist())

train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
val_dataset = TensorDataset(val_encodings['input_ids'], val_encodings['attention_mask'], val_labels)  # Validation set
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=16)
val_loader = DataLoader(val_dataset, sampler=SequentialSampler(val_dataset), batch_size=16)  # Validation set
test_loader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=16)


### Optimizer and Learning Rate Scheduler

- **Optimizer**:
  - `AdamW` is used as the optimizer, designed for fine-tuning transformer-based models like BERT.
  - The learning rate is set to `learning_rate` (e.g., `5e-5`), which is critical for stable training.

- **Learning Rate Scheduler**:
  - A linear scheduler with a warm-up phase (`num_warmup_steps=0`) is employed.
  - The scheduler adjusts the learning rate gradually over `total_steps` (calculated as `number of batches × number of epochs`).

These components ensure efficient and stable optimization of the model parameters during training.


In [208]:
optimizer = AdamW(model.parameters(), lr=learning_rate)

total_steps = len(train_loader) * num_epochs

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)



### Training Evaluation

In this part of the process, the model is trained for a specified number of epochs, with performance monitored by calculating both the training loss and accuracy.

- **Training Phase**: The model is trained using batches of data, with the optimizer and scheduler being updated accordingly. For each batch, the loss is calculated and backpropagated to adjust the model's parameters. Training accuracy is calculated based on correct predictions.

Finally, the training accuracies and losses over the epochs are plotted to visualize the model's performance.

This process helps track the model's training progress, ensuring it learns effectively from the data.

---

*Note: The graphs display training metrics, helping to assess the model’s learning efficiency.*


In [210]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    model.train() 
    total_loss = 0
    total_correct = 0
    total_samples = 0
    
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs}"):
        optimizer.zero_grad()

        input_ids, attention_mask, labels = [item.to(device) for item in batch]

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits
        
        preds = torch.argmax(logits, dim=-1)

        correct = (preds == labels).sum().item()
        total_correct += correct
        total_samples += labels.size(0)

        loss.backward()
        optimizer.step()
        scheduler.step()

        total_loss += loss.item()

    # Calculate and print accuracy for the epoch
    accuracy = total_correct / total_samples * 100
    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.4f}, Accuracy: {accuracy:.2f}%")

Training Epoch 1/8:   0%|          | 0/14 [00:00<?, ?it/s]

Epoch 1, Loss: 1.5569, Accuracy: 58.49%


Training Epoch 2/8:   0%|          | 0/14 [00:00<?, ?it/s]

Epoch 2, Loss: 0.9137, Accuracy: 75.47%


Training Epoch 3/8:   0%|          | 0/14 [00:00<?, ?it/s]

Epoch 3, Loss: 0.7610, Accuracy: 75.47%


Training Epoch 4/8:   0%|          | 0/14 [00:00<?, ?it/s]

Epoch 4, Loss: 0.5999, Accuracy: 80.66%


Training Epoch 5/8:   0%|          | 0/14 [00:00<?, ?it/s]

Epoch 5, Loss: 0.5277, Accuracy: 81.13%


Training Epoch 6/8:   0%|          | 0/14 [00:00<?, ?it/s]

Epoch 6, Loss: 0.4531, Accuracy: 87.26%


Training Epoch 7/8:   0%|          | 0/14 [00:00<?, ?it/s]

Epoch 7, Loss: 0.4205, Accuracy: 88.68%


Training Epoch 8/8:   0%|          | 0/14 [00:00<?, ?it/s]

Epoch 8, Loss: 0.4162, Accuracy: 91.04%


### Model Evaluation

After training, the model is evaluated on the test set to assess its performance on unseen data. During this phase, the model is set to evaluation mode using `model.eval()`, and predictions are made on the test set.

- **Prediction**: For each batch in the test set, the model generates predictions (logits), which are converted to class labels using `torch.argmax()`.
  
- **F1 Score**: The F1 score, calculated using the `f1_score` function from `sklearn.metrics`, is reported to measure the weighted average performance across all classes.
  
- **Accuracy**: The accuracy of the model is calculated by comparing the predicted labels to the true labels and computing the percentage of correct predictions.

The evaluation results help in understanding the model's effectiveness and generalization to the test set.

*Note: This evaluation process does not update the model parameters, ensuring that it is a true test of performance.*


In [211]:
model.eval() 
predictions, true_labels = [], []

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Evaluating"):
        input_ids, attention_mask, labels = [item.to(device) for item in batch]

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        preds = torch.argmax(logits, dim=-1)
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(labels.cpu().numpy())

# Calculate the F1 score for multiclass classification
f1 = f1_score(true_labels, predictions, average='weighted')
print(f"F1 Score (Weighted): {f1}")

# Calculate accuracy
accuracy = np.sum(np.array(predictions) == np.array(true_labels)) / len(true_labels) * 100
print(f"Accuracy: {accuracy:.2f}%")

Evaluating:   0%|          | 0/5 [00:00<?, ?it/s]

F1 Score (Weighted): 0.8430827830335148
Accuracy: 88.06%
