### Setting Up Conda Environment with Python 3.12 and Installing Required Libraries
##### Step 1 - Created a Conda Environment with Python 3.12 and Connected to its Kernel
I created a new Conda environment with Python 3.12 and activated it. After that, I connected to the Python kernel in Jupyter/VSCode using this environment.

Commands to create and activate the environment:
```bash
conda create -n llm_env python=3.12
conda activate llm_env
```

##### Step 2 - Imported Required Libraries
Once the environment was set up and activated, I imported the following libraries for my project:
- `numpy` as `np`: For numerical operations and working with arrays.
- `pandas` as `pd`: For data manipulation and analysis.
- `torch`: PyTorch, for deep learning model building and training.
- `tqdm.notebook`: For displaying progress bars in Jupyter notebooks.
- `sklearn.preprocessing.LabelEncoder`: For encoding categorical labels into numerical format.
- `sklearn.model_selection.train_test_split`: For splitting datasets into training and testing sets.
- `transformers.BertTokenizer`: For tokenizing text data for BERT models.
- `transformers.BertForSequenceClassification`: For using BERT for sequence classification tasks.
- `torch.utils.data.TensorDataset`: For creating datasets from tensors.
- `torch.utils.data.DataLoader`: For batching datasets and loading them efficiently during training.
- `torch.utils.data.RandomSampler` and `torch.utils.data.SequentialSampler`: For random and sequential sampling of datasets.
- `transformers.AdamW` and `transformers.get_linear_schedule_with_warmup`: For the AdamW optimizer and learning rate scheduler used with transformers.
- `sklearn.metrics.f1_score`: For evaluating model performance using the F1 score.
- `random`: For random number generation, typically used in data shuffling.

Commands to install the libraries:
```bash
conda install numpy pandas scikit-learn tqdm pytorch
conda install -c huggingface transformers
```

Once the libraries were installed, I successfully imported them into the notebook and was ready to proceed with building and training the machine learning model.

In [3]:
import numpy as np
import pandas as pd
import torch
from tqdm.notebook import tqdm
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup
from sklearn.metrics import f1_score
import random

In [4]:
num_epochs = 3
learning_rate = 5e-5

### Data Load and Handling

- **`df`**: The full dataset loaded from the CSV (`dataset.csv`).
- **`labels`**: The list of labels (i.e., 'Artifact Id') extracted from the dataset.
- **`label_counts`**: The count of occurrences of each unique label in the dataset.
- **`filtered_labels`**: The labels that appear at least 5 times in the dataset.
- **`filtered_labels_list`**: A list of labels that appear at least 5 times.
- **`filtered_df`**: The filtered DataFrame containing only the rows with labels that appear at least 5 times.

By running this script, you get a new DataFrame, `filtered_df`, containing only the rows with labels that appear 5 or more times in the original dataset.

**Note**: This is a very simple preprocessing step where we filter out labels with fewer than 5 occurrences. There is no further data cleaning, such as handling missing values or data normalization, applied at this stage.

In [22]:
df = pd.read_csv('dataset.csv')

labels = df['Artifact Id']

label_counts = labels.value_counts()

filtered_labels = label_counts[label_counts >= 5]

filtered_labels_list = filtered_labels.index.tolist()

filtered_df = df[df['Artifact Id'].isin(filtered_labels_list)]

# print(filtered_df['Artifact Id'].value_counts())

##### Check if my model can reach overfitting on a small dataset with two labels

I tested whether the model trains well by taking a small data subset to induce overfitting. Indeed, by the third epoch, I achieved 96% accuracy on the train set compared to 57% on the test set.


In [20]:
# df = pd.read_csv('dataset.csv')

# labels = df['Artifact Id']

# label_counts = labels.value_counts()

# filtered_labels = label_counts[label_counts == 16]

# filtered_labels_list = filtered_labels.index.tolist()

# filtered_df = df[df['Artifact Id'].isin(filtered_labels_list)]

### Divide to Train and Test While Keeping the Distribution

In this step, the dataset is split into a training and a testing set while maintaining the same label distribution across both sets. This is done using the `train_test_split` function from `sklearn.model_selection`, with the `stratify` parameter set to the labels (`Artifact Id`) to ensure that both the training and testing sets have the same class proportions as the original dataset.

- **Training Set**: Contains 80% of the data.
- **Test Set**: Contains 20% of the data.

By using stratified splitting, the label distribution in both the training and test sets is consistent, preventing potential bias caused by imbalanced labels in smaller datasets.

In [23]:
train_df, test_df = train_test_split(filtered_df,
                                     test_size=0.2,
                                     stratify=filtered_df['Artifact Id'],
                                     random_state=42)

## Check the label distribution in both sets
# print("Training set label distribution:")
# print(train_df['Artifact Id'].value_counts())

# print("\nTest set label distribution:")
# print(test_df['Artifact Id'].value_counts())

In [15]:
# Load the pre-trained BERT model with a classification head
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(filtered_labels_list))
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### Preprocess the Data (Tokenization)

In [16]:
def tokenize_data(texts):
    return tokenizer(texts, padding=True, truncation=True, return_tensors='pt', max_length=512)

# Tokenize the training and test data
train_encodings = tokenize_data(train_df['Example Description'].tolist()) 
test_encodings = tokenize_data(test_df['Example Description'].tolist())

# Convert labels to tensors
train_labels = torch.tensor(train_df['Artifact Id'].map(lambda x: filtered_labels_list.index(x)).tolist())
test_labels = torch.tensor(test_df['Artifact Id'].map(lambda x: filtered_labels_list.index(x)).tolist())

# Create TensorDataset
train_dataset = TensorDataset(train_encodings['input_ids'], train_encodings['attention_mask'], train_labels)
test_dataset = TensorDataset(test_encodings['input_ids'], test_encodings['attention_mask'], test_labels)

# Create DataLoader
train_loader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=16)
test_loader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset), batch_size=16)

### Define the Optimizer and Learning Rate Scheduler

In [17]:
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Number of training steps (number of batches * number of epochs)
total_steps = len(train_loader) * num_epochs

# Learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)



### Training Loop

In [18]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

for epoch in range(num_epochs):
    model.train()  # Set model to training mode
    total_loss = 0
    total_correct = 0
    total_samples = 0
    
    for batch in tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs}"):
        optimizer.zero_grad()

        input_ids, attention_mask, labels = [item.to(device) for item in batch]

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        logits = outputs.logits
        
        # Get predictions
        preds = torch.argmax(logits, dim=-1)

        # Calculate accuracy
        correct = (preds == labels).sum().item()
        total_correct += correct
        total_samples += labels.size(0)

        # Backward pass
        loss.backward()
        optimizer.step()
        scheduler.step()

        total_loss += loss.item()

    # Calculate and print accuracy for the epoch
    accuracy = total_correct / total_samples * 100
    print(f"Epoch {epoch+1}, Loss: {total_loss / len(train_loader):.4f}, Accuracy: {accuracy:.2f}%")


Training Epoch 1/3:   0%|          | 0/2 [00:00<?, ?it/s]

Epoch 1, Loss: 0.7183, Accuracy: 52.00%


Training Epoch 2/3:   0%|          | 0/2 [00:00<?, ?it/s]

Epoch 2, Loss: 0.5315, Accuracy: 80.00%


Training Epoch 3/3:   0%|          | 0/2 [00:00<?, ?it/s]

Epoch 3, Loss: 0.3861, Accuracy: 96.00%


### Evaluation Loop

In [19]:
model.eval()  # Set model to evaluation mode
predictions, true_labels = [], []

with torch.no_grad():
    for batch in tqdm(test_loader, desc="Evaluating"):
        input_ids, attention_mask, labels = [item.to(device) for item in batch]

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        # Get predictions
        preds = torch.argmax(logits, dim=-1)
        predictions.extend(preds.cpu().numpy())
        true_labels.extend(labels.cpu().numpy())

# Calculate the F1 score for multiclass classification
f1 = f1_score(true_labels, predictions, average='weighted')
print(f"F1 Score (Weighted): {f1}")

# Calculate accuracy
accuracy = np.sum(np.array(predictions) == np.array(true_labels)) / len(true_labels) * 100
print(f"Accuracy: {accuracy:.2f}%")

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

F1 Score (Weighted): 0.5714285714285714
Accuracy: 57.14%
