# Time Complexity analysis using CodeBert

## 1. Import libraries and Initialize CodeBERT Feature Extraction Pipeline with Device Selection


Sets up a `feature-extraction` pipeline using the pretrained `microsoft/codebert-base` model from the Hugging Face Transformers library. It automatically selects the GPU (if available) or falls back to CPU for computation.

- **Device Selection:** Uses GPU (`device=0`) if available, otherwise CPU (`device=-1`).
- **Pipeline Purpose:** Extracts high-dimensional embeddings from code or text inputs using CodeBERT.


In [25]:
from transformers import pipeline
import torch

# Determine the device to use: 0 for GPU if available, -1 for CPU
device_id = 0 if torch.cuda.is_available() else -1

# Create the pipeline with the specified device
pipe = pipeline("feature-extraction", model="microsoft/codebert-base", device=device_id)

Device set to use cuda:0


In [26]:
device_id = 0 if torch.cuda.is_available() else -1
pipe = pipeline("feature-extraction", model="microsoft/codebert-base")

Device set to use cuda:0


## 2. Load Pretrained CodeBERT Tokenizer and Model Directly

In [27]:
# Load model directly
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModel.from_pretrained("microsoft/codebert-base")

## 3. Import dataset

This dataset consists of competitive programming code snippets along with metadata, used for code complexity analysis and classification tasks.

**Features:**
- `src`: The source code snippet (typically in Java).
- `complexity`: The algorithmic time complexity (e.g., `linear`, `quadratic`, `nlogn`, `constant`).
- `problem`: Problem name and identifier from a coding platform.
- `from`: The source platform of the problem (e.g., `CODEFORCES`).

In [28]:
import pandas as pd


df = pd.read_json("/kaggle/input/codecomp/codebertdata.jsonl", lines=True)
print(df.shape)


(4517, 4)


In [29]:
df.head(10)


Unnamed: 0,src,complexity,problem,from
0,import java.io.*;\nimport java.math.BigInteger...,quadratic,1179_B. Tolik and His Uncle,CODEFORCES
1,import java.util.Scanner;\n \npublic class pil...,linear,1197_B. Pillars,CODEFORCES
2,import java.io.BufferedReader;\nimport java.io...,linear,1059_C. Sequence Transformation,CODEFORCES
3,import java.util.*;\n\nimport java.io.*;\npubl...,linear,1011_A. Stages,CODEFORCES
4,import java.io.OutputStream;\nimport java.io.I...,linear,1190_C. Tokitsukaze and Duel,CODEFORCES
5,import java.math.BigDecimal;\nimport java.math...,quadratic,527_B. Error Correct System,CODEFORCES
6,import java.util.*;\nimport java.io.*;\n\nimpo...,nlogn,913_D. Too Easy Problems,CODEFORCES
7,import java.io.*;\nimport java.util.*;\n\nimpo...,nlogn,1197_C. Array Splitting,CODEFORCES
8,\n// LM10: The next Ballon d'or\nimport java.u...,linear,1038_D. Slime,CODEFORCES
9,import java.util.*;\nimport java.io.*;\nimport...,constant,1028_B. Unnatural Conditions,CODEFORCES


## 4. Create a Custom Dataset Class

In [30]:
codes = df['src'].tolist()
labels =df['complexity'].tolist()

In [31]:
num_labels = set(labels)
sorted_labels = sorted(num_labels)
print(f"Number of classes: {len(num_labels)}")
print(f"Labels in alphabetical order: {sorted_labels}")


Number of classes: 7
Labels in alphabetical order: ['constant', 'cubic', 'linear', 'logn', 'nlogn', 'np', 'quadratic']


## 5. Initialize Tokenizer and Model (CodeBERT)


In [32]:
from transformers import RobertaTokenizer

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
inputs = tokenizer(codes, padding=True, truncation=True, return_tensors="pt")


## 6. Encode the Dataset (Tokenization and Label Mapping)


In [33]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

In [34]:
print("Encoded labels:", encoded_labels[:10])


Encoded labels: [6 2 2 2 2 6 4 4 2 0]


### Create Custom PyTorch Dataset and DataLoader for CodeBERT

Defines a custom `CodeDataset` class for handling tokenized inputs and labels, enabling efficient batching and shuffling using PyTorch's `DataLoader`.



In [35]:
import torch
from torch.utils.data import Dataset, DataLoader

class CodeDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach()for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

dataset = CodeDataset(inputs, encoded_labels)
dataloader = DataLoader(dataset, batch_size=8, shuffle=True)


## 7.  Split Encoded Dataset into Training and Validation Sets


In [37]:
from torch.utils.data import random_split

In [38]:
# Split the dataset into training and validation sets
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

In [39]:
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8)

## 8. Load CodeBERT for Sequence Classification with Custom Label Count


Initialize the CodeBERT model (`RobertaForSequenceClassification`) with the number of output labels based on the encoded classes for code complexity classification.


In [43]:
from transformers import RobertaForSequenceClassification

model = RobertaForSequenceClassification.from_pretrained("microsoft/codebert-base", num_labels=len(label_encoder.classes_))


## 9. Train-Test Split of Raw Data for Tokenization and Encoding


In [19]:
from sklearn.model_selection import train_test_split

# Split the data
train_codes, val_codes, train_labels, val_labels = train_test_split(codes, labels, test_size=0.2, random_state=42)
train_inputs = tokenizer(train_codes, padding=True, truncation=True, return_tensors="pt")
val_inputs = tokenizer(val_codes, padding=True, truncation=True, return_tensors="pt")
train_labels_encoded = label_encoder.fit_transform(train_labels)
val_labels_encoded = label_encoder.transform(val_labels)

## 10. Create Encoded Datasets and DataLoaders for Model Training


In [None]:

train_dataset = CodeDataset(train_inputs, train_labels_encoded)
val_dataset = CodeDataset(val_inputs, val_labels_encoded)
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8, shuffle=False)

## 11. Train CodeBERT Model with Accuracy Evaluation per Epoch


In [20]:
import torch
from torch.optim import AdamW  # Use AdamW from torch.optim
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader

# Define device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)
model.train()

# Define accuracy function
def compute_accuracy(preds, labels):
    preds = torch.argmax(preds, dim=1)
    return accuracy_score(labels.cpu().numpy(), preds.cpu().numpy())

# Set the learning rate
learning_rate = 1e-5

# Initialize optimizer with the specified learning rate
optimizer = AdamW(model.parameters(), lr=learning_rate)

# Training parameters
epochs = 10  # You can adjust this as needed
best_val_accuracy = 0

for epoch in range(epochs):
    model.train()
    total_accuracy = 0
    total_loss = 0
    no_deprecation_warning=True

    for batch in train_dataloader:
        # Move batch to the device
        batch = {k: v.to(device) for k, v in batch.items()}

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss
        logits = outputs.logits

        # Backward pass
        loss.backward()
        optimizer.step()

        # Compute accuracy
        labels = batch['labels']
        accuracy = compute_accuracy(logits, labels)

        total_loss += loss.item()
        total_accuracy += accuracy

    avg_loss = total_loss / len(train_dataloader)
    avg_accuracy = total_accuracy / len(train_dataloader)

    print(f"Epoch {epoch + 1}: Train Loss: {avg_loss:.4f}, Train Accuracy: {avg_accuracy:.4f}")


Epoch 1: Train Loss: 1.5004, Train Accuracy: 0.4194
Epoch 2: Train Loss: 0.7896, Train Accuracy: 0.7319
Epoch 3: Train Loss: 0.5175, Train Accuracy: 0.8322
Epoch 4: Train Loss: 0.3398, Train Accuracy: 0.8881
Epoch 5: Train Loss: 0.2080, Train Accuracy: 0.9351
Epoch 6: Train Loss: 0.1496, Train Accuracy: 0.9538
Epoch 7: Train Loss: 0.1065, Train Accuracy: 0.9671
Epoch 8: Train Loss: 0.1071, Train Accuracy: 0.9649
Epoch 9: Train Loss: 0.0785, Train Accuracy: 0.9762
Epoch 10: Train Loss: 0.0824, Train Accuracy: 0.9729


## 12. Save the model

In [44]:
torch.save(model.state_dict(), '/kaggle/working/bertmodel.pth')