<a href="https://colab.research.google.com/github/yasidew/Codeharbor-2.0/blob/feature%2Fcode-analysis/code_analysis_model_1.0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1.   Dataset Preparation**

*   The dataset preparation process involves creating a structured dataset that supports machine learning or data analysis workflows. This step is critical for ensuring data consistency, enabling reproducibility, and providing diverse examples that help improve the robustness of models.

**Workflow Overview:**

1.    Create a dataset with varying vulnerabaility checks catgorizing as func and target

*   **func**: Contains the raw code snippet representing a function or block.
*   **target:** A Suggestion or description of the detected issue or vulnerability

2.   Store the dataset in a standard text-based format such as JSON (JSON) for easy parsing:

3. **Data Validation:** Normalize code snippets by ensuring consistent formatting (e.g., proper indentation, line breaks).


4. **Augment Dataset:** Include both synthetic examples (generated or modified to demonstrate specific vulnerabilities) , real-world examples sourced from repositories and Hugging Face







# **2.   Pre-Process Data set**

**Why Preprocess Dataset**:

Preprocessing is a crucial step in preparing the dataset for model training and evaluation. It ensures that the data is:

**Randomized:** Prevents biases introduced by inherent ordering in the dataset.

**Well-Partitioned:** Divided into training, and test subsets to evaluate the model's performance effectively.

**Tokenization:** Convert code snippets into a tokenized format suitable for input into machine learning models, such as sequence-to-sequence or transformer models.





# **3. Fine Tuning**:

**What is Fine Tunning?**:

fine-tuning involves adapting a pre-trained model CodeT5 model to specialize in detecting code vulnerabilities, security issues, and other specific programming checks. The model learns to identify patterns and features unique to vulnerable code



---

# **Why CodeT5 was Chosen for Fine-Tuning**
**Overview of CodeT5:**

CodeT5, developed by Salesforce, is a powerful pre-trained sequence-to-sequence model designed specifically for code-related tasks, such as:



*   **Code Generation:** Generating code snippets from input queries.
*   **Code Translation:** Converting code from one programming language to another.
*   **Code Summarization:** Generating concise summaries of code functionality.
*   **Code Completion:** Auto-completing partially written code.



**Key Features of CodeT5:**


1.   **Specialized for Programming Tasks**
      *   Pre-trained on diverse programming languages, including Java, Python, and JavaScript, making it highly effective for understanding and generating code.

2.   **Sequence-to-Sequence Architecture:**
      *   Allows flexible input-output relationships, such as transforming one code snippet into another, which aligns perfectly with our singleton transformation task.

3.   **Model Size Options:**
      *   Offers multiple variants like base, small, and large, depending on available compute resources.

4.   **Transformer-Based:**
      *   Built on the Transformer architecture, which is state-of-the-art for natural language processing (NLP) and code-related tasks.



---
# **Other Available Models and Why They Weren't Chosen**
### **1. GPT Models (e.g., Codex by OpenAI)**:
- **Advantages**:
  - Extremely powerful and capable of complex reasoning.
  - Performs well across various programming languages.
- **Limitations**:
  - Requires substantial computational resources for fine-tuning, especially on large datasets.
  - Fine-tuning access for Codex is limited compared to its use as a pre-trained API (closed access).
  - Higher cost associated with cloud usage and resource allocation.

---

### **2. CodeBERT**:
- **Advantages**:
  - Designed for programming tasks like code search and clone detection.
  - Lightweight and easier to fine-tune compared to GPT models.
- **Limitations**:
  - Primarily built for tasks like code retrieval and classification, not for sequence-to-sequence tasks like code transformation.
  - Less effective for generating code due to its focus on representation learning rather than generation.


## **Why CodeT5 is the Best Fit for This Task**
1. **Focus on Code Generation**:
   - CodeT5's architecture is optimized for understanding and generating code, making it ideal for tasks like identifying vulnerabilities and checks. Its ability to comprehend complex code structures and generate meaningful outputs aligns perfectly with the requirements of analyzing and improving code quality.
2. **Extensive Pre-Training on Programming Data**:
   - CodeT5 has been trained on datasets like CodeSearchNet, giving it a strong foundation in understanding and generating high-quality code.
3. **Ease of Fine-Tuning**:
   - It supports fine-tuning with Hugging Face's Trainer API, simplifying the implementation process.
4. **Scalability**:
   - The `base` version balances model size and performance, making it suitable for running on Colab Pro's GPU resources.

In [None]:
import os
import json
from datasets import Dataset
from transformers import AutoTokenizer
import torch


def preprocess_data_with_auto_tokenizer(data, tokenizer):
    """
    Tokenize the dataset using AutoTokenizer.
    """
    def tokenize_function(example):
        return tokenizer(example["func"], truncation=True, padding="max_length", max_length=512)

    # Debug: Print the first few samples before tokenization
    print("Data before tokenization:", data[:3])

    # Tokenize and format the dataset
    tokenized_data = data.map(tokenize_function, batched=False)
    tokenized_data = tokenized_data.rename_column("label", "labels")
    tokenized_data.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

    # Debug: Print the first few samples after tokenization
    print("Data after tokenization:", tokenized_data[:3])

    return tokenized_data

def validate_dataset(dataset, stage="train"):
    """
    Validate the dataset to ensure all necessary fields are present.
    """
    print(f"Validating {stage} dataset...")
    for i, sample in enumerate(dataset):
        assert "input_ids" in sample, f"Sample {i} missing 'input_ids'"
        assert "attention_mask" in sample, f"Sample {i} missing 'attention_mask'"
        assert "labels" in sample, f"Sample {i} missing 'labels'"

        # Adjust to check for tensors or lists
        assert isinstance(sample["input_ids"], (list, torch.Tensor)), f"Sample {i} 'input_ids' is not a list or tensor"
        assert isinstance(sample["attention_mask"], (list, torch.Tensor)), f"Sample {i} 'attention_mask' is not a list or tensor"
        assert isinstance(sample["labels"], (int, torch.Tensor)), f"Sample {i} 'labels' is not an integer or tensor"

        # Additional check for tensor shape if it's a tensor
        if isinstance(sample["input_ids"], torch.Tensor):
            assert sample["input_ids"].ndim == 1, f"Sample {i} 'input_ids' tensor is not 1-dimensional"
        if isinstance(sample["attention_mask"], torch.Tensor):
            assert sample["attention_mask"].ndim == 1, f"Sample {i} 'attention_mask' tensor is not 1-dimensional"

    print(f"{stage} dataset validation complete.")


def preprocess_dataset(raw_data_path, model_name, output_dir="dataset/processed_data"):
    """
    Preprocess the raw dataset and save it.
    """
    os.makedirs(output_dir, exist_ok=True)

    # Load raw data
    print(f"Loading dataset from: {raw_data_path}")
    with open(raw_data_path, 'r') as file:
        data = [json.loads(line) for line in file.readlines()]

    dataset = Dataset.from_dict({"func": [d["func"] for d in data], "label": [d["label"] for d in data]})

    # Split dataset
    split_data = dataset.train_test_split(test_size=0.2, seed=42)
    train_dataset = split_data["train"]
    test_dataset = split_data["test"]

    print(f"Train dataset size: {len(train_dataset)}")
    print(f"Test dataset size: {len(test_dataset)}")

    # Initialize tokenizer
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

    # Preprocess data
    train_dataset = preprocess_data_with_auto_tokenizer(train_dataset, tokenizer)
    test_dataset = preprocess_data_with_auto_tokenizer(test_dataset, tokenizer)

    # Validate datasets
    validate_dataset(train_dataset, stage="train")
    validate_dataset(test_dataset, stage="test")

    # Save preprocessed datasets
    train_path = os.path.join(output_dir, "custom_train.jsonl")
    test_path = os.path.join(output_dir, "custom_test.jsonl")
    train_dataset.to_json(train_path)
    test_dataset.to_json(test_path)

    print(f"Processed data saved at: {output_dir}")
    return train_dataset, test_dataset

# Preprocess the raw dataset
train_dataset, test_dataset = preprocess_dataset(
    raw_data_path="dataset/raw_data/custom_dataset.jsonl",
    model_name="Salesforce/codet5-base"
)

In [None]:
!python preprocess.py

Loading dataset from: custom_dataset.json
Train dataset size: 1200
Test dataset size: 300
tokenizer_config.json: 100% 1.48k/1.48k [00:00<00:00, 9.53MB/s]
vocab.json: 100% 703k/703k [00:00<00:00, 1.11MB/s]
merges.txt: 100% 294k/294k [00:00<00:00, 643kB/s]
added_tokens.json: 100% 2.00/2.00 [00:00<00:00, 11.1kB/s]
special_tokens_map.json: 100% 12.5k/12.5k [00:00<00:00, 48.5MB/s]
Data before tokenization: {'func': ['def unused_variable():\n    x = 10\n    return 5', 'def risky_eval(data):\n    return eval(data)', "def sql_injection(user_input):\n    query = 'SELECT * FROM users WHERE name = ' + user_input\n    return query"], 'target': ['No specific issue detected.', 'Avoid using eval as it can execute arbitrary code. Use ast.literal_eval for safer parsing.', 'Avoid string concatenation in SQL queries. Use parameterized queries to prevent SQL injection.']}
Map: 100% 1200/1200 [00:00<00:00, 2933.44 examples/s]
Data after tokenization: {'input_ids': tensor([[    1,   536, 10197,  ...,     0,

In [None]:
!pip install transformers datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl 

# **4. Model Evaluation - F1 Score**

## **What is Model Evaluation?**
The F1 Score is a robust metric for evaluating the model's ability to identify vulnerabilities and checks accurately. It considers both false positives (e.g., incorrectly flagged non-issues) and false negatives (e.g., missed vulnerabilities), which are critical for this component..

---

## **Why Use F1 Score?**
The F1 Score is a harmonic mean of **precision** and **recall**, making it particularly useful when there is an uneven class distribution or when both false positives and false negatives are critical.

- **Precision**: Measures the accuracy of the vulnerabilities detected by the model, ensuring fewer false positives..
- **Recall**: Assesses the model's ability to detect all actual vulnerabilities, reducing false negatives.
- **F1 Score**: Provides a balanced view of precision and recall, ensuring the model performs well across diverse vulnerability categories.

# **5. What Happens During Model Training?**

**Data Preparation:**

* The dataset is tokenized using a custom tokenizer for both input (func) and target (target) sequences.
* The input_ids, attention_mask, and labels are prepared, ensuring uniform length via padding.

**Batching with DataLoader:**

* The tokenized data is loaded into DataLoader objects for both training and testing.
* A custom collation function ensures correct batching with appropriate padding.

**Model Setup:**

* A T5ForConditionalGeneration model (CodeT5) is initialized.
* The model is moved to the GPU for efficient training and inference.

**Optimizer and Training Loop:**

* AdamW optimizer is used for weight updates.

- **For each epoch:**

  * The model processes the input and calculates the loss.
  * The loss is backpropagated, and model weights are updated to minimize it.
- The average loss per epoch is printed to monitor training progress.

**Model Saving:**

After training, the fine-tuned model and tokenizer are saved to disk for later use.

**Evaluation:**

* The model is evaluated on the test dataset by generating predictions for the input code snippets.
* Predictions are compared with ground-truth suggestions, and metrics like Precision, Recall, and F1 Score are computed.

**Sample Outputs:**

A few examples of input code, generated suggestions, and ground-truth suggestions are displayed for qualitative analysis.

In [None]:
from transformers import T5ForConditionalGeneration, AutoTokenizer
from torch.utils.data import DataLoader
from datasets import Dataset
from sklearn.metrics import precision_recall_fscore_support
import torch
from torch.nn.utils.rnn import pad_sequence

# Initialize model and tokenizer
model_name = "Salesforce/codet5-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Custom collation function for seq2seq tasks
def custom_collate_fn(batch):
    input_ids = [torch.tensor(item["input_ids"]) for item in batch]
    attention_masks = [torch.tensor(item["attention_mask"]) for item in batch]
    labels = [torch.tensor(item["labels"]) for item in batch]

    input_ids = pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
    attention_masks = pad_sequence(attention_masks, batch_first=True, padding_value=0)
    labels = pad_sequence(labels, batch_first=True, padding_value=tokenizer.pad_token_id)

    return {
        "input_ids": input_ids,
        "attention_mask": attention_masks,
        "labels": labels,
    }

# Load and preprocess datasets
def load_and_preprocess_dataset(file_path, tokenizer):
    data = Dataset.from_json(file_path)

    def tokenize_function(example):
        inputs = tokenizer(
            example["func"], truncation=True, padding="max_length", max_length=512
        )
        targets = tokenizer(
            example["target"], truncation=True, padding="max_length", max_length=128
        )
        return {
            "input_ids": inputs["input_ids"],
            "attention_mask": inputs["attention_mask"],
            "labels": targets["input_ids"],
        }

    return data.map(tokenize_function, batched=True)

# Paths to the processed JSON files
train_file_path = "dataset/processed_data/custom_train.json"
test_file_path = "dataset/processed_data/custom_test.json"

# Preprocess datasets
train_dataset = load_and_preprocess_dataset(train_file_path, tokenizer)
test_dataset = load_and_preprocess_dataset(test_file_path, tokenizer)

# Set up DataLoaders
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True, collate_fn=custom_collate_fn)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False, collate_fn=custom_collate_fn)

# Move model to GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Set up optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Training loop
epochs = 6
model.train()
for epoch in range(epochs):
    total_loss = 0
    for batch in train_dataloader:
        inputs = {k: v.to(device) for k, v in batch.items()}

        optimizer.zero_grad()
        outputs = model(**inputs)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    print(f"Epoch {epoch + 1}/{epochs} Loss: {total_loss / len(train_dataloader)}")

# Save the trained model
model_save_path = "models/custom_seq2seq_model"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

# Evaluate the model
print("Evaluating model on test dataset...")
model.eval()

generated_targets = []
ground_truth_targets = []

with torch.no_grad():
    for batch in test_dataloader:
        inputs = {k: v.to(device) for k, v in batch.items() if k != "labels"}
        labels = batch["labels"].to(device)
        outputs = model.generate(inputs["input_ids"], max_length=128)

        generated_targets.extend(tokenizer.batch_decode(outputs, skip_special_tokens=True))
        ground_truth_targets.extend(tokenizer.batch_decode(labels, skip_special_tokens=True))

# Calculate F1 Score
binary_true = [1 if gt == pred else 0 for gt, pred in zip(ground_truth_targets, generated_targets)]
binary_pred = [1] * len(binary_true)  # Assume all generated are positive predictions

precision, recall, f1, _ = precision_recall_fscore_support(binary_true, binary_pred, average="binary")

print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

# Display some examples
for i in range(5):
    print(f"Input Code: {test_dataset[i]['func']}")
    print(f"Generated Suggestion: {generated_targets[i]}")
    print(f"Ground Truth Suggestion: {ground_truth_targets[i]}")
    print()

In [None]:
!python train_model.py

2024-12-01 10:26:18.834932: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-01 10:26:18.854207: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-01 10:26:18.861105: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-01 10:26:18.875186: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
config.json: 100% 1.57k/1.57k [00:00<00:00, 9

In [None]:
!python preprocess.py

Loading dataset from: custom_dataset.json
Train dataset size: 1227
Test dataset size: 307
Data before tokenization: {'func': ['def long_function():\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n    pass\n 

In [None]:
!python train_model.py

2024-12-01 10:50:22.840696: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-01 10:50:22.860294: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-01 10:50:22.866355: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-01 10:50:22.881323: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Generating train split: 1227 examples [00:00,

In [None]:
!zip -r custom_seq2seq_model.zip ./models/custom_seq2seq_model

  adding: models/custom_seq2seq_model/ (stored 0%)
  adding: models/custom_seq2seq_model/tokenizer_config.json (deflated 94%)
  adding: models/custom_seq2seq_model/special_tokens_map.json (deflated 97%)
  adding: models/custom_seq2seq_model/tokenizer.json (deflated 82%)
  adding: models/custom_seq2seq_model/merges.txt (deflated 54%)
  adding: models/custom_seq2seq_model/vocab.json (deflated 59%)
  adding: models/custom_seq2seq_model/model.safetensors (deflated 7%)
  adding: models/custom_seq2seq_model/config.json (deflated 61%)
  adding: models/custom_seq2seq_model/generation_config.json (deflated 33%)


In [None]:
import time

def keep_colab_active():
    while True:
        # Print a simple message to the log (you can also comment this out to reduce output)
        print("Keeping Colab active...")
        time.sleep(300)  # Wait for 5 minutes (300 seconds) before the next iteration

# Run the function
keep_colab_active()

In [None]:
!python preprocess.py

Loading dataset from: custom_dataset.json
Train dataset size: 1521
Test dataset size: 381
Data before tokenization: {'func': ['def overly_complex_parameters():\n    def process(data, config, env, context):\n        return True', "def hardcoded_api_key():\n    api_key = '12345-abcdef-67890'\n    return api_key", 'def insecure_data_storage():\n    stored_password = password'], 'target': ['Simplify parameters by grouping related items into objects or dictionaries.', 'Avoid hardcoding API keys. Use secure storage or environment variables.', 'Encrypt sensitive data before storing it to protect against unauthorized access.']}
Map: 100% 1521/1521 [00:00<00:00, 2997.50 examples/s]
Data after tokenization: {'input_ids': tensor([[    1,   536,  1879,  ...,     0,     0,     0],
        [    1,   536,  7877,  ...,     0,     0,     0],
        [    1,   536, 22785,  ...,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,

In [None]:
!python train_model.py

2024-12-03 10:35:34.388374: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-12-03 10:35:34.409209: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-12-03 10:35:34.415150: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-12-03 10:35:34.429494: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Map: 100% 381/381 [00:00<00:00, 2586.46 examp

In [None]:
!zip -r models.zip ./models/custom_seq2seq_model

  adding: models/custom_seq2seq_model/ (stored 0%)
  adding: models/custom_seq2seq_model/tokenizer_config.json (deflated 94%)
  adding: models/custom_seq2seq_model/special_tokens_map.json (deflated 97%)
  adding: models/custom_seq2seq_model/tokenizer.json (deflated 82%)
  adding: models/custom_seq2seq_model/merges.txt (deflated 54%)
  adding: models/custom_seq2seq_model/vocab.json (deflated 59%)
  adding: models/custom_seq2seq_model/model.safetensors (deflated 7%)
  adding: models/custom_seq2seq_model/config.json (deflated 61%)
  adding: models/custom_seq2seq_model/generation_config.json (deflated 33%)
