# Lab 2: Large Language Models for Code Generation

In this lab, we will explore how **Large Language Models (LLMs)** can be applied to source code.  
You will learn how to use pre-trained models for generating and understanding code, as well as how to fine-tune models for specific programming-related tasks.

### **Objectives**
By the end of this lab, you should be able to:
- Run inference with multiple code-based LLMs (e.g., QwenCoder, StarCoder).
- Generate code snippets from natural language prompts.
- Compare how different models handle code generation tasks.
- Train and evaluate a model for a downstream task such as **code classification**.

### **Why This Matters**
Code-based LLMs are increasingly being used in software development, data science, and education.  
Understanding how to apply, evaluate, and fine-tune them will give you practical skills for building intelligent programming assistants or automated code analysis tools.


# I. Dependencies:
In this lab, we will be using [HuggingFace](https://huggingface.co/), an open-source platform where Machine Learning engineers, researchers, and companies deploy their models and datasets. HuggingFace features multiple libraries:
1. ***Transformers***: a library of pretrained natural language processing, computer vision, audio, and multimodal models for inference and training. Use Transformers to train models on your data, build inference applications, and generate text with large language models.
2. ***Datasets***: a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks.
3. **Evaluate**: a library for easily evaluating machine learning models and datasets.

In [7]:
# Install Hugging Face libraries
!pip install -q transformers datasets evaluate

# Import main classes for code generation
from transformers import AutoModelForCausalLM, AutoTokenizer

# (Later we’ll use datasets + evaluate when we fine-tune and assess performance)

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/84.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h

# II. Code Generation

Large Language Models (LLMs) have recently achieved impressive results on **code-related tasks**.  
Specialized versions of LLMs for programming are often called **Code LLMs**.  
Their main applications include:
- **Code generation**: turning natural language descriptions into working source code  
- **Code completion**: suggesting the next tokens in a partially written program  
- **Code Translation**: translating a snippet of code from a particular programming language to another one (e.g., Java -> Python)

- **Bug fixing & refactoring**: improving existing code  

A well-known real-world example is **GitHub Copilot**, which assists developers by suggesting code inline.

---

### Our Goal in this Lab
In this part, we will experiment with **inference** (generation) using a state-of-the-art Code LLM:  

**Qwen2.5-Coder**: the latest series of **code-based models** from the Qwen family (previously known as *CodeQwen*).  

We will use it to generate **Python code snippets** from natural language prompts.

---

### Example
- **Input (natural language):**  
  `"Generate a python function that writes 'Hello, World!'"`

- **Output (generated code):**
```python
def hello():
    print("Hello, World!")


In [None]:
# I. Load the Code LLM

from transformers import AutoModelForCausalLM, AutoTokenizer

# Hugging Face model repo (Qwen2.5 Coder, 1.5B parameters, instruction-tuned version)
model_name = "Qwen/Qwen2.5-Coder-1.5B-Instruct"

## TODO : Load the model model_name, automatically choosing the precision and GPU if available.

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    # arg1,
    # arg2 = "value2",   # automatically chooses best precision (saves memory)
    # arg3 = "value3"     # automatically uses GPU if available
)

# Load the tokenizer (responsible for converting text ↔ tokens)
tokenizer = AutoTokenizer.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
# II. Prepare the Prompt

# Define the task we want the model to perform

## TODO : Try out other prompts
prompt = "write a quick sort algorithm in Python."

# Chat-style messages for instruction-tuned LLMs
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# Apply the chat template to convert messages into model-readable input
# - `tokenize=False`: we want text, not token IDs yet
# - `add_generation_prompt=True`: adds the model's generation cue
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Show the prepared prompt
print("=== Prepared Input for the Model ===")
print(text)

=== Prepared Input for the Model ===
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
write a quick sort algorithm in Python.<|im_end|>
<|im_start|>assistant



In [None]:
# III. Generate Code from the Prompt

# Tokenize the prepared prompt and move to the same device as the model
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate code
# - max_new_tokens: limits the number of tokens generated
# - other parameters (temperature, top_p) can control creativity

## TODO : limit the number of generated tokens at 512
## TODO : try different temparatures values and check the creativity of the model
generated_ids = model.generate(
    **model_inputs,
    # arg 1 = value 1,
    # arg 2 = value 2
)

# Remove the input tokens from the output to get only the newly generated tokens
generated_ids_only = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

# Decode token IDs back to readable text
response = tokenizer.batch_decode(generated_ids_only, skip_special_tokens=True)[0]

# Display the generated code
print("=== Generated Code ===")
print(response)


=== Generated Code ===
Certainly! Below is the implementation of the Quick Sort algorithm in Python:

```python
def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr if x > pivot]
        return quick_sort(left) + middle + quick_sort(right)

# Example usage:
if __name__ == "__main__":
    example_array = [3, 6, 8, 10, 1, 2, 1]
    sorted_array = quick_sort(example_array)
    print("Sorted array:", sorted_array)
```

### Explanation:
1. **Base Case**: If the list has one or zero elements, it is already sorted, so we return it as is.
2. **Choose Pivot**: We choose the middle element of the list as the pivot.
3. **Partition**:
   - `left`: Elements less than the pivot.
   - `middle`: Elements equal to the pivot.
   - `right`: Elements greater than the pivot.
4. **Recursive Sort**: Recursively apply the quick sort 

# III. Code Translation

In this part of the lab, we will use the **same Code LLM (Qwen2.5-Coder)** to **translate code from one programming language to another**.  

Code translation is an increasingly important task in software development and research:
- Migrate legacy code to modern languages
- Facilitate cross-language compatibility
- Help programmers learn new languages by example

---

### How It Works
1. Provide the **source code** along with a natural language instruction indicating the target language.  
2. The model generates the **equivalent code** in the target language.  
3. We decode the output tokens to obtain human-readable code.

---

### Example
- **Input (Python)**:  
```python
def add(a, b):
    return a + b
```
- **Output (Java)**:  
```java
public int add(int a, int b) {
        return a + b;
    }
```

In [None]:
# I. Prepare Code Translation Prompt

# Define source and target languages

## TODO : Try different programming languages 
source_programming_language = "Java"
target_programming_language = "Python"

# Example code snippet in the source language
code = """
public static boolean isEven(int number) {
    return number % 2 == 0;
}
"""

# Create a natural language instruction for translation
prompt = f"Translate the following code from {source_programming_language} to {target_programming_language}:\n{code}"

# Prepare chat-style messages for the instruction-tuned LLM
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# Apply the chat template to generate model-readable input
# - tokenize=False: we want text, not token IDs yet
# - add_generation_prompt=True: adds the model's generation cue
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Show the prepared prompt for clarity
print("=== Prepared Input for the Model ===")
print(text)

=== Prepared Input for the Model ===
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Translate the following code from Java to Python:

public static boolean isEven(int number) {
    return number % 2 == 0;
}
<|im_end|>
<|im_start|>assistant



In [None]:
# II. Generate Translated Code

# Tokenize the prepared prompt and move it to the same device as the model
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate the translated code
# - max_new_tokens: limits the number of tokens generated
# - other generation parameters (temperature, top_p) can control creativity
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

# Remove input tokens to keep only the newly generated tokens
generated_ids_only = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

# Decode token IDs to human-readable text
response = tokenizer.batch_decode(generated_ids_only, skip_special_tokens=True)[0]

# Display the translated code
print("=== Translated Code ===")
print(response)


=== Translated Code ===
Here is the equivalent Python function:

```python
def is_even(number):
    return number % 2 == 0
```

In Python, the modulo operator `%` returns the remainder of the division of two numbers. So `number % 2` will be `0` if `number` is even and `1` if it's odd. The function then checks if this value equals `0`. If it does, the function returns `True`, indicating that the number is even; otherwise, it returns `False`.


# IV. Code Summarization

In this part of the lab, we will use the **same Code LLM (Qwen2.5-Coder)** to **summarize and explain code snippets**.  

Code summarization is a valuable task in software development and education:
- Understand unfamiliar code quickly  
- Generate documentation automatically  
- Help beginners learn programming concepts  

---

### How It Works
1. Provide the **source code snippet** as input.  
2. The model generates a **natural language explanation** of what the code does.  
3. We decode the output tokens to obtain human-readable text.

---

### Example
- **Input (Python code)**:  
```python
def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)
```
- **Output (Explanation)**:  


> This function calculates the factorial of a number n using recursion. If n is 0, it returns 1; otherwise, it multiplies n by the factorial of n-1.



In [None]:
# I. Prepare Code Summarization Prompt

# Example code snippet to summarize
## TODO : change the code for different summarization examples
code = """
public static boolean isEven(int number) {
    return number % 2 == 0;
}
"""

# Create a natural language instruction for summarization
prompt = f"Summarize the following code snippet:\n{code}"

# Prepare chat-style messages for the instruction-tuned LLM
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": prompt}
]

# Apply the chat template to generate model-readable input
# - tokenize=False: we want text, not token IDs yet
# - add_generation_prompt=True: adds the model's generation cue
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Show the prepared prompt for clarity
print("=== Prepared Input for the Model ===")
print(text)

=== Prepared Input for the Model ===
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
Summarize the following code snippet:

public static boolean isEven(int number) {
    return number % 2 == 0;
}
<|im_end|>
<|im_start|>assistant



In [None]:
# II. Generate Code Summary

# Tokenize the prepared prompt and move it to the same device as the model
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate the summary
# - max_new_tokens: limits the number of tokens generated
# - other parameters (temperature, top_p) can control creativity
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)

# Remove input tokens to keep only the newly generated tokens
generated_ids_only = [
    output_ids[len(input_ids):]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

# Decode token IDs to human-readable text
response = tokenizer.batch_decode(generated_ids_only, skip_special_tokens=True)[0]

# Display the summary
print("=== Generated Code Summary ===")
print(response)

=== Generated Code Summary ===
This Java method checks if a given integer `number` is even. It returns `true` if the number is divisible by 2 without any remainder, indicating that it is an even number. Otherwise, it returns `false`.


# V. Task: Experiment with Another Code LLM

**Objective:** Explore another code-based LLM from Hugging Face and apply it to one of the tasks we’ve learned so far:
- **Code Generation**  
- **Code Translation**  
- **Code Summarization**

---

### Instructions:
1. Go to the [Hugging Face Model Hub](https://huggingface.co/models) and search for **code-related models** (keywords: `code`, `StarCoder`, `codellama`, etc.).  
2. Pick one model that suits your task.  
3. Load the model and its tokenizer using `AutoModelForCausalLM` and `AutoTokenizer`.  
4. Prepare a prompt following the same pattern as in this lab.  
5. Generate the output and inspect the results.  

---

### Tips:
- Check if the model is **instruction-tuned** or **causal** — it may affect how you format prompts.  
- Limit `max_new_tokens` to avoid long outputs.  
- Compare the outputs with Qwen2.5-Coder to see differences in style, accuracy, or completeness.


In [None]:
# I. Load the Code LLM

In [None]:
# II. Prepare the Prompt

In [None]:
# III. Generate Code

# VI. Evaluate Text-to-SQL model:

In this section, we evaluate a **Text-to-SQL large language model** using the **Exact Match (EM)** metric.  
The goal is to measure how often the model’s generated SQL query exactly matches the ground truth query from a labeled dataset.

The workflow is as follows:

1. **Load the model and tokenizer** – a fine-tuned T5 model for text-to-SQL generation.  
2. **Define a helper function** – a reusable function to generate SQL queries from natural language prompts.  
3. **Load the dataset** – we use the `b-mc2/sql-create-context` dataset, which provides schemas, questions, and reference SQL queries.  
4. **Run inference** – for each example, we provide the schema (`tables`) and the natural language question (`query for:`) and collect the generated SQL.  
5. **Evaluate with Exact Match** – predictions are compared to the reference SQL queries to compute accuracy.

This evaluation provides a first quantitative measure of the model’s performance on the Text-to-SQL task.


In [1]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

# I. First load the model and the tokenizer

tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('cssupport/t5-small-awesome-text-to-sql').to("cuda")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/242M [00:00<?, ?B/s]

In [2]:
# II. prepare the prompt

input_prompt = "tables:\n" + "CREATE TABLE student_course_attendance (student_id VARCHAR); CREATE TABLE students (student_id VARCHAR)" + "\n" + "query for:" + "List the id of students who never attends courses?"

inputs = tokenizer(input_prompt, padding=True, truncation=True, return_tensors="pt").to("cuda")

In [3]:
# III. generate SQL queries
outputs = model.generate(**inputs, max_length=512)

generated_sql = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(f"The generated SQL query is: {generated_sql}")

The generated SQL query is: SELECT student_id FROM students WHERE NOT student_id IN (SELECT student_id FROM student_course_attendance)


In [14]:
# IV. prepare a text-to-SQL function

def text_to_SQL(input_prompt):
  inputs = tokenizer(input_prompt, padding=True, truncation=True, return_tensors="pt").to("cuda")
  outputs = model.generate(**inputs, max_length=512)
  generated_sql = tokenizer.decode(outputs[0], skip_special_tokens=True)
  return generated_sql

text_to_SQL(input_prompt)

'SELECT student_id FROM students WHERE NOT student_id IN (SELECT student_id FROM student_course_attendance)'

In [9]:
# V. Load the evaluation metric

from evaluate import load
exact_match_metric = load("exact_match")

In [22]:
# IV. prepare the test set:
from datasets import load_dataset

dataset = load_dataset("b-mc2/sql-create-context")
test_set = dataset["train"].shuffle(seed=42).select(range(100))
test_set

README.md: 0.00B [00:00, ?B/s]

sql_create_context_v4.json:   0%|          | 0.00/21.8M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/78577 [00:00<?, ? examples/s]

Dataset({
    features: ['answer', 'question', 'context'],
    num_rows: 100
})

In [None]:
# V. Perform inference on the test set
predictions = []
references = test_set[:]["answer"]
for i in range(len(test_set)):
  input_prompt = f"tables:\n {test_set[i]['context']} \n  query for: {test_set[i]['question']}"
  sql_query = text_to_SQL(input_prompt)
  predictions.append(sql_query)

In [None]:
## TODO : use the exact match metric to compare predictions with references
results = exact_match_metric.compute(##arg1, ##arg2)
print(results)

{'exact_match': np.float64(0.63)}


# VII. Fine-tuning CodeBERT for Programming Language Classification

In this part of the lab, we will **fine-tune CodeBERT**, a **bimodal pre-trained model** that understands both **programming languages (PL)** and **natural language (NL)**.

**About CodeBERT:**
- Learns general-purpose representations for code and natural language.
- Supports downstream applications such as:
  - **Code search using natural language queries**
  - **Code documentation generation**
  - **Code classification tasks**

**Today's Task:**  
We will fine-tune CodeBERT to **recognize the programming language of a given code snippet**.  
- **Input:** a code snippet  
- **Output:** the predicted programming language (e.g., Python, Java, C++)

---

### Fine-tuning Workflow:
1. **Load the pretrained CodeBERT model** and tokenizer.
2. **Prepare a labeled dataset** with code snippets and their programming language labels.
3. **Train the model** using supervised learning.
4. **Evaluate performance** with metrics such as accuracy or F1-score.
5. (Optional) **Use the fine-tuned model** to predict the language of new code snippets.

---

### Example:
- **Input:**  
```python
def factorial(n):
    return 1 if n == 0 else n * factorial(n-1)
```
- **Output:**
> Python



In [None]:
# I. Load Dataset for Code Language Classification

from datasets import load_dataset

# Load the dataset containing Java and Python code snippets
dataset_name = "salmane11/java_python_code_snippets"

##TODO : load dataset_name using load_dataset 
dataset = 


In [None]:
# II. Display the structure of the dataset
print("=== Dataset Structure ===")
print(dataset)

# Show a few examples from the training set
print("\n=== Example Code Snippets ===")
for example in dataset["train"].select(range(2)):
    print(f"Language: {example['language']}")
    print(f"Code:\n{example['code']}")
    print("-" * 50)

=== Dataset Structure ===
DatasetDict({
    train: Dataset({
        features: ['code', 'repo_name', 'path', 'language', 'license', 'size'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['code', 'repo_name', 'path', 'language', 'license', 'size'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['code', 'repo_name', 'path', 'language', 'license', 'size'],
        num_rows: 1000
    })
})

=== Example Code Snippets ===
Language: 0
Code:
package sunning.democollection.learn._0331.component;

import dagger.Component;
import sunning.democollection.learn._0331.UserActivity;
import sunning.democollection.learn._0331.module.ShoppingCartModule;

/**
 * Created by sunning on 16/3/31.
 */
@Component(dependencies = ActivityComponent.class, modules = ShoppingCartModule.class)
public interface ShoppingCartComponent {
    void inject(UserActivity userActivity);
}

--------------------------------------------------
Language: 0
Code:
package com.ford.op

In [None]:
# III. Prepare Dataset Columns for Fine-tuning

# Rename columns to standard names expected by Hugging Face Trainer
dataset = dataset.rename_column("code", "text")     # "text" will be used as model input
dataset = dataset.rename_column("language", "label") # "label" will be used as target

# Display the updated dataset structure
print("=== Updated Dataset Structure ===")
print(dataset)


=== Updated Dataset Structure ===
DatasetDict({
    train: Dataset({
        features: ['text', 'repo_name', 'path', 'label', 'license', 'size'],
        num_rows: 8000
    })
    validation: Dataset({
        features: ['text', 'repo_name', 'path', 'label', 'license', 'size'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['text', 'repo_name', 'path', 'label', 'license', 'size'],
        num_rows: 1000
    })
})


In [None]:
# IV. Load Tokenizer for CodeBERT

from transformers import AutoTokenizer

# Load the tokenizer corresponding to the CodeBERT model
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

# Quick example: tokenize a sample code snippet
sample_code = "def add(a, b): return a + b"
tokens = tokenizer(sample_code)
print("=== Tokenized Sample ===")
print(tokens)

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

=== Tokenized Sample ===
{'input_ids': [0, 9232, 1606, 1640, 102, 6, 741, 3256, 671, 10, 2055, 741, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
# V. Preprocess Dataset for Fine-tuning

# Define a preprocessing function to tokenize code snippets
def preprocess_function(examples):
    """
    Tokenizes input code snippets for the CodeBERT model.

    Args:
        examples (dict): A batch of examples from the dataset, containing the 'text' field.

    Returns:
        dict: Tokenized inputs suitable for model training.
    """
    return tokenizer(
        examples["text"],     # the code snippet
        truncation=True,      # truncate sequences longer than max_length
        max_length=60         # maximum sequence length (adjustable)
    )

# Quick test with a sample example
sample_example = {"text": "def factorial(n): return 1 if n == 0 else n * factorial(n-1)"}
print("=== Tokenized Sample ===")
print(preprocess_function(sample_example))


=== Tokenized Sample ===
{'input_ids': [0, 9232, 754, 17707, 1640, 282, 3256, 671, 112, 114, 295, 45994, 321, 1493, 295, 1009, 754, 17707, 1640, 282, 12, 134, 43, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


In [None]:
# VI. Tokenize the Entire Dataset

# Apply the preprocessing function to the entire dataset
# - .map() applies the function to each example in the dataset
# - returns a new dataset with tokenized inputs
tokenized_dataset = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
# VII. Set Up Data Collator

from transformers import DataCollatorWithPadding

# A data collator forms batches and dynamically pads inputs to the same length in each batch
# This is necessary because code snippets have varying lengths
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)


In [None]:
# Load Evaluation Metric

import evaluate
accuracy = evaluate.load("accuracy")

Downloading builder script: 0.00B [00:00, ?B/s]

In [None]:
# Define Metric Computation Function

import numpy as np
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

In [None]:
# VIII. Define Label Mappings

# Mapping from numerical class IDs to programming language names
id2label = {0: "Java", 1: "Python"}

# Mapping from programming language names to numerical class IDs
label2id = {"Java": 0, "Python": 1}

# Explanation:
# - These mappings are used by the model for classification tasks
# - Model outputs numerical IDs, which we can convert to human-readable labels
# - During training, text labels are converted to IDs using label2id

In [None]:
# IX. Load CodeBERT for Sequence Classification

from transformers import AutoModelForSequenceClassification

# Load the pretrained CodeBERT model for classification
# Arguments:
# - num_labels: number of classes (Java, Python)
# - id2label: maps class IDs to human-readable labels
# - label2id: maps human-readable labels to class IDs
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/codebert-base",
    num_labels=2,
    id2label=id2label,
    label2id=label2id
)

# Display a summary of the model architecture
print("=== Model Summary ===")
print(model)

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at microsoft/codebert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


=== Model Summary ===
RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768

In [None]:
# X. Fine-tune CodeBERT with Trainer

from transformers import TrainingArguments, Trainer

# Training arguments
training_args = TrainingArguments(
    output_dir="models/CodeClassifier",   # where to save checkpoints
    learning_rate=2e-5,                    # learning rate for optimizer
    per_device_train_batch_size=16,        # batch size per device (adjust if GPU memory is limited)
    per_device_eval_batch_size=16,         # batch size for evaluation
    num_train_epochs=5,                    # number of training epochs
    weight_decay=0.01,                     # regularization
    eval_strategy="epoch",                  # evaluate at the end of each epoch
    save_strategy="epoch",                  # save checkpoint at the end of each epoch
    load_best_model_at_end=True,            # automatically load the best model
    push_to_hub=False,                      # do not push to Hugging Face hub
    report_to="none",                       # no logging (disable wandb or tensorboard)
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Start fine-tuning
trainer.train()

  trainer = Trainer(


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy
1,0.0252,0.009691,0.998
2,0.0047,2.4e-05,1.0
3,0.0014,1.3e-05,1.0
4,0.0,9e-06,1.0
5,0.0,8e-06,1.0


TrainOutput(global_step=2500, training_loss=0.006257697998732328, metrics={'train_runtime': 691.9251, 'train_samples_per_second': 57.81, 'train_steps_per_second': 3.613, 'total_flos': 1233333072000000.0, 'train_loss': 0.006257697998732328, 'epoch': 5.0})

# VII. Using the Fine-tuned CodeBERT Model

After fine-tuning, the model can **classify new code snippets** to predict their programming language.  

### How to Use:
1. Load the fine-tuned model using a Hugging Face pipeline.  
2. Provide a **code snippet** as input.  
3. The model outputs the **predicted language** along with a confidence score.  

### Example Use Cases:
- Identify the programming language of a new snippet from a code repository.  
- Automatically tag code samples in educational platforms.  
- Analyze mixed-language projects and categorize code sections.


In [None]:
# I. Load Fine-tuned Model for Inference

from transformers import pipeline

# Initialize a text-classification pipeline with the fine-tuned CodeBERT model
# - model: path to the folder where the fine-tuned model is saved
classifier = pipeline("text-classification", model="/content/models/CodeClassifier/checkpoint-2500")

# Quick test: classify a new code snippet
sample_code = """
def hello_world():
  print('Hello, RANLP2025!')
"""
result = classifier(sample_code)

print("=== Classification Result ===")
print(result)


Device set to use cuda:0


=== Classification Result ===
[{'label': 'Python', 'score': 0.9999912977218628}]
