## 📦 CodeParrot GitHub Code Dataset Overview

The **CodeParrot GitHub Code Dataset** is a large-scale dataset of open-source code files collected from GitHub via BigQuery. It is intended for tasks such as language modeling and code generation, supporting a wide variety of programming languages and licenses.

---

### 🔹 General Info

- **Dataset Name**: `codeparrot/github-code`
- **Files**: ~115 million
- **Total Size**: ~1TB uncompressed (~300GB compressed)
- **Source**: Public GitHub repositories (via Google BigQuery)
- **Split**: Only contains a `train` split

---

### 🔸 Data Fields

| Field       | Type   | Description                            |
|-------------|--------|----------------------------------------|
| `code`      | string | File contents                          |
| `repo_name` | string | GitHub repository name                 |
| `path`      | string | Path of file in the repo               |
| `language`  | string | Programming language (by extension)    |
| `license`   | string | Repository license                     |
| `size`      | int    | File size in bytes                     |

---

### 🧠 Programming Languages (30 total)

Includes, but not limited to:

- Python, JavaScript, Java, C, C++, C#, Go, PHP, HTML, CSS
- TypeScript, Shell, Rust, Scala, Lua, Dockerfile, SQL
- Markdown, Haskell, Perl, Ruby, FORTRAN, PowerShell, Assembly

---

### 📝 License Types (15 total)

Examples:

- `mit`, `apache-2.0`, `gpl-3.0`, `bsd-3-clause`, `isc`, `unlicense`, `cc0-1.0`, etc.

---

### 📊 Dataset Statistics (Top Languages by File Count)

| Language    | Files        | Size (GB) |
|-------------|--------------|-----------|
| Java        | 19.5M        | 107.7     |
| C           | 14.1M        | 183.8     |
| JavaScript  | 11.8M        | 87.8      |
| HTML        | 11.1M        | 118.1     |
| Python      | 7.2M         | 52.0      |
| PHP         | 11.1M        | 61.4      |

---

### ⚠️ Considerations

- May include sensitive or unsafe content (e.g., passwords, API keys)
- Quality varies across repositories and files
- Deduplication and line-length filtering were applied during preprocessing

---

# Import Libraries

In [4]:
from datasets import load_dataset
import pandas as pd
from collections import defaultdict
import random
import os
from ollama import pull, chat, show, ResponseError
import time
import json
import re
import io
import tokenize

  from .autonotebook import tqdm as notebook_tqdm


---

# Random Sampling Data and Preprocessing

## `clean_code`
- **Purpose:**  
  Removes comments and docstrings from the source code while preserving the original indentation and spacing of non-empty lines.
- **Key Features:**
  - For **Python**:
    - Removes triple-quoted docstrings (`""" ... """` and `''' ... '''`).
    - Removes inline comments starting with `#`.
  - For **Other Languages** (e.g., C, C++, Java, JavaScript):
    - Removes single-line comments (e.g., starting with `//`).
    - Removes multi-line comments (enclosed in `/* ... */`).
  - **Line Processing:**  
    Collapses consecutive empty lines into a single blank line (retains a single empty line for clarity).

## `count_definitions`
- **Purpose:**  
  Counts the occurrences of functions and class definitions in the cleaned code.
- **Key Features:**
  - For **Python**:
    - Counts the keywords `def` (functions) and `class` (classes).
  - For **JavaScript**:
    - Counts the keywords `function` and `class`.
  - For **Java, C, and C++**:
    - Uses heuristic regular expressions to identify function definitions.
    - Counts the keyword `class` for class declarations.

## `passes_filters`
- **Purpose:**  
  Determines whether a given code snippet meets specific criteria for further processing.
- **Key Features:**
  - **Code Cleaning:**  
    Calls `clean_code` to remove unwanted elements (comments, docstrings, extra blank lines).
  - **Validation Checks:**  
    - Verifies that the cleaned code meets a minimum length.
    - Checks that the code contains at least a minimum number of function/class definitions (as counted by `count_definitions`).
  - **Output:**  
    Returns a tuple `(bool, cleaned_code)` where `bool` indicates if the snippet passes all filters.


In [8]:
# ---- PARAMETERS ----
target_languages = ['Python', 'C', 'Java', 'JavaScript', 'C++']
samples_per_language = 25
max_total_samples = 100000  # Total items to process before stopping
min_definitions = 1         # Require at least one function or class definition
min_length = 500            # Minimum length (in characters) after cleaning
max_length = 2000           # Maximum length (in characters) for medium-sized code

# List of disallowed license strings (lower-case)
disallowed_licenses = [
    'mit',
    'apache-2.0',
    'gpl-3.0',
    'gpl-2.0',
    'bsd-3-clause',
    'agpl-3.0',
    'lgpl-3.0',
    'lgpl-2.1',
    'bsd-2-clause',
    'cc0-1.0',
    'epl-1.0',
    'mpl-2.0',
    'unlicense',
    'isc',
    'artistic-2.0'
]

# ---- CODE CLEANING AND FILTERING FUNCTIONS ----
def clean_code(code, language):
    """
    Remove comments and docstrings (if applicable) while:
      - Preserving indentation and spaces in non-empty lines.
      - Collapsing consecutive empty lines into a single empty line.
    """
    cleaned = code
    if language == "Python":
        # Remove triple-quoted docstrings (both """ and ''')
        cleaned = re.sub(r'("""[\s\S]*?"""|\'\'\'[\s\S]*?\'\'\')', '', cleaned)
        # Remove inline comments starting with #
        cleaned = re.sub(r'#.*', '', cleaned)
    else:
        # Remove single-line comments (// ...) for C, C++, Java, JavaScript
        cleaned = re.sub(r'//.*', '', cleaned)
        # Remove multi-line comments (/* ... */)
        cleaned = re.sub(r'/\*[\s\S]*?\*/', '', cleaned)
    
    # Split the cleaned code into lines.
    lines = cleaned.splitlines()
    collapsed_lines = []
    last_line_blank = False
    for line in lines:
        if not line.strip():
            if not last_line_blank:
                # Add the first blank line.
                collapsed_lines.append(line)
                last_line_blank = True
            # Skip additional consecutive blank lines.
        else:
            collapsed_lines.append(line)
            last_line_blank = False
    cleaned = "\n".join(collapsed_lines)
    return cleaned

def count_definitions(cleaned, language):
    """
    Naively count functions and class definitions based on language-specific keywords.
    """
    count = 0
    if language == "Python":
        count += len(re.findall(r'\bdef\b', cleaned))
        count += len(re.findall(r'\bclass\b', cleaned))
    elif language == "JavaScript":
        count += len(re.findall(r'\bfunction\b', cleaned))
        count += len(re.findall(r'\bclass\b', cleaned))
    elif language in ["Java", "C++", "C"]:
        # Heuristic for function definitions and class declarations
        count += len(re.findall(r'\b[A-Za-z_][A-Za-z0-9_]*\s+\**[A-Za-z_][A-Za-z0-9_]*\s*\([^)]*\)\s*\{', cleaned))
        count += len(re.findall(r'\bclass\b', cleaned))
    return count

def contains_disallowed_license(cleaned, disallowed_licenses):
    """
    Check if any line in the cleaned code contains a disallowed license string.
    """
    for line in cleaned.splitlines():
        low_line = line.lower()
        if any(dl in low_line for dl in disallowed_licenses):
            return True
    return False

def passes_filters(code, language, min_definitions=min_definitions, min_length=min_length, max_length=max_length):
    """
    Clean the code and check if it meets the criteria:
      - Length must be at least `min_length` and at most `max_length` after cleaning.
      - Contains at least `min_definitions` number of function/class definitions.
      - Does not include any disallowed license information.
    Returns a tuple (bool, cleaned_code).
    """
    cleaned = clean_code(code, language)
    # Check length requirements.
    if len(cleaned) < min_length or len(cleaned) > max_length:
        return False, cleaned
    # Check for sufficient definitions.
    if count_definitions(cleaned, language) < min_definitions:
        return False, cleaned
    # Discard any sample that mentions a disallowed license.
    if contains_disallowed_license(cleaned, disallowed_licenses):
        return False, cleaned
    return True, cleaned

# ---- LOADING THE DATASET (REVISION 1.1) ----
try:
    ds = load_dataset("codeparrot/github-code", streaming=True, split="test")
except Exception as e:
    print("Test split not available; using train split instead.")
    ds = load_dataset("codeparrot/github-code", streaming=True, split="train")

# ---- RESERVOIR SAMPLING WITH FILTERING ----
reservoir = {lang: [] for lang in target_languages}
counts = {lang: 0 for lang in target_languages}

for i, example in enumerate(ds):
    lang = example['language']
    if lang in target_languages:
        # Apply cleaning and filtering criteria (including size and license check)
        passes, cleaned_code = passes_filters(example['code'], lang)
        if not passes:
            continue
        counts[lang] += 1
        
        # Reservoir sampling: if reservoir isn't full, add sample; otherwise, randomly replace.
        if len(reservoir[lang]) < samples_per_language:
            reservoir[lang].append(cleaned_code)
        else:
            j = random.randint(0, counts[lang] - 1)
            if j < samples_per_language:
                reservoir[lang][j] = cleaned_code

    # Early exit to limit processed examples.
    if i > max_total_samples:
        print("Stopped early: processed max_total_samples examples.")
        break

# ---- BUILDING THE DATAFRAME AND SAVING TO CSV ----
rows = []
for lang in target_languages:
    for code in reservoir[lang]:
        rows.append({'language': lang, 'code': code})

import pandas as pd
df = pd.DataFrame(rows)
df.index.name = "index"

csv_filename = "github_code_sample_random_5langs.csv"
df.to_csv(csv_filename, index=True)
print(f"✅ Randomized and filtered sample DataFrame created and saved as '{csv_filename}'")


Test split not available; using train split instead.
Stopped early: processed max_total_samples examples.
✅ Randomized and filtered sample DataFrame created and saved as 'github_code_sample_random_5langs.csv'


---

## Ollama Model Download and Test

This code snippet uses the Ollama Python library to automate the process of downloading and testing several code-related models. Here’s what the code does:

1. **Define Models**:  
   It starts by listing the models to be tested:
   - `qwen2.5-coder:32b`
   - `codellama:70b`
   - `deepseek-coder:33b`
   - `codegemma:7b`
   - `codestral`

2. **Pull Models**:  
   For each model in the list, the code uses the `pull(model)` function to download the model if it isn’t already available locally.

3. **Show Model Info**:  
   After pulling, it calls the `show(model)` function to print basic information about the model, confirming that it’s been loaded properly.

4. **Test Model Functionality**:  
   The code sends a simple chat query (“Hello, is your model working?”) using the `chat()` function. It then prints the model's response to verify that the model is functioning correctly.

5. **Error Handling**:  
   If an error occurs during any of these steps (e.g., if a model is not available), the error is caught and printed, ensuring you are informed of any issues.

This process helps verify that all specified models are properly downloaded and operational on your system.

In [10]:
# List of models to pull and test
models = [
    "qwen2.5-coder:32b",
    "codellama:70b",
    "deepseek-coder:33b",
    "codegemma:7b",
    "codestral"
]

# Iterate over each model, pull it, and perform a simple test chat query
for model in models:
    print(f"Pulling model: {model}...")
    try:
        # Pull the model; if not present locally, it will be downloaded.
        pull(model)
        print(f"Model '{model}' pulled successfully!")
        
        # Optionally, show basic model details/info.
        info = show(model)
        print(f"Model info for '{model}':\n{info}\n")
        
        # Test the model with a simple chat query.
        response = chat(model=model, messages=[{'role': 'user', 'content': 'Hello, is your model working?'}])
        print(f"Response from '{model}':")
        print(response['message']['content'])
    except ResponseError as e:
        print(f"Error with model '{model}': {e}")
    print("-" * 50)

Pulling model: qwen2.5-coder:32b...
Model 'qwen2.5-coder:32b' pulled successfully!
Model info for 'qwen2.5-coder:32b':
modified_at=datetime.datetime(2025, 4, 8, 20, 49, 45, 432944, tzinfo=TzInfo(-05:00)) template='{{- if .Suffix }}<|fim_prefix|>{{ .Prompt }}<|fim_suffix|>{{ .Suffix }}<|fim_middle|>\n{{- else if .Messages }}\n{{- if or .System .Tools }}<|im_start|>system\n{{- if .System }}\n{{ .System }}\n{{- end }}\n{{- if .Tools }}\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{{- range .Tools }}\n{"type": "function", "function": {{ .Function }}}\n{{- end }}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{"name": <function-name>, "arguments": <args-json-object>}\n</tool_call>\n{{- end }}<|im_end|>\n{{ end }}\n{{- range $i, $_ := .Messages }}\n{{- $last := eq (len (sli

---

# CSV Processing with Ollama Models for Code Documentation

This Jupyter Notebook cell processes a CSV file of code samples by using multiple pre-downloaded models from the Ollama Python library. The goal is to generate documented code from each sample using two sets of instructions (prompts). The output is saved incrementally into a new CSV file.

---

## 1. Prompt Setup

- **Prompt 1 (`engineered_prompt-1.md`):**  
  Contains the main engineered instructions for code cleaning and documentation. This prompt is loaded into the variable `engineered_prompt`.

- **Prompt 2 (`engineered_prompt-2.md`):**  
  Contains additional instructions. It is loaded into the variable `engineered_prompt_2`.  
  *Note:* It is assumed that Prompt 2 already includes the required instruction line.

---

## 2. Model List

A list of models (e.g., `qwen2.5-coder:32b`, `codellama:70b`, etc.) is defined. These models are assumed to be already downloaded on your system.

---

## 3. Conversation History

- A helper function `send_message_with_metrics` is defined to maintain a conversation history for each request.  
- The function appends each user message and the corresponding assistant response to a conversation list and returns the full response along with evaluation metrics.

---

## 4. CSV File Loading

- The CSV file (`github_code_sample_random_5langs.csv`) is loaded into a Pandas DataFrame.  
- Each row in the DataFrame contains a code sample in the column named `code`.

---

## 5. Processing Code Samples for Each Model

For each model, the following steps are executed for every code sample:

1. **New Conversation Initialization:**  
   A new conversation history is started for each code sample.

2. **Sending Prompt 1:**  
   The engineered prompt (Prompt 1) is sent to establish context with the model. Metrics such as evaluation duration and tokens processed per second are collected.

3. **Sending Prompt 2 + Code Sample:**  
   The second message is built by concatenating Prompt 2 (which already includes the instruction) with the actual code sample. The model's response containing the documented code is recorded.

4. **Recording the Response:**  
   The model's documented code response is immediately written into a new column in the DataFrame, corresponding to the current model.

5. **Incremental Saving:**  
   After processing each model, an intermediate CSV file is saved to ensure work is not lost. Additionally, separate CSV files for model-specific metrics (such as evaluation duration and token throughput) are generated.

---

## 6. Final Output

- After processing all models and code samples, the updated DataFrame is saved to `documented_code_responses_all_incremental.csv`.  
- This final CSV contains the original code samples along with new columns for each model's documented code response.

---

## Additional Details

- **Pausing:**  
  Short pauses (`time.sleep(1)`) are added between requests to avoid overloading the API.

- **Progress Tracking:**  
  Progress messages are printed for each data point and model, making it easier to monitor the process.

---

## Usage

1. Ensure that the files `engineered_prompt-1.md`, `engineered_prompt-2.md`, and `github_code_sample_random_5langs.csv` are in the same directory as your notebook.
2. Run the cell to process all data points and update the CSV incrementally.
3. The final output CSV will contain the original code samples along with a new column for each model's documented code response.

In [16]:
# Load Prompt 1 from 'engineered_prompt-1.md'
with open("engineered_prompt-1.md", "r", encoding="utf-8") as f:
    engineered_prompt = f.read()

# Load Prompt 2 from 'engineered_prompt-2.md'
with open("engineered_prompt-2.md", "r", encoding="utf-8") as f:
    engineered_prompt_2 = f.read()

# List of models (assumed to be already downloaded)
models = [
    "qwen2.5-coder:32b",
    "codellama:70b",
    "deepseek-coder:33b",
    "codegemma:7b",
    "codestral"
]

# Helper function to send a message and return the full response (with metrics)
def send_message_with_metrics(model, conversation, message_content):
    conversation.append({"role": "user", "content": message_content})
    response = chat(model=model, messages=conversation)
    conversation.append({"role": "assistant", "content": response['message']['content']})
    return response  # Returns full response dict including metrics

# Load the CSV file containing code samples (ensure it's in the same directory)
csv_filename = "github_code_sample_random_5langs.csv"
df = pd.read_csv(csv_filename, index_col=0)

# For each model, ensure there is a column in our main dataset for the documented code response.
for model in models:
    if model not in df.columns:
        df[model] = ""

# Process each model one by one.
for model in models:
    print(f"\nProcessing metrics and documented code for model: {model}")
    
    # Create a DataFrame to store all metrics for the current model.
    # Columns include conversation history and metrics for Prompt 1 and Prompt 2.
    metrics_df = pd.DataFrame(index=df.index, columns=[
        "conversation_history", 
        "prompt1_eval_duration_sec", "prompt1_eval_count", "prompt1_tokens_per_sec",
        "prompt2_eval_duration_sec", "prompt2_eval_count", "prompt2_tokens_per_sec"
    ])
    
    for idx in df.index:
        code_sample = df.at[idx, 'code']
        conversation = []
        
        try:
            # ---------------------------
            # Process Prompt 1
            # ---------------------------
            response1 = send_message_with_metrics(model, conversation, engineered_prompt)
            # Extract metrics for Prompt 1 (eval_duration is in nanoseconds)
            eval_duration_1 = response1.get("eval_duration", None)
            eval_count_1 = response1.get("eval_count", None)
            if eval_duration_1 and eval_duration_1 > 0 and eval_count_1 is not None:
                prompt1_tokens_per_sec = eval_count_1 * 1e9 / eval_duration_1
                prompt1_eval_duration_sec = eval_duration_1 / 1e9
            else:
                prompt1_tokens_per_sec = None
                prompt1_eval_duration_sec = None
            
            # Pause briefly (this sleep is not included in time measurement)
            time.sleep(1)
            
            # ---------------------------
            # Process Prompt 2
            # ---------------------------
            second_message = engineered_prompt_2 + code_sample
            response2 = send_message_with_metrics(model, conversation, second_message)
            documented_code = response2['message']['content']  # Documented code response
            
            eval_duration_2 = response2.get("eval_duration", None)
            eval_count_2 = response2.get("eval_count", None)
            if eval_duration_2 and eval_duration_2 > 0 and eval_count_2 is not None:
                prompt2_tokens_per_sec = eval_count_2 * 1e9 / eval_duration_2
                prompt2_eval_duration_sec = eval_duration_2 / 1e9
            else:
                prompt2_tokens_per_sec = None
                prompt2_eval_duration_sec = None
            
        except ResponseError as e:
            documented_code = f"Error: {e}"
            prompt1_eval_duration_sec = None
            eval_count_1 = None
            prompt1_tokens_per_sec = None
            prompt2_eval_duration_sec = None
            eval_count_2 = None
            prompt2_tokens_per_sec = None
        
        # Update the main dataset with the documented code response for this model.
        df.at[idx, model] = documented_code
        
        # Save metrics for this data point
        conv_history_str = json.dumps(conversation)
        metrics_df.at[idx, "conversation_history"] = conv_history_str
        metrics_df.at[idx, "prompt1_eval_duration_sec"] = prompt1_eval_duration_sec
        metrics_df.at[idx, "prompt1_eval_count"] = eval_count_1
        metrics_df.at[idx, "prompt1_tokens_per_sec"] = prompt1_tokens_per_sec
        metrics_df.at[idx, "prompt2_eval_duration_sec"] = prompt2_eval_duration_sec
        metrics_df.at[idx, "prompt2_eval_count"] = eval_count_2
        metrics_df.at[idx, "prompt2_tokens_per_sec"] = prompt2_tokens_per_sec
        
        print(f"Model '{model}' | Data point index: {idx} | P1: {prompt1_eval_duration_sec} sec, {prompt1_tokens_per_sec} tokens/sec | P2: {prompt2_eval_duration_sec} sec, {prompt2_tokens_per_sec} tokens/sec")
        time.sleep(1)
    
    # Save the metrics for this model to a separate CSV file.
    metrics_csv = f"metrics_{model.replace(':','_')}.csv"
    metrics_df.to_csv(metrics_csv)
    print(f"Metrics for model '{model}' saved to '{metrics_csv}'.")
    
    # Save the updated main dataset (with documented code responses for the current model)
    df.to_csv("documented_code_responses_all_incremental.csv")
    print(f"Documented code responses updated in main dataset CSV after processing model '{model}'.")
    
    time.sleep(1)

print("\nAll model metrics and documented code responses have been processed and saved.")


Processing metrics and documented code for model: qwen2.5-coder:32b
Model 'qwen2.5-coder:32b' | Data point index: 0 | P1: 24.063415208 sec, 21.3186696720194 tokens/sec | P2: 73.807443833 sec, 22.59934653439687 tokens/sec
Model 'qwen2.5-coder:32b' | Data point index: 1 | P1: 24.923316666 sec, 21.947319746019552 tokens/sec | P2: 30.392599792 sec, 23.85449105906484 tokens/sec
Model 'qwen2.5-coder:32b' | Data point index: 2 | P1: 17.511808542 sec, 21.985164986050588 tokens/sec | P2: 40.942399458 sec, 23.642971902343067 tokens/sec
Model 'qwen2.5-coder:32b' | Data point index: 3 | P1: 20.528826875 sec, 21.822971313844256 tokens/sec | P2: 39.691171833 sec, 23.556875668322373 tokens/sec
Model 'qwen2.5-coder:32b' | Data point index: 4 | P1: 15.567693958 sec, 21.775864872124693 tokens/sec | P2: 36.816477333 sec, 23.549238352113473 tokens/sec
Model 'qwen2.5-coder:32b' | Data point index: 5 | P1: 15.513294666 sec, 21.981146322667293 tokens/sec | P2: 38.335533375 sec, 23.685597148679296 tokens/sec