# Kotlin ML Pack: Technical Report - Main Implementation

## Paper Information
- **Title**: Kotlin ML Pack: Technical Report
- **Authors**: Sergey Titov, Mikhail Evtikhiev, Anton Shapkin, et al. (JetBrains Research)
- **Link**: [arXiv:2405.19250v1](https://arxiv.org/abs/2405.19250)
- **Date**: May 29, 2024

## Abstract Summary
This paper presents three novel datasets for Kotlin code generation:
1. **KStack**: The largest collection of permissively licensed Kotlin files (4M files, 3.1B tokens)
2. **KStack-clean**: A highly filtered version with 25,000 high-quality examples
3. **KExercises**: Translated and improved version of Python exercises dataset

The authors fine-tuned CodeLlama and DeepSeek models achieving up to 16-point increase in pass rate on HumanEval benchmark.

## 1. Environment Setup and Dependencies

### Why LangChain/LangGraph?
- **LangChain** provides excellent abstractions for working with LLMs, embeddings, and document processing
- **LangGraph** can orchestrate the complex multi-stage pipeline of dataset filtering and model evaluation
- The paper's dataset creation involves LLM-based classification which maps perfectly to LangChain's capabilities

### Why DeepEval?
- The paper evaluates models on HumanEval benchmark - DeepEval provides built-in support for code evaluation metrics
- Metrics like pass rate, syntax error rate map directly to DeepEval's correctness and syntax metrics

In [None]:
# Install required packages
!pip install langchain langchain-openai langchain-community langgraph
!pip install deepeval transformers datasets torch
!pip install pandas numpy matplotlib seaborn
!pip install huggingface-hub

In [None]:
import os
from typing import List, Dict, Tuple, Optional
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from dataclasses import dataclass

# LangChain imports
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate, ChatPromptTemplate
from langchain.chains import LLMChain
from langchain.schema import HumanMessage, SystemMessage

# LangGraph imports
from langgraph.graph import StateGraph, END
from langgraph.prebuilt import ToolExecutor

# DeepEval imports  
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

# Transformers imports
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Set up visualization
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 2. Dataset Implementation

### 2.1 KStack Dataset Structure
Based on Section III.A of the paper, KStack is collected from:
- GitHub repositories where main language is Kotlin
- Repositories with 10+ stars containing Kotlin files
- Stack v1.2 repositories with Kotlin

In [None]:
@dataclass
class KotlinFile:
    """Represents a Kotlin file from the dataset"""
    content: str
    repo_stars: int
    file_path: str
    extension: str  # .kt, .kts, .gradle.kts
    license: str
    
@dataclass
class DatasetStats:
    """Statistics from Table I in the paper"""
    name: str
    files: int
    repositories: int
    lines: int
    tokens: int

# Dataset statistics from the paper
dataset_stats = {
    "The Stack v2": DatasetStats("The Stack v2", 2_000_000, 109_547, 162_000_000, 1_700_000_000),
    "KStack": DatasetStats("KStack", 4_000_000, 163_310, 293_000_000, 3_100_000_000),
    "KStack-clean": DatasetStats("KStack-clean", 25_000, 3_366, 2_000_000, 22_000_000)
}

### 2.2 KStack-clean Quality Filtering with LangChain

Section III.B describes using LLM classifiers for quality filtering. We'll implement this using LangChain.

In [None]:
class KotlinQualityClassifier:
    """Implements the pairwise quality classification from Section III.B"""
    
    def __init__(self, model_name: str = "gpt-3.5-turbo"):
        self.llm = ChatOpenAI(model_name=model_name, temperature=0)
        
        # Prompt for pairwise comparison as described in the paper
        self.prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="You are evaluating Kotlin code quality."),
            HumanMessage(content="""Compare these two Kotlin code files and determine which has 
greater educational value for learning algorithms in Kotlin.

File A:
{file_a}

File B:
{file_b}

Which file (A or B) has higher educational value? Respond with only 'A' or 'B'.""")
        ])
        
        self.chain = LLMChain(llm=self.llm, prompt=self.prompt)
    
    def compare_files(self, file_a: str, file_b: str) -> str:
        """Compare two files and return which is better"""
        result = self.chain.run(file_a=file_a, file_b=file_b)
        return result.strip()
    
    def calculate_score(self, file: str, comparison_file: str) -> float:
        """
        Calculate quality score using the formula from the paper:
        s(f) = (s(f,c)_A - s(f,c)_B + s(c,f)_B - s(c,f)_A) / 2
        """
        # First comparison: file vs comparison_file
        result1 = self.compare_files(file, comparison_file)
        score1 = 1.0 if result1 == 'A' else 0.0
        
        # Second comparison: comparison_file vs file (reversed)
        result2 = self.compare_files(comparison_file, file)
        score2 = 1.0 if result2 == 'B' else 0.0
        
        # Average as per the paper's formula
        return (score1 + score2) / 2

### 2.3 KExercises Dataset Translation Pipeline

Section III.D describes translating Python exercises to Kotlin using GPT-3.5-turbo.

In [None]:
class PythonToKotlinTranslator:
    """Translates Python code exercises to Kotlin as per Section III.D"""
    
    def __init__(self):
        self.llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
        
        # Translation prompt from Figure 2 in the paper
        self.prompt = ChatPromptTemplate.from_messages([
            SystemMessage(content="You are a helpful assistant."),
            HumanMessage(content="""Rewrite to Kotlin (do not forget about docstring):

{python_code}""")
        ])
        
        self.chain = LLMChain(llm=self.llm, prompt=self.prompt)
    
    def translate(self, python_code: str) -> str:
        """Translate Python code to Kotlin"""
        return self.chain.run(python_code=python_code)

# Example translation
translator = PythonToKotlinTranslator()
python_example = """
def fibonacci(n: int) -> int:
    '''Calculate the nth Fibonacci number'''
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

# Note: In real implementation, you would translate the actual dataset
# kotlin_translation = translator.translate(python_example)

## 3. Model Fine-tuning Implementation

### 3.1 Training Configuration
Based on Section V.C, the paper uses specific training techniques.

In [None]:
import torch.nn as nn
from torch.optim import AdamW
from torch.optim.lr_scheduler import LinearLR, SequentialLR

class ZLoss(nn.Module):
    """Z-loss implementation from Section V.C
    
    The z-loss term is log^2(Z) where Z = sum(exp(y_j))
    This prevents logit divergence during training.
    """
    def __init__(self):
        super().__init__()
        
    def forward(self, logits):
        # Calculate Z = sum(exp(logits))
        # Use logsumexp for numerical stability
        log_z = torch.logsumexp(logits, dim=-1)
        # Z-loss is log^2(Z)
        z_loss = log_z ** 2
        return z_loss.mean()

class KotlinModelTrainer:
    """Implements training setup from Section V.C"""
    
    def __init__(self, model, learning_rate=1e-4, weight_decay=0.1):
        self.model = model
        self.z_loss = ZLoss()
        
        # AdamW with decreased epsilon as per paper
        self.optimizer = AdamW(
            model.parameters(),
            lr=learning_rate,
            weight_decay=weight_decay,
            eps=1e-16  # Decreased from default 1e-8
        )
        
        # Gradient clipping value
        self.max_grad_norm = 1.0
        
    def compute_loss(self, logits, labels, z_loss_weight=0.01):
        """Compute combined cross-entropy and z-loss"""
        # Standard cross-entropy loss
        ce_loss = nn.functional.cross_entropy(
            logits.view(-1, logits.size(-1)),
            labels.view(-1)
        )
        
        # Add z-loss for stability
        z_loss = self.z_loss(logits)
        
        total_loss = ce_loss + z_loss_weight * z_loss
        return total_loss, ce_loss, z_loss
    
    def training_step(self, batch):
        """Single training step with gradient clipping"""
        self.optimizer.zero_grad()
        
        # Forward pass
        outputs = self.model(**batch)
        loss, ce_loss, z_loss = self.compute_loss(outputs.logits, batch['labels'])
        
        # Backward pass
        loss.backward()
        
        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)
        
        # Optimizer step
        self.optimizer.step()
        
        return {
            'loss': loss.item(),
            'ce_loss': ce_loss.item(),
            'z_loss': z_loss.item()
        }

### 3.2 Model Configuration
The paper uses CodeLlama-7B and DeepSeek-coder models.

In [None]:
# Model configurations from the paper
MODEL_CONFIGS = {
    "CodeLlama-7B": {
        "model_name": "codellama/CodeLlama-7b-hf",
        "context_length": 16384,
        "supports_fim": True  # Fill-in-the-middle
    },
    "Deepseek-coder-6.7B": {
        "model_name": "deepseek-ai/deepseek-coder-6.7b-base",
        "context_length": 16384,
        "supports_fim": True
    },
    "Deepseek-coder-1.3B": {
        "model_name": "deepseek-ai/deepseek-coder-1.3b-base",
        "context_length": 16384,
        "supports_fim": True
    }
}

def load_model_and_tokenizer(model_config: dict):
    """Load model and tokenizer with proper configuration"""
    tokenizer = AutoTokenizer.from_pretrained(
        model_config["model_name"],
        trust_remote_code=True
    )
    
    model = AutoModelForCausalLM.from_pretrained(
        model_config["model_name"],
        torch_dtype=torch.bfloat16,  # BF16 precision as per paper
        device_map="auto",
        trust_remote_code=True
    )
    
    return model, tokenizer

## 4. Evaluation with DeepEval

### 4.1 HumanEval for Kotlin
Section IV describes the improved HumanEval benchmark for Kotlin.

In [None]:
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
import subprocess
import tempfile
import os

class KotlinCodeMetric(BaseMetric):
    """Custom metric for evaluating Kotlin code generation
    
    Implements the metrics from Section IV.B:
    - Pass@1
    - Compilation Error Rate
    - Test Error Rate
    - Runtime Error Rate
    """
    
    def __init__(self):
        self.threshold = 1.0  # Binary pass/fail
        
    def measure(self, test_case: LLMTestCase):
        # Extract generated code
        generated_code = test_case.actual_output
        expected_tests = test_case.expected_output
        
        # Create temporary Kotlin file
        with tempfile.NamedTemporaryFile(mode='w', suffix='.kt', delete=False) as f:
            # Combine generated code with test harness
            full_code = f"""
{generated_code}

fun main() {{
    // Test harness
    {expected_tests}
}}
"""
            f.write(full_code)
            kotlin_file = f.name
        
        try:
            # Compile Kotlin code
            compile_result = subprocess.run(
                ['kotlinc', kotlin_file, '-d', tempfile.gettempdir()],
                capture_output=True,
                text=True,
                timeout=15
            )
            
            if compile_result.returncode != 0:
                self.score = 0
                self.reason = f"Compilation error: {compile_result.stderr}"
                self.error_type = "compilation_error"
                return
            
            # Run compiled code
            run_result = subprocess.run(
                ['kotlin', '-cp', tempfile.gettempdir(), 'MainKt'],
                capture_output=True,
                text=True,
                timeout=15
            )
            
            if run_result.returncode != 0:
                self.score = 0
                self.reason = f"Runtime error: {run_result.stderr}"
                self.error_type = "runtime_error"
            else:
                self.score = 1
                self.reason = "All tests passed"
                self.error_type = None
                
        except subprocess.TimeoutExpired:
            self.score = 0
            self.reason = "Execution timeout (>15 seconds)"
            self.error_type = "timeout_error"
        finally:
            # Cleanup
            if os.path.exists(kotlin_file):
                os.remove(kotlin_file)
    
    def is_successful(self):
        return self.score >= self.threshold
    
    @property
    def __name__(self):
        return "Kotlin Code Evaluation"

### 4.2 Evaluation Pipeline with LangGraph

We'll use LangGraph to orchestrate the evaluation pipeline.

In [None]:
from typing import TypedDict, List
from langgraph.graph import StateGraph, END

class EvaluationState(TypedDict):
    """State for the evaluation pipeline"""
    model_name: str
    prompts: List[str]
    generated_codes: List[str]
    test_results: List[dict]
    metrics: dict

def generate_code(state: EvaluationState) -> EvaluationState:
    """Generate Kotlin code from prompts"""
    # Load model based on state.model_name
    # Generate code for each prompt
    # This is a placeholder - in real implementation, use actual model
    state['generated_codes'] = [
        f"fun solution() {{ /* Generated code for {prompt} */ }}" 
        for prompt in state['prompts']
    ]
    return state

def evaluate_code(state: EvaluationState) -> EvaluationState:
    """Evaluate generated code using DeepEval"""
    metric = KotlinCodeMetric()
    results = []
    
    for code in state['generated_codes']:
        test_case = LLMTestCase(
            input="Generate Kotlin function",
            actual_output=code,
            expected_output="assert(solution() == expected)"
        )
        metric.measure(test_case)
        results.append({
            'passed': metric.is_successful(),
            'error_type': getattr(metric, 'error_type', None),
            'reason': metric.reason
        })
    
    state['test_results'] = results
    return state

def calculate_metrics(state: EvaluationState) -> EvaluationState:
    """Calculate final metrics as per Section IV.B"""
    results = state['test_results']
    total = len(results)
    
    metrics = {
        'pass_rate': sum(r['passed'] for r in results) / total * 100,
        'compilation_error_rate': sum(1 for r in results if r['error_type'] == 'compilation_error') / total * 100,
        'runtime_error_rate': sum(1 for r in results if r['error_type'] == 'runtime_error') / total * 100,
        'timeout_error_rate': sum(1 for r in results if r['error_type'] == 'timeout_error') / total * 100
    }
    
    # Syntax error rate = compilation + runtime errors
    metrics['syntax_error_rate'] = metrics['compilation_error_rate'] + metrics['runtime_error_rate']
    
    state['metrics'] = metrics
    return state

# Build the evaluation graph
evaluation_graph = StateGraph(EvaluationState)

# Add nodes
evaluation_graph.add_node("generate", generate_code)
evaluation_graph.add_node("evaluate", evaluate_code)
evaluation_graph.add_node("calculate_metrics", calculate_metrics)

# Add edges
evaluation_graph.add_edge("generate", "evaluate")
evaluation_graph.add_edge("evaluate", "calculate_metrics")
evaluation_graph.add_edge("calculate_metrics", END)

# Set entry point
evaluation_graph.set_entry_point("generate")

# Compile the graph
evaluation_pipeline = evaluation_graph.compile()

## 5. Results Visualization

### 5.1 Reproducing Figure 1: Model Performance Comparison

In [None]:
# Results from Table II in the paper
results_data = {
    'Model': ['CodeLlama-7B Base', 'CodeLlama-7B KStack', 'CodeLlama-7B KStack-clean', 'CodeLlama-7B KExercises',
              'Deepseek-7B Base', 'Deepseek-7B KExercises',
              'Deepseek-1.3B Base', 'Deepseek-1.3B KStack', 'Deepseek-1.3B KExercises'],
    'Pass Rate': [26.09, 29.19, 37.89, 42.24, 40.99, 55.28, 26.71, 27.95, 36.65],
    'Syntax Error Rate': [22.98, 22.98, 18.64, 19.25, 21.12, 15.53, 19.26, 19.88, 18.63],
    'Completion Rate': [0.388, 0.396, 0.403, 0.344, 0.403, 0.411, 0.403, 0.404, 0.388]
}

results_df = pd.DataFrame(results_data)

# Create visualization
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

# Pass Rate
ax1.bar(range(len(results_df)), results_df['Pass Rate'], color='green', alpha=0.7)
ax1.set_xlabel('Model Configuration')
ax1.set_ylabel('Pass Rate (%)')
ax1.set_title('Pass Rate on Kotlin HumanEval')
ax1.set_xticks(range(len(results_df)))
ax1.set_xticklabels(results_df['Model'], rotation=45, ha='right')

# Syntax Error Rate
ax2.bar(range(len(results_df)), results_df['Syntax Error Rate'], color='red', alpha=0.7)
ax2.set_xlabel('Model Configuration')
ax2.set_ylabel('Syntax Error Rate (%)')
ax2.set_title('Syntax Error Rate')
ax2.set_xticks(range(len(results_df)))
ax2.set_xticklabels(results_df['Model'], rotation=45, ha='right')

# Completion Rate
ax3.bar(range(len(results_df)), results_df['Completion Rate'], color='blue', alpha=0.7)
ax3.set_xlabel('Model Configuration')
ax3.set_ylabel('Completion Rate')
ax3.set_title('Code Completion (Exact Match)')
ax3.set_xticks(range(len(results_df)))
ax3.set_xticklabels(results_df['Model'], rotation=45, ha='right')

plt.tight_layout()
plt.show()

### 5.2 Training Progress Visualization (Figure 3)

In [None]:
# Simulating training progress data from Figure 3
optimization_steps = np.linspace(0, 1400, 50)

# Approximate curves from Figure 3
openai_classifier = 26 + 10 * (1 - np.exp(-optimization_steps / 400)) + np.random.normal(0, 0.5, 50)
mistral_classifier = 26 + 14 * (1 - np.exp(-optimization_steps / 300)) + np.random.normal(0, 0.3, 50)

plt.figure(figsize=(10, 6))
plt.plot(optimization_steps, openai_classifier, label='OpenAI GPT-3.5-based classifier', color='blue')
plt.plot(optimization_steps, mistral_classifier, label='Mistral-based classifier', color='orange')
plt.xlabel('Optimization step')
plt.ylabel('Pass rate')
plt.title('Pass rate on HumanEval for Kotlin - Different Filtration Strategies')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## 6. Template for Personal Research

This section provides a template for extending the Kotlin ML Pack research.

In [None]:
class ResearchTemplate:
    """Template for extending Kotlin ML Pack research"""
    
    def __init__(self, research_area: str):
        self.research_area = research_area
        self.results = []
        
    def experiment_1_custom_filtering(self):
        """Experiment with different quality filtering approaches"""
        # TODO: Implement custom filtering logic
        # Ideas from Section VII:
        # - Use compiler feedback for filtering
        # - Use IDE inspections for quality assessment
        pass
    
    def experiment_2_synthetic_data(self):
        """Generate synthetic Kotlin data for specific domains"""
        # TODO: Implement synthetic data generation
        # Focus on production-oriented tasks as mentioned in Section VII
        pass
    
    def experiment_3_new_benchmarks(self):
        """Create issue-based benchmarks for Kotlin"""
        # TODO: Create SWE-Bench style dataset for Kotlin
        # Use real-world Kotlin projects and issues
        pass
    
    def run_experiments(self):
        """Run all experiments and collect results"""
        print(f"Running experiments for: {self.research_area}")
        # Run your experiments here
        pass

# Example usage
my_research = ResearchTemplate("Kotlin Android Development")
# my_research.run_experiments()

## 7. Conclusions and Key Insights

### Key Findings from the Paper:
1. **Quality over Quantity**: KStack-clean (25K examples) outperforms KStack (4M files) - showing that careful curation matters
2. **Synthetic Data Works**: KExercises achieves best results (55.28% pass rate) despite being synthetic
3. **Training Stability Matters**: Z-loss and other techniques prevent training instabilities
4. **Evaluation is Complex**: Pass rate alone isn't sufficient - syntax error rate and completion metrics provide fuller picture

### Implementation with LangChain/DeepEval:
- **LangChain** excellently handles the LLM-based classification pipeline for dataset curation
- **LangGraph** provides clean orchestration for complex evaluation pipelines
- **DeepEval** metrics map directly to the paper's evaluation criteria

### Future Research Directions (Section VII):
1. **Learning from Tools**: Integrate Kotlin compiler and IDE inspections
2. **Synthetic Data**: Generate production-oriented Kotlin tasks
3. **Better Benchmarks**: Create SWE-Bench style evaluations for Kotlin

In [None]:
# Summary statistics
print("Kotlin ML Pack Summary:")
print("=" * 50)
for name, stats in dataset_stats.items():
    print(f"\n{name}:")
    print(f"  Files: {stats.files:,}")
    print(f"  Repositories: {stats.repositories:,}")
    print(f"  Lines: {stats.lines:,}")
    print(f"  Tokens: {stats.tokens:,}")

print("\nBest Results:")
print(f"  Model: Deepseek-7B + KExercises")
print(f"  Pass Rate: 55.28% (+14.29% improvement)")
print(f"  Syntax Error Rate: 15.53% (-5.59% improvement)")