# Focused Learning: HumanEval Benchmark Adaptation for Kotlin

## Learning Objective
Understand the challenges and solutions in adapting code generation benchmarks across programming languages, focusing on the HumanEval adaptation for Kotlin described in Section IV.

## Paper Reference
- **Section**: IV - Kotlin Evaluation
- **Key Challenge**: Existing Kotlin HumanEval had faulty prompts and tests
- **Solution**: Human experts rewrote HumanEval from scratch for Kotlin

## 1. The Problem: Why Cross-Language Benchmarks Fail

### 1.1 Issues Found in Existing Kotlin HumanEval

From Section IV.A, the paper identifies two major categories of issues:

1. **Type System Mismatches**: Generic variable types preventing method usage
2. **Numerical Precision Differences**: Rounding differences between Python and Kotlin

Let's explore these with concrete examples from the paper.

In [None]:
# Install required packages
!pip install deepeval langchain pandas numpy matplotlib

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import subprocess
import tempfile
import os
import json

# Set visualization style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## 2. Case Study: Type System Issues

The paper mentions: "too generic variable type in the Kotlin function signature... cannot apply many built-in Kotlin methods"

In [None]:
@dataclass
class HumanEvalProblem:
    """Represents a HumanEval problem with Python and Kotlin versions"""
    task_id: int
    description: str
    python_signature: str
    kotlin_signature_bad: str  # Problematic version
    kotlin_signature_good: str  # Fixed version
    test_cases: List[Dict]

# Example: Array sorting problem demonstrating type issues
type_issue_example = HumanEvalProblem(
    task_id=1,
    description="Sort an array in ascending order",
    python_signature="def sort_array(arr: List[int]) -> List[int]:",
    kotlin_signature_bad="fun sortArray(arr: Array<Any>): Array<Any>",  # TOO GENERIC!
    kotlin_signature_good="fun sortArray(arr: IntArray): IntArray",  # SPECIFIC TYPE
    test_cases=[
        {"input": [3, 1, 4, 1, 5], "expected": [1, 1, 3, 4, 5]},
        {"input": [9, 2, 6, 5, 3], "expected": [2, 3, 5, 6, 9]}
    ]
)

print("Type System Issue Example:")
print("=" * 50)
print(f"Task: {type_issue_example.description}")
print(f"\nPython signature:\n  {type_issue_example.python_signature}")
print(f"\nProblematic Kotlin signature:\n  {type_issue_example.kotlin_signature_bad}")
print(f"  ❌ Problem: Cannot use sort() method on Array<Any>!")
print(f"\nFixed Kotlin signature:\n  {type_issue_example.kotlin_signature_good}")
print(f"  ✅ Solution: Use specific type (IntArray) that supports sorting")

### 2.1 Demonstrating the Type System Problem

In [None]:
# Let's create Kotlin code snippets to demonstrate the issue
kotlin_bad_code = """
// This will NOT compile!
fun sortArray(arr: Array<Any>): Array<Any> {
    return arr.sorted().toTypedArray()  // Error: No sorted() for Array<Any>
}
"""

kotlin_good_code = """
// This works correctly
fun sortArray(arr: IntArray): IntArray {
    return arr.sorted().toIntArray()  // Works: IntArray has sorted()
}
"""

# Create a validator to check Kotlin compilation
class KotlinValidator:
    @staticmethod
    def validate_code(code: str, test_code: str = "") -> Dict[str, any]:
        """Validate Kotlin code compilation"""
        full_code = f"""
{code}

fun main() {{
    {test_code}
}}
"""
        
        # Create temporary file
        with tempfile.NamedTemporaryFile(mode='w', suffix='.kt', delete=False) as f:
            f.write(full_code)
            kotlin_file = f.name
        
        try:
            # Try to compile
            result = subprocess.run(
                ['kotlinc', kotlin_file, '-d', tempfile.gettempdir()],
                capture_output=True,
                text=True,
                timeout=10
            )
            
            return {
                'compiles': result.returncode == 0,
                'error': result.stderr if result.returncode != 0 else None
            }
        except Exception as e:
            return {'compiles': False, 'error': str(e)}
        finally:
            if os.path.exists(kotlin_file):
                os.remove(kotlin_file)

# Note: In a real environment with kotlinc installed, this would show compilation errors
print("\nKotlin Code Examples:")
print("\nBad (Generic Type):")
print(kotlin_bad_code)
print("\nGood (Specific Type):")
print(kotlin_good_code)

## 3. Case Study: Numerical Precision Issues

The paper provides a specific example from HumanEval task #2 about floating-point precision.

In [None]:
# HumanEval Task #2 from the paper
precision_issue_example = HumanEvalProblem(
    task_id=2,
    description="""Given a positive floating point number, it can be decomposed into 
integer part (largest integer smaller than given number) and decimals 
(leftover part always smaller than 1). Return the decimal part of the number.""",
    python_signature="def truncate_number(number: float) -> float:",
    kotlin_signature_bad="fun truncate(number: Double): Double",
    kotlin_signature_good="fun truncate(number: Double): Double",
    test_cases=[
        {"input": 3.5, "expected": 0.5},
        {"input": 1.25, "expected": 0.25},
        {"input": 123.45, "expected": 0.45}
    ]
)

# The problematic Kotlin solution from the paper
kotlin_truncate_simple = """
fun truncate(number: Double): Double {
    return number - Math.floor(number)
}
"""

# Demonstrate the precision issue
import math

def demonstrate_precision_issue():
    """Show how floating-point precision can cause test failures"""
    test_values = [3.5, 1.25, 123.45, 10.999999999999998]
    
    print("Floating-Point Precision Issues:")
    print("=" * 60)
    print(f"{'Input':<20} {'Python Result':<20} {'Kotlin Result':<20} {'Difference':<20}")
    print("=" * 60)
    
    for value in test_values:
        # Python implementation
        python_result = value - math.floor(value)
        
        # Simulate Kotlin's slightly different precision
        # In reality, this comes from JVM vs Python float handling
        kotlin_result = value - math.floor(value)
        if value == 10.999999999999998:
            # Simulate precision error
            kotlin_result += 1e-15
        
        difference = abs(python_result - kotlin_result)
        
        print(f"{value:<20} {python_result:<20.15f} {kotlin_result:<20.15f} {difference:<20.2e}")
        
        if difference > 1e-8:
            print(f"  ❌ Would fail with strict equality check!")
        else:
            print(f"  ✅ Within acceptable tolerance")

demonstrate_precision_issue()

## 4. The Solution: Human Expert Rewrite

From Section IV.A: "All HumanEval solutions and tests in Kotlin were written by an expert competitive programmer with six years of experience in Kotlin, and independently reviewed by a programmer with four years of experience."

In [None]:
class HumanEvalAdapter:
    """Demonstrates the process of adapting HumanEval problems to Kotlin"""
    
    def __init__(self):
        self.adaptation_rules = [
            "Use specific types instead of generic Any",
            "Handle floating-point comparisons with tolerance",
            "Adapt to Kotlin idioms (e.g., IntArray vs Array<Int>)",
            "Consider null safety in function signatures",
            "Use appropriate collection types"
        ]
    
    def adapt_problem(self, problem: HumanEvalProblem) -> Dict[str, str]:
        """Adapt a HumanEval problem from Python to Kotlin"""
        return {
            "description": self._adapt_description(problem.description),
            "signature": self._adapt_signature(problem.python_signature),
            "tests": self._adapt_tests(problem.test_cases),
            "solution_template": self._create_solution_template(problem)
        }
    
    def _adapt_description(self, description: str) -> str:
        """Adapt problem description for Kotlin context"""
        # Add Kotlin-specific notes if needed
        kotlin_notes = "\n\nNote: Use Kotlin's standard library functions where appropriate."
        return description + kotlin_notes
    
    def _adapt_signature(self, python_sig: str) -> str:
        """Convert Python signature to Kotlin"""
        # Mapping of Python types to Kotlin types
        type_mapping = {
            "List[int]": "IntArray",
            "List[str]": "List<String>",
            "Dict[str, int]": "Map<String, Int>",
            "float": "Double",
            "int": "Int",
            "str": "String",
            "bool": "Boolean"
        }
        
        kotlin_sig = python_sig.replace("def ", "fun ")
        for py_type, kt_type in type_mapping.items():
            kotlin_sig = kotlin_sig.replace(py_type, kt_type)
        
        return kotlin_sig
    
    def _adapt_tests(self, test_cases: List[Dict]) -> str:
        """Generate Kotlin test code with proper assertions"""
        test_code = "// Test cases\n"
        
        for i, test in enumerate(test_cases):
            test_code += f"""
// Test {i + 1}
val result{i} = solution({test['input']})
assert(result{i} == {test['expected']}) {{ "Test {i + 1} failed" }}
"""
        return test_code
    
    def _create_solution_template(self, problem: HumanEvalProblem) -> str:
        """Create a Kotlin solution template"""
        return f"""
{problem.kotlin_signature_good} {{
    // TODO: Implement the solution
    TODO("Not yet implemented")
}}
"""

# Example adaptation
adapter = HumanEvalAdapter()
adapted = adapter.adapt_problem(type_issue_example)

print("Adapted HumanEval Problem:")
print("=" * 50)
for key, value in adapted.items():
    print(f"\n{key.upper()}:")
    print(value)

## 5. Evaluation Setup from Section IV.B

The paper describes a specific evaluation setup with multiple metrics.

In [None]:
from enum import Enum
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase

class ErrorType(Enum):
    """Error types from Section IV.B"""
    COMPILATION_ERROR = "compilation_error"
    RUNTIME_ERROR = "runtime_error"
    TEST_FAILURE = "test_failure"
    TIMEOUT_ERROR = "timeout_error"
    SUCCESS = "success"

class KotlinHumanEvalMetrics:
    """Implements the evaluation metrics from Section IV.B"""
    
    def __init__(self):
        self.results = []
    
    def evaluate_generation(self, generated_code: str, test_harness: str) -> Dict:
        """Evaluate a single code generation"""
        # Combine generated code with test harness
        full_code = f"""
{generated_code}

fun main() {{
    {test_harness}
    println("All tests passed!")
}}
"""
        
        # Simulate evaluation (in real implementation, would compile and run)
        # For demonstration, we'll use mock results
        import random
        
        # Simulate different error types based on paper's statistics
        rand = random.random()
        if rand < 0.15:  # ~15% compilation errors
            error_type = ErrorType.COMPILATION_ERROR
        elif rand < 0.20:  # ~5% runtime errors
            error_type = ErrorType.RUNTIME_ERROR
        elif rand < 0.50:  # ~30% test failures
            error_type = ErrorType.TEST_FAILURE
        elif rand < 0.51:  # ~1% timeout
            error_type = ErrorType.TIMEOUT_ERROR
        else:  # ~49% success
            error_type = ErrorType.SUCCESS
        
        result = {
            'error_type': error_type,
            'passed': error_type == ErrorType.SUCCESS,
            'compilation_error': error_type == ErrorType.COMPILATION_ERROR,
            'runtime_error': error_type == ErrorType.RUNTIME_ERROR,
            'test_failure': error_type == ErrorType.TEST_FAILURE,
            'timeout': error_type == ErrorType.TIMEOUT_ERROR
        }
        
        self.results.append(result)
        return result
    
    def calculate_metrics(self) -> Dict[str, float]:
        """Calculate all metrics from Section IV.B"""
        if not self.results:
            return {}
        
        total = len(self.results)
        
        metrics = {
            'pass_at_1': sum(r['passed'] for r in self.results) / total * 100,
            'compilation_error_rate': sum(r['compilation_error'] for r in self.results) / total * 100,
            'runtime_error_rate': sum(r['runtime_error'] for r in self.results) / total * 100,
            'test_error_rate': sum(r['test_failure'] for r in self.results) / total * 100,
            'timeout_error_rate': sum(r['timeout'] for r in self.results) / total * 100
        }
        
        # Syntax error rate = compilation + runtime errors (as per paper)
        metrics['syntax_error_rate'] = (
            metrics['compilation_error_rate'] + 
            metrics['runtime_error_rate']
        )
        
        return metrics

# Simulate evaluation of 100 problems
evaluator = KotlinHumanEvalMetrics()
for _ in range(100):
    evaluator.evaluate_generation("fun solution() { /* generated */ }", "// tests")

metrics = evaluator.calculate_metrics()

print("Evaluation Metrics (Simulated):")
print("=" * 40)
for metric, value in metrics.items():
    print(f"{metric:<25}: {value:>6.2f}%")

## 6. Generation Setup Details

Section IV.B provides specific details about the generation setup.

In [None]:
class KotlinGenerationConfig:
    """Configuration for Kotlin code generation based on Section IV.B"""
    
    def __init__(self):
        # From the paper
        self.prompt_template = "You are an expert Kotlin programmer..."
        self.generation_strategy = "greedy"  # Greedy generation
        self.min_tokens = 128
        self.max_tokens = 256
        self.early_stop_sequence = "\n}\n"  # End of Kotlin method
        self.remove_comments = True
        self.handle_prompt_repetition = True
    
    def preprocess_prompt(self, problem_description: str, signature: str) -> str:
        """Prepare the prompt for generation"""
        return f"""{self.prompt_template}

Problem: {problem_description}

Complete the following Kotlin function:
{signature} {{
    // Your implementation here
"""
    
    def postprocess_generation(self, generated: str, original_signature: str) -> str:
        """Post-process generated code as per paper's approach"""
        lines = generated.split('\n')
        processed_lines = []
        
        # Remove comments if configured
        if self.remove_comments:
            lines = [line for line in lines if not line.strip().startswith('//')]
        
        # Handle prompt repetition
        if self.handle_prompt_repetition:
            # Find first function definition
            for i, line in enumerate(lines):
                if line.strip().startswith('fun '):
                    # Remove this line and everything before
                    lines = lines[i+1:]
                    break
        
        return '\n'.join(lines)

# Example usage
config = KotlinGenerationConfig()

# Example problem
problem = "Return the sum of two integers"
signature = "fun sum(a: Int, b: Int): Int"

prompt = config.preprocess_prompt(problem, signature)
print("Generation Prompt:")
print("=" * 50)
print(prompt)

# Simulate generation
mock_generation = """
fun sum(a: Int, b: Int): Int {
    // This function adds two numbers
    return a + b
}

// Additional generated content
fun test() {
    println(sum(1, 2))
}
"""

processed = config.postprocess_generation(mock_generation, signature)
print("\nProcessed Generation:")
print("=" * 50)
print(processed)

## 7. Comparing Python vs Kotlin Performance

From Figure 1 in the paper, we can analyze the performance gap between languages.

In [None]:
# Data from Figure 1 in the paper
model_performance = {
    'model': ['GPT-4-turbo', 'Deepseek-coder-33B-instruct', 'Deepseek-coder-6.7B-instruct',
              'GPT-3.5-turbo', 'CodeLlama-70B-Instruct-hf', 'CodeQwen1.5-7B',
              'Meta-Llama-3-8B-Instruct', 'Deepseek-coder-6.7B-base',
              'CodeLlama-13b-Instruct-hf', 'Deepseek-coder-1.3B-instruct'],
    'kotlin_score': [73, 63, 49, 62, 58, 44, 43, 41, 35, 29],
    'python_score': [81, 75, 61, 73, 67, 55, 52, 48, 43, 36]
}

df = pd.DataFrame(model_performance)
df['performance_gap'] = df['python_score'] - df['kotlin_score']
df['gap_percentage'] = (df['performance_gap'] / df['python_score']) * 100

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Bar chart comparing scores
x = np.arange(len(df))
width = 0.35

bars1 = ax1.bar(x - width/2, df['kotlin_score'], width, label='Kotlin', color='orange')
bars2 = ax1.bar(x + width/2, df['python_score'], width, label='Python', color='blue')

ax1.set_xlabel('Models')
ax1.set_ylabel('HumanEval Score')
ax1.set_title('Kotlin vs Python HumanEval Performance')
ax1.set_xticks(x)
ax1.set_xticklabels(df['model'], rotation=45, ha='right')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Performance gap analysis
ax2.barh(df['model'], df['gap_percentage'], color='red', alpha=0.7)
ax2.set_xlabel('Performance Gap (%)')
ax2.set_title('Python Advantage over Kotlin (%)')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary statistics
print("\nPerformance Gap Analysis:")
print("=" * 50)
print(f"Average Kotlin Score: {df['kotlin_score'].mean():.1f}")
print(f"Average Python Score: {df['python_score'].mean():.1f}")
print(f"Average Performance Gap: {df['performance_gap'].mean():.1f} points")
print(f"Average Gap Percentage: {df['gap_percentage'].mean():.1f}%")
print(f"\nConclusion: Models consistently perform better on Python than Kotlin")

## 8. Best Practices for Cross-Language Benchmark Adaptation

Based on the paper's findings, here are key principles for adapting benchmarks.

In [None]:
class BenchmarkAdaptationGuidelines:
    """Best practices learned from the Kotlin HumanEval adaptation"""
    
    def __init__(self):
        self.guidelines = {
            "Type Systems": {
                "principle": "Respect language-specific type systems",
                "example": "Use IntArray instead of Array<Any> in Kotlin",
                "impact": "Prevents compilation errors and enables proper method usage"
            },
            "Numerical Precision": {
                "principle": "Account for floating-point differences between languages",
                "example": "Use tolerance-based comparisons instead of exact equality",
                "impact": "Avoids false test failures due to precision differences"
            },
            "Idioms": {
                "principle": "Adapt to language-specific idioms and conventions",
                "example": "Use Kotlin's when expression instead of if-else chains",
                "impact": "Makes benchmarks more representative of real code"
            },
            "Expert Review": {
                "principle": "Have language experts review all adaptations",
                "example": "6-year Kotlin expert + 4-year reviewer in the paper",
                "impact": "Ensures high-quality, idiomatic benchmark problems"
            },
            "Test Equivalence": {
                "principle": "Ensure tests are functionally equivalent, not literally translated",
                "example": "Adapt test assertions to language-specific testing patterns",
                "impact": "Maintains benchmark validity across languages"
            }
        }
    
    def validate_adaptation(self, original_problem: dict, adapted_problem: dict) -> List[str]:
        """Validate that an adaptation follows best practices"""
        issues = []
        
        # Check for generic types
        if "Any" in adapted_problem.get("signature", ""):
            issues.append("Uses generic 'Any' type - consider more specific types")
        
        # Check for exact floating-point comparisons
        if "==" in adapted_problem.get("tests", "") and "Double" in adapted_problem.get("signature", ""):
            issues.append("Uses exact equality for floating-point - consider tolerance")
        
        # Check for language idioms
        if "ArrayList" in adapted_problem.get("signature", "") and "mutableListOf" not in adapted_problem.get("signature", ""):
            issues.append("Consider using Kotlin's mutableListOf() idiom")
        
        return issues
    
    def display_guidelines(self):
        """Display adaptation guidelines in a structured format"""
        print("Benchmark Adaptation Guidelines:")
        print("=" * 70)
        
        for category, details in self.guidelines.items():
            print(f"\n{category}:")
            print(f"  Principle: {details['principle']}")
            print(f"  Example: {details['example']}")
            print(f"  Impact: {details['impact']}")

guidelines = BenchmarkAdaptationGuidelines()
guidelines.display_guidelines()

## 9. Implementing a Complete HumanEval Problem

Let's implement a complete example showing the full adaptation process.

In [None]:
class CompleteHumanEvalExample:
    """Complete example of HumanEval problem adaptation"""
    
    def __init__(self):
        self.problem = {
            "task_id": 100,
            "description": """Given a list of integers, return a list of only the even integers.
            If there are no even integers, return an empty list.""",
            "python": {
                "signature": "def filter_even(numbers: List[int]) -> List[int]:",
                "solution": """def filter_even(numbers: List[int]) -> List[int]:
    return [n for n in numbers if n % 2 == 0]""",
                "tests": [
                    "assert filter_even([1, 2, 3, 4, 5]) == [2, 4]",
                    "assert filter_even([1, 3, 5, 7]) == []",
                    "assert filter_even([2, 4, 6, 8]) == [2, 4, 6, 8]"
                ]
            }
        }
    
    def create_kotlin_adaptation(self) -> dict:
        """Create proper Kotlin adaptation following paper's guidelines"""
        kotlin_adaptation = {
            "signature": "fun filterEven(numbers: IntArray): IntArray",
            "solution": """fun filterEven(numbers: IntArray): IntArray {
    return numbers.filter { it % 2 == 0 }.toIntArray()
}""",
            "prompt": f"""You are an expert Kotlin programmer.

{self.problem['description']}

Complete the following Kotlin function:
fun filterEven(numbers: IntArray): IntArray {{
    // Your implementation here
""",
            "tests": """// Test cases
fun main() {
    // Test 1
    val result1 = filterEven(intArrayOf(1, 2, 3, 4, 5))
    assert(result1.contentEquals(intArrayOf(2, 4))) { "Test 1 failed" }
    
    // Test 2
    val result2 = filterEven(intArrayOf(1, 3, 5, 7))
    assert(result2.isEmpty()) { "Test 2 failed" }
    
    // Test 3
    val result3 = filterEven(intArrayOf(2, 4, 6, 8))
    assert(result3.contentEquals(intArrayOf(2, 4, 6, 8))) { "Test 3 failed" }
    
    println("All tests passed!")
}""",
            "key_adaptations": [
                "Used IntArray instead of List<Int> for better performance",
                "Used contentEquals() for array comparison (Kotlin idiom)",
                "Added proper error messages to assertions",
                "Followed Kotlin naming conventions (camelCase)"
            ]
        }
        return kotlin_adaptation
    
    def demonstrate_evaluation(self):
        """Show how the adapted problem would be evaluated"""
        kotlin = self.create_kotlin_adaptation()
        
        print("Complete HumanEval Adaptation Example:")
        print("=" * 60)
        print(f"\nProblem: {self.problem['description']}")
        print(f"\nPython signature: {self.problem['python']['signature']}")
        print(f"Kotlin signature: {kotlin['signature']}")
        
        print("\nKotlin Solution:")
        print(kotlin['solution'])
        
        print("\nKotlin Tests:")
        print(kotlin['tests'])
        
        print("\nKey Adaptations Made:")
        for i, adaptation in enumerate(kotlin['key_adaptations'], 1):
            print(f"{i}. {adaptation}")

example = CompleteHumanEvalExample()
example.demonstrate_evaluation()

## 10. Summary and Key Takeaways

### Main Lessons from the Paper

1. **Direct Translation Fails**: You cannot simply translate syntax between languages
2. **Type Systems Matter**: Generic types prevent proper method usage
3. **Precision Issues**: Floating-point handling differs between language runtimes
4. **Expert Review Essential**: Language experts catch subtle issues automated tools miss
5. **Metrics Beyond Pass Rate**: Syntax error rate provides additional insights

### Impact on Results

The proper adaptation enabled meaningful evaluation of Kotlin code generation, revealing:
- Significant performance gaps between Python and Kotlin
- The effectiveness of different fine-tuning approaches
- The importance of language-specific training data

In [None]:
# Final summary visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Error type distribution from Table II
error_types = ['Pass', 'Test Fail', 'Compilation Error', 'Runtime Error', 'Timeout']
baseline_rates = [26.09, 50.31, 19.25, 3.73, 0.62]  # CodeLlama-7B Base
improved_rates = [42.24, 37.89, 17.39, 1.86, 0.62]  # CodeLlama-7B KExercises

x = np.arange(len(error_types))
width = 0.35

ax1.bar(x - width/2, baseline_rates, width, label='Baseline', color='lightcoral')
ax1.bar(x + width/2, improved_rates, width, label='After Fine-tuning', color='lightgreen')
ax1.set_ylabel('Rate (%)')
ax1.set_title('Error Distribution: Before vs After Fine-tuning')
ax1.set_xticks(x)
ax1.set_xticklabels(error_types, rotation=45)
ax1.legend()
ax1.grid(True, alpha=0.3)

# Key metrics improvement
metrics = ['Pass Rate', 'Syntax Error Rate', 'Completion Rate']
improvements = [16.15, -3.73, -0.044]  # Positive is better for pass rate, negative for error rates
colors = ['green' if imp > 0 else 'red' for imp in improvements]

ax2.barh(metrics, improvements, color=colors, alpha=0.7)
ax2.set_xlabel('Change (percentage points)')
ax2.set_title('Metric Improvements with KExercises Fine-tuning')
ax2.grid(True, alpha=0.3)
ax2.axvline(x=0, color='black', linestyle='-', linewidth=0.5)

plt.tight_layout()
plt.show()

print("\nConclusion:")
print("Proper benchmark adaptation is crucial for meaningful cross-language evaluation.")
print("The Kotlin HumanEval rewrite enabled accurate assessment of model capabilities.")