# 💻 DeepEval - Evaluating Code Generation and Review

Chào mừng đến với **Notebook 3** trong series DeepEval framework!

## 🎯 Mục tiêu của Notebook này

1. **Custom Metrics với G-Eval**: Tạo evaluation criteria tùy chỉnh cho code
2. **Code Generation Metrics**: Correctness, Readability, Efficiency
3. **Code Review Metrics**: Completeness, Security, Best Practices
4. **Automated Code Review**: End-to-end evaluation pipeline
5. **Security Assessment**: Phát hiện vulnerabilities trong code

## 📖 Tại sao Code Evaluation quan trọng?

Code generation đã trở thành một trong những ứng dụng quan trọng nhất của LLM, nhưng việc đánh giá chất lượng code tự động là cực kỳ thách thức:

### 🔍 Thách thức của Code Evaluation:
- **Functional Correctness**: Code có chạy đúng không?
- **Code Quality**: Readable, maintainable, efficient
- **Security**: Có vulnerabilities không?
- **Best Practices**: Follow coding standards
- **Context Awareness**: Code phù hợp với requirements

### ✅ DeepEval giải quyết như thế nào:
- **G-Eval Framework**: Custom metrics với LLM-based evaluation
- **Multi-dimensional Assessment**: Đánh giá nhiều aspects cùng lúc
- **Automated Code Review**: Scale code review process
- **Security-focused Metrics**: Specialized security evaluation

## 🛠️ Phần 1: Setup và Imports

In [None]:
# Core imports
import os
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional, Tuple
import warnings
warnings.filterwarnings('ignore')

# DeepEval imports
import deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import GEval
from deepeval.metrics.utils import trimAndLoadJson

# Code analysis libraries
import ast
import re
import subprocess
import tempfile
from pathlib import Path

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('default')

print(f"✅ DeepEval version: {deepeval.__version__}")
print("✅ All imports successful!")

In [None]:
# Setup environment
from dotenv import load_dotenv
load_dotenv()

# Check API keys
api_keys_status = {
    "OpenAI": "✅ Configured" if os.getenv("OPENAI_API_KEY") else "❌ Missing",
    "Anthropic": "✅ Configured" if os.getenv("ANTHROPIC_API_KEY") else "❌ Missing"
}

print("🔑 API Keys Status:")
for provider, status in api_keys_status.items():
    print(f"  {provider}: {status}")

if not os.getenv("OPENAI_API_KEY"):
    print("\n⚠️  Cần OPENAI_API_KEY để chạy code evaluation!")
    print("   Tạo file .env với: OPENAI_API_KEY=your_key_here")

## 📁 Phần 2: Load Code Samples Data

In [None]:
def load_code_samples():
    """
    Load code samples từ data folder
    """
    
    code_samples_path = "data/code_samples.json"
    
    try:
        with open(code_samples_path, 'r', encoding='utf-8') as f:
            code_data = json.load(f)
        
        print(f"📄 Loaded code samples data")
        print(f"  Code Generation Problems: {len(code_data.get('code_generation_problems', []))}")
        print(f"  Buggy Code Samples: {len(code_data.get('buggy_code_samples', []))}")
        print(f"  Code Review Scenarios: {len(code_data.get('code_review_scenarios', []))}")
        
        return code_data
        
    except FileNotFoundError:
        print(f"❌ File không tìm thấy: {code_samples_path}")
        print("💡 Đảm bảo đã chạy notebook trong đúng directory")
        return {}
    except Exception as e:
        print(f"❌ Error loading code samples: {e}")
        return {}

# Load code samples
code_data = load_code_samples()

# Preview data structure
if code_data:
    print("\n🔍 Preview Data Structure:")
    for category, items in code_data.items():
        if items and len(items) > 0:
            print(f"\n{category}:")
            first_item = items[0]
            for key in first_item.keys():
                print(f"  - {key}")

## 🎯 Phần 3: Custom Metrics với G-Eval

### 3.1 Hiểu về G-Eval Framework

G-Eval là một framework mạnh mẽ cho phép tạo custom evaluation metrics sử dụng LLM. Thay vì dựa vào predefined metrics, G-Eval cho phép định nghĩa evaluation criteria bằng natural language.

In [None]:
def create_code_correctness_metric():
    """
    Tạo G-Eval metric để đánh giá tính đúng đắn của code
    """
    
    # Define evaluation criteria
    evaluation_criteria = """
    Bạn sẽ đánh giá tính đúng đắn (correctness) của code Python được generate.
    
    Tiêu chí đánh giá:
    1. LOGIC CORRECTNESS (40%):
       - Code implement đúng algorithm/logic được yêu cầu
       - Handle các edge cases appropriately
       - Không có logical errors
    
    2. SYNTAX & RUNTIME (30%):
       - Code syntactically correct (không syntax errors)
       - Không có runtime errors với valid inputs
       - Proper variable declarations và scope
    
    3. FUNCTIONAL REQUIREMENTS (30%):
       - Output matches expected results
       - Function signature đúng như yêu cầu
       - Handle input validation appropriately
    
    Scoring:
    - 9-10: Hoàn toàn đúng, handle tất cả cases
    - 7-8: Mostly correct, minor issues
    - 5-6: Có một số lỗi logic hoặc edge cases
    - 3-4: Major logic errors hoặc nhiều bugs
    - 1-2: Fundamentally incorrect hoặc không chạy được
    """
    
    # Define evaluation steps
    evaluation_steps = [
        "Phân tích problem requirements và expected functionality",
        "Kiểm tra syntax correctness và potential runtime errors",
        "Trace through algorithm logic với sample inputs",
        "Identify potential edge cases và cách code handle chúng",
        "Assess overall correctness và assign score"
    ]
    
    # Create G-Eval metric
    correctness_metric = GEval(
        name="Code Correctness",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.input,  # Problem description
            LLMTestCase.actual_output,  # Generated code
            LLMTestCase.expected_output  # Expected solution (if available)
        ],
        threshold=7.0,  # Threshold for pass/fail
        model="gpt-4",  # Use GPT-4 for better code understanding
        include_reason=True
    )
    
    return correctness_metric

# Create correctness metric
code_correctness_metric = create_code_correctness_metric()
print("✅ Code Correctness Metric created")
print(f"Name: {code_correctness_metric.name}")
print(f"Threshold: {code_correctness_metric.threshold}")
print(f"Model: {code_correctness_metric.model}")

### 3.2 Code Readability Metric

In [None]:
def create_code_readability_metric():
    """
    Tạo G-Eval metric để đánh giá readability của code
    """
    
    evaluation_criteria = """
    Bạn sẽ đánh giá độ dễ đọc (readability) của Python code.
    
    Tiêu chí đánh giá:
    1. NAMING & CLARITY (35%):
       - Variable names descriptive và meaningful
       - Function names clear và verb-based
       - Avoid abbreviations và cryptic names
    
    2. CODE STRUCTURE (25%):
       - Proper indentation và formatting
       - Logical organization của code blocks
       - Appropriate use of whitespace
    
    3. COMMENTS & DOCUMENTATION (25%):
       - Docstrings cho functions
       - Inline comments cho complex logic
       - Clear explanation of algorithm steps
    
    4. SIMPLICITY & ELEGANCE (15%):
       - Code concise nhưng không cryptic
       - Avoid unnecessary complexity
       - Use appropriate Python idioms
    
    Scoring:
    - 9-10: Exceptionally readable, professional quality
    - 7-8: Well-written, easy to understand
    - 5-6: Acceptable but có thể improve readability
    - 3-4: Hard to read, poor naming/structure
    - 1-2: Very difficult to understand
    """
    
    evaluation_steps = [
        "Kiểm tra naming conventions của variables và functions",
        "Assess code structure, indentation, và formatting",
        "Review comments, docstrings, và documentation quality",
        "Evaluate overall clarity và ease of understanding",
        "Assign readability score based on criteria"
    ]
    
    readability_metric = GEval(
        name="Code Readability",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.input,
            LLMTestCase.actual_output
        ],
        threshold=6.0,
        model="gpt-4",
        include_reason=True
    )
    
    return readability_metric

# Create readability metric
code_readability_metric = create_code_readability_metric()
print("✅ Code Readability Metric created")

### 3.3 Code Efficiency Metric

In [None]:
def create_code_efficiency_metric():
    """
    Tạo G-Eval metric để đánh giá efficiency của code
    """
    
    evaluation_criteria = """
    Bạn sẽ đánh giá hiệu suất (efficiency) của Python code về time và space complexity.
    
    Tiêu chí đánh giá:
    1. TIME COMPLEXITY (40%):
       - Algorithm có optimal time complexity không?
       - Avoid unnecessary nested loops
       - Use efficient data structures (dict vs list lookup)
    
    2. SPACE COMPLEXITY (30%):
       - Minimize memory usage
       - Avoid unnecessary data copies
       - Consider in-place operations khi possible
    
    3. ALGORITHMIC APPROACH (20%):
       - Choose appropriate algorithm cho problem
       - Consider trade-offs between time và space
       - Use built-in optimized functions
    
    4. IMPLEMENTATION EFFICIENCY (10%):
       - Avoid redundant calculations
       - Minimize function call overhead
       - Early termination conditions
    
    Scoring Guide:
    - 9-10: Optimal complexity, highly efficient
    - 7-8: Good efficiency, minor optimizations possible
    - 5-6: Acceptable but not optimal
    - 3-4: Inefficient approach, needs optimization
    - 1-2: Very poor efficiency, major issues
    """
    
    evaluation_steps = [
        "Analyze algorithm approach và complexity",
        "Calculate time complexity (Big O notation)",
        "Assess space complexity và memory usage",
        "Identify potential optimizations",
        "Compare với alternative approaches nếu có",
        "Assign efficiency score"
    ]
    
    efficiency_metric = GEval(
        name="Code Efficiency",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.input,
            LLMTestCase.actual_output
        ],
        threshold=6.0,
        model="gpt-4",
        include_reason=True
    )
    
    return efficiency_metric

# Create efficiency metric
code_efficiency_metric = create_code_efficiency_metric()
print("✅ Code Efficiency Metric created")

## 🧪 Phần 4: Test Code Generation Metrics

### 4.1 Evaluate Bubble Sort Implementation

In [None]:
def test_bubble_sort_evaluation():
    """
    Test code generation metrics với bubble sort example
    """
    
    if not code_data or 'code_generation_problems' not in code_data:
        print("❌ Không có code generation data để test")
        return
    
    # Lấy bubble sort problem
    bubble_sort_problem = None
    for problem in code_data['code_generation_problems']:
        if problem['id'] == 'bubble_sort':
            bubble_sort_problem = problem
            break
    
    if not bubble_sort_problem:
        print("❌ Không tìm thấy bubble sort problem")
        return
    
    print("🧪 Testing Code Generation Metrics với Bubble Sort")
    print(f"Problem: {bubble_sort_problem['problem'][:100]}...")
    
    # Tạo test case
    test_case = LLMTestCase(
        input=bubble_sort_problem['problem'],
        actual_output=bubble_sort_problem['correct_solution'],
        expected_output=bubble_sort_problem['correct_solution']  # Using same solution as reference
    )
    
    # Test metrics
    metrics = {
        "Correctness": code_correctness_metric,
        "Readability": code_readability_metric,
        "Efficiency": code_efficiency_metric
    }
    
    results = {}
    
    for metric_name, metric in metrics.items():
        print(f"\n🔍 Testing {metric_name}...")
        
        try:
            # Create fresh metric instance để avoid state conflicts
            fresh_metric = type(metric)(
                name=metric.name,
                criteria=metric.criteria,
                evaluation_steps=metric.evaluation_steps,
                evaluation_params=metric.evaluation_params,
                threshold=metric.threshold,
                model=metric.model,
                include_reason=True
            )
            
            fresh_metric.measure(test_case)
            
            results[metric_name] = {
                "score": fresh_metric.score,
                "passed": fresh_metric.is_successful(),
                "reason": fresh_metric.reason
            }
            
            status = "✅" if fresh_metric.is_successful() else "❌"
            print(f"  {status} Score: {fresh_metric.score:.1f}/10")
            print(f"  Reason: {fresh_metric.reason[:150]}...")
            
        except Exception as e:
            print(f"  ❌ Error: {e}")
            results[metric_name] = {
                "score": 0,
                "passed": False,
                "error": str(e)
            }
    
    return results, test_case

# Run bubble sort evaluation
bubble_sort_results, bubble_sort_test_case = test_bubble_sort_evaluation()

### 4.2 Test với Generated Code (Mock LLM Response)

In [None]:
def test_generated_code_samples():
    """
    Test metrics với various quality levels của generated code
    """
    
    # Mock generated code samples với different quality levels
    generated_samples = [
        {
            "name": "High Quality Implementation",
            "problem": "Implement binary search algorithm",
            "code": '''def binary_search(arr, target):
    """
    Perform binary search on a sorted array.
    
    Args:
        arr: Sorted list of integers
        target: Integer to search for
    
    Returns:
        Index of target if found, -1 otherwise
    """
    left, right = 0, len(arr) - 1
    
    while left <= right:
        mid = (left + right) // 2
        
        if arr[mid] == target:
            return mid
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    
    return -1'''
        },
        {
            "name": "Poor Quality Implementation", 
            "problem": "Implement binary search algorithm",
            "code": '''def search(a, x):
    # search for x
    i=0
    j=len(a)-1
    while i<=j:
        m=(i+j)/2  # Bug: should use // for integer division
        if a[m]==x:
            return m
        if a[m]<x:
            i=m+1
        else:
            j=m-1
    return -1'''
        },
        {
            "name": "Inefficient Implementation",
            "problem": "Find maximum element in array",
            "code": '''def find_max(arr):
    # O(n^2) approach - very inefficient
    max_val = arr[0]
    for i in range(len(arr)):
        is_max = True
        for j in range(len(arr)):
            if arr[j] > arr[i]:
                is_max = False
                break
        if is_max:
            max_val = arr[i]
    return max_val'''
        }
    ]
    
    print("🧪 Testing Generated Code Samples với Different Quality Levels\n")
    
    all_results = []
    
    for sample in generated_samples:
        print(f"📝 Testing: {sample['name']}")
        print(f"Problem: {sample['problem']}")
        
        # Create test case
        test_case = LLMTestCase(
            input=sample['problem'],
            actual_output=sample['code']
        )
        
        # Test với all metrics
        sample_results = {
            "name": sample['name'],
            "problem": sample['problem'],
            "metrics": {}
        }
        
        metrics = {
            "Correctness": code_correctness_metric,
            "Readability": code_readability_metric,
            "Efficiency": code_efficiency_metric
        }
        
        for metric_name, metric in metrics.items():
            try:
                # Create fresh instance
                fresh_metric = type(metric)(
                    name=metric.name,
                    criteria=metric.criteria,
                    evaluation_steps=metric.evaluation_steps,
                    evaluation_params=metric.evaluation_params,
                    threshold=metric.threshold,
                    model=metric.model,
                    include_reason=True
                )
                
                fresh_metric.measure(test_case)
                
                sample_results["metrics"][metric_name] = {
                    "score": fresh_metric.score,
                    "passed": fresh_metric.is_successful()
                }
                
                status = "✅" if fresh_metric.is_successful() else "❌"
                print(f"  {metric_name}: {status} {fresh_metric.score:.1f}/10")
                
            except Exception as e:
                print(f"  {metric_name}: ❌ Error - {e}")
                sample_results["metrics"][metric_name] = {
                    "score": 0,
                    "passed": False,
                    "error": str(e)
                }
        
        all_results.append(sample_results)
        print()
    
    return all_results

# Test generated code samples
generated_code_results = test_generated_code_samples()

## 🛡️ Phần 5: Security Review Metrics

### 5.1 Security Vulnerability Detection

In [None]:
def create_security_review_metric():
    """
    Tạo G-Eval metric để detect security vulnerabilities
    """
    
    evaluation_criteria = """
    Bạn sẽ đánh giá code Python để identify security vulnerabilities và issues.
    
    Security Issues cần check:
    1. INJECTION VULNERABILITIES (30%):
       - SQL injection (raw SQL queries)
       - Command injection (subprocess, os.system)
       - Code injection (eval, exec với user input)
    
    2. INPUT VALIDATION (25%):
       - Lack of input sanitization
       - Missing bounds checking
       - Improper data type validation
    
    3. AUTHENTICATION & AUTHORIZATION (20%):
       - Hard-coded credentials
       - Weak authentication mechanisms
       - Missing access controls
    
    4. DATA EXPOSURE (15%):
       - Sensitive data in logs
       - Unencrypted sensitive data
       - Information leakage in error messages
    
    5. OTHER SECURITY ISSUES (10%):
       - Insecure random number generation
       - Improper error handling
       - Race conditions
    
    Scoring (Security Score - higher is better):
    - 9-10: No security issues detected
    - 7-8: Minor security concerns
    - 5-6: Some security issues present
    - 3-4: Multiple security vulnerabilities
    - 1-2: Critical security flaws
    """
    
    evaluation_steps = [
        "Scan code cho injection vulnerabilities (SQL, command, code)",
        "Check input validation và sanitization",
        "Look for hard-coded credentials hoặc secrets",
        "Identify potential data exposure issues",
        "Check error handling và information leakage",
        "Assess overall security posture và assign score"
    ]
    
    security_metric = GEval(
        name="Security Review",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.actual_output  # Code to review
        ],
        threshold=7.0,  # High threshold for security
        model="gpt-4",
        include_reason=True
    )
    
    return security_metric

# Create security metric
security_review_metric = create_security_review_metric()
print("✅ Security Review Metric created")

### 5.2 Test Security Metrics với Vulnerable Code

In [None]:
def test_security_vulnerabilities():
    """
    Test security metrics với code có vulnerabilities
    """
    
    if not code_data or 'buggy_code_samples' not in code_data:
        print("❌ Không có buggy code data để test security")
        return
    
    print("🛡️ Testing Security Review Metrics\n")
    
    security_results = []
    
    # Test với buggy code samples có security issues
    for sample in code_data['buggy_code_samples']:
        if sample.get('security_issues'):
            print(f"🔍 Testing: {sample['description']}")
            print(f"Expected Issues: {', '.join(sample['security_issues'])}")
            
            # Create test case
            test_case = LLMTestCase(
                input=f"Review this code for security vulnerabilities: {sample['description']}",
                actual_output=sample['buggy_code']
            )
            
            try:
                # Create fresh metric instance
                fresh_security_metric = GEval(
                    name=security_review_metric.name,
                    criteria=security_review_metric.criteria,
                    evaluation_steps=security_review_metric.evaluation_steps,
                    evaluation_params=security_review_metric.evaluation_params,
                    threshold=security_review_metric.threshold,
                    model=security_review_metric.model,
                    include_reason=True
                )
                
                fresh_security_metric.measure(test_case)
                
                result = {
                    "description": sample['description'],
                    "expected_issues": sample['security_issues'],
                    "score": fresh_security_metric.score,
                    "passed": fresh_security_metric.is_successful(),
                    "reason": fresh_security_metric.reason,
                    "detected_correctly": fresh_security_metric.score < 7.0  # Low score = detected issues
                }
                
                security_results.append(result)
                
                status = "✅ Detected" if not fresh_security_metric.is_successful() else "❌ Missed"
                print(f"  {status} Security Score: {fresh_security_metric.score:.1f}/10")
                print(f"  Analysis: {fresh_security_metric.reason[:200]}...")
                
            except Exception as e:
                print(f"  ❌ Error: {e}")
                security_results.append({
                    "description": sample['description'],
                    "error": str(e)
                })
            
            print()
    
    # Summary
    if security_results:
        detected_count = sum(1 for r in security_results if r.get('detected_correctly', False))
        total_count = len([r for r in security_results if 'detected_correctly' in r])
        
        print(f"📊 Security Detection Summary:")
        print(f"  Vulnerabilities Detected: {detected_count}/{total_count}")
        print(f"  Detection Rate: {detected_count/total_count*100:.1f}%" if total_count > 0 else "  Detection Rate: N/A")
    
    return security_results

# Test security vulnerabilities
security_test_results = test_security_vulnerabilities()

## 📝 Phần 6: Code Review Automation

### 6.1 Comprehensive Code Review Metric

In [None]:
def create_code_review_completeness_metric():
    """
    Tạo metric để đánh giá completeness của code review
    """
    
    evaluation_criteria = """
    Bạn sẽ đánh giá mức độ hoàn thiện (completeness) của một code review.
    
    Aspects cần review:
    1. FUNCTIONALITY REVIEW (25%):
       - Logic correctness được check
       - Edge cases được identify
       - Error handling được evaluate
    
    2. CODE QUALITY REVIEW (25%):
       - Readability và naming conventions
       - Code structure và organization
       - Documentation quality
    
    3. PERFORMANCE REVIEW (20%):
       - Algorithm efficiency
       - Time/space complexity analysis
       - Optimization opportunities
    
    4. SECURITY REVIEW (20%):
       - Security vulnerabilities check
       - Input validation review
       - Authentication/authorization issues
    
    5. BEST PRACTICES REVIEW (10%):
       - Coding standards compliance
       - Design patterns usage
       - Maintainability considerations
    
    Scoring:
    - 9-10: Comprehensive review covering all aspects
    - 7-8: Good review but missing some areas
    - 5-6: Basic review, several gaps
    - 3-4: Incomplete review, major aspects missed
    - 1-2: Very superficial or inadequate review
    """
    
    evaluation_steps = [
        "Check if functionality và logic được thoroughly reviewed",
        "Verify code quality aspects được addressed",
        "Assess if performance considerations được discussed",
        "Confirm security implications được reviewed",
        "Evaluate coverage của best practices",
        "Assess overall completeness của review"
    ]
    
    review_completeness_metric = GEval(
        name="Code Review Completeness",
        criteria=evaluation_criteria,
        evaluation_steps=evaluation_steps,
        evaluation_params=[
            LLMTestCase.input,  # Original code
            LLMTestCase.actual_output  # Review comments
        ],
        threshold=7.0,
        model="gpt-4",
        include_reason=True
    )
    
    return review_completeness_metric

# Create code review metric
code_review_completeness_metric = create_code_review_completeness_metric()
print("✅ Code Review Completeness Metric created")

### 6.2 Test Code Review Scenarios

In [None]:
def test_code_review_scenarios():
    """
    Test code review metrics với realistic review scenarios
    """
    
    if not code_data or 'code_review_scenarios' not in code_data:
        print("❌ Không có code review scenarios để test")
        return
    
    print("📝 Testing Code Review Scenarios\n")
    
    review_results = []
    
    for scenario in code_data['code_review_scenarios']:
        print(f"🔍 Scenario: {scenario['title']}")
        
        # Generate mock review comments based on review points
        review_comments = []
        for point in scenario['review_points']:
            severity = point['severity'].upper()
            comment = f"[{severity}] {point['description']} - {point['suggestion']}"
            review_comments.append(comment)
        
        mock_review = "\n".join(review_comments)
        
        print(f"Code length: {len(scenario['code_to_review'])} characters")
        print(f"Review points: {len(scenario['review_points'])}")
        
        # Create test case
        test_case = LLMTestCase(
            input=scenario['code_to_review'],
            actual_output=mock_review
        )
        
        try:
            # Test review completeness
            fresh_review_metric = GEval(
                name=code_review_completeness_metric.name,
                criteria=code_review_completeness_metric.criteria,
                evaluation_steps=code_review_completeness_metric.evaluation_steps,
                evaluation_params=code_review_completeness_metric.evaluation_params,
                threshold=code_review_completeness_metric.threshold,
                model=code_review_completeness_metric.model,
                include_reason=True
            )
            
            fresh_review_metric.measure(test_case)
            
            result = {
                "scenario": scenario['title'],
                "review_points_count": len(scenario['review_points']),
                "completeness_score": fresh_review_metric.score,
                "passed": fresh_review_metric.is_successful(),
                "reason": fresh_review_metric.reason
            }
            
            review_results.append(result)
            
            status = "✅" if fresh_review_metric.is_successful() else "❌"
            print(f"  {status} Completeness Score: {fresh_review_metric.score:.1f}/10")
            print(f"  Assessment: {fresh_review_metric.reason[:150]}...")
            
        except Exception as e:
            print(f"  ❌ Error: {e}")
            review_results.append({
                "scenario": scenario['title'],
                "error": str(e)
            })
        
        print()
    
    # Analyze results
    if review_results:
        valid_results = [r for r in review_results if 'completeness_score' in r]
        
        if valid_results:
            avg_score = np.mean([r['completeness_score'] for r in valid_results])
            pass_rate = np.mean([r['passed'] for r in valid_results]) * 100
            
            print(f"📊 Code Review Analysis Summary:")
            print(f"  Scenarios Tested: {len(valid_results)}")
            print(f"  Average Completeness Score: {avg_score:.1f}/10")
            print(f"  Pass Rate: {pass_rate:.1f}%")
    
    return review_results

# Test code review scenarios
review_scenario_results = test_code_review_scenarios()

## 📊 Phần 7: Comprehensive Code Evaluation Pipeline

### 7.1 End-to-End Code Evaluation

In [None]:
class CodeEvaluationPipeline:
    """
    Comprehensive pipeline cho code evaluation
    """
    
    def __init__(self, model="gpt-4"):
        self.model = model
        self.evaluation_history = []
        
        # Initialize all metrics
        self.metrics = {
            "correctness": self._create_correctness_metric(),
            "readability": self._create_readability_metric(),
            "efficiency": self._create_efficiency_metric(),
            "security": self._create_security_metric()
        }
    
    def _create_correctness_metric(self):
        return create_code_correctness_metric()
    
    def _create_readability_metric(self):
        return create_code_readability_metric()
    
    def _create_efficiency_metric(self):
        return create_code_efficiency_metric()
    
    def _create_security_metric(self):
        return create_security_review_metric()
    
    def evaluate_code(self, problem_description: str, generated_code: str, 
                     expected_solution: str = None) -> Dict[str, Any]:
        """
        Comprehensive evaluation của generated code
        """
        
        # Create test case
        test_case = LLMTestCase(
            input=problem_description,
            actual_output=generated_code,
            expected_output=expected_solution
        )
        
        results = {
            "problem": problem_description,
            "code": generated_code,
            "code_length": len(generated_code),
            "metrics": {},
            "overall_score": 0,
            "pass_count": 0,
            "total_metrics": len(self.metrics)
        }
        
        # Evaluate each metric
        metric_scores = []
        
        for metric_name, metric in self.metrics.items():
            try:
                # Create fresh metric instance
                fresh_metric = type(metric)(
                    name=metric.name,
                    criteria=metric.criteria,
                    evaluation_steps=metric.evaluation_steps,
                    evaluation_params=metric.evaluation_params,
                    threshold=metric.threshold,
                    model=metric.model,
                    include_reason=True
                )
                
                fresh_metric.measure(test_case)
                
                results["metrics"][metric_name] = {
                    "score": fresh_metric.score,
                    "passed": fresh_metric.is_successful(),
                    "reason": fresh_metric.reason,
                    "threshold": fresh_metric.threshold
                }
                
                metric_scores.append(fresh_metric.score)
                if fresh_metric.is_successful():
                    results["pass_count"] += 1
                
            except Exception as e:
                results["metrics"][metric_name] = {
                    "score": 0,
                    "passed": False,
                    "error": str(e)
                }
                metric_scores.append(0)
        
        # Calculate overall score
        if metric_scores:
            results["overall_score"] = np.mean(metric_scores)
        
        # Store in history
        self.evaluation_history.append(results)
        
        return results
    
    def batch_evaluate(self, code_samples: List[Dict]) -> List[Dict[str, Any]]:
        """
        Batch evaluation của multiple code samples
        """
        
        results = []
        
        for i, sample in enumerate(code_samples):
            print(f"🔍 Evaluating {i+1}/{len(code_samples)}: {sample.get('name', f'Sample {i+1}')}")
            
            try:
                result = self.evaluate_code(
                    problem_description=sample.get('problem', 'Code evaluation'),
                    generated_code=sample.get('code', ''),
                    expected_solution=sample.get('expected_solution')
                )
                
                result["sample_name"] = sample.get('name', f'Sample {i+1}')
                results.append(result)
                
                # Quick summary
                print(f"  ✅ Overall Score: {result['overall_score']:.1f}/10")
                print(f"  📊 Passed: {result['pass_count']}/{result['total_metrics']} metrics")
                
            except Exception as e:
                print(f"  ❌ Error: {e}")
                results.append({
                    "sample_name": sample.get('name', f'Sample {i+1}'),
                    "error": str(e)
                })
        
        return results
    
    def get_performance_summary(self) -> Dict[str, Any]:
        """
        Get comprehensive performance summary
        """
        
        if not self.evaluation_history:
            return {"message": "No evaluations performed yet"}
        
        # Collect metrics data
        metric_summaries = {}
        overall_scores = []
        pass_rates = []
        
        for result in self.evaluation_history:
            if "overall_score" in result:
                overall_scores.append(result["overall_score"])
                pass_rates.append(result["pass_count"] / result["total_metrics"])
            
            if "metrics" in result:
                for metric_name, metric_data in result["metrics"].items():
                    if "score" in metric_data:
                        if metric_name not in metric_summaries:
                            metric_summaries[metric_name] = []
                        metric_summaries[metric_name].append(metric_data["score"])
        
        # Calculate summary statistics
        summary = {
            "total_evaluations": len(self.evaluation_history),
            "overall_performance": {
                "average_score": round(np.mean(overall_scores), 2) if overall_scores else 0,
                "average_pass_rate": round(np.mean(pass_rates) * 100, 1) if pass_rates else 0,
                "score_std": round(np.std(overall_scores), 2) if overall_scores else 0
            },
            "metric_performance": {}
        }
        
        for metric_name, scores in metric_summaries.items():
            summary["metric_performance"][metric_name] = {
                "average_score": round(np.mean(scores), 2),
                "min_score": round(min(scores), 2),
                "max_score": round(max(scores), 2),
                "std_dev": round(np.std(scores), 2)
            }
        
        return summary

# Create evaluation pipeline
code_eval_pipeline = CodeEvaluationPipeline()
print("✅ Code Evaluation Pipeline created successfully!")
print(f"Available metrics: {list(code_eval_pipeline.metrics.keys())}")

### 7.2 Run Comprehensive Evaluation

In [None]:
def run_comprehensive_code_evaluation():
    """
    Run comprehensive evaluation pipeline
    """
    
    if not code_data or 'code_generation_problems' not in code_data:
        print("❌ Không có code generation data để comprehensive evaluation")
        return
    
    print("🚀 Running Comprehensive Code Evaluation Pipeline\n")
    
    # Prepare evaluation samples từ code data
    evaluation_samples = []
    
    # Add code generation problems
    for problem in code_data['code_generation_problems']:
        evaluation_samples.append({
            "name": f"Problem: {problem['id']}",
            "problem": problem['problem'],
            "code": problem['correct_solution'],
            "expected_solution": problem['correct_solution']
        })
    
    # Add some buggy code samples
    for sample in code_data['buggy_code_samples'][:2]:  # Just first 2
        evaluation_samples.append({
            "name": f"Buggy: {sample['id']}",
            "problem": sample['description'],
            "code": sample['buggy_code']
        })
    
    print(f"📝 Evaluating {len(evaluation_samples)} code samples")
    
    # Run batch evaluation
    comprehensive_results = code_eval_pipeline.batch_evaluate(evaluation_samples)
    
    # Get performance summary
    performance_summary = code_eval_pipeline.get_performance_summary()
    
    return comprehensive_results, performance_summary

# Run comprehensive evaluation
comprehensive_results, performance_summary = run_comprehensive_code_evaluation()

In [None]:
# Display comprehensive results
def display_comprehensive_results(results, summary):
    """
    Display và analyze comprehensive evaluation results
    """
    
    if not results:
        print("❌ Không có results để display")
        return
    
    print("📊 Comprehensive Code Evaluation Results\n")
    
    # Create summary DataFrame
    summary_data = []
    
    for result in results:
        if "overall_score" in result:
            row = {
                "Sample": result.get("sample_name", "Unknown"),
                "Overall_Score": result["overall_score"],
                "Pass_Rate": f"{result['pass_count']}/{result['total_metrics']}",
                "Code_Length": result.get("code_length", 0)
            }
            
            # Add individual metric scores
            for metric_name, metric_data in result.get("metrics", {}).items():
                if "score" in metric_data:
                    row[f"{metric_name.title()}"] = metric_data["score"]
            
            summary_data.append(row)
    
    if summary_data:
        df = pd.DataFrame(summary_data)
        print(df.round(1).to_string(index=False))
    
    # Display performance summary
    print(f"\n📈 Performance Summary:")
    if "overall_performance" in summary:
        perf = summary["overall_performance"]
        print(f"  Total Evaluations: {summary['total_evaluations']}")
        print(f"  Average Overall Score: {perf['average_score']}/10")
        print(f"  Average Pass Rate: {perf['average_pass_rate']}%")
        print(f"  Score Std Dev: {perf['score_std']}")
    
    print(f"\n🎯 Metric Performance:")
    if "metric_performance" in summary:
        for metric_name, stats in summary["metric_performance"].items():
            print(f"  {metric_name.title()}:")
            print(f"    Average: {stats['average_score']}/10")
            print(f"    Range: {stats['min_score']} - {stats['max_score']}")
            print(f"    Std Dev: {stats['std_dev']}")
    
    # Insights
    print(f"\n💡 Key Insights:")
    
    if "metric_performance" in summary:
        metric_avgs = {name: stats['average_score'] for name, stats in summary["metric_performance"].items()}
        
        if metric_avgs:
            best_metric = max(metric_avgs.keys(), key=lambda k: metric_avgs[k])
            worst_metric = min(metric_avgs.keys(), key=lambda k: metric_avgs[k])
            
            print(f"  • Strongest area: {best_metric.title()} ({metric_avgs[best_metric]:.1f}/10)")
            print(f"  • Weakest area: {worst_metric.title()} ({metric_avgs[worst_metric]:.1f}/10)")
            
            if metric_avgs[worst_metric] < 6.0:
                print(f"  • ⚠️  {worst_metric.title()} needs improvement (below threshold)")
    
    return df if 'df' in locals() else None

# Display results
results_df = display_comprehensive_results(comprehensive_results, performance_summary)

## 📊 Phần 8: Visualization và Analysis

### 8.1 Code Evaluation Results Visualization

In [None]:
def visualize_code_evaluation_results(results, summary):
    """
    Create comprehensive visualizations cho code evaluation results
    """
    
    if not results or not summary.get("metric_performance"):
        print("❌ Insufficient data for visualization")
        return
    
    # Setup plots
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    fig.suptitle('Code Evaluation Results Analysis', fontsize=16, fontweight='bold')
    
    # 1. Metric Performance Comparison
    metric_names = list(summary["metric_performance"].keys())
    metric_scores = [summary["metric_performance"][name]["average_score"] for name in metric_names]
    
    bars = axes[0,0].bar(metric_names, metric_scores, 
                        color=['skyblue', 'lightgreen', 'coral', 'lightpink'])
    axes[0,0].set_title('Average Metric Scores')
    axes[0,0].set_ylabel('Score (0-10)')
    axes[0,0].tick_params(axis='x', rotation=45)
    axes[0,0].axhline(y=7.0, color='red', linestyle='--', alpha=0.7, label='Typical Threshold')
    axes[0,0].legend()
    
    # Add value labels
    for bar, score in zip(bars, metric_scores):
        axes[0,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
                      f'{score:.1f}', ha='center', va='bottom', fontweight='bold')
    
    # 2. Score Distribution
    all_scores = []
    all_metric_labels = []
    
    for result in results:
        if "metrics" in result:
            for metric_name, metric_data in result["metrics"].items():
                if "score" in metric_data:
                    all_scores.append(metric_data["score"])
                    all_metric_labels.append(metric_name)
    
    if all_scores:
        axes[0,1].hist(all_scores, bins=10, alpha=0.7, color='lightblue', edgecolor='black')
        axes[0,1].set_title('Score Distribution')
        axes[0,1].set_xlabel('Score')
        axes[0,1].set_ylabel('Frequency')
        axes[0,1].axvline(x=np.mean(all_scores), color='red', linestyle='--', 
                         label=f'Mean: {np.mean(all_scores):.1f}')
        axes[0,1].legend()
    
    # 3. Pass Rate Analysis
    pass_rates = []
    sample_names = []
    
    for result in results:
        if "pass_count" in result and "total_metrics" in result:
            pass_rate = result["pass_count"] / result["total_metrics"] * 100
            pass_rates.append(pass_rate)
            sample_names.append(result.get("sample_name", "Unknown")[:20])
    
    if pass_rates:
        bars = axes[1,0].bar(range(len(pass_rates)), pass_rates, 
                            color=['green' if pr >= 75 else 'orange' if pr >= 50 else 'red' for pr in pass_rates])
        axes[1,0].set_title('Pass Rate by Sample')
        axes[1,0].set_ylabel('Pass Rate (%)')
        axes[1,0].set_xlabel('Samples')
        axes[1,0].set_xticks(range(len(sample_names)))
        axes[1,0].set_xticklabels(sample_names, rotation=45, ha='right')
        
        # Add value labels
        for i, (bar, rate) in enumerate(zip(bars, pass_rates)):
            axes[1,0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
                          f'{rate:.0f}%', ha='center', va='bottom', fontsize=8)
    
    # 4. Metric Correlation Heatmap
    if len(metric_names) > 1:
        # Create correlation matrix
        metric_data_matrix = []
        
        for result in results:
            if "metrics" in result:
                row = []
                for metric_name in metric_names:
                    if metric_name in result["metrics"] and "score" in result["metrics"][metric_name]:
                        row.append(result["metrics"][metric_name]["score"])
                    else:
                        row.append(0)
                if len(row) == len(metric_names):
                    metric_data_matrix.append(row)
        
        if len(metric_data_matrix) > 1:
            correlation_df = pd.DataFrame(metric_data_matrix, columns=metric_names)
            correlation_matrix = correlation_df.corr()
            
            im = sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
                           square=True, ax=axes[1,1], cbar_kws={'shrink': 0.8})
            axes[1,1].set_title('Metric Score Correlations')
        else:
            axes[1,1].text(0.5, 0.5, 'Insufficient data\nfor correlation', 
                          ha='center', va='center', transform=axes[1,1].transAxes, fontsize=12)
            axes[1,1].set_title('Metric Score Correlations')
    else:
        axes[1,1].text(0.5, 0.5, 'Need multiple metrics\nfor correlation', 
                      ha='center', va='center', transform=axes[1,1].transAxes, fontsize=12)
        axes[1,1].set_title('Metric Score Correlations')
    
    plt.tight_layout()
    plt.show()
    
    # Print additional insights
    print("\n🔍 Visualization Insights:")
    
    if metric_scores:
        highest_metric = metric_names[metric_scores.index(max(metric_scores))]
        lowest_metric = metric_names[metric_scores.index(min(metric_scores))]
        
        print(f"  • Highest performing metric: {highest_metric.title()} ({max(metric_scores):.1f})")
        print(f"  • Lowest performing metric: {lowest_metric.title()} ({min(metric_scores):.1f})")
        
        score_range = max(metric_scores) - min(metric_scores)
        if score_range > 3.0:
            print(f"  • ⚠️  Large score variance ({score_range:.1f}) indicates inconsistent quality")
        
    if pass_rates:
        avg_pass_rate = np.mean(pass_rates)
        print(f"  • Average pass rate: {avg_pass_rate:.1f}%")
        
        if avg_pass_rate < 50:
            print(f"  • ⚠️  Low pass rate suggests code quality issues")
        elif avg_pass_rate > 80:
            print(f"  • ✅ High pass rate indicates good code quality")

# Create visualizations
visualize_code_evaluation_results(comprehensive_results, performance_summary)

## 🎓 Phần 9: Exercises và Thực hành

### Exercise 1: Custom Security Metric

In [None]:
# Exercise 1: Tạo custom security metric cho specific vulnerability type
def exercise_1_custom_security_metric():
    """
    TODO: Tạo specialized security metric cho SQL injection detection
    Yêu cầu:
    1. Focus specifically on SQL injection patterns
    2. Include common SQL injection techniques
    3. Test với vulnerable code samples
    4. Compare với general security metric
    """
    
    # TODO: Define SQL injection specific criteria
    sql_injection_criteria = """
    Your SQL injection detection criteria here...
    Focus on:
    - String concatenation in SQL queries
    - Missing parameterized queries
    - User input directly in SQL
    - Dynamic query construction
    """
    
    # TODO: Create G-Eval metric
    
    # TODO: Test với vulnerable samples
    
    # TODO: Compare results
    
    return None

print("💡 Exercise 1 Template created. Complete the function above!")
print("Hints:")
print("- Use specific SQL injection patterns in criteria")
print("- Test với code_data['buggy_code_samples'] có SQL injection")
print("- Compare specificity vs general security metric")

### Exercise 2: Code Style Metric

In [None]:
# Exercise 2: Tạo code style compliance metric
def exercise_2_code_style_metric():
    """
    TODO: Tạo metric để check PEP 8 compliance và Python best practices
    Yêu cầu:
    1. Check naming conventions (snake_case, PascalCase)
    2. Line length và formatting
    3. Import organization
    4. Comment style và docstrings
    5. Python idioms usage
    """
    
    # TODO: Define PEP 8 criteria
    pep8_criteria = """
    Your PEP 8 compliance criteria here...
    Check for:
    - Variable naming (snake_case)
    - Function naming conventions
    - Line length (79 characters)
    - Proper spacing and indentation
    - Import statements organization
    """
    
    # TODO: Create metric và test
    
    return None

print("💡 Exercise 2 Template created. Complete the function above!")
print("Hints:")
print("- Reference PEP 8 guidelines")
print("- Create test cases với good/bad style examples")
print("- Consider using ast module để analyze code structure")

### Exercise 3: Automated Code Review Pipeline

In [None]:
# Exercise 3: Build automated code review system
def exercise_3_automated_code_review():
    """
    TODO: Tạo complete automated code review system
    Yêu cầu:
    1. Integrate tất cả metrics (correctness, style, security, performance)
    2. Generate comprehensive review report
    3. Provide specific recommendations
    4. Rank issues by severity
    5. Output structured review format
    """
    
    class AutomatedCodeReviewer:
        def __init__(self):
            # TODO: Initialize all metrics
            pass
        
        def review_code(self, code: str, context: str = "") -> Dict:
            # TODO: Run all evaluations
            # TODO: Generate recommendations
            # TODO: Format review report
            pass
        
        def generate_review_report(self, results: Dict) -> str:
            # TODO: Create structured review report
            pass
    
    # TODO: Test với various code samples
    
    return None

print("💡 Exercise 3 Template created. Complete the class above!")
print("Hints:")
print("- Combine all metrics từ previous sections")
print("- Use severity levels: Critical, High, Medium, Low")
print("- Generate actionable recommendations")
print("- Consider code context trong recommendations")

## 🎯 Tổng kết và Next Steps

### 🏆 Những gì đã học trong Notebook này:

1. **✅ G-Eval Framework Mastery**
   - Tạo custom metrics với natural language criteria
   - Define evaluation steps và parameters
   - Flexible threshold setting

2. **✅ Code Generation Metrics**
   - **CodeCorrectnessMetric**: Logic, syntax, functional requirements
   - **ReadabilityMetric**: Naming, structure, documentation
   - **EfficiencyMetric**: Time/space complexity, algorithmic approach

3. **✅ Security Evaluation**
   - **SecurityReviewMetric**: Vulnerability detection
   - Injection attacks, input validation, authentication issues
   - Automated security assessment

4. **✅ Code Review Automation**
   - **ReviewCompletenessMetric**: Comprehensive review coverage
   - Multi-dimensional code assessment
   - Automated review pipeline

5. **✅ Comprehensive Evaluation Pipeline**
   - End-to-end code evaluation system
   - Batch processing capabilities
   - Performance analytics và insights

### 🚀 Next Steps - Notebook 4: AI Agents & Chain-of-Thought

Trong notebook tiếp theo, chúng ta sẽ học:

- 🤖 **LangGraph Agent Construction**: Multi-tool agents
- 🧠 **Chain-of-Thought Evaluation**: Reasoning process assessment
- 🛠️ **Agent-Specific Metrics**: LogicalFlow, ToolUsage, PlanExecution
- 🔄 **Multi-step Evaluation**: Intermediate step analysis
- 📊 **Agent Performance Benchmarking**: Comprehensive agent assessment

### 💡 Key Insights từ Code Evaluation:

- **Multi-dimensional Assessment**: Code quality không chỉ về correctness
- **Security First**: Security evaluation cần được prioritized
- **Custom Metrics Power**: G-Eval cho phép domain-specific evaluation
- **Automated Review**: Scale code review process with consistent standards
- **Comprehensive Pipeline**: Holistic approach tốt hơn individual metrics

### 🎯 Best Practices Summary:

1. **Use multiple metrics** cho comprehensive assessment
2. **Customize evaluation criteria** cho specific use cases
3. **Include security metrics** trong all code evaluations
4. **Test với various code quality levels** để validate metrics
5. **Visualize results** để identify patterns và insights
6. **Automate evaluation pipeline** cho consistent assessment

### 📊 Evaluation Framework Comparison:

| Aspect | Traditional Testing | DeepEval Code Metrics |
|--------|--------------------|-----------------------|
| **Scope** | Functionality only | Multi-dimensional |
| **Automation** | Limited | Comprehensive |
| **Customization** | Rigid | Highly flexible |
| **Security** | Manual review | Automated detection |
| **Scalability** | Manual effort | Automated pipeline |
| **Consistency** | Reviewer dependent | Standardized criteria |

---

## 🎉 Outstanding Progress!

Bạn đã mastered code generation evaluation với DeepEval! 

Ready to tackle **Notebook 4: Evaluating AI Agents và CoT with LangGraph**? 🚀🤖