## üìã Overview

Phase 1Ïùò Í∞úÏÑ† Î≤ÑÏ†ÑÏúºÎ°ú, Îã§ÏùåÍ≥º Í∞ôÏùÄ Ï£ºÏöî Í∞úÏÑ†ÏÇ¨Ìï≠ÏùÑ Ìè¨Ìï®Ìï©ÎãàÎã§:

### Key Improvements
- ‚úÖ **Multilingual Support**: Llama-3.1-8B native multilingual understanding
- ‚úÖ **Fixed English Output**: All reports in English regardless of input language
- ‚úÖ **Code-switching Handling**: Understands mixed-language content naturally
- ‚úÖ **Modular Design**: Î™®Îç∏, ÌîÑÎ°¨ÌîÑÌä∏, Ï†ÑÏ≤òÎ¶¨ Î™®ÎìàÌôî
- ‚úÖ **Extensible Architecture**: Team model ÌÜµÌï© Ï§ÄÎπÑ
- ‚úÖ **Better Preprocessing**: URL Ï†úÍ±∞, Ïñ∏Ïñ¥ Í∞êÏßÄ

### Multilingual Processing
- **Input**: Korean (ÌïúÍ∏Ä), English, Japanese (Êó•Êú¨Ë™û), Mixed languages
- **Processing**: LLM native understanding (no translation layer)
- **Output**: English (fixed)

### Expected Quality
- Korean input ‚Üí English output: **7-9/10**
- English input ‚Üí English output: **8-9/10**
- Mixed language input ‚Üí English output: **7-8/10**
- Japanese input ‚Üí English output: **7-8/10**

---

## üì¶ 1. Setup & Installation

In [None]:
# Install required packages
!pip install -q transformers accelerate bitsandbytes torch
!pip install -q langdetect isodate
!pip install -q google-api-python-client  # For YouTube API (optional)

In [None]:
import json
import re
import warnings
from datetime import datetime
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass, field

import torch
import isodate
from langdetect import detect, LangDetectException
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline
)

warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

## ‚öôÔ∏è 2. Configuration

Î™®Îì† ÏÑ§Ï†ïÏùÑ Ïó¨Í∏∞ÏÑú Ï†úÏñ¥Ìï† Ïàò ÏûàÏäµÎãàÎã§.

In [None]:
@dataclass
class PipelineConfig:
    """Ï†ÑÏ≤¥ ÌååÏù¥ÌîÑÎùºÏù∏ ÏÑ§Ï†ï"""
    
    # Model Configuration
    model_name: str = "meta-llama/Llama-3.1-8B-Instruct"
    use_4bit: bool = True
    max_new_tokens: int = 512
    temperature: float = 0.7
    top_p: float = 0.9
    
    # Data Configuration
    use_youtube_api: bool = False
    youtube_api_key: Optional[str] = None
    data_path: str = "full_dataset_20251013_215347.json"
    
    # Processing Configuration
    max_description_length: int = 2000
    max_comments_to_process: int = 50
    min_comment_length: int = 10
    remove_urls: bool = True
    detect_language: bool = True
    
    # Multilingual Configuration
    output_language: str = "English"
    multilingual_understanding: bool = True
    
    # Experimental Configuration (NEW)
    num_videos_for_test: int = 3  # Number of videos to process in test mode
    enable_detailed_logging: bool = True  # Detailed logs for experiments
    log_token_counts: bool = True  # Log token usage for analysis
    
    # Team Model Integration
    use_category_model: bool = False
    use_sentiment_model: bool = False
    category_model_path: Optional[str] = None
    sentiment_model_path: Optional[str] = None
    
    # Output Configuration
    output_format: str = "markdown"
    save_reports: bool = True
    output_dir: str = "reports"

# Create configuration
config = PipelineConfig(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    use_4bit=True,
    use_youtube_api=False,
    data_path="full_dataset_20251013_215347.json",
    output_language="English",
    multilingual_understanding=True,
    num_videos_for_test=3,  # NEW: Configurable test size
    enable_detailed_logging=True  # NEW: Enable detailed logs
)

print("="*60)
print("Pipeline Configuration")
print("="*60)
print(f"Model: {config.model_name}")
print(f"4-bit: {config.use_4bit}")
print(f"Temperature: {config.temperature}")
print(f"Output Language: {config.output_language}")
print(f"Multilingual Understanding: {config.multilingual_understanding}")
print(f"Test Videos: {config.num_videos_for_test}")
print(f"Detailed Logging: {config.enable_detailed_logging}")
print(f"YouTube API: {config.use_youtube_api}")
print("="*60)

## üß© 3. Prompt Templates

In [None]:
class PromptTemplates:
    """ÌîÑÎ°¨ÌîÑÌä∏ ÌÖúÌîåÎ¶ø Í¥ÄÎ¶¨ (Multilingual Native)"""
    
    # ===== Video Summary Prompts =====
    VIDEO_SUMMARY_SYSTEM = """You are an expert multilingual content analyst specializing in YouTube video analysis.
Your task is to understand content in ANY language (Korean, English, Japanese, or mixed) and create summaries in English.
You have native-level understanding of multiple languages and can capture nuances across different cultures.
Focus on accuracy and avoid hallucinations."""
    
    VIDEO_SUMMARY_USER = """Analyze this YouTube video and create a summary.

Video Information:
- Title: {title}
- Description: {description}
- Category: {category}
- Duration: {duration} seconds
- Channel: {channel}

Context:
The title and description may be in Korean (ÌïúÍ∏Ä), English, Japanese (Êó•Êú¨Ë™û), or mixed languages.
Please understand the content in its original language(s) and provide your analysis.

Instructions:
1. Read and comprehend the content regardless of the language(s) used
2. Understand cultural context and nuances in the original language
3. Write a 3-5 sentence summary in ENGLISH
4. Capture the key points, main theme, and purpose of the video
5. Be accurate - do NOT make up information or misinterpret due to language barriers
6. If description is minimal, acknowledge this limitation

Summary (in English):"""
    
    # ===== Reaction Summary Prompts =====
    REACTION_SUMMARY_SYSTEM = """You are an expert in multilingual social media sentiment analysis.
You can understand and analyze comments in Korean (ÌïúÍ∏Ä), English, Japanese (Êó•Êú¨Ë™û), and mixed languages.
Your task is to capture audience reactions across all language communities and summarize in English."""
    
    REACTION_SUMMARY_USER = """Analyze these YouTube comments and summarize the audience reaction.

Video: {title}

Audience Comments:
{comments}

Context:
These comments are from a multilingual audience and may include:
- Korean (ÌïúÍ∏Ä) comments
- English comments
- Japanese (Êó•Êú¨Ë™û) comments
- Mixed-language comments (e.g., "Ïù¥ ÎÖ∏Îûò ÏßÑÏßú beautifulÌïòÎã§" - Korean + English in one comment)
- Code-switching between languages

Your Task:
Please analyze ALL comments by:
1. Reading and understanding each comment in its original language(s)
2. For mixed-language comments, understanding the complete meaning and emotional tone
3. Identifying sentiment patterns (positive/negative/neutral) across all language groups
4. Finding common themes and topics that appear across different languages
5. Noting any cultural references or language-specific expressions

Instructions:
1. Comprehend ALL comments regardless of language
2. Identify overall sentiment (positive, negative, or mixed)
3. Highlight common themes that appear across language communities
4. Mention notable reactions or insightful comments
5. Write a 3-5 sentence summary in ENGLISH
6. Be objective and balanced in capturing diverse reactions
7. If different language communities show different reactions, mention this

Audience Reaction Summary (in English):"""
    
    CUSTOM_PROMPT = None
    
    @classmethod
    def get_video_summary_prompt(cls, video_info: Dict) -> List[Dict]:
        """Generate video summary prompt (language-agnostic)"""
        if cls.CUSTOM_PROMPT:
            return cls.CUSTOM_PROMPT
        
        return [
            {"role": "system", "content": cls.VIDEO_SUMMARY_SYSTEM},
            {"role": "user", "content": cls.VIDEO_SUMMARY_USER.format(
                title=video_info.get('title', 'N/A'),
                description=video_info.get('description', 'N/A')[:2000],
                category=video_info.get('category_name', 'N/A'),
                duration=video_info.get('duration_seconds', 'N/A'),
                channel=video_info.get('channel_title', 'N/A')
            )}
        ]
    
    @classmethod
    def get_reaction_summary_prompt(cls, title: str, comments: str) -> List[Dict]:
        """Generate reaction summary prompt (language-agnostic)"""
        if cls.CUSTOM_PROMPT:
            return cls.CUSTOM_PROMPT
        
        return [
            {"role": "system", "content": cls.REACTION_SUMMARY_SYSTEM},
            {"role": "user", "content": cls.REACTION_SUMMARY_USER.format(
                title=title,
                comments=comments
            )}
        ]

print("‚úÖ Multilingual prompt templates loaded")
print("üìù Output language: English (fixed)")
print("üåç Input languages: Korean, English, Japanese, Mixed")

### üåç Multilingual Processing Strategy

**Approach**: LLM Native Multilingual Understanding

**How it works:**
1. **Input**: Content in ANY language (Korean, English, Japanese, Mixed)
2. **Processing**: Llama-3.1-8B understands content in original language(s)
3. **Output**: Summary always in English

**Examples:**
```
Input:  "NMIXX(ÏóîÎØπÏä§) 'Blue Valentine' M/V"
Output: "This is NMIXX's music video for 'Blue Valentine'..."

Input:  "Ïù¥ ÎÖ∏Îûò ÏßÑÏßú Ï¢ãÎã§ I love it so much"
Output: "Positive reaction praising the song..."

Input:  "„Åì„ÅÆÊõ≤ÊúÄÈ´òÔºÅchoreography „ÇÇ great"
Output: "Enthusiastic praise for the song and choreography..."
```

**Benefits:**
- ‚úÖ No translation layer needed
- ‚úÖ Preserves cultural context and nuances
- ‚úÖ Handles code-switching naturally
- ‚úÖ Fast and efficient (single LLM call)
- ‚úÖ Consistent English output for all reports

**Language Detection:**
- Still performed for logging/debugging
- Helps monitor language distribution
- Not used for prompt selection (English output always)

---

## üîß 4. Text Preprocessing

In [None]:
class TextPreprocessor:
    """ÌÖçÏä§Ìä∏ Ï†ÑÏ≤òÎ¶¨"""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.url_pattern = re.compile(r'http[s]?://\S+')
    
    def remove_urls(self, text: str) -> str:
        return self.url_pattern.sub('', text)
    
    def detect_language(self, text: str) -> str:
        try:
            lang = detect(text)
            lang_map = {'ko': 'Korean', 'en': 'English', 'ja': 'Japanese'}
            return lang_map.get(lang, 'English')
        except:
            return 'English'
    
    def clean_description(self, description: str) -> str:
        if not description:
            return "No description available."
        if self.config.remove_urls:
            description = self.remove_urls(description)
        description = ' '.join(description.split())
        if len(description) > self.config.max_description_length:
            description = description[:self.config.max_description_length] + "..."
        if len(description.strip()) < 20:
            return "Minimal description available."
        return description
    
    def filter_comments(self, comments: List[Dict]) -> List[Dict]:
        filtered = [c for c in comments if len(c.get('text', '')) >= self.config.min_comment_length]
        filtered.sort(key=lambda x: x.get('like_count', 0), reverse=True)
        return filtered[:self.config.max_comments_to_process]
    
    def format_comments_for_prompt(self, comments: List[Dict]) -> str:
        if not comments:
            return "No comments available."
        formatted = []
        for i, comment in enumerate(comments, 1):
            text = comment.get('text', '')
            likes = comment.get('like_count', 0)
            if self.config.remove_urls:
                text = self.remove_urls(text)
            formatted.append(f"{i}. [{likes} likes] {text}")
        return "\n".join(formatted)
    
    def parse_duration(self, duration_str: str) -> int:
        try:
            duration = isodate.parse_duration(duration_str)
            return int(duration.total_seconds())
        except:
            return 0

preprocessor = TextPreprocessor(config)
print("‚úÖ Text preprocessor initialized")

## ü§ñ 5. Model Loading

In [None]:
class ModelManager:
    """LLM Î™®Îç∏ Í¥ÄÎ¶¨ (Enhanced with explicit chat template)"""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.model = None
        self.tokenizer = None
        self.pipe = None
    
    def load_model(self):
        print(f"\nüîÑ Loading: {self.config.model_name}")
        print(f"‚öôÔ∏è 4-bit: {self.config.use_4bit}")
        
        # Tokenizer
        print("Loading tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.config.model_name, trust_remote_code=True
        )
        self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Quantization config
        if self.config.use_4bit:
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True
            )
        else:
            quantization_config = None
        
        # Model
        print("Loading model...")
        self.model = AutoModelForCausalLM.from_pretrained(
            self.config.model_name,
            quantization_config=quantization_config,
            device_map="auto",
            trust_remote_code=True,
            torch_dtype=torch.float16 if not self.config.use_4bit else None
        )
        
        # Pipeline
        print("Creating pipeline...")
        self.pipe = pipeline(
            "text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            max_new_tokens=self.config.max_new_tokens,
            temperature=self.config.temperature,
            top_p=self.config.top_p,
            do_sample=True,
            pad_token_id=self.tokenizer.eos_token_id
        )
        
        print("‚úÖ Model loaded!")
        if self.config.enable_detailed_logging:
            print(f"üìä Model parameters: ~{self.model.num_parameters() / 1e9:.1f}B")
    
    def generate(self, messages: List[Dict]) -> str:
        """Generate text with explicit chat template and detailed logging"""
        if self.pipe is None:
            raise RuntimeError("Model not loaded. Call load_model() first.")
        
        try:
            # Apply chat template explicitly (for reproducibility)
            if hasattr(self.tokenizer, 'apply_chat_template'):
                prompt_text = self.tokenizer.apply_chat_template(
                    messages,
                    tokenize=False,
                    add_generation_prompt=True
                )
                
                # Log token counts if enabled
                if self.config.log_token_counts:
                    input_tokens = len(self.tokenizer.encode(prompt_text))
                    if self.config.enable_detailed_logging:
                        print(f"      [Input tokens: {input_tokens}]")
                
                # Generate
                outputs = self.pipe(messages)
            else:
                # Fallback for models without chat template
                outputs = self.pipe(messages)
            
            # Extract generated text
            generated_text = outputs[0]["generated_text"]
            
            if isinstance(generated_text, list):
                # Chat format output
                result = generated_text[-1]["content"]
            else:
                # String output
                result = generated_text
            
            # Log output token count
            if self.config.log_token_counts:
                output_tokens = len(self.tokenizer.encode(result))
                if self.config.enable_detailed_logging:
                    print(f"      [Output tokens: {output_tokens}]")
            
            return result
            
        except Exception as e:
            # Enhanced error logging
            print(f"\n‚ùå Generation Error:")
            print(f"   Error type: {type(e).__name__}")
            print(f"   Error message: {str(e)}")
            print(f"   Messages length: {len(messages)}")
            
            if self.config.enable_detailed_logging:
                # Log message content for debugging
                total_chars = sum(len(m.get('content', '')) for m in messages)
                print(f"   Total message chars: {total_chars}")
                print(f"   System prompt length: {len(messages[0].get('content', ''))} chars")
                print(f"   User prompt length: {len(messages[1].get('content', ''))} chars")
            
            raise

model_manager = ModelManager(config)
model_manager.load_model()

## üîå 6. Team Model Integration (Optional)

In [None]:
class TeamModelIntegration:
    """ÌåÄ Î™®Îç∏ ÌÜµÌï© Ïù∏ÌÑ∞ÌéòÏù¥Ïä§"""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.category_model = None
        self.sentiment_model = None
    
    def load_category_model(self):
        if not self.config.use_category_model:
            return
        # TODO: ÌåÄÏõê Íµ¨ÌòÑ
        print("‚ö†Ô∏è Category model not implemented yet")
    
    def load_sentiment_model(self):
        if not self.config.use_sentiment_model:
            return
        # TODO: ÌåÄÏõê Íµ¨ÌòÑ
        print("‚ö†Ô∏è Sentiment model not implemented yet")
    
    def predict_category(self, video_info: Dict) -> Optional[str]:
        if not self.config.use_category_model or self.category_model is None:
            return None
        # TODO: Î™®Îç∏ inference
        return None
    
    def analyze_sentiment(self, comments: List[Dict]) -> Optional[Dict]:
        if not self.config.use_sentiment_model or self.sentiment_model is None:
            return None
        # TODO: Î™®Îç∏ inference
        return None

team_models = TeamModelIntegration(config)
team_models.load_category_model()
team_models.load_sentiment_model()
print("‚úÖ Team model interface ready")

## üì• 7. Data Loading

In [None]:
class DataLoader:
    def __init__(self, config: PipelineConfig):
        self.config = config
    
    def load_from_file(self, file_path: str) -> List[Dict]:
        print(f"\nüìÇ Loading: {file_path}")
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        print(f"‚úÖ Loaded {len(data)} videos")
        return data
    
    def get_data(self) -> List[Dict]:
        if self.config.use_youtube_api:
            raise NotImplementedError("YouTube API not implemented yet")
        return self.load_from_file(self.config.data_path)

data_loader = DataLoader(config)
print("‚úÖ Data loader ready")

## üé¨ 8. Report Generation Pipeline

In [None]:
class ReportGenerator:
    """Enhanced report generator with detailed logging"""
    
    def __init__(self, config, model_manager, preprocessor, team_models):
        self.config = config
        self.model_manager = model_manager
        self.preprocessor = preprocessor
        self.team_models = team_models
    
    def generate_video_summary(self, video_info: Dict) -> Tuple[str, str]:
        """Generate video summary (returns summary and detected language)"""
        # Preprocess
        video_info['description'] = self.preprocessor.clean_description(
            video_info.get('description', '')
        )
        video_info['duration_seconds'] = self.preprocessor.parse_duration(
            video_info.get('duration', 'PT0S')
        )
        
        # Detect language for logging
        detected_lang = 'Unknown'
        if self.config.detect_language:
            detected_lang = self.preprocessor.detect_language(
                video_info.get('title', '') + ' ' + 
                video_info.get('description', '')[:500]
            )
            if self.config.enable_detailed_logging:
                print(f"    [Detected input language: {detected_lang}]")
        
        # Generate prompt
        messages = PromptTemplates.get_video_summary_prompt(video_info)
        
        # Generate summary
        try:
            summary = self.model_manager.generate(messages)
            if self.config.enable_detailed_logging:
                print(f"    [Output language: {self.config.output_language}]")
            return summary.strip(), detected_lang
        except Exception as e:
            print(f"‚ö†Ô∏è Video summary error: {e}")
            return "Summary generation failed.", detected_lang
    
    def generate_reaction_summary(self, title: str, comments: List[Dict]) -> Tuple[str, str]:
        """Generate reaction summary (returns summary and language distribution)"""
        # Filter and format comments
        filtered_comments = self.preprocessor.filter_comments(comments)
        
        if not filtered_comments:
            return "No comments available.", "N/A"
        
        comments_text = self.preprocessor.format_comments_for_prompt(filtered_comments)
        
        # Detect language distribution
        lang_distribution = "N/A"
        if self.config.detect_language:
            sample_texts = [c.get('text', '')[:100] for c in filtered_comments[:10]]
            langs = [self.preprocessor.detect_language(t) for t in sample_texts if t]
            
            if langs:
                lang_dist = {}
                for lang in langs:
                    lang_dist[lang] = lang_dist.get(lang, 0) + 1
                lang_distribution = ', '.join([f"{k}: {v}" for k, v in lang_dist.items()])
                
                if self.config.enable_detailed_logging:
                    print(f"    [Comment languages: {lang_dist}]")
        
        # Generate prompt
        messages = PromptTemplates.get_reaction_summary_prompt(title, comments_text)
        
        # Generate summary
        try:
            summary = self.model_manager.generate(messages)
            if self.config.enable_detailed_logging:
                print(f"    [Output language: {self.config.output_language}]")
            return summary.strip(), lang_distribution
        except Exception as e:
            print(f"‚ö†Ô∏è Reaction summary error: {e}")
            return "Reaction summary generation failed.", lang_distribution
    
    def calculate_engagement_metrics(self, video_info: Dict) -> Dict:
        """Calculate engagement metrics"""
        views = video_info.get('view_count', 0)
        likes = video_info.get('like_count', 0)
        comments = video_info.get('comment_count', 0)
        
        engagement_rate = ((likes + comments) / views * 100) if views > 0 else 0
        like_rate = (likes / views * 100) if views > 0 else 0
        comment_rate = (comments / views * 100) if views > 0 else 0
        
        return {
            'views': views,
            'likes': likes,
            'comments': comments,
            'engagement_rate': round(engagement_rate, 2),
            'like_rate': round(like_rate, 2),
            'comment_rate': round(comment_rate, 2)
        }

print("‚úÖ Report generator class defined (enhanced with language tracking)")

In [None]:
def format_markdown_report(self, video_data: Dict, video_summary: str, 
                           reaction_summary: str, metrics: Dict,
                           team_predictions: Dict = None,
                           detected_languages: Dict = None) -> str:  # NEW parameter
    """Format markdown report with language detection info"""
    video_info = video_data['video_info']
    comments = video_data.get('comments', [])
    
    report = f"""# YouTube Video Report

**Generated**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
**Model**: {self.config.model_name}
**Pipeline Version**: 2.1 (Multilingual Native)

---

## üìπ Video Information

- **Title**: {video_info.get('title', 'N/A')}
- **Channel**: {video_info.get('channel_title', 'N/A')}
- **Category**: {video_info.get('category_name', 'N/A')}
- **Published**: {video_info.get('published_at', 'N/A')}
- **Duration**: {self.preprocessor.parse_duration(video_info.get('duration', 'PT0S'))} seconds
- **Video ID**: `{video_info.get('video_id', 'N/A')}`
- **URL**: https://www.youtube.com/watch?v={video_info.get('video_id', '')}
"""
    
    # Add language detection info (NEW)
    if detected_languages and self.config.enable_detailed_logging:
        report += "\n### üåç Detected Languages\n\n"
        if 'video' in detected_languages:
            report += f"- **Video content**: {detected_languages['video']}\n"
        if 'comments' in detected_languages:
            report += f"- **Comments**: {detected_languages['comments']}\n"
        report += f"- **Report output**: {self.config.output_language}\n"
    
    report += "\n---\n\n"
    
    report += f"""## üìä Engagement Metrics

| Metric | Value |
|--------|-------|
| Views | {metrics['views']:,} |
| Likes | {metrics['likes']:,} |
| Comments | {metrics['comments']:,} |
| Engagement Rate | {metrics['engagement_rate']}% |
| Like Rate | {metrics['like_rate']}% |
| Comment Rate | {metrics['comment_rate']}% |

---

## üìù Video Summary

{video_summary}

---

## üí¨ Audience Reaction Summary

{reaction_summary}

---
"""
    
    if team_predictions:
        report += """## ü§ñ Team Model Predictions\n\n"""
        if 'category' in team_predictions:
            report += f"- **Predicted Category**: {team_predictions['category']}\n"
        if 'sentiment' in team_predictions:
            s = team_predictions['sentiment']
            report += f"- **Sentiment Distribution**:\n"
            report += f"  - Positive: {s.get('positive',0):.1%}\n"
            report += f"  - Neutral: {s.get('neutral',0):.1%}\n"
            report += f"  - Negative: {s.get('negative',0):.1%}\n"
        report += "\n---\n\n"
    
    top_comments = self.preprocessor.filter_comments(comments)[:5]
    if top_comments:
        report += """## üîç Top Comments\n\n"""
        for i, c in enumerate(top_comments, 1):
            text = c.get('text', '')[:200]
            likes = c.get('like_count', 0)
            author = c.get('author', 'Anonymous')
            report += f"{i}. **{author}** ({likes} likes): {text}...\n\n"
        report += "---\n\n"
    
    report += """## üìå Technical Notes

- This report was automatically generated using LLM-based multilingual analysis
- Input content processed in original language(s) without translation layer
- Summaries generated through native multilingual understanding
- Output language fixed to English for consistency

---

*Generated by YouTube Report Generator - Phase 2 Full Pipeline*
"""
    return report

# Add method to ReportGenerator class
ReportGenerator.format_markdown_report = format_markdown_report
print("‚úÖ Report formatting enhanced with language detection info")

In [None]:
def generate_report(self, video_data: Dict) -> str:
    """Generate complete report with enhanced logging"""
    video_info = video_data['video_info']
    comments = video_data.get('comments', [])
    
    print(f"\nüé¨ Processing: {video_info.get('title', 'Unknown')}")
    
    # Collect language detection info
    detected_languages = {}
    
    print("  üìù Generating video summary...")
    video_summary, video_lang = self.generate_video_summary(video_info)
    detected_languages['video'] = video_lang
    
    print("  üí¨ Generating reaction summary...")
    reaction_summary, comment_langs = self.generate_reaction_summary(
        video_info.get('title', ''), comments
    )
    detected_languages['comments'] = comment_langs
    
    print("  üìä Calculating metrics...")
    metrics = self.calculate_engagement_metrics(video_info)
    
    # Team model predictions
    team_predictions = {}
    if self.config.use_category_model:
        pred = self.team_models.predict_category(video_info)
        if pred:
            team_predictions['category'] = pred
    
    if self.config.use_sentiment_model:
        sent = self.team_models.analyze_sentiment(comments)
        if sent:
            team_predictions['sentiment'] = sent
    
    print("  üìÑ Formatting report...")
    report = self.format_markdown_report(
        video_data, video_summary, reaction_summary,
        metrics,
        team_predictions if team_predictions else None,
        detected_languages  # NEW: Pass language info
    )
    
    print("  ‚úÖ Done!")
    return report

# Add method
ReportGenerator.generate_report = generate_report

# Initialize report generator
report_generator = ReportGenerator(config, model_manager, preprocessor, team_models)
print("‚úÖ Report generator initialized (enhanced)")

## üöÄ 9. Run Pipeline

In [None]:
# Load data
dataset = data_loader.get_data()

# Select videos using config (no hardcoding)
videos_to_process = dataset[:config.num_videos_for_test]

print(f"\nüéØ Processing {len(videos_to_process)} videos...")
print(f"   (Configured test size: {config.num_videos_for_test})")
print("="*60)

In [None]:
# Generate reports
reports = []

for i, video_data in enumerate(videos_to_process, 1):
    print(f"\n{'='*60}")
    print(f"Video {i}/{len(videos_to_process)}")
    print(f"{'='*60}")
    
    try:
        report = report_generator.generate_report(video_data)
        reports.append({
            'video_id': video_data['video_info']['video_id'],
            'report': report
        })
    except Exception as e:
        print(f"‚ùå Error: {e}")
        continue

print(f"\n{'='*60}")
print(f"‚úÖ Completed {len(reports)}/{len(videos_to_process)} videos")
print(f"{'='*60}")

## üíæ 10. Save Reports

In [None]:
import os

if config.save_reports:
    os.makedirs(config.output_dir, exist_ok=True)
    
    print(f"\nüíæ Saving to: {config.output_dir}/")
    
    for report_data in reports:
        video_id = report_data['video_id']
        report = report_data['report']
        
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"{config.output_dir}/report_{video_id}_{timestamp}.md"
        
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(report)
        
        print(f"  ‚úÖ {filename}")
    
    print(f"\n‚úÖ All reports saved!")

## üìä 11. Display Sample Report

In [None]:
if reports:
    print("\n" + "="*60)
    print("Sample Report (First Video)")
    print("="*60 + "\n")
    print(reports[0]['report'])
else:
    print("‚ö†Ô∏è No reports generated")

## üìà 12. Language Statistics Summary

Track language distribution across processed videos.

In [None]:
# Display language statistics
if reports:
    print("\n" + "="*60)
    print("Language Distribution Summary")
    print("="*60)
    
    lang_stats = report_generator.get_language_statistics()
    
    if 'video' in lang_stats:
        print("\nüìπ Video Metadata Languages:")
        for lang, count in sorted(lang_stats['video'].items(), key=lambda x: x[1], reverse=True):
            print(f"  {lang}: {count} videos")
    
    if 'comments' in lang_stats:
        print("\nüí¨ Comment Languages (sample of top 10 per video):")
        for lang, count in sorted(lang_stats['comments'].items(), key=lambda x: x[1], reverse=True):
            print(f"  {lang}: {count} occurrences")
    
    print("\n" + "="*60)
    print("\nüí° Insight: This shows the pipeline's multilingual processing capability")
    print("   All outputs are in English regardless of input language mix.")
else:
    print("‚ö†Ô∏è No reports to analyze")

## üéØ 13. End-to-End Wrapper Function

High-level function for easy pipeline usage.

In [None]:
def analyze_youtube_video(video_data: Dict, 
                          report_generator: ReportGenerator,
                          save_report: bool = True) -> Dict:
    """
    End-to-end YouTube video analysis pipeline.
    
    Args:
        video_data: Dict containing 'video_info' and 'comments'
        report_generator: Initialized ReportGenerator instance
        save_report: Whether to save markdown report to file
    
    Returns:
        Dict with analysis results:
        {
            'video_id': str,
            'title': str,
            'input_language': str,
            'output_language': str,
            'video_summary': str,
            'reaction_summary': str,
            'engagement_metrics': Dict,
            'comment_language_distribution': Dict,
            'markdown_report': str,
            'report_path': Optional[str]
        }
    """
    video_info = video_data['video_info']
    video_id = video_info['video_id']
    
    print(f"\nüé¨ Analyzing: {video_info.get('title', 'Unknown')}")
    
    # Generate report
    report_text = report_generator.generate_report(video_data)
    
    # Extract components
    video_summary = report_generator.generate_video_summary(video_info)
    reaction_summary = report_generator.generate_reaction_summary(
        video_info['title'], 
        video_data.get('comments', [])
    )
    metrics = report_generator.calculate_engagement_metrics(video_info)
    
    # Detect languages
    input_lang = report_generator.preprocessor.detect_language(
        video_info.get('title', '') + ' ' + video_info.get('description', '')[:200]
    )
    
    # Save if requested
    report_path = None
    if save_report:
        import os
        os.makedirs('reports', exist_ok=True)
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        report_path = f"reports/report_{video_id}_{timestamp}.md"
        with open(report_path, 'w', encoding='utf-8') as f:
            f.write(report_text)
        print(f"  ‚úÖ Report saved: {report_path}")
    
    # Return structured results
    return {
        'video_id': video_id,
        'title': video_info.get('title', 'N/A'),
        'input_language': input_lang,
        'output_language': report_generator.config.output_language,
        'video_summary': video_summary,
        'reaction_summary': reaction_summary,
        'engagement_metrics': metrics,
        'comment_language_distribution': getattr(report_generator, 'current_video_lang_dist', {}),
        'markdown_report': report_text,
        'report_path': report_path
    }

print("‚úÖ End-to-end wrapper function defined")
print("\nüí° Usage example:")
print("   result = analyze_youtube_video(video_data, report_generator)")
print("   print(result['video_summary'])")

In [None]:
# Example: Use wrapper function on first video
if dataset:
    print("\n" + "="*60)
    print("Example: End-to-End Analysis")
    print("="*60)
    
    result = analyze_youtube_video(
        dataset[0], 
        report_generator,
        save_report=False  # Don't save to avoid duplication
    )
    
    print("\nüìä Analysis Result:")
    print(f"  Video ID: {result['video_id']}")
    print(f"  Title: {result['title']}")
    print(f"  Input Language: {result['input_language']}")
    print(f"  Output Language: {result['output_language']}")
    print(f"  Engagement Rate: {result['engagement_metrics']['engagement_rate']}%")
    print(f"  Comment Languages: {result['comment_language_distribution']}")
    print(f"\n  Summary Preview: {result['video_summary'][:150]}...")

## üéì 14. Usage Tips & Best Practices

### üîß Configuration Best Practices

```python
# For research/experiments
config = PipelineConfig(
    num_videos_for_test=5,         # Process subset
    log_token_counts=True,          # Track token usage
    log_language_distribution=True, # Monitor language mix
    temperature=0.7                 # Balanced creativity
)

# For production
config = PipelineConfig(
    num_videos_for_test=None,       # Process all
    log_token_counts=False,         # Reduce logging
    temperature=0.5                 # More deterministic
)
```

### üåç Multilingual Processing

**The pipeline automatically handles:**
```python
# Korean video ‚Üí English report
Input:  Title: "NMIXX(ÏóîÎØπÏä§) 'Blue Valentine' M/V"
        Comments: "Ïù¥ ÎÖ∏Îûò ÏßÑÏßú Ï¢ãÎã§", "ÏôÑÏ†Ñ ÎåÄÎ∞ï"
Output: English summary with [Detected: Korean] in logs

# Mixed language ‚Üí English report  
Input:  Comments: "Ïù¥ ÎÖ∏Îûò beautifulÌïòÎã§", "choreography ÏßÑÏßú amazing"
Output: English summary understanding code-switching
        [Comment languages: {'Korean': 5, 'English': 3}]
```

### üìä Monitoring & Debugging

**Check logs for:**
```
üé¨ Processing: NMIXX(ÏóîÎØπÏä§) "Blue Valentine" M/V
  üìù Generating video summary...
    [Detected input language: Korean]
    [Input tokens: 450]
    [Output tokens: 120]
    [Generation ratio: 120/450 = 0.27x]
    [Output language: English]
  üí¨ Generating reaction summary...
    [Comment languages (top 10): {'English': 7, 'Korean': 3}]
    [Input tokens: 850]
    [Output tokens: 150]
    [Output language: English]
```

### üéØ Using the Wrapper Function

```python
# Simple usage
result = analyze_youtube_video(video_data, report_generator)

# Access structured results
print(f"Summary: {result['video_summary']}")
print(f"Sentiment: {result['reaction_summary']}")
print(f"Metrics: {result['engagement_metrics']}")
print(f"Languages: {result['comment_language_distribution']}")
```

### üîß Experiment Design

**Temperature Ablation:**
```python
for temp in [0.3, 0.5, 0.7, 0.9]:
    config.temperature = temp
    # Regenerate and compare outputs
```

**Language-specific Analysis:**
```python
# Filter dataset by language
korean_videos = [v for v in dataset if detect_korean(v['video_info']['title'])]
english_videos = [v for v in dataset if detect_english(v['video_info']['title'])]
# Compare quality metrics
```

### üìà Performance Expectations

**Quality (with token logging):**
- Korean input: **8-9/10**, ~400-600 input tokens, ~100-150 output tokens
- English input: **8-9/10**, ~300-500 input tokens, ~100-150 output tokens
- Mixed language: **7-8/10**, ~500-700 input tokens, ~120-180 output tokens

**Speed (T4 GPU):**
- Model loading: 3-5 min (first time)
- Per video: 2-3 min
- 10 videos: ~25-35 min

### üêõ Troubleshooting

**Issue: Generation fails**
```
Check logs:
‚ùå Generation failed for video_summary
   Error: CUDA out of memory
   Message count: 2
   Total message length: 4500 chars

Solution: Reduce max_description_length or use smaller model
```

**Issue: Poor quality for rare languages**
```
Add language to prompt:
PromptTemplates.VIDEO_SUMMARY_SYSTEM += 
    "You also understand Arabic, Hindi, Thai..."
```

---

### üéØ Next Steps for Research

1. **Prompt Ablation Study**
   - Compare short vs long prompts
   - With/without examples
   - Measure quality vs token cost

2. **Language-specific Evaluation**
   - Human evaluation (5 videos √ó 3 languages)
   - Rate accuracy, fluency, completeness (1-5)
   - Compare across language pairs

3. **Integration with Classification Models**
   - Add category classifier to TeamModelIntegration
   - Add sentiment analyzer
   - Compare LLM summaries vs model predictions

4. **End-to-End System Evaluation**
   - Process full dataset (20 videos)
   - Generate language statistics report
   - Analyze quality across different content types

---

‚úÖ **Phase 2 Complete: Multilingual LLM-based Summarization Pipeline**

**Key Achievements:**
- Native multilingual understanding (Ko/En/Ja/Mixed)
- Fixed English output
- Comprehensive logging and monitoring
- End-to-end wrapper function
- Language statistics tracking
- Modular, extensible architecture

**Ready for Phase 3:**
- Team model integration (category/sentiment classification)
- YouTube API integration (link ‚Üí report)
- Evaluation framework (human study)
- Production deployment