# YouTube Report Generator - Full Pipeline (Phase 2)

**Version**: 2.0  
**Date**: 2025-12-02  
**Model**: Llama-3.1-8B (4-bit quantization)  
**Environment**: Google Colab (T4 GPU recommended)

---

## üìã Overview

Phase 1Ïùò Í∞úÏÑ† Î≤ÑÏ†ÑÏúºÎ°ú, Îã§ÏùåÍ≥º Í∞ôÏùÄ Ï£ºÏöî Í∞úÏÑ†ÏÇ¨Ìï≠ÏùÑ Ìè¨Ìï®Ìï©ÎãàÎã§:

### Key Improvements
- ‚úÖ **Multilingual Support**: Llama-3.1-8B (Korean ÌíàÏßà Ìñ•ÏÉÅ)
- ‚úÖ **Modular Design**: Î™®Îç∏, ÌîÑÎ°¨ÌîÑÌä∏, Ï†ÑÏ≤òÎ¶¨ Î™®ÎìàÌôî
- ‚úÖ **Extensible Architecture**: Team model ÌÜµÌï© Ï§ÄÎπÑ
- ‚úÖ **Better Preprocessing**: URL Ï†úÍ±∞, Ïñ∏Ïñ¥ Í∞êÏßÄ
- ‚úÖ **YouTube API Integration**: Ïã§ÏãúÍ∞Ñ Îç∞Ïù¥ÌÑ∞ ÏàòÏßë Ï§ÄÎπÑ

### Expected Quality
- Korean: 0-2/10 ‚Üí **6-8/10**
- English: 4-8/10 ‚Üí **7-9/10**

---

## üì¶ 1. Setup & Installation

In [None]:
# Install required packages
!pip install -q transformers accelerate bitsandbytes torch
!pip install -q langdetect isodate
!pip install -q google-api-python-client  # For YouTube API (optional)

In [None]:
import json
import re
import warnings
from datetime import datetime
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass, field

import torch
import isodate
from langdetect import detect, LangDetectException
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    pipeline
)

warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

## ‚öôÔ∏è 2. Configuration

Î™®Îì† ÏÑ§Ï†ïÏùÑ Ïó¨Í∏∞ÏÑú Ï†úÏñ¥Ìï† Ïàò ÏûàÏäµÎãàÎã§.

In [None]:
@dataclass
class PipelineConfig:
    """Ï†ÑÏ≤¥ ÌååÏù¥ÌîÑÎùºÏù∏ ÏÑ§Ï†ï"""
    
    # Model Configuration
    model_name: str = "meta-llama/Llama-3.1-8B-Instruct"
    use_4bit: bool = True
    max_new_tokens: int = 512
    temperature: float = 0.7
    top_p: float = 0.9
    
    # Data Configuration
    use_youtube_api: bool = False
    youtube_api_key: Optional[str] = None
    data_path: str = "full_dataset_20251013_215347.json"
    
    # Processing Configuration
    max_description_length: int = 2000
    max_comments_to_process: int = 50
    min_comment_length: int = 10
    remove_urls: bool = True
    detect_language: bool = True
    
    # Team Model Integration
    use_category_model: bool = False
    use_sentiment_model: bool = False
    category_model_path: Optional[str] = None
    sentiment_model_path: Optional[str] = None
    
    # Output Configuration
    output_format: str = "markdown"
    save_reports: bool = True
    output_dir: str = "reports"

# Create configuration
config = PipelineConfig(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    use_4bit=True,
    use_youtube_api=False,
    data_path="full_dataset_20251013_215347.json"
)

print("="*60)
print("Pipeline Configuration")
print("="*60)
print(f"Model: {config.model_name}")
print(f"4-bit: {config.use_4bit}")
print(f"Temperature: {config.temperature}")
print(f"YouTube API: {config.use_youtube_api}")
print(f"Category Model: {config.use_category_model}")
print(f"Sentiment Model: {config.use_sentiment_model}")
print("="*60)

## üß© 3. Prompt Templates

In [None]:
class PromptTemplates:
    """ÌîÑÎ°¨ÌîÑÌä∏ ÌÖúÌîåÎ¶ø Í¥ÄÎ¶¨"""
    
    VIDEO_SUMMARY_SYSTEM = """You are an expert content analyst specializing in YouTube video analysis.
Your task is to create concise, informative summaries based on video metadata.
Focus on accuracy and avoid hallucinations."""
    
    VIDEO_SUMMARY_USER = """Analyze this YouTube video and create a summary.

Video Information:
- Title: {title}
- Description: {description}
- Category: {category}
- Duration: {duration} seconds
- Channel: {channel}

Instructions:
1. Write a 3-5 sentence summary in {language}
2. Focus on what the video is about and key highlights
3. Do NOT make up information
4. Be factual and avoid speculation

Summary:"""
    
    REACTION_SUMMARY_SYSTEM = """You are an expert in social media sentiment analysis.
Analyze YouTube comments and summarize audience reactions."""
    
    REACTION_SUMMARY_USER = """Analyze these YouTube comments and summarize the audience reaction.

Video: {title}

Top Comments:
{comments}

Instructions:
1. Write a 3-5 sentence summary in {language}
2. Identify overall sentiment (positive, negative, mixed)
3. Highlight common themes
4. Be objective and balanced

Audience Reaction:"""
    
    CUSTOM_PROMPT = None
    
    @classmethod
    def get_video_summary_prompt(cls, video_info: Dict, language: str = "Korean") -> List[Dict]:
        if cls.CUSTOM_PROMPT:
            return cls.CUSTOM_PROMPT
        return [
            {"role": "system", "content": cls.VIDEO_SUMMARY_SYSTEM},
            {"role": "user", "content": cls.VIDEO_SUMMARY_USER.format(
                title=video_info.get('title', 'N/A'),
                description=video_info.get('description', 'N/A')[:2000],
                category=video_info.get('category_name', 'N/A'),
                duration=video_info.get('duration_seconds', 'N/A'),
                channel=video_info.get('channel_title', 'N/A'),
                language=language
            )}
        ]
    
    @classmethod
    def get_reaction_summary_prompt(cls, title: str, comments: str, language: str = "Korean") -> List[Dict]:
        if cls.CUSTOM_PROMPT:
            return cls.CUSTOM_PROMPT
        return [
            {"role": "system", "content": cls.REACTION_SUMMARY_SYSTEM},
            {"role": "user", "content": cls.REACTION_SUMMARY_USER.format(
                title=title, comments=comments, language=language
            )}
        ]

print("‚úÖ Prompt templates loaded")

## üîß 4. Text Preprocessing

In [None]:
class TextPreprocessor:
    """ÌÖçÏä§Ìä∏ Ï†ÑÏ≤òÎ¶¨"""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.url_pattern = re.compile(r'http[s]?://\S+')
    
    def remove_urls(self, text: str) -> str:
        return self.url_pattern.sub('', text)
    
    def detect_language(self, text: str) -> str:
        try:
            lang = detect(text)
            lang_map = {'ko': 'Korean', 'en': 'English', 'ja': 'Japanese'}
            return lang_map.get(lang, 'English')
        except:
            return 'English'
    
    def clean_description(self, description: str) -> str:
        if not description:
            return "No description available."
        if self.config.remove_urls:
            description = self.remove_urls(description)
        description = ' '.join(description.split())
        if len(description) > self.config.max_description_length:
            description = description[:self.config.max_description_length] + "..."
        if len(description.strip()) < 20:
            return "Minimal description available."
        return description
    
    def filter_comments(self, comments: List[Dict]) -> List[Dict]:
        filtered = [c for c in comments if len(c.get('text', '')) >= self.config.min_comment_length]
        filtered.sort(key=lambda x: x.get('like_count', 0), reverse=True)
        return filtered[:self.config.max_comments_to_process]
    
    def format_comments_for_prompt(self, comments: List[Dict]) -> str:
        if not comments:
            return "No comments available."
        formatted = []
        for i, comment in enumerate(comments, 1):
            text = comment.get('text', '')
            likes = comment.get('like_count', 0)
            if self.config.remove_urls:
                text = self.remove_urls(text)
            formatted.append(f"{i}. [{likes} likes] {text}")
        return "\n".join(formatted)
    
    def parse_duration(self, duration_str: str) -> int:
        try:
            duration = isodate.parse_duration(duration_str)
            return int(duration.total_seconds())
        except:
            return 0

preprocessor = TextPreprocessor(config)
print("‚úÖ Text preprocessor initialized")

## ü§ñ 5. Model Loading

In [None]:
class ModelManager:
    """LLM Î™®Îç∏ Í¥ÄÎ¶¨"""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.model = None
        self.tokenizer = None
        self.pipe = None
    
    def load_model(self):
        print(f"\nüîÑ Loading: {self.config.model_name}")
        print(f"‚öôÔ∏è 4-bit: {self.config.use_4bit}")
        
        # Tokenizer
        print("Loading tokenizer...")
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.config.model_name, trust_remote_code=True
        )
        self.tokenizer.pad_token = self.tokenizer.eos_token
        
        # Quantization config
        if self.config.use_4bit:
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True
            )
        else:
            quantization_config = None
        
        # Model
        print("Loading model...")
        self.model = AutoModelForCausalLM.from_pretrained(
            self.config.model_name,
            quantization_config=quantization_config,
            device_map="auto",
            trust_remote_code=True,
            torch_dtype=torch.float16 if not self.config.use_4bit else None
        )
        
        # Pipeline
        print("Creating pipeline...")
        self.pipe = pipeline(
            "text-generation",
            model=self.model,
            tokenizer=self.tokenizer,
            max_new_tokens=self.config.max_new_tokens,
            temperature=self.config.temperature,
            top_p=self.config.top_p,
            do_sample=True,
            pad_token_id=self.tokenizer.eos_token_id
        )
        
        print("‚úÖ Model loaded!")
    
    def generate(self, messages: List[Dict]) -> str:
        if self.pipe is None:
            raise RuntimeError("Model not loaded")
        outputs = self.pipe(messages)
        generated_text = outputs[0]["generated_text"]
        if isinstance(generated_text, list):
            return generated_text[-1]["content"]
        return generated_text

model_manager = ModelManager(config)
model_manager.load_model()

## üîå 6. Team Model Integration (Optional)

In [None]:
class TeamModelIntegration:
    """ÌåÄ Î™®Îç∏ ÌÜµÌï© Ïù∏ÌÑ∞ÌéòÏù¥Ïä§"""
    
    def __init__(self, config: PipelineConfig):
        self.config = config
        self.category_model = None
        self.sentiment_model = None
    
    def load_category_model(self):
        if not self.config.use_category_model:
            return
        # TODO: ÌåÄÏõê Íµ¨ÌòÑ
        print("‚ö†Ô∏è Category model not implemented yet")
    
    def load_sentiment_model(self):
        if not self.config.use_sentiment_model:
            return
        # TODO: ÌåÄÏõê Íµ¨ÌòÑ
        print("‚ö†Ô∏è Sentiment model not implemented yet")
    
    def predict_category(self, video_info: Dict) -> Optional[str]:
        if not self.config.use_category_model or self.category_model is None:
            return None
        # TODO: Î™®Îç∏ inference
        return None
    
    def analyze_sentiment(self, comments: List[Dict]) -> Optional[Dict]:
        if not self.config.use_sentiment_model or self.sentiment_model is None:
            return None
        # TODO: Î™®Îç∏ inference
        return None

team_models = TeamModelIntegration(config)
team_models.load_category_model()
team_models.load_sentiment_model()
print("‚úÖ Team model interface ready")

## üì• 7. Data Loading

In [None]:
class DataLoader:
    def __init__(self, config: PipelineConfig):
        self.config = config
    
    def load_from_file(self, file_path: str) -> List[Dict]:
        print(f"\nüìÇ Loading: {file_path}")
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        print(f"‚úÖ Loaded {len(data)} videos")
        return data
    
    def get_data(self) -> List[Dict]:
        if self.config.use_youtube_api:
            raise NotImplementedError("YouTube API not implemented yet")
        return self.load_from_file(self.config.data_path)

data_loader = DataLoader(config)
print("‚úÖ Data loader ready")

## üé¨ 8. Report Generation Pipeline

In [None]:
class ReportGenerator:
    def __init__(self, config, model_manager, preprocessor, team_models):
        self.config = config
        self.model_manager = model_manager
        self.preprocessor = preprocessor
        self.team_models = team_models
    
    def generate_video_summary(self, video_info: Dict) -> str:
        video_info['description'] = self.preprocessor.clean_description(
            video_info.get('description', '')
        )
        video_info['duration_seconds'] = self.preprocessor.parse_duration(
            video_info.get('duration', 'PT0S')
        )
        
        language = "Korean"
        if self.config.detect_language:
            title_lang = self.preprocessor.detect_language(video_info.get('title', ''))
            desc_lang = self.preprocessor.detect_language(video_info.get('description', ''))
            language = title_lang if title_lang != 'English' else desc_lang
        
        messages = PromptTemplates.get_video_summary_prompt(video_info, language)
        
        try:
            summary = self.model_manager.generate(messages)
            return summary.strip()
        except Exception as e:
            print(f"‚ö†Ô∏è Error: {e}")
            return "Summary generation failed."
    
    def generate_reaction_summary(self, title: str, comments: List[Dict]) -> str:
        filtered_comments = self.preprocessor.filter_comments(comments)
        
        if not filtered_comments:
            return "No comments available."
        
        comments_text = self.preprocessor.format_comments_for_prompt(filtered_comments)
        
        language = "Korean"
        if self.config.detect_language and filtered_comments:
            sample_text = " ".join([c.get('text', '') for c in filtered_comments[:5]])
            language = self.preprocessor.detect_language(sample_text)
        
        messages = PromptTemplates.get_reaction_summary_prompt(title, comments_text, language)
        
        try:
            summary = self.model_manager.generate(messages)
            return summary.strip()
        except Exception as e:
            print(f"‚ö†Ô∏è Error: {e}")
            return "Reaction summary generation failed."
    
    def calculate_engagement_metrics(self, video_info: Dict) -> Dict:
        views = video_info.get('view_count', 0)
        likes = video_info.get('like_count', 0)
        comments = video_info.get('comment_count', 0)
        
        engagement_rate = ((likes + comments) / views * 100) if views > 0 else 0
        like_rate = (likes / views * 100) if views > 0 else 0
        comment_rate = (comments / views * 100) if views > 0 else 0
        
        return {
            'views': views,
            'likes': likes,
            'comments': comments,
            'engagement_rate': round(engagement_rate, 2),
            'like_rate': round(like_rate, 2),
            'comment_rate': round(comment_rate, 2)
        }

print("‚úÖ Report generator class defined")

In [None]:
def format_markdown_report(self, video_data: Dict, video_summary: str, 
                           reaction_summary: str, metrics: Dict,
                           team_predictions: Dict = None) -> str:
    video_info = video_data['video_info']
    comments = video_data.get('comments', [])
    
    report = f"""# YouTube Video Report

**Generated**: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
**Model**: {self.config.model_name}

---

## üìπ Video Information

- **Title**: {video_info.get('title', 'N/A')}
- **Channel**: {video_info.get('channel_title', 'N/A')}
- **Category**: {video_info.get('category_name', 'N/A')}
- **Published**: {video_info.get('published_at', 'N/A')}
- **Duration**: {self.preprocessor.parse_duration(video_info.get('duration', 'PT0S'))} seconds
- **Video ID**: `{video_info.get('video_id', 'N/A')}`
- **URL**: https://www.youtube.com/watch?v={video_info.get('video_id', '')}

---

## üìä Engagement Metrics

| Metric | Value |
|--------|-------|
| Views | {metrics['views']:,} |
| Likes | {metrics['likes']:,} |
| Comments | {metrics['comments']:,} |
| Engagement Rate | {metrics['engagement_rate']}% |
| Like Rate | {metrics['like_rate']}% |
| Comment Rate | {metrics['comment_rate']}% |

---

## üìù Video Summary

{video_summary}

---

## üí¨ Audience Reaction Summary

{reaction_summary}

---
"""
    
    if team_predictions:
        report += """## ü§ñ Team Model Predictions\n\n"""
        if 'category' in team_predictions:
            report += f"- **Category**: {team_predictions['category']}\n"
        if 'sentiment' in team_predictions:
            s = team_predictions['sentiment']
            report += f"- **Sentiment**: Positive {s.get('positive',0):.0%}, "
            report += f"Neutral {s.get('neutral',0):.0%}, Negative {s.get('negative',0):.0%}\n"
        report += "\n---\n\n"
    
    top_comments = self.preprocessor.filter_comments(comments)[:5]
    if top_comments:
        report += """## üîç Top Comments\n\n"""
        for i, c in enumerate(top_comments, 1):
            text = c.get('text', '')[:200]
            likes = c.get('like_count', 0)
            author = c.get('author', 'Anonymous')
            report += f"{i}. **{author}** ({likes} likes): {text}...\n\n"
        report += "---\n\n"
    
    report += """*Generated by YouTube Report Generator - Phase 2*"""
    return report

# Add method to ReportGenerator class
ReportGenerator.format_markdown_report = format_markdown_report
print("‚úÖ Report formatting method added")

In [None]:
def generate_report(self, video_data: Dict) -> str:
    video_info = video_data['video_info']
    comments = video_data.get('comments', [])
    
    print(f"\nüé¨ Processing: {video_info.get('title', 'Unknown')}")
    
    print("  üìù Generating video summary...")
    video_summary = self.generate_video_summary(video_info)
    
    print("  üí¨ Generating reaction summary...")
    reaction_summary = self.generate_reaction_summary(
        video_info.get('title', ''), comments
    )
    
    print("  üìä Calculating metrics...")
    metrics = self.calculate_engagement_metrics(video_info)
    
    team_predictions = {}
    if self.config.use_category_model:
        pred = self.team_models.predict_category(video_info)
        if pred:
            team_predictions['category'] = pred
    
    if self.config.use_sentiment_model:
        sent = self.team_models.analyze_sentiment(comments)
        if sent:
            team_predictions['sentiment'] = sent
    
    print("  üìÑ Formatting report...")
    report = self.format_markdown_report(
        video_data, video_summary, reaction_summary, 
        metrics, team_predictions if team_predictions else None
    )
    
    print("  ‚úÖ Done!")
    return report

# Add method
ReportGenerator.generate_report = generate_report

# Initialize report generator
report_generator = ReportGenerator(config, model_manager, preprocessor, team_models)
print("‚úÖ Report generator initialized")

## üöÄ 9. Run Pipeline

In [None]:
# Load data
dataset = data_loader.get_data()

# Select videos (start with 3 for testing)
videos_to_process = dataset[:3]  # Change to dataset for all 20

print(f"\nüéØ Processing {len(videos_to_process)} videos...")
print("="*60)

In [None]:
# Generate reports
reports = []

for i, video_data in enumerate(videos_to_process, 1):
    print(f"\n{'='*60}")
    print(f"Video {i}/{len(videos_to_process)}")
    print(f"{'='*60}")
    
    try:
        report = report_generator.generate_report(video_data)
        reports.append({
            'video_id': video_data['video_info']['video_id'],
            'report': report
        })
    except Exception as e:
        print(f"‚ùå Error: {e}")
        continue

print(f"\n{'='*60}")
print(f"‚úÖ Completed {len(reports)}/{len(videos_to_process)} videos")
print(f"{'='*60}")

## üíæ 10. Save Reports

In [None]:
import os

if config.save_reports:
    os.makedirs(config.output_dir, exist_ok=True)
    
    print(f"\nüíæ Saving to: {config.output_dir}/")
    
    for report_data in reports:
        video_id = report_data['video_id']
        report = report_data['report']
        
        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
        filename = f"{config.output_dir}/report_{video_id}_{timestamp}.md"
        
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(report)
        
        print(f"  ‚úÖ {filename}")
    
    print(f"\n‚úÖ All reports saved!")

## üìä 11. Display Sample Report

In [None]:
if reports:
    print("\n" + "="*60)
    print("Sample Report (First Video)")
    print("="*60 + "\n")
    print(reports[0]['report'])
else:
    print("‚ö†Ô∏è No reports generated")

## üéì 12. Usage Tips

### Change Model
```python
config = PipelineConfig(
    model_name="google/gemma-2-9b-it"  # or other models
)
```

### Customize Prompts
```python
PromptTemplates.VIDEO_SUMMARY_USER = """Your custom prompt..."""
```

### Add Team Models
```python
config = PipelineConfig(
    use_category_model=True,
    category_model_path="path/to/model"
)
```

### Process All Videos
```python
videos_to_process = dataset  # All 20 videos
```

---

**Expected Performance**:
- Korean quality: **6-8/10** (vs 0-2/10 Phase 1)
- English quality: **7-9/10** (vs 4-8/10 Phase 1)
- Time: ~2-3 min/video on T4 GPU

**Next Steps**:
1. Test with 3 sample videos
2. Compare quality with Phase 1
3. Try different models
4. Tune prompts
5. Integrate team models
6. Process full dataset

---

‚úÖ **Phase 2 Full Pipeline Ready!**