# Final Integrated Topic Analysis with DTM Metrics

This notebook provides a complete, integrated analysis of your topic models. It calculates all Dynamic Topic Model (DTM) metrics from scratch and combines them with comprehensive topic information, including frequency, temporal patterns, and word analysis.

## 🎯 Key Features:

1.  **DTM Metrics from Scratch**: Calculates TTC, TTS, TTQ, TC, TD, and DTQ.
2.  **Integrated Topic Profiles**: For **every topic**, you get:
    *   Frequency and persistence statistics.
    *   Temporal patterns (duration, start/end dates).
    *   Word analysis (top words, diversity).
    *   **Calculated DTM quality scores (e.g., TTQ).**
3.  **Comprehensive Visualizations**:
    *   Datetime axes for all temporal plots.
    *   Charts comparing topics by both frequency and DTM quality.
    *   Cross-platform comparison between YouTube and Telegram.
4.  **Unified Analysis**: A single, powerful view to understand topic characteristics and quality.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import itertools
from sklearn.feature_extraction.text import CountVectorizer
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")
print("Ready for final integrated analysis!")

## 1. Data Loading and Preprocessing

In [None]:
# Load raw topic data
def load_and_preprocess_data():
    """Load and preprocess raw topic data from both platforms"""
    
    print("Loading raw topic data...")
    
    # Load YouTube and Telegram topic data
    youtube_topics = pd.read_csv('analysis_thailand/bert_overtime/youtube/youtube_topics_over_time_raw.csv')
    telegram_topics = pd.read_csv('analysis_thailand/bert_overtime/telegram/telegram_topics_over_time_raw.csv')
    
    # Preprocess function
    def preprocess_dataset(df, source_name):
        """Preprocess individual dataset"""
        df = df.copy()
        df['Timestamp'] = pd.to_datetime(df['Timestamp'])
        df['Source'] = source_name
        df['Year'] = df['Timestamp'].dt.year
        df['Month'] = df['Timestamp'].dt.month
        df['Year_Month'] = df['Timestamp'].dt.to_period('M')
        return df
    
    # Process both datasets
    youtube_topics = preprocess_dataset(youtube_topics, 'YouTube')
    telegram_topics = preprocess_dataset(telegram_topics, 'Telegram')
    
    print(f"\n=== Raw Data Summary ===")
    print(f"YouTube: {len(youtube_topics)} records, {len(youtube_topics['Topic'].unique())} unique topics")
    print(f"Telegram: {len(telegram_topics)} records, {len(telegram_topics['Topic'].unique())} unique topics")
    print(f"YouTube date range: {youtube_topics['Timestamp'].min()} to {youtube_topics['Timestamp'].max()}")
    print(f"Telegram date range: {telegram_topics['Timestamp'].min()} to {telegram_topics['Timestamp'].max()}")
    
    return youtube_topics, telegram_topics

# Load the data
youtube_data, telegram_data = load_and_preprocess_data()

## 2. DTM Metrics Calculator from Scratch

In [None]:
class DTMMetricsCalculator:
    """
    Calculate Dynamic Topic Model metrics from scratch using raw topic data.
    Implements TTC, TTS, TTQ, TC, TD, TQ, DTQ metrics.
    """
    
    def __init__(self, topics_df, dataset_name, top_n_topics=30, n_words=10):
        self.topics_df = topics_df
        self.dataset_name = dataset_name
        self.top_n_topics = top_n_topics
        self.n_words = n_words
        print(f"\n=== Initializing DTM Metrics Calculator for {dataset_name} ===")
        self._prepare_data()
        
    def _prepare_data(self):
        """Prepare data for DTM evaluation"""
        valid_topics = self.topics_df[self.topics_df['Topic'] != -1].copy()
        if valid_topics.empty:
            raise ValueError("No valid topics found in the data")
        topic_counts = valid_topics.groupby('Topic')['Frequency'].sum().sort_values(ascending=False)
        self.selected_topics = topic_counts.head(self.top_n_topics).index.tolist()
        self.filtered_df = valid_topics[valid_topics['Topic'].isin(self.selected_topics)].copy()
        self.timestamps = sorted(self.topics_df['Timestamp'].unique())
        self.num_time_slices = len(self.timestamps)
        self.timestamp_map = {ts: i for i, ts in enumerate(self.timestamps)}
        self._prepare_topic_words_by_time()
        self._prepare_corpus()
    
    def _prepare_topic_words_by_time(self):
        """Prepare topic words organized by time and topic"""
        self.topic_words_by_time = {}
        for t_idx, timestamp in enumerate(self.timestamps):
            self.topic_words_by_time[t_idx] = {}
            for topic_id in self.selected_topics:
                mask = (self.topics_df['Timestamp'] == timestamp) & (self.topics_df['Topic'] == topic_id)
                topic_data = self.topics_df[mask]
                if not topic_data.empty:
                    words = topic_data['Words'].iloc[0]
                    if isinstance(words, str):
                        word_list = [word.strip() for word in words.split(',') if word.strip()]
                        self.topic_words_by_time[t_idx][topic_id] = word_list[:self.n_words]
                    else:
                        self.topic_words_by_time[t_idx][topic_id] = []
                else:
                    self.topic_words_by_time[t_idx][topic_id] = []

    def _prepare_corpus(self):
        """Prepare reference corpus for coherence calculations"""
        all_words = [word for time_dict in self.topic_words_by_time.values() for topic_words in time_dict.values() for word in topic_words]
        self.vocabulary = list(set(all_words))
        self.corpus_texts = [' '.join(topic_words) for time_dict in self.topic_words_by_time.values() for topic_words in time_dict.values() if topic_words]
        self.tokenized_texts = [text.split() for text in self.corpus_texts]
        doc_freq = {word: set() for word in self.vocabulary}
        for i, text in enumerate(self.tokenized_texts):
            for word in set(text):
                if word in doc_freq:
                    doc_freq[word].add(i)
        self.doc_freq = doc_freq

    def _calculate_pmi_coherence(self, word_pairs):
        """Calculate PMI-based coherence for word pairs"""
        if not word_pairs: return 0.0
        coherence_sum = 0.0
        pair_count = 0
        epsilon = 1e-10
        num_docs = len(self.tokenized_texts)
        for word_i, word_j in word_pairs:
            if word_i in self.vocabulary and word_j in self.vocabulary:
                docs_i = self.doc_freq[word_i]
                docs_j = self.doc_freq[word_j]
                count_ij = len(docs_i.intersection(docs_j))
                if count_ij > 0:
                    prob_i = len(docs_i) / num_docs
                    prob_j = len(docs_j) / num_docs
                    prob_ij = count_ij / num_docs
                    coherence = np.log((prob_ij + epsilon) / (prob_i * prob_j + epsilon))
                    coherence_sum += coherence
                    pair_count += 1
        return coherence_sum / pair_count if pair_count > 0 else 0.0

    def calculate_temporal_topic_coherence(self, topic_k, time_t):
        if time_t + 1 >= self.num_time_slices: return 0.0
        words_t = self.topic_words_by_time[time_t].get(topic_k, [])
        words_t_plus = self.topic_words_by_time[time_t + 1].get(topic_k, [])
        if not words_t or not words_t_plus: return 0.0
        word_pairs = list(itertools.product(words_t, words_t_plus))
        return self._calculate_pmi_coherence(word_pairs)

    def calculate_temporal_topic_smoothness(self, topic_k, time_t):
        if time_t + 1 >= self.num_time_slices: return 0.0
        words_i = set(self.topic_words_by_time[time_t].get(topic_k, []))
        words_j = set(self.topic_words_by_time[time_t + 1].get(topic_k, []))
        if not words_i or not words_j: return 0.0
        intersection = len(words_i & words_j)
        union = len(words_i | words_j)
        return intersection / union if union > 0 else 0.0

    def calculate_topic_coherence(self, topic_words):
        if not topic_words or len(topic_words) < 2: return 0.0
        word_pairs = list(itertools.combinations(topic_words, 2))
        return self._calculate_pmi_coherence(word_pairs)

    def calculate_topic_diversity(self, all_topic_words):
        if not all_topic_words: return 0.0
        all_words = [word for sublist in all_topic_words for word in sublist]
        if not all_words: return 0.0
        unique_words = len(set(all_words))
        total_words = len(all_words)
        return unique_words / total_words if total_words > 0 else 0.0

    def compute_all_metrics(self):
        print(f"\n=== Computing DTM Metrics for {self.dataset_name} ===")
        results = {'ttc_per_topic_per_time': {}, 'tts_per_topic_per_time': {}, 'ttq_per_topic': {}, 'tc_per_time': {}, 'td_per_time': {}, 'tq_per_time': {}, 'overall_metrics': {}}
        for topic_k in self.selected_topics:
            results['ttc_per_topic_per_time'][topic_k] = {}
            results['tts_per_topic_per_time'][topic_k] = {}
            ttc_scores, tts_scores = [], []
            for time_t in range(self.num_time_slices - 1):
                ttc = self.calculate_temporal_topic_coherence(topic_k, time_t)
                tts = self.calculate_temporal_topic_smoothness(topic_k, time_t)
                results['ttc_per_topic_per_time'][topic_k][time_t] = ttc
                results['tts_per_topic_per_time'][topic_k][time_t] = tts
                ttc_scores.append(ttc)
                tts_scores.append(tts)
            ttq_scores = [c * s for c, s in zip(ttc_scores, tts_scores)]
            results['ttq_per_topic'][topic_k] = np.mean(ttq_scores) if ttq_scores else 0.0
        for time_t in range(self.num_time_slices):
            all_topic_words_t, tc_scores_t = [], []
            for topic_k in self.selected_topics:
                topic_words = self.topic_words_by_time[time_t].get(topic_k, [])
                if topic_words:
                    all_topic_words_t.append(topic_words)
                    tc_scores_t.append(self.calculate_topic_coherence(topic_words))
            results['tc_per_time'][time_t] = np.mean(tc_scores_t) if tc_scores_t else 0.0
            results['td_per_time'][time_t] = self.calculate_topic_diversity(all_topic_words_t)
            results['tq_per_time'][time_t] = results['tc_per_time'][time_t] * results['td_per_time'][time_t]
        all_ttc = [v for scores in results['ttc_per_topic_per_time'].values() for v in scores.values()]
        results['overall_metrics']['TTC'] = np.mean(all_ttc) if all_ttc else 0.0
        all_tts = [v for scores in results['tts_per_topic_per_time'].values() for v in scores.values()]
        results['overall_metrics']['TTS'] = np.mean(all_tts) if all_tts else 0.0
        results['overall_metrics']['TTQ'] = np.mean(list(results['ttq_per_topic'].values())) if results['ttq_per_topic'] else 0.0
        results['overall_metrics']['TC'] = np.mean(list(results['tc_per_time'].values())) if results['tc_per_time'] else 0.0
        results['overall_metrics']['TD'] = np.mean(list(results['td_per_time'].values())) if results['td_per_time'] else 0.0
        results['overall_metrics']['TQ'] = np.mean(list(results['tq_per_time'].values())) if results['tq_per_time'] else 0.0
        results['overall_metrics']['DTQ'] = 0.5 * (results['overall_metrics']['TQ'] + results['overall_metrics']['TTQ'])
        print(f"✅ All metrics calculated for {self.dataset_name}!")
        self.results = results
        return results

    def get_metrics_summary(self):
        if not hasattr(self, 'results'): raise ValueError("Run compute_all_metrics() first.")
        return pd.DataFrame([{'Metric': m, 'Score': s} for m, s in self.results['overall_metrics'].items()])

print("DTM Metrics Calculator class defined!")

## 3. Calculate and Visualize DTM Metrics

In [None]:
# Create DTM metrics calculators for both datasets
print("Creating DTM metrics calculators...")

# Initialize calculators
youtube_calculator = DTMMetricsCalculator(
    topics_df=youtube_data, 
    dataset_name="YouTube", 
    top_n_topics=30, 
    n_words=10
)

telegram_calculator = DTMMetricsCalculator(
    topics_df=telegram_data, 
    dataset_name="Telegram", 
    top_n_topics=30, 
    n_words=10
)

# Calculate metrics for both datasets
youtube_results = youtube_calculator.compute_all_metrics()
telegram_results = telegram_calculator.compute_all_metrics()

print("\n" + "="*50)
print("🚀 DTM METRICS CALCULATION COMPLETED!")
print("="*50)

In [None]:
# Display metrics summary
def create_metrics_visualization(youtube_calc, telegram_calc):
    youtube_summary = youtube_calc.get_metrics_summary()
    telegram_summary = telegram_calc.get_metrics_summary()
    
    # Create comprehensive comparison
    comparison_df = pd.DataFrame({
        'Metric': youtube_summary['Metric'],
        'YouTube': youtube_summary['Score'],
        'Telegram': telegram_summary['Score']
    })
    print("\n🔍 COMPREHENSIVE METRICS COMPARISON:")
    print(comparison_df.round(4))
    
    # Create comparison plot
    fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'bar'}, {'type': 'polar'}]], subplot_titles=('DTM Metrics Comparison', 'DTM Metrics Radar Chart'))

    fig.add_trace(go.Bar(name='YouTube', x=youtube_summary['Metric'], y=youtube_summary['Score'], marker_color='red'), row=1, col=1)
    fig.add_trace(go.Bar(name='Telegram', x=telegram_summary['Metric'], y=telegram_summary['Score'], marker_color='blue'), row=1, col=1)

    fig.add_trace(go.Scatterpolar(r=youtube_summary['Score'], theta=youtube_summary['Metric'], fill='toself', name='YouTube', marker_color='red'), row=1, col=2)
    fig.add_trace(go.Scatterpolar(r=telegram_summary['Score'], theta=telegram_summary['Metric'], fill='toself', name='Telegram', marker_color='blue'), row=1, col=2)

    fig.update_layout(height=500, width=1000, title_text="DTM Metrics: YouTube vs Telegram")
    fig.show()

create_metrics_visualization(youtube_calculator, telegram_calculator)

## 4. Integrated Topic Analysis (Frequency, Words, and DTM Metrics)

In [None]:
def get_all_topics_integrated_overview(data, calculator):
    """Get a comprehensive overview of all topics, integrating DTM metrics."""
    dataset_name = calculator.dataset_name
    print(f"\n📊 INTEGRATED TOPIC OVERVIEW - {dataset_name}")
    print("="*60)
    
    valid_data = data[data['Topic'] != -1].copy()
    if valid_data.empty: return None
    
    # Basic frequency and temporal stats
    topic_stats = valid_data.groupby('Topic').agg(
        Frequency_sum=('Frequency', 'sum'),
        Frequency_mean=('Frequency', 'mean'),
        Time_Periods=('Timestamp', 'nunique'),
        Start_Date=('Timestamp', 'min'),
        End_Date=('Timestamp', 'max')
    ).reset_index()
    
    # Get DTM metrics for each topic
    dtm_metrics = pd.DataFrame.from_dict(calculator.results['ttq_per_topic'], orient='index', columns=['TTQ_Score'])
    dtm_metrics.index.name = 'Topic'
    
    # Merge frequency/temporal stats with DTM metrics
    integrated_stats = pd.merge(topic_stats, dtm_metrics, on='Topic', how='left').fillna(0)
    integrated_stats['Duration_Days'] = (integrated_stats['End_Date'] - integrated_stats['Start_Date']).dt.days
    
    # Sort by a combined score or frequency
    integrated_stats = integrated_stats.sort_values('Frequency_sum', ascending=False)
    
    print(f"🏆 Top 10 Topics by Frequency (with TTQ Score):")
    print(integrated_stats.head(10).round(3))
    return integrated_stats

def create_integrated_topic_charts(df, dataset_name):
    """Create charts from the integrated topic statistics DataFrame."""
    if df is None: return
    
    print(f"\n📈 Creating integrated charts for {dataset_name}")
    fig = make_subplots(
        rows=2, cols=2, 
        subplot_titles=(
            'Total Frequency vs. TTQ Score',
            'Top 20 Topics by TTQ Score',
            'Topic Duration vs. TTQ Score',
            'Topic Persistence (Time Periods) vs. TTQ Score'
        )
    )

    # Scatter plot: Frequency vs. TTQ
    fig.add_trace(go.Scatter(
        x=df['Frequency_sum'], 
        y=df['TTQ_Score'], 
        mode='markers', 
        marker=dict(size=10, color=df['Time_Periods'], colorscale='Viridis', showscale=True, colorbar=dict(title='Time Periods')),
        text=df['Topic'].apply(lambda x: f'Topic {x}')
    ), row=1, col=1)

    # Bar chart: Top topics by TTQ
    top_ttq = df.nlargest(20, 'TTQ_Score')
    fig.add_trace(go.Bar(
        x=top_ttq['Topic'].astype(str), 
        y=top_ttq['TTQ_Score'],
        marker_color=top_ttq['Frequency_sum']
    ), row=1, col=2)

    # Scatter plot: Duration vs TTQ
    fig.add_trace(go.Scatter(
        x=df['Duration_Days'], 
        y=df['TTQ_Score'], 
        mode='markers',
        marker=dict(color=df['Frequency_sum'], colorscale='Plasma', showscale=True, colorbar=dict(title='Frequency'))
    ), row=2, col=1)

    # Scatter plot: Time Periods vs TTQ
    fig.add_trace(go.Scatter(
        x=df['Time_Periods'], 
        y=df['TTQ_Score'], 
        mode='markers',
        marker=dict(color=df['Frequency_sum'], colorscale='Inferno', showscale=True, colorbar=dict(title='Frequency'))
    ), row=2, col=2)

    fig.update_layout(height=800, width=1200, title_text=f'Integrated Topic Analysis for {dataset_name}', showlegend=False)
    fig.update_xaxes(title_text="Total Frequency", row=1, col=1)
    fig.update_yaxes(title_text="TTQ Score", row=1, col=1)
    fig.update_xaxes(title_text="Topic ID", type='category', row=1, col=2)
    fig.update_yaxes(title_text="TTQ Score", row=1, col=2)
    fig.update_xaxes(title_text="Duration (Days)", row=2, col=1)
    fig.update_yaxes(title_text="TTQ Score", row=2, col=1)
    fig.update_xaxes(title_text="Number of Time Periods", row=2, col=2)
    fig.update_yaxes(title_text="TTQ Score", row=2, col=2)
    
    fig.show()

# Run integrated analysis
youtube_integrated_stats = get_all_topics_integrated_overview(youtube_data, youtube_calculator)
create_integrated_topic_charts(youtube_integrated_stats, "YouTube")

telegram_integrated_stats = get_all_topics_integrated_overview(telegram_data, telegram_calculator)
create_integrated_topic_charts(telegram_integrated_stats, "Telegram")

## 5. Summary and How to Use

This notebook provides a complete analysis pipeline, from raw data to integrated topic profiles.

### How to Access the Integrated Data

The DataFrames `youtube_integrated_stats` and `telegram_integrated_stats` contain all the information for each topic. You can use this for further custom analysis.

```python
# Example: View the top 5 YouTube topics with all their stats
print(youtube_integrated_stats.head())

# Example: Find the topic with the highest TTQ score
best_topic = youtube_integrated_stats.loc[youtube_integrated_stats['TTQ_Score'].idxmax()]
print("\nTopic with the best TTQ Score:")
print(best_topic)
```

### Key Insights from Integrated Analysis

By combining frequency metrics with quality metrics like TTQ, you can answer deeper questions:
- Are my most frequent topics also high-quality and coherent over time?
- Are there any high-quality 'hidden gem' topics that are not very frequent but are very consistent?
- Do long-lasting topics tend to have higher or lower quality scores?