## Summary: in the step2, we fix $depth=3$, $\gamma=0.05$ and $alpha=0.1$, to test $eta$

### 1. We do same thing for Renyi Entropy and Jenson-shannon Divergence as we did in step 1 analysis.

### 2. Differently, we add two semantic indicators: **coherence** and **perplexity**, two structual indicators: **Gini Coefficient** and **Branching Factor** in this step.

### 2.1 For each run, We calculated each node's coherence for all chiains by **calculate_standard_coherence_from_corpus_corrected()**
- save to: <u>step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_3/standard_coherence.csv</u>

### 2.2 For each run, we averaged the coherence for nodes within each layer for each run by **aggregate_coherence_by_eta()**
- save to: <u>step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_3/layer_coherence_summary_k5.csv</u>

### 3. For each eta, We calculate the layer level coherence and weighted by docs' count coherence by **We calculate_coherence_layered_analysis()** when top_k = 5 selecting top 5 words
- save to: <u>step2/step2_d3_g005_e01_收敛/eta_0.1_coherence_layer_summary_k5.csv</u>

### 4. For all eta, we aggreagte them coherence by **aggregate_coherence_by_eta()**
- save to <u>/Volumes/My Passport/收敛结果/step2/02_eta_coherence_layer_comparison_k5.csv</u>

### 5. For each eta, we calculate the perplexity for each eta's run to see its prediction performance by **calculate_hlda_perplexity_with_path_mapping_complete()** 
- save to: <u>step2/step2_d3_g005_e01_收敛/eta_0.1_perplexity_summary.csv</u>

### 6. For all eta, we aggregate all perplexity by **aggregate_perplexity_by_eta_groups()**
- save to: <u> step2/03_eta_perplexity_comparison.csv</u>

### 7.1 For each run/chain, we calculate its tree structure's branch factor and gini coefficient by **calculate_branching_and_gini_metrics**
- save to <u>step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/layer_branching_gini_metrics.csv</u>

### 7.2 for each eta, we calculate its mean of each layer's performance by **aggregate_branching_gini_by_eta()**
- save to: <u>step2/step2_d3_g005_e01_收敛/eta_0.1_layer_branching_gini_summary.csv <\u>

### 7.3 for all eta, we aggregate all results by **display_branching_gini_summary**
- save to: <u>step2/04_eta_layer_branching_gini_comparison.csv</u>

In [None]:
import pandas as pd
import numpy as np
import os
import glob
from scipy.special import gammaln

def calculate_renyi_entropy_vectorized(node_data, all_words, eta_prior=1.0, renyi_alpha=2.0):
    """
    Vectorized version of Renyi entropy calculation
    
    Parameters:
    node_data: DataFrame, node data containing word and count columns
    all_words: list, complete vocabulary
    eta_prior: float, Dirichlet prior smoothing parameter (obtained from eta value)
    renyi_alpha: float, order parameter for Renyi entropy
    
    Returns:
    tuple: (entropy, nonzero_word_count) Renyi entropy value and number of non-zero words
    """
    if len(all_words) == 0:
        return 0.0, 0
    
    # Create word to index mapping
    word_to_idx = {word: idx for idx, word in enumerate(all_words)}
    
    # Initialize count vector
    counts = np.zeros(len(all_words))
    
    # Fill actual counts
    for _, row in node_data.iterrows():
        word = row['word']
        if pd.notna(word) and word in word_to_idx:
            counts[word_to_idx[word]] = row['count']
    
    # Count non-zero words (before smoothing)
    nonzero_word_count = np.sum(counts > 0)
    
    # Add eta smoothing
    smoothed_counts = counts + eta_prior
    
    # Calculate probability distribution
    probabilities = smoothed_counts / np.sum(smoothed_counts)
    
    # Calculate Renyi entropy (using natural logarithm)
    if renyi_alpha == 1.0:
        # Shannon entropy (all probabilities > 0 due to alpha smoothing, no need to add small constant)
        entropy = -np.sum(probabilities * np.log(probabilities))
    else:
        # General Renyi entropy
        entropy = (1 / (1 - renyi_alpha)) * np.log(np.sum(probabilities ** renyi_alpha))
    
    return entropy, int(nonzero_word_count)

def process_all_iteration_files_by_eta(base_path=".", renyi_alpha=2.0):
    """
    Process each iteration_node_word_distributions.csv file separately and save results
    Automatically extract eta value from folder name as prior smoothing parameter
    """
    pattern = os.path.join(base_path, "**", "iteration_node_word_distributions.csv")
    files = glob.glob(pattern, recursive=True)
    
    # Remove duplicates to ensure each file is processed only once
    files = list(set(files))
    files.sort()  # Sort for ordered processing
    
    print(f"Found {len(files)} files to process")
    
    for idx, file_path in enumerate(files, 1):
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)
        
        # Extract eta value from folder name
        eta_prior = 0.1  # Default value
        if 'eta_' in folder_name:
            try:
                # Find the numeric part after eta_
                eta_part = folder_name.split('eta_')[1].split('_')[0]
                eta_prior = float(eta_part)
            except (IndexError, ValueError) as e:
                print(f"Warning: Unable to extract eta value from folder name {folder_name}, using default value {eta_prior}")
        
        print(f"\n[{idx}/{len(files)}] Processing file: {file_path}")
        print(f"Folder: {folder_name}")
        print(f"Extracted eta value: {eta_prior}")
        
        try:
            df = pd.read_csv(file_path)
            
            # Clean column names, remove single quotes, double quotes and spaces
            df.columns = [col.strip("'\" ") for col in df.columns]
            
            if 'node_id' not in df.columns:
                print(f"Warning: {file_path} missing node_id column, skipping this file")
                continue
                
            max_iteration = df['iteration'].max()
            last_iteration_data = df[df['iteration'] == max_iteration]
            all_words = list(last_iteration_data['word'].dropna().unique())
            
            print(f"Last iteration: {max_iteration}, vocabulary size: {len(all_words)}, number of nodes: {last_iteration_data['node_id'].nunique()}")
            
            results = []
            for node_id in last_iteration_data['node_id'].unique():
                node_data = last_iteration_data[last_iteration_data['node_id'] == node_id]
                
                entropy, nonzero_words = calculate_renyi_entropy_vectorized(
                    node_data, all_words, eta_prior, renyi_alpha
                )
                
                # Calculate sparsity ratio (proportion of non-zero words)
                sparsity_ratio = nonzero_words / len(all_words) if len(all_words) > 0 else 0
                
                results.append({
                    'node_id': node_id,
                    'renyi_entropy_corrected': entropy,
                    'nonzero_word_count': nonzero_words,
                    'total_vocabulary_size': len(all_words),
                    'sparsity_ratio': sparsity_ratio,
                    'eta_prior': eta_prior,
                    'renyi_alpha': renyi_alpha,
                    'iteration': max_iteration
                })
            
            # Save new corrected_renyi_entropy.csv file
            results_df = pd.DataFrame(results)
            output_path = os.path.join(folder_path, 'corrected_renyi_entropy.csv')
            results_df.to_csv(output_path, index=False)
            print(f"✓ Saved corrected Renyi entropy results to: {output_path}")
            
            # Output some statistics
            print(f"Node vocabulary sparsity statistics:")
            print(f"  - Average non-zero words: {results_df['nonzero_word_count'].mean():.1f}")
            print(f"  - Non-zero word count range: {results_df['nonzero_word_count'].min()}-{results_df['nonzero_word_count'].max()}")
            print(f"  - Average sparsity: {results_df['sparsity_ratio'].mean():.3f}")
            print("=" * 50)
            
                
        except Exception as e:
            import traceback
            print(f"❌ Error processing file {file_path}: {str(e)}")
            print("Detailed error information:")
            traceback.print_exc()

In [2]:
# Set parameters
base_path = "/Volumes/My Passport/收敛结果/step2"  # Root directory
renyi_alpha = 2.0  # Renyi entropy order parameter

print("=" * 50)
print("Starting batch calculation of corrected Renyi entropy (automatically adjusting prior by eta value)...")
print("=" * 50)
process_all_iteration_files_by_eta(base_path, renyi_alpha)
print("=" * 50)
print("All processing completed!")
print("=" * 50)

Starting batch calculation of corrected Renyi entropy (automatically adjusting prior by eta value)...
Found 18 files to process

[1/18] Processing file: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e0005_基于e01_收敛/depth_3_gamma_0.05_eta_0.005_run_1/iteration_node_word_distributions.csv
Folder: depth_3_gamma_0.05_eta_0.005_run_1
Extracted eta value: 0.005
Last iteration: 115, vocabulary size: 1490, number of nodes: 438
✓ Saved corrected Renyi entropy results to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e0005_基于e01_收敛/depth_3_gamma_0.05_eta_0.005_run_1/corrected_renyi_entropy.csv
Node vocabulary sparsity statistics:
  - Average non-zero words: 23.0
  - Non-zero word count range: 0-1108
  - Average sparsity: 0.015

[2/18] Processing file: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e0005_基于e01_收敛/depth_3_gamma_0.05_eta_0.005_run_2/iteration_node_word_distributions.csv
Folder: depth_3_gamma_0.05_eta_0.005_run_2
Extracted eta value: 0.005
Last iteration: 115, vocabulary size: 1490, n

In [3]:
def calculate_node_document_counts(path_structures_df):
    """
    Aggregate from leaf nodes upward to calculate document count and hierarchical relationships for each node
    
    Parameters:
    path_structures_df: DataFrame, data from iteration_path_structures.csv (filtered for the last iteration)
    
    Returns:
    dict: {node_id: {'document_count': int, 'layer': int, 'parent_id': int, 'child_ids': list}} mapping
    """
    # Get all layer columns - fix regex pattern
    layer_columns = [col for col in path_structures_df.columns if col.startswith('layer_') and col.endswith('_node_id')]
    layer_columns.sort()  # Ensure ordered arrangement
    max_layer_idx = len(layer_columns) - 1
    
    print(f"[DEBUG] Found layer columns: {layer_columns}")
    print(f"[DEBUG] Maximum layer index: {max_layer_idx}")
    
    # Initialize node information dictionary
    node_info = {}
    
    # First establish hierarchical and parent-child relationships for all nodes
    for _, row in path_structures_df.iterrows():
        path_nodes = []
        for layer_idx in range(max_layer_idx + 1):
            layer_col = f'layer_{layer_idx}_node_id'
            if layer_col in path_structures_df.columns and pd.notna(row[layer_col]):
                path_nodes.append(row[layer_col])
            else:
                break
        
        # Establish hierarchical and parent-child relationships for each node in the path
        for i, node in enumerate(path_nodes):
            if node not in node_info:
                node_info[node] = {
                    'document_count': 0,
                    'layer': i,
                    'parent_id': None,
                    'child_ids': [],
                    'child_count': 0
                }
            else:
                # Update layer information (ensure consistency)
                node_info[node]['layer'] = i
            
            # Set parent node relationships
            if i > 0:  # Not root node
                parent_node = path_nodes[i-1]
                node_info[node]['parent_id'] = parent_node
                
                # Add current node to parent node's child list
                if parent_node not in node_info:
                    node_info[parent_node] = {
                        'document_count': 0,
                        'layer': i-1,
                        'parent_id': None,
                        'child_ids': [],
                        'child_count': 0
                    }
                
                if node not in node_info[parent_node]['child_ids']:
                    node_info[parent_node]['child_ids'].append(node)
    
    # Then process leaf node document counts - after establishing hierarchical relationships
    for _, row in path_structures_df.iterrows():
        leaf_node = row['leaf_node_id']
        if pd.notna(leaf_node) and leaf_node in node_info:
            node_info[leaf_node]['document_count'] += row['document_count']
    
    # Aggregate document counts from second-to-last layer upward
    for layer_idx in range(max_layer_idx - 1, -1, -1):  # From second-to-last layer to layer 0
        layer_col = f'layer_{layer_idx}_node_id'
        
        if layer_col not in path_structures_df.columns:
            continue
            
        # Get all unique nodes in this layer
        layer_nodes = path_structures_df[layer_col].dropna().unique()
        
        for node in layer_nodes:
            if node in node_info and node_info[node]['document_count'] == 0:
                # Calculate document count: sum all child node document counts
                child_doc_count = 0
                for child_id in node_info[node]['child_ids']:
                    if child_id in node_info:
                        child_doc_count += node_info[child_id]['document_count']
                
                # If no child node document count, calculate directly from path structure
                if child_doc_count == 0:
                    total_docs = path_structures_df[path_structures_df[layer_col] == node]['document_count'].sum()
                    node_info[node]['document_count'] = total_docs
                else:
                    node_info[node]['document_count'] = child_doc_count

    # Calculate child node count for each node
    for node_id, info in node_info.items():
        info['child_count'] = len(info['child_ids'])
    
    return node_info

def add_document_counts_to_entropy_files(base_path="."):
    """
    Add document count and hierarchical information to corrected_renyi_entropy.csv files
    """
    pattern = os.path.join(base_path, "**", "iteration_path_structures.csv")
    files = glob.glob(pattern, recursive=True)
    
    for file_path in files:
        folder_path = os.path.dirname(file_path)
        print(f"\nProcessing path structure file: {file_path}")
        
        try:
            # Read path_structures file
            df = pd.read_csv(file_path)
            df.columns = [col.strip("'\" ") for col in df.columns]
            
            # Get last iteration data
            max_iteration = df['iteration'].max()
            last_iteration_data = df[df['iteration'] == max_iteration]
            
            print(f"Last iteration: {max_iteration}, path count: {len(last_iteration_data)}")
            
            # Calculate document count and hierarchical relationships for each node
            node_info = calculate_node_document_counts(last_iteration_data)
            
            print(f"Calculated information for {len(node_info)} nodes")
            
            # Read corresponding corrected_renyi_entropy.csv
            entropy_file = os.path.join(folder_path, 'corrected_renyi_entropy.csv')
            if os.path.exists(entropy_file):
                entropy_df = pd.read_csv(entropy_file)
                
                # Add new columns - fix child_ids format and child_count calculation
                entropy_df['document_count'] = entropy_df['node_id'].map(lambda x: node_info.get(x, {}).get('document_count', 0))
                entropy_df['layer'] = entropy_df['node_id'].map(lambda x: node_info.get(x, {}).get('layer', -1))
                entropy_df['parent_id'] = entropy_df['node_id'].map(lambda x: node_info.get(x, {}).get('parent_id', None))
                
                # Fix child_ids format: use square brackets instead of commas
                entropy_df['child_ids'] = entropy_df['node_id'].map(
                    lambda x: '[' + ','.join(map(str, node_info.get(x, {}).get('child_ids', []))) + ']' 
                    if node_info.get(x, {}).get('child_ids') else ''
                )
                
                # Fix child_count: use list length directly
                entropy_df['child_count'] = entropy_df['node_id'].map(lambda x: len(node_info.get(x, {}).get('child_ids', [])))

                # Save updated file
                entropy_df.to_csv(entropy_file, index=False)
                print(f"Updated {entropy_file}, added document_count, layer, parent_id, child_ids, child_count columns")
                
                # Display some statistics
                print(f"Node layer statistics:")
                print(f"  - Layer distribution: {entropy_df['layer'].value_counts().sort_index().to_dict()}")
                print(f"  - Document count range: {entropy_df['document_count'].min()}-{entropy_df['document_count'].max()}")
                print(f"  - Root node count: {entropy_df[entropy_df['parent_id'].isna()].shape[0]}")
                print(f"  - Leaf node count: {entropy_df[entropy_df['child_ids'] == ''].shape[0]}")
                print(f"  - Child count distribution: {entropy_df['child_count'].value_counts().sort_index().to_dict()}")
            else:
                print(f"Warning: Corresponding entropy file not found {entropy_file}")
                
        except Exception as e:
            import traceback
            print(f"Error processing file {file_path}: {str(e)}")
            print("Detailed error information:")
            traceback.print_exc()

In [4]:
# Main function: Add document count and hierarchical information to entropy files
import os
import glob
import pandas as pd 

base_path = "/Volumes/My Passport/收敛结果/step2"  # Root directory

print("=" * 50)
print("Starting to add document count and hierarchical information to entropy files...")
print("=" * 50)
add_document_counts_to_entropy_files(base_path)
print("=" * 50)
print("Document count and hierarchical information addition completed!")
print("=" * 50)

Starting to add document count and hierarchical information to entropy files...

Processing path structure file: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/iteration_path_structures.csv
Last iteration: 175, path count: 186
[DEBUG] Found layer columns: ['layer_0_node_id', 'layer_1_node_id', 'layer_2_node_id']
[DEBUG] Maximum layer index: 2
Calculated information for 231 nodes
Updated /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/corrected_renyi_entropy.csv, added document_count, layer, parent_id, child_ids, child_count columns
Node layer statistics:
  - Layer distribution: {0: 1, 1: 44, 2: 186}
  - Document count range: 1-970
  - Root node count: 1
  - Leaf node count: 186
  - Child count distribution: {0: 186, 1: 4, 2: 10, 3: 10, 4: 10, 5: 8, 6: 1, 44: 1, 46: 1}

Processing path structure file: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_3/iteration_path_structures

In [5]:
def jensen_shannon_distance(p, q):
    """
    Calculate Jensen-Shannon distance between two probability distributions
    
    Parameters:
    p, q: array-like, probability distributions (should be normalized)
    
    Returns:
    float: Jensen-Shannon distance
    """
    # Ensure inputs are numpy arrays
    p = np.array(p)
    q = np.array(q)
    
    # Calculate midpoint distribution
    m = 0.5 * (p + q)
    
    # Calculate KL divergence, add small constant to avoid log(0)
    eps = 1e-10
    kl_pm = np.sum(p * np.log((p + eps) / (m + eps)))
    kl_qm = np.sum(q * np.log((q + eps) / (m + eps)))
    
    # Jensen-Shannon divergence
    js_divergence = 0.5 * kl_pm + 0.5 * kl_qm
    
    # Jensen-Shannon distance (square root of divergence)
    js_distance = np.sqrt(js_divergence)
    
    return js_distance

def calculate_jensen_shannon_distances_with_weighted_entropy_by_eta(base_path="."):
    """
    Calculate Jensen-Shannon distances between nodes in each layer and document-weighted average Renyi entropy
    Automatically extract eta value from folder name as Dirichlet smoothing parameter
    """
    # Find all iteration_node_word_distributions.csv files
    pattern = os.path.join(base_path, "**", "iteration_node_word_distributions.csv")
    files = glob.glob(pattern, recursive=True)
    
    print(f"Found {len(files)} word distribution files to process")
    
    # Group files by eta value for display
    files_by_eta = {}
    for file_path in files:
        folder_name = os.path.basename(os.path.dirname(file_path))
        eta = 0.1  # Default value
        if 'eta_' in folder_name:
            try:
                eta_part = folder_name.split('eta_')[1].split('_')[0]
                eta = float(eta_part)
            except:
                pass
        
        if eta not in files_by_eta:
            files_by_eta[eta] = []
        files_by_eta[eta].append(file_path)
    
    print("File distribution:")
    for eta in sorted(files_by_eta.keys()):
        print(f"  Eta {eta}: {len(files_by_eta[eta])} files")
    print()
    
    # Process each file
    for idx, file_path in enumerate(files, 1):
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)
        
        # Extract eta value and run information from folder name
        eta = 0.1  # Default value
        run_id = "unknown"
        if 'eta_' in folder_name:
            try:
                eta_part = folder_name.split('eta_')[1].split('_')[0]
                eta = float(eta_part)
            except:
                pass
        
        if '_run_' in folder_name:
            try:
                run_id = folder_name.split('_run_')[1]
            except:
                pass
        
        print("=" * 80)
        print(f"[{idx}/{len(files)}] Processing Eta={eta}, Run={run_id}")
        print("=" * 80)
        
        try:
            # Read word distribution data
            word_df = pd.read_csv(file_path)
            word_df.columns = [col.strip("'\" ") for col in word_df.columns]
            
            # Get last iteration data
            max_iteration = word_df['iteration'].max()
            last_iteration_data = word_df[word_df['iteration'] == max_iteration]
            
            # Get complete vocabulary
            all_words = sorted(list(last_iteration_data['word'].dropna().unique()))
            
            # Read entropy file to get layer information
            entropy_file = os.path.join(folder_path, 'corrected_renyi_entropy.csv')
            if not os.path.exists(entropy_file):
                print(f"⚠️  Entropy file not found, skipping this file")
                continue
                
            entropy_df = pd.read_csv(entropy_file)
            
            # Basic information
            print(f"📊 Basic information:")
            print(f"   Vocabulary size: {len(all_words)}")
            print(f"   Last iteration: {max_iteration}")
            
            # Group nodes by layer
            layers = entropy_df.groupby('layer')['node_id'].apply(list).to_dict()
            print(f"   Layer distribution: {[(layer, len(nodes)) for layer, nodes in layers.items()]}")
            
            # Build probability distributions for each node
            print(f"🔄 Building probability distributions...")
            node_distributions = {}
            
            for node_id in entropy_df['node_id'].unique():
                # Get word distribution for this node
                node_words = last_iteration_data[last_iteration_data['node_id'] == node_id]
                
                # Initialize count vector
                counts = np.zeros(len(all_words))
                word_to_idx = {word: idx for idx, word in enumerate(all_words)}
                
                # Fill actual counts
                for _, row in node_words.iterrows():
                    word = row['word']
                    if pd.notna(word) and word in word_to_idx:
                        counts[word_to_idx[word]] = row['count']
                
                # Add Dirichlet smoothing
                smoothed_counts = counts + eta
                
                # Calculate probability distribution
                probabilities = smoothed_counts / np.sum(smoothed_counts)
                node_distributions[node_id] = probabilities
            
            print(f"   ✓ Completed {len(node_distributions)} node probability distributions")
            
            # Calculate JS distances within each layer and weighted average entropy
            all_js_distances = []
            layer_avg_distances = []
            
            print(f"📐 Calculating JS distances...")
            for layer, layer_nodes in layers.items():
                layer_js_distances = []
                n = len(layer_nodes)
                
                # Calculate JS distances for all node pairs within this layer
                for i, node1 in enumerate(layer_nodes):
                    for j, node2 in enumerate(layer_nodes):
                        if i < j:  # Only calculate upper triangle to avoid duplicates and self-comparisons
                            if node1 in node_distributions and node2 in node_distributions:
                                p = node_distributions[node1]
                                q = node_distributions[node2]
                                
                                # Calculate Jensen-Shannon distance
                                js_distance = jensen_shannon_distance(p, q)
                                
                                layer_js_distances.append({
                                    'layer': layer,
                                    'node1_id': node1,
                                    'node2_id': node2,
                                    'js_distance': js_distance,
                                    'node1_doc_count': entropy_df[entropy_df['node_id'] == node1]['document_count'].iloc[0] if len(entropy_df[entropy_df['node_id'] == node1]) > 0 else 0,
                                    'node2_doc_count': entropy_df[entropy_df['node_id'] == node2]['document_count'].iloc[0] if len(entropy_df[entropy_df['node_id'] == node2]) > 0 else 0
                                })
                
                all_js_distances.extend(layer_js_distances)
                
                # Calculate average JS distance for this layer
                avg_js_distance = 0.0
                if layer_js_distances and n > 1:
                    total_js_distance = sum(d['js_distance'] for d in layer_js_distances)
                    max_pairs = n * (n - 1) // 2
                    avg_js_distance = total_js_distance / max_pairs
                
                # Calculate document-weighted average Renyi entropy for this layer
                layer_entropy_data = entropy_df[entropy_df['layer'] == layer]
                total_docs = layer_entropy_data['document_count'].sum()
                
                if total_docs > 0:
                    weighted_entropy = (layer_entropy_data['document_count'] * layer_entropy_data['renyi_entropy_corrected']).sum() / total_docs
                else:
                    weighted_entropy = 0.0
                
                layer_avg_distances.append({
                    'layer': layer,
                    'node_count': n,
                    'total_pairs': len(layer_js_distances),
                    'max_pairs': n * (n - 1) // 2 if n > 1 else 0,
                    'sum_js_distance': sum(d['js_distance'] for d in layer_js_distances),
                    'avg_js_distance': avg_js_distance,
                    'total_documents': total_docs,
                    'weighted_avg_renyi_entropy': weighted_entropy,
                    'eta_used': eta
                })
                
                # Concise layer statistics output
                print(f"   Layer {layer}: {n} nodes, JS={avg_js_distance:.4f}, entropy={weighted_entropy:.4f}")
            
            # Save result files
            if all_js_distances:
                js_df = pd.DataFrame(all_js_distances)
                output_path = os.path.join(folder_path, 'jensen_shannon_distances.csv')
                js_df.to_csv(output_path, index=False)
            
            if layer_avg_distances:
                avg_df = pd.DataFrame(layer_avg_distances)
                avg_output_path = os.path.join(folder_path, 'layer_average_js_distances.csv')
                avg_df.to_csv(avg_output_path, index=False)
            
            print(f"💾 Results saved")
            
        except Exception as e:
            print(f"❌ Processing failed: {str(e)}")
    
    print("\n" + "=" * 80)
    print("✅ All files processed!")
    print("=" * 80)

In [6]:
import numpy as np
import os
import glob
import pandas as pd 

# Main function: Calculate Jensen-Shannon distances and weighted average Renyi entropy
base_path = "/Volumes/My Passport/收敛结果/step2"  # Root directory

print("=" * 50)
print("Starting Jensen-Shannon distance and weighted average Renyi entropy calculation (auto-adjusting by eta value)...")
print("=" * 50)
calculate_jensen_shannon_distances_with_weighted_entropy_by_eta(base_path)
print("=" * 50)
print("Jensen-Shannon distance and weighted average Renyi entropy calculation completed!")
print("=" * 50)

Starting Jensen-Shannon distance and weighted average Renyi entropy calculation (auto-adjusting by eta value)...
Found 18 word distribution files to process
File distribution:
  Eta 0.005: 3 files
  Eta 0.01: 3 files
  Eta 0.02: 3 files
  Eta 0.05: 3 files
  Eta 0.1: 3 files
  Eta 0.2: 3 files

[1/18] Processing Eta=0.1, Run=2
📊 Basic information:
   Vocabulary size: 1490
   Last iteration: 175
   Layer distribution: [(0, 1), (1, 44), (2, 186)]
🔄 Building probability distributions...
   ✓ Completed 231 node probability distributions
📐 Calculating JS distances...
   Layer 0: 1 nodes, JS=0.0000, entropy=4.8965
   Layer 1: 44 nodes, JS=0.4414, entropy=5.3491
   Layer 2: 186 nodes, JS=0.4917, entropy=5.0623
💾 Results saved
[2/18] Processing Eta=0.1, Run=3
📊 Basic information:
   Vocabulary size: 1490
   Last iteration: 175
   Layer distribution: [(0, 1), (1, 41), (2, 173)]
🔄 Building probability distributions...
   ✓ Completed 215 node probability distributions
📐 Calculating JS distances..

In [8]:
def aggregate_layer_statistics_by_eta(base_path="."):
    """
    Aggregate layer-level JS distance and weighted entropy statistics by eta value,
    generate summary tables at the same level as run folders
    """
    # Find all layer_average_js_distances.csv files
    pattern = os.path.join(base_path, "**", "layer_average_js_distances.csv")
    files = glob.glob(pattern, recursive=True)
    
    # Store all data and grouping information
    all_data = []
    eta_groups = {}  # Store parent directory for each eta combination
    
    for file_path in files:
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)
        parent_folder = os.path.dirname(folder_path)  # Parent directory of run folder
        
        # Extract eta value from folder name
        eta = None
        if 'eta_' in folder_name:
            try:
                eta_part = folder_name.split('eta_')[1].split('_')[0]
                eta = float(eta_part)
            except (IndexError, ValueError):
                print(f"Warning: Unable to extract eta value from folder name {folder_name}")
                continue
        else:
            print(f"Warning: Folder name {folder_name} does not contain eta information")
            continue
        
        # Extract run number
        run_match = folder_name.split('_run_')
        if len(run_match) > 1:
            run_id = run_match[1]
        else:
            print(f"Warning: Unable to extract run number from folder name {folder_name}")
            continue
        
        # Record parent directory for eta combination
        if eta not in eta_groups:
            eta_groups[eta] = parent_folder
        
        try:
            df = pd.read_csv(file_path)
            
            for _, row in df.iterrows():
                all_data.append({
                    'eta': eta,
                    'run_id': run_id,
                    'layer': row['layer'],
                    'node_count': row['node_count'],
                    'avg_js_distance': row['avg_js_distance'],
                    'weighted_avg_renyi_entropy': row['weighted_avg_renyi_entropy'],
                    'total_documents': row['total_documents'],
                    'parent_folder': parent_folder
                })
                
        except Exception as e:
            print(f"Error reading file {file_path}: {e}")
    
    # Convert to DataFrame
    summary_df = pd.DataFrame(all_data)
    
    if summary_df.empty:
        print("No valid data found")
        return
    
    print("=" * 70)
    print("Layer-level Summary Statistics by ETA Value")
    print("=" * 70)
    
    # Group by eta and generate summary files
    for eta, group_data in summary_df.groupby('eta'):
        parent_folder = group_data['parent_folder'].iloc[0]
        
        print(f"\nProcessing Eta={eta}")
        print(f"Output directory: {parent_folder}")
        
        # Calculate summary statistics for each layer
        layer_summary = group_data.groupby('layer').agg({
            'avg_js_distance': ['mean', 'std', 'count'],
            'weighted_avg_renyi_entropy': ['mean', 'std', 'count'],
            'node_count': ['mean', 'std'],
            'total_documents': 'mean',
            'run_id': lambda x: ', '.join(sorted(x.unique()))
        }).round(4)
        
        # Flatten column names
        layer_summary.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in layer_summary.columns]
        layer_summary = layer_summary.reset_index()
        
        # Rename columns for clarity
        column_mapping = {
            'avg_js_distance_mean': 'avg_js_distance_mean',
            'avg_js_distance_std': 'avg_js_distance_std', 
            'avg_js_distance_count': 'run_count',
            'weighted_avg_renyi_entropy_mean': 'weighted_avg_renyi_entropy_mean',
            'weighted_avg_renyi_entropy_std': 'weighted_avg_renyi_entropy_std',
            'weighted_avg_renyi_entropy_count': 'entropy_run_count',
            'node_count_mean': 'avg_node_count',
            'node_count_std': 'node_count_std',
            'total_documents_mean': 'avg_total_documents',
            'run_id_<lambda>': 'included_runs'
        }
        
        for old_name, new_name in column_mapping.items():
            if old_name in layer_summary.columns:
                layer_summary = layer_summary.rename(columns={old_name: new_name})
        
        # Add eta information
        layer_summary.insert(0, 'eta', eta)
        
        # Save summary results at the same level as run folders
        output_filename = f'eta_{eta}_layer_summary.csv'
        output_path = os.path.join(parent_folder, output_filename)
        layer_summary.to_csv(output_path, index=False)
        
        print(f"  Saved summary file: {output_path}")
        print(f"  Included runs: {layer_summary['included_runs'].iloc[0] if 'included_runs' in layer_summary.columns else 'N/A'}")
        print(f"  Number of layers: {len(layer_summary)}")
        
        # Display brief statistics
        for _, row in layer_summary.iterrows():
            layer_num = int(row['layer'])
            js_mean = row['avg_js_distance_mean']
            js_std = row['avg_js_distance_std'] if 'avg_js_distance_std' in row else 0
            entropy_mean = row['weighted_avg_renyi_entropy_mean']
            entropy_std = row['weighted_avg_renyi_entropy_std'] if 'weighted_avg_renyi_entropy_std' in row else 0
            node_count = row['avg_node_count']
            run_count = int(row['run_count']) if 'run_count' in row else 0
            
            print(f"    Layer {layer_num}: JS={js_mean:.4f}(±{js_std:.4f}), entropy={entropy_mean:.4f}(±{entropy_std:.4f}), nodes={node_count:.1f}, runs={run_count}")
    
    # Generate overall comparison file (saved in base_path)
    print(f"\n" + "=" * 70)
    print("Generating Overall Comparison File")
    print("=" * 70)
    
    overall_summary = summary_df.groupby(['eta', 'layer']).agg({
        'avg_js_distance': ['mean', 'std'],
        'weighted_avg_renyi_entropy': ['mean', 'std'],
        'node_count': ['mean', 'std'],
        'run_id': 'count'
    }).round(4)
    
    # Flatten column names
    overall_summary.columns = ['_'.join(col).strip() for col in overall_summary.columns]
    overall_summary = overall_summary.reset_index()
    
    overall_output_path = os.path.join(base_path, 'eta_layer_comparison.csv')
    overall_summary.to_csv(overall_output_path, index=False)
    print(f"Overall comparison file saved to: {overall_output_path}")
    
    # Display cross-eta comparison
    for layer in sorted(summary_df['layer'].unique()):
        print(f"\nLayer {int(layer)} Cross-Eta Comparison:")
        print("Eta Value  JS Distance(±std)   Weighted Entropy(±std)  Node Count(±std)  Run Count")
        print("-" * 75)
        
        layer_data = overall_summary[overall_summary['layer'] == layer]
        for _, row in layer_data.iterrows():
            eta = row['eta']
            js_mean = row['avg_js_distance_mean']
            js_std = row['avg_js_distance_std']
            entropy_mean = row['weighted_avg_renyi_entropy_mean']
            entropy_std = row['weighted_avg_renyi_entropy_std']
            node_mean = row['node_count_mean']
            node_std = row['node_count_std']
            run_count = int(row['run_id_count'])
            
            print(f"{eta:6.3f}    {js_mean:6.4f}(±{js_std:5.4f})   {entropy_mean:6.4f}(±{entropy_std:5.4f})   {node_mean:6.1f}(±{node_std:4.1f})   {run_count:4d}")

In [None]:
# Execute aggregation analysis
base_path = "/Volumes/My Passport/收敛结果/step2"
print("=" * 70)
print("Starting aggregation of layer statistics by Eta values...")
print("=" * 70)
aggregate_layer_statistics_by_eta(base_path)
print("=" * 70)
print("Aggregation analysis completed!")
print("=" * 70)

Starting aggregation of layer statistics by Eta values...
Layer-level Summary Statistics by ETA Value

Processing Eta=0.005
Output directory: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e0005_基于e01_收敛
  Saved summary file: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e0005_基于e01_收敛/eta_0.005_layer_summary.csv
  Included runs: 1, 2, 3
  Number of layers: 3
    Layer 0: JS=0.0000(±0.0000), entropy=5.4077(±0.0307), nodes=1.0, runs=3
    Layer 1: JS=0.7370(±0.0077), entropy=3.5804(±0.0238), nodes=85.0, runs=3
    Layer 2: JS=0.7486(±0.0018), entropy=3.0128(±0.0564), nodes=346.3, runs=3

Processing Eta=0.01
Output directory: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e001_基于e01_收敛
  Saved summary file: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e001_基于e01_收敛/eta_0.01_layer_summary.csv
  Included runs: 1, 2, 3
  Number of layers: 3
    Layer 0: JS=0.0000(±0.0000), entropy=5.3045(±0.0209), nodes=1.0, runs=3
    Layer 1: JS=0.6998(±0.0031), entropy=3.7481(±0.0531), nodes=79.0, runs=3


In [11]:
# 汇总所有result_layers.csv，按layer分组求mean（包括nodes_in_layer）
import pandas as pd
import glob
import os

base_path = "/Volumes/My Passport/收敛结果/step2"
pattern = os.path.join(base_path, "**", "result_layers.csv")
files = glob.glob(pattern, recursive=True)

all_rows = []
for file in files:
    df = pd.read_csv(file)
    run_folder = os.path.dirname(file)
    folder_name = os.path.basename(run_folder)
    
    # 从文件夹名称提取eta值
    eta = None
    if 'eta_' in folder_name:
        try:
            eta_part = folder_name.split('eta_')[1].split('_')[0]
            eta = float(eta_part)
        except (IndexError, ValueError):
            eta = None
    
    # 按 layer 分组求均值
    grouped = df.groupby('layer').agg({
        'entropy_wavg': 'mean',
        'distinctiveness_wavg_jsd': 'mean',
        'nodes_in_layer': 'mean'
    }).reset_index()
    grouped['run_folder'] = run_folder
    grouped['eta'] = eta
    
    # 如果有其他参数信息（如 depth, gamma, alpha），可从原df取第一行补充
    for col in ['depth', 'gamma', 'alpha']:
        if col in df.columns:
            grouped[col] = df[col].iloc[0]
        else:
            # 从文件夹名称提取
            if col == 'depth' and 'depth_' in folder_name:
                try:
                    grouped[col] = int(folder_name.split('depth_')[1].split('_')[0])
                except:
                    grouped[col] = 3  # 默认depth=3
            elif col == 'gamma' and 'gamma_' in folder_name:
                try:
                    grouped[col] = float(folder_name.split('gamma_')[1].split('_')[0])
                except:
                    grouped[col] = 0.05  # 默认gamma=0.05
            elif col == 'alpha':
                grouped[col] = 0.1  # 默认alpha=0.1
            else:
                grouped[col] = None
    
    all_rows.append(grouped)

summary_df = pd.concat(all_rows, ignore_index=True)
summary_df.to_csv(os.path.join(base_path, "all_layers_summary.csv"), index=False)
print("已汇总所有run的层级均值到 all_layers_summary.csv")

已汇总所有run的层级均值到 all_layers_summary.csv


In [12]:
import pandas as pd

# filepath: /Volumes/My Passport/收敛结果/step2/all_layers_summary.csv
df = pd.read_csv("/Volumes/My Passport/收敛结果/step2/all_layers_summary.csv")

# 按 eta 和 layer 分组，求均值和标准差
summary = df.groupby(['eta', 'layer']).agg({
    'entropy_wavg': ['mean', 'std'],
    'distinctiveness_wavg_jsd': ['mean', 'std'],
    'nodes_in_layer': ['mean', 'std'],
    'depth': 'first',
    'gamma': 'first',
    'alpha': 'first'
}).reset_index()

# 展开多级列名
summary.columns = ['_'.join(col).strip('_') for col in summary.columns]

summary.to_csv("/Volumes/My Passport/收敛结果/step2/layer_eta_group_mean.csv", index=False)
print("已生成所有run按eta和层整体均值表 layer_eta_group_mean.csv")

已生成所有run按eta和层整体均值表 layer_eta_group_mean.csv


In [13]:
import pandas as pd
import glob
import os

base_path = "/Volumes/My Passport/收敛结果/step2"
pattern = os.path.join(base_path, "**", "result_layers.csv")
files = glob.glob(pattern, recursive=True)

all_rows = []
for file in files:
    df = pd.read_csv(file)
    df['run_folder'] = os.path.dirname(file)  # 标记来源
    
    # 从路径中提取参数
    folder = os.path.dirname(file)
    folder_name = os.path.basename(folder)
    
    for col in ['depth', 'gamma', 'eta', 'alpha']:
        if col not in df.columns:
            if f"{col}_" in folder_name:
                try:
                    value = float(folder_name.split(f"{col}_")[1].split("_")[0])
                except:
                    value = None
                df[col] = value
            else:
                # 设置默认值
                if col == 'depth':
                    df[col] = 3
                elif col == 'gamma':
                    df[col] = 0.05
                elif col == 'alpha':
                    df[col] = 0.1
                else:
                    df[col] = None
    
    all_rows.append(df)

summary_df = pd.concat(all_rows, ignore_index=True)
summary_df.to_csv(os.path.join(base_path, "all_result_layers_merged.csv"), index=False)
print("已汇总所有result_layers.csv到 all_result_layers_merged.csv")

已汇总所有result_layers.csv到 all_result_layers_merged.csv


In [14]:
import pandas as pd
import glob
import os

base_path = "/Volumes/My Passport/收敛结果/step2"
pattern = os.path.join(base_path, "**", "result_layers.csv")
files = glob.glob(pattern, recursive=True)

all_rows = []
for file in files:
    df = pd.read_csv(file)
    folder = os.path.dirname(file)
    folder_name = os.path.basename(folder)
    
    # 补充参数信息，从文件夹名称提取
    for col in ['depth', 'gamma', 'eta', 'alpha']:
        if col not in df.columns:
            if f"{col}_" in folder_name:
                try:
                    value = float(folder_name.split(f"{col}_")[1].split("_")[0])
                except:
                    value = None
                df[col] = value
            else:
                # 设置默认值
                if col == 'depth':
                    df[col] = 3
                elif col == 'gamma':
                    df[col] = 0.05
                elif col == 'alpha':
                    df[col] = 0.1
                else:
                    df[col] = None
    all_rows.append(df)

merged = pd.concat(all_rows, ignore_index=True)

# 按参数组和layer分组，计算均值和标准差
group_cols = ['depth', 'gamma', 'eta', 'alpha', 'layer']
summary = merged.groupby(group_cols).agg({
    'entropy_wavg': ['mean', 'std'],
    'distinctiveness_wavg_jsd': ['mean', 'std'],
    'nodes_in_layer': ['mean', 'std'],
}).reset_index()

# 展开多级列名
summary.columns = ['_'.join(col).strip('_') for col in summary.columns]

summary.to_csv(os.path.join(base_path, "all_params_layer_mean.csv"), index=False)
print("已生成每组参数每层的均值表 all_params_layer_mean.csv")

已生成每组参数每层的均值表 all_params_layer_mean.csv


In [15]:
import pandas as pd
import numpy as np
import os
import glob
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary

def extract_params_from_folder(folder_name):
    params = {'eta': 0.1, 'gamma': 0.05, 'depth': 3, 'alpha': 0.1}
    for param in params.keys():
        if f'{param}_' in folder_name:
            try:
                value = folder_name.split(f'{param}_')[1].split('_')[0]
                if param == 'depth':
                    params[param] = int(value)
                else:
                    params[param] = float(value)
            except Exception as e:
                print(f"⚠️ 提取参数 {param} 失败: {e}")
    return params['eta'], params['gamma'], params['depth'], params['alpha']

from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary

def calculate_standard_coherence_from_corpus_corrected(corpus, node_word_df, top_k=15):
    """
    计算每个节点的主题一致性指标（NPMI, C_V, UMass），并返回全局平均一致性
    参数:
        corpus: dict {doc_id: [word list]}
        node_word_df: DataFrame, 包含'node_id', 'word', 'count'等列
        top_k: int, 每个节点选取前K高频词
    返回:
        global_coherence: dict, 全局平均一致性
        per_topic_coherence: dict, 每个节点的各项一致性分数列表
        node_to_topic_idx: dict, node_id到topic索引的映射
    """
    texts = list(corpus.values())
    dictionary = Dictionary(texts)

    topics = []
    node_to_topic_idx = {}
    topic_idx = 0
    for node_id in node_word_df['node_id'].unique():
        node_data = node_word_df[node_word_df['node_id'] == node_id]
        top_words = node_data.nlargest(top_k, 'count')['word'].tolist()
        valid_words = [w for w in top_words if pd.notna(w) and w in dictionary.token2id]
        if len(valid_words) >= 2:
            topics.append(valid_words)
            node_to_topic_idx[node_id] = topic_idx
            topic_idx += 1

    if len(topics) == 0:
        return {}, {}, {}

    coherence_measures = ['c_npmi', 'c_v', 'u_mass']
    per_topic_coherence = {}
    global_coherence = {}

    for measure in coherence_measures:
        try:
            cm = CoherenceModel(
                topics=topics,
                texts=texts,
                dictionary=dictionary,
                coherence=measure,
                processes=1
            )
            per_topic_scores = cm.get_coherence_per_topic()
            per_topic_coherence[measure] = per_topic_scores
            global_coherence[measure] = cm.get_coherence()
        except Exception as e:
            per_topic_coherence[measure] = [0.0] * len(topics)
            global_coherence[measure] = 0.0

    return global_coherence, per_topic_coherence, node_to_topic_idx

def calculate_coherence_layered_analysis(base_path=".", corpus=None, top_k=15):
    """
    Calculate node coherence metrics and perform weighted aggregation analysis by layer
    """
    if corpus is None:
        print("❌ Must provide original corpus")
        return

    pattern = os.path.join(base_path, "**", "iteration_node_word_distributions.csv")
    files = glob.glob(pattern, recursive=True)

    print(f"🔍 Found {len(files)} word distribution files to process (top_k={top_k})")

    for idx, file_path in enumerate(files, 1):
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)

        # 参数提取
        eta, gamma, depth, alpha = extract_params_from_folder(folder_name)

        print(f"\n{'='*80}")
        print(f"[{idx}/{len(files)}] Processing file: {folder_name} (k={top_k})")
        print(f"Parameters - Eta: {eta}, Gamma: {gamma}, Depth: {depth}, Alpha: {alpha}")
        print(f"{'='*80}")

        try:
            # Read data
            df = pd.read_csv(file_path)
            df.columns = [col.strip("'\" ") for col in df.columns]

            max_iteration = df['iteration'].max()
            last_iteration_data = df[df['iteration'] == max_iteration]

            # Read layer and document count information
            entropy_file = os.path.join(folder_path, 'corrected_renyi_entropy.csv')
            if not os.path.exists(entropy_file):
                print("⚠️ Entropy file not found, skipping this file")
                continue

            entropy_df = pd.read_csv(entropy_file)

            print(f"📈 Last iteration: {max_iteration}")
            print(f"📈 Number of nodes: {last_iteration_data['node_id'].nunique()}")

            # Calculate node-level coherence (node-level only)
            texts = list(corpus.values())
            dictionary = Dictionary(texts)

            topics = []
            node_to_topic_idx = {}

            topic_idx = 0
            for node_id in last_iteration_data['node_id'].unique():
                node_data = last_iteration_data[last_iteration_data['node_id'] == node_id]
                top_words = node_data.nlargest(top_k, 'count')['word'].tolist()

                valid_words = []
                for word in top_words:
                    if pd.notna(word) and word in dictionary.token2id:
                        valid_words.append(word)

                if len(valid_words) >= 2:
                    topics.append(valid_words)
                    node_to_topic_idx[node_id] = topic_idx
                    topic_idx += 1

            if len(topics) == 0:
                print("⚠️ No valid topics, skipping this file")
                continue

            # Calculate coherence metrics
            coherence_measures = ['c_npmi', 'c_v', 'u_mass']
            per_topic_coherence = {}

            for measure in coherence_measures:
                try:
                    print(f"   Calculating {measure}...")

                    cm = CoherenceModel(
                        topics=topics,
                        texts=texts,
                        dictionary=dictionary,
                        coherence=measure,
                        processes=1
                    )

                    per_topic_scores = cm.get_coherence_per_topic()
                    per_topic_coherence[measure] = per_topic_scores

                    print(f"   ✓ {measure}: Range=[{min(per_topic_scores):.4f}, {max(per_topic_scores):.4f}]")

                except Exception as e:
                    print(f"   ❌ Error calculating {measure}: {e}")
                    per_topic_coherence[measure] = [0.0] * len(topics)

            # Merge node-level coherence with layer information
            node_coherence_data = []

            for node_id in last_iteration_data['node_id'].unique():
                node_words = last_iteration_data[last_iteration_data['node_id'] == node_id]
                top_words = node_words.nlargest(top_k, 'count')['word'].tolist()
                top_words = [word for word in top_words if pd.notna(word)]

                # Get layer and document count information
                node_entropy_info = entropy_df[entropy_df['node_id'] == node_id]
                if len(node_entropy_info) > 0:
                    layer = node_entropy_info['layer'].iloc[0]
                    document_count = node_entropy_info['document_count'].iloc[0]
                else:
                    layer = -1
                    document_count = 0

                # Get node coherence scores
                node_coherence_scores = {}
                if node_id in node_to_topic_idx:
                    topic_idx = node_to_topic_idx[node_id]
                    for measure in ['c_npmi', 'c_v', 'u_mass']:
                        if measure in per_topic_coherence:
                            measure_name = measure.replace('c_', '') if measure.startswith('c_') else measure
                            node_coherence_scores[f'node_{measure_name}'] = per_topic_coherence[measure][topic_idx]
                        else:
                            measure_name = measure.replace('c_', '') if measure.startswith('c_') else measure
                            node_coherence_scores[f'node_{measure_name}'] = 0.0
                else:
                    for measure in ['npmi', 'v', 'u_mass']:
                        node_coherence_scores[f'node_{measure}'] = 0.0

                node_coherence_data.append({
                    'node_id': node_id,
                    'eta': eta,
                    'gamma': gamma,
                    'depth': depth,
                    'alpha': alpha,
                    'layer': layer,
                    'document_count': document_count,
                    'top_k': top_k,
                    'top_words': ', '.join(top_words[:10]),
                    'word_count': len(top_words),

                    # Node-level coherence metrics only
                    'node_npmi': node_coherence_scores.get('node_npmi', 0.0),
                    'node_c_v': node_coherence_scores.get('node_v', 0.0),
                    'node_u_mass': node_coherence_scores.get('node_u_mass', 0.0),

                    'iteration': max_iteration
                })

            # Save node-level coherence results (with k value)
            coherence_df = pd.DataFrame(node_coherence_data)
            node_output_path = os.path.join(folder_path, f'node_coherence_k{top_k}.csv')
            coherence_df.to_csv(node_output_path, index=False)

            # Calculate layer-weighted average coherence
            layer_coherence_summary = []

            for layer in coherence_df['layer'].unique():
                if layer == -1:  # Skip invalid layers
                    continue

                layer_data = coherence_df[coherence_df['layer'] == layer]
                total_docs = layer_data['document_count'].sum()

                if total_docs > 0:
                    # Document count weighted average
                    weighted_npmi = (layer_data['document_count'] * layer_data['node_npmi']).sum() / total_docs
                    weighted_c_v = (layer_data['document_count'] * layer_data['node_c_v']).sum() / total_docs
                    weighted_u_mass = (layer_data['document_count'] * layer_data['node_u_mass']).sum() / total_docs

                    # Simple average (unweighted)
                    simple_npmi = layer_data['node_npmi'].mean()
                    simple_c_v = layer_data['node_c_v'].mean()
                    simple_u_mass = layer_data['node_u_mass'].mean()

                    layer_coherence_summary.append({
                        'layer': layer,
                        'node_count': len(layer_data),
                        'total_documents': total_docs,
                        'avg_documents_per_node': total_docs / len(layer_data),

                        # Document count weighted average coherence
                        'weighted_avg_npmi': weighted_npmi,
                        'weighted_avg_c_v': weighted_c_v,
                        'weighted_avg_u_mass': weighted_u_mass,

                        # Simple average coherence
                        'simple_avg_npmi': simple_npmi,
                        'simple_avg_c_v': simple_c_v,
                        'simple_avg_u_mass': simple_u_mass,

                        # Standard deviations
                        'std_npmi': layer_data['node_npmi'].std(),
                        'std_c_v': layer_data['node_c_v'].std(),
                        'std_u_mass': layer_data['node_u_mass'].std(),

                        'top_k': top_k,  # Add k value record
                        'eta': eta,
                        'gamma': gamma,
                        'depth': depth,
                        'alpha': alpha
                    })

            # Save layer summary results (with k value)
            if layer_coherence_summary:
                layer_summary_df = pd.DataFrame(layer_coherence_summary)
                layer_output_path = os.path.join(folder_path, f'layer_coherence_summary_k{top_k}.csv')
                layer_summary_df.to_csv(layer_output_path, index=False)

                print(f"💾 Node coherence results saved to: {node_output_path}")
                print(f"💾 Layer summary results saved to: {layer_output_path}")

                print(f"📊 Layer coherence summary (k={top_k}):")
                for _, row in layer_summary_df.iterrows():
                    layer_num = int(row['layer'])
                    node_count = int(row['node_count'])
                    w_npmi = row['weighted_avg_npmi']
                    w_cv = row['weighted_avg_c_v']
                    w_umass = row['weighted_avg_u_mass']
                    print(f"   Layer {layer_num} ({node_count} nodes): NPMI={w_npmi:.4f}, C_V={w_cv:.4f}, U_Mass={w_umass:.4f}")

        except Exception as e:
            import traceback
            print(f"❌ Error processing file {file_path}: {str(e)}")
            traceback.print_exc()

    print(f"\n✅ Coherence layered analysis for all files completed! (k={top_k})")

def process_coherence_with_original_corpus_corrected(base_path=".", corpus=None, top_k=15):
    """
    Corrected version: Complete calculation of node-level and global-level coherence metrics
    """
    if corpus is None:
        print("❌ Must provide original corpus")
        return

    pattern = os.path.join(base_path, "**", "iteration_node_word_distributions.csv")
    files = glob.glob(pattern, recursive=True)

    print(f"🔍 Found {len(files)} word distribution files to process")

    for idx, file_path in enumerate(files, 1):
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)

        # 参数提取
        eta, gamma, depth, alpha = extract_params_from_folder(folder_name)

        print(f"\n{'='*80}")
        print(f"[{idx}/{len(files)}] Processing file: {folder_name}")
        print(f"Parameters - Eta: {eta}, Gamma: {gamma}, Depth: {depth}, Alpha: {alpha}")
        print(f"{'='*80}")

        try:
            # Read data
            df = pd.read_csv(file_path)
            df.columns = [col.strip("'\" ") for col in df.columns]

            max_iteration = df['iteration'].max()
            last_iteration_data = df[df['iteration'] == max_iteration]

            print(f"📈 Last iteration: {max_iteration}")
            print(f"📈 Number of nodes: {last_iteration_data['node_id'].nunique()}")

            # Calculate coherence (corrected version)
            global_coherence, per_topic_coherence, node_to_topic_idx = calculate_standard_coherence_from_corpus_corrected(
                corpus, last_iteration_data, top_k=top_k
            )

            if not global_coherence:
                print("⚠️ Coherence calculation failed, skipping this file")
                continue

            # Prepare data for saving
            results_data = []

            for node_id in last_iteration_data['node_id'].unique():
                node_words = last_iteration_data[last_iteration_data['node_id'] == node_id]
                top_words = node_words.nlargest(top_k, 'count')['word'].tolist()
                top_words = [word for word in top_words if pd.notna(word)]

                # Get various coherence metrics for this node (corrected version)
                node_coherence_scores = {}

                if node_id in node_to_topic_idx:
                    # Get metrics directly through index
                    topic_idx = node_to_topic_idx[node_id]

                    for measure in ['c_npmi', 'c_v', 'u_mass']:
                        if measure in per_topic_coherence:
                            measure_name = measure.replace('c_', '') if measure.startswith('c_') else measure
                            node_coherence_scores[f'node_{measure_name}'] = per_topic_coherence[measure][topic_idx]
                        else:
                            measure_name = measure.replace('c_', '') if measure.startswith('c_') else measure
                            node_coherence_scores[f'node_{measure_name}'] = 0.0
                else:
                    # If node not in mapping, set to 0
                    for measure in ['npmi', 'v', 'u_mass']:
                        node_coherence_scores[f'node_{measure}'] = 0.0

                results_data.append({
                    'node_id': node_id,
                    'eta': eta,
                    'gamma': gamma,
                    'depth': depth,
                    'alpha': alpha,
                    'top_k': top_k,
                    'top_words': ', '.join(top_words[:10]),
                    'word_count': len(top_words),

                    # Node-level coherence metrics (corrected)
                    'node_npmi': node_coherence_scores.get('node_npmi', 0.0),
                    'node_c_v': node_coherence_scores.get('node_v', 0.0),
                    'node_u_mass': node_coherence_scores.get('node_u_mass', 0.0),

                    # Global-level coherence metrics
                    'global_npmi': global_coherence.get('c_npmi', 0.0),
                    'global_c_v': global_coherence.get('c_v', 0.0),
                    'global_u_mass': global_coherence.get('u_mass', 0.0),

                    'iteration': max_iteration
                })

            # Save results
            results_df = pd.DataFrame(results_data)
            output_path = os.path.join(folder_path, 'standard_coherence.csv')
            results_df.to_csv(output_path, index=False)

            print(f"💾 Standard coherence results saved to: {output_path}")
            print(f"📊 Results summary:")
            print(f"   - Global NPMI: {global_coherence.get('c_npmi', 0.0):.4f}")
            print(f"   - Global C_V: {global_coherence.get('c_v', 0.0):.4f}")
            print(f"   - Global U_Mass: {global_coherence.get('u_mass', 0.0):.4f}")

            # Display node-level metric ranges
            if len(results_df) > 0:
                print(f"   - Node NPMI range: [{results_df['node_npmi'].min():.4f}, {results_df['node_npmi'].max():.4f}]")
                print(f"   - Node C_V range: [{results_df['node_c_v'].min():.4f}, {results_df['node_c_v'].max():.4f}]")
                print(f"   - Node U_Mass range: [{results_df['node_u_mass'].min():.4f}, {results_df['node_u_mass'].max():.4f}]")

        except Exception as e:
            import traceback
            print(f"❌ Error processing file {file_path}: {str(e)}")
            traceback.print_exc()

    print(f"\n✅ Standard coherence calculation for all files completed!")

In [16]:
""" 0. Set-up part: import necessary libraries and set up environment """

import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
from collections import Counter, defaultdict
import numpy as np
import math
import copy
import itertools
import matplotlib.pyplot as plt
import matplotlib as mpl

import joblib
from joblib import Parallel, delayed
from threading import Thread

import os
import pickle
import time

import operator
from functools import reduce
import json
import cProfile

# Download nltk data once
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('omw-1.4')
# nltk.download('punkt_tab')
# nltk.download('averaged_perceptron_tagger_eng')

# Chinese character support in matplotlib
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'SimHei', 'DejaVu Sans']
plt.rcParams['axes.unicode_minus'] = False

""" 1.1 Data Preprocessing: load data, clean text, lemmatization, remove low-frequency words """

# Map POS tags to WordNet format: Penn Treebank annotation (fine-grained, 45 tags), WordNet annotation (coarse-grained, 4 tags: a, v, n, r)
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return 'a'  # adjective
    elif treebank_tag.startswith('V'):
        return 'v'  # verb
    elif treebank_tag.startswith('N'):
        return 'n'  # noun
    elif treebank_tag.startswith('R'):
        return 'r'  # adverb
    else:
        return 'n'  # default noun

# Text cleaning and lemmatization preprocessing function
def clean_and_lemmatize(text):
    if pd.isnull(text):
        return []
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)  # Remove non-alphabetic characters using regex
    tokens = word_tokenize(text)
    tokens = [w for w in tokens if w not in stop_words]
    pos_tags = pos_tag(tokens)
    lemmatized = [lemmatizer.lemmatize(w, get_wordnet_pos(pos)) for w, pos in pos_tags]
    return lemmatized

# -----------------Load data----------------
data = pd.read_excel('/Volumes/My Passport/收敛结果/step2/papers_CM.xlsx', usecols=['PaperID', 'Abstract', 'Keywords', 'Year'])

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Clean and lemmatize the abstracts
data['Lemmatized_Tokens'] = data['Abstract'].apply(clean_and_lemmatize)

# Count word frequencies
all_tokens = [word for tokens in data['Lemmatized_Tokens'] for word in tokens]
word_counts = Counter(all_tokens)

# Set a minimum frequency threshold for valid words
min_freq = 10
valid_words = set([word for word, freq in word_counts.items() if freq >= min_freq])

# Remove rare words based on frequency threshold
def remove_rare_words(tokens):
    return [word for word in tokens if word in valid_words]

data['Filtered_Tokens'] = data['Lemmatized_Tokens'].apply(remove_rare_words)

# Join tokens back into cleaned abstracts
data['Cleaned_Abstract'] = data['Filtered_Tokens'].apply(lambda x: " ".join(x))

# Create a cleaned DataFrame with relevant columns
cleaned_data = data[['PaperID', 'Year', 'Cleaned_Abstract']]
cleaned_data = cleaned_data[~(cleaned_data['PaperID'] == 57188)] # this paper has no abstract
cleaned_data = cleaned_data.reset_index(drop=True)
cleaned_data.insert(0, 'Document_ID', range(len(cleaned_data)))
abstract_list = cleaned_data['Cleaned_Abstract'].apply(lambda x: x.split()).tolist()

corpus = {doc_id: abstract_list for doc_id, abstract_list in enumerate(abstract_list)}
# cleaned_data.to_csv('./data/processed/cleaned_data.xlsx', index=False, encoding='utf-8-sig')

In [17]:
# Delete old incomplete files
def clean_old_coherence_files(base_path="."):
    """Delete old incomplete standard coherence files"""
    pattern = os.path.join(base_path, "**", "standard_coherence.csv")
    files = glob.glob(pattern, recursive=True)
    
    deleted_count = 0
    for file_path in files:
        try:
            os.remove(file_path)
            print(f"✓ Deleted old file: {os.path.basename(os.path.dirname(file_path))}")
            deleted_count += 1
        except Exception as e:
            print(f"❌ Failed to delete: {file_path} - {e}")
    
    print(f"🗑️ Total deleted {deleted_count} old coherence files")

# Execute corrected version calculation
base_path = "/Volumes/My Passport/收敛结果/step2"
top_k = 5

print("=" * 80)
print("🗑️ Cleaning old incomplete files...")
print("=" * 80)
clean_old_coherence_files(base_path)

print("\n" + "=" * 80)
print("🔄 Starting recalculation of complete coherence metrics...")
print("=" * 80)

# Use corrected version function
process_coherence_with_original_corpus_corrected(base_path, corpus, top_k)

print("=" * 80)
print("✅ Corrected coherence calculation completed!")
print("=" * 80)

🗑️ Cleaning old incomplete files...
🗑️ Total deleted 0 old coherence files

🔄 Starting recalculation of complete coherence metrics...
🔍 Found 18 word distribution files to process

[1/18] Processing file: depth_3_gamma_0.05_eta_0.1_run_2
Parameters - Eta: 0.1, Gamma: 0.05, Depth: 3, Alpha: 0.1
📈 Last iteration: 175
📈 Number of nodes: 231
💾 Standard coherence results saved to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/standard_coherence.csv
📊 Results summary:
   - Global NPMI: 0.0460
   - Global C_V: 0.5559
   - Global U_Mass: -2.9908
   - Node NPMI range: [-0.4494, 0.7414]
   - Node C_V range: [0.1778, 0.9900]
   - Node U_Mass range: [-14.9995, -0.3317]

[2/18] Processing file: depth_3_gamma_0.05_eta_0.1_run_3
Parameters - Eta: 0.1, Gamma: 0.05, Depth: 3, Alpha: 0.1
📈 Last iteration: 175
📈 Number of nodes: 215
💾 Standard coherence results saved to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_3/standard_

In [18]:
def add_layer_and_document_info_to_coherence(base_path="."):
    """
    Add 'layer' and 'document_count' info from corrected_renyi_entropy.csv to standard_coherence.csv

    Parameters:
    base_path: str, root directory of result files
    """
    # Find all folders containing standard_coherence.csv
    pattern = os.path.join(base_path, "**", "standard_coherence.csv")
    coherence_files = glob.glob(pattern, recursive=True)

    print(f"🔍 Found {len(coherence_files)} standard coherence files to process")

    for idx, coherence_file_path in enumerate(coherence_files, 1):
        folder_path = os.path.dirname(coherence_file_path)
        folder_name = os.path.basename(folder_path)

        print(f"\n{'='*80}")
        print(f"[{idx}/{len(coherence_files)}] Processing folder: {folder_name}")
        print(f"{'='*80}")

        # Check if corresponding corrected_renyi_entropy.csv exists
        entropy_file_path = os.path.join(folder_path, 'corrected_renyi_entropy.csv')

        if not os.path.exists(entropy_file_path):
            print(f"⚠️  Entropy file not found: {entropy_file_path}")
            continue

        try:
            # Read both files
            print("📖 Reading files...")
            coherence_df = pd.read_csv(coherence_file_path)
            entropy_df = pd.read_csv(entropy_file_path)

            print(f"   Coherence file: {len(coherence_df)} rows")
            print(f"   Entropy file: {len(entropy_df)} rows")

            # Check if layer and document_count columns already exist
            existing_cols = coherence_df.columns.tolist()
            has_layer = 'layer' in existing_cols
            has_doc_count = 'document_count' in existing_cols

            print(f"   Current columns: {existing_cols}")
            print(f"   Has layer column: {has_layer}")
            print(f"   Has document_count column: {has_doc_count}")

            # Create node_id to layer and document_count mapping
            node_layer_map = entropy_df.set_index('node_id')['layer'].to_dict()
            node_doc_count_map = entropy_df.set_index('node_id')['document_count'].to_dict()

            print(f"   Mappable nodes: {len(node_layer_map)}")

            # Add or update layer column
            coherence_df['layer'] = coherence_df['node_id'].map(node_layer_map)
            print("   ✓ Layer column added/updated")

            # Add or update document_count column
            coherence_df['document_count'] = coherence_df['node_id'].map(node_doc_count_map)
            print("   ✓ Document_count column added/updated")

            # Check mapping results
            layer_null_count = coherence_df['layer'].isnull().sum()
            doc_count_null_count = coherence_df['document_count'].isnull().sum()

            if layer_null_count > 0:
                print(f"   ⚠️  {layer_null_count} nodes missing layer info")

            if doc_count_null_count > 0:
                print(f"   ⚠️  {doc_count_null_count} nodes missing document_count info")

            # Show layer distribution stats
            layer_stats = coherence_df['layer'].value_counts().sort_index()
            print(f"   📊 Layer distribution: {layer_stats.to_dict()}")

            # Show document count stats
            doc_stats = coherence_df['document_count'].describe()
            print(f"   📊 Document count stats:")
            print(f"      Min: {doc_stats['min']:.0f}")
            print(f"      Max: {doc_stats['max']:.0f}")
            print(f"      Mean: {doc_stats['mean']:.1f}")

            # Save updated file
            coherence_df.to_csv(coherence_file_path, index=False)
            print(f"💾 Updated and saved: {coherence_file_path}")

            # Show updated columns
            updated_cols = coherence_df.columns.tolist()
            print(f"   Updated columns: {updated_cols}")

        except Exception as e:
            import traceback
            print(f"❌ Error processing file {coherence_file_path}: {str(e)}")
            print("Detailed error info:")
            traceback.print_exc()

    print(f"\n✅ All standard coherence files updated with layer and document_count info!")


def verify_coherence_files_update(base_path="."):
    """
    Verify update status of standard_coherence.csv files
    """
    pattern = os.path.join(base_path, "**", "standard_coherence.csv")
    coherence_files = glob.glob(pattern, recursive=True)

    print("🔍 Verifying update results:")
    print("="*80)

    all_have_layer = True
    all_have_doc_count = True

    for file_path in coherence_files:
        folder_name = os.path.basename(os.path.dirname(file_path))

        try:
            df = pd.read_csv(file_path)
            has_layer = 'layer' in df.columns
            has_doc_count = 'document_count' in df.columns

            layer_null = df['layer'].isnull().sum() if has_layer else "No column"
            doc_null = df['document_count'].isnull().sum() if has_doc_count else "No column"

            status = "✅" if (has_layer and has_doc_count and layer_null == 0 and doc_null == 0) else "⚠️"

            print(f"{status} {folder_name}")
            print(f"   Layer column: {'Yes' if has_layer else 'No'} (Nulls: {layer_null})")
            print(f"   DocCount column: {'Yes' if has_doc_count else 'No'} (Nulls: {doc_null})")

            if not has_layer:
                all_have_layer = False
            if not has_doc_count:
                all_have_doc_count = False

        except Exception as e:
            print(f"❌ {folder_name}: Read failed - {e}")

    print("="*80)
    print(f"📋 Summary:")
    print(f"   Total files: {len(coherence_files)}")
    print(f"   All have layer column: {'Yes' if all_have_layer else 'No'}")
    print(f"   All have document_count column: {'Yes' if all_have_doc_count else 'No'}")

In [19]:
# Execute update
base_path = "/Volumes/My Passport/收敛结果/step2"

print("=" * 80)
print("Starting to add layer and document_count information to standard_coherence.csv ...")
print("=" * 80)

# Add layer and document_count information
add_layer_and_document_info_to_coherence(base_path)

print("\n" + "=" * 80)
print("Verifying update results ...")
print("=" * 80)

# Verify update results
verify_coherence_files_update(base_path)

print("\n" + "=" * 80)
print("✅ Layer and document_count information addition completed!")
print("=" * 80)

Starting to add layer and document_count information to standard_coherence.csv ...
🔍 Found 18 standard coherence files to process

[1/18] Processing folder: depth_3_gamma_0.05_eta_0.1_run_2
📖 Reading files...
   Coherence file: 231 rows
   Entropy file: 231 rows
   Current columns: ['node_id', 'eta', 'gamma', 'depth', 'alpha', 'top_k', 'top_words', 'word_count', 'node_npmi', 'node_c_v', 'node_u_mass', 'global_npmi', 'global_c_v', 'global_u_mass', 'iteration']
   Has layer column: False
   Has document_count column: False
   Mappable nodes: 231
   ✓ Layer column added/updated
   ✓ Document_count column added/updated
   📊 Layer distribution: {0: 1, 1: 44, 2: 186}
   📊 Document count stats:
      Min: 1
      Max: 970
      Mean: 12.6
💾 Updated and saved: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/standard_coherence.csv
   Updated columns: ['node_id', 'eta', 'gamma', 'depth', 'alpha', 'top_k', 'top_words', 'word_count', 'node_npmi', 'node_c_v', '

In [24]:
def aggregate_coherence_by_eta(base_path=".", top_k=15):
    """
    Aggregate layer-level coherence statistics by eta value (including k value),
    and generate weighted and unweighted (simple) summary files.
    """
    # Find all layer_coherence_summary_k{top_k}.csv files
    pattern = os.path.join(base_path, "**", f"layer_coherence_summary_k{top_k}.csv")
    files = glob.glob(pattern, recursive=True)

    print(f"🔍 Search pattern: layer_coherence_summary_k{top_k}.csv")
    print(f"🔍 Found {len(files)} layer summary files")

    all_data = []
    eta_groups = {}

    for file_path in files:
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)
        parent_folder = os.path.dirname(folder_path)

        # Extract eta value
        eta = None
        if 'eta_' in folder_name:
            try:
                eta_part = folder_name.split('eta_')[1].split('_')[0]
                eta = float(eta_part)
            except:
                continue
        else:
            continue

        # Extract run number
        run_match = folder_name.split('_run_')
        if len(run_match) > 1:
            run_id = run_match[1]
        else:
            continue

        if eta not in eta_groups:
            eta_groups[eta] = parent_folder

        try:
            df = pd.read_csv(file_path)

            for _, row in df.iterrows():
                all_data.append({
                    'eta': eta,
                    'run_id': run_id,
                    'layer': row['layer'],
                    'node_count': row['node_count'],
                    'total_documents': row['total_documents'],
                    'weighted_avg_npmi': row['weighted_avg_npmi'],
                    'weighted_avg_c_v': row['weighted_avg_c_v'],
                    'weighted_avg_u_mass': row['weighted_avg_u_mass'],
                    'simple_avg_npmi': row['simple_avg_npmi'],
                    'simple_avg_c_v': row['simple_avg_c_v'],
                    'simple_avg_u_mass': row['simple_avg_u_mass'],
                    'top_k': top_k,
                    'parent_folder': parent_folder
                })

        except Exception as e:
            print(f"Error reading file {file_path}: {e}")

    # Convert to DataFrame and aggregate by eta
    summary_df = pd.DataFrame(all_data)

    if summary_df.empty:
        print("No valid data found")
        return

    print("=" * 70)
    print(f"Layer Coherence Summary Statistics by ETA Value (k={top_k})")
    print("=" * 70)

    for eta, group_data in summary_df.groupby('eta'):
        parent_folder = group_data['parent_folder'].iloc[0]

        print(f"\nProcessing Eta={eta} (k={top_k})")

        # Weighted aggregation
        layer_summary = group_data.groupby('layer').agg({
            'weighted_avg_npmi': ['mean', 'std', 'count'],
            'weighted_avg_c_v': ['mean', 'std', 'count'],
            'weighted_avg_u_mass': ['mean', 'std', 'count'],
            'simple_avg_npmi': ['mean', 'std'],
            'simple_avg_c_v': ['mean', 'std'],
            'simple_avg_u_mass': ['mean', 'std'],
            'node_count': 'mean',
            'total_documents': 'mean',
            'run_id': lambda x: ', '.join(sorted(x.unique()))
        }).round(4)

        # Flatten column names
        layer_summary.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in layer_summary.columns]
        layer_summary = layer_summary.reset_index()
        layer_summary.insert(0, 'eta', eta)
        layer_summary.insert(1, 'top_k', top_k)

        # Save weighted + simple aggregation results (filename includes k and eta value)
        output_filename = f'eta_{eta}_coherence_layer_comparison_k{top_k}.csv'
        output_path = os.path.join(parent_folder, output_filename)
        layer_summary.to_csv(output_path, index=False)

        print(f"  Saved summary file: {output_path}")
        print(f"  Number of layers: {len(layer_summary)}")

        # Show brief statistics
        for _, row in layer_summary.iterrows():
            layer_num = int(row['layer'])
            w_npmi = row['weighted_avg_npmi_mean']
            w_cv = row['weighted_avg_c_v_mean']
            w_umass = row['weighted_avg_u_mass_mean']
            run_count = int(row['weighted_avg_npmi_count'])
            print(f"    Layer {layer_num}: W_NPMI={w_npmi:.4f}, W_C_V={w_cv:.4f}, W_U_Mass={w_umass:.4f}, runs={run_count}")

        # Save unweighted simple coherence summary (filename includes k and eta value)
        simple_summary = group_data.groupby('layer').agg({
            'simple_avg_npmi': ['mean', 'std'],
            'simple_avg_c_v': ['mean', 'std'],
            'simple_avg_u_mass': ['mean', 'std'],
            'node_count': ['mean', 'std'],
            'run_id': 'count'
        }).round(4)
        simple_summary.columns = ['_'.join(col).strip() for col in simple_summary.columns]
        simple_summary = simple_summary.reset_index()
        simple_summary.insert(0, 'eta', eta)
        simple_summary.insert(1, 'top_k', top_k)
        simple_output_filename = f'eta_{eta}_coherence_layer_comparison_k{top_k}_simple.csv'
        simple_output_path = os.path.join(parent_folder, simple_output_filename)
        simple_summary.to_csv(simple_output_path, index=False)
        print(f"  Saved simple (unweighted) coherence file: {simple_output_path}")

    # Generate overall comparison file (filename includes k value)
    overall_summary = summary_df.groupby(['eta', 'layer']).agg({
        'weighted_avg_npmi': ['mean', 'std'],
        'weighted_avg_c_v': ['mean', 'std'],
        'weighted_avg_u_mass': ['mean', 'std'],
        'simple_avg_npmi': ['mean', 'std'],
        'simple_avg_c_v': ['mean', 'std'],
        'simple_avg_u_mass': ['mean', 'std'],
        'node_count': 'mean',
        'total_documents': 'mean',
        'run_id': 'count'
    }).round(4)

    overall_summary.columns = ['_'.join(col).strip() for col in overall_summary.columns]
    overall_summary = overall_summary.reset_index()
    overall_summary.insert(2, 'top_k', top_k)

    overall_output_path = os.path.join(base_path, f'eta_coherence_layer_comparison_k{top_k}.csv')
    overall_summary.to_csv(overall_output_path, index=False)
    print(f"\nOverall comparison file saved to: {overall_output_path}")

In [25]:
import pandas as pd
import numpy as np
import os
import glob
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary

# Execute streamlined layered coherence analysis (filename includes k value)
base_path = "/Volumes/My Passport/收敛结果/step2"
top_k = 5

print("=" * 80)
print(f"Starting node coherence calculation and layered analysis (k={top_k})...")
print("=" * 80)

# Calculate node coherence and layer summary
calculate_coherence_layered_analysis(base_path, corpus, top_k)

print("\n" + "=" * 80)
print(f"Starting aggregation of layer coherence statistics by eta value (k={top_k})...")
print("=" * 80)

# Aggregate by eta (pass top_k parameter)
aggregate_coherence_by_eta(base_path, top_k)

print("=" * 80)
print(f"✅ Layered coherence analysis completed! (k={top_k})")
print("=" * 80)

Starting node coherence calculation and layered analysis (k=5)...
🔍 Found 18 word distribution files to process (top_k=5)

[1/18] Processing file: depth_3_gamma_0.05_eta_0.1_run_2 (k=5)
Parameters - Eta: 0.1, Gamma: 0.05, Depth: 3, Alpha: 0.1
📈 Last iteration: 175
📈 Number of nodes: 231
   Calculating c_npmi...
   ✓ c_npmi: Range=[-0.4494, 0.7414]
   Calculating c_v...
   ✓ c_v: Range=[0.1778, 0.9900]
   Calculating u_mass...
   ✓ u_mass: Range=[-14.9995, -0.3317]
💾 Node coherence results saved to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/node_coherence_k5.csv
💾 Layer summary results saved to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/layer_coherence_summary_k5.csv
📊 Layer coherence summary (k=5):
   Layer 0 (1 nodes): NPMI=-0.0013, C_V=0.4948, U_Mass=-0.5699
   Layer 1 (44 nodes): NPMI=-0.0083, C_V=0.5269, U_Mass=-1.7559
   Layer 2 (186 nodes): NPMI=0.0763, C_V=0.5772, U_Mass=-2.8969

[2/18] Proce

In [26]:
def compute_perplexity_with_path_mapping_fixed(node_word_df, path_mapping_df, corpus, test_doc_ids, eta_prior=0.1):
    """
    Compute hLDA perplexity for test documents using path-document mapping and node-word distributions.
    Args:
        node_word_df: DataFrame, node-word distributions for the last iteration
        path_mapping_df: DataFrame, path-document mapping for the last iteration
        corpus: dict, {doc_id: [word list]}
        test_doc_ids: list, document IDs in the test set
        eta_prior: float, Dirichlet prior smoothing parameter (from eta value)
    Returns:
        dict: {
            'perplexity': float,
            'avg_doc_perplexity': float,
            'log_likelihood': float,
            'valid_docs': int,
            'matched_docs': int,
            'total_words': int,
            'match_rate': float,
            'avg_path_length': float
        }
    """
    # Build node-word probability distributions
    all_words = sorted(node_word_df['word'].dropna().unique())
    word_to_idx = {word: idx for idx, word in enumerate(all_words)}
    node_distributions = {}

    for node_id in node_word_df['node_id'].unique():
        node_data = node_word_df[node_word_df['node_id'] == node_id]
        counts = np.zeros(len(all_words))
        for _, row in node_data.iterrows():
            word = row['word']
            if pd.notna(word) and word in word_to_idx:
                counts[word_to_idx[word]] = row['count']
        smoothed_counts = counts + eta_prior
        probabilities = smoothed_counts / np.sum(smoothed_counts)
        node_distributions[node_id] = probabilities

    # Map each test document to its path (sequence of node_ids)
    doc_path_map = {}
    for _, row in path_mapping_df.iterrows():
        doc_id = row['doc_id'] if 'doc_id' in row else row['document_id']
        if doc_id in test_doc_ids:
            # Extract path as a list of node_ids for all layers
            path = []
            for col in path_mapping_df.columns:
                if col.startswith('layer_') and col.endswith('_node_id') and pd.notna(row[col]):
                    path.append(row[col])
            if path:
                doc_path_map[doc_id] = path

    log_likelihood = 0.0
    total_words = 0
    valid_docs = 0
    matched_docs = 0
    path_lengths = []

    for doc_id in test_doc_ids:
        words = corpus.get(doc_id, [])
        if not words or doc_id not in doc_path_map:
            continue
        path = doc_path_map[doc_id]
        path_lengths.append(len(path))
        matched_docs += 1

        # For each word, average probability over all nodes in the path
        word_probs = []
        for word in words:
            if word not in word_to_idx:
                continue
            prob_sum = 0.0
            for node_id in path:
                if node_id in node_distributions:
                    prob_sum += node_distributions[node_id][word_to_idx[word]]
            if len(path) > 0:
                avg_prob = prob_sum / len(path)
                word_probs.append(avg_prob)
        if word_probs:
            valid_docs += 1
            total_words += len(word_probs)
            log_likelihood += np.sum(np.log(np.maximum(word_probs, 1e-12)))  # avoid log(0)

    if total_words == 0 or valid_docs == 0:
        return {
            'perplexity': np.nan,
            'avg_doc_perplexity': np.nan,
            'log_likelihood': 0.0,
            'valid_docs': valid_docs,
            'matched_docs': matched_docs,
            'total_words': total_words,
            'match_rate': matched_docs / len(test_doc_ids) if test_doc_ids else 0,
            'avg_path_length': np.mean(path_lengths) if path_lengths else 0
        }

    avg_doc_perplexity = np.exp(-log_likelihood / total_words)
    perplexity = avg_doc_perplexity
    match_rate = matched_docs / len(test_doc_ids) if test_doc_ids else 0
    avg_path_length = np.mean(path_lengths) if path_lengths else 0

    return {
        'perplexity': perplexity,
        'avg_doc_perplexity': avg_doc_perplexity,
        'log_likelihood': log_likelihood,
        'valid_docs': valid_docs,
        'matched_docs': matched_docs,
        'total_words': total_words,
        'match_rate': match_rate,
        'avg_path_length': avg_path_length
    }

In [None]:
# First run full perplexity computation (if not already run)
from sklearn.model_selection import train_test_split
import math
import pandas as pd
import numpy as np
import os
import glob
import re

def extract_eta_from_folder(folder_name, default=0.1):
    """
    Robust eta extraction from folder name.
    Supports patterns like: eta_0.1, eta0.1, eta-0.1, eta_1, etc.
    Returns float or default on failure.
    """
    if not isinstance(folder_name, str):
        return default
    
    # Primary pattern: eta followed by separator and number
    m = re.search(r'eta[_-]?([0-9]+(?:\.[0-9]+)?)', folder_name, flags=re.IGNORECASE)
    if m:
        try:
            return float(m.group(1))
        except:
            return default
    
    # Fallback pattern: looser match
    m2 = re.search(r'eta\s*[:=]?\s*([0-9]+(?:\.[0-9]+)?)', folder_name, flags=re.IGNORECASE)
    if m2:
        try:
            return float(m2.group(1))
        except:
            return default
    
    return default

def extract_params_from_folder(folder_name):
    """Extract all parameters from folder name using robust regex"""
    params = {'eta': 0.1, 'gamma': 0.05, 'depth': 3, 'alpha': 0.1}
    
    for param in params.keys():
        pattern = rf'{param}[_-]?([0-9]+(?:\.[0-9]+)?)'
        m = re.search(pattern, folder_name, flags=re.IGNORECASE)
        if m:
            try:
                value = m.group(1)
                if param == 'depth':
                    params[param] = int(value)
                else:
                    params[param] = float(value)
            except Exception as e:
                print(f"⚠️ Failed to extract parameter {param}: {e}")
    
    return params['eta'], params['gamma'], params['depth'], params['alpha']

def calculate_hlda_perplexity_with_path_mapping_complete(base_path=".", corpus=None, test_ratio=0.2, random_state=42):
    """
    Full version: hLDA perplexity calculation based on iteration_path_document_mapping.csv
    """
    if corpus is None:
        print("❌ corpus (original text data) must be provided")
        return

    # Split train/test
    doc_ids = list(corpus.keys())
    train_ids, test_ids = train_test_split(doc_ids, test_size=test_ratio, random_state=random_state)

    print(f"📊 Dataset split:")
    print(f"   Total documents: {len(doc_ids)}")
    print(f"   Training set: {len(train_ids)} documents")
    print(f"   Test set: {len(test_ids)} documents")

    pattern = os.path.join(base_path, "**", "iteration_node_word_distributions.csv")
    files = glob.glob(pattern, recursive=True)

    print(f"🔍 Found {len(files)} model result folders to process")

    for idx, file_path in enumerate(files, 1):
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)

        # Use robust parameter extraction
        eta, gamma, depth, alpha = extract_params_from_folder(folder_name)

        print(f"\n{'='*80}")
        print(f"[{idx}/{len(files)}] Computing perplexity for: {folder_name}")
        print(f"Parameters - Eta: {eta}, Gamma: {gamma}, Depth: {depth}, Alpha: {alpha}")
        print(f"{'='*80}")

        try:
            # read node-word distributions
            word_df = pd.read_csv(file_path)
            word_df.columns = [col.strip("'\" ") for col in word_df.columns]

            # read path-document mapping
            path_mapping_file = os.path.join(folder_path, 'iteration_path_document_mapping.csv')
            if not os.path.exists(path_mapping_file):
                print("⚠️ Path-document mapping file not found, skipping this folder")
                continue

            path_mapping_df = pd.read_csv(path_mapping_file)
            path_mapping_df.columns = [col.strip("'\" ") for col in path_mapping_df.columns]

            # select last iteration
            max_iteration = word_df['iteration'].max()
            last_word_data = word_df[word_df['iteration'] == max_iteration]
            last_path_mapping_data = path_mapping_df[path_mapping_df['iteration'] == max_iteration]

            print(f"📈 Last iteration: {max_iteration}")
            print(f"📈 Number of nodes: {last_word_data['node_id'].nunique()}")
            print(f"📈 Path mappings: {len(last_path_mapping_data)}")

            # compute perplexity using the fixed function
            perplexity_results = compute_perplexity_with_path_mapping_fixed(
                last_word_data,
                last_path_mapping_data,
                corpus,
                test_ids,
                eta
            )

            if perplexity_results is not None:
                # save perplexity results
                perplexity_data = [{
                    'eta': eta,
                    'gamma': gamma,
                    'depth': depth,
                    'alpha': alpha,
                    'iteration': max_iteration,
                    'test_docs_count': len(test_ids),
                    'valid_test_docs': perplexity_results['valid_docs'],
                    'matched_docs': perplexity_results['matched_docs'],
                    'total_test_words': perplexity_results['total_words'],
                    'log_likelihood': perplexity_results['log_likelihood'],
                    'perplexity': perplexity_results['perplexity'],
                    'avg_doc_perplexity': perplexity_results['avg_doc_perplexity'],
                    'doc_match_rate': perplexity_results['match_rate'],
                    'avg_path_length': perplexity_results['avg_path_length']
                }]

                perplexity_df = pd.DataFrame(perplexity_data)
                output_path = os.path.join(folder_path, 'perplexity_results_final.csv')
                perplexity_df.to_csv(output_path, index=False)

                print(f"💾 Perplexity results saved to: {output_path}")
                print(f"📊 Perplexity summary:")
                print(f"   - Perplexity: {perplexity_results['perplexity']:.4f}")
                print(f"   - Average doc perplexity: {perplexity_results['avg_doc_perplexity']:.4f}")
                print(f"   - Document match rate: {perplexity_results['match_rate']:.1%}")
                print(f"   - Average path length: {perplexity_results['avg_path_length']:.1f}")
                print(f"   - Valid test docs: {perplexity_results['valid_docs']}/{len(test_ids)}")

        except Exception as e:
            import traceback
            print(f"❌ Error processing {file_path}: {str(e)}")
            traceback.print_exc()

    print(f"\n✅ Perplexity computation completed for all folders!")

def aggregate_perplexity_by_eta_groups(base_path="."):
    """
    Aggregate average perplexity and related metrics across runs grouped by eta (fixed version).
    """
    # find final perplexity result files
    pattern = os.path.join(base_path, "**", "perplexity_results_final.csv")
    files = glob.glob(pattern, recursive=True)

    # if no final files, try alternative patterns
    if len(files) == 0:
        patterns = [
            "perplexity_results_path_mapping.csv",
            "perplexity_results_test.csv",
            "perplexity_results.csv"
        ]
        for pattern_name in patterns:
            pattern = os.path.join(base_path, "**", pattern_name)
            files = glob.glob(pattern, recursive=True)
            if len(files) > 0:
                print(f"🔍 Using file pattern: {pattern_name}")
                break

    print(f"🔍 Found {len(files)} perplexity result files")

    if len(files) == 0:
        print("❌ No perplexity result files found")
        return

    all_data = []
    eta_groups = {}

    for file_path in files:
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)
        parent_folder = os.path.dirname(folder_path)

        # Use robust eta extraction
        eta = extract_eta_from_folder(folder_name, default=None)
        if eta is None:
            print(f"Warning: cannot extract eta from folder name {folder_name}")
            continue

        # extract run id
        run_match = folder_name.split('_run_')
        if len(run_match) > 1:
            run_id = run_match[1]
        else:
            print(f"Warning: cannot extract run id from folder name {folder_name}")
            run_id = "unknown"

        if eta not in eta_groups:
            eta_groups[eta] = parent_folder

        try:
            df = pd.read_csv(file_path)
            print(f"📖 Reading file: {folder_name} - {len(df)} rows")

            for _, row in df.iterrows():
                # ensure required field exists
                if 'perplexity' not in row:
                    print(f"Warning: {file_path} missing perplexity column")
                    continue

                all_data.append({
                    'eta': eta,
                    'run_id': run_id,
                    'gamma': row.get('gamma', 0.05),
                    'depth': row.get('depth', 3),
                    'alpha': row.get('alpha', 0.1),
                    'perplexity': row.get('perplexity', 0),
                    'avg_doc_perplexity': row.get('avg_doc_perplexity', row.get('perplexity', 0)),
                    'valid_test_docs': row.get('valid_test_docs', 0),
                    'total_test_words': row.get('total_test_words', 0),
                    'doc_match_rate': row.get('doc_match_rate', 0),
                    'avg_path_length': row.get('avg_path_length', 0),
                    'log_likelihood': row.get('log_likelihood', 0),
                    'parent_folder': parent_folder
                })

        except Exception as e:
            print(f"Error reading file {file_path}: {e}")

    # to DataFrame
    summary_df = pd.DataFrame(all_data)

    if summary_df.empty:
        print("No valid data found")
        return

    print(f"📊 Data summary:")
    print(f"   Total rows: {len(summary_df)}")
    print(f"   Unique eta values: {sorted(summary_df['eta'].unique())}")
    print(f"   Counts per eta: {summary_df['eta'].value_counts().sort_index().to_dict()}")

    print("=" * 80)
    print("Perplexity aggregation by ETA")
    print("=" * 80)

    # group by eta and save summaries
    for eta, group_data in summary_df.groupby('eta'):
        parent_folder = group_data['parent_folder'].iloc[0]

        print(f"\nProcessing Eta={eta}")
        print(f"Output directory: {parent_folder}")
        print(f"Group size: {len(group_data)}")

        if len(group_data) == 0:
            print(f"Warning: no data for Eta={eta}, skip")
            continue

        # safer aggregation dict construction
        agg_dict = {}

        numeric_cols = ['perplexity', 'avg_doc_perplexity', 'valid_test_docs',
                       'total_test_words', 'doc_match_rate', 'avg_path_length', 'log_likelihood']

        for col in numeric_cols:
            if col in group_data.columns:
                valid_data = group_data[col].dropna()
                if len(valid_data) > 0:
                    if col in ['perplexity', 'avg_doc_perplexity', 'doc_match_rate', 'avg_path_length', 'log_likelihood']:
                        agg_dict[col] = ['mean', 'std', 'min', 'max']
                    else:
                        agg_dict[col] = ['mean', 'std']
                else:
                    print(f"   Warning: column {col} has no valid data")

        if 'run_id' in group_data.columns:
            agg_dict['run_id'] = 'count'

        for col in ['gamma', 'depth', 'alpha']:
            if col in group_data.columns:
                agg_dict[col] = 'first'

        if not agg_dict:
            print(f"Warning: no aggregatable columns for Eta={eta}, skip")
            continue

        try:
            print(f"   Aggregating using keys: {list(agg_dict.keys())}")
            eta_summary = group_data.agg(agg_dict).round(4)

            # flatten multiindex columns
            eta_summary.columns = ['_'.join(col).strip() if isinstance(col, tuple) else col for col in eta_summary.columns]
            eta_summary = eta_summary.reset_index()
            eta_summary.insert(0, 'eta', eta)

            run_ids = ', '.join(sorted(group_data['run_id'].unique()))
            eta_summary['run_ids'] = run_ids

            output_filename = f'eta_{eta}_perplexity_summary.csv'
            output_path = os.path.join(parent_folder, output_filename)
            eta_summary.to_csv(output_path, index=False)

            print(f"  ✓ Saved summary: {output_path}")

            if 'run_id_count' in eta_summary.columns:
                print(f"  Runs: {int(eta_summary['run_id_count'].iloc[0])}")

            if 'perplexity_mean' in eta_summary.columns:
                mean_perp = eta_summary['perplexity_mean'].iloc[0]
                std_perp = eta_summary.get('perplexity_std', pd.Series([0])).iloc[0]
                print(f"  Average perplexity: {mean_perp:.4f} (±{std_perp:.4f})")

            print(f"  Included runs: {run_ids}")

        except Exception as e:
            print(f"❌ Error processing Eta={eta}: {e}")
            import traceback
            traceback.print_exc()

    # overall comparison file at base_path
    print(f"\n" + "=" * 80)
    print("Generating overall comparison file")
    print("=" * 80)

    try:
        overall_agg_dict = {}

        for col in ['perplexity', 'avg_doc_perplexity', 'doc_match_rate', 'avg_path_length',
                   'valid_test_docs', 'total_test_words', 'log_likelihood']:
            if col in summary_df.columns:
                valid_data = summary_df[col].dropna()
                if len(valid_data) > 0:
                    overall_agg_dict[col] = ['mean', 'std']
                    if col in ['perplexity', 'avg_doc_perplexity']:
                        overall_agg_dict[col].extend(['min', 'max'])

        if 'run_id' in summary_df.columns:
            overall_agg_dict['run_id'] = 'count'

        if not overall_agg_dict:
            print("Warning: no columns available for overall aggregation")
            return None

        overall_summary = summary_df.groupby('eta').agg(overall_agg_dict).round(4)

        overall_summary.columns = ['_'.join(col).strip() for col in overall_summary.columns]
        overall_summary = overall_summary.reset_index()

        overall_output_path = os.path.join(base_path, 'eta_perplexity_comparison.csv')
        overall_summary.to_csv(overall_output_path, index=False)
        print(f"✓ Overall comparison saved to: {overall_output_path}")

        print(f"\nCross-eta perplexity comparison:")
        print("Eta      Avg Perplexity(±std)   Runs")
        print("-" * 50)

        for _, row in overall_summary.iterrows():
            eta = row['eta']
            run_count = int(row.get('run_id_count', 0))

            if 'perplexity_mean' in row:
                mean_perp = row['perplexity_mean']
                std_perp = row.get('perplexity_std', 0)
                print(f"{eta:6.3f}    {mean_perp:8.4f}(±{std_perp:6.4f})        {run_count:4d}")
            else:
                print(f"{eta:6.3f}    data missing                    {run_count:4d}")

        return overall_summary

    except Exception as e:
        print(f"❌ Error generating overall comparison: {e}")
        import traceback
        traceback.print_exc()
        return None

def analyze_perplexity_trends(base_path="."):
    """
    Analyze perplexity trends
    """
    comparison_file = os.path.join(base_path, 'eta_perplexity_comparison.csv')

    if os.path.exists(comparison_file):
        df = pd.read_csv(comparison_file)

        print(f"\n📈 Perplexity trend analysis:")
        print("=" * 60)

        eta_perp_corr = df['eta'].corr(df['perplexity_mean'])
        eta_match_corr = df['eta'].corr(df['doc_match_rate_mean'])
        eta_path_corr = df['eta'].corr(df['avg_path_length_mean'])

        print(f"Correlation eta vs avg perplexity: {eta_perp_corr:.4f}")
        print(f"Correlation eta vs doc match rate: {eta_match_corr:.4f}")
        print(f"Correlation eta vs avg path length: {eta_path_corr:.4f}")

        best_eta_idx = df['perplexity_mean'].idxmin()
        best_eta = df.loc[best_eta_idx, 'eta']
        best_perplexity = df.loc[best_eta_idx, 'perplexity_mean']

        print(f"\n🏆 Best performance:")
        print(f"   Lowest average perplexity: {best_perplexity:.4f} (Eta={best_eta})")
        print(f"   Corresponding match rate: {df.loc[best_eta_idx, 'doc_match_rate_mean']:.1%}")
        print(f"   Runs: {int(df.loc[best_eta_idx, 'run_id_count'])}")

        print(f"\n📊 Stability (coefficient of variation):")
        for _, row in df.iterrows():
            eta = row['eta']
            cv = row['perplexity_std'] / row['perplexity_mean'] if row['perplexity_mean'] > 0 else 0
            print(f"   Eta {eta}: CV={cv:.4f}")

    else:
        print("⚠️ Overall comparison file not found. Run aggregation first.")

# Execute full perplexity calculation and aggregation
base_path = "/Volumes/My Passport/收敛结果/step2"

print("=" * 80)
print("Starting full perplexity computation...")
print("=" * 80)

# 1. Compute perplexity (if not done)
calculate_hlda_perplexity_with_path_mapping_complete(base_path, corpus, test_ratio=0.2)

print("\n" + "=" * 80)
print("Starting aggregation of perplexity by eta...")
print("=" * 80)

# 2. Aggregate by eta
overall_summary = aggregate_perplexity_by_eta_groups(base_path) 

print("\n" + "=" * 80)
print("Starting perplexity trend analysis...")
print("=" * 80)

# 3. Trend analysis
analyze_perplexity_trends(base_path)

print("=" * 80)
print("✅ Perplexity computation and aggregation completed!")
print("=" * 80)

Starting full perplexity computation...
📊 Dataset split:
   Total documents: 970
   Training set: 776 documents
   Test set: 194 documents
🔍 Found 18 model result folders to process

[1/18] Computing perplexity for: depth_3_gamma_0.05_eta_0.1_run_2
Parameters - Eta: 0.1, Gamma: 0.05, Depth: 3, Alpha: 0.1
📈 Last iteration: 175
📈 Number of nodes: 231
📈 Path mappings: 970
💾 Perplexity results saved to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/perplexity_results_final.csv
📊 Perplexity summary:
   - Perplexity: 402.1430
   - Average doc perplexity: 402.1430
   - Document match rate: 100.0%
   - Average path length: 3.0
   - Valid test docs: 194/194

[2/18] Computing perplexity for: depth_3_gamma_0.05_eta_0.1_run_3
Parameters - Eta: 0.1, Gamma: 0.05, Depth: 3, Alpha: 0.1
📈 Last iteration: 175
📈 Number of nodes: 215
📈 Path mappings: 970
💾 Perplexity results saved to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_ru

Traceback (most recent call last):
  File "/var/folders/v5/6mdkg5713kxgwg5xs24g8rvr0000gn/T/ipykernel_1460/147612263.py", line 300, in aggregate_perplexity_by_eta_groups
    eta_summary = group_data.agg(agg_dict).round(4)
  File "/Users/wenlinsuniverse/opt/anaconda3/envs/huggingface/lib/python3.8/site-packages/pandas/core/frame.py", line 9342, in aggregate
    result = op.agg()
  File "/Users/wenlinsuniverse/opt/anaconda3/envs/huggingface/lib/python3.8/site-packages/pandas/core/apply.py", line 776, in agg
    result = super().agg()
  File "/Users/wenlinsuniverse/opt/anaconda3/envs/huggingface/lib/python3.8/site-packages/pandas/core/apply.py", line 172, in agg
    return self.agg_dict_like()
  File "/Users/wenlinsuniverse/opt/anaconda3/envs/huggingface/lib/python3.8/site-packages/pandas/core/apply.py", line 504, in agg_dict_like
    results = {
  File "/Users/wenlinsuniverse/opt/anaconda3/envs/huggingface/lib/python3.8/site-packages/pandas/core/apply.py", line 505, in <dictcomp>
    key

In [None]:
import pandas as pd
import numpy as np
import os
import glob

def calculate_branching_and_gini_metrics(base_path="."):
    """
    Calculate branching factor and Gini coefficient metrics for each model
    """
    pattern = os.path.join(base_path, "**", "corrected_renyi_entropy.csv")
    files = glob.glob(pattern, recursive=True)
    
    print(f"🔍 Found {len(files)} entropy files to process")
    
    for idx, file_path in enumerate(files, 1):
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)
        
        print(f"\n[{idx}/{len(files)}] Processing folder: {folder_name}")
        
        try:
            # Read entropy file
            entropy_df = pd.read_csv(file_path)
            
            # Check if required columns exist
            required_cols = ['node_id', 'layer', 'document_count', 'child_count']
            missing_cols = [col for col in required_cols if col not in entropy_df.columns]
            
            if missing_cols:
                print(f"⚠️ Missing required columns: {missing_cols}, skipping this file")
                continue
            
            # Calculate layer-level branching factor and Gini coefficient metrics
            layer_metrics = []
            
            for layer in entropy_df['layer'].unique():
                if layer == -1:  # Skip invalid layers
                    continue
                    
                layer_nodes = entropy_df[entropy_df['layer'] == layer]
                
                # Basic statistics
                node_count = len(layer_nodes)
                total_documents = layer_nodes['document_count'].sum()
                
                # Branching statistics
                child_counts = layer_nodes['child_count'].values
                total_branches = child_counts.sum()
                
                # Non-leaf node statistics
                non_leaf_nodes = (child_counts > 0).sum()
                non_leaf_counts = child_counts[child_counts > 0]
                
                # Branching factor statistics
                if len(non_leaf_counts) > 0:
                    avg_branching_factor = non_leaf_counts.mean()
                    std_branching_factor = non_leaf_counts.std()
                    non_leaf_avg_branching = non_leaf_counts.mean()
                else:
                    avg_branching_factor = 0.0
                    std_branching_factor = 0.0
                    non_leaf_avg_branching = 0.0
                
                # Gini coefficient calculation
                def gini_coefficient(values):
                    """Calculate Gini coefficient"""
                    if len(values) == 0:
                        return 0.0
                    values = np.array(values)
                    values = values[values > 0]  # Only consider positive values
                    if len(values) <= 1:
                        return 0.0
                    
                    values = np.sort(values)
                    n = len(values)
                    cumsum = np.cumsum(values)
                    return (n + 1 - 2 * np.sum(cumsum) / cumsum[-1]) / n
                
                # Document distribution Gini coefficient
                doc_counts = layer_nodes['document_count'].values
                gini_doc_distribution = gini_coefficient(doc_counts)
                
                # Branch distribution Gini coefficient
                gini_branch_distribution = gini_coefficient(child_counts)
                
                layer_metrics.append({
                    'layer': layer,
                    'node_count': node_count,
                    'total_branches': total_branches,
                    'avg_branching_factor': avg_branching_factor,
                    'std_branching_factor': std_branching_factor,
                    'non_leaf_nodes': non_leaf_nodes,
                    'non_leaf_avg_branching': non_leaf_avg_branching,
                    'total_documents': total_documents,
                    'gini_doc_distribution': gini_doc_distribution,
                    'gini_branch_distribution': gini_branch_distribution
                })
            
            # Save layer-level metrics
            if layer_metrics:
                layer_df = pd.DataFrame(layer_metrics)
                layer_output_path = os.path.join(folder_path, 'layer_branching_gini_metrics.csv')
                layer_df.to_csv(layer_output_path, index=False)
                print(f"✓ Layer metrics saved to: {layer_output_path}")
                
                # Display brief statistics
                print(f"📊 Layer metrics summary:")
                for _, row in layer_df.iterrows():
                    layer_num = int(row['layer'])
                    node_count = int(row['node_count'])
                    avg_branch = row['avg_branching_factor']
                    doc_gini = row['gini_doc_distribution']
                    branch_gini = row['gini_branch_distribution']
                    print(f"   Layer {layer_num} ({node_count} nodes): Branching={avg_branch:.2f}, Doc Gini={doc_gini:.4f}, Branch Gini={branch_gini:.4f}")
            
        except Exception as e:
            import traceback
            print(f"❌ Error processing file {file_path}: {str(e)}")
            traceback.print_exc()

def aggregate_branching_gini_by_eta(base_path="."):
    """
    Aggregate branching factor and Gini coefficient statistics by eta value (layer-level only)
    """
    
    print("=" * 80)
    print("Aggregating layer-level branching factor and Gini coefficient metrics...")
    print("=" * 80)
    
    pattern = os.path.join(base_path, "**", "layer_branching_gini_metrics.csv")
    files = glob.glob(pattern, recursive=True)
    
    print(f"🔍 Found {len(files)} layer metric files")
    
    all_layer_data = []
    eta_groups = {}
    
    for file_path in files:
        folder_path = os.path.dirname(file_path)
        folder_name = os.path.basename(folder_path)
        parent_folder = os.path.dirname(folder_path)
        
        # Extract eta value
        eta = None
        if 'eta_' in folder_name:
            try:
                eta_part = folder_name.split('eta_')[1].split('_')[0]
                eta = float(eta_part)
            except:
                continue
        else:
            continue
        
        # Extract run number
        run_match = folder_name.split('_run_')
        if len(run_match) > 1:
            run_id = run_match[1]
        else:
            continue
        
        if eta not in eta_groups:
            eta_groups[eta] = parent_folder
        
        try:
            df = pd.read_csv(file_path)
            
            for _, row in df.iterrows():
                all_layer_data.append({
                    'eta': eta,
                    'run_id': run_id,
                    'layer': row['layer'],
                    'node_count': row['node_count'],
                    'total_branches': row['total_branches'],
                    'avg_branching_factor': row['avg_branching_factor'],
                    'std_branching_factor': row['std_branching_factor'],
                    'non_leaf_nodes': row['non_leaf_nodes'],
                    'non_leaf_avg_branching': row['non_leaf_avg_branching'],
                    'total_documents': row['total_documents'],
                    'gini_doc_distribution': row['gini_doc_distribution'],
                    'gini_branch_distribution': row['gini_branch_distribution'],
                    'parent_folder': parent_folder
                })
                
        except Exception as e:
            print(f"Error reading file {file_path}: {e}")
    
    # Convert to DataFrame and aggregate by eta
    if all_layer_data:
        layer_summary_df = pd.DataFrame(all_layer_data)
        
        print("Layer-level branching factor and Gini coefficient summary statistics by ETA value")
        print("=" * 80)
        
        # Generate layer summary files by eta
        for eta, group_data in layer_summary_df.groupby('eta'):
            parent_folder = group_data['parent_folder'].iloc[0]
            
            print(f"\nProcessing Eta={eta}")
            
            layer_summary = group_data.groupby('layer').agg({
                'node_count': ['mean', 'std'],
                'total_branches': ['mean', 'std'],
                'avg_branching_factor': ['mean', 'std'],
                'std_branching_factor': ['mean', 'std'],
                'non_leaf_nodes': ['mean', 'std'],
                'non_leaf_avg_branching': ['mean', 'std'],
                'total_documents': ['mean', 'std'],
                'gini_doc_distribution': ['mean', 'std'],
                'gini_branch_distribution': ['mean', 'std'],
                'run_id': 'count'
            }).round(4)
            
            # Flatten column names
            layer_summary.columns = ['_'.join(col).strip() for col in layer_summary.columns]
            layer_summary = layer_summary.reset_index()
            layer_summary.insert(0, 'eta', eta)
            
            # Save aggregated results
            output_filename = f'eta_{eta}_layer_branching_gini_summary.csv'
            output_path = os.path.join(parent_folder, output_filename)
            layer_summary.to_csv(output_path, index=False)
            
            print(f"  Saved layer summary file: {output_path}")
            print(f"  Number of layers: {len(layer_summary)}")
            
            # Find correct count column name
            count_col = None
            for col in layer_summary.columns:
                if 'run_id' in col and ('count' in col or col.endswith('_count')):
                    count_col = col
                    break
            
            # Display brief statistics
            for _, row in layer_summary.iterrows():
                layer_num = int(row['layer'])
                avg_branch = row['avg_branching_factor_mean']
                doc_gini = row['gini_doc_distribution_mean']
                branch_gini = row['gini_branch_distribution_mean']
                run_count = int(row[count_col]) if count_col else 0
                
                print(f"    Layer {layer_num}: Branching={avg_branch:.2f}, Doc Gini={doc_gini:.4f}, Branch Gini={branch_gini:.4f}, runs={run_count}")
        
        # Generate overall layer comparison file
        overall_layer_summary = layer_summary_df.groupby(['eta', 'layer']).agg({
            'avg_branching_factor': ['mean', 'std'],
            'gini_doc_distribution': ['mean', 'std'],
            'gini_branch_distribution': ['mean', 'std'],
            'node_count': ['mean', 'std'],
            'run_id': 'count'
        }).round(4)
        
        overall_layer_summary.columns = ['_'.join(col).strip() for col in overall_layer_summary.columns]
        overall_layer_summary = overall_layer_summary.reset_index()
        
        overall_layer_output_path = os.path.join(base_path, 'eta_layer_branching_gini_comparison.csv')
        overall_layer_summary.to_csv(overall_layer_output_path, index=False)
        print(f"\nOverall layer comparison file saved to: {overall_layer_output_path}")

def display_branching_gini_summary(base_path="."):
    """
    Display summary report of branching factor and Gini coefficient analysis
    """
    print("=" * 100)
    print("Branching Factor and Gini Coefficient Analysis Summary Report")
    print("=" * 100)
    
    # Read overall comparison file
    layer_comparison_file = os.path.join(base_path, 'eta_layer_branching_gini_comparison.csv')
    
    if os.path.exists(layer_comparison_file):
        print("\n📊 Layer-level branching factor and Gini coefficient analysis:")
        print("-" * 60)
        
        df = pd.read_csv(layer_comparison_file)
        
        # Find correct count column name
        count_col = None
        for col in df.columns:
            if 'run_id' in col and ('count' in col or col.endswith('_count')):
                count_col = col
                break
        
        for layer in sorted(df['layer'].unique()):
            print(f"\nLayer {int(layer)} Cross-Eta comparison:")
            print("Eta Value  Avg Branching(±std)  Doc Gini(±std)     Branch Gini(±std)     Runs")
            print("-" * 75)
            
            layer_data = df[df['layer'] == layer]
            for _, row in layer_data.iterrows():
                eta = row['eta']
                avg_branch = row['avg_branching_factor_mean']
                branch_std = row['avg_branching_factor_std']
                doc_gini = row['gini_doc_distribution_mean']
                doc_gini_std = row['gini_doc_distribution_std']
                branch_gini = row['gini_branch_distribution_mean']
                branch_gini_std = row['gini_branch_distribution_std']
                run_count = int(row[count_col]) if count_col else 0
                
                print(f"{eta:6.3f}    {avg_branch:6.2f}(±{branch_std:4.2f})     {doc_gini:6.4f}(±{doc_gini_std:5.4f})     {branch_gini:6.4f}(±{branch_gini_std:5.4f})     {run_count:4d}")
    else:
        print("⚠️ Layer comparison file not found")
    
    print("\n" + "=" * 100)
    print("✅ Branching factor and Gini coefficient analysis completed!")
    print("=" * 100)

# Execute branching factor and Gini coefficient analysis
base_path = "/Volumes/My Passport/收敛结果/step2"

print("=" * 80)
print("Starting calculation of branching factor and Gini coefficient metrics...")
print("=" * 80)

# 1. Calculate branching factor and Gini coefficient for each model
calculate_branching_and_gini_metrics(base_path)

print("\n" + "=" * 80)
print("Starting aggregation of branching factor and Gini coefficient statistics by eta value...")
print("=" * 80)

# 2. Aggregate by eta
aggregate_branching_gini_by_eta(base_path)

print("\n" + "=" * 80)
print("Displaying branching factor and Gini coefficient summary report...")
print("=" * 80)

# 3. Display summary report
display_branching_gini_summary(base_path)

Starting calculation of branching factor and Gini coefficient metrics...
🔍 Found 18 entropy files to process

[1/18] Processing folder: depth_3_gamma_0.05_eta_0.1_run_2
✓ Layer metrics saved to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_2/layer_branching_gini_metrics.csv
📊 Layer metrics summary:
   Layer 0 (1 nodes): Branching=44.00, Doc Gini=0.0000, Branch Gini=0.0000
   Layer 1 (44 nodes): Branching=4.23, Doc Gini=0.6345, Branch Gini=0.3915
   Layer 2 (186 nodes): Branching=0.00, Doc Gini=0.4864, Branch Gini=0.0000

[2/18] Processing folder: depth_3_gamma_0.05_eta_0.1_run_3
✓ Layer metrics saved to: /Volumes/My Passport/收敛结果/step2/step2_d3_g005_e01_收敛/depth_3_gamma_0.05_eta_0.1_run_3/layer_branching_gini_metrics.csv
📊 Layer metrics summary:
   Layer 0 (1 nodes): Branching=41.00, Doc Gini=0.0000, Branch Gini=0.0000
   Layer 1 (41 nodes): Branching=4.22, Doc Gini=0.6697, Branch Gini=0.4892
   Layer 2 (173 nodes): Branching=0.00, Doc Gini=0.4607