# Sentence and Paragraph Metric Analysis Notebook

This notebook performs the following steps:
- **Import Libraries and Load Data:** It reads JSON files containing evaluation metrics for sentence-level and paragraph-level translations.
- **DataFrame Creation:** It creates two separate DataFrames (`df_sentence` and `df_paragraph`) that include metrics such as BLEU, ROUGE-1, ROUGE-2, ROUGE-L, CHRF, and a mean score for each language.
- **Merge DataFrames:** The two DataFrames are merged on the `language` column to combine sentence and paragraph metrics.
- **Correlation Analysis:** Using Pearson correlation, we compute the correlation coefficients and p-values between the sentence-level and paragraph-level metrics for each evaluation metric.

This analysis helps us understand how closely related the metrics are between sentence-level and paragraph-level evaluations.

In [None]:
import pandas as pd
import json
from scipy.stats import pearsonr

# Load sentence-level metrics from JSON file and create a DataFrame.
with open("../results/sentence.json", "r") as f:
    sentence = json.load(f)

df_sentence = pd.DataFrame([{
    'language': entry['language'],
    'bleu': entry['bleu'],
    'rouge1': entry['rouge1'],
    'rouge2': entry['rouge2'],
    'rougeL': entry['rougeL'],
    'chrf': entry['chrf'],
    'mean': entry['mean']
    }
    for entry in sentence])

# Load paragraph-level metrics from JSON file and create a DataFrame.
with open("../results/paragraph.json", "r") as f:
    paragraph = json.load(f)

df_paragraph = pd.DataFrame([{
    'language': entry['language'],
    'bleu': entry['bleu'],
    'rouge1': entry['rouge1'],
    'rouge2': entry['rouge2'],
    'rougeL': entry['rougeL'],
    'chrf': entry['chrf'],
    'mean': entry['mean']
    }
    for entry in paragraph])

## Merge DataFrames and Prepare for Correlation Analysis

In the next cell, we merge the sentence-level and paragraph-level DataFrames on the `language` column. 
Suffixes (`_sent` and `_para`) are added to distinguish between the two sets of metrics. Then, we compute the Pearson correlation coefficients and p-values for each metric.

In [2]:
merged_df = pd.merge(df_sentence, df_paragraph, on="language", suffixes=('_sent', '_para'))

print("Merged DataFrame:")
print(merged_df)

# Define the list of metrics to compare between sentence and paragraph evaluations.
metrics = ["chrf", "bleu", "rouge1", "rouge2", "rougeL", "mean"]

print("\nPearson Correlation Coefficients:")
for metric in metrics:
    col_sent = metric + "_sent"
    col_para = metric + "_para"
    # Calculate Pearson correlation and corresponding p-value.
    r, p_value = pearsonr(merged_df[col_sent], merged_df[col_para])
    print(f"{metric:8s}: r = {r:.3f}, p-value = {p_value:.3e}")

Merged DataFrame:
      language  bleu_sent  rouge1_sent  rouge2_sent  rougeL_sent  chrf_sent  \
0       arabic   0.609044     0.328829     0.081081     0.324324  76.850865   
1      bengali   0.765244     0.184211     0.078947     0.184211  88.050489   
2      burmese   0.204676     0.215189     0.103448     0.216297  54.971608   
3    cantonese   0.783907     0.400000     0.314286     0.400000  77.152628   
4        hindi   0.699851     0.400000     0.165714     0.390476  82.477808   
5   indonesian   0.831442     0.946900     0.858708     0.938013  92.343698   
6     japanese   0.000000     0.401628     0.233831     0.396201  66.158589   
7        khmer   0.233038     0.209622     0.097938     0.209622  52.982974   
8     mandarin   0.713609     0.356436     0.158416     0.356436  70.227798   
9    mongolian   0.474205     0.390110     0.120679     0.380530  66.725151   
10      nepali   0.631054     0.129032     0.043011     0.124424  79.519024   
11     persian   0.796547     0.20