## Research Question: Is there a relation between the quality and length of the summary and the subreddit? 

#### Evaluating Summary Quality Without a Gold Standard

In many summarization evaluation tasks, the quality of a summary is assessed by comparing it to a human-written **gold standard summary** using metrics such as ROUGE or BLEU. These metrics rely on having one or more reference summaries to measure content overlap, fluency, and informativeness.

However, in the case of the TLDR-17 dataset, **there is no separate gold summary available for evaluation** beyond the provided summary itself. This makes it impossible to directly measure summary quality using standard reference-based metrics, as we lack a definitive “correct” summary to compare against.

---

#### Using Semantic Similarity as a Proxy for Summary Quality

To work around this, we adopt an **unsupervised, reference-free approach** to estimate summary quality by measuring the semantic similarity between the original content and its corresponding summary. The key assumption is that a good summary should capture the essential meaning of the original content. Therefore, higher semantic similarity scores suggest that the summary effectively represents the content, while lower scores indicate a potential loss of information or off-topic summarization.

---

#### Why Sentence Transformers and Cosine Similarity?

To quantify semantic similarity, we use **sentence-transformer models**, which are deep learning models trained to produce high-quality vector representations (embeddings) of sentences or paragraphs. These embeddings capture the semantic meaning of text beyond simple keyword matching.

By encoding both the content and the summary into embeddings, we can compute the **cosine similarity** between their vectors. Cosine similarity measures the angle between two vectors in high-dimensional space, providing a normalized score between -1 and 1 that reflects how semantically close the texts are.

- A score close to **1** means the summary and content share very similar meaning.  
- A score near **0** indicates little semantic overlap.  
- Negative values suggest opposite meanings, which are rare in this context.

This method allows us to **objectively assess the relationship between content and summary without relying on external references**, making it a practical and meaningful proxy for summary quality in the absence of gold standards.


In [5]:
# Install dependencies
!pip install -q sentence-transformers tqdm pandas datasets matplotlib

# Imports
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
import pandas as pd
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import torch

# Load dataset
stream = load_dataset("webis/tldr-17", split="train", streaming=True)
N = 100000  # Sample size
rows = []

# Sample N rows
for i, ex in enumerate(tqdm(stream, total=N, desc="Sampling TLDR-17")):
    if i >= N:
        break
    rows.append({
        "content": ex["content"],
        "summary": ex["summary"],
        "subreddit": ex["subreddit"]
    })

df = pd.DataFrame(rows)

# SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode content and summaries
contents = df["content"].tolist()
summaries = df["summary"].tolist()

content_embs = model.encode(contents, convert_to_tensor=True, batch_size=128, show_progress_bar=True, normalize_embeddings=True)
summary_embs = model.encode(summaries, convert_to_tensor=True, batch_size=128, show_progress_bar=True, normalize_embeddings=True)

# Fast cosine similarity (just dot product after normalization)
similarities = (content_embs * summary_embs).sum(dim=1)

#content_embs = torch.nn.functional.normalize(content_embs, p=2, dim=1)
#summary_embs = torch.nn.functional.normalize(summary_embs, p=2, dim=1)

# Compute only pairwise dot product (equivalent to cosine sim after normalization)
#similarities = (content_embs * summary_embs).sum(dim=1)


# Add features
df["similarity"] = similarities.cpu().numpy()
df["summary_len"] = df["summary"].str.len()
df["content_len"] = df["content"].str.len()
df["compression_ratio"] = df["summary_len"] / df["content_len"].replace(0, 1)

# Save full data
df.to_csv("tldr_summary_analysis.csv", index=False)

# Basic stats
print(df[["similarity", "summary_len", "content_len"]].describe())

# Optional: Subreddit-level aggregation
agg = df.groupby("subreddit").agg({
    "similarity": "mean",
    "summary_len": "mean",
    "content_len": "mean",
    "compression_ratio": "mean",
    "content": "count"
}).rename(columns={"content": "num_posts"}).sort_values("similarity", ascending=False)

agg.to_csv("subreddit_summary_stats.csv")
print("\nTop subreddits by average TL;DR similarity:")
print(agg.head(10))

# Plot similarity distribution
plt.hist(df["similarity"], bins=50, alpha=0.7)
plt.title("Semantic Similarity between Content and Summary")
plt.xlabel("Cosine Similarity")
plt.ylabel("Number of Examples")
plt.show()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Sampling TLDR-17:   0%|          | 0/100000 [00:00<?, ?it/s]

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
# # Install dependencies first (run once)
# !pip install -q sentence-transformers tqdm pandas datasets

# # Imports
# from datasets import load_dataset
# from sentence_transformers import SentenceTransformer, util
# from sentence_transformers.util import cos_sim
# import pandas as pd
# from tqdm.auto import tqdm

# # Load a manageable subset from TLDR-17 for demonstration
# stream = load_dataset("webis/tldr-17", split="train", streaming=True)

# # Sample a small number of examples to a list for demo (adjust N as needed)
# N = 100000
# rows = []
# for i, ex in enumerate(tqdm(stream, total=N, desc="Sampling TLDR-17")):
#     if i >= N:
#         break
#     rows.append({
#         "content": ex["content"],
#         "summary": ex["summary"],
#         "subreddit": ex["subreddit"]
#     })

# df = pd.DataFrame(rows)
# # Batch embed all content and all summaries
# contents = df["content"].tolist()
# summaries = df["summary"].tolist()

# model = SentenceTransformer('all-MiniLM-L6-v2')


# # Encode in batches
# content_embs = model.encode(contents, convert_to_tensor=True, batch_size=128, show_progress_bar=True)
# summary_embs = model.encode(summaries, convert_to_tensor=True, batch_size=128, show_progress_bar=True)
# similarities = util.cos_sim(content_embs, summary_embs).diagonal()

# # Load sentence-transformers model

# # Apply with progress bar
# tqdm.pandas(desc="Computing similarity")
# #df["similarity"] = df.progress_apply(semantic_similarity, axis=1)
# df["similarity"] = similarities.cpu().numpy()
# df.to_csv("tldr_cos_similarities.csv", index=False)


# # Quick stats and visualization
# print(df["similarity"].describe())

# import matplotlib.pyplot as plt
# plt.hist(df["similarity"], bins=50, alpha=0.7)
# plt.title("Semantic Similarity between Content and Summary")
# plt.xlabel("Cosine Similarity")
# plt.ylabel("Number of Examples")
# plt.show()


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Sampling TLDR-17:   0%|          | 0/100000 [00:00<?, ?it/s]

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

Batches:   0%|          | 0/782 [00:00<?, ?it/s]

RuntimeError: [enforce fail at alloc_cpu.cpp:119] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 40000000000 bytes. Error code 12 (Cannot allocate memory)