## **Negative Retrieve and Rerank for extractive text summarization**


### **Abstract** 
This notebook demonstrates a basic Negative Retrieve and Rerank setup that demonstrates document summarization/hightlighting
using positive/negative query hits.<br>The script uses semantic search and a cross encoder to retrieve and score passages
based on relevance to query instructions,<br> then it uses Pagerank to exploit additional information present in the queried structure 
.<br>

### **Problema** 

<a href="https://courses.nus.edu.sg/course/elltankw/EL1102-24a.JPG">
  <img src="https://courses.nus.edu.sg/course/elltankw/EL1102-24a.JPG" alt="Description">
</a>

<br>

For each Document D, there exists a superstructure that is superset of infinite numbers of structures: 

<br>

\begin{align*}
\text{} D_i: \
& \text{ set variables } x = {x_i \mid i \in \mathbb{N}}, \text{ denoted as elements, connotated as \textit{items, units}}, \\
& \text{ sets domains } d = {d_i \mid i \in \mathbb{N}}, \text{ denoted as structures, connotated as \textit{referents, categories}} \\
\end{align*}

<br>

Such that for struct in  $D_i$ 

*"Signs of all types are recognizable as such because they have certain predictable and regular properties or structures.
For example, most human signs have the capacity to encode two primary kinds of referents, denotative and connotative, 
depending on usage and situation. Dimotation is the initial referent a sign intends to capture. But the denotated referent, 
or denotatum, is not something specific in the world, but rather a proto'typical category of something. For instance, 
the word cat does not refer to a specific 'cat,' although it can, but to the category of animals that we recognize as having 
the quality 'catness.' The denotative meaning of cat is, therefore, really catness, a prototypical mental picture marked by specific distinctive features such as mammal, retractile claws, long tail, etc."*

Without specifying the scope and relation, we obtain something like this:  

<br>

\begin{align*}
\text{}D  :: \
\text{} d_i \mid i \in \mathbb{N} :: \
\text{} x_i \mid i \in \mathbb{N} 
\end{align*}

<br>

In order to frame summarization as a solvable constraint satisfaction problem. We rewrite the following definitions of the components <br>

<br>

\begin{align*}
\text{} D: \
& \ x=\{x_1,...,x_n\} \\
& \text{non-empty set } d=\{d_1,...,d_n\} \\
& \ c=\{c_1,...,c_n\}\\
\end{align*}

<br>

Therefore each constraint $C_i$ restricts the possible combination of elements that can be sucessfully grouped into a finite structure. 

<br>

\begin{align*}
\text{Linguistic familia:}  \
& \text{structural denotata:} \
& \begin{aligned}
    & \text{textual} : {x_1,...,x_n}, \\
    & \text{pragmatic} : {x_1,...,x_n}, \\
    & \text{syntactic} : {x_1,...,x_n}, \\
    & \text{semantic} : {x_1,...,x_n}, \\
    & \text{lexical} : {x_1,...,x_n}, \\
    & \text{phonemic} : {x_1,...,x_n}, \\
\end{aligned}
\end{align*}

<br>

Therefore it is often advantageous to implement the summarization layer on top of the information retrieval layer. <br>


In [5]:
from typing import List, Dict, Any, Optional

import nltk.data
from nltk.tokenize import word_tokenize

from sentence_transformers import SentenceTransformer, CrossEncoder, util
from tokenizers.normalizers import BertNormalizer
import networkx as nx

from IPython.display import Markdown, display
import time

########  
normalizer = BertNormalizer(
            clean_text=True,
            handle_chinese_chars=False,
            strip_accents=True,
            lowercase=False,
) 
sentencizer = nltk.data.load('tokenizers/punkt/english.pickle')  
bi_encoder = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
######## // 


########
def _normalize_text(text: str) -> str:
    if not text:
        raise ValueError("Text must not be empty")
    try: 
        cleaned_text = normalizer.normalize_str(text)
        return cleaned_text  
    except Exception as e:
        raise RuntimeError(f"Oops, an error had occurred during run-time: {e}")

def window_slider(text:str, window_size:int, debug:bool=False):
    window_size = window_size 
    document = _normalize_text(text)
    
    paragraphs = []
    for paragraph in document.replace("\r\n", "\n").split("\n\n"):
        if len(paragraph.strip()) > 0:
            paragraphs.append(sentencizer.tokenize(paragraph.strip()))
    tokens = word_tokenize(document)
    word_count = len([token for token in tokens if token.isalpha()])

    passages = []
    for paragraph in paragraphs:
        for start_idx in range(0, len(paragraph), window_size):
            end_idx = min(start_idx + window_size, len(paragraph))
            passages.append(" ".join(paragraph[start_idx:end_idx]))
    if debug: 
        print("Word Count:", word_count)
        print("Paragraphs: ", len(paragraphs))
        print("Sentences: ", sum([len(p) for p in paragraphs]))
        print("Passages: ", len(passages))
        print("Window Size:", window_size)
    else:
        return passages

########

########

def summarize(
    prompt: str, 
    negative_prompt:Optional[str]= None, 
    markdown_mode:Optional[bool]= False, 
) -> str | Markdown:
    start_time = time.time()
    """negative prompting via dense retrieval"""
    if negative_prompt:
        positive_hits = _execute_query(prompt)
        negative_hits = _execute_query(negative_prompt)[:3] 
        summary_candidates = _remove_negative_hits(positive_hits, negative_hits)[:10]
    else: 
        summary_candidates = _execute_query(prompt)[:10]

    corpus_indices = [hit['corpus_id'] for hit in summary_candidates]
    summary_embeddings = corpus_embeddings[corpus_indices]
    cos_scores = util.cos_sim(summary_embeddings, summary_embeddings).numpy()
    centrality_scores = nx.pagerank(nx.from_numpy_array(cos_scores))
    most_central_sentence_indices = sorted(centrality_scores, key=centrality_scores.get, reverse=True)[:3]

    summary:str = "" 
    for idx in most_central_sentence_indices[0:5]:
        corpus_id = summary_candidates[idx]['corpus_id']
        passage = passages[corpus_id]
        summary += passage.strip() + " "
    end_time = time.time()
    
    if markdown_mode: 
        _markdown_summary(
            summary, title, prompt, 
            start_time, end_time, negative_prompt)
    else:  
        return summary 

def _markdown_summary(
    summary: str, 
    title: str, 
    prompt: str, 
    start_time: float,
    end_time: float,
    negative_prompt: Optional[str] = None, 
) -> str | Markdown:
    display(Markdown(f"\n\n# {title}")) 
    display(Markdown("Results (after {:.2f} seconds):".format(end_time - start_time)))
    display(Markdown("\n-------------------------\n"))
    display(Markdown("<font size='4'>{}</font>".format(summary)))
    display(Markdown(f"summary size: {len(summary.split())} words"))
    display(Markdown(f"\n\n prompt: {prompt} | negative prompt: {negative_prompt}")) 

def hightlight(rules: Optional[Dict[str, str]]=None, markdown_mode: Optional[bool]= False) -> str | Markdown: 
    #### SAMPLE QUERY RULES
    """rules: Dict= {
            "entity": "who was it?", 
            "event": "what happened?",
            "relation": "how about the relation between them?",
            "causation": "why did it happen?",
            "attribute": "where and when?",
            ... etc
    }"""
    #### 
    highlighted_text:Dict[str,str] = "" 
    if markdown_mode: 
        display(Markdown(f"\n\n# {title}")) 
        for fields, query in rules.items():
            start_time = time.time()
            hits = _execute_query(query)[:5]
            for hit in hits[0:1]:
                display(Markdown(
                    f"- <font size='4'>{fields.capitalize()} Structure:</font>\n  - <font size='3'>{passages[hit['corpus_id']]}</font>"))
        end_time = time.time()
        display(Markdown("Results (after {:.2f} seconds):".format(end_time - start_time)))
    else:
        highlighted_text += passages[hit['corpus_id']] 
        return highlighted_text

def _assign_pseudo_label():
    """
    // i,e cannotation component -> connotata 
    """
    raise NotImplementedError

def _execute_query(query: str) -> List[Dict[str, Any]]:
    """
    Execute a query and retrieve hits. 
    // i.e denotation component -> denotatum 
    """
    query_embeddings = bi_encoder.encode(query, convert_to_tensor=True)
    hits = util.semantic_search(query_embeddings, corpus_embeddings)
    hits = hits[0]  
    cross_inp = [[query, passages[hit['corpus_id']]] for hit in hits]
    cross_scores = cross_encoder.predict(cross_inp)
    for idx in range(len(cross_scores)):
        hits[idx]['cross-score'] = cross_scores[idx]
    hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
    return hits

def _remove_negative_hits(
    summary_candidates: List[Dict[str, Any]], 
    negative_hits: List[Dict[str, Any]]
) -> List[Dict[str, Any]]:
    """
    Remove overlapping hits between positive 
    and negative queries using mask.
    """
    try:
        negative_corpus_ids = {hit['corpus_id'] for hit in negative_hits}
        positive_summary_candidates = [hit for hit in summary_candidates if hit['corpus_id'] not in negative_corpus_ids]
        if not positive_summary_candidates:
            raise ValueError("No valid candidates after removing negative hits.")
        return positive_summary_candidates
    except Exception as e:
        raise e

def search(query: str) -> None:
    """dense-retrieval information extraction"""
    start_time = time.time()
    hits = _execute_query(query)
    end_time = time.time()

    display(Markdown(f"# Input question: {query}"))
    display(Markdown("Results (after {:.2f} seconds):".format(end_time - start_time)))
    display(Markdown("\n-------------------------\n"))
    for hit in hits[0:1]:
        display(Markdown(passages[hit['corpus_id']].replace("\n", " ")))


In [10]:
from newspaper import Article

########  
url = "https://www.sciencenews.org/article/nasa-odysseus-moon-landing"
article = Article(url)
article.download()
article.parse()
article.nlp()
text = article.text
title = article.title
######## // 
########  |
# input = text:str
passages = window_slider(text, window_size=2)
corpus_embeddings = bi_encoder.encode(passages, convert_to_tensor=True, show_progress_bar=True)
# output = Torch.Tensors 
######## // 
######## single document summarization with negative retrieve and rerank     
summarize(prompt=(
        f"{title}"
    ), negative_prompt=(
        f""" 
        stock,
        photograph
        images, 
        credits, 
        """
    ), markdown_mode=True)

Batches: 100%|██████████| 1/1 [00:01<00:00,  1.19s/it]




# The first U.S. lunar lander since 1972 touches down on the moon

Results (after 0.86 seconds):


-------------------------


<font size='4'>After a nail-biting descent, the United States took one small step back to the surface of the moon. A spindly robotic lander named Odysseus — designed and built by a private U.S. company — touched down near the moon’s south pole at about 6:23 p.m. Eastern time. The landing by Odysseus today has moved the United States closer to its next giant leap in space exploration. “Today, for the first time in more than a half century, the U.S. has returned to the moon,” said NASA administrator Bill Nelson during the NASA broadcast. The probe, which is carrying six NASA payloads plus a few other odds and ends, is the first U.S. vehicle to perform a controlled descent to the lunar soil since Apollo 17 landed in 1972. “I know this was a nail-biter but we are on the surface and we are transmitting and welcome to the moon,” Intuitive Machines CEO Steve Altemus said during a live NASA broadcast of the touchdown. </font>

summary size: 163 words



 prompt: The first U.S. lunar lander since 1972 touches down on the moon | negative prompt:  
        stock,
        photograph
        images, 
        credits, 
        

In [11]:
query_instructions: Dict= {
            "entity": "who was it?", 
            "event": "what happened?",
            "relation": "how about the relation between them?",
            "causation": "cause and effect, consequence, induce",
            "attribute": "why, where, when?",
}

hightlight(rules=query_instructions, markdown_mode=True)



# The first U.S. lunar lander since 1972 touches down on the moon

- <font size='4'>Entity Structure:</font>
  - <font size='3'>The probe, which is carrying six NASA payloads plus a few other odds and ends, is the first U.S. vehicle to perform a controlled descent to the lunar soil since Apollo 17 landed in 1972. “I know this was a nail-biter but we are on the surface and we are transmitting and welcome to the moon,” Intuitive Machines CEO Steve Altemus said during a live NASA broadcast of the touchdown.</font>

- <font size='4'>Event Structure:</font>
  - <font size='3'>After a nail-biting descent, the United States took one small step back to the surface of the moon. A spindly robotic lander named Odysseus — designed and built by a private U.S. company — touched down near the moon’s south pole at about 6:23 p.m. Eastern time.</font>

- <font size='4'>Relation Structure:</font>
  - <font size='3'>Communication with the spacecraft was patchy, and it was unclear immediately what shape it was in. Odysseus, which stands about 4 meters tall and 1.5 meters wide, is hauling a half dozen NASA instruments designed to demonstrate equipment for future landings and better understand the environment near the south pole in service of planned astronaut missions.</font>

- <font size='4'>Causation Structure:</font>
  - <font size='3'>“You’re gonna bring a tear to my eye,” Crain said when asked about the prospect of Odysseus’ demise during the Feb. 23 NASA briefing. But he was also jubilant when reflecting about his team’s accomplishments.</font>

- <font size='4'>Attribute Structure:</font>
  - <font size='3'>The spot is near one of several potential landing sites for future NASA astronauts. Engineers had to deal with several unexpected problems during the landing attempt, most prominently the fact that the spacecraft’s laser range finder, part of its autonomous landing system, stopped functioning.</font>

Results (after 0.74 seconds):

## **Results comparisons** 

In [12]:
from summarizer import Summarizer
from summarizer.sbert import SBertSummarizer
import warnings

warnings.filterwarnings("ignore")
display(Markdown(f"\n\n# {title}")) 

# newspaper3k 
newspaper3k_summary = article.summary
display(Markdown("\n-------------------------\n"))
display(Markdown(f"## Newspaper3k\n {newspaper3k_summary}"))

# negative-retrieve-and-rerank
result = summarize(
    prompt=f"{title}", 
    negative_prompt=f"no quotes, images, credits, citations, bibliography, author, paper") 
display(Markdown("\n-------------------------\n"))
display(Markdown(f"## negative-retrieve-and-rerank\n {result}"))

# distilbert-base-uncased 
Bertmodel = Summarizer('distilbert-base-uncased')
Bertmodel_result = Bertmodel(text)
display(Markdown("\n-------------------------\n"))
display(Markdown(f"## distilbert-base-uncased\n {Bertmodel_result}"))

# paraphrase-MiniLM-L6-v2 
SBertmodel = SBertSummarizer('paraphrase-MiniLM-L6-v2')
SBertmodel_result = SBertmodel(text)
display(Markdown("\n-------------------------\n"))
display(Markdown(f"## paraphrase-MiniLM-L6-v2\n {SBertmodel_result}"))



# The first U.S. lunar lander since 1972 touches down on the moon


-------------------------


## Newspaper3k
 After a nail-biting descent, the United States took one small step back to the surface of the moon.
The probe, which is carrying six NASA payloads plus a few other odds and ends, is the first U.S. vehicle to perform a controlled descent to the lunar soil since Apollo 17 landed in 1972.
The telescope, named ILO-X, expects to take scientific images of the Milky Way from the lunar surface that will be used by researchers to study our galaxy.
The Intuitive Machines venture is part of NASA’s Commercial Lunar Payload Services program, wherein the agency hires companies to scout the moon in support of the Artemis lunar program (SN: 11/16/22).
Under Artemis, NASA aims to reestablish a human presence on the moon, with the first crewed landing no earlier than late 2026.


-------------------------


## negative-retrieve-and-rerank
 The landing by Odysseus today has moved the United States closer to its next giant leap in space exploration. “Today, for the first time in more than a half century, the U.S. has returned to the moon,” said NASA administrator Bill Nelson during the NASA broadcast. After a nail-biting descent, the United States took one small step back to the surface of the moon. A spindly robotic lander named Odysseus — designed and built by a private U.S. company — touched down near the moon’s south pole at about 6:23 p.m. Eastern time. The probe, which is carrying six NASA payloads plus a few other odds and ends, is the first U.S. vehicle to perform a controlled descent to the lunar soil since Apollo 17 landed in 1972. “I know this was a nail-biter but we are on the surface and we are transmitting and welcome to the moon,” Intuitive Machines CEO Steve Altemus said during a live NASA broadcast of the touchdown. 


-------------------------


## distilbert-base-uncased
 After a nail-biting descent, the United States took one small step back to the surface of the moon. “I know this was a nail-biter but we are on the surface and we are transmitting and welcome to the moon,” Intuitive Machines CEO Steve Altemus said during a live NASA broadcast of the touchdown. “ Data collected after the touchdown suggests the lander ended up tipped on its side as it sat on the lunar surface with its solar arrays deployed and its battery charged to 100 percent, Altemus said during a NASA news conference February 23. Odysseus’ destination was a flat region near the Malapert A crater, about 300 kilometers from the moon’s south pole. Odysseus, which stands about 4 meters tall and 1.5 meters wide, is hauling a half dozen NASA instruments designed to demonstrate equipment for future landings and better understand the environment near the south pole in service of planned astronaut missions. But most of the payloads are on the sides of the lander that are facing up, the team said, and the instruments appear to be operational and able to send information back to Earth. It’s been more than 50 years since astronaut Eugene Cernan left the last U.S. footprints on the moon. But he was also jubilant when reflecting about his team’s accomplishments.


-------------------------


## paraphrase-MiniLM-L6-v2
 After a nail-biting descent, the United States took one small step back to the surface of the moon. A spindly robotic lander named Odysseus — designed and built by a private U.S. company — touched down near the moon’s south pole at about 6:23 p.m. Eastern time. “I know this was a nail-biter but we are on the surface and we are transmitting and welcome to the moon,” Intuitive Machines CEO Steve Altemus said during a live NASA broadcast of the touchdown. “ Houston, Odysseus has found its new home.” The payloads will test precision landing technologies, try out a new way of knowing how much lander fuel is left, investigate the radio environment near the moon’s surface, and plop a set of retroreflectors on the ground that will serve as a permanent location marker. The telescope, named ILO-X, expects to take scientific images of the Milky Way from the lunar surface that will be used by researchers to study our galaxy. Engineers are still hopeful that they’ll be able to fire it away from the spacecraft and take images at a later date. But he was also jubilant when reflecting about his team’s accomplishments.

## References and Citations

@architectures<br>
normalizer: "BertNormalizer",<br>
sentencizer: "nltk-punkt-sent-tokenizer",<br>
bi encoder: "multi-qa-MiniLM-L6-cos-v1",<br>
cross encoder: "cross-encoder/ms-marco-MiniLM-L-6-v2",<br>
centrality scorer: {"PageRank"<br>
<span style="display: inline-block; margin-left: 20px;">
  title = "The Anatomy of a Large-Scale Hypertextual Web Search Engine", <br>
  author = "Sergey Brin and Lawrence Page", <br>
  year = "1998",<br>
  publisher = "Stanford University",<br>
}
</span>

@inproceedings{reimers-2019-sentence-bert,<br>
<span style="display: inline-block; margin-left: 20px;">
  title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",<br>
  author = "Reimers, Nils and Gurevych, Iryna",<br>
  booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",<br>
  month = "11",<br>
  year = "2019",<br>
  publisher = "Association for Computational Linguistics",<br>
  url = "http://arxiv.org/abs/1908.10084",<br>
}
</span>