##Filtering and Getting Matched Context

```markdown
# Filtering and Getting Matched Context

This notebook demonstrates how to filter and get matched context based on a question from the tokenized sentences obtained from a PDF document.

## Step 1: Install Required Libraries

First, we need to install the necessary libraries. Run the following command in your terminal or in a Jupyter Notebook cell:

```python
!pip install nltk

## Step 2: Import Libraries

Next, we will import the required libraries.

In [1]:
import nltk
import json

# Ensure punkt tokenizer is downloaded
# Install PyMuPDF if you haven't already
# !pip install pymupdf


import re
from typing import List, Dict
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\shuklajiwank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\shuklajiwank\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

## Step 3: Load Tokenized Sentences

We assume you have already extracted and tokenized the sentences using the previous notebook. Here, we will load those tokenized sentences.

In [2]:
# Load the JSON data from a file
with open('extracted_data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

# Display the loaded JSON data (optional for verification)
print(json.dumps(data, indent=4))



[
    {
        "page_number": 1,
        "chunks": [
            {
                "paragraph": 1,
                "text": "This is an initiative aiming to combat misinformation in the age of LLMs (Correspondence to: Kai Shu) (New Preprint) Can Knowledge Editing Really Correct Hallucinations? - We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. We find their effectiveness could be far from what their performance on existing datasets suggests, and the performance beyond Efficacy for all methods is generally unsatisfactory. (New Preprint) Can Editing LLMs Inject Harm? - We propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and discover its emerging risk of injecting misinformation or bias into LLMs stealthily, indicating the feasibility of disseminating misinformation or bia

## Step 4: Define a Function to Filter Relevant Context

Now, we define a function to filter the relevant context based on a given question. This function will use keyword matching to find relevant sentences.


In [3]:
def calculate_cosine_similarity(question: str, text: str) -> float:
    """
    Calculate cosine similarity between the question and the text.
    
    Args:
    - question (str): The question to compare.
    - text (str): The text to compare.
    
    Returns:
    - float: The cosine similarity between the question and the text.
    """
    # Initialize TF-IDF Vectorizer
    vectorizer = TfidfVectorizer(stop_words='english')

    # Vectorize the question and the text
    tfidf_matrix = vectorizer.fit_transform([question, text])

    # Calculate cosine similarity
    cosine_sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
    return cosine_sim[0][0]


In [4]:
def filter_content_by_similarity(data: List[Dict], question: str, threshold: float = 0.5) -> List[Dict]:
    """
    Filter the JSON content based on cosine similarity with the question.
    
    Args:
    - data (List[Dict]): The JSON data.
    - question (str): The question to filter content against.
    - threshold (float): The similarity threshold for filtering.
    
    Returns:
    - List[Dict]: Filtered content with relevant paragraphs having cosine similarity above the threshold.
    """
    filtered_data = []

    for page in data:
        page_number = page['page_number']
        relevant_chunks = []
        for chunk in page['chunks']:
            similarity = calculate_cosine_similarity(question, chunk['text'])
            if similarity >= threshold:
                relevant_chunks.append({
                    'paragraph': chunk['paragraph'],
                    'text': chunk['text'],
                    'similarity': similarity
                })
        
        if relevant_chunks:
            filtered_page = {
                'page_number': page_number,
                'chunks': relevant_chunks
            }
            filtered_data.append(filtered_page)
    
    return filtered_data

# Example usage: Filter content based on cosine similarity with the question
question = "How can LLMs help in combating misinformation?"
filtered_data = filter_content_by_similarity(data, question)

# Display the filtered data (optional for verification)
print(json.dumps(filtered_data, indent=4))


[
    {
        "page_number": 3,
        "chunks": [
            {
                "paragraph": 1,
                "text": "Abstract Misinformation such as fake news and rumors is a serious threat to information ecosystems and public trust. The emergence of Large Language Models (LLMs) has great potential to reshape the landscape of combating misinformation. Generally, LLMs can be a double-edged sword in the fight. On the one hand, LLMs bring promising opportunities for combating misinformation due to their profound world knowledge and strong reasoning abilities. Thus, one emergent question is: can we utilize LLMs to combat misinformation? On the other hand, the critical challenge is that LLMs can be easily leveraged to generate deceptive misinformation at scale. Then, another important question is: how to combat LLM-generated misinformation? In this paper, we first systematically review the history of combating misinformation before the advent of LLMs. Then we illustrate the current 

## Step 4: Preparing Context for LLM

In [5]:
def prepare_context(filtered_data: List[Dict]) -> str:
    """
    Prepare context for LLM question answering by concatenating relevant text chunks.
    
    Args:
    - filtered_data (List[Dict]): The filtered content data.
    
    Returns:
    - str: The concatenated context for LLM.
    """
    context = ""
    
    for page in filtered_data:
        page_number = page['page_number']
        context += f'--- Page {page_number} ---\n'
        
        for chunk in page['chunks']:
            paragraph_number = chunk['paragraph']
            text = chunk['text']
            context += f'Paragraph {paragraph_number}:\n{text}\n\n'
    
    return context

# Prepare the context from the filtered data
context = prepare_context(filtered_data)

# Display the context (optional for verification)
print(context)


--- Page 3 ---
Paragraph 1:
Abstract Misinformation such as fake news and rumors is a serious threat to information ecosystems and public trust. The emergence of Large Language Models (LLMs) has great potential to reshape the landscape of combating misinformation. Generally, LLMs can be a double-edged sword in the fight. On the one hand, LLMs bring promising opportunities for combating misinformation due to their profound world knowledge and strong reasoning abilities. Thus, one emergent question is: can we utilize LLMs to combat misinformation? On the other hand, the critical challenge is that LLMs can be easily leveraged to generate deceptive misinformation at scale. Then, another important question is: how to combat LLM-generated misinformation? In this paper, we first systematically review the history of combating misinformation before the advent of LLMs. Then we illustrate the current efforts and present an outlook for these two fundamental questions respectively. The goal of this

### Step 5: Question Answering with LLM

In [6]:
# use llm

from openai import OpenAI



In [7]:
def get_answer_from_llm(question: str, context: str) -> str:
    """
    Get an answer from the LLM given a question and context.
    
    Args:
    - question (str): The question to ask the LLM.
    - context (str): The context to provide to the LLM.
    
    Returns:
    - str: The answer from the LLM.
    """
    client = OpenAI(
	base_url="https://huggingface.co/api/inference-proxy/together",
	api_key="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"
    )


    messages = [
        {
            "role": "user",
            "content": f"you are a Question answer bot. Answer the following question:{question} based on the context below\n{context}"
        }
    ]

    completion = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R1",
        messages=messages,
        max_tokens=5000
    )
    
    return completion.choices[0].message.content

# Example question
question = "How can LLMs help in combating misinformation?"

# Get the answer from the LLM
answer = get_answer_from_llm(question, context)

# Display the answer
print(f'Question: {question}')
print(f'Answer: {answer}')

Question: How can LLMs help in combating misinformation?
Answer: <think>
Okay, so I need to figure out how LLMs can help combat misinformation based on the given context. Let's start by reading through the context carefully. The context is from page 3 of a paper by Chen and Shu from 2024. The first paragraph talks about how LLMs are a double-edged sword—they can help fight misinformation but also generate it. The main questions are whether we can use LLMs to combat misinformation and how to deal with the misinformation they might create.

The user's question is specifically asking about how LLMs can help, so I should focus on the opportunities mentioned. The context says LLMs have "profound world knowledge and strong reasoning abilities," which are good for combating misinformation. The paper also mentions that they review past efforts before LLMs and discuss current efforts and outlooks for the two questions. The goal is to promote using LLMs against misinformation and call for interd