# PDF Parsing and Cleaning

This notebook demonstrates how to parse and clean text from a PDF document using Python libraries such as `PyMuPDF` and `nltk`.

## Step 1: Install Required Libraries

First, we need to install the necessary libraries. Run the following command in your terminal or in a Jupyter Notebook cell:

```python
!pip install PyMuPDF nltk


In [1]:
# !pip install PyMuPDF nltk

Collecting PyMuPDF
  Obtaining dependency information for PyMuPDF from https://files.pythonhosted.org/packages/46/72/8c5bbf817aacebe21a454f3ade8ee4b5b17afe698bb73d65c4ca23a89a87/pymupdf-1.25.1-cp39-abi3-win_amd64.whl.metadata
  Downloading pymupdf-1.25.1-cp39-abi3-win_amd64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-win_amd64.whl (16.6 MB)
   ---------------------------------------- 0.0/16.6 MB ? eta -:--:--
   ---------------------------------------- 0.0/16.6 MB 1.3 MB/s eta 0:00:13
   ---------------------------------------- 0.1/16.6 MB 1.3 MB/s eta 0:00:13
    --------------------------------------- 0.2/16.6 MB 2.0 MB/s eta 0:00:09
   - -------------------------------------- 0.5/16.6 MB 3.2 MB/s eta 0:00:06
   -- ------------------------------------- 1.0/16.6 MB 5.0 MB/s eta 0:00:04
   --- ------------------------------------ 1.4/16.6 MB 5.7 MB/s eta 0:00:03
   ----- ---------------------------------- 2.4/16.6 MB 8.0 MB/s eta 0:00:02
   -------- ---------------------


Next, we will import the required libraries.

In [1]:
import fitz  # PyMuPDF
import re
import json
import nltk
from nltk.tokenize import sent_tokenize


## Loading and Parsing PDF

In [2]:
# Specify the path to your PDF file
pdf_path = 'llms.pdf'

# Open the PDF document
pdf_document = fitz.open(pdf_path)

# Check the number of pages
num_pages = pdf_document.page_count
print(f'The PDF document has {num_pages} pages.')


The PDF document has 15 pages.


# Extracting Text Page by Page

In [3]:
# Extract text from each page and print it
for page_num in range(num_pages):
    page = pdf_document.load_page(page_num)
    text = page.get_text()
    print(f'--- Page {page_num + 1} ---')
    print(text)
    print('\n')


--- Page 1 ---
This is an initiative aiming to combat misinformation in the age
of LLMs
(Correspondence to: Kai Shu)
(New Preprint) Can Knowledge Editing Really Correct Hallucinations?
- We proposed HalluEditBench to holistically benchmark knowledge
editing methods in correcting real-world hallucinations on five
dimensions including Efficacy, Generalization, Portability, Locality, and
Robustness. We find their effectiveness could be far from what their
performance on existing datasets suggests, and the performance
beyond Efficacy for all methods is generally unsatisfactory.
(New Preprint) Can Editing LLMs Inject Harm?
- We propose to reformulate knowledge editing as a new type of safety
threat for LLMs, namely Editing Attack, and discover its emerging risk
of injecting misinformation or bias into LLMs stealthily, indicating the
feasibility of disseminating misinformation or bias with LLMs as new
channels.
(SIGKDD Explorations 2024) Authorship Attribution in the Era of LLMs:
Problems, M

# Parsing Text Paragraph Wise

### Defining a Function to Extract Paragraphs from Text

In [4]:
def extract_text_by_token_count(text, threshold=200):
    # Split text by whitespace to count tokens
    tokens = re.findall(r'\S+', text)
    chunks = []
    current_chunk = []
    current_token_count = 0

    for token in tokens:
        if current_token_count + 1 > threshold:
            # If adding this token exceeds the threshold, save the current chunk and start a new one
            chunks.append(' '.join(current_chunk))
            current_chunk = [token]
            current_token_count = 1
        else:
            # Otherwise, continue adding to the current chunk
            current_chunk.append(token)
            current_token_count += 1

    # Append the last chunk if it exists
    if current_chunk:
        chunks.append(' '.join(current_chunk))

    return chunks



### Extracting and Printing Paragraphs from Each Page

In [5]:
for page_num in range(num_pages):
    page = pdf_document.load_page(page_num)
    text = page.get_text()
    paragraphs = extract_text_by_token_count(text)
    print(f'--- Page {page_num + 1} ---')
    for para_num, paragraph in enumerate(paragraphs, start=1):
        print(f'Paragraph {para_num}:')
        print(paragraph)
        print('\n')


--- Page 1 ---
Paragraph 1:
This is an initiative aiming to combat misinformation in the age of LLMs (Correspondence to: Kai Shu) (New Preprint) Can Knowledge Editing Really Correct Hallucinations? - We proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. We find their effectiveness could be far from what their performance on existing datasets suggests, and the performance beyond Efficacy for all methods is generally unsatisfactory. (New Preprint) Can Editing LLMs Inject Harm? - We propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and discover its emerging risk of injecting misinformation or bias into LLMs stealthily, indicating the feasibility of disseminating misinformation or bias with LLMs as new channels. (SIGKDD Explorations 2024) Authorship Attribution in the Era of LLMs

## Additional Parsing and Analysis
#### Saving Extracted Text to a File

In [6]:

# Save all chunks to a single text file, separating them with page numbers and chunk numbers
extracted_data = []

for page_num in range(num_pages):
    page = pdf_document.load_page(page_num)
    text = page.get_text()
    chunks = extract_text_by_token_count(text)
    page_data = {
        'page_number': page_num + 1,
        'chunks': [{'paragraph': chunk_num + 1, 'text': chunk} for chunk_num, chunk in enumerate(chunks)]
    }
    extracted_data.append(page_data)



In [7]:
# Save the extracted data to a JSON file
with open('extracted_data.json', 'w', encoding='utf-8') as f:
    json.dump(extracted_data, f, ensure_ascii=False, indent=4)
