### STARTUP
In order to take a pdf's text and transform it into a hierarchy of topics, you only need to initialize a Destructor object
using the relative path to the pdf, and the rest is magic

In [1]:
from destructor import Deconstructor as pdfdestructor

# load the pdf into memory and parse it [set start page and initial heading level if needed]
# ! Due to a bug, you need to manually specify the level of the highest level (title is 1, heading 1 is 2, etc)
text = pdfdestructor("test.pdf", start_page = 0, heading_level = 1)

In [2]:
# the hierarchy looks like this
# each element has 4 properties: title, content, subsections, and links
print(text.content)

[{'title': 'Heading 1', 'content': '\nHi, im the first header', 'subsections': [{'title': 'Heading 2', 'content': '\nHi, im the 2[nd] header\n[https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)\n[https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)\ngoodbye', 'subsections': [{'title': 'Heading 3', 'content': '\nHello, I am the 3[rd] header\nSee ya', 'subsections': [], 'links': []}], 'links': ['https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)', 'https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)']}, {'title': 'Heading 2', 'content': '\nThe second 2[nd] header\nBye', 'subsections': [], 'links': []}], 'links': ['https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)', 'https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8q

### Contents
To access the content of a topic, no matter how deep, access its "content" key

In [3]:
# content of the first heading 1 
print(text.content[0]["content"])


Hi, im the first header


### SUBCONTENTS
everything is inside something in this hierarchy, except the top level headings



In [4]:
# here are the top level headings
for elem in text.content:
    print(elem["title"])

Heading 1
Heading 1
Heading 1


In [5]:
# and here are the elements inside the first heading 1
# TODO implement perhaps a tree function to display the entire hierarchy, with all the subelements of subelements of subelements of subele...
for elem in text.content[0]["subsections"]:
    print(elem["title"])

Heading 2
Heading 2


### LINKS
accessing links is easy, if header 1 contains a header 2 that contains a link, the link can be accessed via text.content[header1_id]["subsections"][header2_id]["links"]
or via text.content[header1_id]["links"]


In [6]:
#! Link parsing needs a little touch up
# access the links inside header 2
print(text.content[0]["subsections"][0]["links"])

['https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)', 'https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)']


In [7]:
# and access the same links via header 1
print(text.content[0]["links"])

['https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)', 'https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)']


In [8]:
# all the links in a document can be accesed via Destructor.links, which holds the "global" links
print(text.links)

['https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)', 'https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk](https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk)', 'https://youtu.be/GJDNkVDGM_s?si=8qFrQmmqy0yTGnyk']


### Misc
There are some functions left over that are either deprecated or very specific, but could be useful

In [9]:
help(pdfdestructor.feed_links)
print("-----------------------")
help(pdfdestructor.feed_pages)
print("-----------------------")
help(pdfdestructor.feed_topic)

Help on function feed_links in module destructor:

feed_links(self) -> collections.abc.Generator[str]
    Provide a link for web scraping
    
    Yields:
        Generator[str]: link for web scraper

-----------------------
Help on function feed_pages in module destructor:

feed_pages(self) -> collections.abc.Generator[str]
    Provide a page for LLM processing
    
    Yields:
        Generator[str]: text for the LLM

-----------------------
Help on function feed_topic in module destructor:

feed_topic(self, max_lines: int) -> collections.abc.Generator[str]
    Provide a topic for LLM processing. If a topic is too big,
    it will get split into separate chunks
    
    Args:
        max_lines (int): number of maximum lines allowed
    
    Yields:
        str: text for LLM

