# Introduction

Let's learn how to ingest a pdf file into weaviate using unstrcutured!

# Data

Here's a sample data:

![image.png](./pdf01_snippet.png)

# The Basics

You can convert a pdf to text in just one function call:

In [1]:
from unstructured.partition.pdf import partition_pdf

In [2]:
elements = partition_pdf(filename="../data/paper01.pdf")

In [3]:
for elem in elements[:10]:
    print(elem)

A survey on Image Data Augmentation for Deep Learning
Connor Shorten* and Taghi M. Khoshgoftaar
*Correspondence: cshorten2015@fau.edu Department of Computer and Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, USA
Abstract
Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data. Unfor
tunately, many application domains do not have access to big data, such as medical image analysis. This survey focuses on Data Augmentation, a data
space solution to the problem of limited data. Data Augmentation encompasses a suite of techniques that enhance the size and quality of training datasets such that better Deep Learning models can be built using them. The image augmentation algorithms discussed in this su

Titles have their own category.

Here's all the titles that unstructured was able to find:

In [4]:
titles = [elem for elem in elements if elem.category == "Title"]

for title in titles:
    print(title.text)

A survey on Image Data Augmentation for Deep Learning
Abstract
Introduction
Background
Image Data Augmentation techniques
Data Augmentations based on basic image manipulations
Flipping
Color space
Cropping
Rotation
Translation
Noise injection
Color space transformations
Geometric versus photometric transformations
Kernel filters
Mixing images
Random erasing
A note on combining augmentations
Data Augmentations based on Deep Feature space augmentation
Data Augmentations based on Deep Learning
Feature space augmentation
Adversarial training
GAN‑based Data Augmentation
Generated images
Neural Style Transfer
Meta learning Data Augmentations
Comparing Augmentations
Design considerations for image Data Augmentation
Test-time augmentation
Curriculum learning
Resolution impact
Final dataset size
Alleviating class imbalance with Data Augmentation
Discussion
Future work
Conclusion
Abbreviations
Acknowledgements
Authors’ contributions
Funding
References
Publisher’s Note


It was able to recognize most of the titles, but not all of them. 

Example of missing titles:

* Availability of data and materials
* Competing interests
* Consent for publication

The content of a title is a NarrativeText object:

In [5]:
import textwrap

narrative_texts = [elem for elem in elements if elem.category == "NarrativeText"]

for index, elem in enumerate(narrative_texts[:5]):
    print(f"Narrative text {index + 1}:")
    print("\n".join(textwrap.wrap(elem.text, width=70)))
    print("\n" + "-" * 70 + "\n")

Narrative text 1:
Connor Shorten* and Taghi M. Khoshgoftaar

----------------------------------------------------------------------

Narrative text 2:
*Correspondence: cshorten2015@fau.edu Department of Computer and
Electrical Engineering and Computer Science, Florida Atlantic
University, Boca Raton, USA

----------------------------------------------------------------------

Narrative text 3:
Deep convolutional neural networks have performed remarkably well on
many Computer Vision tasks. However, these networks are heavily
reliant on big data to avoid overfitting. Overfitting refers to the
phenomenon when a network learns a function with very high variance
such as to perfectly model the training data. Unfor- tunately, many
application domains do not have access to big data, such as medical
image analysis. This survey focuses on Data Augmentation, a data-space
solution to the problem of limited data. Data Augmentation encompasses
a suite of techniques that enhance the size and quality 

So, we could vectorize all the narrative texts and store it in weaviate

# Advanced Stuff

What if we are only interested in extracted the abstract?

We build a state machine that can extract the narrative texts under the abstract section given a list of elements:

In [6]:
import logging

logging.basicConfig(level=logging.INFO)


class AbstractExtractor:
    def __init__(self):
        self.current_section = None  # Keep track of the current section being processed
        self.have_extracted_abstract = (
            False  # Keep track of whether the abstract has been extracted
        )
        self.in_abstract_section = (
            False  # Keep track of whether we're inside the Abstract section
        )
        self.texts = []  # Keep track of the extracted abstract text

    def process(self, element):
        if element.category == "Title":
            self.set_section(element.text)

            if self.current_section == "Abstract":
                self.in_abstract_section = True
                return True

            if self.in_abstract_section:
                return False

        if self.in_abstract_section and element.category == "NarrativeText":
            self.consume_abstract_text(element.text)
            return True

        return True

    def set_section(self, text):
        self.current_section = text
        logging.info(f"Current section: {self.current_section}")

    def consume_abstract_text(self, text):
        logging.info(f"Abstract part extracted: {text}")
        self.texts.append(text)

    def consume_elements(self, elements):
        for element in elements:
            should_continue = self.process(element)

            if not should_continue:
                self.have_extracted_abstract = True
                break

        if not self.have_extracted_abstract:
            logging.warning("No abstract found in the given list of objects.")

    def abstract(self):
        return "\n".join(self.texts)

Using the class:

In [7]:
abstract_extractor = AbstractExtractor()
abstract_extractor.consume_elements(elements)

INFO:root:Current section: A survey on Image Data Augmentation for Deep Learning
INFO:root:Current section: Abstract
INFO:root:Abstract part extracted: Deep convolutional neural networks have performed remarkably well on many Computer Vision tasks. However, these networks are heavily reliant on big data to avoid overfitting. Overfitting refers to the phenomenon when a network learns a function with very high variance such as to perfectly model the training data. Unfor- tunately, many application domains do not have access to big data, such as medical image analysis. This survey focuses on Data Augmentation, a data-space solution to the problem of limited data. Data Augmentation encompasses a suite of techniques that enhance the size and quality of training datasets such that better Deep Learning models can be built using them. The image augmentation algorithms discussed in survey include geometric transformations, color space augmentations, kernel filters, mixing images, random erasing

The extracted abstract is:

In [8]:
print("\n".join(textwrap.wrap(abstract_extractor.abstract(), width=70)))

Deep convolutional neural networks have performed remarkably well on
many Computer Vision tasks. However, these networks are heavily
reliant on big data to avoid overfitting. Overfitting refers to the
phenomenon when a network learns a function with very high variance
such as to perfectly model the training data. Unfor- tunately, many
application domains do not have access to big data, such as medical
image analysis. This survey focuses on Data Augmentation, a data-space
solution to the problem of limited data. Data Augmentation encompasses
a suite of techniques that enhance the size and quality of training
datasets such that better Deep Learning models can be built using
them. The image augmentation algorithms discussed in survey include
geometric transformations, color space augmentations, kernel filters,
mixing images, random erasing, feature space augmentation, adversarial
training, generative adversarial networks, neural style transfer, and
meta-learning. The applica- tion of augm

# End-to-End Example

Let's read a folder containing pdfs of papers, extract their abstracts and store them in weaviate.

In [9]:
from pathlib import Path
import weaviate
from weaviate.embedded import EmbeddedOptions
import os

First, we initialize weaviate:

In [10]:
client = weaviate.Client(
    embedded_options=EmbeddedOptions(
        additional_env_vars={"OPENAI_APIKEY": os.environ["OPENAI_APIKEY"]}
    )
)

Started /home/.cache/weaviate-embedded: process ID 10617


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2023-05-05T12:01:18Z"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2023-05-05T12:01:18Z"}
{"action":"hnsw_vector_cache_prefill","count":50000,"index_id":"document_H3hGhHFqxwSF","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-05-05T12:01:18Z","took":1762513}
{"action":"restapi_management","level":"info","msg":"Serving weaviate at http://127.0.0.1:6666","time":"2023-05-05T12:01:18Z"}
  _warn("subprocess %s is still running" % self.pid,


In [11]:
client.schema.delete_all()

schema = {
    "class": "Document",
    "vectorizer": "text2vec-openai",
    "properties": [
        {
            "name": "source",
            "dataType": ["text"],
        },
        {
            "name": "abstract",
            "dataType": ["text"],
            "moduleConfig": {
                "text2vec-openai": {"skip": False, "vectorizePropertyName": False}
            },
        },
    ],
    "moduleConfig": {
        "generative-openai": {},
        "text2vec-openai": {"model": "ada", "modelVersion": "002", "type": "text"},
    },
}

client.schema.create_class(schema)

{"action":"hnsw_vector_cache_prefill","count":25000,"index_id":"document_E98XgwOuptlB","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2023-05-05T12:01:18Z","took":1316246}


Next, we build the objects that we want to store in weaviate:

In [12]:
data_folder = "../data"

data_objects = []

for path in Path(data_folder).iterdir():
    if path.suffix != ".pdf":
        continue

    print(f"Processing {path.name}...")

    elements = partition_pdf(filename=path)

    abstract_extractor = AbstractExtractor()
    abstract_extractor.consume_elements(elements)

    data_object = {"source": path.name, "abstract": abstract_extractor.abstract()}

    data_objects.append(data_object)

INFO:unstructured_inference:Loading the Detectron2 layout model ...


Processing paper01.pdf...


INFO:detectron2.checkpoint.detection_checkpoint:[DetectionCheckpointer] Loading from /home/.cache/huggingface/hub/models--layoutparser--detectron2/snapshots/bdedfd979ad33a5713af334da14ec09688e7e9de/PubLayNet/faster_rcnn_R_50_FPN_3x/model_final.pth ...
INFO:fvcore.common.checkpoint:[Checkpointer] Loading from /home/.cache/huggingface/hub/models--layoutparser--detectron2/snapshots/bdedfd979ad33a5713af334da14ec09688e7e9de/PubLayNet/faster_rcnn_R_50_FPN_3x/model_final.pth ...
INFO:unstructured_inference:Reading PDF for file: ../data/paper01.pdf ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
I

Processing paper02.pdf...


INFO:detectron2.checkpoint.detection_checkpoint:[DetectionCheckpointer] Loading from /home/.cache/huggingface/hub/models--layoutparser--detectron2/snapshots/bdedfd979ad33a5713af334da14ec09688e7e9de/PubLayNet/faster_rcnn_R_50_FPN_3x/model_final.pth ...
INFO:fvcore.common.checkpoint:[Checkpointer] Loading from /home/.cache/huggingface/hub/models--layoutparser--detectron2/snapshots/bdedfd979ad33a5713af334da14ec09688e7e9de/PubLayNet/faster_rcnn_R_50_FPN_3x/model_final.pth ...
INFO:unstructured_inference:Reading PDF for file: ../data/paper02.pdf ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:unstructured_inference:Detecting page elements ...
INFO:root:Current section: A Comparison of House Price Classification with Structured and Unstructured
INFO:root:Current section: Connor Shorten
INFO:root:Current section: Abstract
INFO:root:Abstract part extracted: fident an

Finally, we upload the data objects to weaviate:

In [13]:
with client.batch as batch:
    for data_object in data_objects:
        batch.add_data_object(data_object, "Document")

And now we can do some queries:

In [14]:
client.query.get("Document", "source").with_bm25(
    query="some paper about housing prices"
).with_additional("score").do()

{'data': {'Get': {'Document': [{'_additional': {'score': '0.8450042'},
     'source': 'paper02.pdf'},
    {'_additional': {'score': '0.26854637'}, 'source': 'paper01.pdf'}]}}}

In [15]:
prompt = """
Please summarize the following academic abstract in a one-liner for a layperson:

{abstract}
"""

results = (
    client.query.get("Document", "source").with_generate(single_prompt=prompt).do()
)

docs = results["data"]["Get"]["Document"]

for doc in docs:
    source = doc["source"]
    abstract = doc["_additional"]["generate"]["singleResult"]
    wrapped_abstract = textwrap.fill(abstract, width=80)
    print(f"Source: {source}\nSummary:\n{wrapped_abstract}\n")

Source: paper02.pdf
Summary:
Using machine learning techniques, researchers explore predicting house prices
with structured and unstructured data, finding that the best predictive
performance is achieved with term frequency-inverse document frequency (TF-IDF)
representations of house descriptions.

Source: paper01.pdf
Summary:
Data Augmentation is a technique that enhances the size and quality of training
datasets for Deep Learning models, particularly useful in domains with limited
data such as medical image analysis.

