<a href="https://colab.research.google.com/github/seandavi/jupyter_notebooks/blob/main/intro_to_unstructured.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%capture
%pip install 'unstructured[pdf]'

In [2]:
from unstructured.partition.auto import partition

In [3]:
from urllib.request import urlretrieve

In [11]:
path, headers = urlretrieve('https://www.nature.com/articles/s41598-022-07995-7.pdf','microbiome.pdf')

In [12]:
parts = partition('microbiome.pdf')

In [14]:
from collections import defaultdict
categories = defaultdict(int)
for element in parts:
  categories[str(type(element))]+=1
for k, v in categories.items():
  print(k,v)

<class 'unstructured.documents.elements.Title'> 65
<class 'unstructured.documents.elements.NarrativeText'> 84
<class 'unstructured.documents.elements.Text'> 64
<class 'unstructured.documents.elements.ListItem'> 40


## fine-tuning the pdf extraction

In [8]:
from unstructured.partition.pdf import partition_pdf

In [9]:
!sudo apt-get install poppler-utils tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
tesseract-ocr is already the newest version (4.1.1-2.1build1).
poppler-utils is already the newest version (22.02.0-2ubuntu0.3).
0 upgraded, 0 newly installed, 0 to remove and 24 not upgraded.


In [15]:
parts2 = partition_pdf(filename = "microbiome.pdf",
                                 # Unstructured first finds embedded image blocks
                                 extract_images_in_pdf=False,
                                 # Use layout model (YOLOX) to get bounding boxes (for tables) and find titles
                                 # Titles are any sub-section of the document
                                 infer_table_structure=True,
                                 # Post processing to aggregate text once we have the title
                                 chunking_strategy="by_title",
                                 # Chunking params to aggregate text blocks
                                 # Attempt to create a new chunk 3800 chars
                                 # Attempt to keep chunks > 2000 chars
                                 max_characters=4000,
                                 new_after_n_chars=3800,
                                 combine_text_under_n_chars=2000,
                                 #image_output_dir_path=path
                                 )

Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TableTransformerForObjectDetection from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [16]:
from collections import defaultdict
categories = defaultdict(int)
for element in parts2:
  categories[str(type(element))]+=1
for k, v in categories.items():
  print(k,v)

<class 'unstructured.documents.elements.CompositeElement'> 24
<class 'unstructured.documents.elements.Table'> 6


In [26]:
lens = list([len(x.text) for x in parts2])

## ChromaDB

In [44]:
%%capture
%pip install chromadb

In [55]:
# prompt: import chromadb and load the documents in the parts2 list

import chromadb
client = chromadb.PersistentClient(path="db.db")
client.delete_collection('my_collection')
collection = client.create_collection(name="my_collection")


In [56]:
collection.add(documents=[x.text for x in parts2], metadatas = [{'filename': x.metadata.filename} for x in parts2], ids=list(map(lambda x: str(x), range(len(parts2)))))

In [57]:
res = collection.query(
    query_texts=["What is the statistical model used for analysis?"],
    n_results=10,
)

In [59]:
res.keys()

dict_keys(['ids', 'distances', 'metadatas', 'embeddings', 'documents', 'uris', 'data'])