# <a id='toc1_'></a>[PDF Extraction](#toc0_)

This notebook shows examples of text extraction from pdf files with different **Non-OCR** packages

**Table of contents**<a id='toc0_'></a>    
- [PDF Extraction](#toc1_)    
  - [Pypdf2](#toc1_1_)    
  - [Pymupdf](#toc1_2_)    
  - [Fitz](#toc1_3_)    
  - [Unstructured](#toc1_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import glob
# Provide location of pdf files
folder_loc = 'sample_data/sample_pdfs'
pdf_files = [f for f in glob.glob(f'{folder_loc}/*.pdf')]
sample_pdf = pdf_files[0]

## <a id='toc1_1_'></a>[Pypdf2](https://pypdf2.readthedocs.io/en/3.0.0/) [&#8593;](#toc0_)

In [2]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(sample_pdf)
docs = loader.load()

for doc in docs:
    print(doc.page_content)


Accelerate Data-Driven 
Decision-Making with 
State-of-the-Art AI
Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These 
challenges include:
The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like 
Google or Facebook, you’ll be hard-pressed to attract top talent, resulting in a subpar AI solution. It’s time you demand a 
solution to this challenge and it’s time that companies meet this need.
  01 Copyright © 2021 SambaNova Systems, Inc. All rights reserved.ROADBLOCKS TO INNOVATION
UNLOCKING THE FUTURE
SambaNova enables organizations to build and deploy AI solutions for natural language processing, high-resolution 
computer vision, and recommendation. With state-of-the-art accuracy, unmatched scalability, and ease of use, 
SambaNova delivers AI capabilities at a fraction of the time and expense it takes to develop complex in-house 
infrastructure and 

## <a id='toc1_2_'></a>[Pymupdf](https://pymupdf.readthedocs.io/en/latest/) [&#8593;](#toc0_)

In [3]:
from langchain.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(sample_pdf)
docs = loader.load()

for doc in docs:
    print(doc.page_content)


Accelerate Data-Driven 
Decision-Making with 
State-of-the-Art AI
Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These 
challenges include:
The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like 
Google or Facebook, you’ll be hard-pressed to attract top talent, resulting in a subpar AI solution. It’s time you demand a 
solution to this challenge and it’s time that companies meet this need.
  01
Copyright © 2021 SambaNova Systems, Inc. All rights reserved.
ROADBLOCKS TO INNOVATION
UNLOCKING THE FUTURE
SambaNova enables organizations to build and deploy AI solutions for natural language processing, high-resolution 
computer vision, and recommendation. With state-of-the-art accuracy, unmatched scalability, and ease of use, 
SambaNova delivers AI capabilities at a fraction of the time and expense it takes to develop complex in-house 
infrastructure and

## <a id='toc1_3_'></a>[Fitz](https://pymupdf.readthedocs.io/en/latest/module.html) [&#8593;](#toc0_)

Pymupdf uses the fitz package underneath but using Fitz provides more flexibility.

In [4]:
import fitz
from src.multi_column import column_boxes

docs = fitz.open(sample_pdf)

for page in docs:
    full_text = ''
    bboxes = column_boxes(page, footer_margin=100, no_image_text=True)
    for rect in bboxes:
        full_text += page.get_text(clip=rect, sort=True)
    print(full_text)

Accelerate Data-Driven 
Decision-Making with 
State-of-the-Art AI
AI is increasingly being adopted across commercial and public sector industries and is as disruptive 
today as the advent of the internet a few decades ago. And like the internet—AI promises decisive 
competitive and operational advantages to organizations that can leverage it for innovation sooner 
rather than later.
ROADBLOCKS TO INNOVATION
Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These 
challenges include:
• Critical skill gaps due to machine learning talent scarcity
• Lack of expertise in computing architectures
• Difculty in keeping on top of latest models and techniques
• Investment justiﬁcation without proof of prior impact
UNLOCKING THE FUTURE
The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like 
Google or Facebook, you’ll be hard-pressed to attract top talent, result

## <a id='toc1_4_'></a>[Unstructured](https://unstructured.io/) [&#8593;](#toc0_)

The Unstructured package offers text extraction capabilities from various document types and seamlessly integrates with [Langchain]((https://python.langchain.com/docs/integrations/document_loaders/unstructured_file)).


In [5]:
from langchain.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader(sample_pdf)
docs = loader.load()

for doc in docs:
    print(doc.page_content)

  from .autonotebook import tqdm as notebook_tqdm


Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE

The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like Google or Facebook, you’ll be hard-pressed to attract top talent, resulti