# Unstructured multiple document type extraction

This notebook shows examples of text extraction from difrent file types using unstructured API and local running loaders

**Table of contents**<a id='toc0_'></a>    
- [1. CSV file](#toc1_)    
  - [Using unstructured API](#toc1_1_)    
  - [Using CSV loader](#toc1_2_)    
- [2. Excel file (.xls)](#toc2_)    
  - [Using unstructured API](#toc2_1_)    
  - [Using unstructured local XLS loader](#toc2_2_)    
- [3. Docx file](#toc3_)    
  - [Using unstructured API](#toc3_1_)    
  - [Using unstructured local DOC loader](#toc3_2_)    
- [4. RTF file](#toc4_)    
  - [Using unstructured API](#toc4_1_)    
  - [Using unstructured local RTF loader](#toc4_2_)    
- [5. Markdown file](#toc5_)    
  - [Using unstructured API](#toc5_1_)    
  - [Using unstructured local loader](#toc5_2_)    
- [6. Web](#toc6_)    
  - [Using Async local loader](#toc6_1_)    
- [7. PDF file](#toc7_)    
  - [Using unstructured API](#toc7_1_)    
  - [Using unstructured local pdf loader](#toc7_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import sys
sys.path.append('../')

import os
import glob
from tqdm.autonotebook import trange
from dotenv import load_dotenv
from langchain.document_loaders import UnstructuredAPIFileLoader

load_dotenv('./export.env')

  from tqdm.autonotebook import trange


True

# <a id='toc1_'></a>[CSV file](#toc0_)

In [2]:
folder_loc = 'sample_data/sample_files/'
csv_files = list(glob.glob(f'{folder_loc}/*.csv'))
sample_csv = csv_files[0]

## <a id='toc1_1_'></a>[Using unstructured API](#toc0_)

In [3]:
loader = UnstructuredAPIFileLoader(sample_csv, 
                                   api_key=os.environ.get('UNSTRUCTURED_API_KEY'),
                                   url=os.environ.get("UNSTRUCTURED_URL"))
docs = loader.load()
for doc in docs:
    print(f'{doc.page_content}\n')




Index
Customer Id
First Name
Last Name
Company
City
Country
Phone 1
Phone 2
Email
Subscription Date
Website


1
DD37Cf93aecA6Dc
Sheryl
Baxter
Rasmussen Group
East Leonard
Chile
229.077.5154
397.884.0519x718
zunigavanessa@smith.info
2020-08-24
http://www.stephenson.com/


2
1Ef7b82A4CAAD10
Preston
Lozano
Vega-Gentry
East Jimmychester
Djibouti
5153435776
686-620-1820x944
vmata@colon.com
2021-04-23
http://www.hobbs.com/


3
6F94879bDAfE5a6
Roy
Berry
Murillo-Perry
Isabelborough
Antigua and Barbuda
+1-539-402-0259
(496)978-3969x58947
beckycarr@hogan.com
2020-03-25
http://www.lawrence.com/


4
5Cef8BFA16c5e3c
Linda
Olsen
Dominguez, Mcmillan and Donovan
Bensonview
Dominican Republic
001-808-617-6467x12895
+1-813-324-8756
stanleyblackwell@benson.org
2020-06-02
http://www.good-lyons.com/


5
053d585Ab6b3159
Joanna
Bender
Martin, Lang and Andrade
West Priscilla
Slovakia (Slovak Republic)
001-234-203-0635x76146
001-199-446-3860x3486
colinalvarado@miles.net
2021-04-17
https://goodwin-ingram.com

## <a id='toc1_2_'></a>[Using CSV loader](#toc0_)

In [4]:
from langchain.document_loaders.csv_loader import CSVLoader

loader_csv = CSVLoader(file_path=sample_csv, encoding="utf-8", csv_args={'delimiter': ','})
docs_csv = loader_csv.load()
for doc in docs_csv:
    print(f'{doc.page_content}\n')

Index: 1
Customer Id: DD37Cf93aecA6Dc
First Name: Sheryl
Last Name: Baxter
Company: Rasmussen Group
City: East Leonard
Country: Chile
Phone 1: 229.077.5154
Phone 2: 397.884.0519x718
Email: zunigavanessa@smith.info
Subscription Date: 2020-08-24
Website: http://www.stephenson.com/

Index: 2
Customer Id: 1Ef7b82A4CAAD10
First Name: Preston
Last Name: Lozano
Company: Vega-Gentry
City: East Jimmychester
Country: Djibouti
Phone 1: 5153435776
Phone 2: 686-620-1820x944
Email: vmata@colon.com
Subscription Date: 2021-04-23
Website: http://www.hobbs.com/

Index: 3
Customer Id: 6F94879bDAfE5a6
First Name: Roy
Last Name: Berry
Company: Murillo-Perry
City: Isabelborough
Country: Antigua and Barbuda
Phone 1: +1-539-402-0259
Phone 2: (496)978-3969x58947
Email: beckycarr@hogan.com
Subscription Date: 2020-03-25
Website: http://www.lawrence.com/

Index: 4
Customer Id: 5Cef8BFA16c5e3c
First Name: Linda
Last Name: Olsen
Company: Dominguez, Mcmillan and Donovan
City: Bensonview
Country: Dominican Republic
P

# <a id='toc2_'></a>[Excel file (.xls)](#toc0_)

In [5]:
folder_loc = 'sample_data/sample_files/'
xls_files = list(glob.glob(f'{folder_loc}/*.xls'))
sample_xls = xls_files[0]

## <a id='toc2_1_'></a>[Using unstructured API](#toc0_)

In [6]:
loader = UnstructuredAPIFileLoader(sample_xls, 
                                   api_key=os.environ.get('UNSTRUCTURED_API_KEY'), 
                                   url=os.environ.get("UNSTRUCTURED_URL"))
docs = loader.load()
for doc in docs:
    print(f'{doc.page_content}\n')




MC
What is 2+2?
4
correct
3
incorrect





MA
What C datatypes are 8 bits? (assume i386)
int

float

double

char


TF
Bagpipes are awesome.
true








ESS
How have the original Henry Hornbostel buildings influenced campus architecture and design in the last 30 years?









ORD
Rank the following in their order of operation.
Parentheses
Exponents
Division
Addition





FIB
The student activities fee is
95
dollars for students enrolled in
19
units or more,





MAT
Match the lower-case greek letter with its capital form.
λ
Λ
α
γ
Γ
φ
Φ




http://www.cmu.edu/blackboard

Question Format Abbreviations




Abbreviation
Question Type


MC
Multiple Choice


MA
Multiple Answer


TF
True/False


ESS
Essay


ORD
Ordering


MAT
Matching


FIB
Fill in the Blank


FIL
File response


NUM
Numeric Response


SR
Short response


OP
Opinion


FIB_PLUS
Multiple Fill in the Blank


JUMBLED_SENTENCE
Jumbled Sentence


QUIZ_BOWL
Quiz Bowl




http://www.cmu.edu/blackboard

File Information

Source


## <a id='toc2_2_'></a>[Using unstructured local XLS loader](#toc0_)

In [7]:
from langchain.document_loaders import UnstructuredExcelLoader

loader = UnstructuredExcelLoader(sample_xls, mode="elements")
docs = loader.load()
for doc in docs:
    print(f'{doc.page_content}\n')




MC
What is 2+2?
4
correct
3
incorrect





MA
What C datatypes are 8 bits? (assume i386)
int

float

double

char


TF
Bagpipes are awesome.
true








ESS
How have the original Henry Hornbostel buildings influenced campus architecture and design in the last 30 years?









ORD
Rank the following in their order of operation.
Parentheses
Exponents
Division
Addition





FIB
The student activities fee is
95
dollars for students enrolled in
19
units or more,





MAT
Match the lower-case greek letter with its capital form.
λ
Λ
α
γ
Γ
φ
Φ




http://www.cmu.edu/blackboard

Question Format Abbreviations




Abbreviation
Question Type


MC
Multiple Choice


MA
Multiple Answer


TF
True/False


ESS
Essay


ORD
Ordering


MAT
Matching


FIB
Fill in the Blank


FIL
File response


NUM
Numeric Response


SR
Short response


OP
Opinion


FIB_PLUS
Multiple Fill in the Blank


JUMBLED_SENTENCE
Jumbled Sentence


QUIZ_BOWL
Quiz Bowl




http://www.cmu.edu/blackboard

File Information

Source


# <a id='toc3_'></a>[Docx file](#toc0_)

In [8]:
folder_loc = 'sample_data/sample_files/'
docx_files = list(glob.glob(f'{folder_loc}/*.docx'))
sample_docx = docx_files[0]

## <a id='toc3_1_'></a>[Using unstructured API](#toc0_)

In [9]:
loader = UnstructuredAPIFileLoader(sample_docx,
                                   mode="elements",
                                   api_key=os.environ.get('UNSTRUCTURED_API_KEY'),
                                   url=os.environ.get("UNSTRUCTURED_URL"))
docs = loader.load()
for doc in docs:
    print(f'{doc.page_content}\n')

US Trustee Handbook

CHAPTER 1

INTRODUCTION

CHAPTER 1 – INTRODUCTION

A.	PURPOSE

The United States Trustee appoints and supervises standing trustees and monitors and supervises cases under chapter 13 of title 11 of the United States Code.  28 U.S.C. § 586(b).  The Handbook, issued as part of our duties under 28 U.S.C. § 586, establishes or clarifies the position of the United States Trustee Program (Program) on the duties owed by a standing trustee to the debtors, creditors, other parties in interest, and the United States Trustee.  The Handbook does not present a full and complete statement of the law; it should not be used as a substitute for legal research and analysis.  The standing trustee must be familiar with relevant provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules), any local bankruptcy rules, and case law.  11 U.S.C. § 321, 28 U.S.C. § 586, 28 C.F.R. § 58.6(a)(3).  Standing trustees are encouraged to follow Practice Tips identified in this Ha

## <a id='toc3_2_'></a>[Using unstructured local DOC loader](#toc0_)

In [10]:
from langchain.document_loaders import UnstructuredWordDocumentLoader

loader = UnstructuredWordDocumentLoader(sample_docx, mode="elements")
docs = loader.load()
for doc in docs:
    print(f'{doc.page_content}\n')


US Trustee Handbook

CHAPTER 1

INTRODUCTION

CHAPTER 1 – INTRODUCTION

A.	PURPOSE

The United States Trustee appoints and supervises standing trustees and monitors and supervises cases under chapter 13 of title 11 of the United States Code.  28 U.S.C. § 586(b).  The Handbook, issued as part of our duties under 28 U.S.C. § 586, establishes or clarifies the position of the United States Trustee Program (Program) on the duties owed by a standing trustee to the debtors, creditors, other parties in interest, and the United States Trustee.  The Handbook does not present a full and complete statement of the law; it should not be used as a substitute for legal research and analysis.  The standing trustee must be familiar with relevant provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules), any local bankruptcy rules, and case law.  11 U.S.C. § 321, 28 U.S.C. § 586, 28 C.F.R. § 58.6(a)(3).  Standing trustees are encouraged to follow Practice Tips identified in this Ha

# <a id='toc4_'></a>[RTF file](#toc0_)

In [11]:
folder_loc = 'sample_data/sample_files/'
rtf_files = list(glob.glob(f'{folder_loc}/*.rtf'))
sample_rtf = rtf_files[0]

## <a id='toc4_1_'></a>[Using unstructured API](#toc0_)

In [12]:
file_path = 'sample_data/sample_files/fake-doc.rtf'
loader = UnstructuredAPIFileLoader(sample_rtf, 
                                   mode="elements", 
                                   api_key=os.environ.get('UNSTRUCTURED_API_KEY'),
                                   url=os.environ.get("UNSTRUCTURED_URL"))
docs = loader.load()
for doc in docs:
    print(doc.page_content)

My First Heading
My first paragraph.
Table Example:
Column 1 Column 2 Row 1, Cell 1 Row 1, Cell 2 Row 2, Cell 1 Row 2, Cell 2


## <a id='toc4_2_'></a>[Using unstructured local RTF loader](#toc0_)

for using pypandoc is it required to install pandoc -> https://pandoc.org/installing.html

In [None]:
from langchain.document_loaders import UnstructuredRTFLoader

loader = UnstructuredRTFLoader(sample_rtf, mode="elements")
docs = loader.load()
for doc in docs:
    print(f'{doc.page_content}\n')

# <a id='toc5_'></a>[Markdown file](#toc0_)

In [14]:
folder_loc = '.'
md_files = list(glob.glob(f'{folder_loc}/*.md'))
sample_md = md_files[0]

## <a id='toc5_1_'></a>[Using unstructured API](#toc0_)

In [15]:
loader = UnstructuredAPIFileLoader(sample_md, 
                                   api_key=os.environ.get('UNSTRUCTURED_API_KEY'),
                                   url=os.environ.get("UNSTRUCTURED_URL"))
docs = loader.load()
for doc in docs:
    print(f'{doc.page_content}\n')

SambaNova AI Starter Kits

Data Extraction Examples

Data Extraction Examples
Overview
Getting started
File Loaders
CSV Documents
XLS/XLSX Documents
DOC/DOCX Documents
RTF Documents
Markdown Documents
HTML Documents
Multidocument
PDF Documents
Included Files

Overview

This kit include a series of Notebooks that demonstrates various methods for extracting text from documents in different input formats. including Markdown, PDF, CSV, RTF, DOCX, XLS, HTML

Getting started

Clone repo.
git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git

Install requirements: It is recommended to use virtualenv or conda environment for installation.
python3 -m venv data_extract_env
source data_extract_env/bin/activate
cd data_extraction
pip install -r requirements.txt

Some text extraction examples use Unstructured lib. Please register at Unstructured.io to get a free API Key. then create an enviroment file to store the APIkey and URL provided.
echo 'UNSTRUCTURED_API_KEY="your_API_ke

## <a id='toc5_2_'></a>[Using unstructured local loader](#toc0_)

In [16]:
from langchain.document_loaders import UnstructuredMarkdownLoader

loader = UnstructuredMarkdownLoader(sample_md, mode="elements")
docs = loader.load()
for doc in docs:
    print(f'{doc.page_content}\n')

SambaNova AI Starter Kits

Data Extraction Examples

Data Extraction Examples
Overview
Getting started
File Loaders
CSV Documents
XLS/XLSX Documents
DOC/DOCX Documents
RTF Documents
Markdown Documents
HTML Documents
Multidocument
PDF Documents
Included Files

Overview

This kit include a series of Notebooks that demonstrates various methods for extracting text from documents in different input formats. including Markdown, PDF, CSV, RTF, DOCX, XLS, HTML

Getting started

Clone repo.
git clone https://github.sambanovasystems.com/SambaNova/ai-starter-kit.git

Install requirements: It is recommended to use virtualenv or conda environment for installation.
python3 -m venv data_extract_env
source data_extract_env/bin/activate
cd data_extraction
pip install -r requirements.txt

Some text extraction examples use Unstructured lib. Please register at Unstructured.io to get a free API Key. then create an enviroment file to store the APIkey and URL provided.
echo 'UNSTRUCTURED_API_KEY="your_API_ke

# <a id='toc6_'></a>[Web](#toc0_)

In [17]:
urls = [
    "https://en.wikipedia.org/wiki/Unstructured_data",
    "https://unstructured-io.github.io/unstructured/introduction.html",
]

## <a id='toc6_1_'></a>[Using Async local loader](#toc0_)

In [18]:
from langchain.document_loaders import AsyncHtmlLoader

loader = AsyncHtmlLoader(urls, verify_ssl=False)
docs = loader.load()

for doc in docs:
    print(f'{doc.page_content}\n')

Fetching pages:   0%|          | 0/2 [00:00<?, ?it/s]

Fetching pages: 100%|##########| 2/2 [00:00<00:00,  3.24it/s]

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>Unstructured data - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-wi




### Clean html output from async loader with the Html2Text transformer

In [19]:
from langchain.document_transformers import Html2TextTransformer

html2text_transformer = Html2TextTransformer()
docs=html2text_transformer.transform_documents(documents=docs)
for doc in docs:
    print(f'{doc.page_content}\n')

Jump to content

Main menu

Main menu

move to sidebar hide

Navigation

  * Main page
  * Contents
  * Current events
  * Random article
  * About Wikipedia
  * Contact us
  * Donate

Contribute

  * Help
  * Learn to edit
  * Community portal
  * Recent changes
  * Upload file

Languages

Language links are at the top of the page.

Search

Search

  * Create account
  * Log in

Personal tools

  * Create account
  * Log in

Pages for logged out editors learn more

  * Contributions
  * Talk

## Contents

move to sidebar hide

  * (Top)

  * 1Background

  * 2Issues with terminology

  * 3Dealing with unstructured data

Toggle Dealing with unstructured data subsection

    * 3.1Approaches in natural language processing

    * 3.2Approaches in medicine and biomedical research

  * 4The use of "unstructured" in data privacy regulations

  * 5See also

  * 6Notes

  * 7References

  * 8External links

Toggle the table of contents

# Unstructured data

11 languages

  * العربية
  * Català

# <a id='toc7_'></a>[PDF file](#toc0_)

In [20]:
folder_loc = 'sample_data/sample_pdfs'
pdf_files = list(glob.glob(f'{folder_loc}/*.pdf'))
sample_pdf = pdf_files[0]

## <a id='toc7_1_'></a>[Using unstructured API](#toc0_)

In [22]:
loader = UnstructuredAPIFileLoader(sample_pdf, 
                                   api_key=os.environ.get('UNSTRUCTURED_API_KEY'),
                                   url =os.environ.get('UNSTRUCTURED_URL'))
docs = loader.load()
for doc in docs:
    print(doc.page_content)

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE

The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like Google or Facebook, you’ll be hard-pressed to attract top talent, resulti

## <a id='toc7_2_'></a>[Using unstructured local pdf loader](#toc0_)

In [23]:
from langchain.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader(sample_pdf)
docs = loader.load()

for doc in docs:
    print(f'{doc.page_content}\n')

Accelerate Data-Driven Decision-Making with State-of-the-Art AI

AI is increasingly being adopted across commercial and public sector industries and is as disruptive today as the advent of the internet a few decades ago. And like the internet—AI promises decisive competitive and operational advantages to organizations that can leverage it for innovation sooner rather than later.

ROADBLOCKS TO INNOVATION

Today, all but the big tech giants face seemingly insurmountable obstacles to bring AI applications into production. These challenges include:

Critical skill gaps due to machine learning talent scarcity • Lack of expertise in computing architectures • Difficulty in keeping on top of latest models and techniques • Investment justiﬁcation without proof of prior impact

UNLOCKING THE FUTURE

The AI talent shortage continues to burden companies across all industries. And the reality is, unless you’re a company like Google or Facebook, you’ll be hard-pressed to attract top talent, resulti