**Title**: Text Analytics 3.2 Exercises  
**Author**: Ryan Weeks  
**Date**: 3/26/2025  
**Description**: In this exercise, I used Python to extract and process text from PDFs and an image. I worked with PyPDF and tabula-py to read text and tables from two different PDF files, then used pytesseract to perform OCR on an image containing printed text. Finally, I used spaCy to tokenize the extracted image text and analyze each token’s part of speech and dependency.


### Exercise 1: Reading Text from a PDF without Tables (Using PyPDF)

In this step, I’m using the `pypdf` library to extract text from a PDF that contains only plain text. `pypdf` is lightweight and straightforward for basic text extraction, although it doesn't support layout or table parsing.


In [8]:
from pypdf import PdfReader

# File path to the PDF with no tables
pdf_path_no_tables = "C:/Users/Weekseey/Documents/Bellevue Work/Text Analytics/Week_3_No_Tables.pdf"

# Create a PDF reader object
reader = PdfReader(pdf_path_no_tables)

# Extract text from all pages
all_text_no_tables = ""
for page in reader.pages:
    all_text_no_tables += page.extract_text()

print(all_text_no_tables)

Exercises 
1. Create a simple PDF file without tables (you can use Microsoft Word to create a document and 
save it as a PDF file) and read the text using Python. Print the results.  
2. Create a simple PDF file with tables (you can use Microsoft Word to create a document and save 
it as a PDF file) and read the text using Python. Print the results. 
3. Go through the Microsoft tutorial to create a form processing model using the Microsoft invoice 
samples. Do a “quick test” using the test invoice. Then, after reading about how you can 
incorporate your model in Power Automate, create a simple Power Automate flow that reads 
that test invoice and shows the data fields within it. (There may be a tutorial available from 
Microsoft that shows you how to do this.)  
4. Assuming you installed Tesseract, use pytesseract to read the Bowers text image found in the 
GitHub for Week 3 (week_3\data\bowers.jpg). Then use spaCy to print out the tokens (the text, 
part of speech, and dependency).  


The PDF text was extracted successfully using `pypdf`. Since this file only had regular text, the results are clean and readable. For documents with more complex layouts or visual elements, `pypdf` might miss some structure.


### Exercise 2: Reading Text from a PDF with Tables (Using tabula-py)
W while libraries like `pypdf` and `PyMuPDF` are great for extracting plain text, they don't preserve the structure of tables very well. That's where `tabula-py` comes in — it's specifically designed to extract tabular data from PDFs into a structured format like a DataFrame.

Since this PDF includes a table, I’ll use `tabula-py` to pull the data in a more organized way.


In [16]:
import tabula
import pandas as pd

# File path to the PDF with tables
pdf_path_with_tables = "C:/Users/Weekseey/Documents/Bellevue Work/Text Analytics/Week_3_With_Tables.pdf"

# Extract tables — returns a list of DataFrames
tables = tabula.read_pdf(pdf_path_with_tables, pages='all', multiple_tables=True)

# Show number of tables found and preview the first one
print(f"Number of tables extracted: {len(tables)}")
tables[0].head()

Number of tables extracted: 1


Unnamed: 0,X1,X2,X3,X4
0,14.36055,Jun,-11.0728,asia
1,0.328324,July,4.601376,asia
2,3.824882,Aug,17.35175,asia
3,-6.20102,Aug,6.084073,asia


`tabula-py` was able to extract the table from the PDF and return it as pandas a DataFrame. The table looks well-structured, and I can now work with the data using standard pandas operations.

This worked much better than just pulling raw text — especially for PDFs that were meant to be read like spreadsheets.


### Exercise 3: OCR and NLP with pytesseract and spaCy

In this step, I'm using `pytesseract` to perform OCR (Optical Character Recognition) on a text image. The image contains a paragraph of printed text, and my goal is to extract that text, clean up any extraneous characters introduced during OCR, and then use `spaCy` to tokenize the text and identify parts of speech and dependencies.

I'll start by downloading and loading the image, then extracting text with `pytesseract`, and finally processing it with `spaCy`.


In [26]:
import pytesseract
from PIL import Image
import requests
from io import BytesIO

# Tell pytesseract where to find the tesseract.exe
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image from the URL
img_url = "https://raw.githubusercontent.com/bellevue-university/dsc360/main/12%20Week/week_3/bowers.jpg"
response = requests.get(img_url)
img = Image.open(BytesIO(response.content))

# Extract text from the image
raw_text = pytesseract.image_to_string(img)
print(raw_text)

The Life and Work of
Fredson Bowers

by
G. THOMAS TANSELLE

N EVERY FIELD OF ENDEAVOR THERE ARE A FEW FIGURES WHOSE AGCOM-
plishment and influence cause them to be the symbols of their age;
their careers and oeuvres become the touchstones by which the
field is measured and its history told. In the related pursuits of

analytical and descriptive bibliography, textual criticism, and scholarly
editing, Fredson Bowers was such a figure, dominating the four decades
after 1949, when his Principles of Bibliographical Description was pub-
lished. By 1973 the period was already being called “the age of Bowers”:
in that year Norman Sanders, writing the chapter on textual scholarship
for Stanley Wells's Shakespeare: Select Bibliographies, gave this title to
a section of his essay. For most people, it would be achievement enough
to rise to such a position in a field as complex as Shakespearean textual
studies; but Bowers played an equally important role in other areas.
Editors of nineteenth-centur

The raw OCR text from the image contains some extra characters like newline symbols (`\n`), odd punctuation, and possibly words split across lines. I’ll now clean up the text by removing these extra characters to make it easier to process with `spaCy`.


In [29]:
# Basic cleanup: remove newlines and fix extra whitespace
clean_text = raw_text.replace('\n', ' ').strip()
print(clean_text)

The Life and Work of Fredson Bowers  by G. THOMAS TANSELLE  N EVERY FIELD OF ENDEAVOR THERE ARE A FEW FIGURES WHOSE AGCOM- plishment and influence cause them to be the symbols of their age; their careers and oeuvres become the touchstones by which the field is measured and its history told. In the related pursuits of  analytical and descriptive bibliography, textual criticism, and scholarly editing, Fredson Bowers was such a figure, dominating the four decades after 1949, when his Principles of Bibliographical Description was pub- lished. By 1973 the period was already being called “the age of Bowers”: in that year Norman Sanders, writing the chapter on textual scholarship for Stanley Wells's Shakespeare: Select Bibliographies, gave this title to a section of his essay. For most people, it would be achievement enough to rise to such a position in a field as complex as Shakespearean textual studies; but Bowers played an equally important role in other areas. Editors of nineteenth-centur

The cleaned text looks much better — it reads like a proper paragraph now. This should be ready for spaCy to tokenize and analyze.


In [32]:
import spacy

# Load English tokenizer and pipeline
nlp = spacy.load("en_core_web_sm")
doc = nlp(clean_text)

# Print each token, its part of speech, and its syntactic dependency
for token in doc:
    print(f"{token.text:<15} POS: {token.pos_:<10} DEP: {token.dep_}")

The             POS: DET        DEP: det
Life            POS: PROPN      DEP: nsubj
and             POS: CCONJ      DEP: cc
Work            POS: NOUN       DEP: conj
of              POS: ADP        DEP: prep
Fredson         POS: PROPN      DEP: compound
Bowers          POS: PROPN      DEP: pobj
                POS: SPACE      DEP: dep
by              POS: ADP        DEP: prep
G.              POS: PROPN      DEP: compound
THOMAS          POS: PROPN      DEP: compound
TANSELLE        POS: PROPN      DEP: pobj
                POS: SPACE      DEP: dep
N               POS: PROPN      DEP: cc
EVERY           POS: PROPN      DEP: compound
FIELD           POS: NOUN       DEP: conj
OF              POS: ADP        DEP: prep
ENDEAVOR        POS: PROPN      DEP: pobj
THERE           POS: PRON       DEP: appos
ARE             POS: AUX        DEP: ccomp
A               POS: DET        DEP: det
FEW             POS: ADJ        DEP: amod
FIGURES         POS: NOUN       DEP: npadvmod
WHOSE           POS

Using spaCy, I was able to tokenize the text and view each token's part of speech and syntactic role. This gives great insight into how the sentence is structured, and it's a key step for downstream tasks like named entity recognition or sentiment analysis.
