# Parsing Choices

## Only Uploading Document

#### Document Loader into Text

In [1]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("../data/sample_returns/dummy1.pdf")
documents = loader.load()

# Can grab individual pages, could also concatenate all pages into a single text
print(documents[0].page_content)

# OR Concatenate all pages into a single text
full_text = "\n\n".join([doc.page_content for doc in documents])

GOLDEN STATE ACCOUNTING INC.
1221 BRIDGEWAY SUITE 2
SAUSALITO, CA 94965
415-331-9900
May 31, 2024
Joseph W and Stacy T Smith
16023 Via Del Alba
Rancho Santa Fe, CA 92067
Dear Joe and Stacy, 
Your 2023 Federal Individual Income Tax return will be
electronically filed with the Internal Revenue Service upon receipt
of a signed Form 8879 - IRS e-file Signature Authorization.  There
is a balance due of $700.  
Make your check payable to the "United States Treasury" and mail
your Form 1040-V payment voucher on or before April 15, 2024 to: 
INTERNAL REVENUE SERVICE
P.O. BOX 802501
CINCINNATI, OH 45280-2501
The deductible contribution to your spouse's Health Savings Account
for 2023 is $5,350.  To ensure that your spouse's contribution is
allowable, $5,350 must be deposited to your spouse's account on or
before April 15, 2024.  
Your 2023 California Individual Income Tax Return will be
electronically filed with the Franchise Tax Board upon receipt of a
signed Form 8879 - California e-file Sign

#### OCR for Images

In [15]:
from langchain_community.document_loaders import UnstructuredPDFLoader

# TODO NOT WORKING, dont have a scanned PDF example
pdf_image = "../data/sample_returns/dummy1_scanned.pdf"
loader = UnstructuredPDFLoader(pdf_image, mode="elements", strategy="ocr_only")
documents = loader.load()

print(documents[0].page_content)

ImportError: unstructured package not found, please install it with `pip install unstructured`

## Parsing and Embedding the Text of PDF (To be Used as a Tool for AgentState?)

In [4]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

True

In [8]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

pdf = "../data/sample_returns/dummy1.pdf"

loader = PyPDFLoader(pdf)
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

chunks = splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = FAISS.from_documents(chunks, embeddings)

#### Tool for Searching Tax Return for AgentState

In [6]:
from langchain_core.tools import tool

@tool
def semantic_doc_search(query: str) -> str:
    """
    Search the embedded tax document for relevant information.
    Returns relevant excerpts to help answer the user's question.
    """

    results = vectorstore.similarity_search(query, k=3)
    return "\n\n".join([doc.page_content for doc in results])

In [7]:
response = semantic_doc_search.invoke({"query": "What is the reported total income?"})
print(response)

2023 2022 DIFF
INCOME
 WAGES, SALARIES, TIPS, ETC. . . . . . . . . . . . . . . . . . . . . 266,350 249,408 16,942
 INTEREST INCOME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9,553 13,389 -3,836
 DIVIDEND INCOME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23,877 23,931 -54
 BUSINESS INCOME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 0 3,196 -3,196
 CAPITAL GAIN OR LOSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . -3,000 -3,000 0
 RENT, ROYALTY, PARTNERSHIP, SCORP, TRUST -10,744 589 -11,333
 TOTAL INCOME. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286,036 287,513 -1,477
ADJUSTMENTS TO INCOME
 HEALTH SAVINGS ACCOUNT DEDUCTION. . . . . . . . . . . . 5,350 0 5,350
 DEDUCTIBLE PART OF SELF-EMPLOYMENT TAX. . . 0 43 -43
 TOTAL ADJUSTMENTS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5,350 43 5,307

2023 2022 DIFF
INCOME
 WAGE