# Mutiple document extraction

This notebook shows examples of text extraction from a list of multiple files with different types using unstructured API

**Table of contents**<a id='toc0_'></a>    
- [Utility functions](#toc1_1_)    
- [Read files](#toc2_)    
- [Load and split files](#toc3_)    
- [Embedding & Storage](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [1]:
import sys
sys.path.append('../')
import os
import glob
from dotenv import load_dotenv
from langchain.document_loaders import UnstructuredAPIFileLoader
from tqdm.autonotebook import trange

load_dotenv("export.env")

  from tqdm.autonotebook import trange


True

## <a id='toc1_1_'></a>[Utility functions](#toc0_)

In [2]:
# Split the text into trunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
def get_text_splitter(chunk_size = 200, chunk_overlap = 20, separators= ["\n\n\n","\n\n","\n"," "]):
    """
    Returns a text splitter that splits text into chunks with some overlap based on separators.

    Args:
        chunk_size (int): The maximum number of characters in each chunk.
        chunk_overlap (int): The number of characters to overlap between chunks.
        separators (List[str]): A list of strings that delineate the boundaries of chunks.

    Returns:
        RecursiveCharacterTextSplitter: A text splitter object that splits text into chunks.
    """
    text_splitter = RecursiveCharacterTextSplitter(
            # Set a really small chunk size, just to show.
            chunk_size = chunk_size,
            chunk_overlap  = chunk_overlap,
            length_function = len,
            add_start_index = True,
            separators = separators
        )
    return text_splitter

## Extract text and metadata from files
def get_text_and_metadata(file, splitter_param=None):
    """
    This function takes a file path and returns a list of documents, 
    where each document is a dictionary containing the text and metadata of the file. 
    The metadata includes the file path.
    The function uses the UnstructuredAPIFileLoader class to load the file and split it into documents.
    The splitter can be specified using the splitter_param argument, 
    which is a dictionary with the keys 'chunk_size', 'chunk_overlap', and 'separators'. 
    If the splitter_param argument is not specified, a default splitter is used. 
    Args:
        file (str): The file path.
        splitter_param (dict, optional): A dictionary containing the parameters for the splitter. Defaults to None.
    Returns:
        List[dict]: A list of documents, where each document is a dictionary containing the text and metadata of the file.
    """
    loader = UnstructuredAPIFileLoader(file, 
                                       api_key=os.environ.get("UNSTRUCTURED_API_KEY"), 
                                       url=os.environ.get("UNSTRUCTURED_URL"))
    if splitter_param is not None:
        text_splitter = get_text_splitter(splitter_param['chunk_size'], 
                                          splitter_param['chunk_overlap'], 
                                          splitter_param['separators'])
    else:
        text_splitter = get_text_splitter()
    docs = loader.load_and_split(text_splitter = text_splitter)
    return docs

# Read the files and extract text + metadata
def get_data_for_splitting(files, splitter_parameter=None):
    """
    This function takes a list of file paths,  and returns a list of documents,
    where each document is a dictionary containing the text and metadata of the file. 
    The splitter can be specified using the splitter_parameter argument, 
    which is a dictionary with the keys 'chunk_size', 'chunk_overlap', and 'separators'. 
    If the splitter_parameter argument is not specified, a default splitter is used. 
    Args:
        files (List[str]): A list of file paths.
        splitter_param (dict, optional): A dictionary containing the parameters for the splitter. Defaults to None.
    Returns:
        List[Dict[str, Any]]: A list of documents, where each document is a dictionary containing the text and metadata of the file.
    """
    docs_list = []
    for file in files:
        if splitter_parameter is not None:
            docs = get_text_and_metadata(file, splitter_parameter[file])
        else:
            docs = get_text_and_metadata(file)
        docs_list.extend(docs)
    return docs_list



# <a id='toc2_'></a>[Read files](#toc0_)

In [3]:
folder_path = 'sample_data/sample_files'
file_paths = [f for f in glob.glob(f'{folder_path}/*')]

# <a id='toc3_'></a>[Load and split files](#toc0_)

In [4]:
splitter_parameter = dict()
docs = get_data_for_splitting(file_paths)
docs[:20]

[Document(page_content='US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 – INTRODUCTION\n\nA.\tPURPOSE', metadata={'source': 'sample_data/sample_files/handbook-1p.docx', 'start_index': 0}),
 Document(page_content='The United States Trustee appoints and supervises standing trustees and monitors and supervises cases under chapter 13 of title 11 of the United States Code.  28 U.S.C. § 586(b).  The Handbook,', metadata={'source': 'sample_data/sample_files/handbook-1p.docx', 'start_index': 84}),
 Document(page_content='The Handbook, issued as part of our duties under 28 U.S.C. § 586, establishes or clarifies the position of the United States Trustee Program (Program) on the duties owed by a standing trustee to the', metadata={'source': 'sample_data/sample_files/handbook-1p.docx', 'start_index': 264}),
 Document(page_content='trustee to the debtors, creditors, other parties in interest, and the United States Trustee.  The Handbook does not present a full and complete statement o

In [5]:
docs[50:60]

[Document(page_content="24\n010468dAA11382c\nJanet\nValenzuela\nWatts-Donaldson\nVeronicamouth\nLao People's Democratic Republic\n354.259.5062x7538\n500.433.2022\nstefanie71@spence.com\n2020-09-08\nhttps://moreno.biz/", metadata={'source': 'sample_data/sample_files/customers-100.csv', 'start_index': 4096}),
 Document(page_content='25\neC1927Ca84E033e\nShane\nWilcox\nTucker LLC\nBryanville\nAlbania\n(429)005-9030x11004\n541-116-4501\nmariah88@santos.com\n2021-04-06\nhttps://www.ramos.com/', metadata={'source': 'sample_data/sample_files/customers-100.csv', 'start_index': 4281}),
 Document(page_content='26\n09D7D7C8Fe09aea\nMarcus\nMoody\nGiles Ltd\nKaitlyntown\nPanama\n674-677-8623\n909-277-5485x566\ndonnamullins@norris-barrett.org\n2022-05-24\nhttps://www.curry.com/', metadata={'source': 'sample_data/sample_files/customers-100.csv', 'start_index': 4432}),
 Document(page_content='27\naBdfcF2c50b0bfD\nDakota\nPoole\nSimmons Group\nMichealshire\nBelarus\n(371)987-8576x4720\n071-152-1376\ns

# <a id='toc4_'></a>[Embedding & Storage](#toc0_)

In [6]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceInstructEmbeddings

# using Huggingface emmbedings and FAISS to store the embedding vectors
encode_kwargs = {'normalize_embeddings': True}
embd_model = HuggingFaceInstructEmbeddings(model_name='intfloat/e5-large-v2',
                                      embed_instruction="", # no instruction is needed for candidate passages
                                      query_instruction="Represent this sentence for searching relevant passages: ",
                                      encode_kwargs=encode_kwargs)
vectorstore = FAISS.from_documents(documents=docs, embedding=embd_model)

load INSTRUCTOR_Transformer
max_seq_length  512
