#### What does Langchain do?
```document-parser``` is a tool that splits a structured document into smaller chunks. This is done using ```regex``` and ```langchain```, an open-source framework that stores those embeddings (which are now vector representations of the text) into a vector database. This tool is useful for searching and retrieving information from a large document, and may have significant use-cases in GPT/LLM evaluations. The way langchain works is that a user creates a query, this query gets sent to the LLM and a vector representation of that query is used to do a similarity search inside the vector database. This'll fetch the relevant chunks of information from the vector database and feed that into the LLM. Now the LLM has the initial query and the relevant chunks of information to generate a response. This response is then sent back to the user.  

In [3]:
import sys
print(sys.version)
from dotenv import load_dotenv
load_dotenv()

from os import getenv

# retrieve the API keys from the environment variables
OPENAI_API_KEY = getenv("OPENAI_API_KEY")
PINECONE_API_KEY = getenv("PINECONE_API_KEY")
PINECONE_INDEX_NAME = getenv("PINECONE_INDEX_NAME")

3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:27:10) [MSC v.1938 64 bit (AMD64)]


In [4]:
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
pc = Pinecone(api_key=PINECONE_API_KEY)

In [6]:
from langchain_openai import OpenAIEmbeddings


embeddings = OpenAIEmbeddings()

Convert the Word document into a Langchain Document class object

In [4]:
%pip install --upgrade --quiet  docx2txt
from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("./docs/business_terms.docx")

data = loader.load()

data

Note: you may need to restart the kernel to use updated packages.


[Document(metadata={'source': './docs/business_terms.docx'}, page_content="Title: An official designation of an individual's position or job role within an organization.\n\nTerm Date: The date on which a contract or agreement is set to expire or terminate.\n\nGuarantor: An individual or entity that agrees to be responsible for another's debt or obligations in case of default.\n\nApplicant: A person who applies for something, such as a job, a loan, or admission to an institution.\n\nObligor: The party in a contract who is obligated to provide payment or services.\n\nObligee: The party in a contract to whom an obligation is owed.\n\nCollateral: Assets pledged by a borrower to secure a loan or other credit.\n\nLien: A legal right or interest that a lender has in the borrower's property, granted until the debt obligation is satisfied.\n\nIndemnity: A contractual obligation of one party to compensate the loss incurred to the other party due to the acts of the indemnitor or any other party.\

The Document object has a data member page_content that contains all the text from the document. We want to now parse this text into smaller chunks. We want to create new Document class objects such that each keyword with definition is a separate Document object. Afterwards we'll put it in our vector database.

In [5]:
import regex as re

# the string containing keywords and definitions
page_content = data[0].page_content

# regex that'll match a keyword followed by a colon and a definition
pattern = r"(?<=^|\n)([\w\s/()]+?):\s*(.*?)(?=\n[\w\s/()]+?:|\Z)"

# find all matches of the pattern in the page_content string
matches = re.findall(pattern, page_content, re.DOTALL)

# Convert matches to a dictionary
keywords_definitions = {keyword.strip(): definition.strip() for keyword, definition in matches}

# Example usage: print each keyword and its definition
for keyword, definition in keywords_definitions.items():
    print(f"{keyword}: {definition}\n")

Title: An official designation of an individual's position or job role within an organization.

Term Date: The date on which a contract or agreement is set to expire or terminate.

Guarantor: An individual or entity that agrees to be responsible for another's debt or obligations in case of default.

Applicant: A person who applies for something, such as a job, a loan, or admission to an institution.

Obligor: The party in a contract who is obligated to provide payment or services.

Obligee: The party in a contract to whom an obligation is owed.

Collateral: Assets pledged by a borrower to secure a loan or other credit.

Lien: A legal right or interest that a lender has in the borrower's property, granted until the debt obligation is satisfied.

Indemnity: A contractual obligation of one party to compensate the loss incurred to the other party due to the acts of the indemnitor or any other party.

Arbitration: A method of dispute resolution where an impartial third party, the arbitrator

Export all the keywords and definitions to a .txt file. The goal is to prevent any missing data. When reading in the text file, it should be easier to load.

In [6]:
# Open a new .txt file in write mode
with open('keywords_definitions.txt', 'w') as file:
    # Iterate through the keywords_definitions dictionary
    for keyword, definition in keywords_definitions.items():
        # Write the keyword and its definition to the file
        file.write(f"{keyword}: {definition}\n\n")

Load in the text file and parse the data into a dictionary again.

In [7]:
def read_keywords_from_file(file_path):
    keywords_dict = {}
    with open(file_path, 'r') as file:
        for line in file:
            if ':' in line:  # Ensure the line contains a colon
                keyword, definition = line.split(':', 1)  # Split on the first colon only
                keywords_dict[keyword.strip()] = definition.strip()
    return keywords_dict

# Example usage
file_path = 'keywords_definitions.txt'
keywords_definitions = read_keywords_from_file(file_path)
print(keywords_definitions)

{'Title': "An official designation of an individual's position or job role within an organization.", 'Term Date': 'The date on which a contract or agreement is set to expire or terminate.', 'Guarantor': "An individual or entity that agrees to be responsible for another's debt or obligations in case of default.", 'Applicant': 'A person who applies for something, such as a job, a loan, or admission to an institution.', 'Obligor': 'The party in a contract who is obligated to provide payment or services.', 'Obligee': 'The party in a contract to whom an obligation is owed.', 'Collateral': 'Assets pledged by a borrower to secure a loan or other credit.', 'Lien': "A legal right or interest that a lender has in the borrower's property, granted until the debt obligation is satisfied.", 'Indemnity': 'A contractual obligation of one party to compensate the loss incurred to the other party due to the acts of the indemnitor or any other party.', 'Arbitration': 'A method of dispute resolution where 

As we can see now, "Non-Disclosure Agreement (NDA)" is a key in the dictionary. As of right now, every definition from the document is accounted for. We can now export each keyword and definition to a separate .yaml file.

In [8]:
print(keywords_definitions['Non-Disclosure Agreement (NDA)'])

A legally binding contract establishing a confidential relationship between parties to protect any type of confidential and proprietary information or trade secrets.


In [9]:
import yaml

def sanitize_filename(filename):
    """Sanitize the string to make it a valid filename."""
    return re.sub(r'[\\/*?:"<>|]', "", filename)[:255]

print(len(keywords_definitions))
for keyword, definition in keywords_definitions.items():
    filename = sanitize_filename(keyword) + '.yaml'
    filename = "./results/" + filename
    with open(filename, 'w') as file:
        print(f"Creating file: {filename}")
        yaml.dump({keyword: definition}, file, default_flow_style=False)

18
Creating file: ./results/Title.yaml
Creating file: ./results/Term Date.yaml
Creating file: ./results/Guarantor.yaml
Creating file: ./results/Applicant.yaml
Creating file: ./results/Obligor.yaml
Creating file: ./results/Obligee.yaml
Creating file: ./results/Collateral.yaml
Creating file: ./results/Lien.yaml
Creating file: ./results/Indemnity.yaml
Creating file: ./results/Arbitration.yaml
Creating file: ./results/Breach of Contract.yaml
Creating file: ./results/Due Diligence.yaml
Creating file: ./results/Force Majeure.yaml
Creating file: ./results/Intellectual Property.yaml
Creating file: ./results/Non-Disclosure Agreement (NDA).yaml
Creating file: ./results/Partnership.yaml
Creating file: ./results/Shareholder.yaml
Creating file: ./results/Tenure.yaml


In [9]:
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_pinecone import PineconeVectorStore
loader = TextLoader("keywords_definitions.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10)
docs = text_splitter.split_documents(documents)

Created a chunk of size 118, which is longer than the specified 100
Created a chunk of size 101, which is longer than the specified 100
Created a chunk of size 125, which is longer than the specified 100
Created a chunk of size 153, which is longer than the specified 100
Created a chunk of size 150, which is longer than the specified 100
Created a chunk of size 101, which is longer than the specified 100
Created a chunk of size 136, which is longer than the specified 100
Created a chunk of size 149, which is longer than the specified 100
Created a chunk of size 152, which is longer than the specified 100
Created a chunk of size 197, which is longer than the specified 100
Created a chunk of size 130, which is longer than the specified 100
Created a chunk of size 123, which is longer than the specified 100


Create a Pinecone vector store object from the documents

In [11]:
vectorstore_from_texts = PineconeVectorStore.from_documents(
        docs,
        index_name=PINECONE_INDEX_NAME,
        embedding=embeddings
    )

In [12]:
print(docs)

[Document(metadata={'source': 'keywords_definitions.txt', 'text': "Title: An official designation of an individual's position or job role within an organization."}, page_content="Title: An official designation of an individual's position or job role within an organization."), Document(metadata={'source': 'keywords_definitions.txt', 'text': 'Term Date: The date on which a contract or agreement is set to expire or terminate.'}, page_content='Term Date: The date on which a contract or agreement is set to expire or terminate.'), Document(metadata={'source': 'keywords_definitions.txt', 'text': "Guarantor: An individual or entity that agrees to be responsible for another's debt or obligations in case of default."}, page_content="Guarantor: An individual or entity that agrees to be responsible for another's debt or obligations in case of default."), Document(metadata={'source': 'keywords_definitions.txt', 'text': 'Applicant: A person who applies for something, such as a job, a loan, or admiss

Now let's get connected to our vector database!

In [13]:
vectorstore = PineconeVectorStore(index_name=PINECONE_INDEX_NAME, embedding=embeddings)
#vectorstore.add_texts(["More text to embed and add to the index!"])

In [14]:
query = "What is the definition of a term date?"
vectorstore.similarity_search(query)

[Document(metadata={'source': 'keywords_definitions.txt'}, page_content='Term Date: The date on which a contract or agreement is set to expire or terminate.'),
 Document(metadata={'source': 'keywords_definitions.txt'}, page_content='Term Date: The date on which a contract or agreement is set to expire or terminate.'),
 Document(metadata={'source': 'keywords_definitions.txt'}, page_content='Tenure: The period or term during which an individual holds a position or office.'),
 Document(metadata={'source': 'keywords_definitions.txt'}, page_content='Tenure: The period or term during which an individual holds a position or office.')]

This appears to be a success. The document was parsed and split into smaller chunks. The chunks were then stored in a vector database. We then queried the database with a question and received a response. This is a successful implementation of our tool.