#### What does Langchain do?
```document-parser``` is a tool that splits a structured document into smaller chunks. This is done using ```regex``` and ```langchain```, an open-source framework that stores those embeddings (which are now vector representations of the text) into a vector database. This tool is useful for searching and retrieving information from a large document, and may have significant use-cases in GPT/LLM evaluations. The way langchain works is that a user creates a query, this query gets sent to the LLM and a vector representation of that query is used to do a similarity search inside the vector database. This'll fetch the relevant chunks of information from the vector database and feed that into the LLM. Now the LLM has the initial query and the relevant chunks of information to generate a response. This response is then sent back to the user.  

In [2]:
import sys
print(sys.version)
from dotenv import load_dotenv
load_dotenv()

from os import getenv

# retrieve the API keys from the environment variables
OPENAI_API_KEY = getenv("OPENAI_API_KEY")
PINECONE_API_KEY = getenv("PINECONE_API_KEY")
PINECONE_INDEX_NAME = getenv("PINECONE_INDEX_NAME")

3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:27:10) [MSC v.1938 64 bit (AMD64)]


In [3]:
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
pc = Pinecone(api_key=PINECONE_API_KEY)

  from tqdm.autonotebook import tqdm


In [4]:
from langchain_pinecone import PineconeVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import CharacterTextSplitter

embeddings = OpenAIEmbeddings()

Convert the Word document into a Langchain Document class object

In [5]:
%pip install --upgrade --quiet  docx2txt
from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("./docs/business_terms.docx")

data = loader.load()

data

Note: you may need to restart the kernel to use updated packages.


[Document(metadata={'source': './docs/business_terms.docx'}, page_content="Title: An official designation of an individual's position or job role within an organization.\n\nTerm Date: The date on which a contract or agreement is set to expire or terminate.\n\nGuarantor: An individual or entity that agrees to be responsible for another's debt or obligations in case of default.\n\nApplicant: A person who applies for something, such as a job, a loan, or admission to an institution.\n\nObligor: The party in a contract who is obligated to provide payment or services.\n\nObligee: The party in a contract to whom an obligation is owed.\n\nCollateral: Assets pledged by a borrower to secure a loan or other credit.\n\nLien: A legal right or interest that a lender has in the borrower's property, granted until the debt obligation is satisfied.\n\nIndemnity: A contractual obligation of one party to compensate the loss incurred to the other party due to the acts of the indemnitor or any other party.\

The Document object has a data member page_content that contains all the text from the document. We want to now parse this text into smaller chunks. We want to create new Document class objects such that each keyword with definition is a separate Document object. Afterwards we'll put it in our vector database.

In [6]:
import regex as re

# the string containing keywords and definitions
page_content = data[0].page_content

# regex that'll match a keyword followed by a colon and a definition
pattern = r"(?<=^|\n)([\w\s/()]+?):\s*(.*?)(?=\n[\w\s/()]+?:|\Z)"

# find all matches of the pattern in the page_content string
matches = re.findall(pattern, page_content, re.DOTALL)

# Convert matches to a dictionary
keywords_definitions = {keyword.strip(): definition.strip() for keyword, definition in matches}

# Example usage: print each keyword and its definition
for keyword, definition in keywords_definitions.items():
    print(f"{keyword}: {definition}\n")

Title: An official designation of an individual's position or job role within an organization.

Term Date: The date on which a contract or agreement is set to expire or terminate.

Guarantor: An individual or entity that agrees to be responsible for another's debt or obligations in case of default.

Applicant: A person who applies for something, such as a job, a loan, or admission to an institution.

Obligor: The party in a contract who is obligated to provide payment or services.

Obligee: The party in a contract to whom an obligation is owed.

Collateral: Assets pledged by a borrower to secure a loan or other credit.

Lien: A legal right or interest that a lender has in the borrower's property, granted until the debt obligation is satisfied.

Indemnity: A contractual obligation of one party to compensate the loss incurred to the other party due to the acts of the indemnitor or any other party.

Arbitration: A method of dispute resolution where an impartial third party, the arbitrator

Export all the keywords and definitions to a .txt file. The goal is to prevent any missing data. When reading in the text file, it should be easier to load.

In [7]:
# Open a new .txt file in write mode
with open('keywords_definitions.txt', 'w') as file:
    # Iterate through the keywords_definitions dictionary
    for keyword, definition in keywords_definitions.items():
        # Write the keyword and its definition to the file
        file.write(f"{keyword}: {definition}\n\n")

Load in the text file and parse the data into a dictionary again.

In [8]:
def read_keywords_from_file(file_path):
    keywords_dict = {}
    with open(file_path, 'r') as file:
        for line in file:
            if ':' in line:  # Ensure the line contains a colon
                keyword, definition = line.split(':', 1)  # Split on the first colon only
                keywords_dict[keyword.strip()] = definition.strip()
    return keywords_dict

# Example usage
file_path = 'keywords_definitions.txt'
keywords_definitions = read_keywords_from_file(file_path)
print(keywords_definitions)

{'Title': "An official designation of an individual's position or job role within an organization.", 'Term Date': 'The date on which a contract or agreement is set to expire or terminate.', 'Guarantor': "An individual or entity that agrees to be responsible for another's debt or obligations in case of default.", 'Applicant': 'A person who applies for something, such as a job, a loan, or admission to an institution.', 'Obligor': 'The party in a contract who is obligated to provide payment or services.', 'Obligee': 'The party in a contract to whom an obligation is owed.', 'Collateral': 'Assets pledged by a borrower to secure a loan or other credit.', 'Lien': "A legal right or interest that a lender has in the borrower's property, granted until the debt obligation is satisfied.", 'Indemnity': 'A contractual obligation of one party to compensate the loss incurred to the other party due to the acts of the indemnitor or any other party.', 'Arbitration': 'A method of dispute resolution where 

As we can see now, "Non-Disclosure Agreement (NDA)" is a key in the dictionary. As of right now, every definition from the document is accounted for. We can now export each keyword and definition to a separate .yaml file.

In [10]:
print(keywords_definitions['Non-Disclosure Agreement (NDA)'])

A legally binding contract establishing a confidential relationship between parties to protect any type of confidential and proprietary information or trade secrets.


In [None]:
import yaml

def sanitize_filename(filename):
    """Sanitize the string to make it a valid filename."""
    return re.sub(r'[\\/*?:"<>|]', "", filename)[:255]

print(len(keywords_definitions))
for keyword, definition in keywords_definitions.items():
    filename = sanitize_filename(keyword) + '.yaml'
    filename = "./results/" + filename
    with open(filename, 'w') as file:
        print(f"Creating file: {filename}")
        yaml.dump({keyword: definition}, file, default_flow_style=False)

In [12]:
# Function to convert text to vector
def text_to_vector(text):
    return embeddings.embed([text])[0]

# Upload data to Pinecone
for keyword, definition in keywords_definitions.items():
    # Convert keyword or definition to vector. Here we use definition as an example.
    vector = text_to_vector(definition)
    # Upload the vector to Pinecone with the keyword as its ID
    PINECONE_INDEX_NAME.upsert(vectors=[(keyword, vector)])

print("Upload complete.")

AttributeError: 'OpenAIEmbeddings' object has no attribute 'embed'