# Max.AI LLM Components for Agent Development
**Author:** `Max.AI`

This document outlines the modules of the Max.AI LLM accelerators. 
An individual Data Scientist can leverage Max in-built accelerators to build Gen AI LLM agents. Data scientists have access to Max Development Enviornments, which are equipped with pre-installed Max accelerators. These accelerators are crafted to streamline LLM-based operations such as parsing, extraction, chunking, vectorization, embeddings, retrieval, and generation.


#### **Components**:
1. `MaxaiLLM`
    - This module loads various large language models (LLMs), offering a unified interface for both open-source and online LLMs.The llm component is responsible for loading different llms
    
  
2. `MaxExtractor`
   - This package extracts relevant text from unstructured formats like PDFs, PPTX, TXT, and DOCX files, turning complex documents into usable data.


3. `MaxSplitter`
   - This component breaks down text into smaller units such as paragraphs for detailed analysis and processing. It also extracts metadata from the text.


4. `MaxEmbeddings`
   - Responsible for converting text or tokens into dense vector representations, this module is vital for NLP tasks like classification or similarity analysis.

5. `MaxVectorization`
   - MaxVectorization transforms data into a vector format, which is crucial for various machine learning and natural language processing tasks.


6. `MaxRetriever`
   - MaxRetriever is associated with information retrieval tasks. This module searches for relevant documents or information within a large corpus based on specific queries or contexts.


7. `MaxGenerator`
   - This component is designed for text generation tasks, including summarization, text completion, and creative text generation based on given prompts or contexts.




### Step 1 : Initialize the Environment Variables from Config Store

Leverage Remote Config Store to load required runtime configurations like - 
- **LLM API Key**: This includes the desire Cloud LLM Service API Key
- **Vector Database Details**: These details captures the database's user, password, name, port, and other relevant information which are required for setting up and connecting to the vector database.
- **Experiement Tracking Enviornment (E.g. AWS SageMaker, MLFlow, Google VertexAI etc.)  Credentials** : These configurations enables the Max Monitoring capability, using this each individual steps like parameters, artifacts like models, configurations, metrices can be logged and montitored

In [55]:
from maxairesources.config_store.config_store import ConfigStore
config_store = ConfigStore(secrets_manager='max/agents')

### Step 2 : Setup LLM Flow

Loading a Large Language Model (LLM) typically refers to the process of initializing and preparing a pre-trained language model for use in various applications, such as text generation, question answering, or other natural language processing tasks. 

- **Initialization**: This is the first step where the language model is instantiated. It involves setting up the model with its pre-trained parameters, which define how the model will interpret and generate language.

- **Configuration**: The model may need to be configured with specific settings. This can include adjusting parameters for performance, accuracy, or output style. For example, setting the length of generated text, the level of creativity or randomness in responses, or specifying certain language rules.

- **Testing and Calibration**: Once loaded, the model may undergo some initial tests to ensure it's functioning as expected. This can involve generating test outputs or running benchmark tasks.

- **Ready for Use**: After these steps, the LLM is ready to be used for various applications, such as text completion, question answering, content generation, insight extraction or any custom natural language processing task.

In [56]:
from maxaillm.model.llm import MaxOpenAILLM

MAX_LLM = "gpt-3.5-turbo"



llm = MaxOpenAILLM(MAX_LLM).load_model()

Loaded OpenAI model: gpt-3.5-turbo


In [57]:
llm

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x7fda7434da60>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x7fda74346df0>, temperature=0.1, openai_api_key='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', openai_proxy='', streaming=True)

### Step 3 : Parsing the files using MaxExtractor

Max Extractor provides out-of-box methods for parsing documents of different formats. It provides capabilities like - 
- **Parsing Documents**: Max Extractor is capable of interpreting and processing a wide range of document formats. This includes common text-based files like PDFs, Word documents (DOCX), text files (TXT), presentation files (PPTX), Markdown (MD) etc. Parsing involves reading these files and converting their contents into a structured plain text format.

- **Extracting Metadata**: Beyond just reading the main content of documents, Max Extractor can extract metadata. Metadata is the data about the data, which could include information like the document's author, creation date, modification dates, file size, and other properties that aren't part of the main text content but are important for cataloging, searching, and managing documents.

- **Cleaning Documents**: This functionality refers to the tool's ability to process and refine the extracted text. Cleaning might involve removing unnecessary or irrelevant sections, correcting formatting issues, standardizing text for consistency, and eliminating errors or noise. This process is crucial for ensuring that the data used for analysis or other applications is accurate, relevant, and in a usable format.

In [58]:
from maxaillm.data.extractor.MaxExtractor import MaxExtractor

me = MaxExtractor()
path = "path/to/sample.pdf"

#### Step 3.1 -  Extracting text for all docs with Metadata

In [59]:
text, metadata = me.extract_text_metadata(path)

#### Step 3.2 -  Cleaning Text

In [60]:
clean_text = me.clean_text(text, 
                           dehyphenate=True, 
                           ascii_only=True, 
                           remove_isolated_symbols=True, 
                           compress_whitespace=True)

### Step 4 : Chunking Layer

This stage involves taking cleaned text as input and performing several key functions:

 - **Creating Documents**: This module inputs clean, pre-processed text. It then organizes this text into structured documents. This step is crucial for preparing the text for detailed analysis and ensuring that it is in a manageable and analyzable format. It can also split the documents based on its structure like .HTML, and .MD can be extracted based on the structure, or based on semantics.
 - **Extracting Metadata**: In this phase, the module performs multiple tasks to extract valuable information from the text:

> - **Extracting Associated Links**: Identifying and extracting any hyperlinks or references to external sources contained within the text.
> - **Summarizing Text**: Generating concise summaries of the chunk to capture the main points or themes, useful for quick overviews or indexing.
> - **Entity Extraction**: Identifying and extracting named entities (like names of people, places, organizations) from the chunk. This is a key aspect of understanding the content and context of the documents.
> - **Keyword Extraction**: Isolating significant keywords or phrases that are central to the text’s topics or themes. This aids in categorizing and searching the document.
> - **Default File-Specific Metadata**: Extracting standard metadata associated with the file, such as author, creation date, file size, etc.
- **Return - List of Documents with Content and Metadata for Each Chunk**: The final output of this step is a collection of documents. Each document in this collection contains not only the chunked content but also the extracted metadata. This structured output is primed for further processing.

In [61]:
from maxaillm.data.chunking.TextSplitter import TextSplitter

In [62]:
splitter = TextSplitter(chunk_size= 4000,chunk_overlap = 200)

In [63]:
docs = splitter.create_documents([text],
                                 file_metadata = metadata, 
                                 metadata = {
                                             'default' : True,
                                             'summary': False,
                                             'entities' : False,
                                             "frequent_keywords" : False, 
                                             "links" : True
                                            })

### Step 5 : Embedding and Vector DB Backend Integration
This phase is pivotal for transforming text data into a format that can be efficiently processed and analyzed by large language model. It involves two main tasks:
- **Initialize the Embeddings** : 
> - The first task is to initialize embeddings, which are high-dimensional vector representations of text. These embeddings convert words, phrases, or entire documents into numerical form, capturing the semantic meaning in a way that machines can understand.
> - This process often involves using pre-trained models that have learned these representations from large datasets. The initialization ensures that the system is equipped to convert text into these vector formats. We here leverage MaxHuggingFaceEmbeddings, to load any open-source embedding model and generate vectors from text.
- **Intitialize the vector DB** : 
> - The second task is to initialize a vector database. This database is designed to store and manage the vector representations of the text.
> - Vector databases like Chroma, MaxPGVector, Milvus, Redis, OpenSearch etc. Each of these systems specializes in handling high-dimensional data.
> - Once the vector database is initialized, the documents, now converted into their vectorized forms (embeddings), are pushed into the vector store. This process effectively stores the rich, vectorized representations of the text in a database. The numerical representation of the texts are then leveraged by MaxRetriever module to retrieve response for a particular user query

#### Step 5.1 Initialize HuggingFaceEmbeddings

In [64]:
from maxaillm.data.embeddings.MaxHuggingFaceEmbeddings import MaxHuggingFaceEmbeddings

In [65]:

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}

embeddings = MaxHuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

print(embeddings)

<maxaillm.data.embeddings.MaxHuggingFaceEmbeddings.MaxHuggingFaceEmbeddings object at 0x7fda71e43ca0>


#### Step 5.2 Initialize VectorDB 

In [66]:
from maxaillm.data.vectorstore.MaxRedis import MaxRedis
vectordb = MaxRedis(
                    redis_url= config_store.config_data['REDIS_CONNECTION_STRING'],
                    index_name = "collection-name",
                    embedding_function = embeddings
                    )

#### Step 5.3 Adding Documents to Vector Database


In [None]:
vectordb.add(docs)

In [68]:
li = vectordb.search("Group Sales Results", k =2)

### Step 6 : Setting up Retriever
In this phase, the focus is on configuring various retrievers and rerankers that play a crucial role in information retrieval processes. These components are essential for efficiently finding relevant information from large datasets or document collections.
- **Retrievers** 
A retriever is an interface that returns documents given an unstructured query. It is more general than a vector store. A retriever doesn't necessarily stores documents, it only returns document. 

We currently support different retrievers like -

>- `MultiQuery` - Distance-based vector database retrieval embeds (represents) queries in high-dimensional space and finds similar embedded text chunks based on "distance".
> - `HybridSearch` - A hybrid search retriever is a lightweight filter that selects documents relevant to a query from a database. 
>- `Base` - A Base retriever uses default search algorithm offered by vector db, and retrieves the relevant chunks which are most similar to the query
>- `HyDE` - At a high level, HyDE is an embedding technique that takes queries, generates a hypothetical answer, and then embeds that generated document and uses that as the final example.
- **Reranker** 
To improve search relevance by reordering the result set returned by a retriever with a different model. Reranking computes a relevance score between the query and each retrieved text chunk, and returns the list of text chunks sorted from the most to the least relevant.


> - `Cohere`: This reranker reorders search results based on coherence with the query, enhancing relevance and accuracy.
> - `LostInMiddle`: This reranker designed to address specific challenges in information retrieval, such as finding relevant information that's not immediately obvious or typically overlooked by standard search algorithm, like the information in the middle of the retrieved context is lost due to higher number of retrieved chunks.

In [69]:
from maxaillm.data.retriever.Retriever import MaxRetriever

#### 6.1. Initialize Retriever (MultiQuery) and Ranking method (LostInMiddle)

In [70]:
retrieve = MaxRetriever(vectordb = vectordb, llm = llm, retriever_type="MultiQuery", reranker_type="LostInMiddle")

In [71]:
retrieve

<maxaillm.data.retriever.Retriever.MaxRetriever at 0x7fda71e43fd0>

In [72]:
out = retrieve.retrieve_and_rerank("Summarize 2022 performance")[:2]

Metadata key file_name not found in metadata. Setting to None. 
Metadata fields defined for this instance: ['file_name', 'file_metadata', 'links']
Metadata key file_name not found in metadata. Setting to None. 
Metadata fields defined for this instance: ['file_name', 'file_metadata', 'links']
Metadata key file_name not found in metadata. Setting to None. 
Metadata fields defined for this instance: ['file_name', 'file_metadata', 'links']
Metadata key file_name not found in metadata. Setting to None. 
Metadata fields defined for this instance: ['file_name', 'file_metadata', 'links']
Metadata key file_name not found in metadata. Setting to None. 
Metadata fields defined for this instance: ['file_name', 'file_metadata', 'links']
Metadata key file_name not found in metadata. Setting to None. 
Metadata fields defined for this instance: ['file_name', 'file_metadata', 'links']
Metadata key file_name not found in metadata. Setting to None. 
Metadata fields defined for this instance: ['file_name

### Step 7 : Setting up Generator
The generator is a large transformer model, such as GPT3. 5, GPT4, Llama2, Falcon, PaLM, and BERT. The generator takes the input query and the retrieved documents, and generates a response. The retrieved documents and the input query are concatenated and fed into the generator. The MaxGenerator also provides streaming capabilities while generating responses.

Different Generators can be tasked with 
 - **Language Generation Models**: For tasks like summarization and question answering, language models like GPT (Generative Pre-trained Transformer) or BERT (Bidirectional Encoder Representations from Transformers) could be used.
- **Specialized NLP Tools**: For sentiment analysis and insight extraction, there are specialized tools and libraries that focus on specific aspects of text analysis, like sentiment scoring or topic modeling.
- **Document Intelligence**: For document intelligence,we provide generators that combine OCR (Optical Character Recognition), NLP, and machine learning to extract and process information from documents.


In [73]:
from maxaillm.app.generator.MaxGenerator import MaxGenerator, LangChainGenerator

In [74]:
mg = LangChainGenerator(llm=llm, 
                        method='stuff', 
                        prompt_config={'moderations':'',
                                       'task':'',
                                       'identity':''},
                        verbose=True, 
                        streamable = False )

#### Step 7.1 Streaming output for QA Generator

In [None]:
async for i in mg.generate_stream(query='Enter your question here', context=out):
    print(i, end='')

In [None]:
result = mg.generate(query='Enter your question here', context=out)

In [None]:
print(result)