**Requirements**

Tutorial from: [web](https://www.datacamp.com/tutorial/run-llama-3-locally)

In [1]:
%pip install unstructured[docx] langchain langchainhub langchain_community langchain-chroma libmagic

Note: you may need to restart the kernel to use updated packages.


In [2]:
#%ollama serve

**Variables**
Change them to your path

In [1]:
file_path = "./input/c06.pdf"
fileExtToSearch = "**/*.pdf"
#pdfToRead = "C:/tmp/"

**Loading the documents**
It is a best practice to develop and test your code in Jupyter Notebook before creating the app.

We will load all the docx files from the folder using the DirectoryLoader.

In [2]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader(file_path)
books = loader.load_and_split()

**Splitting the text**
Feeding an entire book to the model is not feasible, as it would exceed its context window. To overcome this limitation, we must divide the text into smaller, more manageable chunks that fit comfortably within the model's context window.

In our case, we will convert all four books to a chunk size of 500 characters.

In [21]:
import re
def split_text(text: str) -> list[str]:
    # Split the text into paragraphs
    paragraphs = text.split('\n')
    
    # Initialize variables
    chunks = []
    current_chunk = ""
    
    for paragraph in paragraphs:
        # Check if the paragraph starts with a disorder code (e.g., 6A00.3)
        if re.match(r'^\s*6[A-Z]\d+(\.\d+)?([A-Z])?', paragraph):
            # If we have a current chunk, add it to the list of chunks
            if current_chunk:
                chunks.append(current_chunk.strip())
            # Start a new chunk with this paragraph
            current_chunk = paragraph
        else:
            # If it's not a new disorder, add to the current chunk
            current_chunk += "\n" + paragraph
    
    # Add the last chunk if it exists
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

In [12]:
split_text(books[8].page_content)

['396  ICD-11 MMS',
 '6A03.1  Developmental learning disorder with impairment in written expression  \nDevelopmental learning disorder with impairment in written expression is \ncharacterised by significant and persi stent difficulties in learning academic skills \nrelated to writing, such as spelling accuracy, grammar and punctuation accuracy, \nand organisation and coherence of ideas in writing. The individual’s performance in \nwritten expression is markedly below what wo uld be expected for chronological age \nand level of intellectual functioning and results in significant impairment in the \nindividual’s academic or occupational functioning. Developmental learning disorder \nwith impairment in written expression is not due to a disorder of intellectual \ndevelopment, sensory impairment (vision or hearing), a neurological or motor \ndisorder, lack of availability of education, lack of proficiency in the language of \nacademic instruction, or psychosocial adversity.  \nExclusions: 

In [7]:
f = open('./input/c06.txt',encoding="utf-8")
content = f.read()

In [8]:
split_text(content)

['ICD-11 MMS  389 \n • Mental or behavioural disorders associated with pregnancy, childbirth or the puerperium  \n• Secondary mental or behavioural syndromes associated with disorders or diseases classified \nelsewhere  \n \nNeurodevelopmental disorders (6A00 ‑6A0Z)  \nNeurodevelopmental disorders are behavioural and cognitive disorders that arise during the \ndevelopmental period t hat involve significant difficulties in the acquisition and execution of specific \nintellectual, motor, language, or social functions. Although behavioural and cognitive deficits are \npresent in many mental and behavioural disorders that can arise during the  developmental period \n(e.g., Schizophrenia, Bipolar disorder), only disorders whose core features are neurodevelopmental \nare included in this grouping. The presumptive etiology for neurodevelopmental disorders is complex, \nand in many individual cases is un known.  \nCoded Elsewhere:  Primary tics or tic disorders (8A05.0)  \nSecondary neurodevel

In [24]:
from langchain.text_splitter import TextSplitter

class DisorderTextSplitter(TextSplitter):
    def split_text(self, text: str) -> list[str]:
        # Split the text into paragraphs
        paragraphs = text.split('\n')
        
        # Initialize variables
        chunks = []
        current_chunk = ""
        
        for paragraph in paragraphs:
            # Check if the paragraph starts with a disorder code (e.g., 6A00.3)
            if re.match(r'^\s*6[A-Z]\d{2}(\.\d+)?([A-Z])?(?!\s+ICD-11 MMS)', paragraph):
                # If we have a current chunk, add it to the list of chunks
                if current_chunk:
                    chunks.append(current_chunk.strip())
                # Start a new chunk with this paragraph
                current_chunk = paragraph
            else:
                # If it's not a new disorder, add to the current chunk
                current_chunk += "\n" + paragraph
        
        # Add the last chunk if it exists
        if current_chunk:
            chunks.append(current_chunk.strip())
        
        return chunks

# Usage
text_splitter = DisorderTextSplitter()
all_splits = text_splitter.split_documents(books)


In [20]:
print(all_splits)

[Document(metadata={'source': './input/c06.pdf', 'page': 1}, page_content='6A00   Disorders of intellectual development'), Document(metadata={'source': './input/c06.pdf', 'page': 1}, page_content='6A00.0  Disorder of intellectual development, mild'), Document(metadata={'source': './input/c06.pdf', 'page': 2}, page_content='6A00.1  Disorder of intellectual development, moderate'), Document(metadata={'source': './input/c06.pdf', 'page': 2}, page_content='6A00.2  Disorder of intellectual development, severe'), Document(metadata={'source': './input/c06.pdf', 'page': 2}, page_content='6A00.3  Disord er of intellectual development, profound'), Document(metadata={'source': './input/c06.pdf', 'page': 3}, page_content='6A00.4  Disorder of intellectual development, provisional'), Document(metadata={'source': './input/c06.pdf', 'page': 3}, page_content='6A00.Z  Disorders of intellectual development, unspecified'), Document(metadata={'source': './input/c06.pdf', 'page': 3}, page_content='6A01   De

In [5]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
all_splits = text_splitter.split_documents(books)

**Ollama embeddings and Chroma vector store**
We will use Langchain to convert the text into the embedding and store it in the Chroma database.

We are using the Ollama Llama 3 model as an embedding model.


from langchain_chroma import Chroma

In [6]:
from langchain_chroma import Chroma
from langchain_community.embeddings import OllamaEmbeddings

vectorstore = Chroma.from_documents(
    documents=all_splits,
    embedding=OllamaEmbeddings(model="llama3", show_progress=True),
    persist_directory="./chroma_db",
)

OllamaEmbeddings: 100%|██████████| 1353/1353 [51:20<00:00,  2.28s/it]


**TEST**

In [7]:
question = "What is anxiety?"
docs = vectorstore.similarity_search(question)
docs

OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.31s/it]


[Document(metadata={'page': 42, 'source': 'C:/tmp/c06.pdf'}, page_content='Hypochondriasis (6B23)'),
 Document(metadata={'page': 107, 'source': 'C:/tmp/c06.pdf'}, page_content='Disorders due to use of caffeine (6C48)'),
 Document(metadata={'page': 69, 'source': 'C:/tmp/c06.pdf'}, page_content='Possession trance disorder (6B63)'),
 Document(metadata={'page': 7, 'source': 'C:/tmp/c06.pdf'}, page_content='Exclusions:  Disorders of intellectual development (6A00)')]

**Building Langchain chains for Q&A retrieval system**
To build a proper question-and-answer retrieval system, we will use Langchain chains and start adding the modules.

In our Q&A chain, we will

Use vector store as the retriever and format the results.
After that, we will provide the RAG prompt. You can easily pull that from the Langchain Hub.
Then, we will provide the Ollama Llama 3 inference function.
In the end, we will parse the results only to display the response.
Simply put, before passing it through the Llama 3 model, your question will be provided with context using the similarity search and RAG prompt.

In [19]:
from langchain import hub
from langchain_community.llms import Ollama
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = Ollama(model="llama3")

retriever = vectorstore.as_retriever()


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_prompt = hub.pull("rlm/rag-prompt")
qa_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)



**Testing the Q&A retrieval chain**
Ask relevant questions about books to understand more about the story.

In [16]:
question = "from the context, which type of disease are being described?"
qa_chain.invoke(question)
#question = "which are anorexia symptoms?"
#qa_chain.invoke(question)

OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.35s/it]


'Based on the context, it appears that mental, behavioral, or neurodevelopmental disorders are being described. Specifically, these are disorders that affect the manifestation, treatment, or course of another disorder or disease classified elsewhere in the chapters.'