# Install Libraries
Install all the required libraries. In this notebook we are going to use,
* `langchain` for retrival augmented generation,
* `chromadb` as a vector data storage,
* `sentence-transformers` for text embeddings.

In [1]:
!pip install langchain

!pip install pinecone-client==2.2.4
!pip install chromadb
!pip install sentence-transformers

Collecting langchain
  Downloading langchain-0.1.14-py3-none-any.whl.metadata (13 kB)
Collecting langchain-community<0.1,>=0.0.30 (from langchain)
  Downloading langchain_community-0.0.31-py3-none-any.whl.metadata (8.4 kB)
Collecting langchain-core<0.2.0,>=0.1.37 (from langchain)
  Downloading langchain_core-0.1.40-py3-none-any.whl.metadata (5.9 kB)
Collecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Downloading langchain_text_splitters-0.0.1-py3-none-any.whl.metadata (2.0 kB)
Collecting langsmith<0.2.0,>=0.1.17 (from langchain)
  Downloading langsmith-0.1.42-py3-none-any.whl.metadata (13 kB)
Collecting packaging<24.0,>=23.2 (from langchain-core<0.2.0,>=0.1.37->langchain)
  Downloading packaging-23.2-py3-none-any.whl.metadata (3.2 kB)
Collecting orjson<4.0.0,>=3.9.14 (from langsmith<0.2.0,>=0.1.17->langchain)
  Downloading orjson-3.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (49 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

# Configure Notebook
## Set Hugging Face Token
Using Hugging Face for accessing the Gemma 2b-it model.The hugging face token is stored in the form of secret key

In [2]:
import os
from kaggle_secrets import UserSecretsClient

token = UserSecretsClient().get_secret('hf_token')
os.environ["HUGGINGFACEHUB_API_TOKEN"] = token

# PINECONE SETUP
Earlier i was thinking of using pinecone database to store the data in form of vector embeddings but chroma turned out to be better and fast so chromadb is used finally

In [3]:
PINECONE_API_KEY = "f4145b01-ab06-4e7d-8a5f-51f7d8d9e9a1"
PINECONE_API_ENV = "gcp-starter"

# Import Libraries
Import all the necesary libraries here.

* **[PyPDFLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html)**: For loading data from pdf file.
* **[SentenceTransformerEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceEmbeddings.html)**: For generating sentence / text embeddings for comparision (to get question related information from pdf).
* **[Chroma](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html#langchain_community.vectorstores.chroma.Chroma)**: For vector (embeddings) storage.
* **[RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html#langchain-text-splitters-character-recursivecharactertextsplitter)**: To recursively try splitting text using different characters to find one that works.
* **[HuggingFaceEndpoint](https://api.python.langchain.com/en/latest/llms/langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint.html#langchain_community.llms.huggingface_endpoint.HuggingFaceEndpoint)**: To access Hugging Face Hub models.
* **[ConversationBufferMemory](https://api.python.langchain.com/en/latest/memory/langchain.memory.buffer.ConversationBufferMemory.html#langchain-memory-buffer-conversationbuffermemory)**: For storing and extracting the messages.
* **[PromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.prompt.PromptTemplate.html#langchain_core.prompts.prompt.PromptTemplate)**: To generate a customized prompt for the language model.
* **[ConversationalRetrievalChain](https://api.python.langchain.com/en/latest/chains/langchain.chains.conversational_retrieval.base.ConversationalRetrievalChain.html#langchain-chains-conversational-retrieval-base-conversationalretrievalchain)**: To create a conversational question-answering chain.

In [4]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma,Pinecone
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_community.llms import HuggingFaceEndpoint

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain_core.prompts import PromptTemplate
import pinecone

  from tqdm.autonotebook import tqdm


## Load Data from Pdf
For loading data, I am using `pypdf` which is a free and open source pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files. 
The PyPDFDirectoryLoader directly loads all the files from our dataset.

In [5]:
loader = PyPDFDirectoryLoader("/kaggle/input/research-papers/autism_papers")

docs = loader.load()
print(len(docs))

229


# Process and Store Data
## Split Data for Processing
For improving the information processing, comprehension, and retrieval it is essential to split large volumes of complex information into smaller, more manageable units or chunks. We need to group similar information together.

For that I am using `RecursiveCharacterTextSplitter`, which is the recommended one for generic text. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

## Create Embeddings
Embeddings can be used to compute sentence / text embeddings. And also can then be compared to find sentences with a similar meaning which can be useful for semantic textual similarity, semantic search, or paraphrase mining.

For embeddings, I am using `SentenceTransformers`, which is a Python framework for state-of-the-art sentence, text and image embeddings. 

### Store Embeddings
To store the embeddings, I am using `Chroma`, which is the AI-native open-source embedding (vector) database. 

In [6]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 300, chunk_overlap = 80)
documents = text_splitter.split_documents(docs)

embeddings = SentenceTransformerEmbeddings(model_name = 'all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
vectorstore = Chroma.from_documents(documents, embeddings)

<h1>Pinecone Setup</h1>
Earlier as mentioned above <b>pinecone</b> was used to store vector database so in order to store first we need to create a index so index was created with dimensions 384 and <b>cosine</b> metrics.
The dimensions was set to <b>384</b> as the sentence embedding model we use transforms the text in dimension 384


For eg:

<b>query_result = embeddings.embed_query("Hello world")</b><br>
<b>print("Length", len(query_result))</b>

The output would be <b>384</b>

<h2>Here is a pic of index:</h2>

![index-image](http://res.cloudinary.com/dwmwpmrpo/image/upload/v1712676972/dv44imdsp8g7x22ye9pu.png)


In [8]:
# pinecone.init(api_key = PINECONE_API_KEY,
#               environment = PINECONE_API_ENV)

# index_name = "rp-intern-project"

# docsearch = Pinecone.from_texts([t.page_content for t in documents], embeddings, index_name = index_name)

# Get access to Gemma Model
Use Hugging Face to get access to Gemma model. For that I am using `HuggingFaceEndpoint`, which is an integration of the free Serverless Endpoints API. This lets you implement solutions and iterate in no time.

In [9]:
repo_id = "google/gemma-1.1-2b-it"

llm = HuggingFaceEndpoint(
    repo_id        = repo_id, 
    max_length     = 512,
    temperature    = 0.2,
    token          = token,
)

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## Use Chat History
I am also using `ConversationBufferMemory`, this enables keeping chat history so that previous history can be utilized.

## Create Prompt Template
Prompt templates are predefined recipes that can be used for generating (customized) prompts for language models. Prompt template may include instructions, few-shot examples, and specific context and questions appropriate for a given task.

I am using `PromptTemplate` for including instructions with the question that is entered by the user.

## Retrieve the Answer through Conversation Chain
To retrieve the answer I am using `ConversationalRetrievalChain`, this takes in chat history and new questions, and then returns an answer to that question.

In [13]:
def get_answer(question):
    
    memory = ConversationBufferMemory(memory_key = 'chat_history',
                                      return_messages = True)

    template = (
    "To answer your question, you will use the retrieved information, integrating it with your knowledge."
    "The response should be comprehensive,and the entire question should be answered in a elaborative way use all the information you got"
    "Try to answer in alteast 100 to 800 words depending on the question"
    "Question: {question}"
    )

    
    prompt = PromptTemplate.from_template(template)
    
    chain = ConversationalRetrievalChain.from_llm(
        llm                      = llm,
        chain_type               = "stuff",
#         retriever                = docsearch.as_retriever(search_kwargs = {'k': 3}), This one is used with pinecone
        retriever                = vectorstore.as_retriever(),
        memory                   = memory,
        condense_question_prompt = prompt,
    )
    
    return chain({"question": question})

In [14]:
from IPython.display import display, Markdown

def format_resonse(res):
    return '\n\n'.join((
        f"**<font color='red'>Question:</font>** {res['question']}",
        f"**<font color='green'>Answer:</font>** {res['answer']}"
    ))

In [15]:
def ask_question(question):
    response = get_answer(question)
    return display(Markdown(format_resonse(response)))

# Question Answering

In [18]:
# question = 'What is the cure of Autism Spectrum Disorder?'
# ask_question(question)

**<font color='red'>Question:</font>** What is the cure of Autism Spectrum Disorder?

**<font color='green'>Answer:</font>**  The provided text does not contain any information regarding the cure of Autism Spectrum Disorder, so I cannot answer this question from the provided context.

In [19]:
# question = 'What are Stereotypical and maladaptive behaviors in Autism Spectrum, how are these detected and managed?'
# ask_question(question)

**<font color='red'>Question:</font>** What are Stereotypical and maladaptive behaviors in Autism Spectrum, how are these detected and managed?

**<font color='green'>Answer:</font>** 

Stereotypical and maladaptive behaviors in Autism Spectrum Disorder (ASD) are detected and managed through various approaches, including:

- **Clinical evaluation:** Comprehensive clinical evaluation by experienced professionals is crucial for identifying specific and persistent behaviors that deviate from typical patterns.


- **Developmental observation:** Observing a child's behavior in various contexts provides insights into their social and communication skills.


- **Behavioral assessments:** Standardized assessments like the Autism Diagnostic Observation Schedule (ADOS) and the Autism Diagnostic Interview (ADI) help quantify and categorize specific behaviors.


- **Sensory processing:** Understanding how a child processes sensory information is essential for managing challenging behaviors related to sensory sensitivities or overstimulation.


- **Environmental modifications:** Creating a structured and predictable environment with appropriate sensory supports can help reduce challenging behaviors.


- **Positive reinforcement:** Implementing positive reinforcement strategies can encourage desired behaviors and reduce negative behaviors.


- **Individualized interventions:** Tailoring interventions to each child's specific needs and strengths is crucial for effective management.