<a href="https://colab.research.google.com/github/swat90/ChatBot_LLM/blob/main/embeddings_store.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Convert the preprocessed output to vector embeddings using Hugging Face embeddings, store in Chroma DB and use retrival QA for validity.

Install all the necessary libraries

In [None]:
!pip install transformers langchain chromadb tiktoken pypdf sentence-transformers

Collecting tiktoken
  Downloading tiktoken-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf
  Downloading pypdf-4.1.0-py3-none-any.whl (286 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-2.5.1-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.5/156.5 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.7/23.7 MB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nvidia-cuda-runtime-cu12

Mount your google drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Load the combined preprocessed data directly from DagShub

In [None]:
!wget https://dagshub.com/Omdena/HyderabadIndiaChapter_MentalHealthWellbeingFomoSocialMedia/raw/87afb46588c819d63d7d6444dc950101cf6b42fe/data/preprocessed_data/preprocessed_data_combined.txt

--2024-03-14 09:22:30--  https://dagshub.com/Omdena/HyderabadIndiaChapter_MentalHealthWellbeingFomoSocialMedia/raw/87afb46588c819d63d7d6444dc950101cf6b42fe/data/preprocessed_data/preprocessed_data_combined.txt
Resolving dagshub.com (dagshub.com)... 35.186.200.224
Connecting to dagshub.com (dagshub.com)|35.186.200.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘preprocessed_data_combined.txt’

preprocessed_data_c     [            <=>     ]  21.23M  8.57MB/s    in 2.5s    

2024-03-14 09:22:34 (8.57 MB/s) - ‘preprocessed_data_combined.txt’ saved [22256780]



Import the necessary libraries

In [None]:
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.document_loaders import TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import PyPDFLoader
from chromadb.utils import embedding_functions
from transformers import AutoModel, AutoTokenizer

Load our data using Text Loader and split it into chunks using Text Splitter

In [None]:
doc = r"/content/preprocessed_data_combined.txt"
loader=TextLoader(doc)
docs=loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap = 50)
text = text_splitter.split_documents(docs)

convert the embeddings and store in ChromaDB

In [None]:
path = "/content/drive/MyDrive/data/chroma_db"
embeddings = HuggingFaceEmbeddings(model_name = 'sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
vectordb = Chroma.from_documents(documents=text, persist_directory = path, embedding = embeddings)

An Alternative way to use HuggingFace embeddings but not used in this colab notebook

In [None]:
access_token = "yours access token for Hugging Face"
model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", token=access_token)
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", token=access_token)
persist_directory = 'db'

# Access the API key from the access_token variable instead of the environment variable
huggingface_ef = embedding_functions.HuggingFaceEmbeddingFunction(
    api_key=access_token,
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)


In [None]:
# persiste the db to disk
vectordb.persist()
vectordb = None

In [None]:
# Now we can load the persisted database from disk, and use it as normal.
vectordb = Chroma(persist_directory=path,
                  embedding_function=embeddings)

Create a retriever

In [None]:
retriever = vectordb.as_retriever()

Try to generate an output for random query

In [None]:
docs = retriever.get_relevant_documents("I am feeling lonely today")

In [None]:
len(docs) #By default, it gives four answers.

4

Make a retriever for getting only 2 top similar content

In [None]:
retriever = vectordb.as_retriever(search_kwargs={"k": 2})

In [None]:
retriever.search_type

'similarity'

In [None]:
docs = retriever.get_relevant_documents("I am feeling lonely today")
docs

[Document(page_content='today I feel really sad I think I feel lonely too I be drive home from work just want pizza and my bed thu httpstcoarylahnwut \nrt pacificnwgal oh this be sad news he be a funny positive guy condolence to his family uddudeftrade space designer frank bielec du \ngo to start tell people whenever they ask why I be still single oh I have just never meet a man who be up to my su httpstcoyrdeuwyoa', metadata={'source': '/content/preprocessed_data_combined.txt'}),
 Document(page_content='no more lonelyness no more feel you not good enough', metadata={'source': '/content/preprocessed_data_combined.txt'}),
 Document(page_content='so sad and yet so true   I be alone in my struggle', metadata={'source': '/content/preprocessed_data_combined.txt'}),
 Document(page_content='hang in there I struggle too be todaybut I know if I wait it will lift and I will feel well again you be not alone and you be care about more than you know I know I speak for many of we when I say if you n

Use Retreival QA library to get the relevant output

In [None]:
from langchain.chains import RetrievalQA

Use mistral model

In [None]:
hf_repo_id = 'mistralai/Mistral-7B-Instruct-v0.1'

In [None]:
from langchain.llms import HuggingFaceHub
llm = HuggingFaceHub(
            repo_id=hf_repo_id,
            model_kwargs={"temperature": 0.2, "max_length": 32000}, huggingfacehub_api_token = access_token
        )

Storing into memory

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
retrieval = vectors.as_retriever(k=2)
qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retrieval, memory=memory)

In [None]:
## Cite only final response
def process_llm_response(llm_response):
    print(llm_response['result'])

In [None]:
# full example
query = "I am feeling very sad"
llm_response = qa(query)
process_llm_response(llm_response)

Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

rt thelivinproof overthinke ncreate anxiety nnoverthankingncreate peace nnthank god in advance no matter the situation he will getu 
rt qubaisxi I hate that feel the feeling when youure sad and you have no idea why but you just be 
rt ogutname ucbize bir ufmufcr daha luezum vefuetumuzdan sonra ucufcnkufc bu ufmrufcmufczufc sadece umutlanmakla geueirdikudnnsadueei ueirazuee httpstcoctu 
always be feel sad n then I realise its because I be hungry

why be I so overwhelmingly sad at the fact that most people I know or know of I will never get to see againnntheru httpstcohoapszkqve 
so sad emg 
rt ioveiyfeei the bad kind of sadness be not be an able to explain why you be sad 
rt birdmanbirdplan okay no more sad for I so have these soft spicy sheith from valentineus day that I pass out at the katsu meetupu 
rt wrldovrluv nigga be 