# Importing required packages

In [7]:
!pip install --upgrade langchain  -q    #This flag tells pip to run in quiet mode, suppressing output messages. This keeps the code block cleaner and avoids cluttering the console.

In [8]:
!pip install sentence_transformers -q     # provides pre-trained models and tools for text embedding. It allows you to convert sentences or text into numerical representations that can be used for various tasks like semantic search, text classification, and information retrieval.

In [9]:
!pip install unstructured -q    #This package offers tools and utilities for working with unstructured data, such as text, images, and audio
!pip install unstructured[local-inference] -q     #This command installs the unstructured package along with the local-inference extra. This extra installs additional dependencies required to run some functionalities of the unstructured package locally on your machine, such as performing inference on models without needing a remote server.
# !pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q       # installs detectron2 (v0.6) from GitHub, a framework for building and training computer vision models like object detection and image segmentation. #egg=detectron2: This tells pip to create an installable package named detectron2 from the downloaded source code.

In [10]:
!apt-get install poppler-utils          # Poppler-utils is a collection of command-line tools based on the Poppler library  for reading and writing PDF documents. It provides a set of tools and libraries that can be used to extract text, images, and metadata from PDF files.

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
poppler-utils is already the newest version (22.02.0-2ubuntu0.5).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


# 1. Loading from File Upload

In [11]:
!pip install PyMuPDF  # for pdf



In [14]:
from google.colab import files
import fitz  # PyMuPDF
import os

def upload_pdf():
    # Upload the file
    uploaded_file = files.upload()

    # Get the filename
    for fn in uploaded_file.keys():
        print(f'User uploaded file "{fn}" with length {len(uploaded_file[fn])} bytes')

        # Save the uploaded file to the local environment
        with open(fn, 'wb') as f:
            f.write(uploaded_file[fn])

        file_path = os.path.abspath(fn)
        print(f"File uploaded successfully to: {file_path}")

        return file_path

# Upload the PDF file and get the file path
file_path = upload_pdf()


Saving Nepal.pdf to Nepal.pdf
User uploaded file "Nepal.pdf" with length 183715 bytes
File uploaded successfully to: /content/Nepal.pdf


In [13]:
!pip install -U langchain-community       # langchain-community is a collection of community-contributed modules and extensions for the langchain library. These modules can provide additional functionalities or integrations for various use cases.

Collecting langchain-community
  Downloading langchain_community-0.2.14-py3-none-any.whl.metadata (2.7 kB)
Downloading langchain_community-0.2.14-py3-none-any.whl (2.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.3/2.3 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-community
Successfully installed langchain-community-0.2.14


# 2. Splitting documents

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

def split_docs(documents,chunk_size=500,chunk_overlap=20):
  with fitz.open(file_path) as doc:
        text = ""
        for page in doc:
            text += page.get_text()

  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_text(text)
  return docs

# If a file was uploaded, split it into chunks
if file_path:
    docs = split_docs(file_path)
    print(len(docs))

5


In [16]:
print(docs[2])

below the poverty line. Political instability has hindered economic development and social 
progress. Additionally, Nepal is prone to natural disasters such as earthquakes, landslides, and 
floods, which can have devastating consequences. 
Despite these challenges, Nepal is a resilient nation with a rich cultural heritage. The country is 
home to diverse ethnic groups, each with its own unique language, customs, and traditions. Nepali


# 3. Creating embeddings

In [17]:
from langchain.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

  embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
query_result = embeddings.embed_query("Welcome to Chatbot")
len(query_result) # print the length of the embedding vector

384

# 4. Storing embeddings in Pinecone

In [19]:
!pip install --upgrade pinecone

Collecting pinecone
  Downloading pinecone-5.1.0-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone)
  Downloading pinecone_plugin_inference-1.0.3-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone-5.1.0-py3-none-any.whl (245 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/245.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.5/245.5 kB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-1.0.3-py3-none-any.whl (117 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/117.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.6/117.6 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pineco

In [20]:
!pip install langchain_pinecone

Collecting langchain_pinecone
  Downloading langchain_pinecone-0.1.3-py3-none-any.whl.metadata (1.7 kB)
Collecting pinecone-client<6.0.0,>=5.0.0 (from langchain_pinecone)
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Downloading langchain_pinecone-0.1.3-py3-none-any.whl (10 kB)
Downloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pinecone-client, langchain_pinecone
Successfully installed langchain_pinecone-0.1.3 pinecone-client-5.0.1


In [None]:
# !pip install --upgrade pinecone-client -q
# !pip install --upgrade pinecone-client[grpc] -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.3 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.3/1.3 MB[0m [31m105.8 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m34.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.6/117.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.6/294.6 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-metadata 1.15.0 requires protobuf<4.21,>=3.20.3; python_version < "3.11", but you have

In [None]:
# from pinecone.grpc import PineconeGRPC as Pinecone
# from pinecone import ServerlessSpec

# # Initialize Pinecone with the gRPC client
# pc = Pinecone(api_key='485df547-324f-4bd6-97d5-98c480140498')

# # Check if the index exists; if not, create it
# index_name = "chatbot2"
# try:
#   if index_name not in pc.list_indexes().names():
#     pc.create_index(
#         name=index_name,
#         dimension=384,  # Dimension of your embeddings
#         metric="euclidean",
#         spec=ServerlessSpec(
#             cloud="aws",
#             region="us-east-1"
#         )
#     )
#   print("Index creation or verification successful.")
# except Exception as e:
#    print("Error:", e)


Index creation or verification successful.


In [43]:
# With HuggingFace:_------------------
from langchain.schema import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore#

# Initialize Pinecone
pc = Pinecone(api_key="PINECONE_API_KEY")

index_name = "chatbot2"

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")  # Replace if needed
embedded_docs = embeddings.embed_documents(docs)

# Convert docs (list of strings) to Document objects
documents = [Document(page_content=doc) for doc in docs]

# Create the Pinecone vector store (no need to specify API key here)
vevtorstore = PineconeVectorStore.from_documents(documents, embeddings, index_name=index_name)

In [45]:
query = "How is Nepal's economy?"
similar_docs = vectorstore.query(index_name, query)

print(f"Similar response for '{query}':")
for doc in similar_docs[:5]:
    print(f"- {doc.id}")

NameError: name 'vectorstore' is not defined

In [41]:
from google.colab import userdata
import getpass

os.environ['PINECONE_API_KEY'] = getpass.getpass('Pinecone API Key:')

api_key= userdata.get('PINECONE_API_KEY')
api_key

OpenAI API Key:··········


'485df547-324f-4bd6-97d5-98c480140498'

In [46]:
from pinecone import ServerlessSpec
from langchain.vectorstores import Pinecone as LangchainPinecone
from langchain.schema import Document
from pinecone import Pinecone,PineconeApiException
from langchain_pinecone import PineconeVectorStore#

# Initialize Pinecone
pc = Pinecone(api_key="485df547-324f-4bd6-97d5-98c480140498")

index_name = "chatbot2"

# Check if the index exists
# if index_name not in pc.list_indexes():
#     try:
#         # Create the index only if it does not exist
#         pc.create_index(
#             name=index_name,
#             dimension=384,  # Dimension of your embeddings
#             metric="euclidean",
#             spec=ServerlessSpec(
#                 cloud="aws",
#                 region="us-east-1"
#             )
#         )
#         print(f"Index '{index_name}' created successfully.")
#     except PineconeApiException as e:
#         if "ALREADY_EXISTS" in str(e):
#             print(f"Index '{index_name}' already exists. Skipping creation.")
#         else:
#             raise e  # Raise if the error is something else

# Convert docs (list of strings) to Document objects
documents = [Document(page_content=doc) for doc in docs]

# Use the Langchain integration with Pinecone
# index = LangchainPinecone.from_documents(documents, embeddings, index_name=index_name)
vevtorstore = PineconeVectorStore.from_documents(documents, embeddings, index_name=index_name)

print("Index creation and document insertion completed successfully.")

Index creation and document insertion completed successfully.


# 5. Access and search embeddings using the similarity_search

In [48]:
def get_similiar_docs(query,k=1,score=False):
  if score:
    similar_docs = vevtorstore.similarity_search_with_score(query,k=k)
  else:
    similar_docs = vevtorstore.similarity_search(query,k=k)
  return similar_docs

query = "How is Nepal natural disaster"
similar_docs = get_similiar_docs(query)
similar_docs


[Document(page_content='below the poverty line. Political instability has hindered economic development and social \nprogress. Additionally, Nepal is prone to natural disasters such as earthquakes, landslides, and \nfloods, which can have devastating consequences. \nDespite these challenges, Nepal is a resilient nation with a rich cultural heritage. The country is \nhome to diverse ethnic groups, each with its own unique language, customs, and traditions. Nepali')]

In [49]:
# BEST RESPONSE:

def get_best_response(similar_docs):
    best_response = ""
    for doc in similar_docs:
        if "natural disaster" in doc.page_content:
            best_response += doc.page_content
    return best_response

query = "How is Nepal natural disaster"
similar_docs = get_similiar_docs(query)
best_response = get_best_response(similar_docs)
print(best_response)

below the poverty line. Political instability has hindered economic development and social 
progress. Additionally, Nepal is prone to natural disasters such as earthquakes, landslides, and 
floods, which can have devastating consequences. 
Despite these challenges, Nepal is a resilient nation with a rich cultural heritage. The country is 
home to diverse ethnic groups, each with its own unique language, customs, and traditions. Nepali
