In [None]:
import sys
import os

# Get the absolute path of the parent directory of the "notebooks" directory
parent_dir = os.path.abspath(os.path.join(os.getcwd(), os.pardir))

# Add the parent directory to the Python path
sys.path.append(parent_dir)

## To Do

- Parse PDF
- Text Splitters
- Count Tokens
- Use different libraries - langchain
- Generate embeddings
- Store in Qdrant
- Retrieve from Qdrant


### Different PDF parsers

- `PyMuPDF` is a Python wrapper for the MuPDF library, which is a lightweight PDF and XPS viewer and parser. It can be used to extract text, images, and other data from PDF files, as well as to manipulate PDF files programmatically. It provides a comprehensive set of tools for working with PDF files, including merging and splitting PDFs, adding annotations and bookmarks, and converting PDFs to other formats.

- `DeepDocDetection` is a Python library for document analysis and OCR (Optical Character Recognition). It provides tools for detecting text, images, and tables in PDF files, as well as for performing OCR on scanned documents. It uses deep learning models to achieve high accuracy in document analysis and OCR.

In [None]:
%pip install PyMuPDF

Set the OpenAI API key as an environment variable in your system. In Linux or macOS, you can do this by running the following command in a terminal: 

`export OPENAI_API_KEY=<your_key_here>`.

Restart your Jupyter notebook to ensure the environment variable is loaded.

In [None]:
import os
import qdrant_client

collection_name = "langchain_documents"
qdrant_url = "http://localhost:6333/"
qdrant_port = 6333
openai_api_key = os.environ["OPENAI_API_KEY"]
query = "What wrappers are provided by SearxNG search API"

Retrieval

Similarity search
The simplest scenario for using Qdrant vector store is to perform a similarity search. Under the hood, our query will be encoded with the embedding_function and used to find similar documents in Qdrant collection.

In [None]:
client = qdrant_client.QdrantClient(url=qdrant_url, port=qdrant_port)

qdrant = Qdrant(client=client, 
                collection_name="langchain_documents", 
                embedding_function=embeddings.embed_query)

found_docs = qdrant.similarity_search(query)
print(found_docs[0].page_content)

Sometimes we might want to perform the search, but also obtain a relevancy score to know how good is a particular result.

In [None]:
found_docs = qdrant.similarity_search_with_score(query)
document, score = found_docs[0]
print(document.page_content)
print(f"\nScore: {score}")

In [None]:

from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from qdrant_client import QdrantClient


client = QdrantClient(url=qdrant_url)

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_api_key)
qdrant = Qdrant(client=client, collection_name=collection_name, embedding_function=embeddings.embed_query)
search_results = qdrant.similarity_search(query, k=2)
chain = load_qa_chain(OpenAI(openai_api_key=openai_api_key,temperature=0.2), chain_type="stuff")
results = chain({"input_documents": search_results, "question": query}, return_only_outputs=True)

print(results["output_text"])