DocuQuery AI is an intelligent document query application built using Streamlit. It allows users to upload PDF and DOCX files, processes them into vector embeddings for efficient document retrieval, and then enables users to query the contents of the documents using advanced language models. The application supports several pre-trained large language models and provides a user-friendly interface to query document content efficiently.
- Users can upload multiple files in PDF or DOCX format.
- The application processes and extracts the content of the files to be used in vector embeddings for easy document retrieval.
- The uploaded documents are split into smaller chunks and converted into vector embeddings using the
GoogleGenerativeAIEmbeddings
. - The FAISS vector store is used to store and manage the vectorized document chunks.
- Users can choose from a variety of powerful language models available in the sidebar. Models include:
gemma-7b-it
gemma2-9b-it
mixtral-8x7b-32768
llama3-70b-8192
llama3-8B
- ...and more.
- After processing the documents, users can ask questions in natural language. The application retrieves the relevant information from the uploaded documents and returns the most accurate response based on the context.
- The application displays the response time and the selected model for each query.
- The retrieved response includes document chunks that are most similar to the user's query, displayed in the expanded view for in-depth review.
- The app uses Streamlit to offer an intuitive and interactive experience, making it easy to upload documents, ask questions, and explore results.
- Each session automatically cleans up temporary files used during document upload and vectorization to optimize space and performance.
Follow the steps below to install and run DocuQuery AI on your local machine.
- Python 3.8 or higher
- Virtual environment (optional but recommended)
git clone https://github.com/your-repository/docuquery-ai.git
cd docuquery-ai
python -m venv venv
source venv/bin/activate # On Windows use: venv\Scripts\activate
pip install -r requirements.txt
Make sure your requirements.txt
contains the following dependencies:
streamlit
langchain
langchain_groq
langchain_google_genai
langchain_community
faiss-cpu
PyPDF2
docx
shutil
Create a file called .streamlit/secrets.toml
in the root directory of your project and add your API keys:
[secrets]
GROQ_API_KEY = "your-groq-api-key"
GOOGLE_API_KEY = "your-google-api-key"
Replace your-groq-api-key
and your-google-api-key
with your actual API keys for the respective services.
Start the Streamlit application by running the following command:
streamlit run app.py
The application will open in your browser, allowing you to upload documents and start querying them.
- Upload Documents: On the sidebar, upload your PDF or DOCX files. You can upload multiple files at once.
- Process Files: Click the "Upload & Process Files" button to process your files. This will convert the document content into vector embeddings.
- Ask Questions: Once the files are processed, use the text input at the bottom to ask any questions related to the content of the uploaded documents.
- Review Results: The results will include the AI's answer and a list of document chunks that match your query the most.
docuquery-ai/
│
├── app.py # Main application code
├── requirements.txt # Dependencies file
└── .streamlit/
└── secrets.toml # API keys (not included in the repository)
- Document Vectorization: After documents are uploaded, they are split into chunks using
RecursiveCharacterTextSplitter
. These chunks are embedded into vectors using Google Generative AI embeddings and stored using the FAISS vector store. - Document Retrieval: When a user inputs a query, the system retrieves the most relevant document chunks using FAISS and responds based on the content retrieved.
- Model Switching: Users can switch between different pre-trained language models in the sidebar to get responses tailored by different architectures.
- Integration with additional file formats (e.g., TXT, HTML).
- Enhanced document visualization and interaction features.
- Real-time document summarization and preview before query.
Developed by Rakesh Kumar.
Feel free to connect on LinkedIn and share your feedback!
MIT License © 2024 DocuQuery AI.