This repo captures early learning activities acround the building of a local GenAi solution.
The approach taken is iterative using the folowing steps:
- Query CHatGPT for guidance
- Investigate the usefulness of the guidance
- Expand on queries and document findings.
A Jupyter Notebook Environment would dramatically help here.
Creating a domain-specific text analytics application with a natural language interface (NLU) as the user experience (UX).
| Goal | ChatGPT Guidance |
|---|---|
| Establish a comprehensive guide for a beginner to explore a corpus of text data from PDF files. | Prompt 1 Guidance |
- Code lacks integration of text-analytics with a model
- Proposed code uses deprecated Open SDK APIs
ModuleNotFoundError: No Module Named openai
Follow the steps below to install the openai package for the current interpreter
Enter the python terminal session using python and then run the following code
import sys
print(sys.executable)
get the current interpreter path
/Users/dag/Code/sandbox/chatgpt-101/text_analytics_env/bin/python
Copy the path and install openai using the following command in the terminal
/Users/dag/Code/sandbox/chatgpt-101/text_analytics_env/bin/python -m pip install openai
OpenAI Deprecated API.
Setting OpenAI API Key
export OPENAI_API_KEY=<ENTER KEY HERE>
Integration of text-analytics with language model
Creating a domain-specific text analytics application with a natural language interface (NLU) as the user experience (UX).
| Goal | ChatGPT Guidance |
|---|---|
| Integrate the extracted PDF data into the text analytics application and ensure the language model (engine) can provide accurate, domain-specific responses. | Prompt 2 Guidance |
Two options were proposed. Explored Option 2: Using Document Embeddings for Retrieval-Based Q&A which yielded numerous runt-time errors.
Integrated PDF data
pip install -U sentence-transformers
Object of type SentenceTransformer is not JSON serializable
Creating a domain-specific text analytics application with a natural language interface using Python involves several steps. Here’s a detailed guide for beginners to establish a reusable set of Python scripts to accomplish this task, using the latest versions of OpenAI SDK and Streamlit for the UX.
| Goal | ChatGPT Guidance |
|---|---|
| Propose a solution that manages embeddings and indexes manually with FAISS. | Prompt 1 Guidance |
How would the solution to Approach 3 be modified and improved by using Vector DBs?
| Goal | ChatGPT Guidance |
|---|---|
| Explore Vector DB benefits | Technology Comparison from Prompts 3 and 5 |
Several Vector DB options, namely, Milvus, Weaviate, Pinecone, Cassio, and MindsDB are considered and compared.
Test an alternative using MindsDB as the Vector DB. This will aloow for the managing and querying of embeddings. MindsDB is particularly suitable for integrating machine learning models with databases, and it can work well with vector search tasks.
| Goal | ChatGPT Guidance |
|---|---|
| Leverage an open-source self managed vector db solution using MindsDB. | Prompt 4 Guidance |
pip install mindsdb openai streamlit PyPDF2
brew install libmagic # for macOS
- Create Virtual Env
python -m venv mindsdb-venv
- Activate Virtual environment
source mindsdb-venv/bin/activate
Abandoned approach. MAy still be viable but new insighjts suggest Chroma is a better approach.
Test an alternative using Chroma as the Vector DB. This will allow for the managing and querying of embeddings. Chroma is particularly suitable for integrating machine learning models with databases, and it can work well with vector search tasks.
| Goal | ChatGPT Guidance |
|---|---|
| Leverage an open-source self managed vector db solution using Chroma. | Prompt 1 Guidance |
- Chroma Setup
- Install and run Docker Hub Image
chromadb/chroma:latest - This approach publically shares the PDF data with OpenAI Servers. To avoid this we can consider Local Embedding Generation.
Explore a Local Embedding Generation solution.
Test an alternative using Chroma as the Vector DB and local embedding. This will prevent the sharing of data and allow for the managing and querying of embeddings. Chroma is particularly suitable for integrating machine learning models with databases, and it can work well with vector search tasks.
| Goal | ChatGPT Guidance |
|---|---|
| Leverage an open-source self managed vector db solution using Chroma and Local Embedding Generation. | Prompt 2 Guidance |
- Chroma Setup
- Install and run Docker Hub Image
chromadb/chroma:latest - See ULIDs
pip install py-ulid - Good Sentence Transformers Article
Solution works in that it connects a front-end with local vector database that is primed with locally processed data. This will not scale but it helps to learn some of teh solution components.
The solution does not currently yield actual results. It is more of an operational example that needs tect analytics work.
- Work on Text Analytics capabilities so that a query will actually result in a list of meaningful results.
- Consider chuncking the PDF docs into sentences.