Skip to content

vinomaster/chatgpt-stack-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Learning Journey

This repo captures early learning activities acround the building of a local GenAi solution.

The approach taken is iterative using the folowing steps:

  • Query CHatGPT for guidance
  • Investigate the usefulness of the guidance
  • Expand on queries and document findings.

A Jupyter Notebook Environment would dramatically help here.

Approach 1

Creating a domain-specific text analytics application with a natural language interface (NLU) as the user experience (UX).

Goal ChatGPT Guidance
Establish a comprehensive guide for a beginner to explore a corpus of text data from PDF files. Prompt 1 Guidance

Observations

  1. Code lacks integration of text-analytics with a model
  2. Proposed code uses deprecated Open SDK APIs

Issues

Problem 1

ModuleNotFoundError: No Module Named openai

Solution 1

Follow the steps below to install the openai package for the current interpreter

Enter the python terminal session using python and then run the following code

import sys
print(sys.executable)

get the current interpreter path

/Users/dag/Code/sandbox/chatgpt-101/text_analytics_env/bin/python

Copy the path and install openai using the following command in the terminal

/Users/dag/Code/sandbox/chatgpt-101/text_analytics_env/bin/python -m pip install openai

Problem 2

OpenAI Deprecated API.

Solution 2

OpenAI SDK Migration

Problem 3

Setting OpenAI API Key

Solution 3

export OPENAI_API_KEY=<ENTER KEY  HERE>

Approach 2

Integration of text-analytics with language model

Creating a domain-specific text analytics application with a natural language interface (NLU) as the user experience (UX).

Goal ChatGPT Guidance
Integrate the extracted PDF data into the text analytics application and ensure the language model (engine) can provide accurate, domain-specific responses. Prompt 2 Guidance

Observations

Two options were proposed. Explored Option 2: Using Document Embeddings for Retrieval-Based Q&A which yielded numerous runt-time errors.

Issues

Problem 4

Integrated PDF data

Solution 4

See library docs

pip install -U sentence-transformers

Problem 5

Object of type SentenceTransformer is not JSON serializable

Research Inquiry 1

Creating a domain-specific text analytics application with a natural language interface using Python involves several steps. Here’s a detailed guide for beginners to establish a reusable set of Python scripts to accomplish this task, using the latest versions of OpenAI SDK and Streamlit for the UX.

Goal ChatGPT Guidance
Propose a solution that manages embeddings and indexes manually with FAISS. Prompt 1 Guidance

Research Inquiry 2

How would the solution to Approach 3 be modified and improved by using Vector DBs?

Goal ChatGPT Guidance
Explore Vector DB benefits Technology Comparison from Prompts 3 and 5

Several Vector DB options, namely, Milvus, Weaviate, Pinecone, Cassio, and MindsDB are considered and compared.

Approach 3

Test an alternative using MindsDB as the Vector DB. This will aloow for the managing and querying of embeddings. MindsDB is particularly suitable for integrating machine learning models with databases, and it can work well with vector search tasks.

Goal ChatGPT Guidance
Leverage an open-source self managed vector db solution using MindsDB. Prompt 4 Guidance

Observations

  1. Setup
pip install mindsdb openai streamlit PyPDF2
brew install libmagic # for macOS
  1. Create Virtual Env
python -m venv mindsdb-venv
  1. Activate Virtual environment
source mindsdb-venv/bin/activate

Results

Abandoned approach. MAy still be viable but new insighjts suggest Chroma is a better approach.

Approach 4

Test an alternative using Chroma as the Vector DB. This will allow for the managing and querying of embeddings. Chroma is particularly suitable for integrating machine learning models with databases, and it can work well with vector search tasks.

Goal ChatGPT Guidance
Leverage an open-source self managed vector db solution using Chroma. Prompt 1 Guidance

Observations

  1. Chroma Setup
  2. Install and run Docker Hub Image chromadb/chroma:latest
  3. This approach publically shares the PDF data with OpenAI Servers. To avoid this we can consider Local Embedding Generation.

Decision

Explore a Local Embedding Generation solution.

Approach 5

Test an alternative using Chroma as the Vector DB and local embedding. This will prevent the sharing of data and allow for the managing and querying of embeddings. Chroma is particularly suitable for integrating machine learning models with databases, and it can work well with vector search tasks.

Goal ChatGPT Guidance
Leverage an open-source self managed vector db solution using Chroma and Local Embedding Generation. Prompt 2 Guidance

Observations

  1. Chroma Setup
  2. Install and run Docker Hub Image chromadb/chroma:latest
  3. See ULIDs
    pip install py-ulid
    
  4. Good Sentence Transformers Article

Status

Solution works in that it connects a front-end with local vector database that is primed with locally processed data. This will not scale but it helps to learn some of teh solution components.

The solution does not currently yield actual results. It is more of an operational example that needs tect analytics work.

Next Steps

  1. Work on Text Analytics capabilities so that a query will actually result in a list of meaningful results.
  2. Consider chuncking the PDF docs into sentences.

About

Code Assisted Learning Tutorial for PDF Analysis and Q&A Retrieval.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages