In [1]:
import requests

In [2]:
import sys
import os

# 1.3 Retrieval and Search - Notes
- Before jumping into using elasticsearch to index our documents, we're going to use the search engine build by DTC
- In order to do that, I need to import minsearch.py (the search engine library)
- Once that's done, we need to understand a few things about how `minsearch` is implemented:
    - `minsearch.Index()` is the method used to index a document and takes a few parameters: `text_fields` and `keyword_fields`
        - text_fields: the fields we use to search
        - keyword_fields: the fields used to group the data (i.e. similar to a WHERE clause in SQL)
    - So for example, if you pass a query like: ***"How do I execute a command in a running docker container?"*** the search engine would filter results by `keyword_fields` and would search through `text_fields`
    - index.fit() is the method used to specify the document you want to *fit* your Index to. So in this case, you would pass it the document containing all the data with the relevant keyword_fields and text_fields
    - index.search():
        - This is the method used to actually search the fitted document for the specific question
        - All the `text_fields` you search through are given equal weighting. If you want to change that, you can use a parameter called `boost` which allows you to ***relatively*** overweight or underweight certain `text_fields` by passing it a dictionary with `{text_field: weight}`
        - There are two other parameters, that are pretty straightforward: `filter_dict` which just lets you filter based on a `keyword_fields` again as a dict of the form `{"keyword_fields": "value"}` entry and `num_results` which just limits the number of elements it returns

In [3]:
# Map the relative path in order to import minsearch.py
current_dir = os.getcwd()
intro_dir = os.path.abspath(os.path.join(current_dir, "../../../01-intro"))
sys.path.append(intro_dir)

In [4]:
import minsearch

In [5]:
# import the FAQ documents (already parsed into json) into a list called documents

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [6]:
field_names = {key for document in documents for key in document.keys()}
print("\nField names:", list(field_names));
print("\ncourses:\n",'\n'.join({course['course'] for course in documents_raw}))


Field names: ['question', 'text', 'section', 'course']

courses:
 machine-learning-zoomcamp
mlops-zoomcamp
data-engineering-zoomcamp


In [7]:
# Index based on the fields in our FAQ document
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [8]:
question = "How do I execute a command in a running docker container?"

In [9]:
index.fit(documents)

<minsearch.Index at 0x76f0fdc59e40>

In [29]:
boost = {
    "question":3,
    "text":1,
    "section":0.5
}

results = index.search(
    query = question,
    filter_dict = {"course":"data-engineering-zoomcamp"},
    boost_dict = boost,
    num_results = 20
)

In [11]:
results

[{'text': 'In case running pgcli  locally causes issues or you do not want to install it locally you can use it running in a Docker container instead.\nBelow the usage with values used in the videos of the course for:\nnetwork name (docker network)\npostgres related variables for pgcli\nHostname\nUsername\nPort\nDatabase name\n$ docker run -it --rm --network pg-network ai2ys/dockerized-pgcli:4.0.1\n175dd47cda07:/# pgcli -h pg-database -U root -p 5432 -d ny_taxi\nPassword for root:\nServer: PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)\nVersion: 4.0.1\nHome: http://pgcli.com\nroot@pg-database:ny_taxi> \\dt\n+--------+------------------+-------+-------+\n| Schema | Name             | Type  | Owner |\n|--------+------------------+-------+-------|\n| public | yellow_taxi_data | table | root  |\n+--------+------------------+-------+-------+\nSELECT 1\nTime: 0.009s\nroot@pg-database:ny_taxi>',
  'section': 'Module 1: Docker and Terraform',
  'question': 'PGCLI - running in a Docker container',
 

# 1.4 Generating Answers with OpenAI GPT 4.0
- In this section, we'll be packaging up the response from our basic search engine in 1.3 and passing it as part of the context to the OpenAI API
- Using the completions API is pretty straightforward for basic usage. The documentation for the compeletions API can be found here: https://platform.openai.com/docs/api-reference/chat/create
- The general structure of this section is as follows:
    - Assume a set of results are generated based on the minsearch (or any search engine) in the previous section
    - We want to build a context that includes instructions to the LLM to restrict its answers to content from the results above *AND* the relevant content from those results for it to analyze
    - We pass that context as a prompt to the LLM and get results back
    - That's it!

In [12]:
from openai import OpenAI

In [13]:
client = OpenAI()

In [20]:
prompt_template = """
You're a teaching assistant for a bootcamp course.
Restrict your answers to the QUESTION to the content in CONTEXT only.

QUESTION: {question}

CONTEXT: {context}
""".strip()

In [30]:
context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

In [31]:
prompt = prompt_template.format(question=question, context=context).strip()

In [25]:
print(prompt)

You're a teaching assistant for a bootcamp course.
Restrict your answers to the QUESTION to the content in CONTEXT only.

QUESTION: How do I execute a command in a running docker container?

CONTEXT: section: Module 1: Docker and Terraform
question: PGCLI - running in a Docker container
answer: In case running pgcli  locally causes issues or you do not want to install it locally you can use it running in a Docker container instead.
Below the usage with values used in the videos of the course for:
network name (docker network)
postgres related variables for pgcli
Hostname
Username
Port
Database name
$ docker run -it --rm --network pg-network ai2ys/dockerized-pgcli:4.0.1
175dd47cda07:/# pgcli -h pg-database -U root -p 5432 -d ny_taxi
Password for root:
Server: PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)
Version: 4.0.1
Home: http://pgcli.com
root@pg-database:ny_taxi> \dt
+--------+------------------+-------+-------+
| Schema | Name             | Type  | Owner |
|--------+------------------+

In [32]:
response = client.chat.completions.create(
    model = "gpt-4o",
    messages = [{"role": "user", "content": prompt}]
)

In [33]:
print(response.choices[0].message.content)

To execute a command in a running Docker container, you can follow these steps:

1. **Start a Bash session** in the running container by using the `docker exec` command. For example, if your container name is `pg-database`, you can use:
   ```sh
   docker exec -it pg-database bash
   ```

2. **Run the command** you need once you are inside the container. 

For example, if you want to run `pgcli` on a running Docker container:
```sh
docker exec -it pg-database pgcli -h pg-database -U root -p 5432 -d ny_taxi
```
This command starts a `pgcli` session directly in the running `pg-database` container with the specified host, user, port, and database information.
