In [14]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-06-25 06:05:02--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolvendo raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Conectando-se a raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... conectado.
A requisição HTTP foi enviada, aguardando resposta... 200 OK
Tamanho: 3832 (3,7K) [text/plain]
Salvando em: ‘minsearch.py.4’


2024-06-25 06:05:02 (15,0 MB/s) - ‘minsearch.py.4’ salvo [3832/3832]



In [15]:
import minsearch
import json

with open('documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [16]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [17]:
index.fit(documents)

<minsearch.Index at 0x77df86c004d0>

In [18]:
from openai import OpenAI
import os
from dotenv import load_dotenv

load_dotenv()
zscaler_cert_path = "/home/victoralmeida/continuous-learning/certificados/ZscalerCert.pem"
os.environ['SSL_CERT_FILE'] = zscaler_cert_path

client = OpenAI()

q = 'the course has already started, can I still enroll?'
response = client.chat.completions.create(
    model='gpt-3.5-turbo',
    messages=[{"role": "user", "content": q}]
)

response.choices[0].message.content

'It depends on the specific policies of the course and the institution offering it. Some courses may allow late enrollment within a certain window of time, while others may have a strict cutoff date for enrollment. It is best to contact the course instructor or administrator to inquire about the possibility of enrolling late.'

In [19]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [20]:
def build_prompt(query, search_results):

    prompt_template = """
        You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
        Use only the facts from the CONTEXT when answering the QUESTION.

        QUESTION: {question}

        CONTEXT: 
        {context}
        """.strip()

    context = ""

    for doc in search_results:
        context = context + \
            f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [21]:
def llm(prompt):
    response = client.chat.completions.create(
        model='gpt-4o',
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

In [22]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [23]:
query = 'how do I run kafka?'
rag(query)

'To run Kafka, follow the instructions based on the programming language you\'re using.\n\n**For Java:**\n1. Ensure you\'re in the project directory.\n2. Run the following command in the terminal:\n   ```sh\n   java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n   ```\n\n**For Python:**\n1. Create a virtual environment and set it up:\n   ```sh\n   python -m venv env\n   source env/bin/activate\n   pip install -r ../requirements.txt\n   ```\n   Note: For Windows, the activation command is `env/Scripts/activate`.\n\n2. If you encounter any permission issues, particularly while running a `build.sh` script, ensure it has executable permissions by running:\n   ```sh\n   chmod +x build.sh\n   ```\n\n3. To resolve the error "ModuleNotFoundError: No module named \'kafka.vendor.six.moves\'", install the alternative kafka-python package:\n   ```sh\n   pip install kafka-python-ng\n   ```\n\nThese steps should help you run Kafka using either Java or Py

In [24]:
query = 'the course has already started, can I still enroll?'
rag(query)


'Yes, you can still enroll in the course even after it has started. You are eligible to submit the homeworks, but be mindful of the deadlines for turning in the final projects.'

In [25]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [33]:
from elasticsearch import Elasticsearch
es_client = Elasticsearch('http://localhost:9200') 
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

if not es_client.indices.exists(index=index_name):
    es_client.indices.create(index=index_name, body=index_settings)
    print(f"Index '{index_name}' created.")
else:
    print(f"Index '{index_name}' already exists.")

Index 'course-questions' already exists.


In [34]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|██████████| 948/948 [00:04<00:00, 189.98it/s]


In [35]:
query = 'I just disovered the course. Can I still join it?'

def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [37]:
query = 'How do I execute a command in a running docker container?'

elastic_search(query=query)

[{'text': 'In case running pgcli  locally causes issues or you do not want to install it locally you can use it running in a Docker container instead.\nBelow the usage with values used in the videos of the course for:\nnetwork name (docker network)\npostgres related variables for pgcli\nHostname\nUsername\nPort\nDatabase name\n$ docker run -it --rm --network pg-network ai2ys/dockerized-pgcli:4.0.1\n175dd47cda07:/# pgcli -h pg-database -U root -p 5432 -d ny_taxi\nPassword for root:\nServer: PostgreSQL 16.1 (Debian 16.1-1.pgdg120+1)\nVersion: 4.0.1\nHome: http://pgcli.com\nroot@pg-database:ny_taxi> \\dt\n+--------+------------------+-------+-------+\n| Schema | Name             | Type  | Owner |\n|--------+------------------+-------+-------|\n| public | yellow_taxi_data | table | root  |\n+--------+------------------+-------+-------+\nSELECT 1\nTime: 0.009s\nroot@pg-database:ny_taxi>',
  'section': 'Module 1: Docker and Terraform',
  'question': 'PGCLI - running in a Docker container',
 

In [28]:
def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [29]:
query = 'how do I run kafka?'
rag(query)

'To run Kafka, based on the context provided for your course, you can follow these steps specifically for running Java Kafka producer/consumer/kstreams from the terminal:\n\nIn the project directory, run the following command:\n\n```sh\njava -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java\n```\n\nMake sure to replace `<jar_name>` with the actual name of your JAR file.\n\nThis command will execute the Kafka producer/consumer/kstream you have set up in your Java project.'

In [36]:
query = 'I just discovered the course. Can I still join it?'
rag(query)

"Yes, you can still join the course even if you just discovered it now. Even if you're not registered, you're still eligible to submit the homeworks. Just keep in mind that there will be deadlines for submitting the final projects, so try not to leave everything for the last minute."