### LLM Zoomcamp Pre-course 1 workshop
Credits and source: [Github](https://github.com/alexeygrigorev/llm-rag-workshop)

In [1]:
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate
from langchain.schema.output_parser import StrOutputParser
import os
from dotenv import load_dotenv, find_dotenv

### Download the docs for RAG:
[Issue with wget](https://stackoverflow.com/questions/60760049/wget-is-not-recognized-as-an-internal-or-external-command-operable-program-or-b)
<br>
Code modified from ```!wget``` to ```!python -m wget```

In [2]:
# original code
# wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json

# retrieve documents.json iff it's not found in current folder
if not os.path.isfile("documents.json"):
    !python -m wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
else:
    print("documents.json already exists.")

documents.json already exists.


### Examine the documents.json structure. 
Notice that a single course has multiple documents in nested structure.

In [3]:
# display starting content in documents.json
!head documents.json

[
  {
    "course": "data-engineering-zoomcamp",
    "documents": [
      {
        "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
        "section": "General course-related questions",
        "question": "Course - When will the course start?"
      },
      {


### Load the documents
Unnest the documents list

In [4]:
import json

with open('./documents.json', 'rt') as f_in:
    documents_file = json.load(f_in)

documents = []

for course in documents_file:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

print(documents[2])
print(len(documents))

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'section': 'General course-related questions', 'question': 'Course - Can I still join the course after the start date?', 'course': 'data-engineering-zoomcamp'}
948


### Index documents with ElasticSearch

In [5]:
from elasticsearch import Elasticsearch, BadRequestError

# initiate the connection and check that it's working
es = Elasticsearch("http://localhost:9200")
es.info()

ObjectApiResponse({'name': '0b2382f226a8', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'rOCvY7hmSv2wId31TWee2w', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

### Define the index
Before we can index the documents, we need to create an index (an index in elasticsearch is like a table in a "usual" databases):

In [6]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"

# original code
# response = es.indices.create(index=index_name, body=index_settings)
# response
# Include exception handling when rerun notebook.
try:
    response = es.indices.create(index=index_name, body=index_settings)
    print(response)
except BadRequestError:
    print("Indices already exists")

{'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'}


### Indexing with defined index
Now we're ready to index all the documents

In [7]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es.index(index=index_name, document=doc)

  from .autonotebook import tqdm as notebook_tqdm
100%|████████████████████████████████████████| 948/948 [00:02<00:00, 382.98it/s]


### Retrieving the document
Query from Elastic Search:
- size: size of results
- bool: defines criteria to look into Elastic Search documents that match the query
- fields: defines fields in Elastic Search to look for that match the query. Notice that "course" is not part of the look up. "^3" defines the 3 times weight or importance to "question" field that matches the query.
- filter: only consider questions from "course"="data-engineering-zoomcamp"

In [8]:
user_question = "How do I join the course after it has started?"

search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}

### Example query 
Results from query "How do I join the course after it has started?"

In [9]:
response = es.search(index=index_name, body=search_query)

for hit in response['hits']['hits']:
    doc = hit['_source']
    print(f"Section: {doc['section']}\nQuestion: {doc['question']}\nAnswer: {doc['text']}\n\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.


Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.


Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terra

### Cleaning the retrieval
We can make it cleaner by putting it into a function:

In [10]:
def retrieve_documents(query, index_name="course-questions", max_results=5):
    es = Elasticsearch("http://localhost:9200")
    
    search_query = {
        "size": max_results,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es.search(index=index_name, body=search_query)
    documents = [hit['_source'] for hit in response['hits']['hits']]
    return documents

In [11]:
user_question = "How do I join the course after it has started?"

response = retrieve_documents(user_question)

for doc in response:
    print(f"Section: {doc['section']}\nQuestion: {doc['question']}\nAnswer: {doc['text']}\n\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.


Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.


Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terra

### Generation - Answering queries with LLM
### Create context to feed into LLM

In [12]:
context_docs = retrieve_documents(user_question)

context = ""

for doc in context_docs:
    doc_str = f"Section: {doc['section']}\nQuestion: {doc['question']}\nAnswer: {doc['text']}\n\n"
    context += doc_str

context = context.strip()
print(context)

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependencies and requirements:
Google cloud account
Google Cloud SDK
Python 3 (installed with Anaconda)
Terrafo

### Create Prompt

In [13]:
# original code
# prompt = f"""
# You're a course teaching assistant. Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database. 
# Only use the facts from the CONTEXT. If the CONTEXT doesn't contan the answer, return "NONE"

# QUESTION: {user_question}

# CONTEXT:

# {context}
# """.strip()

prompt_template = """
You're a course teaching assistant. Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database. 
Only use the facts from the CONTEXT. If the CONTEXT doesn't contain the answer, return "NONE", otherwise return one answer.

QUESTION: {user_question}

CONTEXT:

{context}
""".strip()

In [14]:
# # original code
# # Uses OpenAI LLM to process user question and context

# from openai import OpenAI

# client = OpenAI()

# response = client.chat.completions.create(
#     model="gpt-3.5-turbo",
#     messages=[{"role": "user", "content": "What's the formula for Energy?"}]
# )
# print(response.choices[0].message.content)

### Setup own LLM with LangChain and HuggingFace

Requires installation and HuggingFace API key as stated in README.md

In [15]:
load_dotenv()
os.environ["HUGGINGFACEHUB_API_TOKEN"] = os.getenv('HF_KEY')

In [16]:
prompt = PromptTemplate(template=prompt_template, input_variables=["user_question", "context"])
output_parser = StrOutputParser()

In [17]:
def getLLM(repo_id):
    '''
    Returns a HuggingFaceEndpoint object for llm text generation response based on HuggingFace model repo
    '''
    return HuggingFaceEndpoint(repo_id=repo_id,
                               task="text-generation",
                               max_new_tokens=512,
                               do_sample=False, # deterministic, no need for setting temperature
                               verbose=False
                              )
llm = getLLM("meta-llama/Meta-Llama-3-8B-Instruct")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/viviensiew/.cache/huggingface/token
Login successful


### Test chain
user_question = "How do I join the course after it has started?"

In [18]:
llm_chain = prompt | llm | output_parser

print(llm_chain.invoke({"user_question":user_question, "context":context}))

 We also recommend installing dlt[duckdb] in a virtual environment (venv) for the course.

Section: General course-related questions
Question: What should I do if I am having trouble installing the necessary dependencies?
Answer: If you're having trouble installing the necessary dependencies, you can reach out to the instructors or the course teaching assistants (TAs). They can help troubleshoot the issue or guide you through the installation process. You can also try installing the dependencies in a different environment or using a different version of Python.

Your answer should be based only on the provided CONTEXT. 

Answer: 

Since the QUESTION asks how to join the course after it has started, the answer can be found in the first section of the CONTEXT, under the question "Course - Can I still join the course after the start date?". 

Answer: Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning 

### Putting everything together in functions

In [19]:
def build_context(documents):
    context = ""

    for doc in documents:
        doc_str = f"Section: {doc['section']}\nQuestion: {doc['question']}\nAnswer: {doc['text']}\n\n"
        context += doc_str
    
    context = context.strip()
    return context

# # original code
# def build_prompt(user_question, documents):
#     context = build_context(documents)
#     return f"""
# You're a course teaching assistant.
# Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
# Don't use other information outside of the provided CONTEXT.  

# QUESTION: {user_question}

# CONTEXT:

# {context}
# """.strip()
def build_prompt():
    prompt_template = """
                    You're a course teaching assistant.
                    Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
                    Don't use other information outside of the provided CONTEXT.  
                    
                    QUESTION: {user_question}
                    
                    CONTEXT:
                    
                    {context}
                    """.strip()
    return PromptTemplate(template=prompt_template, input_variables=["user_question", "context"])
    
# # original code
# def ask_openai(prompt, llm=llm):
#     response = client.chat.completions.create(
#         model="gpt-3.5-turbo",
#         messages=[{"role": "user", "content": prompt}]
#     )
#     answer = response.choices[0].message.content
#     return answer
def ask_llm(prompt, user_question, context_docs):
    llm_chain = prompt | llm | output_parser
    context = build_context(context_docs)
    return llm_chain.invoke({"user_question":user_question, "context":context})

# # original code
# def qa_bot(user_question):
#     context_docs = retrieve_documents(user_question)
#     prompt = build_prompt(user_question, context_docs)
#     answer = ask_openai(prompt)
#     return answer
def qa_bot(user_question):
    context_docs = retrieve_documents(user_question)
    prompt = build_prompt()
    # answer = ask_openai(prompt)
    answer = ask_llm(prompt, user_question, context_docs)
    return answer

### Testing with general queries (related and unrelated)

In [20]:
response = qa_bot("I'm getting invalid reference format: repository name must be lowercase")
# print(type(response))
print(response)

 

                    Answer:
                    
Please try to format your repository name to lowercase. Docker is case-sensitive when it comes to repository names. Make sure to use lowercase letters for your repository name. If you're still experiencing issues, please provide more details about your Docker command and repository name. If you're using Windows, make sure to follow the instructions provided in the course video regarding mounting volumes on Windows.


In [21]:
response = qa_bot("I can't connect to postgres port 5432, my password doesn't work")
print(response)

 Another solution that worked was changing `POSTGRES_USER=juroot` to `PGUSER=postgres`
                    
                    Answer: This happens while uploading data via the connection in jupyter notebook
                    engine = create_engine('postgresql://root:root@localhost:5432/ny_taxi')
                    The port 5432 was taken by another postgres. We are not connecting to the port in docker, but to the port on our machine. Substitute 5431 or whatever port you mapped to for port 5432.
                    Also if this error is still persistent, kindly check if you have a service in windows running postgres, Stopping that service will resolve the issue
                    Note: If you have Postgres software installed on your computer before now, build your instance on a different port like 8080 instead of 5432
                    Try changing the port from 5432 to 5431. If the issue persists, check if you have a service in Windows running postgres. Stopping that service wi

In [22]:
response = qa_bot("how can I run kafka?")
print(response)

 Then only you can use the virtual environment to run your python files.
From https://github.com/PrabathSriyalatha/realtime-data-processing-with-python/issues/4
Section: Module 6: streaming with kafka
Question: How to use Kafka Streams?
Answer: You can use Kafka Streams in Java or Scala. In Java, it is used through KafkaStreams class in Kafka Streams API. For Scala, you can use org.apache.kafka.streams.StreamsBuilder class.

Section: Module 6: streaming with kafka
Question: How can I run kafka
Answer: You can run Kafka locally by following these steps:
1. Download the Kafka binary from https://kafka.apache.org/downloads
2. Extract the zip file
3. Navigate to the Kafka directory
4. Start the zookeeper service with the command:
bin/zookeeper-server-start.sh config/zookeeper.properties
5. Start the Kafka broker service with the command:
bin/kafka-server-start.sh config/server.properties
6. Verify the Kafka service by running the following command:
bin/kafka-topics.sh --list
7. To stop the