# Get the data

The below link gets the faq documents in a json format. 

There is a script that was used to parse the faq from Google docs to Json using python
https://github.com/dimzachar/llm_zoomcamp/blob/master/notes/01-intro/retrieval-with-minsearch.md

this was covered in another workshop

In [1]:
# !wget https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/main/01-intro/documents.json

In [74]:
import json
import minSearch

In [75]:
with open('documents.json','rt') as f_in:
    docs_raw = json.load(f_in)

In [76]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)
        
documents[:2] #list of dictioanry with all the document questions

# STRUCTURE 
# text: answer to the question
# section: section question belongs to
# question: actual question
# course: what course question belongs to

[{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
  'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?',
  'course': 'data-engineering-zoomcamp'}]

In [77]:
import pandas as pd
docs_df = pd.DataFrame(documents)
docs_df.groupby('course').count()

Unnamed: 0_level_0,text,section,question
course,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
data-engineering-zoomcamp,435,435,435
machine-learning-zoomcamp,375,375,375
mlops-zoomcamp,138,138,138


## Indexing documents using minSearch

In [78]:
index = minSearch.Index(
    text_fields=['question', 'text', 'section'], #similarity matching, processed for TF-IDF vectorization
    keyword_fields=['course'] # exact matching
)

In [79]:
q = 'the course has already started, can i still join?'

In [80]:
index.fit(documents)

<minSearch.Index at 0x158e64530>

In [81]:
boost = {'question': 3.0, 'section':0.5} # when we know one field is important than other. like course

results = index.search(
    query=q,
    boost_dict=boost,
    filter_dict={'course': 'data-engineering-zoomcamp'}, 
    num_results=5
)
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineerin

## Generating answer with Claude

In [82]:
import anthropic
from dotenv import load_dotenv
import os

# Load environment variables from .env file into script environment
load_dotenv()


True

In [83]:
q

'the course has already started, can i still join?'

In [45]:
client = anthropic.Anthropic()

message = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": q}
    ]
)
print(message)

Message(id='msg_01EvtSFiKXWGihAmM1yCFuvn', content=[TextBlock(text="I apologize, but I don't have specific information about any particular course or its enrollment policies. Whether you can join a course that has already started depends on several factors:\n\n1. The course's specific policies\n2. How far along the course is\n3. The institution or platform offering the course\n4. The nature of the course (online, in-person, self-paced, etc.)\n\nTo get an accurate answer, you should:\n\n1. Contact the course instructor or the institution offering the course directly.\n2. Check the course syllabus or website for late enrollment policies.\n3. If it's an online course on a platform like Coursera or edX, check their policies on joining courses in progress.\n\nSome courses may allow late enrollment with the understanding that you'll need to catch up, while others might be more strict. It's best to inquire directly for the most accurate and up-to-date information.", type='text')], model='clau

In [55]:
# steps: Write a query as a user -> Query goes and searches documents -> matched documents gets extracted -> generate prompt by passing matched docs as "context" >> LLM recieves it and generates a response

In [56]:
# def search - that searches docs based on query
# def gererate a prompt - pass serach results as context
# call llm with the prompt

In [12]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return results

In [13]:
search(q)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineerin

In [14]:
def build_prompt(query, search_results):
    prompt_template = """ 
    You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
    Use only the facts from the CONTEXT when answering the QUESTION.

    QUESTION: {question}

    CONTEXT: 
    {context}
    """.strip()
    
    context = ""
    for doc in search_results:
        context += f'section: {doc['section']}\nquestion: {doc['question']}\ncourse: {doc['course']}\nanswer: {doc['text']}'
        
    prompt = prompt_template.format(question=query, context = context)
    return prompt
    

In [15]:
build_prompt(q,search(q))

"You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.\n    Use only the facts from the CONTEXT when answering the QUESTION.\n\n    QUESTION: the course has already started, can i still join?\n\n    CONTEXT: \n    section: General course-related questions\nquestion: Course - Can I still join the course after the start date?\ncourse: data-engineering-zoomcamp\nanswer: Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.section: General course-related questions\nquestion: Course - When will the course start?\ncourse: data-engineering-zoomcamp\nanswer: The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public G

In [84]:
def llm(prompt):
    import anthropic
    client = anthropic.Anthropic()
    message = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=256,
    messages=[
        {"role": "user", "content": prompt}
    ]
)
    
    return message.content[0].text

In [72]:
query = 'What is Data Engineering Zoomcamp? Can it help someone become a data engineer? Write in short'

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [73]:
rag('the course has already started, can I still enroll?')


AttributeError: 'str' object has no attribute 'search'

In [61]:
rag(query)

"Data Engineering Zoomcamp is a comprehensive course that can indeed help someone become a data engineer. Here's a brief overview:\n\nThe course covers essential topics in data engineering, including:\n\n1. Docker and Terraform for containerization and infrastructure management\n2. Workflow orchestration\n3. Data warehousing\n4. Stream processing with Kafka\n5. Batch processing\n\nIt provides hands-on experience through practical exercises and homework assignments, allowing students to work with real-world tools and technologies used in the field. The course structure and content are designed to equip participants with the skills and knowledge needed to pursue a career in data engineering."

# Spin up an ElasticSearch container

what is elastic searhc?
https://www.youtube.com/watch?v=ZP0NmfyfsoM&pp=ygUOZWxhc3RpYyBzZWFyY2g%3D

In [3]:
# docker run -it \
#     --rm \
#     --name elasticsearch \
#     -m 4GB \
#     -p 9200:9200 \
#     -p 9300:9300 \
#     -e "discovery.type=single-node" \
#     -e "xpack.security.enabled=false" \
#     docker.elastic.co/elasticsearch/elasticsearch:8.4.3


# check response from docker elastic search - curl http://localhost:9200


# Ingest the document to ElasticSearch - [indexing]
- Elastic search is persistant means it will store the data even after shutting down the docker


In [18]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [20]:
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

In [21]:
es_client = Elasticsearch('http://localhost:9200')

In [22]:
es_client.info()

ObjectApiResponse({'name': '5c83f0e2e1e8', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'NoESQU1aRkevW5oESnnb8w', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [45]:
index_settings = {
    'settings': {
        'number_of_shards':1,
        'number_of_replicas':0
    },
    'mappings': {
        'properties': {
            'text': {'type':'text'},
            'section': {'type':'text'},
            'question': {'type':'text'},
            'course': {'type':'keyword'},
        }
    }
}

index_name = 'course-questions'

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [46]:
# List all indices
indices_response = es_client.cat.indices(format='json')

# Print the indices
for index_info in indices_response:
    print(index_info['index'])

course-questions


In [43]:
# #deleting index
# response = es_client.indices.delete(index='course-questions')
# print(response)

{'acknowledged': True}


In [47]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

100%|██████████| 948/948 [00:01<00:00, 483.68it/s]


In [48]:
q = 'the course has already started, can i still join?'
q

'the course has already started, can i still join?'

In [49]:
sample_query = {
    'size': 1,
    '_source': ['course'],
    'query': {
        'match_all': {}
    }
}
sample_response = es_client.search(index=index_name, body=sample_query)
print(sample_response)

{'took': 4, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 948, 'relation': 'eq'}, 'max_score': 1.0, 'hits': [{'_index': 'course-questions', '_id': '26qwhJAB9W3OmpxwCdBF', '_score': 1.0, '_source': {'course': 'data-engineering-zoomcamp'}}]}}


In [50]:
mapping = es_client.indices.get_mapping(index=index_name)
from pprint import pprint
pprint(mapping)

ObjectApiResponse({'course-questions': {'mappings': {'properties': {'course': {'type': 'keyword'}, 'question': {'type': 'text'}, 'section': {'type': 'text'}, 'text': {'type': 'text'}}}}})


In [65]:
# query we send to elastic search

def elastic_search(query):
    search_query = {
        'size': 5,
        'query': {
            'bool': {
                'must': {
                    'multi_match': {
                        'query': q,
                        'fields': ['question^3','text', 'section'],
                        'type': 'best_fields'
                    }
                },
                'filter': {
                    'term': {
                        'course': "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es_client.search(index=index_name, body=search_query)
    
    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
        
    return result_docs

In [56]:
response = es_client.search(index=index_name, body=search_query)

response['hits']['hits']

{'_index': 'course-questions',
 '_id': '3aqwhJAB9W3OmpxwCdBq',
 '_score': 66.088936,
 '_source': {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'}}

In [63]:
result_docs = []

for hit in response['hits']['hits']:
    result_docs.append(hit['_source'])
    
result_docs

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at

In [66]:
elastic_search(q)

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at

In [85]:
#adjusting rag fucntion again

query = 'What is Data Engineering Zoomcamp? Can it help someone become a data engineer? Write in short'

def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = llm(prompt)
    return answer

In [86]:
rag(query=query)

"Data Engineering Zoomcamp is a course designed to help individuals become data engineers. Here's a short overview:\n\n1. It's a comprehensive program covering essential data engineering topics.\n\n2. The course is accessible even after the start date, allowing flexible participation.\n\n3. It provides hands-on experience with tools like Google Cloud, Python, Terraform, and Git.\n\n4. Materials remain available after the course ends, enabling self-paced learning.\n\n5. Support is available through a Slack channel, where participants can ask questions and seek help.\n\n6. The course includes homework assignments and a final capstone project, providing practical experience.\n\n7. It's designed to prepare participants for real-world data engineering roles.\n\nIn summary, Data Engineering Zoomcamp can indeed help someone become a data engineer by providing structured learning, practical experience, and community support."

In [87]:
rag(query='What week is currenlty going on in Data engineering Zoomcamp? What did the course started?')

"I apologize, but I don't have any specific information about the current week of the Data Engineering Zoomcamp or its start date in the provided context. The context does not mention any details about the current status or timeline of the course. It only contains general information about joining the course, prerequisites, and following the course after it finishes. Without more specific information, I cannot accurately answer your question about the current week or start date of the course."