# LangChain: Creating an AI Tutor for Astronomy 12

In this tutorial, we'll build an AI tutor chatbot tailored for an [Astronomy 12 course](https://teaghan.github.io/astronomy-12/). 

By integrating techniques like conversational retrieval-augmented generation (RAG) with course-specific content, we aim to provide a tool that enhances student learning through interactive and personalized tutoring sessions.

> For a more interactive experience, check out the [web app](https://teaghan-educational-prompt-engineering-tutormain-dkogwm.streamlit.app/) version for this chatbot.

### Objectives

- **Load and Process Educational Material**: Retrieve course material and convert the raw text into structured data that the AI can understand.
- **Embedding for Information Retrieval**: Use advanced NLP techniques to make the course content readily accessible to the AI.
- **Conversation History Management**: Ensure the AI can remember and utilize past interactions to provide contextually relevant responses.
- **Content Retrieval**: Develop a method for the AI to gather the most contextually relevant information from its knowedge to best address the question.
- **Setup a Dynamic Question Answering System**: Allow the AI to answer questions intelligently and contextually.

---

## Setup and Configuration

### Environment Preparation

First, we'll import the necessary libraries and configure our environment, including setting up API keys for accessing Langchain and OpenAI services.

In [1]:
import os
import re
from langchain_openai import ChatOpenAI
from langchain.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.runnables.history import RunnableWithMessageHistory

### Secure API Key Management

In [2]:
def load_api_key(key_file):
    with open(key_file) as f:
        key = f.read().strip("\n")
    return key

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["OPENAI_API_KEY"] = load_api_key('key.txt')
os.environ["LANGCHAIN_API_KEY"] = load_api_key(key_file='langchain_key.txt')
os.environ['USER_AGENT'] = 'myagent'

## Integrating Course Content

### Loading Course Documents

Here we load the textual content of the Astronomy 12 course. These documents form the backbone of our AI tutor's knowledge base.

In [3]:
def load_text_file(file_path):
    return open(file_path, 'r').read()

tutor_instructions = load_text_file('tutor/Tutor_Instructions.txt')
course_content = load_text_file('tutor/course_content.txt')

### Structuring the Content

Breaking down the course material into structured documents allows the AI to navigate and retrieve specific information efficiently..

In [4]:
def split_by_files(content, context_window):
    # Define patterns for splitting based on README, lesson files, and assignments
    readme_pattern = r"# Unit\d+_README\.md"
    lesson_pattern = r"# \d+_\d+_[\w_]+\.md"
    assignment_pattern = r"# Unit \d+ Assignment"

    # Combine all patterns
    combined_pattern = f"({readme_pattern}|{lesson_pattern}|{assignment_pattern})"
    
    # Find all matches (file sections) and split the content
    matches = re.split(combined_pattern, content)

    # Collect file content
    chunks = []
    for i in range(1, len(matches), 2):
        chunk_title = matches[i].strip()
        chunk_content = matches[i + 1].strip()

        # Check the length of the chunk_content and split if necessary
        if chunk_content:
            content_length = len(chunk_content)
            if content_length <= context_window:
                chunks.append(Document(page_content=chunk_content, metadata={"title": chunk_title}))
            else:
                # Split the content into smaller chunks if it exceeds the context window
                part_number = 0
                for start in range(0, content_length, context_window):
                    end = start + context_window
                    part_number += 1
                    subtitle = f"{chunk_title} ({part_number})"
                    # Ensure the split does not cut off mid-word
                    if end < content_length and not chunk_content[end].isspace():
                        # Find the nearest space to avoid splitting words
                        end = chunk_content.rfind(' ', start, end)
                    sub_content = chunk_content[start:end].strip()
                    if sub_content:
                        chunks.append(Document(page_content=sub_content, metadata={"title": subtitle}))

    return chunks

## Building the Chatbot

### Initializing AI Models for Embedding and Interaction

To empower our AI tutor with the capability to understand and process natural language effectively, we initialize two key components from the OpenAI suite: the embedding model and the language model. 

- **`OpenAIEmbeddings`**: Converts textual content into numerical vectors, capturing semantic meanings essential for tasks like document retrieval and context understanding.

- **`ChatOpenAI` with `model="gpt-4o-mini"`**: This language model processes and generates human-like text, enabling the AI to provide coherent, contextually appropriate responses, balancing performance with resource efficiency for real-time interactions.

In [5]:
model = "gpt-4o-mini"
embedding_model = OpenAIEmbeddings()
llm = ChatOpenAI(model=model)

# Determine max context window for model used
model_context_windows = {
    "gpt-4o": 128000,
    "gpt-4o-2024-05-13": 128000,
    "gpt-4o-2024-08-06": 128000,
    "chatgpt-4o-latest": 128000,
    "gpt-4o-mini": 128000,
    "gpt-4o-mini-2024-07-18": 128000,
    "gpt-4-turbo": 128000,
    "gpt-4-turbo-2024-04-09": 128000,
    "gpt-4-turbo-preview": 128000,
    "gpt-4-0125-preview": 128000,
    "gpt-4-1106-preview": 128000,
    "gpt-4": 8192,
    "gpt-4-0613": 8192,
    "gpt-4-0314": 8192,
    "gpt-3.5-turbo-0125": 16385,
    "gpt-3.5-turbo": 16385,
    "gpt-3.5-turbo-1106": 16385,
    "gpt-3.5-turbo-instruct": 4096
}
context_window = model_context_windows[model]

### Embedding Documents

Document embedding is a crucial step in preparing our AI tutor for effective teaching. This process transforms the segmented course material—such as lessons and sections—into numerical vectors that the AI can understand and analyze, akin to how a student internalizes topics for easier recall.

#### Importance of Chunks

Segmenting the course content into smaller, manageable chunks serves three key purposes:

1. **Improved Retrieval Accuracy**: By organizing the material into distinct topics, the AI can more precisely identify and fetch relevant information in response to student queries, enhancing the educational experience.
   
2. **Efficient Processing**: Smaller text chunks lead to more targeted embeddings, allowing for quicker and more accurate retrieval of information.

3. **Focused Content Delivery**: This approach mimics a tutor's method of referencing specific textbook sections to answer questions, providing students with precise and relevant information.

#### Embedding Process

Here’s a brief overview of how embedding benefits the AI tutor:

- **Vector Representation**: Each document chunk is converted into a vector that captures its semantic meaning, enabling the AI to understand content beyond mere keywords.

- **Semantic Understanding**: These vectors help the AI grasp underlying concepts within the text, similar to a student understanding ideas for application rather than rote memorization.

- **Efficient Query Matching**: When a query is made, the AI quickly identifies the most relevant document chunks, ensuring accurate and contextually appropriate responses.

In [6]:
course_content = split_by_files(course_content, context_window)
tutor_instructions = Document(page_content=tutor_instructions, metadata={"title": "Tutor Instructions"})
course_content_vecs = Chroma.from_documents(course_content, embedding=embedding_model)

### Managing Conversation History

Effective tutoring requires remembering past interactions, and our AI tutor uses conversation history management to achieve this. By maintaining a record of each student's dialogue, the AI can provide responses that are not only accurate but also contextually relevant to ongoing discussions.

#### Implementation Details

We utilize a simple but effective method to store and retrieve conversation histories based on session identifiers.

**Key Components:**
- **`store`**: A dictionary that holds conversation histories keyed by session identifiers.
- **`get_session_history`**: A function that retrieves an existing history or creates a new one if none exists for the given session ID.

In [7]:
store = {}
def get_session_history(session_id: str):
    return store.setdefault(session_id, ChatMessageHistory())

### Setting Up the Chatbot Interaction

To effectively tutor students, our AI chatbot uses a sophisticated interaction system that leverages both the context of the conversation and the knowledge embedded in the documents. This setup ensures that the responses are not only precise but also tailored to the specific educational needs of the student.

#### Implementing Context-Aware Question Handling

The AI tutor is designed to process inquiries by considering the entire conversation history, ensuring that each response builds on previous interactions.

**Key Components:**
- **`contextualize_q_system_prompt`**: A prompt that instructs the AI to reformulate the student's query into a clear, standalone question. This helps in stripping any ambiguity that might arise from a lack of context, making sure the question is clear and precise.
- **`contextualize_q_prompt`**: This template structures the reformulation process, integrating the system's instructions with the user's input and the chat history.
- **`history_aware_retriever`**: It leverages the reformulated question to fetch the most relevant information from the embedded documents. This retriever is aware of the chat history, which allows it to consider previous exchanges when selecting content to use in responses.

By refining questions based on the chat history and directly retrieving information from contextually embedded documents, the AI tutor mimics a knowledgeable and attentive human tutor. This method ensures that each interaction is informed by a comprehensive understanding of both the course content and the student's ongoing educational journey.

In [8]:
contextualize_q_system_prompt = (
    "Given a chat history and the latest user question "
    "which might reference context in the chat history, "
    "formulate a standalone question which can be understood "
    "without the chat history. Do NOT answer the question, "
    "just reformulate it if needed and otherwise return it as is."
)
contextualize_q_prompt = ChatPromptTemplate.from_messages([
        ("system", contextualize_q_system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
])

history_aware_retriever = create_history_aware_retriever(llm, course_content_vecs.as_retriever(), contextualize_q_prompt)

### Integrating Document-Based Responses

For our AI tutor to function as an effective educational tool, it's crucial to integrate the tutor's guidance directly into the system's responses. This integration ensures that the tutor not only answers queries but does so in a manner that aligns with the educational objectives and instructions specified in the tutor guidelines.

#### Creating a Contextual Response System

We set up a response system that utilizes the tutor instructions along with the context retrieved from the documents to provide detailed and educational answers.

**Key Elements of the System:**

- **`system_prompt`**: Structures the AI's response approach by incorporating tutor instructions and explicitly directing the use of retrieved document context to answer queries, ensuring responses are pedagogically aligned and informative.

- **`qa_prompt`**: This template structures the interaction by embedding the system’s guidance, chat history, and user’s current query. It guides the AI in engaging with the student in a manner that is responsive and informed by prior interactions.

- **`question_answer_chain`**: Executes the AI's strategy for responding, utilizing the prompt to direct document retrieval and conversation analysis to construct answers that are contextually aware and educationally relevant.

By employing these elements, the AI tutor is equipped to provide responses that are not just reactive but also deeply integrated with the educational content and objectives of the Astronomy 12 course.

In [9]:
system_prompt = (
    f"{tutor_instructions.page_content}\n\n"
    "## Your Task\n\n"
    "Following the instructions above, use the following pieces of retrieved context to answer the question. \n\n"
    "# Context\n\n"
    "{context}"
)
qa_prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
])

question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)

### Implementing the Retrieval-Augmented Generation Chain

To enhance the AI tutor's ability to deliver precise and contextually relevant answers, we integrate a retrieval-augmented generation (RAG) chain into the system. This structure utilizes both the power of document retrieval and the generative capabilities of language models.

**Breakdown:**

- **`create_retrieval_chain`**: This function links the `history_aware_retriever` with the `question_answer_chain`. The retriever first identifies relevant documents based on the user’s query and the context from previous interactions, while the question-answer chain generates responses based on the retrieved information. This combination ensures that the AI's responses are both informed by the course content and tailored to the ongoing conversation.

- **`RunnableWithMessageHistory`**: Wraps the RAG chain to make it aware of the conversational history. This wrapper ensures continuity and context preservation across multiple interactions with the same student, simulating a more human-like tutoring experience. It uses three keys to manage the flow of information:
  - `input_messages_key`: Specifies the key under which incoming student queries are stored.
  - `history_messages_key`: Indicates where the chat history is maintained to provide context.
  - `output_messages_key`: Defines the key for storing the AI’s responses, which can be referenced in future interactions.

This setup enables the AI tutor to not just respond to isolated queries but to engage in a dynamic, ongoing educational dialogue. By leveraging both retrieval and generative processes, the tutor can provide answers that are deeply integrated with the educational content and are responsive to the evolving needs of the student, enhancing the learning experience in the Astronomy 12 course.

In [10]:
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)
conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

## Testing the AI Tutor

Finally, we'll test our AI tutor by simulating a few student inquiries to see how it performs in a real-world educational scenario.

In [15]:
response = conversational_rag_chain.invoke({"input": "How do I calculate the surface gravity of a planet given its mass and radius?"}, 
                                           config={"configurable": {"session_id": "abc1234"}})
print(response["answer"])

Great question! Before we dive into the calculation, do you know which formula is used to calculate the surface gravity of a planet?


In [16]:
response = conversational_rag_chain.invoke({"input": "Well I know that F=G*m1*m2/r^2"}, 
                                           config={"configurable": {"session_id": "abc1234"}})
print(response["answer"])

You're on the right track with the gravitational force equation! However, for surface gravity specifically, we use a slightly different approach. 

Surface gravity ($g$) can be derived from Newton's law of universal gravitation. Do you know how we can express surface gravity in terms of the mass and radius of the planet?


In [17]:
response = conversational_rag_chain.invoke({"input": "Do we need one of Newton's second law, F=ma?"}, 
                                           config={"configurable": {"session_id": "abc1234"}})
print(response["answer"])

Yes, exactly! We can use Newton's second law, $F = ma$, along with the gravitational force formula you mentioned. 

To find the surface gravity, we consider the force acting on an object at the surface of a planet due to gravity. Can you think of how we can combine these ideas to derive the formula for surface gravity ($g$)?


We can also use the most relevant course material found by the model to recommend further reading:

In [18]:
def generate_links(context_list):
    base_url = "https://teaghan.github.io/astronomy-12/"
    links = []

    for context in context_list:
        title = context.metadata['title']
        content = context.page_content

        # Handle README files
        if "_README.md" in title:
            unit_number = re.search(r"# Unit(\d+)_README\.md", title).group(1)
            links.append(f"- [Unit {unit_number}]({base_url}md_files/Unit{unit_number}_README.html)")

        # Handle lesson files
        elif ".md" in title and "_" in title:
            # Extract lesson title from content
            lesson_title_match = re.search(r'# (.*?)\n', content)
            lesson_title = lesson_title_match.group(1) if lesson_title_match else "Lesson"

            lesson_parts = re.search(r"# (\d+)_(\d+)_(\w+)\.md", title)
            if lesson_parts:
                unit, lesson, name = lesson_parts.groups()
                name_formatted = name.replace('_', ' ')  # Assuming names are using underscores instead of spaces
                link_text = f"Lesson {unit}.{lesson} {name_formatted}"
                links.append(f"- [{link_text}]({base_url}md_files/{unit}_{lesson}_{name}.html)")

        # Handle assignments
        elif "Assignment" in title:
            assignment_number = re.search(r"# Unit (\d+) Assignment", title).group(1)
            links.append(f"- [Unit {assignment_number} Assignment]({base_url}Unit{assignment_number}/Unit{assignment_number}_Assignment.pdf)")

    return f"\n".join(links)

print('Recommended course content for review:\n' + generate_links(response['context']))

Recommended course content for review:
- [Lesson 2.2 gravity](https://teaghan.github.io/astronomy-12/md_files/2_2_gravity.html)
- [Unit 2](https://teaghan.github.io/astronomy-12/md_files/Unit2_README.html)
- [Unit 2 Assignment](https://teaghan.github.io/astronomy-12/Unit2/Unit2_Assignment.pdf)
- [Lesson 2.1 keplers laws](https://teaghan.github.io/astronomy-12/md_files/2_1_keplers_laws.html)
