# Moderating AI Responses for Elementary Education

AI tutors are transforming classrooms by providing personalized learning experiences and real-time feedback. However, when used with young learners, it's essential to ensure that AI-generated responses are both **age-appropriate** and **educationally constructive**. Without proper moderation, an AI tutor might inadvertently use language that is **discouraging** or **too complex**, leading to a negative learning experience. 

For instance, phrases like *"You should have known this already!"* can harm a child's confidence, while overly complex vocabulary can confuse rather than clarify. Educators and developers need tools to ensure that AI tutors provide **supportive** and **encouraging** responses, aligned with best educational practices.

> In this tutorial, we use 'inappropriate' to describe AI responses that do not meet the required tone, factual correctness, or complexity level for elementary students.

#### **The Solution: Using LlamaIndex and LLM for AI Response Moderation**

In this tutorial, we’ll demonstrate how to **moderate AI tutor responses** using **LlamaIndex** and **Large Language Models (LLMs)**. The approach involves:
- **Defining moderation guidelines** to check for inappropriate language and tone.
- **Indexing the guidelines** for quick reference by the AI system.
- **Applying an LLM** to evaluate AI-generated responses and flag any that are inappropriate for elementary students.

By the end, you’ll have a system that ensures AI tutors provide **positive** and **constructive** feedback, enhancing the learning experience for young students. Whether you're an educator or an educational tool developer, this tutorial will help you make AI interactions safer and more effective for students.

> The ideas presented in the tutorial are largely inspired by [this blog post](https://www.cloudraft.io/blog/content-moderation-using-llamaindex-and-llm).

---

#### **Step 1: Install Required Libraries**

To build a content moderation system for AI-generated responses, we need to install several libraries. First, we use **LlamaIndex** for indexing and querying moderation documents. This allows us to store and retrieve rules for appropriate responses. Next, **HuggingFace** provides pre-trained models that help generate embeddings (vector representations of text) and evaluate responses. We also use **transformers** and **torch** for core language model operations, and **sentence_transformers** for efficiently converting text into embeddings. 

Install the following libraries:

```bash
# Install necessary libraries
pip install llama-index llama-index-embeddings-huggingface transformers torch sentence_transformers "huggingface_hub[inference]" llama-index-llms-huggingface-api
```

#### **Step 2: Import Required Libraries**

Now that we have installed the necessary packages, we will import the modules required for indexing, creating embeddings, and performing language model inference.

- **`VectorStoreIndex`** is used to create a searchable index from our document set, allowing us to store and query moderation rules.
- **`SimpleDirectoryReader`** helps load text files containing the moderation rules.
- **`HuggingFaceEmbedding`** generates text embeddings using a HuggingFace model, enabling the system to understand relationships between different responses and guidelines.
- **`HuggingFaceInferenceAPI`** allows us to access a pre-trained language model via HuggingFace’s inference API to process and evaluate text.
- **`AutoTokenizer`** tokenizes the text to prepare it for processing by the language model.

In addition, we will define a function to securely load the HuggingFace API token from a file, which is necessary for accessing the model on HuggingFace’s platform.

In [1]:
import os
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.schema import TextNode
from transformers import AutoTokenizer
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI

def load_token(file_path):
    with open(file_path) as f:
        key = f.read().strip("\n")
    return key

hf_token = load_token(file_path='hf_token.txt')




### **Step 3: Initializing the Embedding Model**

To effectively moderate the AI tutor’s responses, we need a way to transform our moderation guidelines into a format the system can understand. This is where embeddings come into play. By initializing an embedding model using **HuggingFace**, we convert each guideline into a vector representation. These vectors allow the AI to measure the similarity between the guidelines and the tutor's responses, making it easier to determine whether the responses align with the rules.

This embedding model is specifically optimized for tasks requiring high-quality, efficient embeddings, ensuring that the system can process educational guidelines quickly and accurately.

In [2]:
# Initialize the embedding model
embedding_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

### **Step 4: Loading and Indexing Moderation Guidelines**

With the embedding model initialized, we proceed by loading the moderation guidelines from the specified text file. The file is then indexed using **LlamaIndex**. This allows the system to efficiently reference these guidelines when evaluating the AI tutor's responses.

<div class="alert alert-block alert-success">
  <h4>Alternative Approach</h4>
  <p>Instead of indexing the guidelines, an alternative method could involve using a dataset of pre-moderated content. This dataset would contain examples of text, each labeled as either "Appropriate" or "Inappropriate," along with explanations of why the text was flagged as inappropriate. Such an approach might streamline the moderation process and improve the accuracy of flagging inappropriate content.</p>
</div>

In [3]:
# Load the guidelines from a text file
guidelines_path = 'data/content/moderation_guidelines.txt'
display_guidelines = True

# Load moderation guidelines as documents
loader = SimpleDirectoryReader(input_files=[guidelines_path])
documents = loader.load_data()

if display_guidelines:
    print(f"The following guidelines will be used from the {guidelines_path} file:\n")
    print(documents[0].text)

# Index the documents using the embedding model
index = VectorStoreIndex.from_documents(
    documents,
    embed_model=embedding_model,
)

The following guidelines will be used from the data/content/moderation_guidelines.txt file:

Responses should avoid discouraging phrases like 'You're wrong' and 'You should have known this already'. 
Use simple, clear language and avoid complex vocabulary that may be too difficult for elementary students.
Avoid condescending phrases like 'This is too easy for you' or 'Even young kids can do this'.
Steer clear of sensitive or inappropriate topics, such as violence or adult themes.
Provide constructive feedback and avoid harsh or overly critical remarks like 'You’re not trying hard enough'.
Give clear, actionable instructions instead of ambiguous phrases like 'Think harder'.
Make responses personal and engaging, avoiding impersonal phrases like 'Proceed to next task'.
Ensure that all information provided is factually accurate and not misleading.
Offer encouragement and positive reinforcement to motivate students.
Avoid culturally insensitive or exclusionary language to create an inclusiv

### **Step 5: Setting Up the Language Model for Moderation Queries**

In this step, we integrate a **pre-trained language model** from HuggingFace into the moderation system. The **HuggingFaceInferenceAPI** allows us to process AI tutor responses by querying the indexed guidelines. This enables the AI to evaluate whether the tutor's responses align with the moderation rules, ensuring they are appropriate for elementary school students.

First, we load the language model and set up the necessary tokenizer for text processing. The **Phi-3-mini** model is well-suited for this task, as it is optimized for efficient inference in instructional contexts, making it a good fit for moderating educational content.

In [4]:
# Initialize the LLM and tokenizer for moderator
moderator_llm = HuggingFaceInferenceAPI(
    model_name="microsoft/Phi-3-mini-4k-instruct",
    token=hf_token, num_output=150)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", token=hf_token)

# Index the guidelines for moderation
index = VectorStoreIndex.from_documents(
    documents,
    llm=moderator_llm,
    embed_model=embedding_model,
    tokenizer=tokenizer
)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Once the language model and tokenizer are ready, we set up a **query engine** that will compare the AI tutor responses with the indexed moderation guidelines. The query engine allows for efficient and flexible retrieval of relevant moderation rules, ensuring that the system provides real-time feedback on whether a response is "Appropriate" or "Inappropriate."

In [6]:
# Set up the query engine for moderation
moderator_engine = index.as_query_engine(llm=moderator_llm,
                                              embed_model=embedding_model, tokenizer=tokenizer)

With the query engine in place, we can now process AI tutor responses and determine if they meet the moderation standards, enabling more accurate and context-aware feedback for educational settings.

### **Step 6: Sample AI Tutor Responses for Moderation**

In this step, we introduce a set of sample responses generated by an AI tutor. These responses include both **good** and **bad** examples to demonstrate how the moderation system will identify inappropriate language, tone, or inaccuracies. By running these examples through the moderation pipeline, we can test how well the system catches problematic responses and ensures that the feedback provided to students is supportive, accurate, and age-appropriate.

The **bad examples** showcase issues such as discouraging tone, condescending feedback, factual inaccuracies, overly complex vocabulary, and impersonal or robotic responses. In contrast, the **good examples** illustrate what appropriate, encouraging, and constructive feedback looks like.

Each AI tutor response is labeled either `1` (appropriate) or `0` (inappropriate). We will use these labels to evaluate the accuracy of the moderation system.

The moderation system will flag responses that fall short of the educational standards defined in our guidelines, ensuring that the AI tutor maintains a positive and effective learning environment for elementary students.

In [7]:
# Sample AI tutor responses with labels (1 = good, 0 = bad)
responses = [
    # Bad Example 1: Discouraging and condescending tone
    ("You should have known this already! Why don't you understand simple addition?", 0),
    
    # Good Example 1: Encouraging and constructive feedback
    ("Great try! Addition can be tricky at first. Let’s work through it together so you can get even better.", 1),
    
    # Bad Example 2: Condescending and demotivating tone
    ("This question is really easy for kids your age. Try harder next time!", 0),
    
    # Good Example 2: Positive reinforcement and guidance
    ("You’re doing well! Let’s review this question together to help you understand it fully. Keep up the effort!", 1),
    
    # Bad Example 3: Overly harsh feedback
    ("You’re not working hard enough. You need to pay more attention or you’ll never get this.", 0),
    
    # Good Example 3: Gentle encouragement and constructive advice
    ("You're making progress, and with a bit more focus, you’ll get this in no time. Let's keep practicing!", 1),
    
    # Bad Example 4: Overly complex vocabulary and tone for elementary students
    ("Your approach to this problem lacks efficiency. You should consider the distributive property when simplifying.", 0),

    # Good Example 4: Simple, age-appropriate explanation
    ("You're on the right track! Let’s look at how breaking the problem into smaller steps can help make it easier.", 1),
    
    # Bad Example 5: Factually incorrect statement
    ("Remember, the sun revolves around the Earth. That's why we have day and night.", 0),
    
    # Good Example 5: Factually correct and engaging explanation
    ("Day and night happen because the Earth rotates on its axis. Let’s explore how this works!", 1),
    
    # Bad Example 6: Impersonal and robotic response
    ("Answer incorrect. Please try again.", 0),
    
    # Good Example 6: Personal and engaging feedback
    ("It looks like you didn’t get this one, but that’s okay! Let’s go over it together so you can understand it better.", 1)
]

### **Step 7: Moderator Inference and Evaluation**

In this step, we evaluate how well the moderation system can classify AI tutor responses based on the indexed moderation guidelines. For each response, we query the system to determine whether it is appropriate for elementary students. The moderation result is then compared to a pre-defined label (`1` for good and `0` for bad), allowing us to assess the accuracy of the system's classification.

The process works as follows:
1. **Querying the Moderation System**: We create a query based on the AI tutor's response and ask the system whether it aligns with the moderation guidelines.
2. **Extracting the Moderator's Response**: The first valid response from the moderator is extracted and checked to see if it starts with "yes" (indicating the response is appropriate) or "no" (indicating it is not).
3. **Classification**: The system's predicted classification is compared with the correct label, and we print whether the system classified the response **CORRECTLY** or **INCORRECTLY**.

This step provides a clear indication of how well the system aligns with the predefined guidelines. By running the evaluation, you can see whether the AI's moderation system correctly classifies each response as appropriate or inappropriate for elementary school students.


<div class="alert alert-block alert-success">
  <h4>Generalizable Approach</h4>
  <p>While it may seem straightforward to train a separate LLM as a classifier specifically to label text as appropriate or inappropriate, this method comes with several challenges. Such a classifier would require access to a large, diverse, and well-curated dataset that captures a wide range of content, tones, and contexts, ensuring that it can accurately detect nuances in language. Even with extensive training, this approach may struggle with edge cases or complex scenarios due to the inherent difficulty of defining appropriateness across different educational contexts.

  In contrast, our approach leverages a more generalizable system. Instead of relying on a static classifier, we dynamically evaluate responses using a pre-trained LLM, guided by moderation rules. This allows the system to flexibly interpret responses in real-time, adjusting based on the specific context and tone of the conversation. As a result, this method provides a more adaptable and nuanced evaluation, ensuring higher accuracy across a variety of situations and making it more suited for the complexities of educational content moderation.</p>
</div>

In [8]:
# Evaluate each response from the AI tutor using the moderation rules
for response, label in responses:
    # Design prompt for moderator
    query = f"Based on the moderation guidelines, is the following response appropriate? "
    query += f'"{response}"'

    # Run query
    moderation_result = moderator_engine.query(query)

    # Select first paragraph of response from moderator
    moderator_response = next((r.strip() for r in moderation_result.response.split('\n') if r), "")

    # Determine if LLM classified the response correctly
    predicted = 1 if moderator_response.lower().startswith("yes") else 0
    correct = "CORRECTLY classified" if predicted == label else "INCORRECTLY classified"

    # Print the results
    print(f'Response assessed:\n\n\t "{response}"\n\nModeration Result:\n\n\t"{moderator_response}"\n\nClassification: {correct}\n\n\n')

Response assessed:

	 "You should have known this already! Why don't you understand simple addition?"

Moderation Result:

	"No, the response is not appropriate based on the moderation guidelines provided. It includes a condescending phrase ("You should have known this already!") and questions the student's understanding in a negative manner ("Why don't you understand simple addition?"). This response does not align with the guidelines of avoiding discouraging phrases, complex vocabulary, harsh or overly critical remarks, and it lacks encouragement and positive reinforcement."

Classification: CORRECTLY classified



Response assessed:

	 "Great try! Addition can be tricky at first. Let’s work through it together so you can get even better."

Moderation Result:

	"Yes, the response is appropriate. It avoids discouraging phrases, uses simple language, provides constructive feedback, and encourages the student to improve. It also offers a personalized and engaging approach to learning."


### **Step 8: Define the LLM that Will Correct the Response**

Once we have identified responses that are deemed inappropriate, we will use a second LLM to correct them. This LLM will generate a more appropriate version of the AI's response based on the original AI response, the moderator's feedback, and optionally the student's original prompt.

Similar to how we used the guidelines as a source for the moderator query engine, we can build a corrector engine that indexes the guidelines and responds to a prompt:

In [9]:
from transformers import OpenAIGPTTokenizerFast
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

openai_api_key = load_token(file_path='key.txt')

# Initialize the LLM, embedding model, and tokenizer for corrector
embedding_model = OpenAIEmbedding(model='text-embedding-3-small', api_key=openai_api_key)
corrector_llm = OpenAI(model='gpt-4o-mini', api_key=openai_api_key)
tokenizer = OpenAIGPTTokenizerFast.from_pretrained("openai-community/openai-gpt", token=hf_token)

# Index the documents using the embedding model
corrector_index = VectorStoreIndex.from_documents(
    documents,
    llm=corrector_llm,
    embed_model=embedding_model,
    tokenizer=tokenizer
)

# Set up the chat engine for correction
corrector_engine = corrector_index.as_chat_engine(chat_mode="openai", llm=corrector_llm,
                                                  embed_model=embedding_model)

This model is tasked with generating responses that align with the feedback provided by the moderator, ensuring that inappropriate language, tone, or complexity is corrected.

---

### **Step 9: Create an Example of a Student Prompt and AI Response That Will Likely Be Deemed Inappropriate**

Let’s simulate a scenario where a student asks a question, and the AI tutor gives a response that is likely to be flagged as inappropriate. We’ll use this as the input for our moderation and correction pipeline.

In [10]:
# Example of a student prompt and an AI response
student_prompt = "Why is 3 + 2 so hard?"
ai_response = "You should already know this! It's just simple addition!"

Here, the AI tutor's response is condescending and discouraging, which will likely be flagged by the moderator as inappropriate for elementary students.

---

### **Step 10: Run the Response Through the Index Query Method**

Next, we pass the AI tutor's response through the moderation system by querying the indexed guidelines. The system will evaluate the AI response and provide feedback on whether it is appropriate.

In [11]:
# Query the moderation system with the AI response
query = f"Based on the moderation guidelines, is the following response appropriate? '{ai_response}'"
moderation_result = moderator_engine.query(query)

# Extract the moderator's feedback
moderator_feedback = next((r.strip() for r in moderation_result.response.split('\n') if r), "")
print(f"Moderator's Feedback:\n\n\t{moderator_response}")

Moderator's Feedback:

	Yes, the response is appropriate. It avoids discouraging phrases, uses simple language, provides constructive feedback, and offers encouragement and positive reinforcement. It also breaks down the concept into manageable parts to ensure clear understanding.


### **Step 11: Combine the User Prompt, AI Response, and Moderator Response into a New Prompt for the Corrector Model**

We now create a combined prompt that includes the **student’s original question**, the **AI's inappropriate response**, and the **moderator's feedback**. This combined prompt will be sent to the corrector LLM, which will generate a more appropriate response.

In [12]:
# Create a correction prompt combining student prompt, AI response, and moderator feedback
correction_prompt = f"""
The AI tutor gave the following inappropriate response to the student's prompt:

        Student's prompt: "{student_prompt}"
        AI tutor's response: "{ai_response}"
        Moderator's feedback: "{moderator_feedback}"

Your Task: Provide a corrected response to the student's prompt that is appropriate based on the guidelines. Respond with ONLY WITH THE CORRECTED RESPONSE.
"""

# Run the correction prompt through the corrector LLM
corrected_response = corrector_engine.chat(correction_prompt).response

# Remove quotes
if corrected_response.startswith('"') and corrected_response.endswith('"'):
    corrected_response = corrected_response[1:-1]
print(corrected_response)

Great question! Addition can sometimes be tricky, but it's also a lot of fun. Let's work through it together!


This prompt ensures that the corrector LLM has enough context to generate a response that addresses the issues flagged by the moderator and aligns with the student’s question.

---

### **Step 12: Print the Process (Both What the Pipeline Sees and What the User Sees)**

Finally, we display the full process—what happens behind the scenes in the pipeline, as well as the corrected response that the user will see. This helps demonstrate how the system works to ensure the feedback is appropriate and educationally constructive.

**Pipeline Process:**
This section prints what the internal system is processing, including the student’s prompt, the original AI response, the moderator's feedback, and the corrected response generated by the second LLM.

**What the User Sees:**
This section shows what the student will see after the AI tutor’s response has been corrected.

In [13]:
# Print the entire process for transparency
print("### Pipeline Process ###")
print(f"\nStudent's Prompt: {student_prompt}")
print(f"\nAI Tutor's Original Response: {ai_response}")
print(f"\nModerator's Feedback: {moderator_feedback}")
print(f"\nCorrected Response: {corrected_response}")

# Print what the user will see
print("\n### What the User Sees ###")
print(f"\nStudent's Prompt: {student_prompt}")
print(f"\nAI Tutor's Corrected Response: {corrected_response}")

### Pipeline Process ###

Student's Prompt: Why is 3 + 2 so hard?

AI Tutor's Original Response: You should already know this! It's just simple addition!

Moderator's Feedback: No, the response is not appropriate according to the moderation guidelines. It uses a condescending tone with the phrase "You should already know this!" which can be discouraging and may not be suitable for elementary students. The response should be rephrased to be more encouraging and supportive, such as "Great job! Addition can be fun. Let's try another problem together."

Corrected Response: Great question! Addition can sometimes be tricky, but it's also a lot of fun. Let's work through it together!

### What the User Sees ###

Student's Prompt: Why is 3 + 2 so hard?

AI Tutor's Corrected Response: Great question! Addition can sometimes be tricky, but it's also a lot of fun. Let's work through it together!


## Putting it all together

We can put this into a standalone class so this process can be done in a single call:

In [15]:
# Full Script for Content Moderation and Correction with LLMs
import os
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.llms.huggingface_api import HuggingFaceInferenceAPI
from transformers import AutoTokenizer, OpenAIGPTTokenizerFast, OpenAIGPTTokenizer
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

class ContentModerator:
    def __init__(self, hf_token, openai_api_key, 
                 moderator_embedding_model='BAAI/bge-small-en-v1.5',
                 moderator_model='microsoft/Phi-3-mini-4k-instruct', 
                 corrector_embedding_model='text-embedding-3-small',
                 corrector_model='gpt-4o-mini', 
                 guidelines_path='data/content/moderation_guidelines.txt',
                 display_guidelines=True):
        """
        Initialize the content moderator class with LLM models, embeddings, and settings.
        
        Arguments:
        - hf_token: HuggingFace API token for accessing models
        - openai_api_key: OpenAI API key for accessing OpenAI models
        - moderator_embedding_model: Embedding model used to index moderation guidelines
        - moderator_model: HuggingFace model for moderation tasks
        - corrector_embedding_model: Embedding model used to create representations for correction pipeline
        - corrector_model: OpenAI model used for correcting inappropriate responses
        - guidelines_path: Path to the moderation guidelines text file
        - display_guidelines: Option to display the moderation guidelines upon initialization
        """
        
        # Load HuggingFace API token for accessing HuggingFace models
        self.hf_token = hf_token

        # Load the moderation guidelines from the specified path
        self.guidelines_path = guidelines_path
        loader = SimpleDirectoryReader(input_files=[guidelines_path])
        self.documents = loader.load_data()

        # Optionally print out the guidelines
        if display_guidelines:
            print(f"The following guidelines will be used from the {guidelines_path} file:\n")
            print(self.documents[0].text)

        # Initialize the embedding model for generating vectors of the guidelines
        embedding_model = HuggingFaceEmbedding(model_name=moderator_embedding_model)
        
        # Initialize the HuggingFace LLM for moderation
        moderator_llm = HuggingFaceInferenceAPI(
            model_name=moderator_model, embed_model=moderator_embedding_model, 
            token=hf_token, num_output=150
        )
        
        # Tokenizer used for text processing before sending to the moderator LLM
        tokenizer = AutoTokenizer.from_pretrained(moderator_model, token=hf_token)
        
        # Index the moderation guidelines using the embedding model
        self.index = VectorStoreIndex.from_documents(
            self.documents,
            llm=moderator_llm,
            embed_model=embedding_model,
            tokenizer=tokenizer
        )

        # Set up the query engine to retrieve moderation rules and apply them
        self.moderator_engine = self.index.as_query_engine(llm=moderator_llm,
            embed_model=embedding_model, tokenizer=tokenizer)

        # Initialize the OpenAI embedding model used in the correction pipeline
        embedding_model = OpenAIEmbedding(model=corrector_embedding_model, api_key=openai_api_key)
        
        # Initialize the OpenAI LLM for correcting inappropriate AI responses
        corrector_llm = OpenAI(model=corrector_model, api_key=openai_api_key)
        
        # Tokenizer for OpenAI's GPT models (used in correction)
        tokenizer = OpenAIGPTTokenizerFast.from_pretrained("openai-community/openai-gpt", token=hf_token)
        
        # Index the guidelines using the embedding model for correction
        corrector_index = VectorStoreIndex.from_documents(
            self.documents,
            llm=corrector_llm,
            embed_model=embedding_model,
            tokenizer=tokenizer
        )
        
        # Set up the chat engine for correcting responses
        self.corrector_engine = corrector_index.as_chat_engine(chat_mode="openai", llm=corrector_llm,
                                                          embed_model=embedding_model)

    def moderate_response(self, ai_response):
        """
        Use the HuggingFace LLM to moderate the AI tutor's response.
        
        Arguments:
        - ai_response: The original response from the AI tutor

        Returns:
        - A tuple (moderator_response, is_appropriate), where:
            - moderator_response: The feedback from the moderation model explaining the decision
            - is_appropriate: A Boolean value indicating if the response is appropriate (True) or inappropriate (False)
        """
        # Formulate the query for moderation based on guidelines
        query = f'Based on the moderation guidelines, is the following response appropriate? "{ai_response}"'
        
        # Query the moderator LLM with the response
        moderation_result = self.moderator_engine.query(query)
        
        # Extract the moderator's feedback from the response (first non-empty line)
        moderator_response = next((r.strip() for r in moderation_result.response.split('\n') if r), "")
        
        # Determine if the response is appropriate (check if first word is "yes" or "no")
        is_appropriate = moderator_response.lower().startswith("yes")
        return moderator_response, is_appropriate

    def correct_response(self, student_prompt, ai_response, moderator_feedback):
        """
        Use the OpenAI LLM to generate a more appropriate response based on the student prompt, AI response, and moderator feedback.
        
        Arguments:
        - student_prompt: The original question asked by the student
        - ai_response: The AI tutor's inappropriate response
        - moderator_feedback: The feedback from the moderator explaining why the response was inappropriate
        
        Returns:
        - The corrected response generated by the corrector LLM
        """
        # Combine the student prompt, AI response, and moderator feedback into a correction prompt
        correction_prompt = f"""
        The AI tutor gave the following inappropriate response to the student's prompt:
        
                Student's prompt: "{student_prompt}"
                AI tutor's response: "{ai_response}"
                Moderator's feedback: "{moderator_feedback}"
        
        Your Task: Provide a corrected response to the student's prompt that is appropriate based on the guidelines. Respond with ONLY WITH THE CORRECTED RESPONSE.
        """

        # Run the correction prompt through the corrector LLM
        corrected_response = self.corrector_engine.chat(correction_prompt).response

        # Optionally remove quotes if they exist in the output
        if corrected_response.startswith('"') and corrected_response.endswith('"'):
            corrected_response = corrected_response[1:-1]
        
        return corrected_response

    def forward(self, student_prompt, ai_response):
        """
        Forward method to moderate the response and then potentially correct it if necessary.
        
        Arguments:
        - student_prompt: The original question asked by the student
        - ai_response: The original response provided by the AI tutor
        
        Returns:
        - A dictionary containing the original AI response, moderator feedback, and the final response (either original or corrected)
        """
        # First, moderate the AI response using the moderation engine
        moderator_feedback, is_appropriate = self.moderate_response(ai_response)
        
        # If the response is inappropriate, pass it to the corrector LLM
        if not is_appropriate:
            corrected_response = self.correct_response(student_prompt, ai_response, moderator_feedback)
            final_response = corrected_response
        else:
            final_response = ai_response
        
        # Return a dictionary detailing the process and result
        return {
            "student_prompt": student_prompt,
            "ai_response": ai_response,
            "moderator_feedback": moderator_feedback,
            "final_response": final_response
        }

def load_token(file_path):
    """
    Utility function to load API tokens from a file.
    
    Arguments:
    - file_path: Path to the file containing the token
    
    Returns:
    - The token as a string
    """
    with open(file_path) as f:
        key = f.read().strip("\n")
    return key

# Example usage:
hf_token = load_token(file_path='hf_token.txt')
openai_api_key = load_token(file_path='key.txt')

# Create an instance of the ContentModerator class
moderator = ContentModerator(hf_token=hf_token, openai_api_key=openai_api_key)

# Example student prompt and AI response for moderation and correction
student_prompt = "Why do we have day and night?"
ai_response = "It’s because the Sun revolves around the Earth."

# Run the moderation and correction pipeline
result = moderator.forward(student_prompt, ai_response)

# Print the entire moderation and correction process for review
print("### Full Moderation and Correction Process ###")
print(f"\nStudent's Prompt: {result['student_prompt']}")
print(f"\nAI Tutor's Original Response: {result['ai_response']}")
print(f"\nModerator's Feedback: {result['moderator_feedback']}")
print(f"\nFinal Response (Corrected or Original): {result['final_response']}")

The following guidelines will be used from the data/content/moderation_guidelines.txt file:

Responses should avoid discouraging phrases like 'You're wrong' and 'You should have known this already'. 
Use simple, clear language and avoid complex vocabulary that may be too difficult for elementary students.
Avoid condescending phrases like 'This is too easy for you' or 'Even young kids can do this'.
Steer clear of sensitive or inappropriate topics, such as violence or adult themes.
Provide constructive feedback and avoid harsh or overly critical remarks like 'You’re not trying hard enough'.
Give clear, actionable instructions instead of ambiguous phrases like 'Think harder'.
Make responses personal and engaging, avoiding impersonal phrases like 'Proceed to next task'.
Ensure that all information provided is factually accurate and not misleading.
Offer encouragement and positive reinforcement to motivate students.
Avoid culturally insensitive or exclusionary language to create an inclusiv

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### Full Moderation and Correction Process ###

Student's Prompt: Why do we have day and night?

AI Tutor's Original Response: It’s because the Sun revolves around the Earth.

Moderator's Feedback: No, the response "It’s because the Sun revolves around the Earth" is not appropriate based on the moderation guidelines provided. This statement is factually incorrect and could lead to misinformation. The correct scientific understanding is that the Earth revolves around the Sun. It's important to provide accurate information and encourage learning with positive reinforcement. A more appropriate response would be to gently correct the misconception and explain the correct concept in a simple and engaging manner.

Final Response (Corrected or Original): The reason we have day and night is that the Earth rotates on its axis. As the Earth spins, different parts of it face the Sun, experiencing daylight, while the parts that are turned away from the Sun are in darkness, experiencing night. This

## Discussion on Future Considerations

### **Performance Evaluation Metrics**

When evaluating the effectiveness of the AI moderation pipeline, it's essential to define clear performance metrics that align with both technical and educational goals. Common metrics like **precision, recall, and F1 score** can measure how accurately the model flags inappropriate responses. For this specific use case, precision might be prioritized to avoid falsely flagging appropriate content as inappropriate, which could disrupt the learning process. Moreover, qualitative metrics, such as **user satisfaction** (based on teacher feedback) and **student engagement** (measured by participation or progress rates), will provide a holistic view of the system's success in real-world classroom settings.

#### **Quantitative Metrics**:
1. **Precision & Recall**: Ensure the system correctly flags inappropriate content (precision) and catches all inappropriate content (recall). Balancing these helps avoid false positives or negatives.
2. **F1 Score**: A balance between precision and recall to assess overall performance.
3. **Latency & Response Time**: Ensure the system responds quickly (e.g., under 500ms) to maintain a smooth user experience.
4. **Throughput**: Monitor requests per second to gauge the system's ability to handle high demand.
5. **Moderation Accuracy**: Compare the system’s flagged responses with manually reviewed ones to assess accuracy.

#### **Qualitative Metrics**:
1. **User Satisfaction**: Gather feedback from teachers and students via surveys or ratings to evaluate if the moderation improves the learning experience.
2. **Student Engagement**: Track student progress and engagement to determine if moderated responses positively impact learning outcomes.
3. **False Positive/Negative Reports**: Review cases where the system misclassifies responses to refine moderation guidelines.

### **Handling Edge Cases and Ambiguities**

One challenge for moderation systems is dealing with ambiguous content or edge cases where the appropriateness of a response may depend on context. For instance, a neutral response such as "Try again" may not violate guidelines but might lack the encouragement needed for younger learners. Developing specific rules for handling ambiguous content requires gathering real-life examples and continuously updating the guidelines. Incorporating a **confidence threshold** for classification can allow ambiguous responses to be flagged for human review, ensuring that the system errs on the side of caution without compromising the user experience.

### **A/B Testing in Real Environments**

For effective A/B testing in real environments, it's important to compare **two different versions of the moderation pipeline**, rather than comparing moderated vs. non-moderated systems. This could involve testing different models, prompts, or even embedding methods to evaluate which combination delivers the most accurate and effective moderation. For instance, one version of the pipeline might use a more complex LLM with higher precision, while another could use a faster, lighter model optimized for lower latency. By tracking key performance indicators such as **moderation accuracy**, **response time**, and **student engagement**, educators and developers can determine which version best balances accuracy with speed. A/B testing should also measure the **teacher and student satisfaction** with each version, as usability is just as important as technical performance in educational environments.

### **User Acceptance Testing (UAT)**

Before deploying the moderation system at scale, **User Acceptance Testing (UAT)** should be conducted to gather feedback from teachers, students, and administrators. This testing phase ensures that the system not only works as intended but also meets the specific needs of its end users. Teachers, for instance, might request more granular control over moderation settings or need an easy way to override moderation when necessary. UAT should also assess how easily teachers can integrate the moderation system into their workflows without disrupting the educational process.

### **Teacher Customization**

An important feature for future development is allowing **teacher customization**. Educators should be able to adjust moderation guidelines to align with their teaching philosophy, the age group of their students, or specific cultural sensitivities. Customization can also extend to different subjects, where more advanced topics might require less strict language moderation. Providing a simple user interface for teachers to tweak the moderation engine—without requiring technical knowledge—will improve teacher adoption and the system's effectiveness.

### **Scaling for Large-Scale Deployment**

Scaling the moderation system for large-scale deployment involves significant considerations around **infrastructure and performance**. The system must handle increased workloads without degradation in response time, especially in environments with thousands of students interacting with AI tutors simultaneously. This can be achieved by employing cloud infrastructure with autoscaling capabilities and optimizing the model inference pipelines to minimize latency. Additionally, **load balancing** should be implemented to ensure consistent performance across different regions or clusters of users.

### **Reinforcement Learning for Continuous Improvement**

Introducing **reinforcement learning** can make the moderation system more dynamic and capable of continuous improvement. The system could learn from teacher feedback or flagged content and adjust its moderation thresholds over time. Teachers could also provide ratings or corrections on moderated responses, which the system can use to fine-tune its decisions. Reinforcement learning models could gradually improve the moderation pipeline’s accuracy by better understanding nuanced classroom interactions and adapting to new types of content or challenges as they arise.

### **Updating LLM and Embedding Models**

One crucial future consideration is the **updating of LLM and embedding models** as newer, more efficient versions become available. The models used today might not be the best suited for the specific task a year from now, given the rapid advancements in AI. It is important to design the system in a modular fashion, allowing the easy swap of models without a complete overhaul of the pipeline. Additionally, keeping the models updated with **current datasets** (e.g., education-specific text corpora) will ensure that the AI remains relevant and capable of delivering accurate moderation results.

### **Performance Optimization**

Performance optimization in this system largely revolves around choosing the right **models and execution environments**. For instance, opting for **smaller, distilled models** or models specifically trained for low-latency environments can help reduce the computational burden, making the system more responsive without significantly sacrificing accuracy. Furthermore, considering the execution environment—whether on local servers, cloud-based services, or even edge devices—can significantly impact the user experience. **Edge computing** could provide faster, real-time feedback by processing data closer to the user, while cloud-based solutions offer greater scalability for larger deployments but could introduce latency due to data transfer times. By optimizing both model selection and execution environments, the system can strike a balance between performance and scalability, ensuring smooth operation even under high workloads.