# AI Engineer Code Challenge
## Document Embedding and RAG Implementation

## Objective

Develop a simple question-answering system using a Retrieval-Augmented Generation (RAG) framework. The system should embed documents from a public dataset, retrieve relevant information based on a user query, and generate an appropriate response.

## Requirements

1. Use the "sciq" dataset from Hugging Face (https://huggingface.co/datasets/sciq)
2. Implement document embedding
3. Create a simple RAG framework for retrieval and reranking
4. Provide a basic interface for asking questions and receiving answers

## Tasks

### 1. Data Preparation (30 minutes)

- Load the "sciq" dataset from Hugging Face
- Select a subset of 1000 documents from the dataset
- Use the `support` feature of the dataset for your embeddings
    - We will use the other features on our end to understand RAG quality
- Preprocess the documents as needed (e.g., cleaning, tokenization)

### 2. Document Embedding (1 hour)

- Choose an appropriate embedding model (e.g., SentenceTransformers)
- Embed the selected documents
- Store the embeddings efficiently for quick retrieval

### 3. RAG Framework Implementation (1.5 hours)

- Implement a retrieval mechanism to find relevant documents based on a query
- Ensure that the relevant documents are evaluated based on a dynamic threshold like kneedle method, BIC, or elbow method
- Develop a simple reranking algorithm to improve the relevance of retrieved documents
- Integrate a language model (e.g., Llama 70b using HuggingChat) for generating responses based on retrieved information

### 4. Query Interface (30 minutes)

- Create a simple command-line interface for users to input questions
- You could also use Gradio or Streamlit
- Display the top retrieved documents and the generated answer

### 5. Evaluation and Documentation (30 minutes)

- Implement a basic evaluation metric (e.g., relevance score)
- Provide clear documentation on how to run the system and any dependencies

## Deliverables

1. Python, Rust, or GoLang code implementing the entire pipeline
2. Requirements file listing all necessary dependencies
3. Brief documentation explaining the approach, challenges faced, and potential improvements

## Bonus (if time permits)

- Implement a simple caching mechanism to speed up repeated queries
- Add a feature to explain the reasoning behind the generated answer

## Evaluation Criteria

1. Code quality and organization
2. Effectiveness of the RAG implementation
3. Appropriate use of embedding techniques
4. Clarity of documentation
5. Overall system performance and response quality