## Day 1

1. **Easy** — What is the difference between *training* and *inference* in an AI system? Give one real-world example.

2. **Easy–Medium** — What problem does an *embedding* solve in AI systems, and where are embeddings typically used in production?

3. **Medium** — Explain how a basic **RAG (Retrieval-Augmented Generation)** pipeline works end to end. Where can it fail?

4. **Medium–Hard** — You are serving an LLM-based API. Latency suddenly spikes under load. What are the *first 3 things* you would inspect, and why?

5. **Hard** — You need to design a **scalable, cost-efficient AI agent system** that:

   * Uses tools (APIs, DBs)
   * Maintains short-term and long-term memory
   * Avoids hallucinations

   Describe the **architecture**, key components, and trade-offs.


1. Easy
- **Training** an AI system involves - Cleaning the data, model selection, training the model using the data so that model learns the patterns from the data.
- Inference involves getting output from the model that has been trained.

**Correct Answer**<br>
- Training is the phase where model parameters are optimized using labeled or unlabeled data via a loss function and backpropagation.
- Inference is the phase where the trained, frozen model generates predictions on new data with no learning involved.
---
2. Easy - Medi
- Embeddings are numerical representation of data that has semantic meaning.
- In production, Whenever a model wants to search information related to something embeddings are really helpful

**Correct Answer**<br>
- Embeddings map unstructured data into dense vectors such that semantic similarity becomes measurable using distance metrics.
- In production, they’re used for semantic search, RAG, recommendations, deduplication, and clustering via vector databases.
---
3. Medium
- RAG pipeline in short
- Documents -> document-loader -> chunks -> embeddings -> stored in database
- User Query -> convert into embeddings -> bi encoder compares query and documents to retrieve the relevant documents
- retrieved documents + user query + prompt => sent to llm => llm generate response based on information and prompt.

**Correct Answer**<br>
##### **What’s missing**
* Failure modes (explicitly asked).
* No mention of chunk size, retrieval quality, or context window limits.
##### **Key failures interviewers expect**
* Poor chunking → lost context
* Bad embeddings → irrelevant retrieval
* Retrieval misses critical docs
* Context overflow → truncation
* LLM hallucination despite correct retrieval
---
4. Medium - Hard
- Check the llm api calls whether they are looping or not
- Check logs

**Correct Answer**<br>
##### Expected top 3 checks
* LLM provider latency / rate limiting (timeouts, retries, cold starts)
* Embedding + vector DB latency (ANN index type, cache misses)
* Concurrency & scaling (async vs sync, worker saturation, request queuing)
> Red flag: Not thinking in terms of systems, infrastructure, and bottlenecks.
---
5. Hard
pass
### Interview-ready answer (short, crisp, no fluff)

> **Question:** Design a scalable, cost-efficient AI agent system with tools, memory, and low hallucination.

**Correct Answer:**

I’d design it as a **modular agent architecture**:

1. **Planner (LLM-based)**
   Interprets user intent and breaks it into steps. This reduces random tool calls and keeps execution structured.

2. **Tool Executor Layer**
   Controlled interface for APIs, databases, and services. Tools are schema-validated and permissioned to avoid unsafe or unnecessary calls.

3. **Memory System (2 layers)**

   * **Short-term memory:** Conversation context stored in-session (sliding window or summary).
   * **Long-term memory:** Vector DB + metadata for facts, preferences, and past outcomes, retrieved via embeddings.

4. **Retrieval + Grounding (Anti-hallucination)**
   Use RAG for any factual queries. The agent is forced to answer *only* from retrieved context, with fallback to “I don’t know” if confidence is low.

5. **Guardrails & Validation**

   * Output validation (JSON schema, regex)
   * Tool call verification
   * Confidence thresholds on retrieval

6. **Scalability & Cost Control**

   * Async execution
   * Caching embeddings and tool results
   * Smaller models for planning, larger ones only for final response

**Trade-offs:**
More components increase complexity, but you gain reliability, lower hallucinations, and predictable costs—mandatory for production agents.

## Day 5

### 1. Easy

**What is the difference between a *prompt* and a *system prompt* in an LLM-based application?**
Why does this distinction matter in production?

- The prompt is the instructions or query given by the user, it could be differnt for every query. Where as System prompt is defines the character/persona of the llm on how it should behave or think. It is almost same for throughout the conversation.
- The distinction help the model to think like specific persona, system prompt is usually given by the developer that make the model think in particular way.


### 2. Easy–Medium

**Why do most production LLM systems avoid sending the entire conversation history to the model every time?**
What strategies are used instead?

- llm comes with context length which it can able to process, when the entire conversation history is sent to model everytime, the context length increase which consumes more tokens and the model's ability to reasoning decreases which causes more generalised or hallucinations.
- TO overcome this instead of entire conversation you can send summary of conversation history that tell intent of the conversation. or Store converstation history in the vector database and using RAG pull out the messages or limit the previous chat instead of entire history only last 10 chats.
- But among all i feel summarizing the chat history is the best option because it explicitly says the intent of the conversation.

### 3. Medium

**You are building a customer-support AI assistant.**
How would you prevent it from:

* hallucinating answers
* giving outdated information

Explain your approach at a system level.

- First identify the cause of hallucinations such as whether is it from underspecified prompt or lack of information.
- If lack of information then i would implement RAG technique to provide relevant information to query and sent it combined to llm and explicitly give instruction to answer the query based on context.
- I will upload the data such as FAQs, policies, resolved queries to the vector database.
- With this method not only reduced hallucinations but also the data will be latest and up to date.

---

I’d ground the model using **RAG** with a trusted, up-to-date knowledge base.✅

At a system level:

* Retrieve relevant documents at inference time
* Instruct the model to **answer only from retrieved context**
* Add a fallback: *if information isn’t found, say “I don’t know”*

To prevent outdated info:

* Keep the knowledge base **continuously updated**
* Version documents and monitor retrieval quality

This reduces hallucinations and ensures responses stay accurate and current.

---


### 4. Medium–Hard

**An LLM agent is allowed to call tools (APIs, DBs).**
How do you ensure:

* tool calls are correct
* the agent doesn’t loop or misuse tools

What controls would you put in place?

**Answer**:
- To evaluate the tools calls are correct i would use frame works such as langsmith, deepeval. I will prepare a dataset using edgecases that would trigger the tool call. I would mention tool name and description for llm.
- I will evaluate them based on that.
- Then to avoid agent loop i would implement gaurdrails such as rate limiting which breaks the loop if there were many, and implement a layer which validates the tools, if there is some execution layer then i would add human in the loop mechanism.

With these methods the llm agent will be robust make proper tool calls, and doesn't break.

---

To ensure **correct tool calls**:✅

* Define strict **tool schemas and descriptions**
* Evaluate tool usage offline with test cases and edge cases
* Log and monitor tool-call accuracy in production

To prevent **loops or misuse** at runtime:

* Enforce **max tool calls per request**
* Add **timeouts and rate limits**
* Validate tool inputs/outputs before execution
* Define explicit **stop conditions**
* Use **human-in-the-loop** for high-risk actions

This ensures tools are used **correctly, safely, and predictably** in production.

---


### 5. Hard

**Design a multi-tenant AI platform** where:

* thousands of users share the same LLM backend
* data privacy between users is guaranteed
* costs stay predictable

Explain the **architecture, isolation strategy, and cost controls**.

Here’s a **short, crisp, interview-style answer** (what a senior interviewer wants to hear):

---

## **5. Hard — Multi-Tenant AI Platform (Interview Answer)**

I’d use a **shared LLM backend** with a **tenant-aware application layer**.

Each request carries a **tenant ID**, which is used to:

* Route data to **tenant-isolated vector indexes / storage**
* Enforce **access control and encryption** per tenant

For **isolation**, I’d ensure:

* Logical isolation at the data layer (separate namespaces / indexes)
* Strict auth checks so prompts, embeddings, and logs never mix across tenants

For **cost control**:

* Per-tenant **rate limits and token quotas**
* Request batching and response caching
* Tiered plans (limits on context size, tools, and calls)

This keeps inference shared and efficient, while **data privacy and costs remain predictable**.

## Day 8



### 1. Easy (Real usage)

You deploy an LLM-based chatbot. Users complain that answers are *sometimes correct, sometimes completely wrong*.
**What are the first two things you would check, and why?**

- I would check whether the chatbot is extracting information and telling on those based information or not, because llms hallucination rate is high when the key information to the query is missing.
- Then if retriving informtion is not the probelm i would check the prompt and query whether they are alligned or not. Because if both query and prompt have different goal, llms gives inaccurate responses to fill the gap.

---

### 2. Easy–Medium (Production reality)

Your RAG system works well in testing, but in production users say:

> “It answers fast, but often misses obvious information from the documents.”

**What are the likely causes, and how would you debug this step by step?**

- The most likely cause that misses obvious information in RAG system would be llm is not utilizing the information correctly. When we retrieve relevant information using sematic search it gives approximate results it may or may not be helpful. Since there are not much step in between that's the reason it is faster.
- I would check how documents are being stored in the vector database, chunking strategy, retrival logic. I would try to implement reranking using cross encoder for more accurate responses but there would be some latency if that's not the priority.

---

### 3. Medium (System behavior)

An AI agent that uses tools starts repeating the same tool calls and never finishes the task.

**What could cause this behavior, and what concrete safeguards would you add to stop it?**

- the agent uses tool to get information if there is none then it would go again and call the tool. This happens when the parameter could be wrong, tool details like name, description could be wrong that it wouldn't understand.
- To avoid this behaviour i would add more details about the tools so that llm can understand clearly, I would implement gaurdrails such as rate limiting if it calls the same tools again then it would stop and break the loop and say the answer with something like i don't know.
---

### 4. Medium–Hard (Scaling & cost)

Your AI application suddenly gets **10× traffic** after a product launch.
Latency and cloud costs both spike.

**What immediate actions do you take, and what long-term changes do you make to the system?**

- In order to reduce the load on system or handle latency i would implement load balancers, so that load is spread across different servers.
- I would do the horizontal scaling additionally
- To control costs - I would use cache where same or most common questions are shared across. I would per hours based cloud system. so that even though there is huge traffic my per hour cost will be the same. but it is costlier than serverless,
---

### 5. Hard (Failure under pressure)

A critical enterprise customer reports that:

* the AI gave a **confident but wrong answer**
* it used **their private data in another user’s response**

This is a P0 incident.

**Walk me through exactly what you would do in the first hour, and what system changes you’d implement to ensure this never happens again.**

- The AI always gives confident answer even the answer is factually wrong as these are predicting models. So in order to reduce the wrong answer you have to provide relevant information and mention in the prompt that use the information to answer the question.
- Since the private data is getting leaking and to avoid that
    - I would avoid using it plain llms response rather I would implement nodes/layers that validate the response so that any senstive is not being going response.
    - I would implement role based check where the session info which has user's information should match the exact user's info in the response.
- with these techniques the AI system will be more reliable and accurate.


## Day 15

Below is a **new set of 5 interview questions on fine-tuning**, **easy → hard**, written from an **applied AI engineering** perspective (production, trade-offs, failures).

---

#### 1. Easy

**What problem does fine-tuning solve that prompting or RAG cannot?**
Give one concrete example.

> Answer:
- Finetuning is method of post training an LLM model so that it alligns to ther user. Modifying the behaviour and tone of LLM to user centric.
- Prompting is about giving the intructions to an LLM on how to act, respond for the query, how to output the response.
- RAG is providing the information to the LLM where it uses that to answer the query.
For example: In Healthcare finetune a model so that it behaves and talks like a doctor, where as RAG provide extra information if needed to answer the query. Prompt gives the instructions to the llm about its capabilities.

---

#### 2. Easy–Medium

You fine-tune an LLM and observe that:

* training loss goes down
* real user responses get worse

**Why can this happen?**

> Answer
---

#### 3. Medium

You are deciding between **fine-tuning vs RAG vs prompt engineering** for a new product.

**How do you decide which approach to use?**
What signals push you toward fine-tuning?

> Answer

Identify the problem we are trying to solve, is it behaviour based or event based or actions based.
- First off i will experiment with carefully crafted prompts to optimize the response. After maxing out prompting
- If the model is hallucinating, response is factually incorrect then I would add information to the model using RAG.
- If the model is giving the facts correctly but unable to perform or fullfil the task even i would go with finetuning.
- If needed i would combine RAG to finetuned model along with prompt so that model would perform it's best to compplete the tasks.

---

#### 4. Medium–Hard

A fine-tuned model performs well in staging but fails in production with:

* hallucinations
* brittle behavior to small prompt changes

**What went wrong, and how do you fix it?**

---

#### 5. Hard

You must fine-tune an LLM for a **regulated domain** (finance / healthcare / legal) where:

* training data is limited
* mistakes are costly
* explanations are required

**Design the fine-tuning strategy, safety controls, and evaluation framework.**

---

Answer in order.
I’ll evaluate like an experienced AI engineer who has shipped fine-tuned models to production.
