## Day 1

### Questions

1. **Easy** — What is the difference between **model training** and **model serving** in an ML system? Why are they usually separated?

2. **Easy–Medium** — What is **data leakage** in machine learning? Give one concrete example from a real ML pipeline.

3. **Medium** — Your model performs well offline but poorly in production. What are the **top reasons** this happens, and how would you detect them?

4. **Medium–Hard** — Explain how you would design an **end-to-end ML pipeline** from data ingestion to deployment and monitoring.

5. **Hard** — A deployed model’s performance slowly degrades over time, but there are no obvious bugs.
   How would you **detect, diagnose, and fix model drift** in production?

### Answers
1. Easy
- Model training involves data engineering, data cleaning, feature engineer, Hyperparameter tuning, model selection where you will train the model will necessary stuff.
- Model serving involves deploying the model to use it at production scale, it involves converting the each steps in training into python functions or scripts.

**Correct Answer**<br>
- Training is offline, batch-oriented, and compute-heavy.
- Serving is online, latency-sensitive, stateless, and optimized for scale and reliability.
- They are separated because they have different performance, infra, and failure requirements.

**Perfect answer**
Model training is an offline process where we clean data, engineer features, optimize model parameters using a loss function, and validate performance. It is batch-oriented and compute-heavy.

Model serving is the online process of exposing a trained, versioned model through an API for real-time or batch predictions. It is latency-sensitive, scalable, and reliability-focused.

They are separated because training prioritizes accuracy and experimentation, while serving prioritizes low latency, high availability, and safe rollbacks
---

2. Data leakage is a concept where training data is injected with future data that is used to evaluate the model. While training a model if you normalise the training and testing data using same mean causes data leakage.

**Correct Answer**<br>
Minor improvement: Mention label leakage or time-based leakage to sound senior.

**Answer:**

Data leakage occurs when information that would not be available at prediction time is inadvertently used during training, leading to overly optimistic evaluation results.

A common example is normalizing both train and test data using global statistics instead of fitting the scaler only on the training set. Other examples include time-based leakage or features derived from future labels.

Leakage causes models to fail silently in production.
---

3. Model is trained on fixed feature but when in production there may be other factors or new features that getting involved which make the prediction wrong. Check the correlation for the features, you can communicate/collaborate with the data engineering team to identify the data related issues.

4. **Answer:**

The most common reasons are:

* Train–serve skew: Feature computation differs between training and production.
* Data drift: Input distributions change over time.
* Label mismatch or delay: Production labels differ from training labels.
* Overfitting to offline data.

I would detect this by monitoring feature distributions, prediction confidence, and performance metrics over time, and by validating that the same feature pipelines are used in both training and serving.
---
4. To design a pipeline you need to understand the data flow
- Data ingestion - make sure there are no anomolies, missing data, outliers that affect the model
- Feature Engineer - Generate features from the existing features that affect the model's performance
- Model training - training the data, saving the model in pkl file
- Model Evaluation - Evaluating the performance of model so that it meets the performance metrics
- Model Deployment - Deploying the model using fastapi,docker, ci/cd pipelines
- Model Monitoring - Monitor the model for drift, inference using MLFlow

**Answer:**

I’d design the pipeline as:

1. Data ingestion & validation – schema checks, missing values, anomaly detection
2. Feature engineering – reproducible transformations, ideally via a feature store
3. Model training – experiment tracking and hyperparameter tuning
4. Model evaluation – offline metrics and bias checks
5. Model registry – versioned models with metadata
6. Deployment – containerized service using FastAPI and Docker with CI/CD
7. Monitoring – track data drift, prediction quality, latency, and errors

This ensures reproducibility, scalability, and safe iteration.

5. When the model's performance degrades the most likely cause is drift
- There are two types of drift data drift and concept drift
- First test for data drift compare mean, variance to the baseline(training data) using stastical tests KS test.
- To detect Concept drift - Compare the distribution between data.

Identify the cause of data drift - it may because of external events, change in inputs
- To fix it you will again train the model using the latest data
- Or you may include the features that may causing the drift

**Answer:**

I’d approach drift in three steps:

**Detect**
* Monitor data drift using statistical tests like KS test on feature distributions.
* Monitor prediction drift via confidence scores and output distributions.
* Track performance metrics when labels are available.

**Diagnose**
* Identify which features drifted and correlate them with performance drops.
* Check for external changes like seasonality or business logic updates.

**Fix**
* Retrain the model with recent data.
* Add missing or more stable features.
* Use scheduled or drift-triggered retraining.
* This balances model freshness with operational stability.

## Day 8

### **Q1 (Easy — Fundamentals)**

You have a dataset with **100,000 rows**, **10 numerical features**, and **1 binary target**.
What are the **first 5 checks** you perform before training any model, and **why**?

- I would check for data distribution to check the spread.
- Any null values
- Duplicated rows
- Outliers
- Correlation among features
- These are the 5 obvious checks I would perform before training any model. Because these are the areas where models has larger impact.

---

### **Q2 (Easy–Medium — Data Leakage)**

You are building a **customer churn model**.
During EDA you find a feature called `last_login_date` that is **after** the churn label date for some users.

* Is this data leakage?
* How would you **detect** and **fix** it in a real pipeline?


- No this in not data leakage. The last login date is the proof that the customer is churned.
- I would check where the last login date is matching with churn label 1 for churned which also has last login date. And remove the column.
---

### **Q3 (Medium — Model Choice & Bias-Variance)**

Your **training AUC = 0.92** and **validation AUC = 0.71**.

1. What is happening?
2. Name **3 concrete actions** you would take to fix this.
3. Which models are **more prone** to this problem and why?

---

### **Q4 (Medium–Hard — Feature Engineering & Scaling)**

You are using:

* Logistic Regression
* Random Forest
* XGBoost

Answer:

1. Which of these **require feature scaling** and why?
2. Would **one-hot encoding vs target encoding** change model performance? Explain **when and why**.
3. How would you handle a categorical column with **500 unique values**?

---

### **Q5 (Hard — Production & ML Thinking)**

Your model works well offline but **fails in production** after 2 months.

* Training AUC: 0.88
* Production AUC: 0.63

1. List **4 real-world reasons** this happens.
2. How do you **detect** the issue automatically?
3. What **metrics or monitoring** would you put in place?
4. When would you **retrain vs rollback**?

---

### **Rules**

* Answer **Q1 first**.
* I will **challenge your answers like an interviewer**.
* Weak ML intuition → I’ll call it out.
* Good answers → we go deeper.

Start with **Q1**.
