# Interview Questions

### 1. **Question:**

What role does the value of **k** play in the K-Nearest Neighbors algorithm?

* A) It controls the number of features considered during classification.

* B) It determines how many neighbors vote (or average) to make a prediction.

* C) It selects the distance metric used in the algorithm.

* D) It sets the number of training epochs during fitting.

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) It determines how many neighbors vote (or average) to make a prediction.

**Explanation:** The value of **k** controls how many nearby points influence the prediction. Small **k** is noisy/low-bias–high-variance; larger **k** smooths decisions (higher bias, lower variance).

---

### 2. **Question:**

Which of the following is a drawback of the K-Nearest Neighbors algorithm?

* A) High **prediction-time** cost and memory usage because KNN is a lazy learner that scans/stores the whole training set.

* B) It cannot represent non-linear decision boundaries.

* C) It requires gradients and a differentiable loss to train.

* D) It fails on any dataset with more than two classes.

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) High **prediction-time** cost and memory usage because KNN is a lazy learner that scans/stores the whole training set.

**Explanation:** KNN does no model building up front. At inference, it must search many (often all) points and keep them in memory. (It *can* handle multi-class and nonlinear boundaries; no gradients are needed.)

---

### 3. **Question:**

Which distance metric is most commonly used for continuous features in the K-Nearest Neighbors algorithm?

* A) Manhattan (L1) distance

* B) Hamming distance

* C) Euclidean (L2) distance

* D) Cosine similarity

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) Euclidean (L2) distance

**Explanation:** For real-valued, scaled features, Euclidean distance is the standard choice. Manhattan and cosine can be preferable in some settings; Hamming fits binary/categorical encodings.

---

### 4. **Question:**

How does increasing the value of **k** affect the bias-variance tradeoff in the KNN algorithm?

* A) Decreases both bias and variance.

* B) Decreases bias but increases variance.

* C) Leaves the bias-variance balance largely unchanged.

* D) **Increases bias but decreases variance** by smoothing the decision boundary.

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** D) **Increases bias but decreases variance** by smoothing the decision boundary.

**Explanation:** Larger **k** averages over more neighbors, dampening noise (lower variance) but potentially washing out fine structure (higher bias).

---

### 5. **Question:**

Which technique can help reduce the computation time of KNN when dealing with large datasets?

* A) Cross-validation

* B) Building spatial indexes (e.g., **KD-tree/Ball Tree**) for faster neighbor queries

* C) Standardizing features

* D) Early stopping during training

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Building spatial indexes (e.g., **KD-tree/Ball Tree**) for faster neighbor queries

**Explanation:** Tree-based or approximate-nearest-neighbor indexes can avoid full scans (especially in moderate dimensions), greatly speeding up queries. Scaling and CV are useful but don’t by themselves reduce query complexity.

---

### 6. **Question:**

In the KNN algorithm, what happens when the value of **k** is set to 1?

* A) The model becomes a linear classifier with a learned hyperplane.

* B) The model automatically balances classes using priors.

* C) Training error necessarily stays high to avoid overfitting.

* D) The model becomes extremely **sensitive to noise/overfits** (often zero training error).

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** D) The model becomes extremely **sensitive to noise/overfits** (often zero training error).

**Explanation:** With **k = 1**, the nearest single point dictates the prediction. This interpolates the training set and can flip decisions due to a single mislabeled/noisy neighbor.

---

### 7. **Question:**

Which of the following is a valid reason for using weighted KNN over traditional KNN?

* A) To **give greater influence to closer neighbors** during voting/averaging.

* B) To reduce training time by precomputing gradients.

* C) To guarantee perfect class separation on any dataset.

* D) To ensure predictions ignore local structure.

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) To **give greater influence to closer neighbors** during voting/averaging.

**Explanation:** Distance-weighted KNN (e.g., inverse-distance weights) lets nearer points count more, often improving performance when local neighborhoods are informative.

---

### 8. **Question:**

How does K-Nearest Neighbors handle missing data during classification?

* A) It natively skips missing features when computing distances.

* B) It learns an imputation model jointly with neighbor selection.

* C) It **does not handle missing features natively**; you must impute/drop them before applying KNN.

* D) It replaces missing values with class priors automatically.

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) It **does not handle missing features natively**; you must impute/drop them before applying KNN.

**Explanation:** Preprocessing (e.g., mean/median imputation, KNN-imputation, model-based imputation) is required so that distances are well-defined.

---

### 10. **Question:**

What is the computational complexity of making predictions with K-Nearest Neighbors (naïve scan), for a single query?

* A) O(1) due to direct neighbor lookup in a hash table.

* B) O(log n) after sorting the training set once.

* C) **O(nd) to compute distances plus O(n log k) to select the k nearest (e.g., heap), per query.**

* D) O(d) because only feature-wise means are needed.

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) **O(nd) to compute distances plus O(n log k) to select the k nearest (e.g., heap), per query.**

**Explanation:** With **n** training points and **d** features, distance computation costs O(nd). Selecting the **k** smallest can be done without fully sorting—e.g., using a size-k heap in O(n log k) or selection in expected O(n). Index structures can reduce this further in practice.


# Decision Trees


## Employee Attrition

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import io

In [None]:
!gdown 16KtxSt_QEGQvfluEaMls5cCHPwhRXgCk


Downloading...
From: https://drive.google.com/uc?id=16KtxSt_QEGQvfluEaMls5cCHPwhRXgCk
To: /content/HR-Employee-Attrition.csv
  0% 0.00/228k [00:00<?, ?B/s]100% 228k/228k [00:00<00:00, 33.6MB/s]


In [None]:
df = pd.read_csv("HR-Employee-Attrition.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                

In [None]:
df.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2
