# Interview Questions

### 1. **Question:**
You have a linear regression model $y = w_1x + b$ but the data suggests a clear curvature in residual plots. Which conclusion is more appropriate?

- A) The linear model fits perfectly, so no further action is needed
- B) The curvature in residuals indicates the linear form might be incomplete, suggesting a possible need for polynomial terms
- C) The residual plot only matters if the slope is zero
- D) Curvature in residuals automatically means heteroscedasticity is satisfied

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) The curvature in residuals indicates the linear form might be incomplete, suggesting a possible need for polynomial terms

**Explanation:**  
When a plot of residuals against $x$ shows a systematic curve, it signals the linear assumption may be incorrect; you might incorporate polynomial or other transformations to capture non-linear trends.


---

### 2. **Question:**
A dataset has features highly correlated with each other (e.g., $x_1$ and $x_2$ nearly the same). In **multiple linear regression**, which typical issue may arise?

- A) Residuals become automatically Gaussian
- B) Perfect correlation is necessary for a good fit
- C) Multicollinearity can inflate coefficient variances, making them unstable
- D) Homoscedasticity is guaranteed

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) Multicollinearity can inflate coefficient variances, making them unstable

**Explanation:**  
When features are strongly correlated, the design matrix becomes ill-conditioned, complicating the estimation of unique, stable coefficients. The model can still fit but each weight might become highly sensitive to small data changes.


---

### 3. **Question:**
You fit a **linear regression** using ordinary least squares. One assumption is that the errors have **constant variance** (homoscedasticity). What pattern in a residuals vs. predicted-values plot might suggest a **violation** of this assumption?

- A) Residuals scattered randomly without structure
- B) Residuals forming a funnel shape, narrowing for small predictions and widening for large predictions
- C) A horizontal band of uniform residual magnitudes
- D) All residuals lying exactly on zero

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Residuals forming a funnel shape, narrowing for small predictions and widening for large predictions

**Explanation:**  
A funnel or “cone” shape in residual plots indicates the variance of errors changes with predicted values, suggesting heteroscedasticity—an assumption violation for standard linear regression.


---

### 4. **Question:**
One typical assumption of **linear regression** is that the model is “linear in parameters.” Which scenario violates that?

- A) A model $\hat{y} = w_1 x + w_2 x^2 + b$, which is linear in $w_1, w_2$
- B) A model $\hat{y} = w_1 \sin(x) + b$, still linear in $w_1$
- C) A model $\hat{y} = w_1^2 x + b$, where the parameter squared multiplies $x$
- D) A model $\hat{y} = w_1 x_1 + w_2 x_2 + b$, linear in $w_1, w_2$

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) A model $\hat{y} = w_1^2 x + b$, where the parameter squared multiplies $x$

**Explanation:**  
“Linear in parameters” means the function is linear with respect to the **coefficients**, even if in terms of $x$ it can be polynomial or sinusoidal. But if the parameter itself appears in a nonlinear way (like $w_1^2$), it breaks that assumption.


---

### 5. **Question:**
You suspect **outliers** are heavily influencing your ordinary least squares linear regression. Which scenario is more indicative that outliers might distort the slope significantly?

- A) The residuals are all zero
- B) A few points with extremely large residuals, shifting the best-fit line away from the majority
- C) All points lie perfectly on a straight line
- D) The model’s R-squared is 1.0

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) A few points with extremely large residuals, shifting the best-fit line away from the majority

**Explanation:**  
Outliers can disproportionately affect OLS solutions, because the squared error gives heavy weight to large residuals. If a few points have huge errors, they can drastically pull the regression line.


---

### 6. **Question:**
A linear regression model is fitted. You check the residuals vs. each feature separately. One feature’s residual plot shows a clear wave pattern. Which statement is most plausible regarding model assumptions?

- A) There is no violation; wave patterns are normal in residuals
- B) The linear assumption for that feature might be incomplete, or another transformation is needed
- C) This pattern enforces homoscedasticity
- D) Perfect linear relationships produce wave-like residuals

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) The linear assumption for that feature might be incomplete, or another transformation is needed

**Explanation:**  
If residuals vs. a particular feature show a systematic shape (like waves, curves), it suggests that the relationship is not purely linear. The model might miss a polynomial or other form for that feature.


---

### 7. **Question:**
You do multiple linear regression on a large dataset. The regression performs extremely well on training data, but test error is much higher. In the context of linear regression, which factor might be **most** responsible?

- A) The data is perfectly linear
- B) Overfitting due to excessive features or combinations
- C) Residuals all vanish on test data
- D) The intercept is negative

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Overfitting due to excessive features or combinations

**Explanation:**  
Even linear regression can overfit if there are many features relative to sample size (or various polynomial expansions). This can produce a near-perfect fit on training but high test error. The same phenomenon is a high variance scenario.


---

### 8. **Question:**
In linear regression, assume residuals are **normally** distributed with mean zero. Which statement about the distribution of **features** is typically **not** required?

- A) Features must be linearly independent
- B) Features must be normally distributed
- C) Features must be numeric
- D) Residuals must have constant variance

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Features must be normally distributed

**Explanation:**  
Classical linear regression doesn’t demand each feature’s distribution is normal. The key assumption is about residuals (errors), not necessarily about how the features themselves are distributed. They should be numeric or convertible to numeric, but no normality requirement for X.


---

### 9. **Question:**
You apply linear regression. The training error is moderate, but the model generalizes well to new data (similar moderate test error). Which phenomenon does this reflect?

- A) Underfitting with high bias, but good generalization
- B) Overfitting with huge variance
- C) Both training and test error are zero
- D) The residual distribution is guaranteed to be uniform

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Underfitting with high bias, but good generalization

**Explanation:**  
If the model underfits, it can’t reduce training error to very low. However, if training and test performances are similar, it indicates stable generalization (not high variance). That scenario typically suggests a simpler model with consistent moderate error across sets—higher bias, lower variance.


---

### 10. **Question:**
A linear regression has a coefficient $\beta_2$ that’s extremely large in magnitude, but the dataset is small and the design matrix is nearly singular. Which explanation best fits?

- A) Because there's perfect collinearity, the model’s weight can blow up to huge values to compensate
- B) A large coefficient always means minimal test error
- C) If a matrix is singular, the model forcibly sets all weights to zero
- D) That scenario is only possible if R-squared = 1

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Because there's perfect collinearity, the model’s weight can blow up to huge values to compensate

**Explanation:**  
When the columns of X are almost linearly dependent, OLS solutions can yield huge coefficients (multicollinearity). Minor changes in data lead to large coefficient swings. A nearly singular design matrix is a prime cause of extreme coefficient magnitudes.

---

### 11. **Question:**
A regression model has **low training error** but a much **higher test error**. Which statement about bias and variance is most consistent with this situation?

- A) It suggests the model has *high bias* and *low variance*
- B) It suggests the model is *overfitting* with *low bias* but *high variance*
- C) It suggests a perfect balance of bias and variance
- D) It implies the data is strictly linear

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) It suggests the model is *overfitting* with *low bias* but *high variance*

**Explanation:**  
When training error is significantly lower than test error, it usually indicates overfitting. That corresponds to a scenario of low bias (the model fits training data extremely well) but high variance (the model doesn’t generalize well to unseen data).


---

### 12. **Question:**
You add more features to a multiple linear regression. The unadjusted R² rises slightly, but the **adjusted R²** remains stagnant or even drops. Which conclusion is more justified?

- A) The additional features meaningfully improve the generalization of the model
- B) The model is definitely underfitting by ignoring new features
- C) The new features do *not* provide enough explanatory power to offset the penalty for more parameters, indicating they may be unhelpful or leading to potential overfitting
- D) The adjusted R² is always lower by definition

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) The new features do *not* provide enough explanatory power to offset the penalty for more parameters, indicating they may be unhelpful or leading to potential overfitting

**Explanation:**  
Adjusted R² penalizes for adding parameters that don’t significantly improve the model. If unadjusted R² climbs but adjusted R² stagnates or falls, those features aren’t truly beneficial, possibly indicating overfitting or minimal actual contribution.


---

### 13. **Question:**
You suspect a **high-bias** (underfitting) situation: training and test errors are large and about the same. Which remedy is often most effective?

- A) Use a simpler model
- B) Reduce the number of features
- C) Increase model capacity (e.g., add polynomial features) or reduce regularization, allowing the model to learn more complex patterns
- D) Add random noise to the data

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) Increase model capacity (e.g., add polynomial features) or reduce regularization, allowing the model to learn more complex patterns

**Explanation:**  
High bias means the model is too simple to capture the data’s complexity. Typically, you address it by increasing model flexibility (more parameters, polynomial expansions) so it can better fit or by weakening strong regularization that’s constraining the model.


---

### 14. **Question:**
You compare two regression models: Model A has R²=0.90 but an **adjusted R²**=0.60, whereas Model B has R²=0.85 and adjusted R²=0.74. If your goal is robust generalization, which model is *likely* preferable and why?

- A) Model A is clearly better because 0.90 > 0.85
- B) Model B, because a higher adjusted R² suggests it’s more genuinely explanatory relative to its number of parameters, indicating possibly less overfitting
- C) Model A because it has the largest difference between R² and adjusted R²
- D) Both models are worthless with those metrics

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Model B, because a higher adjusted R² suggests it’s more genuinely explanatory relative to its number of parameters, indicating possibly less overfitting

**Explanation:**  
Model A’s big R² drop from 0.90 to 0.60 indicates many extra features may not truly help. Model B’s smaller drop from 0.85 to 0.74 implies it’s balancing complexity vs. explanatory power better, often meaning better generalization.


---

### 15. **Question:**
A colleague claims “If a model’s training error is high, it must have high variance.” Which subtlety might correct their perspective?

- A) High training error usually indicates underfitting (high bias), not high variance
- B) Any large training error means zero bias
- C) Variance is always zero in linear regression
- D) Overfitting is proven by large training error

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) High training error usually indicates underfitting (high bias), not high variance

**Explanation:**  
When training error is large, the model isn’t fitting even the training data well, which is typically a sign of high bias or underfitting, rather than high variance. High variance is typically indicated by a large discrepancy between train vs. test performance.


---

### 16. **Question:**
You add polynomial features to reduce bias. After doing so, you see training error drop significantly and test error rise. Which statement best reflects the outcome?

- A) The model is likely overfitting: it has now *lower bias* but *higher variance*, as indicated by the increased test error
- B) The model is underfitting, so we need even higher-degree polynomials
- C) Additional features always decrease test error
- D) A large gap between training and test error means reduced variance

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The model is likely overfitting: it has now *lower bias* but *higher variance*, as indicated by the increased test error

**Explanation:**  
When you add complexity (e.g., polynomial terms) and observe a big drop in training error but a jump in test error, that’s a classical sign of overfitting: the model is capturing training details at the cost of generalization.


---

### 17. **Question:**
If a model has extremely high **variance**, which training vs. test error pattern is most characteristic?

- A) Both training and test errors are large and very close
- B) Training error is low, but test error is quite large
- C) Both training and test errors are near zero
- D) Training error is huge, test error is small

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Training error is low, but test error is quite large

**Explanation:**  
High variance typically means the model can closely adapt to the training set (achieving low training error) but fails to generalize, producing high test error. That’s the hallmark sign of overfitting.


---

### 18. **Question:**
You have a regression with R²=0.85. You add two more features, and R² jumps to 0.90. However, the adjusted R² moves from 0.84 to 0.81. Which is the **best** interpretation?

- A) The new features do not truly improve the model’s explanatory power enough to justify the extra complexity, possibly indicating overfitting
- B) The new features guarantee better real-world performance
- C) Adjusted R² is always higher than R²
- D) The model’s slope must be negative

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The new features do not truly improve the model’s explanatory power enough to justify the extra complexity, possibly indicating overfitting

**Explanation:**  
Although R² improved (which always can happen when adding features), the adjusted R² decreased, signifying that the improvement in fit might be trivial and not worth the penalty for more parameters. Overfitting or minimal net gain is likely.


---

### 19. **Question:**
You train a linear regression on 50 data points with 40 features. The model’s training error is near zero. Which phenomenon or metric might best confirm the suspicion of overfitting?

- A) Adjusted R² remains extremely high as well
- B) A smaller difference between R² and adjusted R²
- C) A large test error or a big drop in adjusted R² relative to R²
- D) More features always ensure less overfitting

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) A large test error or a big drop in adjusted R² relative to R²

**Explanation:**  
When you have nearly as many features as data points, it’s easy to fit training data almost perfectly, risking major overfitting. A large test error or a strong divergence between unadjusted R² (which could be near 1) and adjusted R² are red flags. If adjusted R² is significantly lower or test error is high, it indicates overfitting.


---

### 20. **Question:**
You believe your model is **underfitting**. However, your colleague notes the *adjusted R²* is relatively high. Which statement might resolve this apparent contradiction?

- A) Adjusted R² can be misleading if the form of the relationship is not properly captured by the linear part (non-linear pattern missed), so you can still underfit
- B) A high adjusted R² means zero underfitting
- C) Underfitting never happens once you have a high R²
- D) The model must have the correct polynomial expansions

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Adjusted R² can be misleading if the form of the relationship is not properly captured by the linear part (non-linear pattern missed), so you can still underfit

**Explanation:**  
Even if adjusted R² is somewhat high, the model might be missing key nonlinearities, suggesting underfitting in the sense of shape mismatch. The metric doesn’t ensure the correct functional form, it just penalizes the number of parameters; it can still fail if the actual data pattern is more complex than a linear structure.


---

### 21. **Question:**
Your linear regression obtains training MSE=10, test MSE=12. Another more complex model obtains training MSE=3, test MSE=25. Which indicates the second model?

- A) Lower variance but higher bias
- B) Likely overfitting, with much better training fit but worse test performance
- C) Perfectly capturing linear assumptions
- D) Equal generalization

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Likely overfitting, with much better training fit but worse test performance

**Explanation:**  
Model 2 drastically improves training error from 10 down to 3, but test error jumps from 12 up to 25, consistent with higher variance and overfitting.


---

### 22. **Question:**
You’re deciding if you should add an additional feature to your linear regression. It increases R² from 0.70 to 0.72. Meanwhile, your adjusted R² changes from 0.69 to 0.68. Which approach is typically safer?

- A) Keep the feature because R² always outranks adjusted R²
- B) Discard the feature: adjusted R² dropping implies the feature is not beneficial enough for generalization
- C) Add more features until R² stops rising
- D) Force the coefficient to be zero

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Discard the feature: adjusted R² dropping implies the feature is not beneficial enough for generalization

**Explanation:**  
Even though R² rose slightly, the adjusted R² penalizes extra parameters. If it goes down, it indicates the feature doesn’t provide sufficient benefit, potentially leading to overfitting. Typically, you trust adjusted R² in such a scenario to ensure robust modeling.


---

### 23. **Question:**
A regression model is found to have a **low bias** but extremely **high variance**. Which training vs. test error pattern is consistent with that?

- A) Training error 2, test error 5
- B) Training error 1, test error 30
- C) Training error 25, test error 24
- D) Training error 15, test error 2

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Training error 1, test error 30

**Explanation:**  
Low bias means the model can fit training data extremely well (low training error). High variance means it fails to generalize, so test error is significantly higher. The biggest gap is 1 vs. 30 among these choices, best fitting that pattern.


---

### 24. **Question:**
You have a dataset with some polynomial relationship. A purely linear model yields training MSE=15, test MSE=16. A 4th-degree polynomial model yields training MSE=1, test MSE=28. This difference is best explained by:

- A) The linear model is probably underfitting but generalizes better than the 4th-degree which is overfitting
- B) The 4th-degree model is definitely underfitting
- C) The linear model must have infinite variance
- D) The polynomial model has minimal variance

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The linear model is probably underfitting but generalizes better than the 4th-degree which is overfitting

**Explanation:**  
One can see the linear model has moderate errors but consistent performance from train to test (a smaller gap), indicating less variance, albeit some bias. The polynomial gets a very low training error but big test error, showing overfitting. So it’s a high-variance fit.


---

### 25. **Question:**
After training a linear regression, you want to interpret the results. You see a fairly high R² (0.85). However, upon checking **adjusted R²**, it’s 0.50. Which is the best conclusion?

- A) The model might be overfitting or adding many features that only marginally help. A large drop indicates the model’s effective explanatory power is weaker than raw R² suggests
- B) The model’s performance is guaranteed to be perfect
- C) Adjusted R² should always equal R²
- D) We must remove the intercept

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The model might be overfitting or adding many features that only marginally help. A large drop indicates the model’s effective explanatory power is weaker than raw R² suggests

**Explanation:**  
The big gap between R² and adjusted R² implies the number of parameters is large relative to how much extra variance they explain. This is a red flag for potential overfitting or trivial benefit from additional features.

---

### 26. **Question:**
In an $\ell_2$-regularized linear model ($w_1, w_2, \dots$), which direct effect does the penalty term $\lambda \sum_j w_j^2$ impose?

- A) It penalizes the absolute size of each weight by its sign, driving some to exactly zero
- B) It shrinks all weights smoothly towards smaller magnitudes but rarely forces them exactly to zero
- C) It disregards weight magnitudes and focuses only on the bias
- D) It allows weights to grow unbounded

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) It shrinks all weights smoothly towards smaller magnitudes but rarely forces them exactly to zero

**Explanation:**  
$\ell_2$-regularization (Ridge) adds a penalty proportional to the square of the weight magnitudes, leading to a smoother shrinkage effect. It normally does not produce exact zeros in coefficients, just smaller values.


---

### 27. **Question:**
You’re using **L1** regularization ($ \lambda \sum |w_j|$). One commonly noted property is:

- A) It smooths weights but never sets any to zero
- B) It can force some weight coefficients exactly to zero, enabling feature selection
- C) It doesn’t affect overfitting
- D) It penalizes squared weight magnitudes

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) It can force some weight coefficients exactly to zero, enabling feature selection

**Explanation:**  
$\ell_1$ regularization (Lasso) is known for creating sparsity in the model by pushing certain coefficients to zero, effectively removing unimportant features from the model. This is a distinct property versus $\ell_2$.


---

### 28. **Question:**
A linear model’s objective function includes $\ell_2$ penalty. During training, if a weight $w_k$ is large in magnitude, the penalty w.r.t. that weight will:

- A) Be linear in $w_k$
- B) Increase quadratically with $|w_k|$, encouraging it to shrink more strongly the larger it becomes
- C) Remain constant
- D) Drive $w_k$ instantly to zero

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Increase quadratically with $|w_k|$, encouraging it to shrink more strongly the larger it becomes

**Explanation:**  
$\ell_2$ regularization penalizes the sum of squares of weights, so large weights are penalized heavily, pushing them to shrink. However, they typically do not become exactly zero as they do with $\ell_1$.


---

### 29. **Question:**
Comparing $\ell_1$ vs. $\ell_2$ regularization in a linear regression:

- A) $\ell_1$ fosters many non-zero weights, while $\ell_2$ often zeroes them
- B) $\ell_2$ commonly yields a sparse solution, while $\ell_1$ keeps all weights nonzero
- C) $\ell_1$ can produce sparse solutions (some coefficients exactly 0), while $\ell_2$ typically yields all small but non-zero coefficients
- D) Both force exactly the same effect

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) $\ell_1$ can produce sparse solutions (some coefficients exactly 0), while $\ell_2$ typically yields all small but non-zero coefficients

**Explanation:**  
$\ell_1$ or Lasso can create actual zeros in the coefficient vector, providing feature selection. $\ell_2$ or Ridge typically shrinks weights but doesn’t force them exactly to zero, leading to smaller but mostly non-zero coefficients.


---

### 30. **Question:**
If a model with $\ell_1$ regularization is tuned to have a very large regularization coefficient $\lambda$, which outcome is likely?

- A) Many coefficients collapse to exactly zero, possibly ignoring important features if $\lambda$ is too high
- B) We produce extremely large coefficients
- C) Overfitting becomes guaranteed
- D) No change occurs in the weight magnitudes

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Many coefficients collapse to exactly zero, possibly ignoring important features if $\lambda$ is too high

**Explanation:**  
A large $\ell_1$ penalty strongly enforces sparsity, zeroing out many weights. If $\lambda$ is too large, the model can oversimplify (underfit) by discarding too many features.


---

### 31. **Question:**
Why might a practitioner prefer **Ridge** ($\ell_2$) regularization over **Lasso** ($\ell_1$) in some contexts?

- A) Lasso can handle collinearity better than Ridge
- B) Ridge is differentiable at all weight values (including zero) and typically handles collinear features more smoothly, not forcing any single weight to zero
- C) Ridge always sets at least half the weights to zero
- D) Lasso offers smoother gradient

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Ridge is differentiable at all weight values (including zero) and typically handles collinear features more smoothly, not forcing any single weight to zero

**Explanation:**  
Ridge’s penalty $\|w\|_2^2$ is differentiable everywhere, making optimization straightforward. It also distributes shrinkage among correlated features rather than picking one. Lasso’s absolute value penalty is not differentiable at zero, though optimization methods still handle it (e.g., coordinate descent), but it does produce sparsity.


---

### 32. **Question:**
When applying Lasso regression with a moderately sized $\lambda$, you notice several weights become zero. From the perspective of **overfitting**:

- A) Zero weights reduce model complexity, potentially mitigating overfitting
- B) Setting coefficients to zero always means the model is overfit
- C) $\ell_1$ penalty always increases variance
- D) Large $\ell_1$ penalty never leads to zero weights

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Zero weights reduce model complexity, potentially mitigating overfitting

**Explanation:**  
Zero weights effectively remove those features, simplifying the model. Simpler models (fewer effective parameters) often reduce variance and risk of overfitting, thus can help generalization.


---

### 33. **Question:**
Your linear regression has a Ridge penalty. You find that the solution yields small but nonzero weights for all features, even irrelevant ones. Why doesn’t Ridge produce many exact zeros?

- A) Because $\ell_2$ penalty encourages smooth shrinkage but doesn’t induce strict sparsity
- B) Because the sum of squares is always negative
- C) Because $\ell_2$ penalty is linear
- D) Because zero is not permissible for the bias term

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Because $\ell_2$ penalty encourages smooth shrinkage but doesn’t induce strict sparsity

**Explanation:**  
$\ell_2$ penalty (Ridge) exerts a *quadratic* shrink effect. Coefficients are pushed towards smaller magnitudes but typically remain nonzero. By contrast, $\ell_1$ can create exact zeros because of its absolute value shape.


---

### 34. **Question:**
A data scientist tries both Lasso and Ridge on the same dataset. Lasso sets half the features’ weights to zero, while Ridge keeps them all non-zero but smaller. Which scenario might be best if the data truly has many irrelevant features?

- A) Lasso is beneficial here for automatic feature elimination, possibly giving better interpretability and similar or better test performance
- B) Ridge is always superior, ignoring irrelevant features
- C) Both methods create exactly identical solutions
- D) There’s no difference in interpretability

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Lasso is beneficial here for automatic feature elimination, possibly giving better interpretability and similar or better test performance

**Explanation:**  
If many features are genuinely irrelevant, Lasso can zero them out, simplifying the model and improving interpretability. This can also help mitigate overfitting from spurious features. Ridge tends to keep all features but with shrunk weights, which might not be as interpretable.


---

### 35. **Question:**
In a linear regression with **L1 + L2** penalty (Elastic Net), you discover the L1 ratio is large. Which effect is most prominent?

- A) The model is purely a standard linear regression
- B) Stronger push for sparsity due to the L1 portion
- C) We only get L2-like shrinkage
- D) The penalty is negative

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Stronger push for sparsity due to the L1 portion

**Explanation:**  
Elastic Net combines $\ell_1$ and $\ell_2$. A larger L1 ratio means more emphasis on the L1 component, thus encouraging more zeros. The L2 portion still does smoothing. So a large L1 ratio primarily fosters sparsity.


---

### 36. **Question:**
If your dataset is small and you want to *avoid large swings in coefficients*, you might choose:

- A) No regularization, for maximum capacity
- B) Ridge ($\ell_2$) regularization, which typically stabilizes weights and reduces variance
- C) L1 regularization for immediate zeroing of all weights
- D) A random assignment of weights at inference

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Ridge ($\ell_2$) regularization, which typically stabilizes weights and reduces variance

**Explanation:**  
With limited data, a purely unconstrained model might overfit, producing large, unstable coefficients. Ridge’s $\ell_2$ penalty specifically helps keep coefficients stable (smaller and more uniform), reducing variance, which is beneficial in small-sample scenarios.


---

### 37. **Question:**
You see a massive gap between training and test errors. One approach is to apply **L2** penalty. Why might this help reduce that gap?

- A) L2 penalty eliminates features entirely
- B) Smoothing the weights (reducing their magnitude) helps avoid overfitting’s large test error
- C) It always sets training error to zero
- D) The cost function becomes non-convex

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Smoothing the weights (reducing their magnitude) helps avoid overfitting’s large test error

**Explanation:**  
When the model overfits (high variance), it typically has large or extreme coefficient values for certain features. The $\ell_2$ penalty shrinks those, making the model less sensitive to data noise. This can align training and test performance better, decreasing the gap.


---

### 38. **Question:**
An ML engineer uses Lasso on 100 features, sees 80 coefficients are driven to exactly zero, and the test error is decent. Which statement about interpretability is accurate?

- A) It is unclear which features matter because all are zero
- B) The 20 non-zero features are presumably the *most* relevant, giving a simpler, more interpretable model
- C) L1 penalty always leads to worse interpretability
- D) The model is guaranteed to be overfitting

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) The 20 non-zero features are presumably the *most* relevant, giving a simpler, more interpretable model

**Explanation:**  
L1 (Lasso) can zero out many coefficients, effectively selecting a subset of features. That subset typically contains the most influential features, making the model simpler and more interpretable than one that uses all 100 features.

---

### 39. **Question:**
In **logistic regression**, the model predicts $\hat{p} = \sigma(\mathbf{w}^\top \mathbf{x} + b)$, where $\sigma$ is the sigmoid function. Why not directly use a linear regression formula $\hat{y} = \mathbf{w}^\top \mathbf{x} + b$ for classification?

- A) A linear output can only produce values in [0,1]
- B) Logistic regression’s sigmoid ensures probabilities remain between 0 and 1, whereas a raw linear output might produce invalid probabilities (<0 or >1)
- C) The linear model easily saturates at 0 or 1 for classification
- D) Logistic regression doesn’t require any features

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Logistic regression’s sigmoid ensures probabilities remain between 0 and 1, whereas a raw linear output might produce invalid probabilities (<0 or >1)

**Explanation:**  
Classification often needs an output interpreted as probability, so we constrain outputs to $[0,1]$. A linear output can exceed these bounds. The sigmoid $\sigma(z)= 1/(1+ e^{-z})$ ensures valid probability predictions.


---

### 40. **Question:**
Which statement best characterizes the **decision boundary** in logistic regression with a single input $x$?

- A) It’s determined by $\mathbf{w}^\top \mathbf{x}+ b =0$, the point where predicted probability is exactly 0.5
- B) It’s a horizontal line at $\hat{p}=0.5$
- C) It changes arbitrarily for each data point
- D) The decision boundary never depends on $\mathbf{w}$

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) It’s determined by $\mathbf{w}^\top \mathbf{x}+ b =0$, the point where predicted probability is exactly 0.5

**Explanation:**  
Logistic regression classifies a sample as “1” if $\hat{p}>0.5$. The boundary is thus $\hat{p}=0.5$. Since $\hat{p}= \sigma(\mathbf{w}^\top\mathbf{x}+ b)$, $\sigma(z)=0.5$ exactly at $z=0$. Therefore, the boundary is $\mathbf{w}^\top \mathbf{x}+ b=0$.


---

### 41. **Question:**
When training logistic regression, we typically minimize **cross-entropy** (log loss) rather than MSE. Which subtle reason supports cross-entropy for classification?

- A) MSE is perfectly suitable for classification, offering the same gradient
- B) The gradient from MSE can lead to slower convergence or non-optimal updates for the sigmoid, while cross-entropy aligns with the likelihood interpretation, providing more stable, direct gradient signals for probability estimates
- C) MSE ensures linear boundaries
- D) Cross-entropy only applies to regression tasks

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) The gradient from MSE can lead to slower convergence or non-optimal updates for the sigmoid, while cross-entropy aligns with the likelihood interpretation, providing more stable, direct gradient signals for probability estimates

**Explanation:**  
Cross-entropy loss arises from the maximum likelihood principle for Bernoulli data. It yields gradients well-suited to logistic regression, typically converging faster and more reliably than MSE, which can produce problematic gradient behavior with a sigmoid.


---

### 42. **Question:**
In logistic regression, the **log-odds** output $\mathbf{w}^\top \mathbf{x}+b$ is interpreted how?

- A) It’s the direct probability of class 1
- B) It’s $\log\frac{\hat{p}}{1-\hat{p}}$, the logarithm of the predicted odds for class 1
- C) It’s always bounded between 0 and 1
- D) A negative log-odds must equal zero probability

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) It’s $\log\frac{\hat{p}}{1-\hat{p}}$, the logarithm of the predicted odds for class 1

**Explanation:**  
Logistic regression’s linear combination $\mathbf{w}^\top \mathbf{x}+ b$ equals the log-odds of class 1, i.e. $\log(\hat{p}/(1-\hat{p}))$. Then $\hat{p}$ is found by applying the sigmoid function $\sigma$.


---

### 43. **Question:**
A logistic regression classifier yields training accuracy near 100%, but test accuracy is only around 60%. Which phenomenon is indicated?

- A) **Overfitting**: the model likely memorized training samples but generalizes poorly
- B) **Underfitting**: the model is too simple
- C) Perfect bias with no variance
- D) The model is guaranteed to have a linear boundary

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) **Overfitting**: the model likely memorized training samples but generalizes poorly

**Explanation:**  
A large gap between near-perfect training accuracy and significantly lower test accuracy typically signals overfitting. The logistic regressor is capturing training specifics that do not hold on unseen data.


---

### 44. **Question:**
When using logistic regression on data with **highly imbalanced classes** (e.g., 95% negative, 5% positive), which subtle pitfall might occur if we rely on standard accuracy?

- A) Accuracy is unaffected by class imbalance
- B) The model might just predict the majority class, achieving high accuracy but ignoring the minority class, leading to poor utility
- C) The logistic function saturates for minority classes
- D) Overfitting is guaranteed

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) The model might just predict the majority class, achieving high accuracy but ignoring the minority class, leading to poor utility

**Explanation:**  
In highly imbalanced datasets, the model can output the majority label almost every time and still get high accuracy. This misleads performance if the minority class is the one we care about. Other metrics (e.g., F1, AUC) help capture minority class performance better.


---

### 45. **Question:**
Which statement about logistic regression’s **cost function** is correct?

- A) The cost is typically the sum of squared errors between predicted probability and label
- B) It’s usually a cross-entropy (log loss) that punishes confident wrong predictions heavily, aligning with the maximum likelihood for Bernoulli
- C) No cost function is needed; logistic regression is always solved by linear algebra
- D) It uses a hinge loss

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) It’s usually a cross-entropy (log loss) that punishes confident wrong predictions heavily, aligning with the maximum likelihood for Bernoulli

**Explanation:**  
Logistic regression uses cross-entropy (logistic) loss, derived from the negative log-likelihood perspective. It heavily penalizes wrong predictions if the model is very confident, guiding parameter updates accordingly.


---

### 46. **Question:**
You find that in your logistic regression, a certain feature $x_j$ has a very **large positive** weight. Interpreting this weight:

- A) A large positive coefficient indicates that higher values of $x_j$ strongly push the log-odds toward class 1, raising the probability of predicting class 1
- B) The sign is irrelevant for classification
- C) It means that feature is never used
- D) A large positive weight sets predicted probability to zero

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) A large positive coefficient indicates that higher values of $x_j$ strongly push the log-odds toward class 1, raising the probability of predicting class 1

**Explanation:**  
In logistic regression, a big positive weight means that as $x_j$ increases, $\mathbf{w}^\top\mathbf{x}+ b$ grows, thus the sigmoid output shifts closer to 1, significantly favoring class 1 predictions.


---

### 47. **Question:**
Why do we typically avoid using **mean squared error (MSE)** as a loss for logistic regression?

- A) MSE is perfectly aligned with maximum likelihood for binary classification
- B) MSE can lead to less stable gradients, slower convergence with the sigmoid function, and lacks the probabilistic interpretation that cross-entropy provides
- C) MSE ensures faster training
- D) MSE is not differentiable

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) MSE can lead to less stable gradients, slower convergence with the sigmoid function, and lacks the probabilistic interpretation that cross-entropy provides

**Explanation:**  
While MSE is differentiable, it’s not the typical choice for binary classification because it doesn’t match the Bernoulli likelihood perspective, leading to suboptimal gradient behavior. Cross-entropy is standard, giving better theoretical and practical results.


---

### 48. **Question:**
A logistic regression is run with no regularization on high-dimensional data. It achieves extremely **low training error**. Which subtle check might you do to confirm the model is not overfitting?

- A) Just check the final loss on training data
- B) Verify a similarly low error on a separate validation or test set
- C) Confirm the weights are large in magnitude
- D) Ensure the log-odds are negative

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Verify a similarly low error on a separate validation or test set

**Explanation:**  
A model with very low training error in high dimensions can easily overfit. The standard approach is checking performance on unseen (test or validation) data. If test performance remains good, it’s less likely to be overfitting.


---

### 49. **Question:**
When logistic regression’s **$\mathbf{w}$ and $b$** yield predicted probabilities close to 1 or 0 for certain samples but are wrong, how does the cross-entropy loss respond?

- A) It penalizes those misclassifications lightly
- B) It’s indifferent to confident misclassifications
- C) It imposes a **large** penalty on being confidently incorrect
- D) It yields negative cost

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** C) It imposes a **large** penalty on being confidently incorrect

**Explanation:**  
Cross-entropy heavily punishes wrong predictions made with high confidence. If the model outputs probability ~1 for label=0 or ~0 for label=1, the log term in the loss function can blow up, penalizing the model’s parameters strongly.


---

### 50. **Question:**
Consider logistic regression for a 2-class problem. If the weight vector’s **norm** becomes large in magnitude, what happens to the decision boundary?

- A) The boundary shifts so that small changes in $\mathbf{x}$ produce large changes in predicted log-odds
- B) It remains unaffected
- C) The model always predicts 0.5
- D) The boundary disappears

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The boundary shifts so that small changes in $\mathbf{x}$ produce large changes in predicted log-odds

**Explanation:**  
A large norm $\|\mathbf{w}\|$ implies the model is sensitive: a small movement in $\mathbf{x}$ significantly modifies $\mathbf{w}^\top \mathbf{x}$. This can lead to a sharper transition around the boundary, making the model’s probability switch rapidly from 0 to 1 near the boundary line $\mathbf{w}^\top \mathbf{x}+ b=0$.


---

### 51. **Question:**
Logistic regression typically uses an iterative algorithm (like gradient descent) instead of solving a closed-form normal equation. Why?

- A) The cross-entropy objective with a sigmoid is non-linear, lacking a simple closed-form solution for $\mathbf{w}$
- B) It has no partial derivatives
- C) The model is linear, so normal equations always exist
- D) We prefer not to find the global optimum

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The cross-entropy objective with a sigmoid is non-linear, lacking a simple closed-form solution for $\mathbf{w}$

**Explanation:**  
The logistic function leads to a **non-linear** log-likelihood. There's no algebraic closed-form solution like in ordinary least squares. Hence, iterative numerical methods (gradient-based) are used.


---

### 52. **Question:**
If your logistic regression model’s **decision boundary** on a 2D feature space is highly curved, is that possible?

- A) Yes, if you include feature transformations (like polynomial expansions), otherwise a single linear combination $\mathbf{w}^\top\mathbf{x}+ b=0$ is always a line
- B) No, logistic regression always yields a circular boundary
- C) The boundary can be any random shape
- D) The boundary is never linear in logistic regression

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Yes, if you include feature transformations (like polynomial expansions), otherwise a single linear combination $\mathbf{w}^\top\mathbf{x}+ b=0$ is always a line

**Explanation:**  
Vanilla logistic regression is linear in the original features, so the boundary is linear (a hyperplane). But if you transform your features (e.g., polynomial expansions), the boundary in original input space can appear curved. The model is still linear in transformed features, though.


---

### 53. **Question:**
A logistic regression’s predicted probability for an example is $\hat{p}=0.7$. The label is 1. Intuitively, what happens in **gradient-based** training?

- A) The model is slightly wrong since 0.7 < 1, but not drastically, so the gradient update for that sample is less severe compared to if $\hat{p}$ were 0.1
- B) The model sees no error if $\hat{p}>0.5$
- C) The cost is infinite
- D) The model always sets $\hat{p}=1$ after one update

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The model is slightly wrong since 0.7 < 1, but not drastically, so the gradient update for that sample is less severe compared to if $\hat{p}$ were 0.1

**Explanation:**  
Logistic regression’s cross-entropy punishes large deviations from the target. A probability 0.7 for a true label 1 is moderately off, so the update is smaller than if it predicted 0.1. The correct class is still “1,” so the error isn’t zero but not huge.

---


### 54. **Question:**
In a binary classification task, **precision** measures which aspect?

- A) Among all predicted positives, how many are truly positives  
- B) Among all actual positives, how many are correctly identified  
- C) How often a negative is incorrectly labeled as positive  
- D) The fraction of negatives correctly identified as negatives  

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Among all predicted positives, how many are truly positives

**Explanation:**  
Precision = (True Positives) / (True Positives + False Positives). It focuses on the reliability of positive predictions.


---

### 55. **Question:**
A classifier yields a **recall** of 0.95 but only 0.50 precision. Interpreting this:

- A) It finds 50% of actual positives, ignoring the rest
- B) It correctly identifies 95% of actual positives, but also has many false positives
- C) It mislabels 5% of all negatives as positives
- D) It consistently misses half of the positives

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) It correctly identifies 95% of actual positives, but also has many false positives

**Explanation:**  
Recall = (TP) / (TP + FN). A recall of 0.95 means it captures 95% of all true positives. A lower precision (0.50) means half of its positive predictions are false. So the classifier is generous at detecting positives but not always correct about them.


---

### 56. **Question:**
You want a **single** metric that balances both precision and recall. Which standard measure is typically used?

- A) F1 score = 2 * (Precision * Recall) / (Precision + Recall)
- B) Accuracy = (TP + TN)/(All samples)
- C) AUC from ROC curve
- D) Cohen’s Kappa

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) F1 score = 2 * (Precision * Recall) / (Precision + Recall)

**Explanation:**  
F1 is the harmonic mean of precision and recall, giving them equal weighting. Accuracy can be misleading with class imbalance; ROC-AUC measures a different aspect, and Kappa is a separate agreement measure.


---

### 57. **Question:**
When generating an ROC curve, you vary a threshold on the classifier’s output probability. On the x-axis is the False Positive Rate (FPR), on the y-axis the True Positive Rate (TPR). If a point on the curve has TPR=1.0 but FPR=1.0, what does that imply?

- A) The classifier detects all positives but also wrongly labels all negatives as positives
- B) The classifier is perfectly discriminating
- C) TPR=1.0 means no false positives
- D) The classifier is predicting all instances as negative

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The classifier detects all positives but also wrongly labels all negatives as positives

**Explanation:**  
TPR=1.0 means all positives are found (no false negatives). FPR=1.0 means no true negatives remain (all negatives are mislabeled as positive). So the classifier is effectively predicting everything as positive.


---

### 58. **Question:**
An ROC curve that is **diagonal** from (0,0) to (1,1) indicates:

- A) The classifier is random, no better than chance  
- B) The classifier is perfect with AUC=1.0  
- C) The classifier always predicts the majority class  
- D) The F1 score is guaranteed to be high  

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The classifier is random, no better than chance

**Explanation:**  
A diagonal ROC means that TPR ~ FPR at every threshold, i.e., no discriminative power beyond guessing. That yields AUC=0.5, indicating random performance.


---

### 59. **Question:**
Why might **precision** be misleadingly high on a heavily **imbalanced** dataset if you do not also check recall or other metrics?

- A) Precision only measures how many predicted positives were correct, ignoring how many positives you missed, thus a model that rarely predicts positive can get high precision but poor recall
- B) Precision always equals recall
- C) High precision ensures no false negatives
- D) With imbalance, accuracy is the only relevant metric

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Precision only measures how many predicted positives were correct, ignoring how many positives you missed, thus a model that rarely predicts positive can get high precision but poor recall

**Explanation:**  
If the model predicts very few positives (some correct), it can show high precision but may fail to capture a large portion of actual positives. That’s why looking at recall or F1 can be crucial, especially with imbalanced data.


---

### 60. **Question:**
You have two binary classifiers, both with the same F1 score. However, classifier A has higher precision but lower recall, while classifier B has the opposite. Depending on business needs, which scenario might you prefer classifier A?

- A) If missing positives is very costly, so we want high TPR
- B) If we want minimal false positives, so high precision is key
- C) If the dataset is balanced
- D) If we only care about capturing all positives

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) If we want minimal false positives, so high precision is key

**Explanation:**  
Classifiers with high precision and lower recall produce fewer false positives but might miss more actual positives. If the priority is to avoid false alarms, one might choose the high-precision classifier.


---

### 61. **Question:**
An F1 score is extremely low, but the ROC AUC is surprisingly high. Which scenario best explains the discrepancy?

- A) F1 focuses on thresholded predictions (precision/recall), while ROC AUC integrates a range of thresholds. It’s possible the classifier’s continuous scores separate classes well overall (leading to decent AUC) but at the chosen threshold, precision or recall is poor
- B) The data must be linearly separable
- C) F1 is always larger than ROC AUC
- D) High AUC guarantees high F1

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) F1 focuses on thresholded predictions (precision/recall), while ROC AUC integrates a range of thresholds. It’s possible the classifier’s continuous scores separate classes well overall (leading to decent AUC) but at the chosen threshold, precision or recall is poor

**Explanation:**  
AUC measures how well scores rank positives vs. negatives across all thresholds. F1 is tied to a specific decision threshold. If the chosen threshold yields a poor trade-off between precision and recall, F1 can be low even if the underlying score separation is good (decent AUC).


---

### 62. **Question:**
**Micro-averaged** F1 vs. **Macro-averaged** F1 in multi-class classification differ in:

- A) Micro-F1 aggregates global true/false positives across classes, thus weighting large classes more. Macro-F1 averages the F1 of each class equally
- B) Both treat each class exactly the same
- C) Macro-F1 never uses recall
- D) Micro-F1 is always higher

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Micro-F1 aggregates global true/false positives across classes, thus weighting large classes more. Macro-F1 averages the F1 of each class equally

**Explanation:**  
Micro-averaging pools all classes’ TP/FP/FN globally, giving bigger classes more influence. Macro-averaging computes F1 per class then averages them, treating each class equally regardless of size.


---

### 63. **Question:**
If you see a classifier with ROC AUC=0.95 but a **precision** of only 0.40 at a chosen threshold, how could you reconcile that?

- A) The model must be random with no real discriminative power
- B) The model can separate positives vs. negatives well in ranking, but the chosen threshold leads to many false positives, lowering precision
- C) High AUC requires high precision
- D) Precision and AUC are identical metrics

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) The model can separate positives vs. negatives well in ranking, but the chosen threshold leads to many false positives, lowering precision

**Explanation:**  
A high AUC means the rank ordering (score) is good overall. However, if the threshold is set in a way that yields many predicted positives (some false), precision can be relatively low, even though the overall ranking is quite discriminative.


---

### 64. **Question:**
Under severe class imbalance, the ROC curve can sometimes give an overly optimistic view of performance. Why do some prefer the **Precision-Recall** curve in that scenario?

- A) Precision-Recall directly focuses on positives, ignoring the typically large number of negatives in the denominator of TPR
- B) ROC is always invalid for imbalanced data
- C) The PR curve is less stable
- D) The negative class is never considered

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Precision-Recall directly focuses on positives, ignoring the typically large number of negatives in the denominator of TPR

**Explanation:**  
With heavily imbalanced data, the false positive rate (in ROC) can remain low simply because negatives are plentiful. PR curves are more sensitive to how well the model identifies actual positives, so it can be more informative in that scenario.


---

### 65. **Question:**
A classifier yields confusion matrix:  
- TP=30, FP=10, FN=20, TN=940.  
What is the **precision**?

- A) $\tfrac{30}{30+20} = 0.60$
- B) $\tfrac{30}{30+10} = 0.75$
- C) $\tfrac{30}{30+20 +10} = 0.50$
- D) $\tfrac{30}{30+940} \approx 0.03$

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) $\tfrac{30}{30+10} = 0.75$

**Explanation:**  
Precision=TP/(TP+FP)= 30/(30+10)=30/40=0.75. This is how many predicted positives are actually positive.


---

### 66. **Question:**
Referring to the same confusion matrix (TP=30, FP=10, FN=20, TN=940), the **recall** is:

- A) $\tfrac{30}{30+10} = 0.75$
- B) $\tfrac{30}{30+20} = 0.60$
- C) $\tfrac{10}{30+20} = 0.33$
- D) $\tfrac{940}{30+20+10+940}\approx 0.93$

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) $\tfrac{30}{30+20} = 0.60$

**Explanation:**  
Recall= TP/(TP+FN)= 30/(30+20)=30/50=0.60. This measures how many actual positives were found.


---

### 67. **Question:**
Using that same confusion matrix (TP=30, FP=10, FN=20, TN=940), the **F1** score is:

- A) 2 * (0.75 * 0.60)/(0.75 + 0.60)= 2*(0.45)/1.35= 0.67
- B) 2 * (0.60 * 0.93)/(0.60 + 0.93)= 0.75
- C) 0.75 + 0.60=1.35
- D) 2 * (0.60 * 0.75)/(0.60 + 0.75)= 2*(0.45)/1.35=0.67

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** D) 2 * (0.60 * 0.75)/(0.60 + 0.75)= 2*(0.45)/1.35=0.67

**Explanation:**  
Precision=0.75, Recall=0.60. F1= 2*(P*R)/(P+R)=2*(0.75*0.60)/(0.75+0.60)=2*(0.45)/1.35=0.67. Make sure the arithmetic is correct: 0.45*2=0.90, 0.90/1.35=0.666..., ~0.67.


---

### 68. **Question:**
A model has an AUC of 0.99. However, at your chosen threshold, precision=0.10. Which perspective addresses this disparity?

- A) The ranking of positives vs. negatives is strong overall (high AUC), but the threshold picks many positives with numerous false positives, resulting in low precision
- B) A high AUC ensures high precision at any threshold
- C) The model has no actual discriminative ability
- D) Precision must exceed recall

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) The ranking of positives vs. negatives is strong overall (high AUC), but the threshold picks many positives with numerous false positives, resulting in low precision

**Explanation:**  
AUC reflects the model’s ability to rank samples across thresholds. At a specific threshold, we might accept many positives, including many false positives, thus a low precision. High AUC doesn’t guarantee high precision unless we adjust the threshold appropriately.

---

### 69. **Question:**
You have a dataset with 1% positive class (rare) and 99% negative class. If your classifier simply predicts “negative” for all samples, it obtains 99% accuracy. From a **class imbalance** perspective, which statement is most accurate?

- A) This accuracy is misleadingly high because the model ignores almost all positives, indicating poor utility on the minority class
- B) The classifier must be optimal since 99% accuracy is always good
- C) The data is not imbalanced with 1% positives
- D) A balanced dataset always yields 99% accuracy

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) This accuracy is misleadingly high because the model ignores almost all positives, indicating poor utility on the minority class

**Explanation:**  
A naive classifier can exploit the imbalance by always predicting the majority class to achieve high accuracy, but it fails to detect the crucial minority positives. Accuracy alone is not very informative in highly imbalanced scenarios.


---

### 70. **Question:**
When dealing with **highly imbalanced** data (say 5% positives), which metric often provides more insight than simple accuracy?

- A) Training error alone
- B) The fraction of predicted positives that are correct (precision), or how many actual positives are found (recall), possibly combined into F1 or reviewing PR curves
- C) The negative predictive value
- D) The confusion matrix diagonal sum

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) The fraction of predicted positives that are correct (precision), or how many actual positives are found (recall), possibly combined into F1 or reviewing PR curves

**Explanation:**  
With heavy imbalance, accuracy can be skewed by the huge negative class. Metrics focusing specifically on how well the minority positives are handled (precision, recall, F1, or PR curves) are more revealing about real performance on that underrepresented class.


---

### 71. **Question:**
You have 10,000 samples: 100 positives, 9,900 negatives. Your classifier guesses “positive” for 200 samples, among which 50 are real positives. Which statement about evaluating performance is most correct?

- A) The model’s accuracy is 99.5%, guaranteeing robust detection
- B) Precision for the positive predictions is 50/200 = 25%, recall is 50/100 = 50%. This indicates moderate success but still many positives missed
- C) The classifier must be random
- D) The negative class is trivial, so the model must have zero FN

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) Precision for the positive predictions is 50/200 = 25%, recall is 50/100 = 50%. This indicates moderate success but still many positives missed

**Explanation:**  
It’s a classic imbalanced scenario. The accuracy seems high but not indicative of minority detection. Among 200 predicted positives, 50 are correct => 25% precision. Also, it catches 50 out of 100 actual positives => 50% recall. So half the actual positives remain undetected, though it’s still better than predicting all negatives.


---

### 72. **Question:**
Which approach can *directly* address **class imbalance** in training?

- A) Minimizing mean squared error
- B) **Oversampling** the minority class (e.g., SMOTE) or **undersampling** the majority, giving more balanced representation to help the model pay more attention to minority
- C) Applying a standard random forest with no parameter changes
- D) Relying on a single threshold that yields the highest accuracy

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** B) **Oversampling** the minority class (e.g., SMOTE) or **undersampling** the majority, giving more balanced representation to help the model pay more attention to minority

**Explanation:**  
Methods such as oversampling minority (e.g., synthetic sampling) or undersampling majority are common ways to handle data skew, ensuring the model sees a more balanced distribution and doesn’t trivially learn to predict majority.


---

### 73. **Question:**
You tune your model’s **decision threshold** specifically to improve detection of the rare positive class, even if it means more false positives. In the context of class imbalance, which direct effect occurs?

- A) Recall typically increases, but precision might drop, as you label more samples as positive
- B) Fewer positives are caught
- C) Accuracy always improves
- D) The negative class disappears

<br>
<br>
<br>
<br>
<br>
<br>

**Correct Answer:** A) Recall typically increases, but precision might drop, as you label more samples as positive

**Explanation:**  
Lowering the threshold means you’ll predict “positive” more often, capturing more actual positives (higher recall) but at the risk of additional false positives, reducing precision. This threshold tuning is common when you can’t rely solely on accuracy in imbalanced data.