# Machine Learning — Assessment

This assessment aligns with materials in `Machine-Learning/` (e.g., Linear/Polynomial/Ridge/Lasso/ElasticNet/KNN/SVM/Tree/Ensembles/GB/HistGB/XGBoost notebooks, and datasets such as winequality). Focus: supervised regression workflow, preprocessing, model selection, metrics, bias-variance.

Total questions: 25 (10 Theory, 8 Fill-in-the-Blanks, 7 Coding). Difficulty mix: 40% easy, 40% medium, 20% hard.


## Instructions
- Answer all questions.
- Implement coding tasks using scikit-learn idioms; run asserts.
- Keep signatures unchanged.
- Solutions at the bottom.


## References
- All notebooks within `Machine-Learning/`
- `winequality-*.csv`


## Part A — Theory (10)
1. What is the difference between parametric and non-parametric models? Give one example each.
2. MCQ: Which metric is scale-sensitive and penalizes large errors more? (a) MAE (b) RMSE (c) R^2 (d) Accuracy
3. Explain bias-variance tradeoff and relate it to model capacity and regularization.
4. Why is it important to perform feature scaling for KNN and SVR?
5. MCQ: Which cross-validation strategy preserves target distribution for classification? (a) KFold (b) StratifiedKFold (c) GroupKFold (d) TimeSeriesSplit
6. When would you use `PolynomialFeatures`? What are the risks?
7. Compare Lasso vs Ridge regularization effects on coefficients.
8. What is feature leakage and how can pipelines help prevent it?
9. MCQ: Which ensemble reduces variance by averaging many decorrelated models? (a) AdaBoost (b) RandomForest (c) GradientBoosting (d) Logistic Regression
10. Explain why `R^2` can be negative on the test set.


## Part B — Fill in the Blanks (8)
1. RMSE is the square root of the __________.
2. Standardizing features centers them at zero mean and unit __________.
3. In `train_test_split`, passing `random_state` ensures __________ splits.
4. `Pipeline` ensures that transformations are fit only on the __________ data within CV.
5. Lasso tends to drive some coefficients exactly to __________.
6. Tree-based models are generally __________ to feature scaling.
7. For time-ordered data, prefer __________ cross-validation.
8. Hyperparameter search over a discrete grid is implemented by `__________` in scikit-learn.


## Part C — Coding Tasks (7)
Use scikit-learn and NumPy. We'll synthesize small data for asserts to avoid external files.

Tasks:
1. `rmse(y_true, y_pred)` — return RMSE.
2. `scale_then_knn_reg(X, y, k)` — pipeline: StandardScaler + KNeighborsRegressor; return 5-fold CV mean RMSE (negative MSE route).
3. `linreg_r2(X, y)` — fit LinearRegression, return test R^2 using 80/20 split (random_state=0).
4. `poly_ridge_score(X, y, degree, alpha)` — pipeline: PolynomialFeatures(degree), StandardScaler, Ridge(alpha); return 3-fold CV mean R^2.
5. `feature_importance_rf(X, y, n)` — fit RandomForestRegressor; return indices of top-n features by importance (desc).
6. `standardize_columns(X)` — return standardized array (columnwise) with ddof=0; if std=0, output zeros for that column.
7. `grid_search_svr(X, y, Cs, gammas)` — pipeline StandardScaler+SVR with grid over `C` and `gamma`; return best params dict.


In [None]:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

def rmse(y_true, y_pred):
    y_true = np.asarray(y_true, float)
    y_pred = np.asarray(y_pred, float)
    return float(np.sqrt(np.mean((y_true - y_pred)**2)))

def scale_then_knn_reg(X, y, k=5):
    pipe = Pipeline([
        ('sc', StandardScaler()),
        ('knn', KNeighborsRegressor(n_neighbors=k))
    ])
    # cross_val_score uses positive score; we convert neg MSE to RMSE
    scores = cross_val_score(pipe, X, y, cv=5, scoring='neg_mean_squared_error')
    rmses = np.sqrt(-scores)
    return float(rmses.mean())

def linreg_r2(X, y):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=0)
    lr = LinearRegression().fit(Xtr, ytr)
    yp = lr.predict(Xte)
    return float(r2_score(yte, yp))

def poly_ridge_score(X, y, degree=2, alpha=1.0):
    pipe = Pipeline([
        ('poly', PolynomialFeatures(degree=degree, include_bias=False)),
        ('sc', StandardScaler()),
        ('rg', Ridge(alpha=alpha))
    ])
    scores = cross_val_score(pipe, X, y, cv=3, scoring='r2')
    return float(scores.mean())

def feature_importance_rf(X, y, n=3, seed=0):
    rf = RandomForestRegressor(n_estimators=50, random_state=seed)
    rf.fit(X, y)
    imp = rf.feature_importances_
    idx = np.argsort(-imp)[:n]
    return idx.tolist()

def standardize_columns(X):
    X = np.asarray(X, float)
    mu = X.mean(axis=0)
    sd = X.std(axis=0)
    sd_safe = np.where(sd==0, 1.0, sd)
    out = (X - mu) / sd_safe
    out[:, sd==0] = 0.0
    return out

def grid_search_svr(X, y, Cs=(0.1,1,10), gammas=(0.01,0.1,1.0)):
    pipe = Pipeline([
        ('sc', StandardScaler()),
        ('svr', SVR())
    ])
    param_grid = {'svr__C': list(Cs), 'svr__gamma': list(gammas)}
    gs = GridSearchCV(pipe, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error')
    gs.fit(X, y)
    return {'C': float(gs.best_params_['svr__C']), 'gamma': float(gs.best_params_['svr__gamma'])}


In [None]:
# Asserts
rs = np.random.RandomState(0)
X = rs.randn(120, 5)
coef = np.array([1.5, -2.0, 0.0, 0.5, 0.0])
y = X @ coef + rs.randn(120)*0.3

assert round(rmse([0,0,0],[1,1,1]), 5) == round(np.sqrt(1.0), 5)

knn_rmse = scale_then_knn_reg(X, y, 3)
assert knn_rmse > 0

r2 = linreg_r2(X, y)
assert -1.0 <= r2 <= 1.0

score = poly_ridge_score(X, y, 2, 1.0)
assert -1.0 <= score <= 1.0

top2 = feature_importance_rf(X, y, 2)
assert len(top2) == 2

stdX = standardize_columns(X)
assert np.allclose(stdX.mean(axis=0), 0, atol=1e-7)

best = grid_search_svr(X, y, Cs=(0.1,1.0), gammas=(0.01,0.1))
assert set(best.keys()) == {'C','gamma'}

print('Machine-Learning asserts passed ✅')


## Solutions

### Theory (sample)
1. Parametric (e.g., Linear Regression) assumes finite parameters; non-parametric (e.g., KNN) grows with data.
2. (b) RMSE
3. Bias decreases with complexity; variance increases — need balance/regularization.
4. Distance-based models depend on scale; unscaled features distort distances.
5. (b) StratifiedKFold
6. When relationships are nonlinear; risk: overfitting and multicollinearity.
7. Lasso: sparse coefficients; Ridge: shrinkage without sparsity.
8. Leakage occurs when information from test folds leaks into training; pipelines fit transforms only on training.
9. (b) RandomForest
10. Poor generalization can perform worse than predicting the mean.

### Fill blanks
1. MSE
2. variance
3. reproducible
4. training
5. zero
6. insensitive
7. TimeSeriesSplit
8. GridSearchCV
