# Scikit-Learn — Assessment

This assessment aligns with `Scikit-Learn/Scikit-Learn.ipynb`, `Scikit-Learn/PipeLines.ipynb`, and `Scikit-Learn/NLP.ipynb`. It covers preprocessing, model selection, pipelines, metrics, and basic NLP workflows.

Total: 25 questions (10 Theory, 8 Fill-in-the-Blanks, 7 Coding). Difficulty: 40% easy, 40% medium, 20% hard.


## Instructions
- Answer all questions.
- Implement coding tasks using scikit-learn idioms and run the asserts.
- Keep function signatures intact.
- Solutions are at the bottom.


## References
- `Scikit-Learn/Scikit-Learn.ipynb`
- `Scikit-Learn/PipeLines.ipynb`
- `Scikit-Learn/NLP.ipynb`


## Part A — Theory (10)
1. What is the train/test split and why is it necessary?
2. MCQ: Which class performs standardization? (a) `MinMaxScaler` (b) `StandardScaler` (c) `Normalizer` (d) `OneHotEncoder`
3. Explain bias-variance tradeoff in the context of model selection.
4. What is cross-validation and how does `StratifiedKFold` differ from `KFold`?
5. MCQ: Which is a proper way to avoid data leakage? (a) Fit scaler on full data (b) Fit scaler inside Pipeline (c) Scale test separately with its own fit (d) Skip scaling
6. What is the purpose of `Pipeline` and `ColumnTransformer`?
7. Explain how `GridSearchCV` works and how it integrates with pipelines.
8. Describe the difference between `precision`, `recall`, and `f1`.
9. MCQ: For text classification, converting raw text to numeric features can be done by (a) `CountVectorizer` (b) `TfidfVectorizer` (c) both (d) neither
10. When would you prefer ROC-AUC vs PR-AUC?


## Part B — Fill in the Blanks (8)
1. To split data reproducibly, pass `__________` to `train_test_split`.
2. Standardization transforms to zero mean and unit __________.
3. `OneHotEncoder` handles __________ variables.
4. In a `Pipeline`, the last step must be an __________.
5. Grid search evaluates combinations of hyperparameters using __________ validation.
6. `classification_report` includes precision, recall, and __________.
7. To select specific columns by name within a transformer, use `ColumnTransformer` with `__________` transformer.
8. In NLP, `tf-idf` downweights terms that are very __________ across documents.


## Part C — Coding Tasks (7)
Implement with scikit-learn. Run asserts.

Tasks:
1. `make_standard_pipeline()` — returns a `Pipeline` that standardizes numeric features then fits a `LogisticRegression`.
2. `train_test_score(clf, X, y, test_size=0.3, seed=0)` — split data, fit, and return test accuracy.
3. `grid_search_C_logreg(X, y, Cs=(0.1,1,10))` — pipeline with `StandardScaler`+`LogisticRegression`, grid-search over `C`, return best `C`.
4. `text_pipeline_svm()` — returns a `Pipeline` with `TfidfVectorizer` and `LinearSVC`.
5. `column_transform_demo(df)` — given a DataFrame with `num1,num2,cat` build a `ColumnTransformer` that scales numeric and one-hot encodes `cat`; return transformed shape.
6. `cv_mean_accuracy(clf, X, y, k=5)` — return mean cross-val accuracy.
7. `report_metrics(y_true, y_pred)` — return dict with `precision`, `recall`, `f1` (macro).


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def make_standard_pipeline():
    pipe = Pipeline([
        ('sc', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000))
    ])
    return pipe

def train_test_score(clf, X, y, test_size=0.3, seed=0):
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=test_size, random_state=seed, stratify=y if np.unique(y).size>1 else None)
    clf.fit(Xtr, ytr)
    yp = clf.predict(Xte)
    return float(accuracy_score(yte, yp))

def grid_search_C_logreg(X, y, Cs=(0.1,1,10)):
    pipe = Pipeline([
        ('sc', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000))
    ])
    gs = GridSearchCV(pipe, param_grid={'clf__C': list(Cs)}, cv=3)
    gs.fit(X, y)
    return float(gs.best_params_['clf__C'])

def text_pipeline_svm():
    return Pipeline([
        ('tfidf', TfidfVectorizer()),
        ('svm', LinearSVC())
    ])

def column_transform_demo(df: pd.DataFrame):
    ct = ColumnTransformer([
        ('num', StandardScaler(), ['num1','num2']),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['cat'])
    ])
    out = ct.fit_transform(df)
    return tuple(out.shape)

def cv_mean_accuracy(clf, X, y, k=5):
    scores = cross_val_score(clf, X, y, cv=k)
    return float(scores.mean())

def report_metrics(y_true, y_pred):
    p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='macro', zero_division=0)
    return {'precision': float(p), 'recall': float(r), 'f1': float(f1)}


In [None]:
# Asserts
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

pipe = make_standard_pipeline()
acc = cv_mean_accuracy(pipe, X, y, k=3)
assert 0.7 <= acc <= 1.0

tt_acc = train_test_score(pipe, X, y, 0.2, 0)
assert 0.6 <= tt_acc <= 1.0

bestC = grid_search_C_logreg(X, y, Cs=(0.01,0.1,1))
assert bestC in (0.01, 0.1, 1)

tp = text_pipeline_svm()
docs = ['cat sat on mat', 'dog chased cat', 'bird flew']
labels = [0, 1, 2]
tp.fit(docs, labels)
assert hasattr(tp, 'predict')

df = pd.DataFrame({'num1':[1,2,3], 'num2':[4,5,6], 'cat':['a','b','a']})
shape = column_transform_demo(df)
assert isinstance(shape, tuple) and len(shape)==2

met = report_metrics([0,1,1],[0,1,0])
assert set(met.keys()) == {'precision','recall','f1'}

print('Scikit-Learn asserts passed ✅')


## Solutions

### Theory (sample)
1. Split ensures evaluation on unseen data to estimate generalization.
2. (b) `StandardScaler`
3. Bias: underfitting; variance: overfitting — balance via model capacity/regularization.
4. Stratified preserves label proportions; KFold does not.
5. (b) Fit scalers within a Pipeline so fit occurs only on training folds.
6. Compose preprocessing and model, possibly selective per column with `ColumnTransformer`.
7. It evaluates parameter grid via CV; with pipelines, grid keys like `step__param` tune steps.
8. Precision: purity of positives; recall: coverage of true positives; f1: harmonic mean.
9. (c) both
10. ROC-AUC for balanced data; PR-AUC for imbalanced positives.

### Fill blanks
1. `random_state`
2. variance
3. categorical
4. estimator
5. cross
6. f1-score
7. `remainder='passthrough'` or `make_column_selector`
8. frequent