# Quiz Bank — Data Cleaning, Preprocessing & Transformation (13 Steps × 5 = 65 Qs)

This version is **fully visible**: each question is numbered (Q1..Q65) and followed by an **Answer Key** immediately below.
Where useful, you'll also find a **starter code cell** to try the solution.

> Tip: Run the bootstrap cell below if you want a ready-made `df` to experiment.

### Bootstrap (optional)
Creates a synthetic `df` similar to the training notebook.

In [None]:
import numpy as np, pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, KBinsDiscretizer, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_classif, VarianceThreshold, SelectKBest, RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, f1_score
from sklearn.neighbors import LocalOutlierFactor
import matplotlib.pyplot as plt

rng = np.random.default_rng(7)
n = 600
df = pd.DataFrame({
    "customer_id": np.arange(n),
    "age": rng.integers(18, 70, size=n),
    "income": rng.normal(60000, 18000, size=n).round(0),
    "tenure_years": rng.exponential(3, size=n).round(2),
    "segment": rng.choice(["A","B","C"], size=n, p=[0.5,0.3,0.2]),
    "region": rng.choice(["North","South","East","West"], size=n),
    "joined_on": pd.to_datetime("2019-01-01") + pd.to_timedelta(rng.integers(0, 2200, size=n), unit="D"),
    "notes": rng.choice(["loves discounts", "calls often", "email only", "no preference"], size=n),
})
# inject issues
df.loc[rng.choice(n, 35, replace=False), "income"] = np.nan
df.loc[rng.choice(n, 15, replace=False), "age"] = None
df.loc[rng.choice(n, 8, replace=False), "tenure_years"] = 120
df = pd.concat([df, df.sample(10, random_state=1)], ignore_index=True)  # duplicates
# simple target
df["churn"] = ((df["income"].fillna(50000) < 55000) | (df["tenure_years"] < 1.5) | (df["segment"]=="C")).astype(int)
print("df shape:", df.shape)
df.head(3)

## Step 1 — Schema & Dtype Validation

**Q1. Coerce `income` like '40,000' to numeric without crashing; count how many became NaN.**

**Answer Key (Q1)**

In [None]:
before = df["income"].isna().sum()
df["income"] = df["income"].astype(str).str.replace(",", "", regex=False)
df["income"] = pd.to_numeric(df["income"], errors="coerce")
after = df["income"].isna().sum()
print("New NaNs introduced:", after - before)

**Q2. Parse `joined_on` mixing `YYYY-MM-DD` and `DD/MM/YYYY` robustly; keep unparseable as NaT.**

**Answer Key (Q2)**

In [None]:
df["joined_on"] = pd.to_datetime(df["joined_on"], errors="coerce", dayfirst=False)  # consider trying dayfirst=True if needed

**Q3. Safely convert a float column with NaNs to integer.**

**Answer Key (Q3)**

In [None]:
df["age"] = pd.to_numeric(df["age"], errors="coerce")
df["age"] = df["age"].astype("Int64")  # nullable integer

**Q4. Auto-upgrade dtypes; why useful? Show code.**

**Answer Key (Q4)**

In [None]:
df = df.convert_dtypes()  # standardizes to missing-friendly types like Int64/string

**Q5. `customer_id` came as float 101.0; make it strict integer keys.**

**Answer Key (Q5)**

In [None]:
df["customer_id"] = pd.to_numeric(df["customer_id"], downcast="integer")
assert pd.api.types.is_integer_dtype(df["customer_id"])

## Step 2 — Text Normalization

**Q6. Normalize `region` so 'North ' and 'north' collapse into one category.**

**Answer Key (Q6)**

In [None]:
df["region"] = df["region"].astype(str).str.strip().str.lower()

**Q7. In `notes`, collapse multiple spaces to one; keep lower-case.**

**Answer Key (Q7)**

In [None]:
df["notes"] = df["notes"].astype(str).str.replace(r"\s+", " ", regex=True).str.lower()

**Q8. Why normalize case before one-hot encoding?**

**Answer Key (Q8)**

To avoid category explosion; 'North' vs 'north' should not become separate dummy columns.

**Q9. Clean and print unique values for ['NORTH','South','  East '].**

**Answer Key (Q9)**

In [None]:
vals = pd.Series(['NORTH','South','  East ']).str.strip().str.lower().unique().tolist(); print(vals)

**Q10. Risk of lowercasing product reviews before sentiment?**

**Answer Key (Q10)**

Loss of signal from proper nouns/emphasis; consider case-aware features for some tasks.

## Step 3 — De-duplication

**Q11. Two rows match on age/region/segment but differ on income. Drop or keep? Justify.**

**Answer Key (Q11)**

In [None]:
Depends on identity definition; if identity is those three fields, keep a rule (e.g., latest by timestamp). Otherwise keep both; they capture variability.

**Q12. Count duplicates of customer_id; keep latest by joined_on.**

**Answer Key (Q12)**

In [None]:
dup_counts = df["customer_id"].value_counts()
dupe_ids = dup_counts[dup_counts > 1].index
df_sorted = df.sort_values("joined_on")
dedup = df_sorted.drop_duplicates(subset=["customer_id"], keep="last")
print("duplicates:", len(dupe_ids))

**Q13. Why df.duplicated().sum()==0 yet visual duplicates exist?**

**Answer Key (Q13)**

Hidden whitespace/casing/type/precision differences; rows are not byte-identical.

**Q14. When keep='first' vs keep='last'?**

**Answer Key (Q14)**

first for earliest record retention; last for most recent/up-to-date values.

**Q15. Drop duplicate rows ignoring case in region.**

**Answer Key (Q15)**

In [None]:
tmp = df.assign(region_norm=df["region"].astype(str).str.strip().str.lower())
df_nodup = tmp.drop_duplicates().drop(columns=["region_norm"])

## Step 4 — Missing Values

**Q16. 15% missing in income. Mean vs median?**

**Answer Key (Q16)**

Median is more robust to skew/outliers, typically preferred for income.

**Q17. Add boolean was_missing_income.**

**Answer Key (Q17)**

In [None]:
df["was_missing_income"] = df["income"].isna().astype(int)

**Q18. Why might KNNImputer beat SimpleImputer for height/weight?**

**Answer Key (Q18)**

Leverages local correlations; imputes contextually instead of a single global statistic.

**Q19. How can constant imputation (-999) bias linear models?**

**Answer Key (Q19)**

Creates artificial separation; acts like a hidden feature, skewing coefficients.

**Q20. Median impute ↑ accuracy but ↓ recall. Why?**

**Answer Key (Q20)**

Variance shrinkage can blur minority-class signals; classifier becomes conservative.

## Step 5 — Outliers

**Q21. Cap tenure_years at the 99th percentile (code).**

**Answer Key (Q21)**

In [None]:
cap = df["tenure_years"].quantile(0.99)
df["tenure_years"] = df["tenure_years"].clip(upper=cap)

**Q22. Why prefer RobustScaler with outliers?**

**Answer Key (Q22)**

Uses median/IQR; less sensitive than mean/std used by StandardScaler.

**Q23. LOF flags 2% rows. Always drop?**

**Answer Key (Q23)**

In [None]:
No; inspect first—rare but valid events (e.g., fraud) may be informative.

**Q24. Are outliers always bad in fraud detection?**

**Answer Key (Q24)**

In [None]:
No; they might represent the positive class (fraud signals).

**Q25. Visualize income before/after capping.**

**Answer Key (Q25)**

In [None]:
ax = df["income"].hist(bins=40); plt.title("Income — Raw"); plt.show()
cap = df["income"].quantile(0.99)
df_cap = df.copy(); df_cap["income"] = df_cap["income"].clip(upper=cap)
df_cap["income"].hist(bins=40); plt.title("Income — Capped"); plt.show()

## Step 6 — Business Rule Validation

**Q26. Raise error if age < 18 (code).**

**Answer Key (Q26)**

In [None]:
assert (df["age"].fillna(0) >= 18).all(), "Found minors in dataset!"

**Q27. Why are assert checks useful at ingestion?**

**Answer Key (Q27)**

Fail fast; catch schema/content issues before expensive modeling.

**Q28. segment has unexpected 'D'. First step?**

**Answer Key (Q28)**

Flag and investigate root cause; avoid silent fixes that hide upstream issues.

**Q29. Schema vs data validation?**

**Answer Key (Q29)**

Schema: structure/types/ranges. Data: content/business rules/anomalies.

**Q30. Function that returns violations for income and age.**

**Answer Key (Q30)**

In [None]:
def validate(df):
    errs = {}
    if not (df["income"].fillna(0) >= 0).all(): errs["income"] = "negative/inconsistent"
    if not (df["age"].fillna(0) >= 0).all(): errs["age"] = "negative"
    return errs
validate(df)

## Step 7 — Categorical Encoding

**Q31. Why can ordinal encoding mislead linear models for education levels?**

**Answer Key (Q31)**

It imposes numeric spacing/order that may not reflect true effects.

**Q32. One-hot encode region, drop first category (code).**

**Answer Key (Q32)**

In [None]:
ohe = OneHotEncoder(drop="first", handle_unknown="ignore", sparse_output=False)
X_region = ohe.fit_transform(df[["region"]])
ohe.get_feature_names_out(["region"])

**Q33. Risk of one-hot encoding city with 10k levels?**

**Answer Key (Q33)**

Dimensionality blow-up, overfitting, memory; consider hashing/target encoding.

**Q34. When target encoding?**

**Answer Key (Q34)**

High-cardinality categoricals with careful CV-based smoothing to avoid leakage.

**Q35. Why label encoding ok for trees but not linear models?**

**Answer Key (Q35)**

Trees split by thresholds; linear models treat labels as continuous values.

## Step 8 — Scaling

**Q36. Apply StandardScaler to numeric columns (code).**

**Answer Key (Q36)**

In [None]:
num_cols = df.select_dtypes(include=["number"]).columns.tolist()
scaler = StandardScaler(); df_scaled = df.copy(); df_scaled[num_cols] = scaler.fit_transform(df_scaled[num_cols])

**Q37. Preferred scaler for salary with extreme outliers?**

**Answer Key (Q37)**

In [None]:
RobustScaler (median/IQR).

**Q38. Why scaling matters for PCA?**

**Answer Key (Q38)**

PCA is variance-based; unscaled high-variance features dominate components.

**Q39. Does scaling matter for trees?**

**Answer Key (Q39)**

Generally no; trees use order, not distances.

**Q40. Compare histograms pre/post scaling (code).**

**Answer Key (Q40)**

In [None]:
df[num_cols].hist(figsize=(10,6)); plt.suptitle("Before"); plt.show()
df_scaled[num_cols].hist(figsize=(10,6)); plt.suptitle("After"); plt.show()

## Step 9 — Datetime Features

**Q41. Extract year, month, weekday from joined_on (code).**

**Answer Key (Q41)**

In [None]:
dt = pd.to_datetime(df["joined_on"], errors="coerce")
df["year"] = dt.dt.year; df["month"] = dt.dt.month; df["weekday"] = dt.dt.weekday

**Q42. Why days_since_signup beats raw date?**

**Answer Key (Q42)**

Captures recency/tenure signal rather than arbitrary calendar stamps.

**Q43. Create is_weekend (code).**

**Answer Key (Q43)**

In [None]:
df["is_weekend"] = (dt.dt.weekday >= 5).astype(int)

**Q44. Why quarter useful in retail?**

**Answer Key (Q44)**

In [None]:
Seasonality and quarterly cycles (promotions, budgets).

**Q45. Pitfall of one-hot encoding year?**

**Answer Key (Q45)**

Temporal leakage/overfit; years proxy trends rather than stable categories.

## Step 10 — Binning

**Q46. Bin income into 5 quantile bins using KBinsDiscretizer (code).**

**Answer Key (Q46)**

In [None]:
kb = KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile")
df["income_bin"] = kb.fit_transform(df[["income"]]).astype(int)

**Q47. How binning helps logistic regression?**

**Answer Key (Q47)**

Introduces threshold/piecewise effects approximating non-linear relationships.

**Q48. pd.cut vs pd.qcut for age — which equalizes counts?**

**Answer Key (Q48)**

In [None]:
pd.qcut (quantile-based).

**Q49. Why is binning common in credit scoring?**

**Answer Key (Q49)**

Stability, interpretability, monotonic WOE/IV transformations.

**Q50. Drawback of fixed bins across populations?**

**Answer Key (Q50)**

Drift/shift causes uneven coverage; bins go stale across cohorts.

## Step 11 — Feature Selection

**Q51. Why remove near-constant features?**

**Answer Key (Q51)**

Low information content; add noise/compute overhead.

**Q52. Top-5 features via MI (code).**

**Answer Key (Q52)**

In [None]:
numX = df.select_dtypes(include=["number"]).drop(columns=["churn"], errors="ignore")
y = df["churn"].astype(int)
mi = mutual_info_classif(numX.fillna(numX.median()), y, random_state=42)
order = mi.argsort()[::-1][:5]
top5 = numX.columns[order].tolist(); top5

**Q53. How MI differs from correlation?**

**Answer Key (Q53)**

In [None]:
MI captures any (nonlinear) dependency; correlation measures linear association only.

**Q54. When prefer wrappers like RFECV?**

**Answer Key (Q54)**

When model-aware selection and interactions matter for performance.

**Q55. Risk of selecting features before split?**

**Answer Key (Q55)**

Leakage—metrics inflated by using information from the whole dataset.

## Step 12 — Leakage-Safe Pipelines

**Q56. Why transforms must live inside CV folds?**

**Answer Key (Q56)**

To prevent leakage: validation data must remain unseen by fitted transformers.

**Q57. Build pipeline: impute → scale → logistic regression (code).**

**Answer Key (Q57)**

In [None]:
num_cols = df.select_dtypes(include=["number"]).drop(columns=["churn"], errors="ignore").columns.tolist()
cat_cols = df.select_dtypes(include=["object","category"]).columns.tolist()
pre = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler())]), num_cols),
    ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
])
pipe = Pipeline([("pre", pre), ("clf", LogisticRegression(max_iter=300))])
scores = cross_val_score(pipe, df.drop(columns=["churn"]), df["churn"], cv=5, scoring="f1")
scores.mean().round(3)

**Q58. Fitting transformer on full data vs training only — difference?**

**Answer Key (Q58)**

Full-data fit leaks validation info; training-only preserves honest evaluation.

**Q59. How ColumnTransformer helps with mixed types?**

**Answer Key (Q59)**

Routes different preprocessors safely to numeric vs categorical feature sets.

**Q60. Why leakage worse in time series?**

**Answer Key (Q60)**

Look-ahead bias; future info contaminates evaluation.

## Step 13 — End-to-End Integration

**Q61. Pipeline with KNNImputer, StandardScaler, RandomForestClassifier (code).**

**Answer Key (Q61)**

In [None]:
features = df.drop(columns=["churn"])
target = df["churn"].astype(int)
num_cols = features.select_dtypes(include=["number"]).columns.tolist()
cat_cols = features.select_dtypes(include=["object","category"]).columns.tolist()
pre = ColumnTransformer([
    ("num", Pipeline([("imp", KNNImputer(n_neighbors=5)), ("sc", StandardScaler())]), num_cols),
    ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
])
pipe = Pipeline([("pre", pre), ("rf", RandomForestClassifier(n_estimators=250, random_state=42))])
print(cross_val_score(pipe, features, target, cv=5, scoring="f1").mean().round(3))

**Q62. Pipeline fails on unseen categories at inference. Fix?**

**Answer Key (Q62)**

In [None]:
Use OneHotEncoder(handle_unknown="ignore") or map rare/unseen to 'Unknown' bucket pre-encode; schedule retraining.

**Q63. Why use Pipeline over manual sequencing?**

**Answer Key (Q63)**

Ensures order/reproducibility and leakage safety across CV & production.

**Q64. Add PCA (3 comps) to numeric branch after scaling (code).**

**Answer Key (Q64)**

In [None]:
num_branch = Pipeline([("imp", SimpleImputer(strategy="median")), ("sc", StandardScaler()), ("pca", PCA(n_components=3, random_state=42))])
pre = ColumnTransformer([("num", num_branch, num_cols), ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")), ("oh", OneHotEncoder(handle_unknown="ignore"))]), cat_cols)])

**Q65. Deployed model sees new region levels. Strategy?**

**Answer Key (Q65)**

Keep encoder with handle_unknown='ignore', add periodic retraining, and consider rare-category bucketing.