
### <a id="q1"></a>Question 1: Explain the differences between AI, ML, Deep Learning (DL), and Data Science (DS).

**Answer:**  
- **Artificial Intelligence (AI):** The broad field of building systems that can perform tasks that normally require human intelligence (reasoning, planning, perception, language understanding, decision-making). AI includes rule-based systems, search/optimization, and learning-based systems.  
- **Machine Learning (ML):** A subfield of AI focused on algorithms that **learn patterns from data** to make predictions or decisions without being explicitly programmed with rules. Examples: linear/logistic regression, decision trees, random forests, SVMs, gradient boosting.  
- **Deep Learning (DL):** A subfield of ML that uses **neural networks with many layers** to automatically learn hierarchical representations (features). Particularly powerful for unstructured data like images, audio, and text (e.g., CNNs, RNNs, Transformers).  
- **Data Science (DS):** An interdisciplinary field that combines **statistics, programming, and domain knowledge** to extract insights and value from data. DS spans the entire lifecycle: data collection, cleaning, EDA, visualization, modeling (often using ML), evaluation, and communication of results.

**Relationship:** AI ⊇ ML ⊇ DL; DS overlaps with ML/DL but also includes data engineering, analytics, and communication.



### <a id="q2"></a>Question 2: What are the types of machine learning? Describe each with one real-world example.

**Answer:**  
1. **Supervised Learning:** Learn a mapping from inputs to a **labeled** target (regression/classification).  
   - *Example:* Predicting house prices (regression) using features like area, location, and rooms.  
2. **Unsupervised Learning:** Discover structure in **unlabeled** data (clustering, dimensionality reduction).  
   - *Example:* Customer segmentation using purchase behavior (k-means).  
3. **Semi-supervised Learning:** Train on a small amount of labeled data plus a large amount of unlabeled data.  
   - *Example:* Classifying webpages when only a subset is labeled.  
4. **Self-supervised Learning:** Create proxy labels from the data itself to pretrain representations.  
   - *Example:* Masked language modeling to pretrain text encoders.  
5. **Reinforcement Learning:** An agent learns to act by receiving **rewards** from an environment.  
   - *Example:* Game-playing agents (e.g., learning to play Atari or Go).



### <a id="q3"></a>Question 3: Define overfitting, underfitting, and the bias-variance tradeoff in machine learning.

**Answer:**  
- **Overfitting:** The model fits training data too closely (including noise), yielding **low training error** but **high validation/test error**.  
- **Underfitting:** The model is too simple to capture underlying patterns, giving **high training and test error**.  
- **Bias–Variance Tradeoff:**  
  - **Bias** = error due to simplifying assumptions (too simple → underfit).  
  - **Variance** = error due to sensitivity to training fluctuations (too complex → overfit).  
  Managing model complexity, adding more data, using regularization/ensembles, and proper validation helps balance the two.



### <a id="q4"></a>Question 4: What are outliers in a dataset, and list three common techniques for handling them.

**Answer:**  
- **Outliers** are observations that deviate markedly from other observations and may arise from data entry errors, rare events, or natural variability.  
- **Common handling techniques:**  
  1. **Capping/Winsorization:** Replace extreme values beyond certain percentiles (e.g., 1st/99th).  
  2. **Transformation:** Apply log/Box–Cox/Yeo–Johnson to reduce skewness and compress extremes.  
  3. **Removal/Filtering:** Drop points beyond thresholds (e.g., IQR method or z-score).  
  4. **Robust Models:** Use algorithms or losses robust to outliers (e.g., median-based stats, Huber loss).  
  5. **Imputation by Domain Logic:** If outliers are due to errors, correct with valid values.



### <a id="q5"></a>Question 5: Explain the process of handling missing values and mention one imputation technique for numerical and one for categorical data.

**Answer:**  
1. **Explore missingness:** quantify % missing per feature; analyze patterns (MCAR/MAR/MNAR).  
2. **Decide strategy per feature:** drop columns/rows (when justified) vs. impute.  
3. **Impute appropriately:**  
   - **Numerical:** mean/median imputation, KNN imputation, model-based, or time-series interpolation.  
   - **Categorical:** most frequent (mode), introduce **'Unknown'** category, or model-based.  
4. **Flag imputation:** add indicator columns where appropriate.  
5. **Validate impact:** compare model performance with/without different strategies.

**Example techniques:**  
- Numerical → **Median imputation** (robust to outliers).  
- Categorical → **Most frequent (mode)** or **'Unknown'** label.



### <a id="q6"></a>Question 6: Write a Python program that creates a synthetic imbalanced dataset with `make_classification()` and prints the class distribution.

**Answer (Code):**


In [None]:

from sklearn.datasets import make_classification
import numpy as np
import pandas as pd
from collections import Counter
import random

# Reproducibility
np.random.seed(42)
random.seed(42)

X, y = make_classification(
    n_samples=2000, n_features=10, n_informative=3, n_redundant=2,
    n_clusters_per_class=1, weights=[0.95, 0.05], flip_y=0.01, random_state=42
)

print("Shapes:", X.shape, np.array(y).shape)
print("Class distribution:", Counter(y))

# Optional: show as a DataFrame head
df = pd.DataFrame(X, columns=[f"f{i}" for i in range(X.shape[1])])
df['target'] = y
df.head()



### <a id="q7"></a>Question 7: Implement one-hot encoding using pandas for the list of colors `['Red', 'Green', 'Blue', 'Green', 'Red']` and print the resulting DataFrame.

**Answer (Code):**


In [None]:

import pandas as pd

colors = ['Red', 'Green', 'Blue', 'Green', 'Red']
df_colors = pd.DataFrame({'color': colors})
encoded = pd.get_dummies(df_colors, columns=['color'], prefix='color')

print("Original:")
print(df_colors)
print("\nOne-hot encoded:")
print(encoded)
encoded



### <a id="q8"></a>Question 8: Generate 1000 normal samples, introduce 50 random missing values, fill with mean, and plot histograms before and after imputation.

**Answer (Code):**


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
np.random.seed(42)

# Generate data
data = np.random.normal(loc=0, scale=1, size=1000).astype(float)
data_with_nan = data.copy()

# Introduce 50 random missing values
nan_indices = np.random.choice(len(data_with_nan), size=50, replace=False)
data_with_nan[nan_indices] = np.nan

# Impute with mean
mean_value = np.nanmean(data_with_nan)
imputed = np.where(np.isnan(data_with_nan), mean_value, data_with_nan)

print(f"Mean used for imputation: {mean_value:.4f}")
print("Missing before:", np.isnan(data_with_nan).sum(), "| Missing after:", np.isnan(imputed).sum())

# Plot histograms
plt.figure()
plt.hist(data_with_nan[~np.isnan(data_with_nan)], bins=30, alpha=0.8)
plt.title("Histogram BEFORE imputation")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

plt.figure()
plt.hist(imputed, bins=30, alpha=0.8)
plt.title("Histogram AFTER mean imputation")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()



### <a id="q9"></a>Question 9: Implement Min–Max scaling on `[2, 5, 10, 15, 20]` using `sklearn.preprocessing.MinMaxScaler` and print the scaled array.

**Answer (Code):**


In [None]:

from sklearn.preprocessing import MinMaxScaler
import numpy as np

arr = np.array([[2], [5], [10], [15], [20]], dtype=float)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(arr)

print("Original:", arr.ravel().tolist())
print("Scaled:", scaled.ravel().tolist())
scaled



### <a id="q10"></a>Question 10: Data preparation plan for retail transactions (missing ages, outliers in amount, imbalanced target, categorical variables).

**Answer:**  
**Step-by-step plan:**  
1. **Schema & sanity checks:** Validate columns, dtypes, duplicate rows, impossible values (e.g., negative ages).  
2. **Missing ages:**  
   - Explore age distribution and correlation with other features.  
   - Impute with **median** (robust) or **KNN imputer** using related features; add `age_imputed` flag.  
3. **Outliers in transaction amount:**  
   - Identify via **IQR method** or log-transform.  
   - Handle via **capping** at reasonable percentiles (e.g., 1%/99%) or use **robust scalers**.  
4. **Imbalanced target (fraud vs non-fraud):**  
   - Use **stratified splits**, evaluate with **PR-AUC**, **F1**, **recall**.  
   - Apply **class weighting**, **SMOTE/SMOTEENN**, or **threshold tuning**.  
5. **Categorical variables (e.g., payment method):**  
   - Low-cardinality → **one-hot**; high-cardinality → **target encoding** with CV to avoid leakage.  
6. **Scaling:**  
   - Use **StandardScaler/RobustScaler** for numeric features; fit on train only.  
7. **Validation & leakage control:**  
   - Pipeline + ColumnTransformer; fit only on training folds; use nested CV if tuning.  
8. **Modeling:**  
   - Start with **logistic regression** (with class_weight) and **tree-based** baselines; calibrate probabilities if needed.  
9. **Monitoring:**  
   - Track data drift and alert on population changes (PSI), retrain periodically.

**Optional (Code Template – synthetic demonstration):**


In [None]:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

# --- Synthetic dataset (for demo only) ---
np.random.seed(42)
n = 3000
df = pd.DataFrame({
    "age": np.random.normal(35, 12, n).round(0),
    "amount": np.random.lognormal(mean=3.2, sigma=1.0, size=n),
    "payment_method": np.random.choice(["card", "upi", "netbanking", "cod"], size=n, p=[0.6, 0.25, 0.1, 0.05]),
})

# Introduce missing ages (~10%)
mask = np.random.rand(n) < 0.1
df.loc[mask, "age"] = np.nan

# Introduce some extreme outliers in amount
outlier_idx = np.random.choice(n, size=20, replace=False)
df.loc[outlier_idx, "amount"] *= 50

# Imbalanced target (fraud 3%)
y = (np.random.rand(n) < 0.03).astype(int)

# --- Train/Valid split ---
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, stratify=y, random_state=42)

# --- Preprocessing ---
numeric_features = ["age", "amount"]
categorical_features = ["payment_method"]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),       # or KNNImputer()
    ("scaler", RobustScaler())                           # robust to outliers
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# --- Modeling with imbalance handling ---
model = LogisticRegression(max_iter=1000, class_weight="balanced")

pipeline = ImbPipeline(steps=[
    ("preprocess", preprocess),
    ("smote", SMOTE(sampling_strategy=0.2, random_state=42)),  # upsample minority in training folds
    ("clf", model)
])

pipeline.fit(X_train, y_train)
print("Train score:", pipeline.score(X_train, y_train))
print("Test score:", pipeline.score(X_test, y_test))

# Inspect one prediction batch
preds = pipeline.predict_proba(X_test)[:5, 1]
pd.DataFrame({"prob_fraud": preds})


---
*End of assignment solutions.*