<a href="https://colab.research.google.com/github/yaldaAlizadeh/Machine_Learning/blob/main/DeepSeek_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

I want you to generate below content, which includes everything. Convert all of the answe (exact content) into a runnable Jupyter notebook with different cells for text content and for code cells that I can run the codes and see the output and then go to the next cell. If you can not generate the ipynb, Generate regular answer that I can copy that and not json file.# Comprehensive Machine Learning Algorithms Guide
I'll create a comprehensive guide covering all algorithms from your cheatsheet. Since I can't generate actual .ipynb files, I'll structure this as a complete Jupyter notebook with markdown cells and code cells that you can copy into a new notebook.

In [None]:
# Cell 1: Installation and Setup
# Run this cell first to install required packages
!pip install numpy pandas matplotlib seaborn scikit-learn xgboost tensorflow torch transformers plotly graphviz missingno -q
print("✅ All packages installed successfully!")

In [None]:
# Cell 2: Import All Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
# Scikit-learn imports
from sklearn import datasets, linear_model, model_selection, metrics, tree, ensemble, svm, neighbors, naive_bayes, cluster, decomposition, neural_network
from sklearn.preprocessing import StandardScaler, LabelEncoder, PolynomialFeatures
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.tree import plot_tree, export_text
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.manifold import TSNE
from sklearn.calibration import calibration_curve
# Advanced models
import xgboost as xgb
import tensorflow as tf
from tensorflow.keras import layers, models, Sequential
from transformers import pipeline
# Statistical analysis
import scipy.stats as stats
from scipy.cluster.hierarchy import dendrogram, linkage
import warnings
warnings.filterwarnings('ignore')
# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)
print("📚 All libraries imported successfully!")

# 1. LINEAR REGRESSION
## Algorithm Background & Mathematical Foundation
**Core Concept:** Linear Regression models the relationship between a dependent variable (Y) and one or more independent variables (X) using a linear approach. It's the foundation of predictive modeling and statistical analysis.
**Mathematical Formulation:**
**Model Equation:**
$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p + \epsilon$$
**Matrix Form:**
$$\mathbf{Y} = \mathbf{X}\beta + \epsilon$$
**Cost Function (Mean Squared Error):**
$$J(\beta) = \frac{1}{2m}\sum_{i=1}^{m}(h_\beta(x^{(i)}) - y^{(i)})^2 = \frac{1}{2m}(X\beta - y)^T(X\beta - y)$$
**Closed-form Solution (Normal Equation):**
$$\hat{\beta} = (X^TX)^{-1}X^Ty$$
**Gradient Descent Update:**
$$\beta_j := \beta_j - \alpha\frac{\partial}{\partial\beta_j}J(\beta) = \beta_j - \alpha\frac{1}{m}\sum_{i=1}^m(h_\beta(x^{(i)}) - y^{(i)})x_j^{(i)}$$
## Key Assumptions (Gauss-Markov Theorem)
1. **Linearity:** Relationship between X and Y is linear
2. **Independence:** Observations are independent of each other
3. **Homoscedasticity:** Constant variance of errors
4. **No Autocorrelation:** Errors are uncorrelated
5. **No Perfect Multicollinearity:** Predictors not perfectly correlated
6. **Normality:** Errors normally distributed (for inference)
## Types of Linear Regression
- **Simple Linear Regression:** One predictor variable
- **Multiple Linear Regression:** Multiple predictors
- **Polynomial Regression:** Non-linear relationships via polynomial features
- **Regularized Regression:** Ridge (L2), Lasso (L1), Elastic Net

In [None]:
# Cell 3: Linear Regression - Comprehensive Implementation
print("🚀 LINEAR REGRESSION: COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset
boston = datasets.fetch_california_housing()
X = boston.data
y = boston.target
feature_names = boston.feature_names
print("📊 Dataset Overview:")
print(f"• Dataset: California Housing")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Features: {list(feature_names)}")
print(f"• Target: Median House Value")
# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['MedHouseVal'] = y
# Exploratory Data Analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Distribution of target variable
axes[0,0].hist(df['MedHouseVal'], bins=50, alpha=0.7, color='skyblue', edgecolor='black')
axes[0,0].set_title('Distribution of Median House Values')
axes[0,0].set_xlabel('Median House Value')
axes[0,0].set_ylabel('Frequency')
# Correlation heatmap
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=axes[0,1])
axes[0,1].set_title('Feature Correlation Matrix')
# Feature vs Target relationships
axes[1,0].scatter(df['MedInc'], df['MedHouseVal'], alpha=0.5)
axes[1,0].set_xlabel('Median Income')
axes[1,0].set_ylabel('Median House Value')
axes[1,0].set_title('Income vs House Value')
# Feature distribution
axes[1,1].boxplot([df['MedInc'], df['HouseAge'], df['AveRooms']])
axes[1,1].set_xticklabels(['MedInc', 'HouseAge', 'AveRooms'])
axes[1,1].set_title('Feature Distributions')
axes[1,1].set_ylabel('Value')
plt.tight_layout()
plt.show()
# Data Preprocessing
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Multiple Linear Regression Models
models = {
'Ordinary Least Squares': linear_model.LinearRegression(),
'Ridge Regression (L2)': linear_model.Ridge(alpha=1.0),
'Lasso Regression (L1)': linear_model.Lasso(alpha=0.1),
'Elastic Net': linear_model.ElasticNet(alpha=0.1, l1_ratio=0.5)
}
results = {}
for name, model in models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = metrics.mean_squared_error(y_test, y_pred)
r2 = metrics.r2_score(y_test, y_pred)
mae = metrics.mean_absolute_error(y_test, y_pred)
results[name] = {
'mse': mse,
'r2': r2,
'mae': mae,
'model': model,
'y_pred': y_pred
}
print(f" • MSE: {mse:.4f}")
print(f" • R²: {r2:.4f}")
print(f" • MAE: {mae:.4f}")
# Model Comparison
comparison_df = pd.DataFrame({
'Model': list(results.keys()),
'MSE': [results[name]['mse'] for name in results.keys()],
'R2': [results[name]['r2'] for name in results.keys()],
'MAE': [results[name]['mae'] for name in results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_df.round(4))
# Visualization of Results
best_model_name = max(results.keys(), key=lambda x: results[x]['r2'])
best_model = results[best_model_name]['model']
y_pred_best = results[best_model_name]['y_pred']
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Predictions vs Actual
axes[0,0].scatter(y_test, y_pred_best, alpha=0.6)
axes[0,0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
axes[0,0].set_xlabel('Actual Values')
axes[0,0].set_ylabel('Predicted Values')
axes[0,0].set_title(f'{best_model_name}: Predictions vs Actual')
axes[0,0].grid(True)
# Residuals plot
residuals = y_test - y_pred_best
axes[0,1].scatter(y_pred_best, residuals, alpha=0.6)
axes[0,1].axhline(y=0, color='r', linestyle='--')
axes[0,1].set_xlabel('Predicted Values')
axes[0,1].set_ylabel('Residuals')
axes[0,1].set_title('Residuals Plot')
axes[0,1].grid(True)
# Distribution of residuals
axes[1,0].hist(residuals, bins=30, alpha=0.7, color='skyblue', edgecolor='black')
axes[1,0].set_xlabel('Residuals')
axes[1,0].set_ylabel('Frequency')
axes[1,0].set_title('Distribution of Residuals')
axes[1,0].grid(True)
# Feature importance for interpretable models
if hasattr(best_model, 'coef_'):
feature_importance = pd.DataFrame({
'feature': feature_names,
'coefficient': best_model.coef_
}).sort_values('coefficient', key=abs, ascending=False)
axes[1,1].barh(feature_importance['feature'][:10], feature_importance['coefficient'][:10])
axes[1,1].set_xlabel('Coefficient Value')
axes[1,1].set_title('Top 10 Feature Coefficients')
plt.tight_layout()
plt.show()
print("✅ Linear Regression Analysis Complete!")

## Linear Regression Interview Questions & Answers
**Q1: What is the difference between L1 (Lasso) and L2 (Ridge) regularization?**
**Answer:**
- **L1 Regularization (Lasso):** Adds absolute value of coefficients as penalty: $J(\beta) = MSE + \lambda\sum|\beta_i|$
- Can shrink coefficients to exactly zero, performing feature selection
- Creates sparse models
- **L2 Regularization (Ridge):** Adds squared value of coefficients as penalty: $J(\beta) = MSE + \lambda\sum\beta_i^2$
- Shrinks coefficients but rarely to exactly zero
- Handles multicollinearity well
**Q2: How do you interpret the R-squared value?**
**Answer:**
R-squared measures the proportion of variance in the dependent variable that is predictable from the independent variables.
- R² = 1: Perfect fit
- R² = 0: Model explains none of the variance
- R² < 0: Model is worse than horizontal line
- **Important:** High R² doesn't guarantee model quality
**Q3: What is multicollinearity and why is it problematic?**
**Answer:**
Multicollinearity occurs when independent variables are highly correlated. Problems include:
- Unstable coefficient estimates
- Difficulty interpreting individual feature importance
- Increased variance of coefficient estimates
- Detection: Variance Inflation Factor (VIF) > 10 indicates severe multicollinearity
**Q4: Explain the bias-variance tradeoff.**
**Answer:**
Total Error = Bias² + Variance + Irreducible Error
- **Bias:** Error from erroneous assumptions (underfitting)
- **Variance:** Error from sensitivity to training data (overfitting)
- Simple models: High bias, low variance
- Complex models: Low bias, high variance
**Q5: How do you validate a Linear Regression model?**
**Answer:**
1. Check assumptions (linearity, homoscedasticity, normality)
2. Use train/test split or cross-validation
3. Analyze residual plots
4. Calculate performance metrics (MSE, R², MAE)
5. Check for influential points using Cook's distance
# 2. LOGISTIC REGRESSION
## Algorithm Background & Mathematical Foundation
**Core Concept:** Despite the name, Logistic Regression is a classification algorithm that estimates probabilities using a logistic function.
**Mathematical Formulation:**
**Sigmoid Function:**
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
**Model Equation:**
$$P(Y=1|X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + ... + \beta_pX_p)}}$$
**Log-Odds (Logit Function):**
$$\log\left(\frac{P(Y=1)}{1 - P(Y=1)}\right) = \beta_0 + \beta_1X_1 + ... + \beta_pX_p$$
**Cost Function (Log Loss):**
$$J(\beta) = -\frac{1}{m}\sum_{i=1}^{m}[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$$
## Key Assumptions
1. **Binary outcome** variable
2. **Linearity** between independent variables and log-odds
3. **No multicollinearity** among independent variables
4. **Independent** observations
5. **Large sample** size

In [None]:
# Cell 4: Logistic Regression - Comprehensive Implementation
print("🚀 LOGISTIC REGRESSION: COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset
breast_cancer = datasets.load_breast_cancer()
X = breast_cancer.data
y = breast_cancer.target
feature_names = breast_cancer.feature_names
target_names = breast_cancer.target_names
print("📊 Dataset Overview:")
print(f"• Dataset: Wisconsin Breast Cancer")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Classes: {target_names}")
print(f"• Class distribution: {np.bincount(y)}")
# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['diagnosis'] = y
# Exploratory Data Analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Class distribution
class_counts = df['diagnosis'].value_counts()
axes[0,0].pie(class_counts.values, labels=target_names, autopct='%1.1f%%',
colors=['lightcoral', 'lightgreen'])
axes[0,0].set_title('Class Distribution')
# Feature correlation with target
correlation_with_target = df.corr()['diagnosis'].sort_values(ascending=False)
top_features = correlation_with_target[1:11]
axes[0,1].barh(range(len(top_features)), top_features.values)
axes[0,1].set_yticks(range(len(top_features)))
axes[0,1].set_yticklabels(top_features.index)
axes[0,1].set_xlabel('Correlation with Diagnosis')
axes[0,1].set_title('Top 10 Features Correlated with Diagnosis')
# Feature distribution by class
for diagnosis in [0, 1]:
subset = df[df['diagnosis'] == diagnosis]
axes[1,0].scatter(subset['worst radius'], subset['worst texture'],
label=target_names[diagnosis], alpha=0.6)
axes[1,0].set_xlabel('Worst Radius')
axes[1,0].set_ylabel('Worst Texture')
axes[1,0].set_title('Feature Space: Malignant vs Benign')
axes[1,0].legend()
axes[1,0].grid(True)
# Data preprocessing
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Logistic Regression models
logistic_models = {
'Logistic Regression (L2)': linear_model.LogisticRegression(penalty='l2', C=1.0, random_state=42, max_iter=1000),
'Logistic Regression (L1)': linear_model.LogisticRegression(penalty='l1', solver='liblinear', C=1.0, random_state=42, max_iter=1000),
'Logistic Regression (Balanced)': linear_model.LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)
}
logistic_results = {}
for name, model in logistic_models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
roc_auc = metrics.roc_auc_score(y_test, y_pred_proba)
logistic_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'roc_auc': roc_auc,
'model': model,
'y_pred_proba': y_pred_proba
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
print(f" • ROC-AUC: {roc_auc:.4f}")
# Model Comparison
comparison_logistic = pd.DataFrame({
'Model': list(logistic_results.keys()),
'Accuracy': [logistic_results[name]['accuracy'] for name in logistic_results.keys()],
'Precision': [logistic_results[name]['precision'] for name in logistic_results.keys()],
'Recall': [logistic_results[name]['recall'] for name in logistic_results.keys()],
'F1-Score': [logistic_results[name]['f1'] for name in logistic_results.keys()],
'ROC-AUC': [logistic_results[name]['roc_auc'] for name in logistic_results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_logistic.round(4))
# Comprehensive Visualization
best_logistic_name = max(logistic_results.keys(), key=lambda x: logistic_results[x]['roc_auc'])
best_logistic_model = logistic_results[best_logistic_name]['model']
y_pred_proba_best = logistic_results[best_logistic_name]['y_pred_proba']
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# ROC Curve
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba_best)
roc_auc = metrics.auc(fpr, tpr)
axes[0,0].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
axes[0,0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
axes[0,0].set_xlabel('False Positive Rate')
axes[0,0].set_ylabel('True Positive Rate')
axes[0,0].set_title('ROC Curve')
axes[0,0].legend()
axes[0,0].grid(True)
# Precision-Recall Curve
precision, recall, _ = metrics.precision_recall_curve(y_test, y_pred_proba_best)
average_precision = metrics.average_precision_score(y_test, y_pred_proba_best)
axes[0,1].plot(recall, precision, color='blue', lw=2, label=f'Avg Precision = {average_precision:.2f}')
axes[0,1].set_xlabel('Recall')
axes[0,1].set_ylabel('Precision')
axes[0,1].set_title('Precision-Recall Curve')
axes[0,1].legend()
axes[0,1].grid(True)
# Probability Distribution
axes[0,2].hist(y_pred_proba_best[y_test == 0], bins=30, alpha=0.7, color='red', label='Malignant', density=True)
axes[0,2].hist(y_pred_proba_best[y_test == 1], bins=30, alpha=0.7, color='green', label='Benign', density=True)
axes[0,2].axvline(x=0.5, color='black', linestyle='--', label='Decision Boundary')
axes[0,2].set_xlabel('Predicted Probability')
axes[0,2].set_ylabel('Density')
axes[0,2].set_title('Probability Distribution by Class')
axes[0,2].legend()
# Confusion Matrix
y_pred_best = best_logistic_model.predict(X_test_scaled)
cm = metrics.confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[1,0],
xticklabels=target_names, yticklabels=target_names)
axes[1,0].set_xlabel('Predicted')
axes[1,0].set_ylabel('Actual')
axes[1,0].set_title('Confusion Matrix')
# Feature Importance
if hasattr(best_logistic_model, 'coef_'):
feature_importance = pd.DataFrame({
'feature': feature_names,
'coefficient': best_logistic_model.coef_[0],
'abs_coefficient': np.abs(best_logistic_model.coef_[0])
}).sort_values('abs_coefficient', ascending=True).tail(10)
axes[1,1].barh(feature_importance['feature'], feature_importance['coefficient'])
axes[1,1].set_xlabel('Coefficient Value')
axes[1,1].set_title('Top 10 Feature Coefficients')
# Model Comparison
models_list = list(logistic_results.keys())
roc_auc_scores = [logistic_results[name]['roc_auc'] for name in models_list]
bars = axes[1,2].bar(models_list, roc_auc_scores, color=['blue', 'green', 'orange'])
axes[1,2].set_ylabel('ROC-AUC Score')
axes[1,2].set_title('Model Comparison: ROC-AUC Scores')
axes[1,2].tick_params(axis='x', rotation=45)
for bar, score in zip(bars, roc_auc_scores):
axes[1,2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
print("✅ Logistic Regression Analysis Complete!")

## Logistic Regression Interview Questions & Answers
**Q1: Why can't we use Linear Regression for classification?**
**Answer:**
- Linear Regression predicts continuous values outside [0,1] range
- It doesn't provide probability estimates naturally
- Sensitive to outliers which can drastically affect decision boundary
- Assumes linear relationship which doesn't hold for classification boundaries
**Q2: How do you interpret coefficients in Logistic Regression?**
**Answer:**
- **Coefficient βⱼ:** Change in log-odds for one unit increase in Xⱼ
- **Odds Ratio e^{βⱼ}:** Multiplicative change in odds for one unit increase
- Example: β = 0.7 means odds increase by 100% (e^0.7 ≈ 2.01)
**Q3: What is the cost function and why is it used?**
**Answer:**
**Log Loss:** $J(\beta) = -\frac{1}{m}\sum_{i=1}^{m}[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]$
- Convex function guaranteeing convergence
- Heavily penalizes confident wrong predictions
- Directly related to maximum likelihood estimation
**Q4: How do you handle multiclass classification?**
**Answer:**
1. **One-vs-Rest (OvR):** Train K binary classifiers
2. **One-vs-One (OvO):** Train $\binom{K}{2}$ binary classifiers
3. **Multinomial/Softmax:** Direct extension using softmax function
- Softmax: $P(Y=k|X) = \frac{e^{\beta_k^TX}}{\sum_{j=1}^{K}e^{\beta_j^TX}}$
**Q5: What evaluation metrics are appropriate?**
**Answer:**
- **Threshold-dependent:** Accuracy, Precision, Recall, F1-Score
- **Threshold-independent:** ROC-AUC, Precision-Recall AUC
- **Probability calibration:** Log Loss, Brier Score
- Choose based on business objectives and class balance
# 3. DECISION TREE
## Algorithm Background & Mathematical Foundation
**Core Concept:** Decision Trees learn hierarchical if-else rules by recursively partitioning the feature space based on feature values.
**Mathematical Formulation:**
**For Classification:**
**Gini Impurity:**
$$Gini(t) = 1 - \sum_{i=1}^{c} p(i|t)^2$$
**Information Gain (Entropy):**
$$Entropy(t) = -\sum_{i=1}^{c} p(i|t)\log_2 p(i|t)$$
$$Information\ Gain = Entropy(parent) - \sum_{j=1}^{k} \frac{N_j}{N} Entropy(child_j)$$
**For Regression:**
**Variance Reduction:**
$$Variance(t) = \frac{1}{N_t}\sum_{i=1}^{N_t} (y_i - \bar{y}_t)^2$$
## Key Features
- **Non-parametric:** No assumptions about data distribution
- **Handles mixed data types:** Numerical and categorical
- **Interpretable:** Easy to understand and explain
- **No feature scaling required**

In [None]:
# Cell 5: Decision Tree - Comprehensive Implementation
print("🚀 DECISION TREE: COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
class_names = iris.target_names
print("📊 Dataset Overview:")
print(f"• Dataset: Iris Flowers")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Classes: {list(class_names)}")
print(f"• Class distribution: {np.bincount(y)}")
# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['species'] = y
df['species_name'] = df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})
# Exploratory Data Analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# Feature relationships
scatter = axes[0,0].scatter(df['sepal length (cm)'], df['sepal width (cm)'],
c=df['species'], cmap='viridis')
axes[0,0].set_xlabel('Sepal Length (cm)')
axes[0,0].set_ylabel('Sepal Width (cm)')
axes[0,0].set_title('Sepal Length vs Width')
plt.colorbar(scatter, ax=axes[0,0])
# Feature distributions by class
for species in range(3):
subset = df[df['species'] == species]
axes[0,1].hist(subset['petal length (cm)'], alpha=0.7, label=class_names[species])
axes[0,1].set_xlabel('Petal Length (cm)')
axes[0,1].set_ylabel('Frequency')
axes[0,1].set_title('Petal Length Distribution by Species')
axes[0,1].legend()
# Correlation heatmap
correlation_matrix = df.iloc[:, :4].corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, ax=axes[1,0])
axes[1,0].set_title('Feature Correlation Matrix')
# Class distribution
class_counts = df['species'].value_counts()
axes[1,1].bar(class_names, class_counts.values, color=['skyblue', 'lightgreen', 'lightcoral'])
axes[1,1].set_title('Class Distribution')
axes[1,1].set_ylabel('Count')
plt.tight_layout()
plt.show()
# Data preprocessing
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train Decision Tree models with different parameters
tree_models = {
'Decision Tree (Gini, depth=3)': tree.DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42),
'Decision Tree (Entropy, depth=5)': tree.DecisionTreeClassifier(criterion='entropy', max_depth=5, random_state=42),
'Decision Tree (Unlimited)': tree.DecisionTreeClassifier(random_state=42)
}
tree_results = {}
for name, model in tree_models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred, average='weighted')
recall = metrics.recall_score(y_test, y_pred, average='weighted')
f1 = metrics.f1_score(y_test, y_pred, average='weighted')
tree_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'model': model,
'y_pred': y_pred
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
print(f" • Tree Depth: {model.get_depth()}")
print(f" • Number of Leaves: {model.get_n_leaves()}")
# Model Comparison
comparison_tree = pd.DataFrame({
'Model': list(tree_results.keys()),
'Accuracy': [tree_results[name]['accuracy'] for name in tree_results.keys()],
'Precision': [tree_results[name]['precision'] for name in tree_results.keys()],
'Recall': [tree_results[name]['recall'] for name in tree_results.keys()],
'F1-Score': [tree_results[name]['f1'] for name in tree_results.keys()],
'Depth': [tree_results[name]['model'].get_depth() for name in tree_results.keys()],
'Leaves': [tree_results[name]['model'].get_n_leaves() for name in tree_results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_tree.round(4))
# Comprehensive Visualization
best_tree_name = max(tree_results.keys(), key=lambda x: tree_results[x]['accuracy'])
best_tree_model = tree_results[best_tree_name]['model']
fig, axes = plt.subplots(2, 2, figsize=(18, 12))
# Visualize the decision tree
plt.figure(figsize=(15, 10))
plot_tree(best_tree_model, feature_names=feature_names, class_names=class_names,
filled=True, rounded=True, fontsize=10)
plt.title(f'Decision Tree Visualization - {best_tree_name}')
plt.show()
# Feature Importance
feature_importance = pd.DataFrame({
'feature': feature_names,
'importance': best_tree_model.feature_importances_
}).sort_values('importance', ascending=True)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'], feature_importance['importance'])
plt.xlabel('Feature Importance')
plt.title('Decision Tree Feature Importance')
plt.grid(True, axis='x')
plt.show()
# Decision Boundary Visualization (using first two features)
X_2d = X[:, :2] # Use only first two features for visualization
X_train_2d, X_test_2d, y_train_2d, y_test_2d = model_selection.train_test_split(
X_2d, y, test_size=0.2, random_state=42, stratify=y
)
# Train on 2D data for visualization
tree_2d = tree.DecisionTreeClassifier(max_depth=3, random_state=42)
tree_2d.fit(X_train_2d, y_train_2d)
# Create mesh grid
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Predict on mesh grid
Z = tree_2d.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundaries
plt.figure(figsize=(12, 8))
plt.contourf(xx, yy, Z, alpha=0.8, cmap='viridis')
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y, edgecolor='black', s=50, cmap='viridis')
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.title('Decision Tree Decision Boundaries (First Two Features)')
plt.colorbar(scatter)
plt.grid(True)
plt.show()
# Overfitting Analysis
depths = range(1, 15)
train_scores = []
test_scores = []
for depth in depths:
dt_temp = tree.DecisionTreeClassifier(max_depth=depth, random_state=42)
dt_temp.fit(X_train, y_train)
train_scores.append(dt_temp.score(X_train, y_train))
test_scores.append(dt_temp.score(X_test, y_test))
plt.figure(figsize=(10, 6))
plt.plot(depths, train_scores, 'o-', label='Training Score', color='blue')
plt.plot(depths, test_scores, 'o-', label='Test Score', color='red')
plt.xlabel('Tree Depth')
plt.ylabel('Accuracy')
plt.title('Decision Tree: Overfitting Analysis')
plt.legend()
plt.grid(True)
plt.show()
print("✅ Decision Tree Analysis Complete!")

## Decision Tree Interview Questions & Answers
**Q1: What are the splitting criteria and when to use each?**
**Answer:**
- **Gini Impurity:** $Gini(t) = 1 - \sum p(i|t)^2$
- Faster to compute, default in scikit-learn
- Tends to find larger splits
- **Information Gain (Entropy):** $Entropy(t) = -\sum p(i|t)\log_2 p(i|t)$
- More balanced splits
- Slightly more computationally expensive
- **Choice:** Usually similar results, Gini is slightly faster
**Q2: How does a Decision Tree handle numerical vs categorical features?**
**Answer:**
- **Numerical features:** Finds optimal split point (e.g., age ≤ 30)
- **Categorical features:**
- For binary: Direct split
- For multi-category: Finds optimal subset (e.g., color in {red, blue})
- **Ordinal features:** Can respect ordering if specified
**Q3: What are the advantages of Decision Trees?**
**Answer:**
- **Interpretable:** Easy to understand and explain
- **No data preprocessing:** Handles missing values, mixed types
- **Non-parametric:** No assumptions about data distribution
- **Feature importance:** Natural feature selection
- **Handles non-linear relationships**
**Q4: What are the main limitations and how to address them?**
**Answer:**
- **Overfitting:** Use pruning, max depth, min samples per leaf
- **Unstable:** Small data changes can cause different trees (use ensembles)
- **Biased towards features with more levels:** Use feature selection
- **Poor extrapolation:** Doesn't predict well outside training range
**Q5: What is tree pruning and why is it important?**
**Answer:**
**Pruning** removes branches that have little power in predicting target values.
- **Pre-pruning:** Stop growing tree early (max_depth, min_samples_leaf)
- **Post-pruning:** Grow full tree then remove unnecessary branches
- **Benefits:** Reduces overfitting, improves generalization, smaller trees
# 4. RANDOM FOREST
## Algorithm Background & Mathematical Foundation
**Core Concept:** Random Forest is an ensemble method that builds multiple decision trees and combines their predictions using bagging and feature randomness.
**Key Concepts:**
**Bagging (Bootstrap Aggregating):**
- Create multiple datasets by sampling with replacement
- Train a decision tree on each bootstrap sample
- Combine predictions by majority vote (classification) or averaging (regression)
**Feature Randomness:**
- At each split, consider only a random subset of features
- Typically $\sqrt{p}$ features for classification, $p/3$ for regression (where p = total features)
**Mathematical Formulation:**
**Final Prediction:**
- **Classification:** $\hat{y} = \text{mode}\{T_1(x), T_2(x), ..., T_B(x)\}$
- **Regression:** $\hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x)$
**Out-of-Bag (OOB) Error:**
- Each tree is trained on ~63% of data, remaining 37% used for validation
- OOB error provides unbiased estimate of generalization error

In [None]:
# Cell 6: Random Forest - Comprehensive Implementation
print("🚀 RANDOM FOREST: COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names
class_names = wine.target_names
print("📊 Dataset Overview:")
print(f"• Dataset: Wine Recognition")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Classes: {list(class_names)}")
print(f"• Class distribution: {np.bincount(y)}")
# Create DataFrame
df = pd.DataFrame(X, columns=feature_names)
df['wine_class'] = y
# Data preprocessing
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Train Random Forest models with different parameters
rf_models = {
'Random Forest (10 trees)': ensemble.RandomForestClassifier(n_estimators=10, random_state=42),
'Random Forest (100 trees)': ensemble.RandomForestClassifier(n_estimators=100, random_state=42),
'Random Forest (500 trees)': ensemble.RandomForestClassifier(n_estimators=500, random_state=42),
'Random Forest (max_features=sqrt)': ensemble.RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42),
'Random Forest (max_features=log2)': ensemble.RandomForestClassifier(n_estimators=100, max_features='log2', random_state=42)
}
rf_results = {}
for name, model in rf_models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred, average='weighted')
recall = metrics.recall_score(y_test, y_pred, average='weighted')
f1 = metrics.f1_score(y_test, y_pred, average='weighted')
rf_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'model': model,
'y_pred': y_pred
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
print(f" • OOB Score: {model.oob_score_ if hasattr(model, 'oob_score_') else 'N/A':.4f}")
# Model Comparison
comparison_rf = pd.DataFrame({
'Model': list(rf_results.keys()),
'Accuracy': [rf_results[name]['accuracy'] for name in rf_results.keys()],
'Precision': [rf_results[name]['precision'] for name in rf_results.keys()],
'Recall': [rf_results[name]['recall'] for name in rf_results.keys()],
'F1-Score': [rf_results[name]['f1'] for name in rf_results.keys()],
'OOB_Score': [rf_results[name]['model'].oob_score_ if hasattr(rf_results[name]['model'], 'oob_score_') else np.nan for name in rf_results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_rf.round(4))
# Comprehensive Visualization
best_rf_name = max(rf_results.keys(), key=lambda x: rf_results[x]['accuracy'])
best_rf_model = rf_results[best_rf_name]['model']
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Feature Importance
feature_importance = pd.DataFrame({
'feature': feature_names,
'importance': best_rf_model.feature_importances_
}).sort_values('importance', ascending=True)
axes[0,0].barh(feature_importance['feature'], feature_importance['importance'])
axes[0,0].set_xlabel('Feature Importance')
axes[0,0].set_title('Random Forest Feature Importance')
axes[0,0].grid(True, axis='x')
# Confusion Matrix
y_pred_best = best_rf_model.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,1],
xticklabels=class_names, yticklabels=class_names)
axes[0,1].set_xlabel('Predicted')
axes[0,1].set_ylabel('Actual')
axes[0,1].set_title('Confusion Matrix')
# Number of Trees vs Performance
n_trees_range = [1, 5, 10, 25, 50, 100, 200]
train_scores = []
test_scores = []
oob_scores = []
for n_trees in n_trees_range:
rf_temp = ensemble.RandomForestClassifier(n_estimators=n_trees, random_state=42, oob_score=True)
rf_temp.fit(X_train, y_train)
train_scores.append(rf_temp.score(X_train, y_train))
test_scores.append(rf_temp.score(X_test, y_test))
oob_scores.append(rf_temp.oob_score_)
axes[1,0].plot(n_trees_range, train_scores, 'o-', label='Training Score', color='blue')
axes[1,0].plot(n_trees_range, test_scores, 'o-', label='Test Score', color='red')
axes[1,0].plot(n_trees_range, oob_scores, 'o-', label='OOB Score', color='green')
axes[1,0].set_xlabel('Number of Trees')
axes[1,0].set_ylabel('Accuracy')
axes[1,0].set_title('Random Forest: Number of Trees vs Performance')
axes[1,0].legend()
axes[1,0].grid(True)
# Model Comparison
models_rf = [name for name in rf_results.keys() if '100 trees' in name or 'max_features' in name]
accuracy_scores = [rf_results[name]['accuracy'] for name in models_rf]
bars = axes[1,1].bar(models_rf, accuracy_scores, color=['blue', 'green', 'orange'])
axes[1,1].set_ylabel('Accuracy Score')
axes[1,1].set_title('Random Forest: Different Configurations')
axes[1,1].tick_params(axis='x', rotation=45)
for bar, score in zip(bars, accuracy_scores):
axes[1,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
# Individual Tree Analysis (for first few trees)
print("\n🌳 Individual Tree Analysis (First 3 Trees):")
for i in range(3):
single_tree = best_rf_model.estimators_[i]
single_tree_score = single_tree.score(X_test, y_test)
print(f"• Tree {i+1}: Accuracy = {single_tree_score:.4f}, Depth = {single_tree.get_depth()}")
# Compare with Single Decision Tree
single_dt = tree.DecisionTreeClassifier(random_state=42)
single_dt.fit(X_train, y_train)
single_dt_score = single_dt.score(X_test, y_test)
print(f"\n📊 Comparison with Single Decision Tree:")
print(f"• Single Decision Tree Accuracy: {single_dt_score:.4f}")
print(f"• Random Forest Accuracy: {rf_results[best_rf_name]['accuracy']:.4f}")
print(f"• Improvement: {rf_results[best_rf_name]['accuracy'] - single_dt_score:.4f}")
print("✅ Random Forest Analysis Complete!")

## Random Forest Interview Questions & Answers
**Q1: How does Random Forest reduce overfitting compared to a single Decision Tree?**
**Answer:**
- **Bagging:** Each tree trained on different bootstrap sample
- **Feature Randomness:** Each split considers random feature subset
- **Averaging:** Combines predictions from multiple trees
- **Result:** Reduces variance while maintaining low bias
**Q2: What is the Out-of-Bag (OOB) error and why is it useful?**
**Answer:**
- **OOB Error:** Error calculated on samples not in bootstrap sample (~37% of data)
- **Benefits:**
- Provides unbiased estimate of generalization error
- No need for separate validation set
- Can be used for hyperparameter tuning
3: How do you choose the number of trees and features per split?
**Answer:**
- **Number of trees:** More trees generally better, but diminishing returns (100-500 typical)
- **Features per split:**
- Classification: $\sqrt{p}$ (default)
- Regression: $p/3$
- Can tune as hyperparameter
- **Rule:** Increase trees until OOB error stabilizes
**Q4: What are the advantages of Random Forest over other algorithms?**
**Answer:**
- **High accuracy:** Often state-of-the-art for tabular data
- **Robust:** Handles outliers, missing values, irrelevant features
- **Feature importance:** Natural feature selection
- **Parallelizable:** Trees can be built independently
- **No overfitting:** More trees cannot overfit (but can memorize noise)
**Q5: When should you NOT use Random Forest?**
**Answer:**
- **Extrapolation:** Poor at predicting outside training range
- **Sparse data:** May not perform well with very high-dimensional sparse data
- **Interpretability:** Less interpretable than single tree
- **Large datasets:** Memory intensive for very large datasets
- **Real-time applications:** Slower prediction than single tree
# 5. GRADIENT BOOSTING
## Algorithm Background & Mathematical Foundation
**Core Concept:** Gradient Boosting builds an ensemble of weak learners (typically decision trees) sequentially, where each new model corrects the errors made by previous models.
**Mathematical Formulation:**
**General Boosting Algorithm:**
1. Initialize model with constant value: $F_0(x) = \arg\min_\gamma \sum_{i=1}^n L(y_i, \gamma)$
2. For m = 1 to M:
a. Compute pseudo-residuals: $r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F(x)=F_{m-1}(x)}$
b. Fit weak learner to pseudo-residuals: $h_m(x)$
c. Compute multiplier: $\gamma_m = \arg\min_\gamma \sum_{i=1}^n L(y_i, F_{m-1}(x_i) + \gamma h_m(x_i))$
d. Update model: $F_m(x) = F_{m-1}(x) + \nu \gamma_m h_m(x)$
**For Regression (MSE Loss):**
- Loss: $L(y_i, F(x_i)) = \frac{1}{2}(y_i - F(x_i))^2$
- Pseudo-residuals: $r_{im} = y_i - F_{m-1}(x_i)$
- Each tree fits the residuals from previous model
**For Classification (Log Loss):**
- More complex due to probability transformations
- Uses log-odds and sigmoid transformations
**Key Parameters:**
- **Learning rate (ν):** Shrinks contribution of each tree
- **Number of estimators (M):** Number of boosting stages
- **Max depth:** Complexity of weak learners
- **Subsample:** Fraction of samples used for each tree

In [None]:
# Cell 7: Gradient Boosting - Comprehensive Implementation
print("🚀 GRADIENT BOOSTING: COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset
from sklearn.datasets import make_classification
# Create a more complex dataset for demonstration
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, n_clusters_per_class=1, random_state=42)
print("📊 Dataset Overview:")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Class distribution: {np.bincount(y)}")
# Data preprocessing
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train Gradient Boosting models with different parameters
gb_models = {
'Gradient Boosting (default)': ensemble.GradientBoostingClassifier(random_state=42),
'Gradient Boosting (high learning rate)': ensemble.GradientBoostingClassifier(learning_rate=0.1, random_state=42),
'Gradient Boosting (low learning rate)': ensemble.GradientBoostingClassifier(learning_rate=0.01, n_estimators=500, random_state=42),
'Gradient Boosting (shallow trees)': ensemble.GradientBoostingClassifier(max_depth=2, random_state=42),
'Gradient Boosting (subsample)': ensemble.GradientBoostingClassifier(subsample=0.8, random_state=42)
}
gb_results = {}
for name, model in gb_models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
roc_auc = metrics.roc_auc_score(y_test, y_pred_proba)
gb_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'roc_auc': roc_auc,
'model': model,
'y_pred_proba': y_pred_proba
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
print(f" • ROC-AUC: {roc_auc:.4f}")
# Model Comparison
comparison_gb = pd.DataFrame({
'Model': list(gb_results.keys()),
'Accuracy': [gb_results[name]['accuracy'] for name in gb_results.keys()],
'Precision': [gb_results[name]['precision'] for name in gb_results.keys()],
'Recall': [gb_results[name]['recall'] for name in gb_results.keys()],
'F1-Score': [gb_results[name]['f1'] for name in gb_results.keys()],
'ROC-AUC': [gb_results[name]['roc_auc'] for name in gb_results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_gb.round(4))
# XGBoost Implementation
print("\n🔥 XGBOOST IMPLEMENTATION")
xgb_models = {
'XGBoost (default)': xgb.XGBClassifier(random_state=42),
'XGBoost (high learning rate)': xgb.XGBClassifier(learning_rate=0.1, random_state=42),
'XGBoost (with regularization)': xgb.XGBClassifier(learning_rate=0.1, reg_alpha=1.0, reg_lambda=1.0, random_state=42)
}
xgb_results = {}
for name, model in xgb_models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
roc_auc = metrics.roc_auc_score(y_test, y_pred_proba)
xgb_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'roc_auc': roc_auc,
'model': model
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
print(f" • ROC-AUC: {roc_auc:.4f}")
# Comprehensive Visualization
best_gb_name = max(gb_results.keys(), key=lambda x: gb_results[x]['roc_auc'])
best_gb_model = gb_results[best_gb_name]['model']
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. Feature Importance (Gradient Boosting)
feature_importance_gb = pd.DataFrame({
'feature': [f'Feature_{i}' for i in range(X.shape[1])],
'importance': best_gb_model.feature_importances_
}).sort_values('importance', ascending=True).tail(10)
axes[0,0].barh(feature_importance_gb['feature'], feature_importance_gb['importance'])
axes[0,0].set_xlabel('Feature Importance')
axes[0,0].set_title('Gradient Boosting - Top 10 Features')
# 2. Feature Importance (XGBoost)
best_xgb_name = max(xgb_results.keys(), key=lambda x: xgb_results[x]['roc_auc'])
best_xgb_model = xgb_results[best_xgb_name]['model']
feature_importance_xgb = pd.DataFrame({
'feature': [f'Feature_{i}' for i in range(X.shape[1])],
'importance': best_xgb_model.feature_importances_
}).sort_values('importance', ascending=True).tail(10)
axes[0,1].barh(feature_importance_xgb['feature'], feature_importance_xgb['importance'])
axes[0,1].set_xlabel('Feature Importance')
axes[0,1].set_title('XGBoost - Top 10 Features')
# 3. Learning Curve - Training Deviance
train_score = best_gb_model.train_score_
test_score = best_gb_model.loss_(y_test, best_gb_model.decision_function(X_test_scaled))
axes[0,2].plot(range(1, len(train_score) + 1), train_score, 'b-', label='Training Deviance')
axes[0,2].set_xlabel('Boosting Iterations')
axes[0,2].set_ylabel('Deviance')
axes[0,2].set_title('Training Deviance')
axes[0,2].legend()
axes[0,2].grid(True)
# 4. ROC Curve Comparison
# Gradient Boosting
fpr_gb, tpr_gb, _ = metrics.roc_curve(y_test, gb_results[best_gb_name]['y_pred_proba'])
roc_auc_gb = metrics.auc(fpr_gb, tpr_gb)
# XGBoost
y_pred_proba_xgb = best_xgb_model.predict_proba(X_test_scaled)[:, 1]
fpr_xgb, tpr_xgb, _ = metrics.roc_curve(y_test, y_pred_proba_xgb)
roc_auc_xgb = metrics.auc(fpr_xgb, tpr_xgb)
axes[1,0].plot(fpr_gb, tpr_gb, color='blue', lw=2, label=f'Gradient Boosting (AUC = {roc_auc_gb:.2f})')
axes[1,0].plot(fpr_xgb, tpr_xgb, color='red', lw=2, label=f'XGBoost (AUC = {roc_auc_xgb:.2f})')
axes[1,0].plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--')
axes[1,0].set_xlabel('False Positive Rate')
axes[1,0].set_ylabel('True Positive Rate')
axes[1,0].set_title('ROC Curve Comparison')
axes[1,0].legend()
axes[1,0].grid(True)
# 5. Learning Rate Comparison
learning_rates = [0.01, 0.05, 0.1, 0.2]
lr_scores = []
for lr in learning_rates:
gb_lr = ensemble.GradientBoostingClassifier(learning_rate=lr, n_estimators=100, random_state=42)
gb_lr.fit(X_train_scaled, y_train)
lr_scores.append(gb_lr.score(X_test_scaled, y_test))
axes[1,1].plot(learning_rates, lr_scores, 'o-', color='green')
axes[1,1].set_xlabel('Learning Rate')
axes[1,1].set_ylabel('Accuracy')
axes[1,1].set_title('Learning Rate vs Performance')
axes[1,1].grid(True)
# 6. Model Comparison
all_models = {**gb_results, **xgb_results}
model_names = list(all_models.keys())
accuracy_scores = [all_models[name]['accuracy'] for name in model_names]
bars = axes[1,2].bar(range(len(model_names)), accuracy_scores, color=['blue']*5 + ['red']*3)
axes[1,2].set_ylabel('Accuracy')
axes[1,2].set_title('All Models Comparison')
axes[1,2].set_xticks(range(len(model_names)))
axes[1,2].set_xticklabels(model_names, rotation=45, ha='right')
for bar, score in zip(bars, accuracy_scores):
axes[1,2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom', fontsize=8)
plt.tight_layout()
plt.show()
# Stage-wise Analysis
print("\n📈 Stage-wise Performance Analysis:")
gb_staged = ensemble.GradientBoostingClassifier(n_estimators=200, random_state=42)
gb_staged.fit(X_train_scaled, y_train)
# Get staged predictions
staged_accuracy = []
for i, y_pred in enumerate(gb_staged.staged_predict(X_test_scaled)):
staged_accuracy.append(metrics.accuracy_score(y_test, y_pred))
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(staged_accuracy) + 1), staged_accuracy, 'b-')
plt.xlabel('Number of Boosting Stages')
plt.ylabel('Test Accuracy')
plt.title('Gradient Boosting: Performance vs Number of Stages')
plt.grid(True)
plt.show()
print("✅ Gradient Boosting Analysis Complete!")

## Gradient Boosting Interview Questions & Answers
**Q1: What is the fundamental difference between Random Forest and Gradient Boosting?**
**Answer:**
- **Random Forest:** Parallel ensemble (bagging) - trees built independently
- **Gradient Boosting:** Sequential ensemble (boosting) - trees built sequentially to correct errors
- **RF:** Reduces variance, maintains bias
- **GB:** Reduces bias, can increase variance
- **RF:** Trees can be deep and overfit individually
- **GB:** Trees are typically shallow (weak learners)
**Q2: Explain the role of the learning rate in Gradient Boosting.**
**Answer:**
- **Learning rate (ν):** Shrinks the contribution of each tree
- **Low learning rate (0.01-0.1):**
- Requires more trees
- More robust, less prone to overfitting
- Better generalization
- **High learning rate (0.1-0.3):**
- Fewer trees needed
- Faster training
- Higher risk of overfitting
- **Typical strategy:** Use small learning rate with many trees
**Q3: What is XGBoost and how does it improve upon basic Gradient Boosting?**
**Answer:**
**XGBoost (Extreme Gradient Boosting) enhancements:**
- **Regularization:** L1 (Lasso) and L2 (Ridge) regularization
- **Handling missing values:** Automatically learns direction for missing values
- **Tree pruning:** More efficient pruning strategy
- **Parallel processing:** Faster training
- **Cross-validation:** Built-in cross-validation
- **Early stopping:** Stop training when no improvement
**Q4: How does Gradient Boosting handle overfitting?**
**Answer:**
1. **Learning rate shrinkage:** Reduces each tree's influence
2. **Subsampling:** Use random subsets of data for each tree
3. **Tree constraints:** Limit tree depth, min samples per leaf
4. **Early stopping:** Stop when validation performance stops improving
5. **Regularization** (XGBoost): L1/L2 regularization on weights
**Q5: What are the main hyperparameters to tune in Gradient Boosting?**
**Answer:**
- **n_estimators:** Number of boosting stages
- **learning_rate:** Shrinks contribution of each tree
- **max_depth:** Maximum depth of individual trees
- **min_samples_split:** Minimum samples required to split
- **subsample:** Fraction of samples used for fitting
- **max_features:** Number of features to consider for splits
# 6. SUPPORT VECTOR MACHINES (SVM)
## Algorithm Background & Mathematical Foundation
**Core Concept:** SVM finds the optimal hyperplane that maximizes the margin between classes in feature space.
**Mathematical Formulation:**
**Linear SVM:**
- **Decision function:** $f(x) = w^T x + b$
- **Prediction:** $\text{sign}(f(x))$
- **Margin:** $\frac{2}{\|w\|}$
**Optimization Problem (Hard Margin):**
$$\min_{w,b} \frac{1}{2}\|w\|^2$$
$$\text{subject to } y_i(w^T x_i + b) \geq 1 \quad \forall i$$
**Optimization Problem (Soft Margin):**
$$\min_{w,b,\xi} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^n \xi_i$$
$$\text{subject to } y_i(w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0 \quad \forall i$$
**Kernel Trick:**
- Maps data to higher-dimensional space: $\phi(x)$
- **Kernel function:** $K(x_i, x_j) = \phi(x_i)^T \phi(x_j)$
- **Common kernels:**
- Linear: $K(x_i, x_j) = x_i^T x_j$
- Polynomial: $K(x_i, x_j) = (\gamma x_i^T x_j + r)^d$
- RBF: $K(x_i, x_j) = \exp(-\gamma \|x_i - x_j\|^2)$

In [None]:
# Cell 8: Support Vector Machines - Comprehensive Implementation
print("🚀 SUPPORT VECTOR MACHINES: COMPREHENSIVE IMPLEMENTATION\n")
# Create a non-linearly separable dataset for demonstration
from sklearn.datasets import make_circles
X, y = make_circles(n_samples=300, noise=0.1, factor=0.3, random_state=42)
print("📊 Dataset Overview:")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Class distribution: {np.bincount(y)}")
# Visualize the original data
plt.figure(figsize=(8, 6))
plt.scatter(X[y == 0, 0], X[y == 0, 1], color='red', alpha=0.6, label='Class 0')
plt.scatter(X[y == 1, 0], X[y == 1, 1], color='blue', alpha=0.6, label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Original Data - Non-linearly Separable')
plt.legend()
plt.grid(True)
plt.show()
# Data preprocessing
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train SVM models with different kernels
svm_models = {
'SVM (Linear Kernel)': svm.SVC(kernel='linear', random_state=42),
'SVM (RBF Kernel)': svm.SVC(kernel='rbf', random_state=42),
'SVM (Polynomial Kernel)': svm.SVC(kernel='poly', degree=3, random_state=42),
'SVM (Sigmoid Kernel)': svm.SVC(kernel='sigmoid', random_state=42),
'LinearSVC': svm.LinearSVC(random_state=42)
}
svm_results = {}
for name, model in svm_models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
# For kernels that support decision function
if hasattr(model, 'decision_function'):
y_decision = model.decision_function(X_test_scaled)
roc_auc = metrics.roc_auc_score(y_test, y_decision)
else:
roc_auc = metrics.roc_auc_score(y_test, y_pred)
svm_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'roc_auc': roc_auc,
'model': model,
'y_pred': y_pred
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
print(f" • ROC-AUC: {roc_auc:.4f}")
# Model Comparison
comparison_svm = pd.DataFrame({
'Model': list(svm_results.keys()),
'Accuracy': [svm_results[name]['accuracy'] for name in svm_results.keys()],
'Precision': [svm_results[name]['precision'] for name in svm_results.keys()],
'Recall': [svm_results[name]['recall'] for name in svm_results.keys()],
'F1-Score': [svm_results[name]['f1'] for name in svm_results.keys()],
'ROC-AUC': [svm_results[name]['roc_auc'] for name in svm_results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_svm.round(4))
# Comprehensive Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# Create mesh grid for decision boundaries
x_min, x_max = X_train_scaled[:, 0].min() - 0.5, X_train_scaled[:, 0].max() + 0.5
y_min, y_max = X_train_scaled[:, 1].min() - 0.5, X_train_scaled[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Plot decision boundaries for different kernels
kernels_to_plot = ['Linear Kernel', 'RBF Kernel', 'Polynomial Kernel']
for i, kernel_name in enumerate(kernels_to_plot):
model_key = f'SVM ({kernel_name})'
if model_key in svm_results:
model = svm_results[model_key]['model']
# Predict on mesh grid
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
axes[0,i].contourf(xx, yy, Z, alpha=0.8, cmap='RdYlBu')
axes[0,i].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train,
edgecolor='black', s=50, cmap='RdYlBu')
axes[0,i].set_xlabel('Feature 1 (scaled)')
axes[0,i].set_ylabel('Feature 2 (scaled)')
axes[0,i].set_title(f'SVM with {kernel_name}')
axes[0,i].grid(True)
# Support Vectors Visualization
best_svm_name = max(svm_results.keys(), key=lambda x: svm_results[x]['accuracy'])
best_svm_model = svm_results[best_svm_name]['model']
if hasattr(best_svm_model, 'support_vectors_'):
axes[1,0].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train,
alpha=0.3, cmap='RdYlBu')
axes[1,0].scatter(best_svm_model.support_vectors_[:, 0],
best_svm_model.support_vectors_[:, 1],
s=100, facecolors='none', edgecolors='black',
linewidths=1, label='Support Vectors')
axes[1,0].set_xlabel('Feature 1 (scaled)')
axes[1,0].set_ylabel('Feature 2 (scaled)')
axes[1,0].set_title('Support Vectors')
axes[1,0].legend()
axes[1,0].grid(True)
# C Parameter Analysis
C_values = [0.1, 1, 10, 100, 1000]
C_scores = []
for C_val in C_values:
svm_temp = svm.SVC(kernel='rbf', C=C_val, random_state=42)
svm_temp.fit(X_train_scaled, y_train)
C_scores.append(svm_temp.score(X_test_scaled, y_test))
axes[1,1].semilogx(C_values, C_scores, 'o-', color='green')
axes[1,1].set_xlabel('C (Regularization Parameter)')
axes[1,1].set_ylabel('Accuracy')
axes[1,1].set_title('SVM: C Parameter vs Performance')
axes[1,1].grid(True)
# Gamma Parameter Analysis (for RBF kernel)
gamma_values = [0.001, 0.01, 0.1, 1, 10, 100]
gamma_scores = []
for gamma_val in gamma_values:
svm_temp = svm.SVC(kernel='rbf', gamma=gamma_val, random_state=42)
svm_temp.fit(X_train_scaled, y_train)
gamma_scores.append(svm_temp.score(X_test_scaled, y_test))
axes[1,2].semilogx(gamma_values, gamma_scores, 'o-', color='purple')
axes[1,2].set_xlabel('Gamma (Kernel Coefficient)')
axes[1,2].set_ylabel('Accuracy')
axes[1,2].set_title('SVM: Gamma Parameter vs Performance (RBF Kernel)')
axes[1,2].grid(True)
plt.tight_layout()
plt.show()
# Kernel Comparison on Different Datasets
print("\n🔍 Kernel Performance on Different Data Patterns:")
# Create different datasets
datasets_info = {
'Linear': datasets.make_classification(n_samples=100, n_features=2, n_redundant=0,
n_informative=2, n_clusters_per_class=1, random_state=42),
'Moons': datasets.make_moons(n_samples=100, noise=0.1, random_state=42),
'Circles': make_circles(n_samples=100, noise=0.1, factor=0.3, random_state=42)
}
kernel_performance = {}
for data_name, (X_data, y_data) in datasets_info.items():
print(f"\nAnalyzing {data_name} dataset...")
X_train_d, X_test_d, y_train_d, y_test_d = model_selection.train_test_split(
X_data, y_data, test_size=0.2, random_state=42
)
# Scale features
scaler_d = StandardScaler()
X_train_d_scaled = scaler_d.fit_transform(X_train_d)
X_test_d_scaled = scaler_d.transform(X_test_d)
kernels = ['linear', 'rbf', 'poly']
for kernel in kernels:
svm_temp = svm.SVC(kernel=kernel, random_state=42)
svm_temp.fit(X_train_d_scaled, y_train_d)
score = svm_temp.score(X_test_d_scaled, y_test_d)
kernel_performance[(data_name, kernel)] = score
print(f" • {kernel} kernel: {score:.4f}")
# SVM for Regression (SVR)
print("\n📈 Support Vector Regression (SVR) Example:")
# Create regression dataset
X_reg, y_reg = datasets.make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
# Train SVR models
svr_models = {
'SVR (Linear)': svm.SVR(kernel='linear'),
'SVR (RBF)': svm.SVR(kernel='rbf'),
'SVR (Poly)': svm.SVR(kernel='poly')
}
plt.figure(figsize=(15, 5))
for i, (name, model) in enumerate(svr_models.items()):
model.fit(X_reg, y_reg)
y_pred_reg = model.predict(X_reg)
plt.subplot(1, 3, i+1)
plt.scatter(X_reg, y_reg, alpha=0.6, label='Data')
# Sort for nice plotting
X_sorted = np.sort(X_reg, axis=0)
y_pred_sorted = model.predict(X_sorted)
plt.plot(X_sorted, y_pred_sorted, 'r-', linewidth=2, label='SVR Prediction')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title(name)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
print("✅ Support Vector Machines Analysis Complete!")

## Support Vector Machines Interview Questions & Answers
**Q1: What is the kernel trick and why is it important?**
**Answer:**
- **Kernel trick:** Method to operate in high-dimensional feature space without computing coordinates
- **Mathematical basis:** $K(x_i, x_j) = \phi(x_i)^T \phi(x_j)$
- **Benefits:**
- Computationally efficient
- Enables non-linear decision boundaries
- Works with infinite-dimensional features
- **Common kernels:** Linear, Polynomial, RBF, Sigmoid
**Q2: Explain the role of the C parameter in SVM.**
**Answer:**
- **C parameter:** Regularization parameter that controls trade-off between margin maximization and error minimization
- **Small C:** Large margin, more misclassifications allowed (underfitting)
- **Large C:** Small margin, fewer misclassifications (overfitting)
- **Effect:**
- C → 0: Very soft margin, many support vectors
- C → ∞: Hard margin, few support vectors
**Q3: What are support vectors and why are they important?**
**Answer:**
- **Support vectors:** Data points that lie on the margin boundaries or are misclassified
- **Importance:**
- Determine the decision boundary
- Only support vectors affect the model
- Model is sparse - depends only on support vectors
- Number of support vectors indicates model complexity
**Q4: Compare Linear SVM and Logistic Regression.**
**Answer:**
- **Similarities:** Both find linear decision boundaries
- **Differences:**
- **SVM:** Maximizes margin, focuses on boundary points
- **LR:** Maximizes likelihood, uses all data points
- **SVM:** Better with clear margin of separation
- **LR:** Provides probability estimates
- **SVM:** More robust to outliers
- **LR:** Faster training for large datasets
**Q5: When should you use SVM vs other classifiers?**
**Answer:**
**Use SVM when:**
- Clear margin of separation exists
- High-dimensional spaces
- Non-linear relationships (with kernels)
- Number of features > number of samples
- Need robust model to outliers
**Avoid SVM when:**
- Very large datasets (slow training)
- Noisy datasets with overlapping classes
- Need probability estimates
- Interpretability is crucial
- Multi-class problems with many classes
# 7. K-NEAREST NEIGHBORS (KNN)
## Algorithm Background & Mathematical Foundation
**Core Concept:** KNN is an instance-based learning algorithm that classifies data points based on the majority class among their k-nearest neighbors.
**Mathematical Formulation:**
**Distance Metrics:**
- **Euclidean:** $d(x,y) = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$
- **Manhattan:** $d(x,y) = \sum_{i=1}^n |x_i - y_i|$
- **Minkowski:** $d(x,y) = (\sum_{i=1}^n |x_i - y_i|^p)^{1/p}$
- **Cosine:** $d(x,y) = 1 - \frac{x \cdot y}{\|x\|\|y\|}$
**Algorithm:**
1. Choose the number k of neighbors
2. Calculate distance between query instance and all training samples
3. Sort distances and find k nearest neighbors
4. Gather categories of k nearest neighbors
5. Use majority vote or averaging for prediction
**For Classification:**
$$\hat{y} = \text{mode}(y_{i_1}, y_{i_2}, ..., y_{i_k})$$
**For Regression:**
$$\hat{y} = \frac{1}{k}\sum_{j=1}^k y_{i_j}$$
**Weighted KNN:**
$$w_i = \frac{1}{d(x, x_i)^2}$$
$$\hat{y} = \frac{\sum_{i=1}^k w_i y_i}{\sum_{i=1}^k w_i}$$

In [None]:
# Cell 9: K-Nearest Neighbors - Comprehensive Implementation
print("🚀 K-NEAREST NEIGHBORS: COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset
iris = datasets.load_iris()
X = iris.data[:, :2] # Use only first two features for visualization
y = iris.target
feature_names = iris.feature_names[:2]
class_names = iris.target_names
print("📊 Dataset Overview:")
print(f"• Dataset: Iris Flowers (first two features)")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Classes: {list(class_names)}")
print(f"• Class distribution: {np.bincount(y)}")
# Visualize the data
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', s=50, alpha=0.7)
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.title('Iris Dataset - First Two Features')
plt.colorbar(scatter, label='Class')
plt.grid(True)
plt.show()
# Data preprocessing
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Scale features (important for KNN!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN models with different parameters
knn_models = {
'KNN (k=1)': neighbors.KNeighborsClassifier(n_neighbors=1),
'KNN (k=3)': neighbors.KNeighborsClassifier(n_neighbors=3),
'KNN (k=5)': neighbors.KNeighborsClassifier(n_neighbors=5),
'KNN (k=10)': neighbors.KNeighborsClassifier(n_neighbors=10),
'KNN (k=20)': neighbors.KNeighborsClassifier(n_neighbors=20),
'KNN (k=50)': neighbors.KNeighborsClassifier(n_neighbors=50)
}
knn_results = {}
for name, model in knn_models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred, average='weighted')
recall = metrics.recall_score(y_test, y_pred, average='weighted')
f1 = metrics.f1_score(y_test, y_pred, average='weighted')
knn_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'model': model,
'y_pred': y_pred
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
# Model Comparison
comparison_knn = pd.DataFrame({
'Model': list(knn_results.keys()),
'Accuracy': [knn_results[name]['accuracy'] for name in knn_results.keys()],
'Precision': [knn_results[name]['precision'] for name in knn_results.keys()],
'Recall': [knn_results[name]['recall'] for name in knn_results.keys()],
'F1-Score': [knn_results[name]['f1'] for name in knn_results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_knn.round(4))
# Comprehensive Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. K Value vs Performance
k_values = [1, 3, 5, 10, 20, 50]
train_scores = []
test_scores = []
for k in k_values:
knn_temp = neighbors.KNeighborsClassifier(n_neighbors=k)
knn_temp.fit(X_train_scaled, y_train)
train_scores.append(knn_temp.score(X_train_scaled, y_train))
test_scores.append(knn_temp.score(X_test_scaled, y_test))
axes[0,0].plot(k_values, train_scores, 'o-', label='Training Score', color='blue')
axes[0,0].plot(k_values, test_scores, 'o-', label='Test Score', color='red')
axes[0,0].set_xlabel('k (Number of Neighbors)')
axes[0,0].set_ylabel('Accuracy')
axes[0,0].set_title('KNN: k Value vs Performance')
axes[0,0].legend()
axes[0,0].grid(True)
# 2. Distance Metrics Comparison
distance_metrics = ['euclidean', 'manhattan', 'minkowski']
distance_scores = []
for metric in distance_metrics:
knn_temp = neighbors.KNeighborsClassifier(n_neighbors=5, metric=metric)
knn_temp.fit(X_train_scaled, y_train)
distance_scores.append(knn_temp.score(X_test_scaled, y_test))
axes[0,1].bar(distance_metrics, distance_scores, color=['blue', 'green', 'orange'])
axes[0,1].set_ylabel('Accuracy')
axes[0,1].set_title('KNN: Distance Metrics Comparison')
axes[0,1].set_ylim(0, 1)
for i, score in enumerate(distance_scores):
axes[0,1].text(i, score + 0.01, f'{score:.3f}', ha='center', va='bottom')
# 3. Decision Boundaries for different k values
k_values_boundaries = [1, 5, 15, 50]
for i, k in enumerate(k_values_boundaries):
knn_temp = neighbors.KNeighborsClassifier(n_neighbors=k)
knn_temp.fit(X_train_scaled, y_train)
# Create mesh grid
x_min, x_max = X_train_scaled[:, 0].min() - 0.5, X_train_scaled[:, 0].max() + 0.5
y_min, y_max = X_train_scaled[:, 1].min() - 0.5, X_train_scaled[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
# Predict on mesh grid
Z = knn_temp.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot decision boundary
row = 1 if i < 2 else 2
col = i % 2
if row == 1:
ax = axes[1, col]
else:
# Create additional subplot if needed
if i == 2:
fig.add_subplot(2, 3, 5)
ax = plt.gca()
else:
fig.add_subplot(2, 3, 6)
ax = plt.gca()
ax.contourf(xx, yy, Z, alpha=0.8, cmap='viridis')
ax.scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], c=y_train,
edgecolor='black', s=50, cmap='viridis')
ax.set_xlabel('Feature 1 (scaled)')
ax.set_ylabel('Feature 2 (scaled)')
ax.set_title(f'KNN Decision Boundary (k={k})')
ax.grid(True)
plt.tight_layout()
plt.show()
# Weighted KNN vs Standard KNN
print("\n⚖️ Weighted KNN vs Standard KNN:")
k_values_compare = [3, 5, 10, 20]
standard_scores = []
weighted_scores = []
for k in k_values_compare:
# Standard KNN
knn_standard = neighbors.KNeighborsClassifier(n_neighbors=k, weights='uniform')
knn_standard.fit(X_train_scaled, y_train)
standard_scores.append(knn_standard.score(X_test_scaled, y_test))
# Weighted KNN
knn_weighted = neighbors.KNeighborsClassifier(n_neighbors=k, weights='distance')
knn_weighted.fit(X_train_scaled, y_train)
weighted_scores.append(knn_weighted.score(X_test_scaled, y_test))
plt.figure(figsize=(10, 6))
plt.plot(k_values_compare, standard_scores, 'o-', label='Standard KNN', color='blue')
plt.plot(k_values_compare, weighted_scores, 'o-', label='Weighted KNN', color='red')
plt.xlabel('k (Number of Neighbors)')
plt.ylabel('Accuracy')
plt.title('Standard vs Weighted KNN')
plt.legend()
plt.grid(True)
plt.show()
# KNN Regression Example
print("\n📈 KNN Regression Example:")
# Create regression dataset
X_reg, y_reg = datasets.make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = model_selection.train_test_split(
X_reg, y_reg, test_size=0.2, random_state=42
)
# Scale features
scaler_reg = StandardScaler()
X_train_reg_scaled = scaler_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_reg.transform(X_test_reg)
# Train KNN Regressor
knn_reg = neighbors.KNeighborsRegressor(n_neighbors=5)
knn_reg.fit(X_train_reg_scaled, y_train_reg)
y_pred_reg = knn_reg.predict(X_test_reg_scaled)
mse = metrics.mean_squared_error(y_test_reg, y_pred_reg)
r2 = metrics.r2_score(y_test_reg, y_pred_reg)
print(f"KNN Regression Results:")
print(f"• MSE: {mse:.4f}")
print(f"• R²: {r2:.4f}")
# Plot regression results
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X_train_reg, y_train_reg, color='blue', alpha=0.6, label='Training Data')
plt.scatter(X_test_reg, y_test_reg, color='red', alpha=0.6, label='Test Data')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('KNN Regression - Data')
plt.legend()
plt.grid(True)
plt.subplot(1, 2, 2)
# Sort for nice plotting
sort_idx = np.argsort(X_test_reg.ravel())
X_sorted = X_test_reg[sort_idx]
y_pred_sorted = y_pred_reg[sort_idx]
plt.scatter(X_test_reg, y_test_reg, color='red', alpha=0.6, label='Actual')
plt.plot(X_sorted, y_pred_sorted, 'black', linewidth=2, label='KNN Prediction')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.title('KNN Regression - Predictions')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# Curse of Dimensionality Demonstration
print("\n⚠️ Curse of Dimensionality in KNN:")
# Generate data with increasing dimensions
dimensions = range(1, 101, 10)
dimension_scores = []
for dim in dimensions:
# Generate high-dimensional data
X_high_dim, y_high_dim = datasets.make_classification(
n_samples=1000, n_features=dim, n_informative=dim//2,
n_redundant=dim//2, random_state=42
)
X_train_hd, X_test_hd, y_train_hd, y_test_hd = model_selection.train_test_split(
X_high_dim, y_high_dim, test_size=0.2, random_state=42
)
# Scale features
scaler_hd = StandardScaler()
X_train_hd_scaled = scaler_hd.fit_transform(X_train_hd)
X_test_hd_scaled = scaler_hd.transform(X_test_hd)
# Train KNN
knn_hd = neighbors.KNeighborsClassifier(n_neighbors=5)
knn_hd.fit(X_train_hd_scaled, y_train_hd)
score = knn_hd.score(X_test_hd_scaled, y_test_hd)
dimension_scores.append(score)
plt.figure(figsize=(10, 6))
plt.plot(dimensions, dimension_scores, 'o-', color='purple')
plt.xlabel('Number of Dimensions')
plt.ylabel('Accuracy')
plt.title('KNN Performance vs Dimensionality (Curse of Dimensionality)')
plt.grid(True)
plt.show()
print("✅ K-Nearest Neighbors Analysis Complete!")

## K-Nearest Neighbors Interview Questions & Answers
**Q1: What is the curse of dimensionality and how does it affect KNN?**
**Answer:**
- **Curse of dimensionality:** As number of features increases, data becomes sparse in high-dimensional space
- **Effect on KNN:**
- Distances between points become similar
- Nearest neighbors may not be meaningful
- Performance degrades with irrelevant features
- Requires exponentially more data
- **Solutions:** Feature selection, dimensionality reduction, careful feature engineering
**Q2: How do you choose the optimal k value?**
**Answer:**
**Methods for choosing k:**
1. **Cross-validation:** Try different k values and choose best performer
2. **Rule of thumb:** $k = \sqrt{n}$ where n is number of samples
3. **Odd numbers:** Prefer odd k to avoid ties in binary classification
4. **Domain knowledge:** Consider data characteristics
5. **Elbow method:** Plot k vs error and choose elbow point
**Q3: What are the advantages and disadvantages of weighted KNN?**
**Answer:**
- **Weighted KNN:** Closer neighbors have more influence
- **Advantages:**
- More nuanced predictions
- Better handling of unevenly distributed data
- Can improve performance
- **Disadvantages:**
- More computationally expensive
- Sensitive to distance metric choice
- May overfit with small k
**Q4: How does KNN handle categorical features?**
**Answer:**
**For categorical features:**
- **Use appropriate distance metrics:**
- Hamming distance: For binary/categorical data
- Jaccard distance: For set-based data
- Custom distance functions
- **One-hot encoding:** Convert to binary features
- **Label encoding:** May not be appropriate (implies ordering)
- **Best practice:** Use distance metrics designed for categorical data
**Q5: When should you use KNN vs other classifiers?**
**Answer:**
**Use KNN when:**
- Simple, interpretable model needed
- Data has clear local patterns
- Small to medium datasets
- No explicit training time available
- Want lazy learning (defer processing until prediction)
**Avoid KNN when:**
- Very large datasets (slow prediction)
- High-dimensional data
- Need fast predictions
- Data has many irrelevant features
- Clear global patterns exist
# 8. NAIVE BAYES
## Algorithm Background & Mathematical Foundation
**Core Concept:** Naive Bayes is a probabilistic classifier based on Bayes' theorem with strong (naive) independence assumptions between features.
**Mathematical Formulation:**
**Bayes' Theorem:**
$$P(Y|X) = \frac{P(X|Y)P(Y)}{P(X)}$$
**For Classification:**
$$P(Y=k|X_1,X_2,...,X_n) = \frac{P(Y=k)\prod_{i=1}^n P(X_i|Y=k)}{P(X_1,X_2,...,X_n)}$$
**Prediction Rule:**
$$\hat{y} = \arg\max_k P(Y=k)\prod_{i=1}^n P(X_i|Y=k)$$
**Types of Naive Bayes:**
1. **Gaussian Naive Bayes:** Continuous features assumed normally distributed
$$P(X_i|Y=k) = \frac{1}{\sqrt{2\pi\sigma_k^2}}\exp\left(-\frac{(X_i - \mu_k)^2}{2\sigma_k^2}\right)$$
2. **Multinomial Naive Bayes:** Discrete features (word counts in text)
$$P(X_i|Y=k) = \frac{\text{count}(X_i, Y=k) + \alpha}{\text{count}(Y=k) + \alpha n}$$
3. **Bernoulli Naive Bayes:** Binary features (word presence/absence)
**Laplace Smoothing:**
- Prevents zero probabilities: $P(X_i|Y=k) = \frac{\text{count} + \alpha}{\text{total} + \alpha n}$
- $\alpha = 1$ (Laplace), $\alpha < 1$ (Lidstone)

In [None]:
# Cell 10: Naive Bayes - Comprehensive Implementation
print("🚀 NAIVE BAYES: COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset for text classification example
from sklearn.datasets import fetch_20newsgroups
# We'll use a simpler dataset for demonstration
categories = ['alt.atheism', 'sci.space', 'comp.graphics']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'), random_state=42)
print("📊 Dataset Overview:")
print(f"• Dataset: 20 Newsgroups (subset)")
print(f"• Samples: {len(newsgroups.data)}")
print(f"• Categories: {newsgroups.target_names}")
print(f"• Class distribution: {np.bincount(newsgroups.target)}")
# Sample of the data
print("\n📝 Sample Documents:")
for i in range(2):
print(f"Category: {newsgroups.target_names[newsgroups.target[i]]}")
print(f"Text preview: {newsgroups.data[i][:200]}...")
print()
# Text preprocessing and feature extraction
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english', min_df=2, max_df=0.8)
X_text = vectorizer.fit_transform(newsgroups.data)
y_text = newsgroups.target
print(f"• Vocabulary size: {len(vectorizer.get_feature_names_out())}")
print(f"• Feature matrix shape: {X_text.shape}")
# Split the data
X_train_text, X_test_text, y_train_text, y_test_text = model_selection.train_test_split(
X_text, y_text, test_size=0.2, random_state=42, stratify=y_text
)
# Train different Naive Bayes models
nb_models = {
'Multinomial NB': naive_bayes.MultinomialNB(),
'Bernoulli NB': naive_bayes.BernoulliNB(),
'Gaussian NB': naive_bayes.GaussianNB(),
'Complement NB': naive_bayes.ComplementNB()
}
nb_results = {}
for name, model in nb_models.items():
print(f"\n📊 Training {name}...")
# Gaussian NB expects dense arrays, others work with sparse
if name == 'Gaussian NB':
X_train_dense = X_train_text.toarray()
X_test_dense = X_test_text.toarray()
model.fit(X_train_dense, y_train_text)
y_pred = model.predict(X_test_dense)
y_pred_proba = model.predict_proba(X_test_dense)
else:
model.fit(X_train_text, y_train_text)
y_pred = model.predict(X_test_text)
y_pred_proba = model.predict_proba(X_test_text)
accuracy = metrics.accuracy_score(y_test_text, y_pred)
precision = metrics.precision_score(y_test_text, y_pred, average='weighted')
recall = metrics.recall_score(y_test_text, y_pred, average='weighted')
f1 = metrics.f1_score(y_test_text, y_pred, average='weighted')
nb_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'model': model,
'y_pred': y_pred,
'y_pred_proba': y_pred_proba
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
# Model Comparison
comparison_nb = pd.DataFrame({
'Model': list(nb_results.keys()),
'Accuracy': [nb_results[name]['accuracy'] for name in nb_results.keys()],
'Precision': [nb_results[name]['precision'] for name in nb_results.keys()],
'Recall': [nb_results[name]['recall'] for name in nb_results.keys()],
'F1-Score': [nb_results[name]['f1'] for name in nb_results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_nb.round(4))
# Comprehensive Visualization
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# 1. Confusion Matrix for Best Model
best_nb_name = max(nb_results.keys(), key=lambda x: nb_results[x]['accuracy'])
best_nb_model = nb_results[best_nb_name]['model']
y_pred_best = nb_results[best_nb_name]['y_pred']
cm = metrics.confusion_matrix(y_test_text, y_pred_best)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,0],
xticklabels=newsgroups.target_names, yticklabels=newsgroups.target_names)
axes[0,0].set_xlabel('Predicted')
axes[0,0].set_ylabel('Actual')
axes[0,0].set_title(f'Confusion Matrix - {best_nb_name}')
# 2. Feature Importance (Top Words per Class)
if hasattr(best_nb_model, 'feature_log_prob_'):
feature_names = vectorizer.get_feature_names_out()
n_top_words = 10
for i, class_label in enumerate(newsgroups.target_names):
top_features = np.argsort(best_nb_model.feature_log_prob_[i])[-n_top_words:]
top_words = [feature_names[j] for j in top_features]
print(f"\n🔤 Top words for '{class_label}':")
print(", ".join(top_words))
# 3. Model Comparison Bar Chart
models_nb = list(nb_results.keys())
accuracy_scores = [nb_results[name]['accuracy'] for name in models_nb]
bars = axes[0,1].bar(models_nb, accuracy_scores, color=['blue', 'green', 'orange', 'red'])
axes[0,1].set_ylabel('Accuracy')
axes[0,1].set_title('Naive Bayes Variants Comparison')
axes[0,1].tick_params(axis='x', rotation=45)
for bar, score in zip(bars, accuracy_scores):
axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom')
# 4. Probability Calibration
if hasattr(best_nb_model, 'predict_proba'):
y_pred_proba_best = nb_results[best_nb_name]['y_pred_proba']
# Binarize labels for calibration curve
from sklearn.preprocessing import label_binarize
y_test_bin = label_binarize(y_test_text, classes=range(len(newsgroups.target_names)))
# Plot calibration curves for each class
for i in range(len(newsgroups.target_names)):
fraction_of_positives, mean_predicted_value = calibration_curve(
y_test_bin[:, i], y_pred_proba_best[:, i], n_bins=10
)
axes[1,0].plot(mean_predicted_value, fraction_of_positives, "s-",
label=newsgroups.target_names[i])
axes[1,0].plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
axes[1,0].set_xlabel('Mean Predicted Probability')
axes[1,0].set_ylabel('Fraction of Positives')
axes[1,0].set_title('Probability Calibration')
axes[1,0].legend()
axes[1,0].grid(True)
# 5. Alpha Parameter Tuning (Smoothing)
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
alpha_scores = []
for alpha in alphas:
nb_temp = naive_bayes.MultinomialNB(alpha=alpha)
nb_temp.fit(X_train_text, y_train_text)
alpha_scores.append(nb_temp.score(X_test_text, y_test_text))
axes[1,1].semilogx(alphas, alpha_scores, 'o-', color='purple')
axes[1,1].set_xlabel('Alpha (Smoothing Parameter)')
axes[1,1].set_ylabel('Accuracy')
axes[1,1].set_title('Alpha Parameter Tuning')
axes[1,1].grid(True)
plt.tight_layout()
plt.show()
# Real-time Prediction Example
print("\n🎯 Real-time Text Classification Example:")
# Train final model on all data
final_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english', min_df=2, max_df=0.8)
X_final = final_vectorizer.fit_transform(newsgroups.data)
final_model = naive_bayes.MultinomialNB(alpha=1.0)
final_model.fit(X_final, newsgroups.target)
# Test sentences
test_sentences = [
"The space shuttle launched successfully into orbit",
"Computer graphics and animation techniques",
"Religious beliefs and atheism discussion"
]
for i, sentence in enumerate(test_sentences):
sentence_vec = final_vectorizer.transform([sentence])
prediction = final_model.predict(sentence_vec)[0]
probabilities = final_model.predict_proba(sentence_vec)[0]
print(f"\nSentence {i+1}: '{sentence}'")
print(f"Predicted category: {newsgroups.target_names[prediction]}")
print("Probabilities:")
for j, category in enumerate(newsgroups.target_names):
print(f" {category}: {probabilities[j]:.4f}")
# Numerical Data Example with Gaussian Naive Bayes
print("\n📊 Gaussian Naive Bayes on Numerical Data:")
# Use iris dataset for Gaussian NB demonstration
iris = datasets.load_iris()
X_iris = iris.data
y_iris = iris.target
X_train_iris, X_test_iris, y_train_iris, y_test_iris = model_selection.train_test_split(
X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)
# Scale features
scaler_iris = StandardScaler()
X_train_iris_scaled = scaler_iris.fit_transform(X_train_iris)
X_test_iris_scaled = scaler_iris.transform(X_test_iris)
# Train Gaussian NB
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train_iris_scaled, y_train_iris)
y_pred_iris = gnb.predict(X_test_iris_scaled)
accuracy_iris = metrics.accuracy_score(y_test_iris, y_pred_iris)
print(f"Gaussian NB on Iris dataset: Accuracy = {accuracy_iris:.4f}")
# Visualize Gaussian distributions
plt.figure(figsize=(12, 8))
features = iris.feature_names
classes = iris.target_names
for i in range(4): # For each feature
plt.subplot(2, 2, i+1)
for class_idx in range(3): # For each class
# Get feature values for this class
feature_values = X_iris[y_iris == class_idx, i]
# Plot histogram
plt.hist(feature_values, alpha=0.7, density=True,
label=classes[class_idx], bins=15)
# Plot fitted normal distribution
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
mean = gnb.theta_[class_idx, i]
std = np.sqrt(gnb.var_[class_idx, i])
p = np.exp(-0.5 * ((x - mean) / std) ** 2) / (std * np.sqrt(2 * np.pi))
plt.plot(x, p, 'k', linewidth=2)
plt.xlabel(features[i])
plt.ylabel('Density')
plt.title(f'Feature: {features[i]}')
if i == 0:
plt.legend()
plt.tight_layout()
plt.show()
print("✅ Naive Bayes Analysis Complete!")

## Naive Bayes Interview Questions & Answers
**Q1: Why is it called "Naive" Bayes?**
**Answer:**
- **Naive assumption:** Features are conditionally independent given the class
- **Mathematically:** $P(X_1,X_2,...,X_n|Y) = \prod_{i=1}^n P(X_i|Y)$
- **Why naive:** This assumption is rarely true in real-world data
- **Why it works:** Even with violated assumptions, it often performs well
- **Benefits:** Simple, fast, works well with high-dimensional data
**Q2: What is Laplace smoothing and why is it important?**
**Answer:**
- **Problem:** Zero probabilities when feature doesn't appear in class
- **Laplace smoothing:** Add small constant α to all counts
- **Formula:** $P(X_i|Y=k) = \frac{\text{count}(X_i,Y=k) + \alpha}{\text{count}(Y=k) + \alpha n}$
- **Purpose:** Prevents zero probabilities, improves generalization
- **Typical values:** α = 1 (Laplace), α < 1 (Lidstone)
**Q3: Compare different Naive Bayes variants.**
**Answer:**
- **Gaussian NB:** Continuous features, assumes normal distribution
- **Multinomial NB:** Discrete counts (text classification, word frequencies)
- **Bernoulli NB:** Binary features (word presence/absence)
- **Complement NB:** Improved version for imbalanced datasets
- **Choice depends on:** Feature type and distribution
**Q4: What are the advantages of Naive Bayes?**
**Answer:**
- **Fast training and prediction:** Simple probability calculations
- **Works well with high dimensions:** Text classification with thousands of features
- **Handles missing data:** Naturally through probability estimation
- **Incremental learning:** Can update with new data easily
- **Interpretable:** Probabilities have clear meaning
- **Good baseline:** Often works surprisingly well despite naive assumption
**Q5: When should you use Naive Bayes vs other classifiers?**
**Answer:**
**Use Naive Bayes when:**
- Text classification problems
- High-dimensional feature spaces
- Need fast training/prediction
- Want probability estimates
- Dealing with categorical features
- Need a simple baseline model
**Avoid Naive Bayes when:**
- Strong feature dependencies exist
- Need very high accuracy
- Features have complex relationships
- Probability calibration is critical
- Dealing with small datasets with continuous features
# 9. K-MEANS CLUSTERING
## Algorithm Background & Mathematical Foundation
**Core Concept:** K-means partitions data into K clusters by minimizing within-cluster variance (sum of squared distances).
**Mathematical Formulation:**
**Objective Function:**
$$J = \sum_{i=1}^k \sum_{x \in C_i} \|x - \mu_i\|^2$$
Where:
- $C_i$ is the i-th cluster
- $\mu_i$ is the centroid of cluster i
- $\|x - \mu_i\|^2$ is squared Euclidean distance
**Algorithm Steps:**
1. **Initialize:** Randomly select K centroids
2. **Assignment:** Assign each point to nearest centroid
$$C_i = \{x : \|x - \mu_i\|^2 \leq \|x - \mu_j\|^2 \ \forall j\}$$
3. **Update:** Recompute centroids as mean of assigned points
$$\mu_i = \frac{1}{|C_i|} \sum_{x \in C_i} x$$
4. **Repeat** steps 2-3 until convergence
**Initialization Methods:**
- **Random:** Pure random selection
- **K-means++:** Smart initialization to spread out centroids
- **Deterministic:** Pre-specified initial centroids

In [None]:
# Cell 11: K-Means Clustering - Comprehensive Implementation
print("🚀 K-MEANS CLUSTERING: COMPREHENSIVE IMPLEMENTATION\n")
# Create sample dataset for clustering
from sklearn.datasets import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=42)
print("📊 Dataset Overview:")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• True clusters: {len(np.unique(y_true))}")
print(f"• Cluster distribution: {np.bincount(y_true)}")
# Visualize the true clusters
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=50, alpha=0.8)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('True Clusters')
plt.grid(True)
# Apply K-means with different K values
k_values = [2, 3, 4, 5, 6]
kmeans_models = {}
for k in k_values:
kmeans = cluster.KMeans(n_clusters=k, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
kmeans_models[k] = {
'model': kmeans,
'labels': y_pred,
'inertia': kmeans.inertia_,
'centroids': kmeans.cluster_centers_
}
# Elbow Method for optimal K
inertias = [kmeans_models[k]['inertia'] for k in k_values]
plt.subplot(1, 3, 2)
plt.plot(k_values, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Within-cluster Sum of Squares (Inertia)')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
# Silhouette Analysis
from sklearn.metrics import silhouette_score
silhouette_scores = []
for k in k_values:
if k > 1: # Silhouette score requires at least 2 clusters
score = silhouette_score(X, kmeans_models[k]['labels'])
silhouette_scores.append(score)
plt.subplot(1, 3, 3)
plt.plot(k_values[1:], silhouette_scores, 'ro-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis')
plt.grid(True)
plt.tight_layout()
plt.show()
# Comprehensive K-means Analysis
optimal_k = k_values[np.argmin(inertias)] # Simple elbow detection
print(f"🎯 Optimal K based on elbow method: {optimal_k}")
# Visualize clustering results for different K
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
for i, k in enumerate(k_values):
row = i // 3
col = i % 3
model_info = kmeans_models[k]
# Plot clusters
scatter = axes[row, col].scatter(X[:, 0], X[:, 1], c=model_info['labels'],
cmap='viridis', s=50, alpha=0.7)
# Plot centroids
axes[row, col].scatter(model_info['centroids'][:, 0], model_info['centroids'][:, 1],
marker='x', s=200, linewidths=3, color='red', label='Centroids')
axes[row, col].set_xlabel('Feature 1')
axes[row, col].set_ylabel('Feature 2')
axes[row, col].set_title(f'K-means (K={k})\nInertia: {model_info["inertia"]:.2f}')
axes[row, col].legend()
axes[row, col].grid(True)
# Remove empty subplot
if len(k_values) < 6:
axes[1, 2].set_visible(False)
plt.tight_layout()
plt.show()
# Advanced K-means Analysis
print("\n🔍 Advanced K-means Analysis:")
# Compare different initialization methods
init_methods = ['random', 'k-means++']
init_results = {}
for init_method in init_methods:
kmeans_temp = cluster.KMeans(n_clusters=4, init=init_method, random_state=42, n_init=10)
y_pred_temp = kmeans_temp.fit_predict(X)
inertia = kmeans_temp.inertia_
silhouette = silhouette_score(X, y_pred_temp)
init_results[init_method] = {
'inertia': inertia,
'silhouette': silhouette,
'labels': y_pred_temp
}
print(f"\nInitialization: {init_method}")
print(f"• Inertia: {inertia:.4f}")
print(f"• Silhouette Score: {silhouette:.4f}")
# K-means on different dataset shapes
print("\n📊 K-means on Different Data Distributions:")
# Create different dataset shapes
datasets_cluster = {
'Blobs': make_blobs(n_samples=300, centers=3, random_state=42),
'Moons': datasets.make_moons(n_samples=300, noise=0.05, random_state=42),
'Circles': datasets.make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42),
'Anisotropic': datasets.make_blobs(n_samples=300, centers=3, random_state=42)
}
# Make anisotropic data
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(datasets_cluster['Anisotropic'][0], transformation)
datasets_cluster['Anisotropic'] = (X_aniso, datasets_cluster['Anisotropic'][1])
# Apply K-means to different datasets
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
for i, (dataset_name, (X_data, y_true_data)) in enumerate(datasets_cluster.items()):
# True labels
axes[0, i].scatter(X_data[:, 0], X_data[:, 1], c=y_true_data, cmap='viridis', s=50, alpha=0.8)
axes[0, i].set_title(f'{dataset_name}\n(True Clusters)')
axes[0, i].set_xlabel('Feature 1')
axes[0, i].set_ylabel('Feature 2')
axes[0, i].grid(True)
# K-means results
kmeans_data = cluster.KMeans(n_clusters=3, random_state=42)
y_pred_data = kmeans_data.fit_predict(X_data)
axes[1, i].scatter(X_data[:, 0], X_data[:, 1], c=y_pred_data, cmap='viridis', s=50, alpha=0.8)
axes[1, i].scatter(kmeans_data.cluster_centers_[:, 0], kmeans_data.cluster_centers_[:, 1],
marker='x', s=200, linewidths=3, color='red')
axes[1, i].set_title(f'{dataset_name}\n(K-means Clusters)')
axes[1, i].set_xlabel('Feature 1')
axes[1, i].set_ylabel('Feature 2')
axes[1, i].grid(True)
# Calculate metrics
inertia = kmeans_data.inertia_
silhouette = silhouette_score(X_data, y_pred_data)
print(f"\n{dataset_name}:")
print(f"• Inertia: {inertia:.4f}")
print(f"• Silhouette Score: {silhouette:.4f}")
plt.tight_layout()
plt.show()
# K-means for Image Compression
print("\n🖼️ K-means for Image Compression:")
# Load a sample image
try:
from sklearn.datasets import load_sample_image
china = load_sample_image("china.jpg")
# Convert to float and reshape
china = china / 255.0
w, h, d = original_shape = tuple(china.shape)
image_array = np.reshape(china, (w * h, d))
print(f"Image shape: {china.shape}")
print(f"Reshaped for clustering: {image_array.shape}")
# Apply K-means for color quantization
n_colors = 16
kmeans_image = cluster.KMeans(n_clusters=n_colors, random_state=42)
labels = kmeans_image.fit_predict(image_array)
colors = kmeans_image.cluster_centers_
# Recreate compressed image
compressed_image = colors[labels].reshape(w, h, d)
# Plot original and compressed images
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(china)
axes[0].set_title('Original Image')
axes[0].axis('off')
axes[1].imshow(compressed_image)
axes[1].set_title(f'Compressed Image ({n_colors} colors)')
axes[1].axis('off')
plt.tight_layout()
plt.show()
print(f"Original image: {china.shape[0]}x{china.shape[1]} pixels, {china.shape[2]} channels")
print(f"Compressed to: {n_colors} colors")
except ImportError:
print("Sample image dataset not available. Using alternative demonstration.")
# Create synthetic image data
synthetic_image = np.random.rand(100, 100, 3)
w, h, d = synthetic_image.shape
image_array = np.reshape(synthetic_image, (w * h, d))
n_colors = 8
kmeans_image = cluster.KMeans(n_clusters=n_colors, random_state=42)
labels = kmeans_image.fit_predict(image_array)
colors = kmeans_image.cluster_centers_
compressed_image = colors[labels].reshape(w, h, d)
fig, axes = plt.subplots(1, 2, figsize=(12, 6))
axes[0].imshow(synthetic_image)
axes[0].set_title('Synthetic Image')
axes[0].axis('off')
axes[1].imshow(compressed_image)
axes[1].set_title(f'Compressed ({n_colors} colors)')
axes[1].axis('off')
plt.tight_layout()
plt.show()
# Customer Segmentation Example
print("\n👥 Customer Segmentation with K-means:")
# Create synthetic customer data
np.random.seed(42)
n_customers = 200
# Generate customer features: age, annual income, spending score
age = np.random.normal(35, 10, n_customers).clip(18, 70)
income = np.random.normal(50000, 20000, n_customers).clip(20000, 100000)
spending = np.random.normal(50, 20, n_customers).clip(1, 100)
customer_data = np.column_stack([age, income, spending])
# Scale the data
scaler_customer = StandardScaler()
customer_data_scaled = scaler_customer.fit_transform(customer_data)
# Apply K-means
kmeans_customer = cluster.KMeans(n_clusters=4, random_state=42)
customer_labels = kmeans_customer.fit_predict(customer_data_scaled)
# Visualize customer segments
fig = plt.figure(figsize=(15, 5))
# Age vs Income
plt.subplot(1, 3, 1)
scatter = plt.scatter(age, income, c=customer_labels, cmap='viridis', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Annual Income ($)')
plt.title('Customer Segments: Age vs Income')
plt.colorbar(scatter, label='Cluster')
plt.grid(True)
# Age vs Spending
plt.subplot(1, 3, 2)
scatter = plt.scatter(age, spending, c=customer_labels, cmap='viridis', alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.title('Customer Segments: Age vs Spending')
plt.colorbar(scatter, label='Cluster')
plt.grid(True)
# Income vs Spending
plt.subplot(1, 3, 3)
scatter = plt.scatter(income, spending, c=customer_labels, cmap='viridis', alpha=0.7)
plt.xlabel('Annual Income ($)')
plt.ylabel('Spending Score')
plt.title('Customer Segments: Income vs Spending')
plt.colorbar(scatter, label='Cluster')
plt.grid(True)
plt.tight_layout()
plt.show()
# Analyze customer segments
print("\n📈 Customer Segment Analysis:")
for cluster_id in range(4):
cluster_mask = customer_labels == cluster_id
cluster_size = np.sum(cluster_mask)
if cluster_size > 0:
avg_age = np.mean(age[cluster_mask])
avg_income = np.mean(income[cluster_mask])
avg_spending = np.mean(spending[cluster_mask])
print(f"\nSegment {cluster_id} (Size: {cluster_size}):")
print(f"• Average Age: {avg_age:.1f} years")
print(f"• Average Income: ${avg_income:.0f}")
print(f"• Average Spending: {avg_spending:.1f}")
print("✅ K-means Clustering Analysis Complete!")

## K-means Interview Questions & Answers
**Q1: How do you choose the optimal number of clusters K?**
**Answer:**
**Methods for choosing K:**
1. **Elbow Method:** Plot inertia vs K, choose elbow point
2. **Silhouette Analysis:** Measure how similar points are to their own cluster vs other clusters
3. **Gap Statistic:** Compare inertia with expected inertia from null reference
4. **Domain Knowledge:** Use business context or prior knowledge
5. **Visual Inspection:** When possible, visualize clusters
**Q2: What are the limitations of K-means?**
**Answer:**
- **Assumes spherical clusters:** Struggles with elongated or irregular shapes
- **Sensitive to initialization:** Different runs can give different results
- **Requires specifying K:** Number of clusters must be known
- **Sensitive to outliers:** Outliers can distort centroids
- **Scale-dependent:** Features should be standardized
- **Local minima:** Can converge to suboptimal solutions
**Q3: Explain K-means++ initialization.**
**Answer:**
**K-means++ improvement:**
1. Choose first centroid randomly from data points
2. For each subsequent centroid:
- Compute distance from each point to nearest existing centroid
- Choose new centroid with probability proportional to squared distance
3. Repeat until K centroids chosen
**Benefits:**
- Spreads out initial centroids
- Faster convergence
- More consistent results
- Better final solutions
**Q4: How does K-means handle categorical data?**
**Answer:**
**K-means limitations with categorical data:**
- Designed for continuous numerical data
- Euclidean distance doesn't work well with categorical variables
- **Solutions:**
1. Use K-modes algorithm (designed for categorical data)
2. One-hot encoding (but creates high-dimensional space)
3. Use appropriate distance metrics (Hamming distance)
4. Consider other clustering algorithms
**Q5: What is the time and space complexity of K-means?**
**Answer:**
- **Time complexity:** O(n * K * I * d)
- n: number of points
- K: number of clusters
- I: number of iterations
- d: number of dimensions
- **Space complexity:** O((n + K) * d)
- **Efficient for:** Large n, small K and d
- **Inefficient for:** Very high-dimensional data
# 10. HIERARCHICAL CLUSTERING
## Algorithm Background & Mathematical Foundation
**Core Concept:** Hierarchical clustering builds a tree of clusters (dendrogram) by successively merging or splitting clusters based on similarity.
**Types of Hierarchical Clustering:**
1. **Agglomerative (Bottom-up):** Start with each point as individual cluster, merge closest pairs
2. **Divisive (Top-down):** Start with one cluster, recursively split into smaller clusters
**Linkage Criteria:**
- **Single Linkage:** Minimum distance between clusters
$$d(C_i, C_j) = \min_{x \in C_i, y \in C_j} d(x, y)$$
- **Complete Linkage:** Maximum distance between clusters
$$d(C_i, C_j) = \max_{x \in C_i, y \in C_j} d(x, y)$$
- **Average Linkage:** Average distance between clusters
$$d(C_i, C_j) = \frac{1}{|C_i||C_j|} \sum_{x \in C_i} \sum_{y \in C_j} d(x, y)$$
- **Ward's Method:** Minimizes within-cluster variance
$$d(C_i, C_j) = \frac{|C_i||C_j|}{|C_i| + |C_j|} \|\mu_i - \mu_j\|^2$$
**Dendrogram:** Tree diagram showing hierarchical relationships and merge distances.

In [None]:
# Cell 12: Hierarchical Clustering - Comprehensive Implementation
print("🚀 HIERARCHICAL CLUSTERING: COMPREHENSIVE IMPLEMENTATION\n")
# Create sample dataset
X, y_true = make_blobs(n_samples=150, centers=3, cluster_std=0.60, random_state=42)
print("📊 Dataset Overview:")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• True clusters: {len(np.unique(y_true))}")
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Compare different linkage methods
linkage_methods = ['single', 'complete', 'average', 'ward']
linkage_results = {}
plt.figure(figsize=(15, 10))
for i, linkage_method in enumerate(linkage_methods):
# Perform hierarchical clustering
hierarchical = cluster.AgglomerativeClustering(
n_clusters=3, linkage=linkage_method
)
labels = hierarchical.fit_predict(X_scaled)
linkage_results[linkage_method] = {
'labels': labels,
'model': hierarchical
}
# Plot clustering results
plt.subplot(2, 3, i+1)
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50, alpha=0.8)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'Linkage: {linkage_method}\nSilhouette: {silhouette_score(X_scaled, labels):.3f}')
plt.grid(True)
# Plot true labels for comparison
plt.subplot(2, 3, 5)
scatter = plt.scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=50, alpha=0.8)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('True Clusters')
plt.grid(True)
plt.tight_layout()
plt.show()
# Dendrogram Visualization
print("\n🌳 Dendrogram Visualization:")
# Create linkage matrix for dendrogram
linkage_matrix = linkage(X_scaled, method='ward')
plt.figure(figsize=(12, 8))
dendrogram(linkage_matrix, truncate_mode='level', p=5)
plt.xlabel('Sample Index or (Cluster Size)')
plt.ylabel('Distance')
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plt.grid(True, alpha=0.3)
plt.show()
# Comprehensive linkage comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
for i, linkage_method in enumerate(linkage_methods):
row = i // 2
col = i % 2
# Create linkage matrix for this method
linkage_matrix_method = linkage(X_scaled, method=linkage_method)
# Plot dendrogram
dendrogram(linkage_matrix_method, ax=axes[row, col], truncate_mode='level', p=5)
axes[row, col].set_title(f'Dendrogram - {linkage_method.capitalize()} Linkage')
axes[row, col].set_xlabel('Sample Index')
axes[row, col].set_ylabel('Distance')
axes[row, col].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Cut dendrogram at different levels
print("\n✂️ Cutting Dendrogram at Different Levels:")
n_clusters_range = [2, 3, 4, 5, 6]
cluster_results = {}
plt.figure(figsize=(15, 10))
for i, n_clusters in enumerate(n_clusters_range):
# Cut dendrogram to get n clusters
labels = fcluster(linkage_matrix, n_clusters, criterion='maxclust')
cluster_results[n_clusters] = {
'labels': labels,
'silhouette': silhouette_score(X_scaled, labels)
}
# Plot clustering results
plt.subplot(2, 3, i+1)
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=50, alpha=0.8)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(f'K={n_clusters}\nSilhouette: {cluster_results[n_clusters]["silhouette"]:.3f}')
plt.grid(True)
plt.tight_layout()
plt.show()
# Compare with K-means
print("\n🆚 Comparison with K-means:")
kmeans_comparison = cluster.KMeans(n_clusters=3, random_state=42)
kmeans_labels = kmeans_comparison.fit_predict(X_scaled)
hierarchical_ward = cluster.AgglomerativeClustering(n_clusters=3, linkage='ward')
hierarchical_labels = hierarchical_ward.fit_predict(X_scaled)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
# True clusters
scatter = axes[0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=50, alpha=0.8)
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].set_title('True Clusters')
axes[0].grid(True)
# K-means
scatter = axes[1].scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=50, alpha=0.8)
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].set_title(f'K-means\nSilhouette: {silhouette_score(X_scaled, kmeans_labels):.3f}')
axes[1].grid(True)
# Hierarchical
scatter = axes[2].scatter(X[:, 0], X[:, 1], c=hierarchical_labels, cmap='viridis', s=50, alpha=0.8)
axes[2].set_xlabel('Feature 1')
axes[2].set_ylabel('Feature 2')
axes[2].set_title(f'Hierarchical (Ward)\nSilhouette: {silhouette_score(X_scaled, hierarchical_labels):.3f}')
axes[2].grid(True)
plt.tight_layout()
plt.show()
# Advanced: Hierarchical clustering on different dataset shapes
print("\n📊 Hierarchical Clustering on Complex Data Shapes:")
complex_datasets = {
'Moons': datasets.make_moons(n_samples=150, noise=0.05, random_state=42),
'Circles': datasets.make_circles(n_samples=150, noise=0.05, factor=0.5, random_state=42),
'Anisotropic': (np.dot(X, [[0.6, -0.6], [-0.4, 0.8]]), y_true)
}
fig, axes = plt.subplots(3, 4, figsize=(20, 15))
for i, (dataset_name, (X_data, y_true_data)) in enumerate(complex_datasets.items()):
# Scale data
X_data_scaled = StandardScaler().fit_transform(X_data)
# True clusters
axes[i, 0].scatter(X_data[:, 0], X_data[:, 1], c=y_true_data, cmap='viridis', s=50, alpha=0.8)
axes[i, 0].set_title(f'{dataset_name}\nTrue Clusters')
axes[i, 0].set_xlabel('Feature 1')
axes[i, 0].set_ylabel('Feature 2')
axes[i, 0].grid(True)
# Different linkage methods
for j, linkage_method in enumerate(['single', 'complete', 'ward']):
hierarchical_temp = cluster.AgglomerativeClustering(
n_clusters=2, linkage=linkage_method
)
labels_temp = hierarchical_temp.fit_predict(X_data_scaled)
axes[i, j+1].scatter(X_data[:, 0], X_data[:, 1], c=labels_temp, cmap='viridis', s=50, alpha=0.8)
axes[i, j+1].set_title(f'{linkage_method.capitalize()} Linkage\nSilhouette: {silhouette_score(X_data_scaled, labels_temp):.3f}')
axes[i, j+1].set_xlabel('Feature 1')
axes[i, j+1].set_ylabel('Feature 2')
axes[i, j+1].grid(True)
plt.tight_layout()
plt.show()
# Practical Application: Customer Segmentation
print("\n👥 Customer Segmentation with Hierarchical Clustering:")
# Use the customer data from K-means example
if 'customer_data_scaled' in locals():
# Perform hierarchical clustering
linkage_customer = linkage(customer_data_scaled, method='ward')
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
# Dendrogram
dendrogram(linkage_customer, ax=axes[0], truncate_mode='lastp', p=12)
axes[0].set_title('Customer Segmentation Dendrogram')
axes[0].set_xlabel('Customer Index')
axes[0].set_ylabel('Distance')
axes[0].grid(True, alpha=0.3)
# Cut dendrogram to get 4 clusters
customer_hierarchical_labels = fcluster(linkage_customer, 4, criterion='maxclust')
# Visualize segments
scatter = axes[1].scatter(age, income, c=customer_hierarchical_labels,
cmap='viridis', alpha=0.7)
axes[1].set_xlabel('Age')
axes[1].set_ylabel('Annual Income ($)')
axes[1].set_title('Hierarchical Clustering: Customer Segments')
axes[1].grid(True)
plt.colorbar(scatter, ax=axes[1], label='Cluster')
plt.tight_layout()
plt.show()
# Compare with K-means
print("\n📊 Comparison: K-means vs Hierarchical Clustering")
print(f"K-means Silhouette Score: {silhouette_score(customer_data_scaled, customer_labels):.3f}")
print(f"Hierarchical Silhouette Score: {silhouette_score(customer_data_scaled, customer_hierarchical_labels):.3f}")
# Memory and Computational Considerations
print("\n💾 Computational Considerations:")
sample_sizes = [100, 500, 1000, 2000]
times_hierarchical = []
times_kmeans = []
for n_samples in sample_sizes:
# Generate data
X_temp, _ = make_blobs(n_samples=n_samples, centers=3, random_state=42)
X_temp_scaled = StandardScaler().fit_transform(X_temp)
# Time hierarchical clustering
start_time = time.time()
hierarchical_temp = cluster.AgglomerativeClustering(n_clusters=3)
hierarchical_temp.fit(X_temp_scaled)
times_hierarchical.append(time.time() - start_time)
# Time K-means
start_time = time.time()
kmeans_temp = cluster.KMeans(n_clusters=3, random_state=42)
kmeans_temp.fit(X_temp_scaled)
times_kmeans.append(time.time() - start_time)
plt.figure(figsize=(10, 6))
plt.plot(sample_sizes, times_hierarchical, 'o-', label='Hierarchical Clustering', linewidth=2)
plt.plot(sample_sizes, times_kmeans, 'o-', label='K-means', linewidth=2)
plt.xlabel('Number of Samples')
plt.ylabel('Execution Time (seconds)')
plt.title('Computational Complexity: Hierarchical vs K-means')
plt.legend()
plt.grid(True)
plt.show()
print("✅ Hierarchical Clustering Analysis Complete!")

## Hierarchical Clustering Interview Questions & Answers
**Q1: Compare and contrast different linkage methods.**
**Answer:**
- **Single Linkage:** Minimum distance between clusters
- Pros: Can detect non-spherical clusters
- Cons: Sensitive to noise, creates elongated chains
- **Complete Linkage:** Maximum distance between clusters
- Pros: Compact, spherical clusters
- Cons: Breaks large clusters, sensitive to outliers
- **Average Linkage:** Average distance between clusters
- Balanced approach, less sensitive to outliers
- **Ward's Method:** Minimizes within-cluster variance
- Creates similarly sized clusters, most commonly used
**Q2: What are the advantages of hierarchical over partitional clustering (like K-means)?**
**Answer:**
- **No need to specify K:** Dendrogram shows all possible clusterings
- **Hierarchical structure:** Shows relationships between clusters
- **Deterministic:** Same result each time (unlike K-means with random initialization)
- **Visualization:** Dendrogram provides intuitive visualization
- **Flexibility:** Can handle clusters of different shapes and sizes
**Q3: How do you interpret a dendrogram?**
**Answer:**
- **Vertical axis:** Distance between merging clusters
- **Horizontal axis:** Data points or clusters
- **Height of fusion:** Indicates similarity between clusters
- **Long vertical lines:** Significant cluster separations
- **Short vertical lines:** Similar clusters merging
- **Cutting height:** Determines number of clusters
**Q4: What is the time and space complexity of hierarchical clustering?**
**Answer:**
- **Time complexity:** O(n³) for naive implementation, O(n² log n) with optimizations
- **Space complexity:** O(n²) for storing distance matrix
- **Comparison:**
- Hierarchical: Better for small datasets (n < 10,000)
- K-means: Better for large datasets
- **Bottleneck:** Distance matrix computation and storage
**Q5: When should you use hierarchical vs K-means clustering?**
**Answer:**
**Use Hierarchical when:**
- Small to medium datasets (n < 10,000)
- Need to determine number of clusters
- Want hierarchical relationships
- Need deterministic results
- Dealing with non-spherical clusters
**Use K-means when:**
- Large datasets
- Know number of clusters in advance
- Need fast computation
- Spherical clusters expected
- Can handle random initialization variations
# 11. PRINCIPAL COMPONENT ANALYSIS (PCA)
## Algorithm Background & Mathematical Foundation
**Core Concept:** PCA is a dimensionality reduction technique that transforms correlated variables into a set of uncorrelated variables (principal components) that capture maximum variance.
**Mathematical Formulation:**
**Step 1: Standardize the Data**
$$X_{\text{standardized}} = \frac{X - \mu}{\sigma}$$
**Step 2: Compute Covariance Matrix**
$$\Sigma = \frac{1}{n-1} X^T X$$
**Step 3: Eigen Decomposition**
$$\Sigma v = \lambda v$$
Where:
- $\lambda$: eigenvalues (variance explained)
- $v$: eigenvectors (principal components)
**Step 4: Project Data**
$$Z = X V$$
Where $V$ contains top-k eigenvectors
**Variance Explained:**
$$\text{Variance Explained}_k = \frac{\lambda_k}{\sum_{j=1}^p \lambda_j}$$
**Reconstruction:**
$$X_{\text{reconstructed}} = Z V^T$$

In [None]:
# Cell 13: Principal Component Analysis - Comprehensive Implementation
print("🚀 PRINCIPAL COMPONENT ANALYSIS: COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names
target_names = iris.target_names
print("📊 Dataset Overview:")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Features: {list(feature_names)}")
print(f"• Classes: {list(target_names)}")
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = decomposition.PCA()
X_pca = pca.fit_transform(X_scaled)
print("\n🔍 PCA Analysis Results:")
print(f"• Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"• Cumulative explained variance: {np.cumsum(pca.explained_variance_ratio_)}")
print(f"• Principal components shape: {X_pca.shape}")
# Comprehensive Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. Original Data (first two features)
scatter = axes[0,0].scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.8)
axes[0,0].set_xlabel(feature_names[0])
axes[0,0].set_ylabel(feature_names[1])
axes[0,0].set_title('Original Data (First Two Features)')
plt.colorbar(scatter, ax=axes[0,0])
# 2. PCA Projection (first two components)
scatter = axes[0,1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.8)
axes[0,1].set_xlabel('Principal Component 1')
axes[0,1].set_ylabel('Principal Component 2')
axes[0,1].set_title('PCA Projection (PC1 vs PC2)')
plt.colorbar(scatter, ax=axes[0,1])
# 3. Variance Explained
components = range(1, len(pca.explained_variance_ratio_) + 1)
variance_ratio = pca.explained_variance_ratio_
cumulative_variance = np.cumsum(variance_ratio)
axes[0,2].bar(components, variance_ratio, alpha=0.6, label='Individual')
axes[0,2].plot(components, cumulative_variance, 'ro-', label='Cumulative')
axes[0,2].set_xlabel('Principal Components')
axes[0,2].set_ylabel('Explained Variance Ratio')
axes[0,2].set_title('Explained Variance by Principal Components')
axes[0,2].legend()
axes[0,2].grid(True)
# 4. Component Loadings (Heatmap)
loadings = pca.components_.T
sns.heatmap(loadings, annot=True, fmt='.2f', cmap='coolwarm', center=0,
xticklabels=[f'PC{i+1}' for i in range(loadings.shape[1])],
yticklabels=feature_names, ax=axes[1,0])
axes[1,0].set_title('PCA Component Loadings')
# 5. 3D PCA Visualization
from mpl_toolkits.mplot3d import Axes3D
ax = fig.add_subplot(2, 3, 5, projection='3d')
scatter = ax.scatter(X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], c=y, cmap='viridis', alpha=0.8)
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
ax.set_title('3D PCA Projection')
# 6. Reconstruction Error vs Number of Components
reconstruction_errors = []
n_components_range = range(1, X.shape[1] + 1)
for n_comp in n_components_range:
pca_temp = decomposition.PCA(n_components=n_comp)
X_temp = pca_temp.fit_transform(X_scaled)
X_reconstructed = pca_temp.inverse_transform(X_temp)
reconstruction_error = np.mean((X_scaled - X_reconstructed) ** 2)
reconstruction_errors.append(reconstruction_error)
axes[1,2].plot(n_components_range, reconstruction_errors, 'bo-')
axes[1,2].set_xlabel('Number of Components')
axes[1,2].set_ylabel('Reconstruction Error (MSE)')
axes[1,2].set_title('Reconstruction Error vs Components')
axes[1,2].grid(True)
plt.tight_layout()
plt.show()
# Advanced PCA Analysis
print("\n🔧 Advanced PCA Applications:")
# PCA for Dimensionality Reduction in Classification
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
# Create pipeline with PCA and classifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', decomposition.PCA(n_components=2)),
('classifier', LogisticRegression(random_state=42))
])
# Compare performance with different number of components
n_components_classification = range(1, 5)
classification_scores = []
for n_comp in n_components_classification:
pipeline.set_params(pca__n_components=n_comp)
scores = model_selection.cross_val_score(pipeline, X, y, cv=5)
classification_scores.append(scores.mean())
plt.figure(figsize=(10, 6))
plt.plot(n_components_classification, classification_scores, 'ro-', linewidth=2)
plt.xlabel('Number of Principal Components')
plt.ylabel('Classification Accuracy')
plt.title('Classification Performance vs Number of PCA Components')
plt.grid(True)
plt.show()
print(f"Best number of components for classification: {np.argmax(classification_scores) + 1}")
# Kernel PCA for Non-linear Dimensionality Reduction
print("\n🎯 Kernel PCA for Non-linear Data:")
# Create non-linear dataset
X_nonlinear, y_nonlinear = datasets.make_circles(n_samples=400, noise=0.05, factor=0.3, random_state=42)
# Apply different PCA variants
pca_variants = {
'Standard PCA': decomposition.PCA(n_components=2),
'Kernel PCA (RBF)': decomposition.KernelPCA(n_components=2, kernel='rbf'),
'Kernel PCA (Poly)': decomposition.KernelPCA(n_components=2, kernel='poly')
}
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Original data
axes[0,0].scatter(X_nonlinear[:, 0], X_nonlinear[:, 1], c=y_nonlinear, cmap='viridis', alpha=0.7)
axes[0,0].set_title('Original Non-linear Data')
axes[0,0].set_xlabel('Feature 1')
axes[0,0].set_ylabel('Feature 2')
axes[0,0].grid(True)
for i, (name, pca_method) in enumerate(pca_variants.items()):
row = (i + 1) // 2
col = (i + 1) % 2
try:
X_transformed = pca_method.fit_transform(X_nonlinear)
scatter = axes[row, col].scatter(X_transformed[:, 0], X_transformed[:, 1],
c=y_nonlinear, cmap='viridis', alpha=0.7)
axes[row, col].set_title(name)
axes[row, col].set_xlabel('Component 1')
axes[row, col].set_ylabel('Component 2')
axes[row, col].grid(True)
except Exception as e:
print(f"Error with {name}: {e}")
plt.tight_layout()
plt.show()
# PCA for Image Compression
print("\n🖼️ PCA for Image Compression:")
try:
from sklearn.datasets import load_digits
digits = load_digits()
X_digits = digits.data
y_digits = digits.target
print(f"Digits dataset: {X_digits.shape[0]} images, {X_digits.shape[1]} pixels")
# Apply PCA for image compression
n_components_image = [10, 25, 50, 64]
fig, axes = plt.subplots(2, 4, figsize=(15, 8))
for i, n_comp in enumerate(n_components_image):
pca_image = decomposition.PCA(n_components=n_comp)
X_compressed = pca_image.fit_transform(X_digits)
X_reconstructed = pca_image.inverse_transform(X_compressed)
# Calculate compression ratio
original_size = X_digits.shape[1] * X_digits.shape[0]
compressed_size = n_comp * X_digits.shape[0] + n_comp * X_digits.shape[1]
compression_ratio = compressed_size / original_size
# Show original and reconstructed
axes[0, i].imshow(X_digits[0].reshape(8, 8), cmap='gray')
axes[0, i].set_title(f'Original Image\n(64 features)')
axes[0, i].axis('off')
axes[1, i].imshow(X_reconstructed[0].reshape(8, 8), cmap='gray')
axes[1, i].set_title(f'Compressed: {n_comp} components\nRatio: {compression_ratio:.2f}')
axes[1, i].axis('off')
plt.tight_layout()
plt.show()
except ImportError:
print("Digits dataset not available")
# PCA for Anomaly Detection
print("\n🚨 PCA for Anomaly Detection:")
# Create dataset with outliers
X_clean, _ = datasets.make_blobs(n_samples=190, centers=2, cluster_std=1.0, random_state=42)
outliers = np.random.uniform(low=-10, high=10, size=(10, 2))
X_with_outliers = np.vstack([X_clean, outliers])
y_outliers = np.array([0]*190 + [1]*10) # 1 indicates outlier
# Apply PCA
pca_outlier = decomposition.PCA(n_components=2)
X_pca_outlier = pca_outlier.fit_transform(StandardScaler().fit_transform(X_with_outliers))
# Calculate reconstruction error as outlier score
pca_recon = decomposition.PCA(n_components=1)
X_pca_recon = pca_recon.fit_transform(StandardScaler().fit_transform(X_with_outliers))
X_reconstructed = pca_recon.inverse_transform(X_pca_recon)
reconstruction_error = np.sum((X_with_outliers - X_reconstructed) ** 2, axis=1)
plt.figure(figsize=(15, 5))
# Original data with outliers
plt.subplot(1, 3, 1)
plt.scatter(X_with_outliers[y_outliers==0, 0], X_with_outliers[y_outliers==0, 1],
c='blue', alpha=0.6, label='Normal')
plt.scatter(X_with_outliers[y_outliers==1, 0], X_with_outliers[y_outliers==1, 1],
c='red', alpha=0.8, label='Outlier', marker='x', s=100)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Original Data with Outliers')
plt.legend()
plt.grid(True)
# PCA projection
plt.subplot(1, 3, 2)
plt.scatter(X_pca_outlier[y_outliers==0, 0], X_pca_outlier[y_outliers==0, 1],
c='blue', alpha=0.6, label='Normal')
plt.scatter(X_pca_outlier[y_outliers==1, 0], X_pca_outlier[y_outliers==1, 1],
c='red', alpha=0.8, label='Outlier', marker='x', s=100)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Projection')
plt.legend()
plt.grid(True)
# Reconstruction error
plt.subplot(1, 3, 3)
plt.scatter(range(len(reconstruction_error)), reconstruction_error,
c=y_outliers, cmap='coolwarm', alpha=0.7)
plt.xlabel('Sample Index')
plt.ylabel('Reconstruction Error')
plt.title('PCA Reconstruction Error\n(Outlier Score)')
plt.grid(True)
plt.tight_layout()
plt.show()
print("✅ Principal Component Analysis Complete!")

## PCA Interview Questions & Answers
**Q1: What is the intuition behind PCA?**
**Answer:**
- **Geometric interpretation:** Find directions of maximum variance in data
- **Statistical interpretation:** Transform correlated variables into uncorrelated components
- **Information preservation:** Keep most information with fewer dimensions
- **Noise reduction:** First components capture signal, later components capture noise
**Q2: Why do we need to standardize data before PCA?**
**Answer:**
- **Scale sensitivity:** PCA is sensitive to feature scales
- **Variance domination:** Features with larger scales dominate variance
- **Fair comparison:** Standardization puts all features on same scale
- **Mathematical requirement:** Covariance matrix calculation assumes comparable scales
**Q3: How do you interpret principal components?**
**Answer:**
- **Eigenvalues:** Amount of variance explained by each component
- **Eigenvectors:** Directions of principal components in original feature space
- **Loadings:** Correlation between original features and principal components
- **Interpretation:** PC1 captures most variance, PC2 captures next most, etc.
**Q4: What is the difference between PCA and LDA?**
**Answer:**
- **PCA:** Unsupervised, maximizes variance (ignores class labels)
- **LDA:** Supervised, maximizes separation between classes
- **Objective:**
- PCA: max variance in data
- LDA: max ratio of between-class to within-class variance
- **Use cases:** PCA for exploration, LDA for classification
**Q5: When should you use PCA vs other dimensionality reduction methods?**
**Answer:**
**Use PCA when:**
- Linear relationships in data
- Need interpretable components
- Want maximum variance preservation
- Dealing with continuous numerical data
- Need deterministic results
**Consider alternatives when:**
- Non-linear relationships (use t-SNE, UMAP, Kernel PCA)
- Preserving local structure (use t-SNE)
- Categorical data (use MCA, Factor Analysis)
- Very high-dimensional sparse data (use Truncated SVD)
# 12. NEURAL NETWORKS (MLP)
## Algorithm Background & Mathematical Foundation
**Core Concept:** Neural networks are computational models inspired by biological neurons, capable of learning complex non-linear relationships through multiple layers of interconnected nodes.
**Mathematical Formulation:**
**Single Neuron:**
$$z = w^T x + b$$
$$a = \sigma(z)$$
Where:
- $w$: weights
- $b$: bias
- $\sigma$: activation function
**Common Activation Functions:**
- **Sigmoid:** $\sigma(z) = \frac{1}{1 + e^{-z}}$
- **Tanh:** $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$
- **ReLU:** $\text{ReLU}(z) = \max(0, z)$
- **Softmax:** $\sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^K e^{z_j}}$
**Forward Propagation:**
For layer $l$:
$$z^{[l]} = W^{[l]} a^{[l-1]} + b^{[l]}$$
$$a^{[l]} = \sigma^{[l]}(z^{[l]})$$
**Backpropagation:**
Compute gradients using chain rule:
$$\frac{\partial L}{\partial W^{[l]}} = \frac{\partial L}{\partial a^{[l]}} \frac{\partial a^{[l]}}{\partial z^{[l]}} \frac{\partial z^{[l]}}{\partial W^{[l]}}$$
**Loss Functions:**
- **MSE:** $L = \frac{1}{m} \sum (y - \hat{y})^2$
- **Cross-Entropy:** $L = -\frac{1}{m} \sum [y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$

In [None]:
# Cell 14: Neural Networks (MLP) - Comprehensive Implementation
print("🚀 NEURAL NETWORKS (MLP): COMPREHENSIVE IMPLEMENTATION\n")
# Load dataset
X, y = datasets.make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, n_classes=2, random_state=42)
print("📊 Dataset Overview:")
print(f"• Samples: {X.shape[0]}, Features: {X.shape[1]}")
print(f"• Classes: {len(np.unique(y))}")
print(f"• Class distribution: {np.bincount(y)}")
# Split and scale data
X_train, X_test, y_train, y_test = model_selection.train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Build MLP models with different architectures
mlp_models = {
'MLP (1 hidden layer)': neural_network.MLPClassifier(
hidden_layer_sizes=(50,), activation='relu', random_state=42, max_iter=1000
),
'MLP (2 hidden layers)': neural_network.MLPClassifier(
hidden_layer_sizes=(50, 25), activation='relu', random_state=42, max_iter=1000
),
'MLP (3 hidden layers)': neural_network.MLPClassifier(
hidden_layer_sizes=(50, 25, 10), activation='relu', random_state=42, max_iter=1000
),
'MLP (tanh activation)': neural_network.MLPClassifier(
hidden_layer_sizes=(50, 25), activation='tanh', random_state=42, max_iter=1000
),
'MLP (with regularization)': neural_network.MLPClassifier(
hidden_layer_sizes=(50, 25), activation='relu', alpha=0.1, random_state=42, max_iter=1000
)
}
mlp_results = {}
for name, model in mlp_models.items():
print(f"\n📊 Training {name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1 = metrics.f1_score(y_test, y_pred)
roc_auc = metrics.roc_auc_score(y_test, y_pred_proba)
mlp_results[name] = {
'accuracy': accuracy,
'precision': precision,
'recall': recall,
'f1': f1,
'roc_auc': roc_auc,
'model': model,
'loss_curve': model.loss_curve_ if hasattr(model, 'loss_curve_') else None
}
print(f" • Accuracy: {accuracy:.4f}")
print(f" • Precision: {precision:.4f}")
print(f" • Recall: {recall:.4f}")
print(f" • F1-Score: {f1:.4f}")
print(f" • ROC-AUC: {roc_auc:.4f}")
print(f" • Iterations: {model.n_iter_}")
# Model Comparison
comparison_mlp = pd.DataFrame({
'Model': list(mlp_results.keys()),
'Accuracy': [mlp_results[name]['accuracy'] for name in mlp_results.keys()],
'Precision': [mlp_results[name]['precision'] for name in mlp_results.keys()],
'Recall': [mlp_results[name]['recall'] for name in mlp_results.keys()],
'F1-Score': [mlp_results[name]['f1'] for name in mlp_results.keys()],
'ROC-AUC': [mlp_results[name]['roc_auc'] for name in mlp_results.keys()]
})
print("\n📋 Model Performance Comparison:")
print(comparison_mlp.round(4))
# Comprehensive Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. Loss Curves
for name, result in mlp_results.items():
if result['loss_curve'] is not None:
axes[0,0].plot(result['loss_curve'], label=name)
axes[0,0].set_xlabel('Iterations')
axes[0,0].set_ylabel('Loss')
axes[0,0].set_title('Training Loss Curves')
axes[0,0].legend()
axes[0,0].grid(True)
# 2. Model Comparison
models_mlp = list(mlp_results.keys())
accuracy_scores = [mlp_results[name]['accuracy'] for name in models_mlp]
bars = axes[0,1].bar(models_mlp, accuracy_scores, color=['blue', 'green', 'orange', 'red', 'purple'])
axes[0,1].set_ylabel('Accuracy')
axes[0,1].set_title('MLP Architectures Comparison')
axes[0,1].tick_params(axis='x', rotation=45)
for bar, score in zip(bars, accuracy_scores):
axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
f'{score:.3f}', ha='center', va='bottom')
# 3. ROC Curves
for name, result in mlp_results.items():
y_pred_proba = result['model'].predict_proba(X_test_scaled)[:, 1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
roc_auc = metrics.auc(fpr, tpr)
axes[0,2].plot(fpr, tpr, label=f'{name} (AUC = {roc_auc:.3f})')
axes[0,2].plot([0, 1], [0, 1], 'k--')
axes[0,2].set_xlabel('False Positive Rate')
axes[0,2].set_ylabel('True Positive Rate')
axes[0,2].set_title('ROC Curves Comparison')
axes[0,2].legend()
axes[0,2].grid(True)
# 4. Activation Functions Comparison
activations = ['relu', 'tanh', 'logistic']
activation_scores = []
for activation in activations:
mlp_temp = neural_network.MLPClassifier(
hidden_layer_sizes=(50, 25), activation=activation, random_state=42, max_iter=1000
)
mlp_temp.fit(X_train_scaled, y_train)
activation_scores.append(mlp_temp.score(X_test_scaled, y_test))
axes[1,0].bar(activations, activation_scores, color=['blue', 'green', 'orange'])
axes[1,0].set_ylabel('Accuracy')
axes[1,0].set_title('Activation Functions Comparison')
axes[1,0].set_ylim(0, 1)
for i, score in enumerate(activation_scores):
axes[1,0].text(i, score + 0.01, f'{score:.3f}', ha='center', va='bottom')
# 5. Learning Rate Analysis
learning_rates = [0.001, 0.01, 0.1, 0.5]
lr_scores = []
for lr in learning_rates:
mlp_temp = neural_network.MLPClassifier(
hidden_layer_sizes=(50, 25), learning_rate_init=lr, random_state=42, max_iter=1000
)
mlp_temp.fit(X_train_scaled, y_train)
lr_scores.append(mlp_temp.score(X_test_scaled, y_test))
axes[1,1].semilogx(learning_rates, lr_scores, 'ro-', linewidth=2)
axes[1,1].set_xlabel('Learning Rate')
axes[1,1].set_ylabel('Accuracy')
axes[1,1].set_title('Learning Rate vs Performance')
axes[1,1].grid(True)
# 6. Regularization Analysis
alphas = [0.0001, 0.001, 0.01, 0.1, 1.0]
alpha_scores = []
for alpha in alphas:
mlp_temp = neural_network.MLPClassifier(
hidden_layer_sizes=(50, 25), alpha=alpha, random_state=42, max_iter=1000
)
mlp_temp.fit(X_train_scaled, y_train)
alpha_scores.append(mlp_temp.score(X_test_scaled, y_test))
axes[1,2].semilogx(alphas, alpha_scores, 'go-', linewidth=2)
axes[1,2].set_xlabel('Alpha (Regularization)')
axes[1,2].set_ylabel('Accuracy')
axes[1,2].set_title('Regularization vs Performance')
axes[1,2].grid(True)
plt.tight_layout()
plt.show()
# TensorFlow/Keras Implementation
print("\n🔥 TENSORFLOW/KERAS IMPLEMENTATION:")
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping
# Build model
tf_model = Sequential([
Dense(50, activation='relu', input_shape=(X_train_scaled.shape[1],)),
BatchNormalization(),
Dropout(0.3),
Dense(25, activation='relu'),
BatchNormalization(),
Dropout(0.3),
Dense(1, activation='sigmoid')
])
# Compile model
tf_model.compile(
optimizer=Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
print("Model Architecture:")
tf_model.summary()
# Train model
early_stopping = EarlyStopping(patience=10, restore_best_weights=True)
history = tf_model.fit(
X_train_scaled, y_train,
validation_data=(X_test_scaled, y_test),
epochs=100,
batch_size=32,
callbacks=[early_stopping],
verbose=0
)
# Evaluate model
y_pred_tf = (tf_model.predict(X_test_scaled) > 0.5).astype(int).flatten()
accuracy_tf = metrics.accuracy_score(y_test, y_pred_tf)
print(f"\nTensorFlow Model Performance:")
print(f"• Accuracy: {accuracy_tf:.4f}")
print(f"• Training epochs: {len(history.history['loss'])}")
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Loss
axes[0].plot(history.history['loss'], label='Training Loss')
axes[0].plot(history.history['val_loss'], label='Validation Loss')
axes[0].set_xlabel('Epochs')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training and Validation Loss')
axes[0].legend()
axes[0].grid(True)
# Accuracy
axes[1].plot(history.history['accuracy'], label='Training Accuracy')
axes[1].plot(history.history['val_accuracy'], label='Validation Accuracy')
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Training and Validation Accuracy')
axes[1].legend()
axes[1].grid(True)
plt.tight_layout()
plt.show()
# Advanced: Neural Network for Regression
print("\n📈 NEURAL NETWORKS FOR REGRESSION:")
# Create regression dataset
X_reg, y_reg = datasets.make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train_reg, X_test_reg, y_train_reg, y_test_reg = model_selection.train_test_split(
X_reg, y_reg, test_size=0.2, random_state=42
)
# Scale features and target
scaler_X_reg = StandardScaler()
X_train_reg_scaled = scaler_X_reg.fit_transform(X_train_reg)
X_test_reg_scaled = scaler_X_reg.transform(X_test_reg)
scaler_y_reg = StandardScaler()
y_train_reg_scaled = scaler_y_reg.fit_transform(y_train_reg.reshape(-1, 1)).flatten()
y_test_reg_scaled = scaler_y_reg.transform(y_test_reg.reshape(-1, 1)).flatten()
# Build regression model
mlp_reg = neural_network.MLPRegressor(
hidden_layer_sizes=(50, 25), activation='relu', random_state=42, max_iter=1000
)
mlp_reg.fit(X_train_reg_scaled, y_train_reg_scaled)
y_pred_reg_scaled = mlp_reg.predict(X_test_reg_scaled)
# Convert back to original scale
y_pred_reg = scaler_y_reg.inverse_transform(y_pred_reg_scaled.reshape(-1, 1)).flatten()
mse = metrics.mean_squared_error(y_test_reg, y_pred_reg)
r2 = metrics.r2_score(y_test_reg, y_pred_reg)
print(f"Neural Network Regression Results:")
print(f"• MSE: {mse:.4f}")
print(f"• R²: {r2:.4f}")
# Plot regression results
plt.figure(figsize=(10, 6))
plt.scatter(y_test_reg, y_pred_reg, alpha=0.6)
plt.plot([y_test_reg.min(), y_test_reg.max()], [y_test_reg.min(), y_test_reg.max()], 'r--')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Neural Network Regression: Actual vs Predicted')
plt.grid(True)
plt.show()
# Neural Network Feature Importance
print("\n🔍 NEURAL NETWORK FEATURE IMPORTANCE:")
# Permutation importance
from sklearn.inspection import permutation_importance
best_mlp_name = max(mlp_results.keys(), key=lambda x: mlp_results[x]['accuracy'])
best_mlp_model = mlp_results[best_mlp_name]['model']
result = permutation_importance(best_mlp_model, X_test_scaled, y_test, n_repeats=10, random_state=42)
# Plot feature importance
feature_importance = pd.DataFrame({
'feature': [f'Feature_{i}' for i in range(X.shape[1])],
'importance': result.importances_mean,
'std': result.importances_std
}).sort_values('importance', ascending=True)
plt.figure(figsize=(10, 8))
plt.barh(feature_importance['feature'][-15:], feature_importance['importance'][-15:],
xerr=feature_importance['std'][-15:])
plt.xlabel('Permutation Importance')
plt.title('Neural Network Feature Importance (Top 15)')
plt.grid(True, axis='x')
plt.tight_layout()
plt.show()
print("✅ Neural Networks (MLP) Analysis Complete!")

## Neural Networks Interview Questions & Answers
**Q1: Explain the vanishing/exploding gradient problem.**
**Answer:**
- **Vanishing gradients:** Gradients become extremely small during backpropagation, stopping learning in early layers
- **Exploding gradients:** Gradients become extremely large, causing unstable training
- **Causes:** Deep networks, certain activation functions (sigmoid, tanh), improper weight initialization
- **Solutions:** ReLU activation, batch normalization, residual connections, proper initialization
**Q2: What are the different activation functions and when to use them?**
**Answer:**
- **ReLU:** $f(x) = \max(0, x)$
- Pros: Prevents vanishing gradient, computationally efficient
- Cons: Dying ReLU problem (negative inputs output zero)
- **Sigmoid:** $f(x) = \frac{1}{1 + e^{-x}}$
- Pros: Smooth gradient, output range (0,1)
- Cons: Vanishing gradient, not zero-centered
- **Tanh:** $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
- Pros: Zero-centered, stronger gradients than sigmoid
- Cons: Still can have vanishing gradients
- **Leaky ReLU:** $f(x) = \max(0.01x, x)$
- Solves dying ReLU problem
**Q3: What is batch normalization and why is it important?**
**Answer:**
- **Batch normalization:** Normalize layer inputs to have zero mean and unit variance
- **Benefits:**
- Faster training convergence
- Allows higher learning rates
- Reduces sensitivity to initialization
- Acts as regularizer
- **Formula:** $BN(x) = \gamma \frac{x - \mu}{\sigma} + \beta$
- **Placement:** Usually after linear transformation, before activation
**Q4: Explain different optimization algorithms.**
**Answer:**
- **SGD:** Basic gradient descent with momentum
- **Adam:** Adaptive learning rates for each parameter, combines momentum and RMSprop
- **RMSprop:** Adapts learning rate based on recent gradient magnitudes
- **Adagrad:** Adapts learning rate for each parameter based on historical gradients
- **Choice:** Adam is usually good default, SGD with momentum can generalize better
**Q5: How do you prevent overfitting in neural networks?**
**Answer:**
1. **Regularization:** L1/L2 regularization on weights
2. **Dropout:** Randomly disable neurons during training
3. **Early stopping:** Stop training when validation performance degrades
4. **Data augmentation:** Artificially increase training data
5. **Batch normalization:** Acts as regularizer
6. **Reduce model complexity:** Fewer layers/neurons
7. **Weight constraints:** Limit weight magnitudes
# 13. CONVOLUTIONAL NEURAL NETWORKS (CNN)
## Algorithm Background & Mathematical Foundation
**Core Concept:** CNNs are specialized neural networks for processing grid-like data (images, time series) using convolutional layers that preserve spatial relationships.
**Key Components:**
**Convolution Operation:**
$$(f * g)(t) = \int f(\tau) g(t - \tau) d\tau$$
Discrete 2D convolution:
$$(I * K)(i,j) = \sum_{m} \sum_{n} I(i+m, j+n) K(m, n)$$
**Convolutional Layer:**
- **Filters/Kernels:** Small matrices that detect features
- **Stride:** Step size for filter movement
- **Padding:** Adding zeros around input (same/valid padding)
**Pooling Layers:**
- **Max Pooling:** $ \text{MaxPool}(x) = \max(x_{i:i+p, j:j+p}) $
- **Average Pooling:** $ \text{AvgPool}(x) = \frac{1}{p^2} \sum x_{i:i+p, j:j+p} $
**Architecture Patterns:**
- **Feature extraction:** Conv → Activation → Pooling
- **Classification:** Flatten → Fully connected → Output

In [None]:
# Cell 15: Convolutional Neural Networks - Comprehensive Implementation
print("🚀 CONVOLUTIONAL NEURAL NETWORKS: COMPREHENSIVE IMPLEMENTATION\n")
# Load MNIST dataset
(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()
print("📊 MNIST Dataset Overview:")
print(f"• Training samples: {X_train.shape[0]}")
print(f"• Test samples: {X_test.shape[0]}")
print(f"• Image shape: {X_train.shape[1:]}")
print(f"• Classes: {len(np.unique(y_train))}")
print(f"• Class distribution: {np.bincount(y_train)}")
# Preprocess data
X_train = X_train.astype('float32') / 255.0
X_test = X_test.astype('float32') / 255.0
# Add channel dimension
X_train = X_train.reshape(-1, 28, 28, 1)
X_test = X_test.reshape(-1, 28, 28, 1)
# Convert labels to categorical
y_train_categorical = tf.keras.utils.to_categorical(y_train, 10)
y_test_categorical = tf.keras.utils.to_categorical(y_test, 10)
# Visualize sample images
plt.figure(figsize=(12, 6))
for i in range(10):
plt.subplot(2, 5, i+1)
plt.imshow(X_train[i].reshape(28, 28), cmap='gray')
plt.title(f'Label: {y_train[i]}')
plt.axis('off')
plt.tight_layout()
plt.show()
# Build CNN models with different architectures
def create_simple_cnn():
model = Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
return model
def create_deeper_cnn():
model = Sequential([
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
layers.BatchNormalization(),
layers.Conv2D(32, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.25),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.BatchNormalization(),
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
layers.Dropout(0.25),
layers.Flatten(),
layers.Dense(256, activation='relu'),
layers.BatchNormalization(),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
return model
# Compile and train models
cnn_models = {
'Simple CNN': create_simple_cnn(),
'Deeper CNN': create_deeper_cnn()
}
cnn_histories = {}
for name, model in cnn_models.items():
print(f"\n📊 Training {name}...")
print("Model Architecture:")
model.summary()
# Compile model
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Train model
history = model.fit(
X_train, y_train_categorical,
validation_data=(X_test, y_test_categorical),
epochs=10,
batch_size=128,
verbose=0
)
cnn_histories[name] = {
'model': model,
'history': history
}
# Evaluate model
test_loss, test_accuracy = model.evaluate(X_test, y_test_categorical, verbose=0)
print(f"• Test Accuracy: {test_accuracy:.4f}")
print(f"• Test Loss: {test_loss:.4f}")
# Comprehensive Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. Training History - Accuracy
for name, cnn_data in cnn_histories.items():
history = cnn_data['history']
axes[0,0].plot(history.history['accuracy'], label=f'{name} - Train')
axes[0,0].plot(history.history['val_accuracy'], label=f'{name} - Val', linestyle='--')
axes[0,0].set_xlabel('Epochs')
axes[0,0].set_ylabel('Accuracy')
axes[0,0].set_title('Training and Validation Accuracy')
axes[0,0].legend()
axes[0,0].grid(True)
# 2. Training History - Loss
for name, cnn_data in cnn_histories.items():
history = cnn_data['history']
axes[0,1].plot(history.history['loss'], label=f'{name} - Train')
axes[0,1].plot(history.history['val_loss'], label=f'{name} - Val', linestyle='--')
axes[0,1].set_xlabel('Epochs')
axes[0,1].set_ylabel('Loss')
axes[0,1].set_title('Training and Validation Loss')
axes[0,1].legend()
axes[0,1].grid(True)
# 3. Model Comparison
models_cnn = list(cnn_histories.keys())
final_accuracies = [cnn_histories[name]['history'].history['val_accuracy'][-1] for name in models_cnn]
bars = axes[0,2].bar(models_cnn, final_accuracies, color=['blue', 'green'])
axes[0,2].set_ylabel('Final Validation Accuracy')
axes[0,2].set_title('CNN Architectures Comparison')
axes[0,2].set_ylim(0.9, 1.0)
for bar, accuracy in zip(bars, final_accuracies):
axes[0,2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
f'{accuracy:.3f}', ha='center', va='bottom')
# 4. Feature Maps Visualization
best_cnn_name = max(cnn_histories.keys(), key=lambda x: cnn_histories[x]['history'].history['val_accuracy'][-1])
best_cnn_model = cnn_histories[best_cnn_name]['model']
# Get first convolutional layer
first_conv_layer = best_cnn_model.layers[0]
# Create feature map model
feature_map_model = tf.keras.Model(
inputs=best_cnn_model.inputs,
outputs=first_conv_layer.output
)
# Get feature maps for sample image
sample_image = X_test[0:1]
feature_maps = feature_map_model.predict(sample_image)
# Plot feature maps
axes[1,0].imshow(X_test[0].reshape(28, 28), cmap='gray')
axes[1,0].set_title('Original Image')
axes[1,0].axis('off')
# Plot first few feature maps
for i in range(min(8, feature_maps.shape[-1])):
row = 1 + i // 4
col = 1 + i % 4
if row < 2 and col < 3: # Only plot in available subplots
axes[row, col].imshow(feature_maps[0, :, :, i], cmap='viridis')
axes[row, col].set_title(f'Feature Map {i+1}')
axes[row,1].axis('off')
# Remove empty subplots
for i in range(6, 9):
row = i // 3
col = i % 3
if row < 2 and col < 3:
if not axes[row, col].has_data():
axes[row, col].set_visible(False)
plt.tight_layout()
plt.show()
# Advanced CNN Applications
print("\n🎯 ADVANCED CNN APPLICATIONS:")
# Data Augmentation
print("\n🔄 Data Augmentation:")
data_augmentation = tf.keras.Sequential([
layers.RandomRotation(0.1),
layers.RandomZoom(0.1),
layers.RandomContrast(0.1),
])
# Visualize augmented images
plt.figure(figsize=(12, 6))
for i in range(5):
plt.subplot(2, 5, i+1)
plt.imshow(X_train[i].reshape(28, 28), cmap='gray')
plt.title('Original')
plt.axis('off')
plt.subplot(2, 5, i+6)
augmented = data_augmentation(tf.expand_dims(X_train[i], 0))
plt.imshow(augmented[0].numpy().reshape(28, 28), cmap='gray')
plt.title('Augmented')
plt.axis('off')
plt.tight_layout()
plt.show()
# Transfer Learning Example (using CIFAR-10)
print("\n🔄 Transfer Learning with CIFAR-10:")
try:
# Load CIFAR-10 dataset
(X_train_cifar, y_train_cifar), (X_test_cifar, y_test_cifar) = tf.keras.datasets.cifar10.load_data()
# Preprocess data
X_train_cifar = X_train_cifar.astype('float32') / 255.0
X_test_cifar = X_test_cifar.astype('float32') / 255.0
y_train_cifar_categorical = tf.keras.utils.to_categorical(y_train_cifar, 10)
y_test_cifar_categorical = tf.keras.utils.to_categorical(y_test_cifar, 10)
print(f"CIFAR-10 Dataset: {X_train_cifar.shape[0]} training images, {X_test_cifar.shape[0]} test images")
# Build transfer learning model
base_model = tf.keras.applications.MobileNetV2(
input_shape=(32, 32, 3),
include_top=False,
weights='imagenet'
)
base_model.trainable = False # Freeze base model
transfer_model = Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])
transfer_model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Resize CIFAR-10 images to match MobileNet input (not ideal but for demonstration)
# In practice, you'd use a model trained on smaller images
print("Note: For proper transfer learning, use models trained on appropriate input sizes")
except Exception as e:
print(f"Transfer learning demonstration skipped: {e}")
# CNN for Different Filter Sizes
print("\n🔍 CNN Filter Size Analysis:")
filter_sizes = [(2, 2), (3, 3), (5, 5)]
filter_results = {}
for filter_size in filter_sizes:
model = Sequential([
layers.Conv2D(32, filter_size, activation='relu', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(10, activation='softmax')
])
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
history = model.fit(
X_train, y_train_categorical,
validation_data=(X_test, y_test_categorical),
epochs=5,
batch_size=128,
verbose=0
)
final_accuracy = history.history['val_accuracy'][-1]
filter_results[filter_size] = final_accuracy
print(f"Filter size {filter_size}: Validation Accuracy = {final_accuracy:.4f}")
# Misclassification Analysis
print("\n🔍 Misclassification Analysis:")
best_model = cnn_histories[best_cnn_name]['model']
y_pred_proba = best_model.predict(X_test)
y_pred = np.argmax(y_pred_proba, axis=1)
# Find misclassified examples
misclassified_idx = np.where(y_pred != y_test)[0]
if len(misclassified_idx) > 0:
plt.figure(figsize=(12, 6))
for i in range(min(8, len(misclassified_idx))):
idx = misclassified_idx[i]
plt.subplot(2, 4, i+1)
plt.imshow(X_test[idx].reshape(28, 28), cmap='gray')
plt.title(f'True: {y_test[idx]}, Pred: {y_pred[idx]}')
plt.axis('off')
plt.tight_layout()
plt.show()
print(f"Misclassification rate: {len(misclassified_idx)/len(y_test):.4f}")
# CNN for Object Detection (Conceptual)
print("\n🎯 CNN for Object Detection (Conceptual):")
# Demonstrate different pooling strategies
sample_image = X_train[0]
# Max Pooling
max_pooled = tf.keras.layers.MaxPool2D(pool_size=(2, 2))(tf.expand_dims(sample_image, 0))
# Average Pooling
avg_pooled = tf.keras.layers.AveragePooling2D(pool_size=(2, 2))(tf.expand_dims(sample_image, 0))
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.imshow(sample_image.reshape(28, 28), cmap='gray')
plt.title('Original Image')
plt.axis('off')
plt.subplot(1, 3, 2)
plt.imshow(max_pooled[0, :, :, 0], cmap='gray')
plt.title('Max Pooling')
plt.axis('off')
plt.subplot(1, 3, 3)
plt.imshow(avg_pooled[0, :, :, 0], cmap='gray')
plt.title('Average Pooling')
plt.axis('off')
plt.tight_layout()
plt.show()
print("✅ Convolutional Neural Networks Analysis Complete!")

## CNN Interview Questions & Answers
**Q1: What is the difference between CNN and fully connected networks?**
**Answer:**
- **Parameter sharing:** CNN shares weights across spatial locations, FC has separate weights for each connection
- **Sparse connectivity:** CNN neurons connect only to local regions, FC connects to all inputs
- **Translation invariance:** CNN can detect features regardless of position
- **Spatial hierarchy:** CNN learns hierarchical features (edges → patterns → objects)
**Q2: Explain the concept of receptive field.**
**Answer:**
- **Receptive field:** Region in input space that affects a particular neuron
- **Local receptive field:** Each neuron connects only to a small region of input
- **Increasing receptive field:** Deeper layers have larger receptive fields through:
- Larger filter sizes
- Pooling layers
- Stacking convolutional layers
- **Importance:** Allows network to combine local features into more global patterns
**Q3: What are the benefits of pooling layers?**
**Answer:**
- **Dimensionality reduction:** Reduce spatial size, decrease parameters
- **Translation invariance:** Small translations don't affect output
- **Prevent overfitting:** Reduce model complexity
- **Computational efficiency:** Fewer parameters to learn
- **Feature robustness:** Makes features more invariant to small variations
**Q4: Compare different CNN architectures.**
**Answer:**
- **LeNet-5:** Early CNN for digit recognition, basic conv-pool structure
- **AlexNet:** Deeper network, ReLU activation, dropout
- **VGG:** Very deep with small 3x3 filters, uniform architecture
- **ResNet:** Residual connections solve vanishing gradient in very deep networks
- **Inception:** Multiple filter sizes in parallel, efficient computation
**Q5: How do you handle overfitting in CNNs?**
**Answer:**
1. **Data augmentation:** Rotation, scaling, flipping, color changes
2. **Dropout:** Randomly disable neurons during training
3. **Batch normalization:** Normalize layer inputs
4. **Early stopping:** Stop when validation performance plateaus
5. **Weight regularization:** L1/L2 regularization on weights
6. **Reduce model complexity:** Fewer layers/filters
7. **Transfer learning:** Use pre-trained models
# 14. RECURRENT NEURAL NETWORKS (RNN)
## Algorithm Background & Mathematical Foundation
**Core Concept:** RNNs are designed for sequential data by maintaining a hidden state that captures information about previous elements in the sequence.
**Mathematical Formulation:**
**Basic RNN:**
- **Hidden state update:** $h_t = \sigma(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$
- **Output:** $y_t = \sigma(W_{hy}h_t + b_y)$
Where:
- $h_t$: hidden state at time t
- $x_t$: input at time t
- $y_t$: output at time t
- $W$: weight matrices
- $b$: bias vectors
- $\sigma$: activation function
**Vanishing Gradient Problem:**
- Gradients become exponentially small through time steps
- Limits learning of long-range dependencies
**LSTM (Long Short-Term Memory):**
- **Forget gate:** $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
- **Input gate:** $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
- **Output gate:** $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
- **Cell state:** $C_t = f_t \odot C_{t-1} + i_t \odot \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$
- **Hidden state:** $h_t = o_t \odot \tanh(C_t)$
**GRU (Gated Recurrent Unit):**
Simplified version with update and reset gates

In [None]:
# Cell 16: Recurrent Neural Networks - Comprehensive Implementation
print("🚀 RECURRENT NEURAL NETWORKS: COMPREHENSIVE IMPLEMENTATION\n")
# Create synthetic time series data
def generate_time_series_data(n_samples=1000, seq_length=50):
"""Generate synthetic time series data with multiple patterns"""
time = np.linspace(0, 100, seq_length)
data = []
for i in range(n_samples):
# Combine multiple sine waves with different frequencies
signal = (np.sin(0.1 * time + i * 0.01) +
0.5 * np.sin(0.3 * time + i * 0.02) +
0.3 * np.sin(0.7 * time + i * 0.03) +
np.random.normal(0, 0.1, seq_length))
data.append(signal)
return np.array(data)
# Generate data
X_ts = generate_time_series_data(1000, 50)
print("📊 Time Series Data Overview:")
print(f"• Samples: {X_ts.shape[0]}")
print(f"• Sequence length: {X_ts.shape[1]}")
print(f"• Data range: [{X_ts.min():.2f}, {X_ts.max():.2f}]")
# Create sequences for prediction (predict next value from previous 10)
def create_sequences(data, seq_length=10):
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:(i + seq_length)])
y.append(data[i + seq_length])
return np.array(X), np.array(y)
seq_length = 10
X_seq, y_seq = create_sequences(X_ts, seq_length)
# Reshape for RNN (samples, time steps, features)
X_seq = X_seq.reshape(X_seq.shape[0], X_seq.shape[1], 1)
y_seq = y_seq.reshape(-1, 1)
print(f"• Sequence data shape: {X_seq.shape}")
print(f"• Target shape: {y_seq.shape}")
# Split data
X_train_seq, X_test_seq, y_train_seq, y_test_seq = model_selection.train_test_split(
X_seq, y_seq, test_size=0.2, random_state=42
)
# Build different RNN architectures
def create_simple_rnn():
model = Sequential([
layers.SimpleRNN(50, activation='relu', input_shape=(seq_length, 1)),
layers.Dense(1)
])
return model
def create_lstm_model():
model = Sequential([
layers.LSTM(50, activation='relu', input_shape=(seq_length, 1)),
layers.Dense(1)
])
return model
def create_gru_model():
model = Sequential([
layers.GRU(50, activation='relu', input_shape=(seq_length, 1)),
layers.Dense(1)
])
return model
def create_deep_rnn():
model = Sequential([
layers.LSTM(50, activation='relu', return_sequences=True, input_shape=(seq_length, 1)),
layers.Dropout(0.2),
layers.LSTM(25, activation='relu'),
layers.Dropout(0.2),
layers.Dense(1)
])
return model
# Compile and train models
rnn_models = {
'Simple RNN': create_simple_rnn(),
'LSTM': create_lstm_model(),
'GRU': create_gru_model(),
'Deep LSTM': create_deep_rnn()
}
rnn_histories = {}
for name, model in rnn_models.items():
print(f"\n📊 Training {name}...")
# Compile model
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
# Train model
history = model.fit(
X_train_seq, y_train_seq,
validation_data=(X_test_seq, y_test_seq),
epochs=20,
batch_size=32,
verbose=0
)
rnn_histories[name] = {
'model': model,
'history': history
}
# Evaluate model
test_loss, test_mae = model.evaluate(X_test_seq, y_test_seq, verbose=0)
print(f"• Test MSE: {test_loss:.4f}")
print(f"• Test MAE: {test_mae:.4f}")
# Comprehensive Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. Training History - Loss
for name, rnn_data in rnn_histories.items():
history = rnn_data['history']
axes[0,0].plot(history.history['loss'], label=f'{name} - Train')
axes[0,0].plot(history.history['val_loss'], label=f'{name} - Val', linestyle='--')
axes[0,0].set_xlabel('Epochs')
axes[0,0].set_ylabel('MSE Loss')
axes[0,0].set_title('Training and Validation Loss')
axes[0,0].legend()
axes[0,0].grid(True)
# 2. Model Comparison
models_rnn = list(rnn_histories.keys())
final_losses = [rnn_histories[name]['history'].history['val_loss'][-1] for name in models_rnn]
bars = axes[0,1].bar(models_rnn, final_losses, color=['blue', 'green', 'orange', 'red'])
axes[0,1].set_ylabel('Final Validation MSE')
axes[0,1].set_title('RNN Architectures Comparison')
axes[0,1].tick_params(axis='x', rotation=45)
for bar, loss in zip(bars, final_losses):
axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
f'{loss:.3f}', ha='center', va='bottom')
# 3. Prediction Visualization
best_rnn_name = min(rnn_histories.keys(), key=lambda x: rnn_histories[x]['history'].history['val_loss'][-1])
best_rnn_model = rnn_histories[best_rnn_name]['model']
# Make predictions
y_pred_seq = best_rnn_model.predict(X_test_seq)
# Plot predictions vs actual for first few samples
for i in range(3):
axes[0,2].plot(y_test_seq[i], 'b-', alpha=0.7, label='Actual' if i == 0 else "")
axes[0,2].plot(y_pred_seq[i], 'r--', alpha=0.7, label='Predicted' if i == 0 else "")
axes[0,2].set_xlabel('Time Step')
axes[0,2].set_ylabel('Value')
axes[0,2].set_title(f'Predictions vs Actual ({best_rnn_name})')
axes[0,2].legend()
axes[0,2].grid(True)
# 4. Multi-step Prediction
def multi_step_prediction(model, initial_sequence, steps=20):
"""Generate multi-step predictions"""
current_sequence = initial_sequence.copy()
predictions = []
for _ in range(steps):
# Predict next value
next_pred = model.predict(current_sequence.reshape(1, seq_length, 1), verbose=0)[0, 0]
predictions.append(next_pred)
# Update sequence (remove first, add prediction)
current_sequence = np.roll(current_sequence, -1)
current_sequence[-1] = next_pred
return np.array(predictions)
# Test multi-step prediction
initial_seq = X_test_seq[0]
true_future = y_test_seq[0:20].flatten()
pred_future = multi_step_prediction(best_rnn_model, initial_seq.flatten(), steps=20)
axes[1,0].plot(range(len(initial_seq)), initial_seq.flatten(), 'g-', label='Input Sequence')
axes[1,0].plot(range(len(initial_seq), len(initial_seq) + len(true_future)), true_future, 'b-', label='True Future')
axes[1,0].plot(range(len(initial_seq), len(initial_seq) + len(pred_future)), pred_future, 'r--', label='Predicted Future')
axes[1,0].set_xlabel('Time Step')
axes[1,0].set_ylabel('Value')
axes[1,0].set_title('Multi-step Prediction')
axes[1,0].legend()
axes[1,0].grid(True)
# 5. Sequence Length Analysis
sequence_lengths = [5, 10, 15, 20]
seq_length_results = {}
for seq_len in sequence_lengths:
X_temp, y_temp = create_sequences(X_ts, seq_len)
X_temp = X_temp.reshape(X_temp.shape[0], X_temp.shape[1], 1)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = model_selection.train_test_split(
X_temp, y_temp, test_size=0.2, random_state=42
)
model_temp = create_lstm_model()
model_temp.compile(optimizer='adam', loss='mse')
history_temp = model_temp.fit(
X_train_temp, y_train_temp,
validation_data=(X_test_temp, y_test_temp),
epochs=10,
batch_size=32,
verbose=0
)
final_loss = history_temp.history['val_loss'][-1]
seq_length_results[seq_len] = final_loss
axes[1,1].plot(sequence_lengths, [seq_length_results[sl] for sl in sequence_lengths], 'bo-')
axes[1,1].set_xlabel('Sequence Length')
axes[1,1].set_ylabel('Validation MSE')
axes[1,1].set_title('Sequence Length vs Performance')
axes[1,1].grid(True)
# 6. Different Activation Functions
activations = ['relu', 'tanh', 'sigmoid']
activation_results = {}
for activation in activations:
model_temp = Sequential([
layers.LSTM(50, activation=activation, input_shape=(seq_length, 1)),
layers.Dense(1)
])
model_temp.compile(optimizer='adam', loss='mse')
history_temp = model_temp.fit(
X_train_seq, y_train_seq,
validation_data=(X_test_seq, y_test_seq),
epochs=10,
batch_size=32,
verbose=0
)
final_loss = history_temp.history['val_loss'][-1]
activation_results[activation] = final_loss
bars = axes[1,2].bar(activations, [activation_results[act] for act in activations],
color=['blue', 'green', 'orange'])
axes[1,2].set_ylabel('Validation MSE')
axes[1,2].set_title('Activation Functions Comparison')
for bar, loss in zip(bars, [activation_results[act] for act in activations]):
axes[1,2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
f'{loss:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
# Text Generation Example
print("\n📝 RNN for Text Generation:")
# Simple character-level text generation example
text = """Machine learning is a subset of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed."""
text = text.lower()
# Create character mapping
chars = sorted(list(set(text)))
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for idx, char in enumerate(chars)}
print(f"• Unique characters: {len(chars)}")
print(f"• Text length: {len(text)}")
# Prepare sequences for training
max_sequence_length = 40
step = 3
sequences = []
next_chars = []
for i in range(0, len(text) - max_sequence_length, step):
sequences.append(text[i:i + max_sequence_length])
next_chars.append(text[i + max_sequence_length])
print(f"• Number of sequences: {len(sequences)}")
# Vectorize sequences
X_text = np.zeros((len(sequences), max_sequence_length, len(chars)), dtype=bool)
y_text = np.zeros((len(sequences), len(chars)), dtype=bool)
for i, sequence in enumerate(sequences):
for t, char in enumerate(sequence):
X_text[i, t, char_to_idx[char]] = 1
y_text[i, char_to_idx[next_chars[i]]] = 1
# Build character-level RNN
text_model = Sequential([
layers.LSTM(128, input_shape=(max_sequence_length, len(chars))),
layers.Dense(len(chars), activation='softmax')
])
text_model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Train for a few epochs
history_text = text_model.fit(
X_text, y_text,
batch_size=128,
epochs=50,
verbose=0
)
print(f"Text model final accuracy: {history_text.history['accuracy'][-1]:.4f}")
# Generate text function
def generate_text(model, seed_text, length=100, temperature=1.0):
generated = seed_text
for _ in range(length):
# Prepare input
x = np.zeros((1, max_sequence_length, len(chars)))
for t, char in enumerate(seed_text):
if char in char_to_idx:
x[0, t, char_to_idx[char]] = 1
# Predict next character
preds = model.predict(x, verbose=0)[0]
# Apply temperature
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
# Sample next character
next_idx = np.random.choice(len(chars), p=preds)
next_char = idx_to_char[next_idx]
generated += next_char
seed_text = seed_text[1:] + next_char
return generated
# Generate some text
seed = "machine learning is"
generated_text = generate_text(text_model, seed, length=100, temperature=0.5)
print(f"\nGenerated text:")
print(f"Seed: '{seed}'")
print(f"Generated: '{generated_text}'\n")
# Advanced: Bidirectional RNN
print("\n🔄 Bidirectional RNN:")
def create_bidirectional_lstm():
model = Sequential([
layers.Bidirectional(layers.LSTM(25, activation='relu'), input_shape=(seq_length, 1)),
layers.Dense(1)
])
return model
bidirectional_model = create_bidirectional_lstm()
bidirectional_model.compile(optimizer='adam', loss='mse')
history_bi = bidirectional_model.fit(
X_train_seq, y_train_seq,
validation_data=(X_test_seq, y_test_seq),
epochs=20,
batch_size=32,
verbose=0
)
bi_loss = history_bi.history['val_loss'][-1]
best_loss = rnn_histories[best_rnn_name]['history'].history['val_loss'][-1]
print(f"Bidirectional LSTM Validation MSE: {bi_loss:.4f}")
print(f"Best regular LSTM Validation MSE: {best_loss:.4f}")
print(f"Improvement: {((best_loss - bi_loss) / best_loss * 100):.1f}%")
# Attention Mechanism Concept
print("\n🎯 Attention Mechanism (Conceptual):")
# Demonstrate the concept of attention weights
def simple_attention(query, keys, values):
"""Simple attention mechanism demonstration"""
# Calculate attention scores
scores = np.dot(keys, query)
# Apply softmax
attention_weights = np.exp(scores) / np.sum(np.exp(scores))
# Weighted sum of values
context_vector = np.dot(attention_weights, values)
return context_vector, attention_weights
# Example usage
query = np.array([0.5, 0.3])
keys = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])
values = np.array([[1, 2], [3, 4], [5, 6]])
context, weights = simple_attention(query, keys, values)
print(f"Attention weights: {weights}")
print(f"Context vector: {context}")
print("✅ Recurrent Neural Networks Analysis Complete!")

## RNN Interview Questions & Answers
**Q1: What is the vanishing/exploding gradient problem in RNNs?**
**Answer:**
- **Vanishing gradients:** Gradients become extremely small when backpropagating through many time steps
- **Exploding gradients:** Gradients become extremely large, causing numerical instability
- **Cause:** Repeated multiplication of the same weight matrix through time
- **Impact:** Difficulty learning long-range dependencies
- **Solutions:** LSTM/GRU architectures, gradient clipping, proper initialization
**Q2: Compare LSTM and GRU architectures.**
**Answer:**
- **LSTM:** Three gates (input, forget, output), separate cell state
- **GRU:** Two gates (update, reset), merged cell and hidden state
- **Complexity:** LSTM has more parameters, GRU is computationally lighter
- **Performance:** Similar performance in many tasks, GRU often faster to train
- **Use cases:** LSTM for very long sequences, GRU for efficiency
**Q3: What are bidirectional RNNs and when are they useful?**
**Answer:**
- **Bidirectional RNN:** Process sequence in both forward and backward directions
- **Architecture:** Two separate hidden layers, one for each direction
- **Benefits:** Access to both past and future context for each time step
- **Use cases:**
- Natural language processing (understanding context)
- Speech recognition
- Time series analysis with clear context
- **Limitations:** Cannot be used for real-time prediction
**Q4: How do you handle variable-length sequences in RNNs?**
**Answer:**
1. **Padding:** Add zeros to make sequences same length
2. **Masking:** Ignore padded positions during computation
3. **Dynamic RNNs:** Handle variable lengths natively (TensorFlow)
4. **Bucketting:** Group sequences by similar lengths
5. **Truncation:** Cut sequences to fixed maximum length
**Q5: What is teacher forcing in RNN training?**
**Answer:**
- **Teacher forcing:** Use actual previous output instead of predicted output during training
- **Benefits:** Faster convergence, more stable training
- **Drawbacks:** Discrepancy between training and inference
- **Scheduled sampling:** Gradually reduce teacher forcing during training
- **Use cases:** Sequence generation, machine translation
# 15. TRANSFORMERS (BERT, GPT)
## Algorithm Background & Mathematical Foundation
**Core Concept:** Transformers use self-attention mechanisms to process sequences in parallel, capturing global dependencies without recurrence.
**Key Components:**
**Self-Attention Mechanism:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
- $Q$: Query matrix (what I'm looking for)
- $K$: Key matrix (what I can offer)
- $V$: Value matrix (what I actually contain)
- $d_k$: Dimension of key vectors
**Multi-Head Attention:**
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$
$$\text{where head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
**Positional Encoding:**
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
**Transformer Architecture:**
- Encoder: Multi-head attention → Feed forward → Layer normalization
- Decoder: Masked multi-head attention → Encoder-decoder attention → Feed forward

In [None]:
# Cell 17: Transformers - Comprehensive Implementation
print("🚀 TRANSFORMERS: COMPREHENSIVE IMPLEMENTATION\n")
# We'll use a simplified implementation to demonstrate transformer concepts
import math
class SimpleTransformer:
"""Simplified transformer implementation for educational purposes"""
def __init__(self, vocab_size, d_model=64, n_heads=4, ff_dim=128, max_len=100):
self.vocab_size = vocab_size
self.d_model = d_model
self.n_heads = n_heads
self.ff_dim = ff_dim
self.max_len = max_len
def positional_encoding(self, position, d_model):
"""Generate positional encoding"""
angle_rates = 1 / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model))
angle_rads = position[:, np.newaxis] * angle_rates
# Apply sin to even indices, cos to odd indices
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
return angle_rads
def scaled_dot_product_attention(self, Q, K, V, mask=None):
"""Calculate scaled dot-product attention"""
d_k = K.shape[-1]
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
if mask is not None:
scores += (mask * -1e9)
attention_weights = self.softmax(scores)
output = np.matmul(attention_weights, V)
return output, attention_weights
def softmax(self, x):
"""Softmax implementation"""
exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
def multi_head_attention(self, x):
"""Simplified multi-head attention"""
batch_size, seq_len, d_model = x.shape
# Split into multiple heads
x_reshaped = x.reshape(batch_size, seq_len, self.n_heads, d_model // self.n_heads)
x_reshaped = x_reshaped.transpose(0, 2, 1, 3) # (batch, heads, seq_len, depth)
# Self-attention (using same input for Q, K, V)
attention_output, attention_weights = self.scaled_dot_product_attention(
x_reshaped, x_reshaped, x_reshaped
)
# Concatenate heads
attention_output = attention_output.transpose(0, 2, 1, 3)
attention_output = attention_output.reshape(batch_size, seq_len, d_model)
return attention_output, attention_weights
# Demonstrate transformer concepts
print("🔍 TRANSFORMER CONCEPTS DEMONSTRATION:")
# Create sample data
batch_size = 2
seq_length = 5
d_model = 64
# Sample input (random embeddings)
sample_input = np.random.randn(batch_size, seq_length, d_model)
print(f"Sample input shape: {sample_input.shape}")
# Initialize transformer
transformer = SimpleTransformer(vocab_size=1000, d_model=d_model)
# Positional encoding
positions = np.arange(seq_length)[:, np.newaxis]
pos_encoding = transformer.positional_encoding(positions, d_model)
print(f"Positional encoding shape: {pos_encoding.shape}")
# Add positional encoding to input
input_with_pos = sample_input + pos_encoding[np.newaxis, :, :]
# Multi-head attention
attention_output, attention_weights = transformer.multi_head_attention(input_with_pos)
print(f"Attention output shape: {attention_output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
# Visualize attention weights
plt.figure(figsize=(10, 8))
plt.imshow(attention_weights[0, 0], cmap='viridis', aspect='auto')
plt.colorbar()
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.title('Self-Attention Weights (First Head, First Batch)')
plt.show()
# Using Hugging Face Transformers for Real Applications
print("\n🤗 HUGGING FACE TRANSFORMERS IMPLEMENTATION:")
try:
from transformers import AutoTokenizer, AutoModel, pipeline
# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
print(f"Loaded model: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
# Text classification example
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
sample_texts = [
"I love machine learning!",
"This is terrible.",
"The weather is nice today.",
"I'm feeling neutral about this."
]
results = classifier(sample_texts)
print("\n📊 Sentiment Analysis Results:")
for text, result in zip(sample_texts, results):
print(f"'{text}' -> {result['label']} (confidence: {result['score']:.3f})")
# Text generation example
generator = pipeline("text-generation", model="gpt2", max_length=50)
prompt = "The future of artificial intelligence"
generated = generator(prompt, num_return_sequences=1)
print(f"\n🤖 Text Generation:")
print(f"Prompt: '{prompt}'")
print(f"Generated: '{generated[0]['generated_text']}'")
except ImportError:
print("Hugging Face transformers not available. Using simulated results.")
# Simulate sentiment analysis results
sample_texts = [
"I love machine learning!",
"This is terrible.",
"The weather is nice today.",
"I'm feeling neutral about this."
]
simulated_results = [
{"label": "POSITIVE", "score": 0.98},
{"label": "NEGATIVE", "score": 0.95},
{"label": "POSITIVE", "score": 0.87},
{"label": "NEUTRAL", "score": 0.65}
]
print("\n📊 Simulated Sentiment Analysis Results:")
for text, result in zip(sample_texts, simulated_results):
print(f"'{text}' -> {result['label']} (confidence: {result['score']:.3f})")
# BERT for Text Classification
print("\n🔤 BERT for Text Classification:")
# Create synthetic text classification dataset
texts = [
"The movie was fantastic and I loved every minute of it",
"This product is terrible and does not work as advertised",
"The weather today is beautiful and sunny",
"I feel very disappointed with the service provided",
"This book is amazing and well written",
"The food was awful and overpriced",
"Great customer service and fast delivery",
"Poor quality materials used in this product"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0] # 1: positive, 0: negative
# If transformers is available, demonstrate fine-tuning concept
try:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
# Tokenize texts
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
encoded_inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="tf")
print(f"Tokenized inputs shape: {encoded_inputs['input_ids'].shape}")
print(f"Attention mask shape: {encoded_inputs['attention_mask'].shape}")
# Demonstrate BERT embeddings
model = AutoModel.from_pretrained("distilbert-base-uncased")
with torch.no_grad():
outputs = model(**encoded_inputs)
embeddings = outputs.last_hidden_state
print(f"BERT embeddings shape: {embeddings.shape}")
# Pooled output (CLS token)
pooled_output = embeddings[:, 0, :]
print(f"Pooled output shape: {pooled_output.shape}")
except ImportError:
print("PyTorch not available for BERT demonstration.")
# Transformer Visualization
print("\n📊 TRANSFORMER ARCHITECTURE VISUALIZATION:")
# Create a visualization of transformer components
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# 1. Positional Encoding Visualization
positions = np.arange(50)
d_model = 64
pos_encoding = transformer.positional_encoding(positions, d_model)
im = axes[0,0].imshow(pos_encoding.T, cmap='viridis', aspect='auto')
axes[0,0].set_xlabel('Position')
axes[0,0].set_ylabel('Dimension')
axes[0,0].set_title('Positional Encoding')
plt.colorbar(im, ax=axes[0,0])
# 2. Attention Pattern Examples
def create_attention_patterns(seq_length):
"""Create different attention patterns"""
patterns = {}
# Causal attention (for GPT)
causal = np.tril(np.ones((seq_length, seq_length)))
patterns['Causal'] = causal
# Full attention (for BERT)
full = np.ones((seq_length, seq_length))
patterns['Full'] = full
# Local attention
local = np.zeros((seq_length, seq_length))
for i in range(seq_length):
start = max(0, i-2)
end = min(seq_length, i+3)
local[i, start:end] = 1
patterns['Local'] = local
return patterns
seq_len = 10
patterns = create_attention_patterns(seq_len)
for i, (name, pattern) in enumerate(patterns.items()):
row = i // 2
col = i % 2
axes[0,1].imshow(pattern, cmap='viridis')
axes[0,1].set_xlabel('Key Position')
axes[0,1].set_ylabel('Query Position')
axes[0,1].set_title(f'Attention Pattern: {name}')
# 3. Multi-Head Attention Concept
def visualize_multihead_attention():
"""Visualize multi-head attention concept"""
seq_len = 6
n_heads = 4
# Create sample attention weights for each head
attention_heads = []
for i in range(n_heads):
# Each head focuses on different patterns
if i == 0:
# Head 0: Diagonal attention
head = np.eye(seq_len)
elif i == 1:
# Head 1: Global attention
head = np.ones((seq_len, seq_len)) / seq_len
elif i == 2:
# Head 2: Local attention
head = np.zeros((seq_len, seq_len))
for j in range(seq_len):
start = max(0, j-1)
end = min(seq_len, j+2)
head[j, start:end] = 1.0 / (end - start)
else:
# Head 3: Random pattern
head = np.random.rand(seq_len, seq_len)
head = head / head.sum(axis=1, keepdims=True)
attention_heads.append(head)
return attention_heads
attention_heads = visualize_multihead_attention()
# Plot each attention head
for i, head in enumerate(attention_heads):
row = 1 + i // 2
col = i % 2
im = axes[row, col].imshow(head, cmap='viridis', aspect='auto')
axes[row, col].set_xlabel('Key Position')
axes[row, col].set_ylabel('Query Position')
axes[row, col].set_title(f'Attention Head {i+1}')
plt.colorbar(im, ax=axes[row, col])
plt.tight_layout()
plt.show()
# Transformer Applications
print("\n🎯 TRANSFORMER APPLICATIONS:")
applications = {
"Text Classification": "BERT, RoBERTa, DistilBERT",
"Text Generation": "GPT, GPT-2, GPT-3, GPT-4",
"Machine Translation": "mBART, T5, MarianMT",
"Question Answering": "BERT, RoBERTa, ALBERT",
"Named Entity Recognition": "BERT, Spacy Transformers",
"Text Summarization": "BART, T5, PEGASUS",
"Sentiment Analysis": "DistilBERT, BERT",
"Code Generation": "Codex, CodeGen, StarCoder"
}
print("Common Transformer Applications:")
for app, models in applications.items():
print(f"• {app}: {models}")
# Fine-tuning Transformers
print("\n🔧 FINE-TUNING TRANSFORMERS:")
fine_tuning_steps = [
"1. Choose pre-trained model (BERT, GPT, etc.)",
"2. Prepare domain-specific dataset",
"3. Add task-specific head (classification, generation)",
"4. Freeze base layers or use gradual unfreezing",
"5. Train with lower learning rate",
"6. Evaluate on validation set",
"7. Deploy fine-tuned model"
]
print("Fine-tuning Steps:")
for step in fine_tuning_steps:
print(step)
# Transformer Limitations and Solutions
print("\n⚠️ TRANSFORMER LIMITATIONS:")
limitations = {
"Computational Complexity": "O(n²) for sequence length",
"Memory Usage": "High for long sequences",
"Training Time": "Requires extensive pre-training",
"Data Requirements": "Needs large datasets",
"Interpretability": "Black-box nature"
}
solutions = {
"Computational Complexity": "Sparse attention, Linformer, Performer",
"Memory Usage": "Gradient checkpointing, model parallelism",
"Training Time": "Distributed training, mixed precision",
"Data Requirements": "Transfer learning, data augmentation",
"Interpretability": "Attention visualization, probing"
}
print("Limitations and Solutions:")
for limitation, solution in zip(limitations.items(), solutions # 14. RECURRENT NEURAL NETWORKS (RNN)
## Algorithm Background & Mathematical Foundation
**Core Concept:** RNNs are designed for sequential data by maintaining a hidden state that captures information about previous elements in the sequence.
**Mathematical Formulation:**
**Basic RNN:**
- **Hidden state update:** $h_t = \sigma(W_{hh}h_{t-1} + W_{xh}x_t + b_h)$
- **Output:** $y_t = \sigma(W_{hy}h_t + b_y)$
Where:
- $h_t$: hidden state at time t
- $x_t$: input at time t
- $y_t$: output at time t
- $W$: weight matrices
- $b$: bias vectors
- $\sigma$: activation function
**Vanishing Gradient Problem:**
- Gradients become exponentially small through time steps
- Limits learning of long-range dependencies
**LSTM (Long Short-Term Memory):**
- **Forget gate:** $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$
- **Input gate:** $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$
- **Output gate:** $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$
- **Cell state:** $C_t = f_t \odot C_{t-1} + i_t \odot \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$
- **Hidden state:** $h_t = o_t \odot \tanh(C_t)$
**GRU (Gated Recurrent Unit):**
Simplified version with update and reset gates

# Cell 16: Recurrent Neural Networks - Comprehensive Implementation
print("🚀 RECURRENT NEURAL NETWORKS: COMPREHENSIVE IMPLEMENTATION\n")
# Create synthetic time series data
def generate_time_series_data(n_samples=1000, seq_length=50):
"""Generate synthetic time series data with multiple patterns"""
time = np.linspace(0, 100, seq_length)
data = []
for i in range(n_samples):
# Combine multiple sine waves with different frequencies
signal = (np.sin(0.1 * time + i * 0.01) +
0.5 * np.sin(0.3 * time + i * 0.02) +
0.3 * np.sin(0.7 * time + i * 0.03) +
np.random.normal(0, 0.1, seq_length))
data.append(signal)
return np.array(data)
# Generate data
X_ts = generate_time_series_data(1000, 50)
print("📊 Time Series Data Overview:")
print(f"• Samples: {X_ts.shape[0]}")
print(f"• Sequence length: {X_ts.shape[1]}")
print(f"• Data range: [{X_ts.min():.2f}, {X_ts.max():.2f}]")
# Create sequences for prediction (predict next value from previous 10)
def create_sequences(data, seq_length=10):
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:(i + seq_length)])
y.append(data[i + seq_length])
return np.array(X), np.array(y)
seq_length = 10
X_seq, y_seq = create_sequences(X_ts, seq_length)
# Reshape for RNN (samples, time steps, features)
X_seq = X_seq.reshape(X_seq.shape[0], X_seq.shape[1], 1)
y_seq = y_seq.reshape(-1, 1)
print(f"• Sequence data shape: {X_seq.shape}")
print(f"• Target shape: {y_seq.shape}")
# Split data
X_train_seq, X_test_seq, y_train_seq, y_test_seq = model_selection.train_test_split(
X_seq, y_seq, test_size=0.2, random_state=42
)
# Build different RNN architectures
def create_simple_rnn():
model = Sequential([
layers.SimpleRNN(50, activation='relu', input_shape=(seq_length, 1)),
layers.Dense(1)
])
return model
def create_lstm_model():
model = Sequential([
layers.LSTM(50, activation='relu', input_shape=(seq_length, 1)),
layers.Dense(1)
])
return model
def create_gru_model():
model = Sequential([
layers.GRU(50, activation='relu', input_shape=(seq_length, 1)),
layers.Dense(1)
])
return model
def create_deep_rnn():
model = Sequential([
layers.LSTM(50, activation='relu', return_sequences=True, input_shape=(seq_length, 1)),
layers.Dropout(0.2),
layers.LSTM(25, activation='relu'),
layers.Dropout(0.2),
layers.Dense(1)
])
return model
# Compile and train models
rnn_models = {
'Simple RNN': create_simple_rnn(),
'LSTM': create_lstm_model(),
'GRU': create_gru_model(),
'Deep LSTM': create_deep_rnn()
}
rnn_histories = {}
for name, model in rnn_models.items():
print(f"\n📊 Training {name}...")
# Compile model
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
# Train model
history = model.fit(
X_train_seq, y_train_seq,
validation_data=(X_test_seq, y_test_seq),
epochs=20,
batch_size=32,
verbose=0
)
rnn_histories[name] = {
'model': model,
'history': history
}
# Evaluate model
test_loss, test_mae = model.evaluate(X_test_seq, y_test_seq, verbose=0)
print(f"• Test MSE: {test_loss:.4f}")
print(f"• Test MAE: {test_mae:.4f}")
# Comprehensive Visualization
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
# 1. Training History - Loss
for name, rnn_data in rnn_histories.items():
history = rnn_data['history']
axes[0,0].plot(history.history['loss'], label=f'{name} - Train')
axes[0,0].plot(history.history['val_loss'], label=f'{name} - Val', linestyle='--')
axes[0,0].set_xlabel('Epochs')
axes[0,0].set_ylabel('MSE Loss')
axes[0,0].set_title('Training and Validation Loss')
axes[0,0].legend()
axes[0,0].grid(True)
# 2. Model Comparison
models_rnn = list(rnn_histories.keys())
final_losses = [rnn_histories[name]['history'].history['val_loss'][-1] for name in models_rnn]
bars = axes[0,1].bar(models_rnn, final_losses, color=['blue', 'green', 'orange', 'red'])
axes[0,1].set_ylabel('Final Validation MSE')
axes[0,1].set_title('RNN Architectures Comparison')
axes[0,1].tick_params(axis='x', rotation=45)
for bar, loss in zip(bars, final_losses):
axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
f'{loss:.3f}', ha='center', va='bottom')
# 3. Prediction Visualization
best_rnn_name = min(rnn_histories.keys(), key=lambda x: rnn_histories[x]['history'].history['val_loss'][-1])
best_rnn_model = rnn_histories[best_rnn_name]['model']
# Make predictions
y_pred_seq = best_rnn_model.predict(X_test_seq)
# Plot predictions vs actual for first few samples
for i in range(3):
axes[0,2].plot(y_test_seq[i], 'b-', alpha=0.7, label='Actual' if i == 0 else "")
axes[0,2].plot(y_pred_seq[i], 'r--', alpha=0.7, label='Predicted' if i == 0 else "")
axes[0,2].set_xlabel('Time Step')
axes[0,2].set_ylabel('Value')
axes[0,2].set_title(f'Predictions vs Actual ({best_rnn_name})')
axes[0,2].legend()
axes[0,2].grid(True)
# 4. Multi-step Prediction
def multi_step_prediction(model, initial_sequence, steps=20):
"""Generate multi-step predictions"""
current_sequence = initial_sequence.copy()
predictions = []
for _ in range(steps):
# Predict next value
next_pred = model.predict(current_sequence.reshape(1, seq_length, 1), verbose=0)[0, 0]
predictions.append(next_pred)
# Update sequence (remove first, add prediction)
current_sequence = np.roll(current_sequence, -1)
current_sequence[-1] = next_pred
return np.array(predictions)
# Test multi-step prediction
initial_seq = X_test_seq[0]
true_future = y_test_seq[0:20].flatten()
pred_future = multi_step_prediction(best_rnn_model, initial_seq.flatten(), steps=20)
axes[1,0].plot(range(len(initial_seq)), initial_seq.flatten(), 'g-', label='Input Sequence')
axes[1,0].plot(range(len(initial_seq), len(initial_seq) + len(true_future)), true_future, 'b-', label='True Future')
axes[1,0].plot(range(len(initial_seq), len(initial_seq) + len(pred_future)), pred_future, 'r--', label='Predicted Future')
axes[1,0].set_xlabel('Time Step')
axes[1,0].set_ylabel('Value')
axes[1,0].set_title('Multi-step Prediction')
axes[1,0].legend()
axes[1,0].grid(True)
# 5. Sequence Length Analysis
sequence_lengths = [5, 10, 15, 20]
seq_length_results = {}
for seq_len in sequence_lengths:
X_temp, y_temp = create_sequences(X_ts, seq_len)
X_temp = X_temp.reshape(X_temp.shape[0], X_temp.shape[1], 1)
X_train_temp, X_test_temp, y_train_temp, y_test_temp = model_selection.train_test_split(
X_temp, y_temp, test_size=0.2, random_state=42
)
model_temp = create_lstm_model()
model_temp.compile(optimizer='adam', loss='mse')
history_temp = model_temp.fit(
X_train_temp, y_train_temp,
validation_data=(X_test_temp, y_test_temp),
epochs=10,
batch_size=32,
verbose=0
)
final_loss = history_temp.history['val_loss'][-1]
seq_length_results[seq_len] = final_loss
axes[1,1].plot(sequence_lengths, [seq_length_results[sl] for sl in sequence_lengths], 'bo-')
axes[1,1].set_xlabel('Sequence Length')
axes[1,1].set_ylabel('Validation MSE')
axes[1,1].set_title('Sequence Length vs Performance')
axes[1,1].grid(True)
# 6. Different Activation Functions
activations = ['relu', 'tanh', 'sigmoid']
activation_results = {}
for activation in activations:
model_temp = Sequential([
layers.LSTM(50, activation=activation, input_shape=(seq_length, 1)),
layers.Dense(1)
])
model_temp.compile(optimizer='adam', loss='mse')
history_temp = model_temp.fit(
X_train_seq, y_train_seq,
validation_data=(X_test_seq, y_test_seq),
epochs=10,
batch_size=32,
verbose=0
)
final_loss = history_temp.history['val_loss'][-1]
activation_results[activation] = final_loss
bars = axes[1,2].bar(activations, [activation_results[act] for act in activations],
color=['blue', 'green', 'orange'])
axes[1,2].set_ylabel('Validation MSE')
axes[1,2].set_title('Activation Functions Comparison')
for bar, loss in zip(bars, [activation_results[act] for act in activations]):
axes[1,2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
f'{loss:.3f}', ha='center', va='bottom')
plt.tight_layout()
plt.show()
# Text Generation Example
print("\n📝 RNN for Text Generation:")
# Simple character-level text generation example
text = """Machine learning is a subset of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed."""
text = text.lower()
# Create character mapping
chars = sorted(list(set(text)))
char_to_idx = {char: idx for idx, char in enumerate(chars)}
idx_to_char = {idx: char for idx, char in enumerate(chars)}
print(f"• Unique characters: {len(chars)}")
print(f"• Text length: {len(text)}")
# Prepare sequences for training
max_sequence_length = 40
step = 3
sequences = []
next_chars = []
for i in range(0, len(text) - max_sequence_length, step):
sequences.append(text[i:i + max_sequence_length])
next_chars.append(text[i + max_sequence_length])
print(f"• Number of sequences: {len(sequences)}")
# Vectorize sequences
X_text = np.zeros((len(sequences), max_sequence_length, len(chars)), dtype=bool)
y_text = np.zeros((len(sequences), len(chars)), dtype=bool)
for i, sequence in enumerate(sequences):
for t, char in enumerate(sequence):
X_text[i, t, char_to_idx[char]] = 1
y_text[i, char_to_idx[next_chars[i]]] = 1
# Build character-level RNN
text_model = Sequential([
layers.LSTM(128, input_shape=(max_sequence_length, len(chars))),
layers.Dense(len(chars), activation='softmax')
])
text_model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Train for a few epochs
history_text = text_model.fit(
X_text, y_text,
batch_size=128,
epochs=50,
verbose=0
)
print(f"Text model final accuracy: {history_text.history['accuracy'][-1]:.4f}")
# Generate text function
def generate_text(model, seed_text, length=100, temperature=1.0):
generated = seed_text
for _ in range(length):
# Prepare input
x = np.zeros((1, max_sequence_length, len(chars)))
for t, char in enumerate(seed_text):
if char in char_to_idx:
x[0, t, char_to_idx[char]] = 1
# Predict next character
preds = model.predict(x, verbose=0)[0]
# Apply temperature
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
# Sample next character
next_idx = np.random.choice(len(chars), p=preds)
next_char = idx_to_char[next_idx]
generated += next_char
seed_text = seed_text[1:] + next_char
return generated
# Generate some text
seed = "machine learning is"
generated_text = generate_text(text_model, seed, length=100, temperature=0.5)
print(f"\nGenerated text:")
print(f"Seed: '{seed}'")
print(f"Generated: '{generated_text}'\n")
# Advanced: Bidirectional RNN
print("\n🔄 Bidirectional RNN:")
def create_bidirectional_lstm():
model = Sequential([
layers.Bidirectional(layers.LSTM(25, activation='relu'), input_shape=(seq_length, 1)),
layers.Dense(1)
])
return model
bidirectional_model = create_bidirectional_lstm()
bidirectional_model.compile(optimizer='adam', loss='mse')
history_bi = bidirectional_model.fit(
X_train_seq, y_train_seq,
validation_data=(X_test_seq, y_test_seq),
epochs=20,
batch_size=32,
verbose=0
)
bi_loss = history_bi.history['val_loss'][-1]
best_loss = rnn_histories[best_rnn_name]['history'].history['val_loss'][-1]
print(f"Bidirectional LSTM Validation MSE: {bi_loss:.4f}")
print(f"Best regular LSTM Validation MSE: {best_loss:.4f}")
print(f"Improvement: {((best_loss - bi_loss) / best_loss * 100):.1f}%")
# Attention Mechanism Concept
print("\n🎯 Attention Mechanism (Conceptual):")
# Demonstrate the concept of attention weights
def simple_attention(query, keys, values):
"""Simple attention mechanism demonstration"""
# Calculate attention scores
scores = np.dot(keys, query)
# Apply softmax
attention_weights = np.exp(scores) / np.sum(np.exp(scores))
# Weighted sum of values
context_vector = np.dot(attention_weights, values)
return context_vector, attention_weights
# Example usage
query = np.array([0.5, 0.3])
keys = np.array([[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]])
values = np.array([[1, 2], [3, 4], [5, 6]])
context, weights = simple_attention(query, keys, values)
print(f"Attention weights: {weights}")
print(f"Context vector: {context}")
print("✅ Recurrent Neural Networks Analysis Complete!")

In [None]:
## RNN Interview Questions & Answers
**Q1: What is the vanishing/exploding gradient problem in RNNs?**
**Answer:**
- **Vanishing gradients:** Gradients become extremely small when backpropagating through many time steps
- **Exploding gradients:** Gradients become extremely large, causing numerical instability
- **Cause:** Repeated multiplication of the same weight matrix through time
- **Impact:** Difficulty learning long-range dependencies
- **Solutions:** LSTM/GRU architectures, gradient clipping, proper initialization
**Q2: Compare LSTM and GRU architectures.**
**Answer:**
- **LSTM:** Three gates (input, forget, output), separate cell state
- **GRU:** Two gates (update, reset), merged cell and hidden state
- **Complexity:** LSTM has more parameters, GRU is computationally lighter
- **Performance:** Similar performance in many tasks, GRU often faster to train
- **Use cases:** LSTM for very long sequences, GRU for efficiency
**Q3: What are bidirectional RNNs and when are they useful?**
**Answer:**
- **Bidirectional RNN:** Process sequence in both forward and backward directions
- **Architecture:** Two separate hidden layers, one for each direction
- **Benefits:** Access to both past and future context for each time step
- **Use cases:**
- Natural language processing (understanding context)
- Speech recognition
- Time series analysis with clear context
- **Limitations:** Cannot be used for real-time prediction
**Q4: How do you handle variable-length sequences in RNNs?**
**Answer:**
1. **Padding:** Add zeros to make sequences same length
2. **Masking:** Ignore padded positions during computation
3. **Dynamic RNNs:** Handle variable lengths natively (TensorFlow)
4. **Bucketting:** Group sequences by similar lengths
5. **Truncation:** Cut sequences to fixed maximum length
**Q5: What is teacher forcing in RNN training?**
**Answer:**
- **Teacher forcing:** Use actual previous output instead of predicted output during training
- **Benefits:** Faster convergence, more stable training
- **Drawbacks:** Discrepancy between training and inference
- **Scheduled sampling:** Gradually reduce teacher forcing during training
- **Use cases:** Sequence generation, machine translation
# 15. TRANSFORMERS (BERT, GPT)
## Algorithm Background & Mathematical Foundation
**Core Concept:** Transformers use self-attention mechanisms to process sequences in parallel, capturing global dependencies without recurrence.
**Key Components:**
**Self-Attention Mechanism:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where:
- $Q$: Query matrix (what I'm looking for)
- $K$: Key matrix (what I can offer)
- $V$: Value matrix (what I actually contain)
- $d_k$: Dimension of key vectors
**Multi-Head Attention:**
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$
$$\text{where head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
**Positional Encoding:**
$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
**Transformer Architecture:**
- Encoder: Multi-head attention → Feed forward → Layer normalization
- Decoder: Masked multi-head attention → Encoder-decoder attention → Feed forward

# Cell 17: Transformers - Comprehensive Implementation
print("🚀 TRANSFORMERS: COMPREHENSIVE IMPLEMENTATION\n")
# We'll use a simplified implementation to demonstrate transformer concepts
import math
class SimpleTransformer:
"""Simplified transformer implementation for educational purposes"""
def __init__(self, vocab_size, d_model=64, n_heads=4, ff_dim=128, max_len=100):
self.vocab_size = vocab_size
self.d_model = d_model
self.n_heads = n_heads
self.ff_dim = ff_dim
self.max_len = max_len
def positional_encoding(self, position, d_model):
"""Generate positional encoding"""
angle_rates = 1 / np.power(10000, (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model))
angle_rads = position[:, np.newaxis] * angle_rates
# Apply sin to even indices, cos to odd indices
angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
return angle_rads
def scaled_dot_product_attention(self, Q, K, V, mask=None):
"""Calculate scaled dot-product attention"""
d_k = K.shape[-1]
scores = np.matmul(Q, K.transpose(0, 2, 1)) / np.sqrt(d_k)
if mask is not None:
scores += (mask * -1e9)
attention_weights = self.softmax(scores)
output = np.matmul(attention_weights, V)
return output, attention_weights
def softmax(self, x):
"""Softmax implementation"""
exp_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
return exp_x / np.sum(exp_x, axis=-1, keepdims=True)
def multi_head_attention(self, x):
"""Simplified multi-head attention"""
batch_size, seq_len, d_model = x.shape
# Split into multiple heads
x_reshaped = x.reshape(batch_size, seq_len, self.n_heads, d_model // self.n_heads)
x_reshaped = x_reshaped.transpose(0, 2, 1, 3) # (batch, heads, seq_len, depth)
# Self-attention (using same input for Q, K, V)
attention_output, attention_weights = self.scaled_dot_product_attention(
x_reshaped, x_reshaped, x_reshaped
)
# Concatenate heads
attention_output = attention_output.transpose(0, 2, 1, 3)
attention_output = attention_output.reshape(batch_size, seq_len, d_model)
return attention_output, attention_weights
# Demonstrate transformer concepts
print("🔍 TRANSFORMER CONCEPTS DEMONSTRATION:")
# Create sample data
batch_size = 2
seq_length = 5
d_model = 64
# Sample input (random embeddings)
sample_input = np.random.randn(batch_size, seq_length, d_model)
print(f"Sample input shape: {sample_input.shape}")
# Initialize transformer
transformer = SimpleTransformer(vocab_size=1000, d_model=d_model)
# Positional encoding
positions = np.arange(seq_length)[:, np.newaxis]
pos_encoding = transformer.positional_encoding(positions, d_model)
print(f"Positional encoding shape: {pos_encoding.shape}")
# Add positional encoding to input
input_with_pos = sample_input + pos_encoding[np.newaxis, :, :]
# Multi-head attention
attention_output, attention_weights = transformer.multi_head_attention(input_with_pos)
print(f"Attention output shape: {attention_output.shape}")
print(f"Attention weights shape: {attention_weights.shape}")
# Visualize attention weights
plt.figure(figsize=(10, 8))
plt.imshow(attention_weights[0, 0], cmap='viridis', aspect='auto')
plt.colorbar()
plt.xlabel('Key Position')
plt.ylabel('Query Position')
plt.title('Self-Attention Weights (First Head, First Batch)')
plt.show()
# Using Hugging Face Transformers for Real Applications
print("\n🤗 HUGGING FACE TRANSFORMERS IMPLEMENTATION:")
try:
from transformers import AutoTokenizer, AutoModel, pipeline
# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
print(f"Loaded model: {model_name}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
# Text classification example
classifier = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
sample_texts = [
"I love machine learning!",
"This is terrible.",
"The weather is nice today.",
"I'm feeling neutral about this."
]
results = classifier(sample_texts)
print("\n📊 Sentiment Analysis Results:")
for text, result in zip(sample_texts, results):
print(f"'{text}' -> {result['label']} (confidence: {result['score']:.3f})")
# Text generation example
generator = pipeline("text-generation", model="gpt2", max_length=50)
prompt = "The future of artificial intelligence"
generated = generator(prompt, num_return_sequences=1)
print(f"\n🤖 Text Generation:")
print(f"Prompt: '{prompt}'")
print(f"Generated: '{generated[0]['generated_text']}'")
except ImportError:
print("Hugging Face transformers not available. Using simulated results.")
# Simulate sentiment analysis results
sample_texts = [
"I love machine learning!",
"This is terrible.",
"The weather is nice today.",
"I'm feeling neutral about this."
]
simulated_results = [
{"label": "POSITIVE", "score": 0.98},
{"label": "NEGATIVE", "score": 0.95},
{"label": "POSITIVE", "score": 0.87},
{"label": "NEUTRAL", "score": 0.65}
]
print("\n📊 Simulated Sentiment Analysis Results:")
for text, result in zip(sample_texts, simulated_results):
print(f"'{text}' -> {result['label']} (confidence: {result['score']:.3f})")
# BERT for Text Classification
print("\n🔤 BERT for Text Classification:")
# Create synthetic text classification dataset
texts = [
"The movie was fantastic and I loved every minute of it",
"This product is terrible and does not work as advertised",
"The weather today is beautiful and sunny",
"I feel very disappointed with the service provided",
"This book is amazing and well written",
"The food was awful and overpriced",
"Great customer service and fast delivery",
"Poor quality materials used in this product"
]
labels = [1, 0, 1, 0, 1, 0, 1, 0] # 1: positive, 0: negative
# If transformers is available, demonstrate fine-tuning concept
try:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
# Tokenize texts
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
encoded_inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="tf")
print(f"Tokenized inputs shape: {encoded_inputs['input_ids'].shape}")
print(f"Attention mask shape: {encoded_inputs['attention_mask'].shape}")
# Demonstrate BERT embeddings
model = AutoModel.from_pretrained("distilbert-base-uncased")
with torch.no_grad():
outputs = model(**encoded_inputs)
embeddings = outputs.last_hidden_state
print(f"BERT embeddings shape: {embeddings.shape}")
# Pooled output (CLS token)
pooled_output = embeddings[:, 0, :]
print(f"Pooled output shape: {pooled_output.shape}")
except ImportError:
print("PyTorch not available for BERT demonstration.")
# Transformer Visualization
print("\n📊 TRANSFORMER ARCHITECTURE VISUALIZATION:")
# Create a visualization of transformer components
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# 1. Positional Encoding Visualization
positions = np.arange(50)
d_model = 64
pos_encoding = transformer.positional_encoding(positions, d_model)
im = axes[0,0].imshow(pos_encoding.T, cmap='viridis', aspect='auto')
axes[0,0].set_xlabel('Position')
axes[0,0].set_ylabel('Dimension')
axes[0,0].set_title('Positional Encoding')
plt.colorbar(im, ax=axes[0,0])
# 2. Attention Pattern Examples
def create_attention_patterns(seq_length):
"""Create different attention patterns"""
patterns = {}
# Causal attention (for GPT)
causal = np.tril(np.ones((seq_length, seq_length)))
patterns['Causal'] = causal
# Full attention (for BERT)
full = np.ones((seq_length, seq_length))
patterns['Full'] = full
# Local attention
local = np.zeros((seq_length, seq_length))
for i in range(seq_length):
start = max(0, i-2)
end = min(seq_length, i+3)
local[i, start:end] = 1
patterns['Local'] = local
return patterns
seq_len = 10
patterns = create_attention_patterns(seq_len)
for i, (name, pattern) in enumerate(patterns.items()):
row = i // 2
col = i % 2
axes[0,1].imshow(pattern, cmap='viridis')
axes[0,1].set_xlabel('Key Position')
axes[0,1].set_ylabel('Query Position')
axes[0,1].set_title(f'Attention Pattern: {name}')
# 3. Multi-Head Attention Concept
def visualize_multihead_attention():
"""Visualize multi-head attention concept"""
seq_len = 6
n_heads = 4
# Create sample attention weights for each head
attention_heads = []
for i in range(n_heads):
# Each head focuses on different patterns
if i == 0:
# Head 0: Diagonal attention
head = np.eye(seq_len)
elif i == 1:
# Head 1: Global attention
head = np.ones((seq_len, seq_len)) / seq_len
elif i == 2:
# Head 2: Local attention
head = np.zeros((seq_len, seq_len))
for j in range(seq_len):
start = max(0, j-1)
end = min(seq_len, j+2)
head[j, start:end] = 1.0 / (end - start)
else:
# Head 3: Random pattern
head = np.random.rand(seq_len, seq_len)
head = head / head.sum(axis=1, keepdims=True)
attention_heads.append(head)
return attention_heads
attention_heads = visualize_multihead_attention()
# Plot each attention head
for i, head in enumerate(attention_heads):
row = 1 + i // 2
col = i % 2
im = axes[row, col].imshow(head, cmap='viridis', aspect='auto')
axes[row, col].set_xlabel('Key Position')
axes[row, col].set_ylabel('Query Position')
axes[row, col].set_title(f'Attention Head {i+1}')
plt.colorbar(im, ax=axes[row, col])
plt.tight_layout()
plt.show()
# Transformer Applications
print("\n🎯 TRANSFORMER APPLICATIONS:")
applications = {
"Text Classification": "BERT, RoBERTa, DistilBERT",
"Text Generation": "GPT, GPT-2, GPT-3, GPT-4",
"Machine Translation": "mBART, T5, MarianMT",
"Question Answering": "BERT, RoBERTa, ALBERT",
"Named Entity Recognition": "BERT, Spacy Transformers",
"Text Summarization": "BART, T5, PEGASUS",
"Sentiment Analysis": "DistilBERT, BERT",
"Code Generation": "Codex, CodeGen, StarCoder"
}
print("Common Transformer Applications:")
for app, models in applications.items():
print(f"• {app}: {models}")
# Fine-tuning Transformers
print("\n🔧 FINE-TUNING TRANSFORMERS:")
fine_tuning_steps = [
"1. Choose pre-trained model (BERT, GPT, etc.)",
"2. Prepare domain-specific dataset",
"3. Add task-specific head (classification, generation)",
"4. Freeze base layers or use gradual unfreezing",
"5. Train with lower learning rate",
"6. Evaluate on validation set",
"7. Deploy fine-tuned model"
]
print("Fine-tuning Steps:")
for step in fine_tuning_steps:
print(step)
# Transformer Limitations and Solutions
print("\n⚠️ TRANSFORMER LIMITATIONS:")
limitations = {
"Computational Complexity": "O(n²) for sequence length",
"Memory Usage": "High for long sequences",
"Training Time": "Requires extensive pre-training",
"Data Requirements": "Needs large datasets",
"Interpretability": "Black-box nature"
}
solutions = {
"Computational Complexity": "Sparse attention, Linformer, Performer",
"Memory Usage": "Gradient checkpointing, model parallelism",
"Training Time": "Distributed training, mixed precision",
"Data Requirements": "Transfer learning, data augmentation",
"Interpretability": "Attention visualization, probing"
}
print("Limitations and Solutions:")
for limitation, solution in zip(limitations.items(), solutions.items()):
print(f"• {limitation[0]}: {limitation[1]}")
print(f" Solution: {solution[1]}")
print("✅ Transformers Analysis Complete!")

In [None]:
## Transformers Interview Questions & Answers
**Q1: What is the key innovation of transformers over RNNs?**
**Answer:**
- **Parallel processing:** Transformers process entire sequences simultaneously vs sequential processing in RNNs
- **Self-attention:** Captures global dependencies in constant time vs RNN's sequential propagation
- **No recurrence:** Eliminates vanishing gradient problem in long sequences
- **Scalability:** Handles much longer sequences than RNNs
- **Performance:** State-of-the-art results across NLP tasks
**Q2: Explain the self-attention mechanism.**
**Answer:**
**Self-attention computes:**
1. **Query, Key, Value vectors** from input embeddings
2. **Attention scores** as dot product between Query and Key
3. **Scaled scores** by dividing by $\sqrt{d_k}$ for stability
4. **Softmax** to get attention weights
5. **Weighted sum** of Value vectors using attention weights
**Intuition:** Each token attends to all other tokens, learning which are most relevant
**Q3: What is the purpose of positional encoding?**
**Answer:**
- **Problem:** Self-attention is permutation invariant (no notion of order)
- **Solution:** Add positional information to input embeddings
- **Methods:**
- **Sinusoidal encoding:** Fixed mathematical function
- **Learned embeddings:** Trainable position vectors
- **Properties:** Unique for each position, deterministic, generalizes to longer sequences
**Q4: Compare encoder-only, decoder-only, and encoder-decoder architectures.**
**Answer:**
- **Encoder-only (BERT):** Bidirectional context, good for understanding tasks
- Use cases: Classification, NER, QA
- **Decoder-only (GPT):** Causal attention, good for generation tasks
- Use cases: Text generation, completion
- **Encoder-decoder (T5, BART):** Full transformer, good for sequence-to-sequence
- Use cases: Translation, summarization
**Q5: How do you handle long sequences with transformers?**
**Answer:**
**Challenges:** O(n²) complexity limits sequence length
**Solutions:**
- **Sparse attention:** Only attend to subset of positions
- **Linear transformers:** Approximate attention with kernels
- **Block-wise attention:** Process sequences in chunks
- **Memory-efficient attention:** Optimize memory usage
- **Longformer, BigBird:** Specialized architectures for long sequences
# 16. AUTOENCODERS
## Algorithm Background & Mathematical Foundation
**Core Concept:** Autoencoders are neural networks that learn efficient data encodings by training to reconstruct their input.
**Mathematical Formulation:**
**Architecture:**
- **Encoder:** $h = f(x) = \sigma(W_ex + b_e)$
- **Bottleneck:** Latent representation $z$
- **Decoder:** $\hat{x} = g(h) = \sigma(W_dh + b_d)$
**Loss Function:**
$$L(x, \hat{x}) = \|x - \hat{x}\|^2$$
**Variational Autoencoder (VAE):**
- **Encoder:** Learns $\mu$ and $\sigma$ of latent distribution
- **Sampling:** $z = \mu + \sigma \odot \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$
- **Loss:** $L = \mathbb{E}[\log p(x|z)] - D_{KL}(q(z|x) \| p(z))$
**Types of Autoencoders:**
- **Undercomplete:** Bottleneck smaller than input
- **Sparse:** Sparse activations in bottleneck
- **Denoising:** Train to reconstruct clean data from noisy input
- **Contractive:** Regularize to be robust to small input variations

# Cell 18: Autoencoders - Comprehensive Implementation
print("🚀 AUTOENCODERS: COMPREHENSIVE IMPLEMENTATION\n")
# Load MNIST dataset for autoencoder demonstration
(X_train_ae, _), (X_test_ae, _) = tf.keras.datasets.mnist.load_data()
# Preprocess data
X_train_ae = X_train_ae.astype('float32') / 255.0
X_test_ae = X_test_ae.astype('float32') / 255.0
# Flatten images for simple autoencoder
X_train_flat = X_train_ae.reshape(-1, 784)
X_test_flat = X_test_ae.reshape(-1, 784)
print("📊 MNIST Dataset for Autoencoders:")
print(f"• Training samples: {X_train_flat.shape[0]}")
print(f"• Test samples: {X_test_flat.shape[0]}")
print(f"• Input dimension: {X_train_flat.shape[1]}")
# Build different autoencoder architectures
def create_simple_autoencoder():
"""Simple undercomplete autoencoder"""
encoding_dim = 32
# Encoder
encoder = Sequential([
layers.Dense(128, activation='relu', input_shape=(784,)),
layers.Dense(64, activation='relu'),
layers.Dense(encoding_dim, activation='relu')
])
# Decoder
decoder = Sequential([
layers.Dense(64, activation='relu', input_shape=(encoding_dim,)),
layers.Dense(128, activation='relu'),
layers.Dense(784, activation='sigmoid')
])
# Autoencoder
autoencoder = Sequential([encoder, decoder])
return autoencoder, encoder, decoder
def create_conv_autoencoder():
"""Convolutional autoencoder for images"""
# Encoder
encoder = Sequential([
layers.Reshape((28, 28, 1), input_shape=(784,)),
layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2), padding='same'),
layers.Conv2D(16, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2), padding='same'),
layers.Conv2D(8, (3, 3), activation='relu', padding='same'),
layers.MaxPooling2D((2, 2), padding='same'),
layers.Flatten()
])
# Decoder
decoder = Sequential([
layers.Dense(128, activation='relu', input_shape=(128,)),
layers.Reshape((4, 4, 8)),
layers.Conv2D(8, (3, 3), activation='relu', padding='same'),
layers.UpSampling2D((2, 2)),
layers.Conv2D(16, (3, 3), activation='relu', padding='same'),
layers.UpSampling2D((2, 2)),
layers.Conv2D(32, (3, 3), activation='relu', padding='same'),
layers.UpSampling2D((2, 2)),
layers.Conv2D(1, (3, 3), activation='sigmoid', padding='same'),
layers.Reshape((784,))
])
# Autoencoder
autoencoder = Sequential([encoder, decoder])
return autoencoder, encoder, decoder
def create_denoising_autoencoder():
"""Denoising autoencoder"""
encoding_dim = 32
# Encoder
encoder = Sequential([
layers.GaussianNoise(0.1, input_shape=(784,)),
layers.Dense(128, activation='relu'),
layers.Dense(64, activation='relu'),
layers.Dense(encoding_dim, activation='relu')
])
# Decoder
decoder = Sequential([
layers.Dense(64, activation='relu', input_shape=(encoding_dim,)),
layers.Dense(128, activation='relu'),
layers.Dense(784, activation='sigmoid')
])
# Autoencoder
autoencoder = Sequential([encoder, decoder])
return autoencoder, encoder, decoder
# Train autoencoders
ae_models = {
'Simple Autoencoder': create_simple_autoencoder(),
'Convolutional Autoencoder': create_conv_autoencoder(),
'Denoising Autoencoder': create_denoising_autoencoder()
}
ae_histories = {}
for name, (autoencoder, encoder, decoder) in ae_models.items():
print(f"\n📊 Training {name}...")
# Compile autoencoder
autoencoder.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
# Prepare data (for denoising autoencoder, use noisy input)
if 'Denoising' in name:
# Add noise to training data
X_train_noisy = X_train_flat + 0.1 * np.random.normal(0, 1, X_train_flat.shape)
X_train_noisy = np.clip(X_train_noisy, 0., 1.)
train_data = (X_train_noisy, X_train_flat)
else:
train_data = (X_train_flat, X_train_flat)
# Train model
history = autoencoder.fit(
train_data[0], train_data[1],
validation_data=(X_test_flat, X_test_flat),
epochs=15,
batch_size=128,
verbose=0
)
ae_histories[name] = {
'autoencoder': autoencoder,
'encoder': encoder,
'decoder': decoder,
'history': history
}
# Evaluate model
test_loss, test_mae = autoencoder.evaluate(X_test_flat, X_test_flat, verbose=0)
print(f"• Test MSE: {test_loss:.4f}")
print(f"• Test MAE: {test_mae:.4f}")
# Comprehensive Visualization
fig, axes = plt.subplots(3, 4, figsize=(18, 12))
# 1. Training History
for i, (name, ae_data) in enumerate(ae_histories.items()):
history = ae_data['history']
axes[0,i].plot(history.history['loss'], label='Train')
axes[0,i].plot(history.history['val_loss'], label='Validation')
axes[0,i].set_xlabel('Epochs')
axes[0,i].set_ylabel('MSE Loss')
axes[0,i].set_title(f'{name} - Training')
axes[0,i].legend()
axes[0,i].grid(True)
# 2. Original vs Reconstructed Images
n_examples = 5
sample_indices = np.random.choice(len(X_test_flat), n_examples, replace=False)
for i, idx in enumerate(sample_indices):
# Original image
original = X_test_flat[idx].reshape(28, 28)
axes[1,i].imshow(original, cmap='gray')
axes[1,i].set_title(f'Original {i+1}')
axes[1,i].axis('off')
# Reconstructed images from different autoencoders
for j, (name, ae_data) in enumerate(ae_histories.items()):
autoencoder = ae_data['autoencoder']
reconstructed = autoencoder.predict(X_test_flat[idx:idx+1], verbose=0)[0].reshape(28, 28)
axes[2+j,i].imshow(reconstructed, cmap='gray')
axes[2+j,i].set_title(f'{name[:10]}...')
axes[2+j,i].axis('off')
plt.tight_layout()
plt.show()
# Latent Space Visualization
print("\n🔍 LATENT SPACE VISUALIZATION:")
# Use the simple autoencoder for latent space analysis
simple_ae_data = ae_histories['Simple Autoencoder']
encoder = simple_ae_data['encoder']
# Encode test images
latent_representations = encoder.predict(X_test_flat, verbose=0)
print(f"Latent representations shape: {latent_representations.shape}")
# Reduce to 2D for visualization using PCA
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
latent_2d = pca.fit_transform(latent_representations)
plt.figure(figsize=(10, 8))
scatter = plt.scatter(latent_2d[:, 0], latent_2d[:, 1], c=y_test, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit Class')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('Autoencoder Latent Space (PCA projection)')
plt.grid(True, alpha=0.3)
plt.show()
# Anomaly Detection with Autoencoders
print("\n🚨 ANOMALY DETECTION WITH AUTOENCODERS:")
# Calculate reconstruction error for each test sample
reconstruction_errors = []
for ae_name, ae_data in ae_histories.items():
autoencoder = ae_data['autoencoder']
reconstructions = autoencoder.predict(X_test_flat, verbose=0)
errors = np.mean((X_test_flat - reconstructions) ** 2, axis=1)
reconstruction_errors.append((ae_name, errors))
# Plot reconstruction error distribution
plt.figure(figsize=(12, 6))
for ae_name, errors in reconstruction_errors:
plt.hist(errors, bins=50, alpha=0.6, label=ae_name)
plt.xlabel('Reconstruction Error (MSE)')
plt.ylabel('Frequency')
plt.title('Reconstruction Error Distribution for Anomaly Detection')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Identify potential anomalies (high reconstruction error)
threshold = np.percentile(reconstruction_errors[0][1], 95) # 95th percentile
anomaly_indices = np.where(reconstruction_errors[0][1] > threshold)[0]
print(f"Anomalies detected: {len(anomaly_indices)} ({len(anomaly_indices)/len(X_test_flat)*100:.1f}%)")
# Show some anomalies
plt.figure(figsize=(12, 4))
for i, idx in enumerate(anomaly_indices[:8]):
plt.subplot(2, 4, i+1)
plt.imshow(X_test_flat[idx].reshape(28, 28), cmap='gray')
plt.title(f'Error: {reconstruction_errors[0][1][idx]:.3f}')
plt.axis('off')
plt.suptitle('Detected Anomalies (High Reconstruction Error)')
plt.tight_layout()
plt.show()
# Variational Autoencoder (VAE) Implementation
print("\n🎯 VARIATIONAL AUTOENCODER (VAE) IMPLEMENTATION:")
class Sampling(layers.Layer):
"""Uses (z_mean, z_log_var) to sample z"""
def call(self, inputs):
z_mean, z_log_var = inputs
batch = tf.shape(z_mean)[0]
dim = tf.shape(z_mean)[1]
epsilon = tf.keras.backend.random_normal(shape=(batch, dim))
return z_mean + tf.exp(0.5 * z_log_var) * epsilon
def create_vae(latent_dim=2):
"""Create Variational Autoencoder"""
# Encoder
encoder_inputs = layers.Input(shape=(784,))
x = layers.Dense(128, activation='relu')(encoder_inputs)
x = layers.Dense(64, activation='relu')(x)
z_mean = layers.Dense(latent_dim, name="z_mean")(x)
z_log_var = layers.Dense(latent_dim, name="z_log_var")(x)
z = Sampling()([z_mean, z_log_var])
encoder = tf.keras.Model(encoder_inputs, [z_mean, z_log_var, z], name="encoder")
# Decoder
latent_inputs = layers.Input(shape=(latent_dim,))
x = layers.Dense(64, activation='relu')(latent_inputs)
x = layers.Dense(128, activation='relu')(x)
decoder_outputs = layers.Dense(784, activation='sigmoid')(x)
decoder = tf.keras.Model(latent_inputs, decoder_outputs, name="decoder")
# VAE
outputs = decoder(encoder(encoder_inputs)[2])
vae = tf.keras.Model(encoder_inputs, outputs, name="vae")
# Add KL divergence loss
kl_loss = -0.5 * tf.reduce_mean(
z_log_var - tf.square(z_mean) - tf.exp(z_log_var) + 1
)
vae.add_loss(kl_loss)
return vae, encoder, decoder
# Create and train VAE
vae, vae_encoder, vae_decoder = create_vae(latent_dim=2)
vae.compile(optimizer='adam', loss='mse')
vae_history = vae.fit(
X_train_flat, X_train_flat,
validation_data=(X_test_flat, X_test_flat),
epochs=20,
batch_size=128,
verbose=0
)
print("VAE training completed!")
# VAE Latent Space Visualization
z_mean, _, _ = vae_encoder.predict(X_test_flat, verbose=0)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
scatter = plt.scatter(z_mean[:, 0], z_mean[:, 1], c=y_test, cmap='tab10', alpha=0.6)
plt.colorbar(scatter, label='Digit Class')
plt.xlabel('z[0]')
plt.ylabel('z[1]')
plt.title('VAE Latent Space')
plt.grid(True, alpha=0.3)
# Generate new samples from VAE
plt.subplot(1, 2, 2)
n = 15
digit_size = 28
figure = np.zeros((digit_size * n, digit_size * n))
# Linearly spaced coordinates on the unit square
grid_x = np.linspace(-2, 2, n)
grid_y = np.linspace(-2, 2, n)[::-1]
for i, yi in enumerate(grid_y):
for j, xi in enumerate(grid_x):
z_sample = np.array([[xi, yi]])
x_decoded = vae_decoder.predict(z_sample, verbose=0)
digit = x_decoded[0].reshape(digit_size, digit_size)
figure[i * digit_size: (i + 1) * digit_size,
j * digit_size: (j + 1) * digit_size] = digit
plt.imshow(figure, cmap='gray')
plt.title('VAE Generated Digits')
plt.axis('off')
plt.tight_layout()
plt.show()
# Autoencoder Applications Summary
print("\n🎯 AUTOENCODER APPLICATIONS:")
applications = {
"Dimensionality Reduction": "Learn compact representations",
"Anomaly Detection": "High reconstruction error indicates anomalies",
"Image Denoising": "Remove noise from images",
"Data Generation": "VAEs can generate new samples",
"Feature Learning": "Learn meaningful features unsupervised",
"Data Compression": "Efficient encoding of data"
}
print("Autoencoder Applications:")
for app, description in applications.items():
print(f"• {app}: {description}")
print("✅ Autoencoders Analysis Complete!")

In [None]:
## Autoencoders Interview Questions & Answers
**Q1: What is the difference between autoencoders and PCA?**
**Answer:**
- **Linearity:** PCA is linear, autoencoders can learn non-linear transformations
- **Flexibility:** Autoencoders can use various architectures (CNN, LSTM)
- **Representation:** Autoencoders can learn more complex feature hierarchies
- **Training:** PCA has closed-form solution, autoencoders require gradient descent
- **Use cases:** PCA for simple linear dimensionality reduction, autoencoders for complex non-linear data
**Q2: What is the bottleneck and why is it important?**
**Answer:**
- **Bottleneck:** Middle layer with reduced dimensionality
- **Purpose:** Forces network to learn compressed representation
- **Undercomplete:** Bottleneck smaller than input - learns compression
- **Overcomplete:** Bottleneck larger than input - needs regularization
- **Optimal size:** Balance between reconstruction quality and compression
**Q3: Explain Variational Autoencoders (VAEs).**
**Answer:**
- **Probabilistic approach:** Learn distribution of latent space
- **Encoder output:** Mean and variance of latent distribution
- **Sampling:** Generate new samples by sampling from latent distribution
- **Loss function:** Reconstruction loss + KL divergence
- **KL divergence:** Encourages latent distribution to match prior (usually Gaussian)
- **Applications:** Data generation, interpolation in latent space
**Q4: What are denoising autoencoders?**
**Answer:**
- **Training:** Learn to reconstruct clean data from corrupted input
- **Corruption:** Add noise, mask pixels, or other transformations
- **Benefits:**
- Learns robust features
- Prevents identity function learning
- Better generalization
- **Use cases:** Image denoising, robust feature learning
**Q5: How do you evaluate autoencoder performance?**
**Answer:**
- **Reconstruction quality:** MSE, SSIM, perceptual metrics
- **Latent space quality:** Clustering, visualization, interpretability
- **Downstream tasks:** Performance on classification/regression using encoded features
- **Generation quality:** For VAEs - quality of generated samples
- **Anomaly detection:** ROC curves for reconstruction error thresholding
# 17. DBSCAN
## Algorithm Background & Mathematical Foundation
**Core Concept:** DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density connectivity.
**Key Definitions:**
- **ε (eps):** Maximum distance between two samples to be considered neighbors
- **MinPts:** Minimum number of points required to form a dense region
- **Core point:** Point with at least MinPts points within ε distance
- **Border point:** Point within ε distance of core point but doesn't have enough neighbors
- **Noise point:** Point that is neither core nor border point
**Algorithm Steps:**
1. For each point, find points within ε distance
2. Identify core points (≥ MinPts neighbors)
3. Form clusters from connected core points
4. Assign border points to clusters
5. Mark remaining points as noise
**Mathematical Properties:**
- **Density reachability:** Point p is density-reachable from q if there's a path of core points
- **Density connectivity:** Points p and q are density-connected if there's point o that density-reaches both

# Cell 19: DBSCAN - Comprehensive Implementation
print("🚀 DBSCAN: COMPREHENSIVE IMPLEMENTATION\n")
# Create dataset with complex cluster shapes
from sklearn.datasets import make_moons, make_circles, make_blobs
# Generate different dataset types
datasets_dbscan = {
'Moons': make_moons(n_samples=300, noise=0.05, random_state=42),
'Circles': make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42),
'Blobs': make_blobs(n_samples=300, centers=3, cluster_std=0.60, random_state=42),
'Anisotropic': make_blobs(n_samples=300, centers=3, random_state=42)
}
# Make anisotropic data
transformation = [[0.6, -0.6], [-0.4, 0.8]]
X_aniso = np.dot(datasets_dbscan['Anisotropic'][0], transformation)
datasets_dbscan['Anisotropic'] = (X_aniso, datasets_dbscan['Anisotropic'][1])
# Add some noise to datasets
for name, (X, y) in datasets_dbscan.items():
# Add 10% noise
n_noise = int(0.1 * len(X))
noise_points = np.random.uniform(X.min(axis=0), X.max(axis=0), (n_noise, 2))
X_noisy = np.vstack([X, noise_points])
y_noisy = np.hstack([y, -1 * np.ones(n_noise)]) # -1 for noise
datasets_dbscan[name] = (X_noisy, y_noisy)
print("📊 Dataset Overview for DBSCAN:")
for name, (X, y) in datasets_dbscan.items():
print(f"• {name}: {X.shape[0]} points, {len(np.unique(y))} true clusters")
# Scale datasets
scaler_dbscan = StandardScaler()
scaled_datasets = {}
for name, (X, y) in datasets_dbscan.items():
X_scaled = scaler_dbscan.fit_transform(X)
scaled_datasets[name] = (X_scaled, y)
# DBSCAN with different parameters
dbscan_results = {}
# Parameter grid
eps_values = [0.1, 0.2, 0.3, 0.5]
min_samples_values = [5, 10, 15]
for dataset_name, (X, y_true) in scaled_datasets.items():
print(f"\n🔍 Analyzing {dataset_name} dataset...")
dataset_results = {}
for eps in eps_values:
for min_samples in min_samples_values:
# Apply DBSCAN
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X)
# Calculate metrics (excluding noise for some metrics)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
# Only calculate silhouette score if we have at least 2 clusters and not all noise
if n_clusters > 1 and n_clusters < len(X) - 1:
silhouette = silhouette_score(X, labels)
else:
silhouette = -1
dataset_results[(eps, min_samples)] = {
'labels': labels,
'n_clusters': n_clusters,
'n_noise': n_noise,
'silhouette': silhouette
}
dbscan_results[dataset_name] = dataset_results
# Find best parameters for each dataset
best_params = {}
for dataset_name, results in dbscan_results.items():
# Find parameters with highest silhouette score (excluding invalid ones)
valid_results = {k: v for k, v in results.items() if v['silhouette'] > 0}
if valid_results:
best_param = max(valid_results.keys(), key=lambda x: valid_results[x]['silhouette'])
best_params[dataset_name] = best_param
print(f"• {dataset_name}: Best eps={best_param[0]}, min_samples={best_param[1]}, "
f"Silhouette={valid_results[best_param]['silhouette']:.3f}")
# Comprehensive Visualization
fig, axes = plt.subplots(4, 4, figsize=(20, 16))
for i, (dataset_name, (X, y_true)) in enumerate(scaled_datasets.items()):
# Original data with true labels
scatter = axes[i,0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7)
axes[i,0].set_title(f'{dataset_name}\nTrue Clusters')
axes[i,0].set_xlabel('Feature 1')
axes[i,0].set_ylabel('Feature 2')
axes[i,0].grid(True, alpha=0.3)
# K-means for comparison
kmeans = KMeans(n_clusters=len(np.unique(y_true[y_true != -1])), random_state=42)
kmeans_labels = kmeans.fit_predict(X)
axes[i,1].scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', alpha=0.7)
axes[i,1].set_title(f'K-means\nSilhouette: {silhouette_score(X, kmeans_labels):.3f}')
axes[i,1].set_xlabel('Feature 1')
axes[i,1].set_ylabel('Feature 2')
axes[i,1].grid(True, alpha=0.3)
# DBSCAN with best parameters
if dataset_name in best_params:
eps, min_samples = best_params[dataset_name]
best_result = dbscan_results[dataset_name][(eps, min_samples)]
labels = best_result['labels']
# Create color map that distinguishes noise (gray)
colors = ['gray' if label == -1 else plt.cm.viridis(label / max(1, best_result['n_clusters']))
for label in labels]
axes[i,2].scatter(X[:, 0], X[:, 1], c=colors, alpha=0.7)
axes[i,2].set_title(f'DBSCAN (Best)\neps={eps}, min_samples={min_samples}\n'
f'Clusters: {best_result["n_clusters"]}, Noise: {best_result["n_noise"]}\n'
f'Silhouette: {best_result["silhouette"]:.3f}')
axes[i,2].set_xlabel('Feature 1')
axes[i,2].set_ylabel('Feature 2')
axes[i,2].grid(True, alpha=0.3)
# DBSCAN parameter sensitivity
eps_for_plot = 0.3
min_samples_for_plot = 10
if (eps_for_plot, min_samples_for_plot) in dbscan_results[dataset_name]:
result = dbscan_results[dataset_name][(eps_for_plot, min_samples_for_plot)]
labels = result['labels']
colors = ['gray' if label == -1 else plt.cm.viridis(label / max(1, result['n_clusters']))
for label in labels]
axes[i,3].scatter(X[:, 0], X[:, 1], c=colors, alpha=0.7)
axes[i,3].set_title(f'DBSCAN (eps=0.3, min_samples=10)\n'
f'Clusters: {result["n_clusters"]}, Noise: {result["n_noise"]}')
axes[i,3].set_xlabel('Feature 1')
axes[i,3].set_ylabel('Feature 2')
axes[i,3].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# DBSCAN Parameter Analysis
print("\n🔧 DBSCAN PARAMETER ANALYSIS:")
# Analyze parameter sensitivity for one dataset
dataset_to_analyze = 'Moons'
X_analyze, y_analyze = scaled_datasets[dataset_to_analyze]
# Create parameter grid
eps_range = np.linspace(0.1, 0.5, 20)
min_samples_range = [5, 10, 15, 20]
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
# 1. Number of clusters vs eps for different min_samples
for min_samples in min_samples_range:
n_clusters_list = []
for eps in eps_range:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X_analyze)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_clusters_list.append(n_clusters)
axes[0,0].plot(eps_range, n_clusters_list, 'o-', label=f'min_samples={min_samples}')
axes[0,0].set_xlabel('eps')
axes[0,0].set_ylabel('Number of Clusters')
axes[0,0].set_title('Number of Clusters vs eps')
axes[0,0].legend()
axes[0,0].grid(True)
# 2. Noise points vs eps for different min_samples
for min_samples in min_samples_range:
noise_list = []
for eps in eps_range:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X_analyze)
n_noise = list(labels).count(-1)
noise_list.append(n_noise)
axes[0,1].plot(eps_range, noise_list, 'o-', label=f'min_samples={min_samples}')
axes[0,1].set_xlabel('eps')
axes[0,1].set_ylabel('Number of Noise Points')
axes[0,1].set_title('Noise Points vs eps')
axes[0,1].legend()
axes[0,1].grid(True)
# 3. Silhouette score vs eps for different min_samples
for min_samples in min_samples_range:
silhouette_list = []
for eps in eps_range:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
labels = dbscan.fit_predict(X_analyze)
# Only calculate silhouette if we have reasonable clustering
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
if n_clusters > 1 and n_clusters < len(X_analyze) - 1:
silhouette = silhouette_score(X_analyze, labels)
else:
silhouette = -1
silhouette_list.append(silhouette)
axes[1,0].plot(eps_range, silhouette_list, 'o-', label=f'min_samples={min_samples}')
axes[1,0].set_xlabel('eps')
axes[1,0].set_ylabel('Silhouette Score')
axes[1,0].set_title('Silhouette Score vs eps')
axes[1,0].legend()
axes[1,0].grid(True)
# 4. Reachability plot (k-distance graph)
from sklearn.neighbors import NearestNeighbors
# Calculate k-distances for different k
k_values = [5, 10, 15]
for k in k_values:
neighbors = NearestNeighbors(n_neighbors=k)
neighbors_fit = neighbors.fit(X_analyze)
distances, indices = neighbors_fit.kneighbors(X_analyze)
k_distances = np.sort(distances[:, k-1])
axes[1,1].plot(k_distances, label=f'k={k}')
axes[1,1].set_xlabel('Points sorted by distance')
axes[1,1].set_ylabel(f'k-distance')
axes[1,1].set_title('K-distance Graph (for eps selection)')
axes[1,1].legend()
axes[1,1].grid(True)
plt.tight_layout()
plt.show()
# DBSCAN vs Other Clustering Algorithms
print("\n🆚 DBSCAN vs OTHER CLUSTERING ALGORITHMS:")
comparison_datasets = {
'Complex Shapes': scaled_datasets['Moons'],
'Noisy Data': scaled_datasets['Circles']
}
algorithms = {
'K-means': KMeans(n_clusters=2, random_state=42),
'Agglomerative': AgglomerativeClustering(n_clusters=2),
'DBSCAN': DBSCAN(eps=0.3, min_samples=10)
}
fig, axes = plt.subplots(2, 4, figsize=(20, 8))
for i, (dataset_name, (X, y_true)) in enumerate(comparison_datasets.items()):
# True clusters
axes[i,0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', alpha=0.7)
axes[i,0].set_title(f'{dataset_name}\nTrue Clusters')
axes[i,0].set_xlabel('Feature 1')
axes[i,0].set_ylabel('Feature 2')
axes[i,0].grid(True, alpha=0.3)
# Each algorithm
for j, (algo_name, algorithm) in enumerate(algorithms.items()):
if algo_name == 'DBSCAN':
labels = algorithm.fit_predict(X)
# Handle noise points
colors = ['gray' if label == -1 else plt.cm.viridis(label / max(1, len(set(labels))-1))
for label in labels]
else:
if dataset_name == 'Complex Shapes':
algorithm.set_params(n_clusters=2)
else:
algorithm.set_params(n_clusters=2)
labels = algorithm.fit_predict(X)
colors = plt.cm.viridis(labels / max(1, len(set(labels))))
axes[i,j+1].scatter(X[:, 0], X[:, 1], c=colors, alpha=0.7)
# Calculate metrics
if len(set(labels)) > 1 and -1 not in labels or labels[labels != -1].size > 0:
silhouette = silhouette_score(X, labels)
axes[i,j+1].set_title(f'{algo_name}\nSilhouette: {silhouette:.3f}')
else:
axes[i,j+1].set_title(f'{algo_name}\nInvalid clustering')
axes[i,j+1].set_xlabel('Feature 1')
axes[i,j+1].set_ylabel('Feature 2')
axes[i,j+1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Real-world Application: Customer Segmentation
print("\n👥 CUSTOMER SEGMENTATION WITH DBSCAN:")
# Create synthetic customer data
np.random.seed(42)
n_customers = 1000
# Generate customer features with different densities
age = np.concatenate([
np.random.normal(25, 5, 200), # Young customers
np.random.normal(45, 8, 300), # Middle-aged
np.random.normal(65, 6, 200), # Senior
np.random.uniform(18, 80, 300) # Noise/unsegmented
])
income = np.concatenate([
np.random.normal(30000, 5000, 200), # Low income
np.random.normal(60000, 10000, 300), # Middle income
np.random.normal(100000, 20000, 200), # High income
np.random.uniform(20000, 120000, 300) # Noise
])
spending_score = np.concatenate([
np.random.normal(80, 10, 200), # High spenders
np.random.normal(50, 15, 300), # Moderate spenders
np.random.normal(20, 8, 200), # Low spenders
np.random.uniform(1, 100, 300) # Noise
])
customer_data = np.column_stack([age, income, spending_score])
# Scale data
scaler_customer = StandardScaler()
customer_data_scaled = scaler_customer.fit_transform(customer_data)
# Apply DBSCAN
dbscan_customer = DBSCAN(eps=0.5, min_samples=20)
customer_labels = dbscan_customer.fit_predict(customer_data_scaled)
# Analyze results
n_clusters = len(set(customer_labels)) - (1 if -1 in customer_labels else 0)
n_noise = list(customer_labels).count(-1)
print(f"Customer Segmentation Results:")
print(f"• Number of clusters: {n_clusters}")
print(f"• Number of noise points: {n_noise}")
print(f"• Percentage of customers segmented: {(len(customer_labels) - n_noise) / len(customer_labels) * 100:.1f}%")
# Visualize customer segments
fig = plt.figure(figsize=(15, 5))
# Age vs Income
plt.subplot(1, 3, 1)
colors = ['gray' if label == -1 else plt.cm.viridis(label / max(1, n_clusters))
for label in customer_labels]
plt.scatter(age, income, c=colors, alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.title('Customer Segments: Age vs Income')
plt.grid(True, alpha=0.3)
# Age vs Spending
plt.subplot(1, 3, 2)
plt.scatter(age, spending_score, c=colors, alpha=0.6)
plt.xlabel('Age')
plt.ylabel('Spending Score')
plt.title('Customer Segments: Age vs Spending')
plt.grid(True, alpha=0.3)
# Income vs Spending
plt.subplot(1, 3, 3)
plt.scatter(income, spending_score, c=colors, alpha=0.6)
plt.xlabel('Income ($)')
plt.ylabel('Spending Score')
plt.title('Customer Segments: Income vs Spending')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Cluster analysis
print("\n📈 CUSTOMER SEGMENT ANALYSIS:")
for cluster_id in range(n_clusters):
cluster_mask = customer_labels == cluster_id
cluster_size = np.sum(cluster_mask)
if cluster_size > 0:
avg_age = np.mean(age[cluster_mask])
avg_income = np.mean(income[cluster_mask])
avg_spending = np.mean(spending_score[cluster_mask])
print(f"\nSegment {cluster_id} (Size: {cluster_size}):")
print(f"• Average Age: {avg_age:.1f} years")
print(f"• Average Income: ${avg_income:.0f}")
print(f"• Average Spending: {avg_spending:.1f}")
print("✅ DBSCAN Analysis Complete!")

In [None]:
## DBSCAN Interview Questions & Answers
**Q1: How do you choose the optimal eps and min_samples parameters?**
**Answer:**
- **K-distance graph:** Plot sorted k-nearest neighbor distances, choose eps at the "elbow"
- **Domain knowledge:** Use understanding of data scale and expected cluster density
- **Grid search:** Try different combinations and evaluate with silhouette score
- **Rule of thumb:** min_samples ≥ dimensionality + 1
- **Iterative approach:** Start with default values and adjust based on results
**Q2: What are the advantages of DBSCAN over K-means?**
**Answer:**
- **Arbitrary cluster shapes:** Can find non-spherical clusters
- **Noise handling:** Identifies outliers explicitly
- **No need for K:** Discovers number of clusters automatically
- **Density-based:** Finds clusters of varying densities
- **Order invariance:** Result doesn't depend on point order
**Q3: What are the limitations of DBSCAN?**
**Answer:**
- **Parameter sensitivity:** Results highly dependent on eps and min_samples
- **Varying densities:** Struggles with clusters of significantly different densities
- **High-dimensional data:** Distance measures become less meaningful
- **Border points:** May assign border points arbitrarily to clusters
- **Chain clusters:** Can create elongated clusters due to single linkage
**Q4: How does DBSCAN handle clusters of different densities?**
**Answer:**
- **Challenge:** Single eps value may not work for all densities
- **Solutions:**
- **OPTICS:** Extension that handles varying densities
- **HDBSCAN:** Hierarchical version that extracts clusters at different density levels
- **Multiple runs:** Run DBSCAN with different parameters for different density regions
- **Limitation:** Standard DBSCAN requires manual parameter tuning for each density level
**Q5: When should you use DBSCAN vs other clustering algorithms?**
**Answer:**
**Use DBSCAN when:**
- Data has noise/outliers
- Clusters have arbitrary shapes
- Don't know number of clusters in advance
- Need to identify outliers explicitly
- Clusters have similar densities
**Use other algorithms when:**
- Spherical clusters expected (K-means)
- Hierarchical structure needed (Agglomerative)
- Very high-dimensional data (spectral clustering)
- Clusters have widely varying densities (HDBSCAN)
- Need deterministic results (K-means++)
---
# 🎉 COMPREHENSIVE MACHINE LEARNING GUIDE COMPLETE!
This guide has covered all 17 major machine learning algorithms from your cheatsheet with:
## 📚 What We Covered:
1. **Linear Regression** - Foundation of predictive modeling
2. **Logistic Regression** - Probabilistic classification
3. **Decision Tree** - Interpretable rule-based learning
4. **Random Forest** - Robust ensemble method
5. **Gradient Boosting** - State-of-the-art performance
6. **SVM** - Maximum margin classification
7. **KNN** - Simple instance-based learning
8. **Naive Bayes** - Fast probabilistic classification
9. **K-means** - Popular partitioning clustering
10. **Hierarchical Clustering** - Tree-based cluster discovery
11. **PCA** - Dimensionality reduction
12. **Neural Networks (MLP)** - Universal function approximators
13. **CNN** - Spatial pattern recognition
14. **RNN** - Sequential data processing
15. **Transformers** - Attention-based sequence modeling
16. **Autoencoders** - Unsupervised representation learning
17. **DBSCAN** - Density-based clustering
## 🛠️ For Each Algorithm:
- **Mathematical foundations** with formulas
- **Complete runnable code** with real datasets
- **Visualizations** and diagrams
- **Performance comparisons**
- **Hyperparameter tuning**
- **Real-world applications**
- **Comprehensive interview Q&A**
## 🚀 Next Steps:
1. **Practice** with the provided code examples
2. **Experiment** with different datasets and parameters
3. **Combine algorithms** in ensemble methods
4. **Explore advanced topics** like reinforcement learning, GANs, etc.
5. **Build projects** to apply these algorithms to real problems
This comprehensive guide provides everything needed to understand, implement, and discuss these machine learning algorithms effectively!