<a href="https://colab.research.google.com/github/rsimisetty/AML_practise/blob/main/Week4_Day7_Review_and_Feedback.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 4, Day 7: Review and Feedback Session

## Session Overview
This session will review the key concepts covered in Week 4 and provide practice exercises to reinforce learning:

1. K-means Clustering
2. Hierarchical Clustering
3. Principal Component Analysis
4. t-SNE and UMAP
5. Anomaly Detection

## Learning Objectives
- Reinforce unsupervised learning concepts
- Practice technique selection
- Master implementation skills
- Prepare for Week 5

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from umap import UMAP
from sklearn.ensemble import IsolationForest

## 1. Clustering Review

In [None]:
def clustering_review():
    # Generate synthetic data
    np.random.seed(42)
    n_samples = 300

    # Create three distinct clusters
    cluster1 = np.random.normal(0, 1, (n_samples, 2))
    cluster2 = np.random.normal(5, 1, (n_samples, 2))
    cluster3 = np.random.normal(2.5, 1, (n_samples, 2)) + np.array([0, 5])

    # Combine clusters
    X = np.vstack([cluster1, cluster2, cluster3])

    # Scale features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    # Apply different clustering methods
    kmeans = KMeans(n_clusters=3, random_state=42)
    hierarchical = AgglomerativeClustering(n_clusters=3)

    kmeans_labels = kmeans.fit_predict(X_scaled)
    hierarchical_labels = hierarchical.fit_predict(X_scaled)

    # Visualize results
    plt.figure(figsize=(15, 5))

    # Original data
    plt.subplot(131)
    plt.scatter(X[:, 0], X[:, 1], alpha=0.5)
    plt.title('Original Data')

    # K-means
    plt.subplot(132)
    plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
    plt.title('K-means Clustering')

    # Hierarchical
    plt.subplot(133)
    plt.scatter(X[:, 0], X[:, 1], c=hierarchical_labels, cmap='viridis')
    plt.title('Hierarchical Clustering')

    plt.tight_layout()
    plt.show()

clustering_review()

## 2. Dimensionality Reduction Review

In [None]:
def dimensionality_reduction_review():
    # Generate high-dimensional data
    np.random.seed(42)
    n_samples = 1000
    n_features = 50

    # Create data with underlying structure
    X = np.random.randn(n_samples, n_features)
    # Add some correlation
    X[:, 1] = X[:, 0] + np.random.randn(n_samples) * 0.1
    X[:, 2] = X[:, 0] - X[:, 1] + np.random.randn(n_samples) * 0.1

    # Apply different reduction methods
    pca = PCA(n_components=2)
    tsne = TSNE(n_components=2, random_state=42)
    umap = UMAP(n_components=2, random_state=42)

    X_pca = pca.fit_transform(X)
    X_tsne = tsne.fit_transform(X)
    X_umap = umap.fit_transform(X)

    # Visualize results
    plt.figure(figsize=(15, 5))

    # PCA
    plt.subplot(131)
    plt.scatter(X_pca[:, 0], X_pca[:, 1], alpha=0.5)
    plt.title('PCA')

    # t-SNE
    plt.subplot(132)
    plt.scatter(X_tsne[:, 0], X_tsne[:, 1], alpha=0.5)
    plt.title('t-SNE')

    # UMAP
    plt.subplot(133)
    plt.scatter(X_umap[:, 0], X_umap[:, 1], alpha=0.5)
    plt.title('UMAP')

    plt.tight_layout()
    plt.show()

    # Print PCA explained variance
    print("PCA explained variance ratio:", pca.explained_variance_ratio_)

dimensionality_reduction_review()

## 3. Anomaly Detection Review

In [None]:
def anomaly_detection_review():
    # Generate data with anomalies
    np.random.seed(42)
    n_samples = 300

    # Normal data
    X_normal = np.random.normal(0, 1, (n_samples, 2))

    # Add outliers
    X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
    X = np.vstack([X_normal, X_outliers])

    # Apply Isolation Forest
    iso_forest = IsolationForest(contamination=0.1, random_state=42)
    y_pred = iso_forest.fit_predict(X)

    # Visualize results
    plt.figure(figsize=(10, 6))
    plt.scatter(X[y_pred == 1, 0], X[y_pred == 1, 1], label='Normal')
    plt.scatter(X[y_pred == -1, 0], X[y_pred == -1, 1],
                color='red', label='Anomaly')
    plt.title('Anomaly Detection Results')
    plt.legend()
    plt.show()

    # Print statistics
    print("Number of detected anomalies:", (y_pred == -1).sum())

anomaly_detection_review()

## Week 4 Review Quiz

### Multiple Choice Questions

1. Which clustering method requires specifying the number of clusters?
   - a) DBSCAN
   - b) K-means
   - c) Hierarchical clustering
   - d) Mean shift

2. What is the main advantage of hierarchical clustering?
   - a) Speed
   - b) Dendrogram visualization
   - c) Scalability
   - d) Simplicity

3. What does PCA maximize?
   - a) Cluster separation
   - b) Variance explained
   - c) Distance preservation
   - d) Data density

4. Which method is best for visualizing high-dimensional data?
   - a) PCA
   - b) t-SNE
   - c) K-means
   - d) IsolationForest

5. What is the main limitation of t-SNE?
   - a) Linear only
   - b) Slow computation
   - c) No parameters
   - d) Requires labels

6. Which is NOT an application of clustering?
   - a) Customer segmentation
   - b) Image compression
   - c) Time series prediction
   - d) Document grouping

7. What does UMAP preserve?
   - a) Only local structure
   - b) Only global structure
   - c) Both local and global structure
   - d) Neither

8. Which method is most suitable for streaming data?
   - a) t-SNE
   - b) Hierarchical clustering
   - c) Statistical methods
   - d) UMAP

9. What is the elbow method used for?
   - a) Feature selection
   - b) Optimal cluster number
   - c) Anomaly detection
   - d) Dimension selection

10. Which is true about unsupervised learning?
    - a) Requires labeled data
    - b) Finds hidden patterns
    - c) Always accurate
    - d) Linear only

Answers: 1-b, 2-b, 3-b, 4-b, 5-b, 6-c, 7-c, 8-c, 9-b, 10-b

## Week 4 Summary

### Key Concepts Covered:
1. Clustering algorithms and applications
2. Dimensionality reduction techniques
3. Anomaly detection methods
4. Visualization approaches

### Preparation for Week 5:
- Review challenging concepts
- Practice implementation
- Prepare for deep learning
- Review Python and libraries

### Additional Resources:
- Scikit-learn clustering guide: https://scikit-learn.org/stable/modules/clustering.html
- UMAP documentation: https://umap-learn.readthedocs.io/
- Anomaly detection tutorial: https://scikit-learn.org/stable/modules/outlier_detection.html