# Dimensionality reduction

- Dimensionality reduction is the process of reducing the number of variables under consideration by obtaining a set of principal variables.
- Used in pre-processing step
- Take only smaller number of variables while keeping as much information as possible
- Reduce overfitting 
- Obtain independent features
- Lower computational intensity
- Enable visualization of hyperdimensional features
- Compression => Loss of information => loss of performance
- Always measure model performance before and after dimensionality reduction
- 2 Types:
    1. Feature Selection:
        - Selecting a subset of existing features without any transformation (based on predictive power)
        - Looking for best combination of features (not best individual features)
    2. Feature Extraction:
        - Involves transformation
        - Creating new features from combination of existing features
        - 2 types:
            1. Linear projections:
                - faster, 
                - deterministic
                - Principal Component Analysis (PCA)
                - `from sklearn.decomposition import PCA`
                - Latent Dirichlet Allocation (LDA)
                - `from sklearn.decomposition import LatentDirichletAllocation`
            2. Non-Linear projections:
                - slower, 
                - non-deterministic
                - Isomap
                - `from sklearn.manifold import Isomap`
                - t-distributed Stochastic Neighbor Embedding (t-SNE)
                - `from sklearn.manifold import TSNE`

# Principal Component Analysis (PCA)

- Family: Linear methods.
- Intuition: Principal components are directions of highest variability in data.
- Reduction = keeping only top #N principal components.
- Assumption: Data is normally distributed.
- Caveat: Very sensitive to outliers.

In [1]:
# from sklearn.decomposition import PCA
# pca = PCA(n_dimensions=3)
# X_reduced = pca.fit_transform(X)

# Clustering

- Cluster = Group of entities or events sharing similar attributes
- The process of applying Machine Learning algorithms for automatic discovery of clusters .
- Very sensitive to pre-processing
- Unsupervised Learning
- Popular models:
    - KMeans clustering:
        - No. of clusters must be specified
        - Works on linear separation boundaries
        - Fast
    - Spectral clustering:
        - No. of clusters must be specified
        - Very non-linear separation boundaries
        - Slow
    - DBSCAN:
        - No. of clusters is figured out by itself
        - Non-linear separation boundaries
        - Leave too many points belonging to no cluster
        

# Determining optimized no of clusters

- Elbow method: 
    - we scan a possible number of clusters.
    - we measure sum of squared distances (how far each point from center of the cluster )
    - We visualize them against each other on a graph
    - The last bending before "almost linear" point is the optimal number

<center><img src="images/04.jpg"  style="width: 200px, height: 200px;"/></center>


# Tuning Cluster

1. Unsupervised (no "ground truth", no expectations on how the samples should be clustered )
    - Variance Ratio Criterion: 
        - "What is the average distance of each point to the center of the cluster AND what is the distance between the clusters?"
        - `sklearn.metrics.calinski_harabaz_score`
    - Silhouette score 
        - "How close is each point to its own cluster VS how close it is to the others?"
        - `sklearn.metrics.silhouette_score`
2. Supervised ("ground truth"/expectations provided or evaluating result with supervision)
    - Mutual information (MI) criterion: 
        - `sklearn.metrics.mutual_info_score`
    - Homogeneity score: 
        - `sklearn.metrics.homogeneity_score`

# Anomaly detection

- Detecting unusual entities or events (rare events).
- Hard to define what's odd, but possible to define what's normal.
- Use cases :
    - Credit card fraud detection
    - Network security monitoring
    - Heart-rate monitoring
- Approaches:
    - Thresholding : Detecting values that supasses stable values that remain within threshold over time
    <center><img src="images/05.jpg"  style="width: 200px, height: 200px;"/></center>

    - rate of change : Derivatives within some interval
    <center><img src="images/06.jpg"  style="width: 200px, height: 200px;"/></center>

    - shape monitoring : inspecting shape of the waveform
    <center><img src="images/07.jpg"  style="width: 200px, height: 200px;"/></center>

- Common Models:
    - Robust covariance 
        - assumes normal distribution
        - `from sklearn.covariance import EllipticEnvelope`
    - Isolation Forest 
        - powerful
        - computationally demanding / slower
        - `from sklearn.ensemble import IsolationForest`
    - One-Class SVM 
        - sensitive to outliers
        - many false negatives
        - `from sklearn.svm import OneClassSVM`

# Evaluation

- Precision = How many of the PREDICTED anomalies I have detected are TRUE anomalies?
- Recall = Out of all TRUE anomalies How many of the TRUE anomalies I have managed to detect?

# Modeling

In [2]:
# # Import Package
# from sklearn.ensemble import IsolationForest

# # Provide model and hyperparameters
# algorithm = IsolationForest()

# # Fit the model
# algorithm.fit(X_train)

# # Apply the model and detect the outliers
# results = algorithm.predict(X_test)

# # Evaluate Model
# from sklearn.metrics import confusion_matrix, precision_score, recall_score
# confusion_matrix(y_test, y_predicted)
# precision_score(y_test, y_predicted)
# recall_score(y_test, y_predicted)


# Selecting the right model

- Do we have target variable ?
    - Yes (Supervised)
        - Target variable a category (Classification)
        - Target variable a number (Regression)
    - No (Unsupervised)
        - Simplify feature space (Dimensionality Reduction)
        - Find groups of similar records (Clustering) 
        - Novelty inside data (Anomaly Detection)

- Priority: (Simplicity First, move up complexity ladder as needed)
    - Interpretable models
        - Simple
        - Linear regression (Linear, Logistic, Lasso, Ridge)
        - Decision Trees
    - Well performing models
        - Complex
        - Tree ensembles (Random Forests, Gradient Boosted Trees)
        - Artificial Neural Networks

- Multiple Metrics:
    - Satisfying metrics:
        - Cut-off criteria that every candidate model needs to meet.
        - e.g. minimum accuracy, maximum execution time, etc
        - Multiple metric possible
    - Optimizing metrics
        - Ultimate priority 
        - e.g. "minimize false positives", "maximize recall"
        - "There can be only one"
    - Final model:
        - Passes the bar on all satisfying metrics 
        - The best score on the optimization metric.
- Interpretation
    - Global
        - Decision-making rules 
        - Common approaches:
            - Visualizing Decision tree
            - plot Feature importance 
    - Local
        - "Why was this datapoint classified in this way?"
        - LIME algorithm (Local Interpretable Model-Agnostic Explanations)
- Data always changes, so does the model. The whole process is thus repeated time-to-time.