# Module 1: Introduction to Scikit-Learn

## Section 2: Supervised Learning Algorithms

### Part 9: Linear Discriminant Analysis (LDA)

In this part, we will explore Linear Discriminant Analysis (LDA), a popular linear classification algorithm used to find a linear combination of features that characterizes or separates two or more classes. LDA is particularly useful for dimensionality reduction and can be used as a preprocessing step for other classifiers.

### 9.1 Understanding Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a supervised learning algorithm that aims to find a linear combination of features that maximizes the separation between multiple classes while minimizing the variance within each class. LDA achieves this by projecting the data onto a lower-dimensional subspace.

The key idea behind LDA is to transform the feature space into a new lower-dimensional space, that maximizes the separation between the classes, making classification easier. It does this by finding a set of linear discriminants that maximize the ratio of between-class variance to within-class variance. In other words, it finds the directions in the feature space that best separate the different classes of data. This results in a linear decision boundary that can effectively separate the classes. In the case of binary classification (two classes), this linear decision boundary is a line. For multiclass classification (more than two classes), LDA constructs a hyperplane that best separates the classes. The classification is then performed by determining which side of the linear decision boundary or hyperplane a data point falls into.

While LDA assumes that the data is normally distributed, the covariance matrices of different classes are equal and data is linearly separable, meaning that a linear decision boundary can accurately classify the different classes. It can be a powerful technique when these assumptions are met. However, if the data distribution is highly non-linear, other techniques like Support Vector Machines (SVM) with non-linear kernels or decision trees might be more appropriate.

When Linear Discriminant Analysis (LDA), finds a new set of features (components) that maximize the separation between classes while minimizing the variance within each class, some information from the original features can be lost. The components in LDA are linear combinations of the original features. Each component is a weighted sum of the original features, and these weights are determined in a way that optimally separates the classes. However, the components themselves don't retain the original feature names because they are constructed as combinations of multiple features.

While the components don't have the same interpretation as the original features, they are useful for visualization, dimensionality reduction, and constructing transformed data for classification tasks.

### 9.2 Training and Evaluation

To train an LDA model, we need a labeled dataset with the target variable and the corresponding feature values. The model learns by estimating the class means and class covariances based on the training data.

Once trained, we can evaluate the model's performance using evaluation metrics suitable for classification tasks, such as accuracy, precision, recall, F1-score, or area under the ROC curve (AUC-ROC).

Scikit-Learn provides the LinearDiscriminantAnalysis class for performing LDA. Here's an example of how to use it:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Create an LDA model with 2 components
lda_2d = LinearDiscriminantAnalysis(n_components=2)
X_lda_2d = lda_2d.fit_transform(X, y)
# Create an LDA model with 1 component
lda_1d = LinearDiscriminantAnalysis(n_components=1)
X_lda_1d = lda_1d.fit_transform(X, y)

# Create a 3D scatter plot
fig = plt.figure(figsize=(20, 6))
# Plot the 3D scatter plot of original data
# Select the first three features for the 3D plot
X_3d = X[:, :3]
ax1 = fig.add_subplot(131, projection='3d')
ax1.scatter(X_3d[:, 0], X_3d[:, 1], X_3d[:, 2], c=y, cmap='viridis', edgecolors='k')
ax1.set_xlabel('Sepal Length')
ax1.set_ylabel('Sepal Width')
ax1.set_zlabel('Petal Length')
ax1.set_title('3D Scatter Plot of Original Data')
# Plot the 2D projection after LDA with 2 components
ax2 = fig.add_subplot(132)
ax2.scatter(X_lda_2d[:, 0], X_lda_2d[:, 1], c=y, cmap='viridis', edgecolors='k')
ax2.set_xlabel('LDA Component 1')
ax2.set_ylabel('LDA Component 2')
ax2.set_title('2D Projection after LDA (2 Components)')
# Plot the 1D projection after LDA with 1 component
ax3 = fig.add_subplot(133)
ax3.scatter(X_lda_1d, np.zeros_like(X_lda_1d), c=y, cmap='viridis', edgecolors='k')
ax3.set_xlabel('LDA Component')
ax3.set_title('1D Projection after LDA (1 Component)')
plt.tight_layout()
plt.show()

This code creates a single figure with three subplots. The first subplot is the 3D scatter plot of the original data, the second subplot is the 2D projection after LDA with 2 components, and the third subplot is the 1D projection after LDA with 1 component. Each subplot shows the data points colored by their respective classes.

### 9.3 Hyperparameter tunning

We can also perform hyperparameter tuning for Linear Discriminant Analysis (LDA), particularly when it comes to choosing the number of components (dimensions) to project the data onto. The number of components (n_components) in LDA is a hyperparameter that you can tune to find the optimal trade-off between dimensionality reduction and preserving class separability.

Example of hyperparameter tuning for LDA using grid search:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Generate a larger synthetic dataset
X, y = make_classification(
    n_samples=500, n_features=3, n_informative=3, n_redundant=0, n_repeated=0,
    n_classes=5, n_clusters_per_class=1, random_state=42)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid for GridSearch
param_grid = {
    'solver': ['svd', 'lsqr', 'eigen'],
    'n_components': [1, 2]
}
# Create an LDA model
lda = LinearDiscriminantAnalysis()
# Perform GridSearchCV
grid_search = GridSearchCV(lda, param_grid=param_grid, cv=3)
grid_search.fit(X_train, y_train)
# Get the best parameters and best estimator
best_params = grid_search.best_params_
best_lda = grid_search.best_estimator_

# Predict the target values for testing data
y_pred = best_lda.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
# Print the best parameters and accuracy
print("Best Parameters:", best_params)
print("Accuracy:", accuracy)

# Plot step by step
# Create an LDA model with 2 components
lda_2d = LinearDiscriminantAnalysis(n_components=2, solver='svd')
X_lda_2d = lda_2d.fit_transform(X, y)
# Create an LDA model with 1 component
lda_1d = LinearDiscriminantAnalysis(n_components=1, solver='svd')
X_lda_1d = lda_1d.fit_transform(X, y)

# Create a 3D scatter plot
fig = plt.figure(figsize=(20, 6))
# Plot the 3D scatter plot of original data
# Select the first three features for the 3D plot
X_3d = X[:, :3]
ax1 = fig.add_subplot(131, projection='3d')
ax1.scatter(X_3d[:, 0], X_3d[:, 1], X_3d[:, 2], c=y, cmap='viridis', edgecolors='k')
ax1.set_xlabel('Sepal Length')
ax1.set_ylabel('Sepal Width')
ax1.set_zlabel('Petal Length')
ax1.set_title('3D Scatter Plot of Original Data')
# Plot the 2D projection after LDA with 2 components
ax2 = fig.add_subplot(132)
ax2.scatter(X_lda_2d[:, 0], X_lda_2d[:, 1], c=y, cmap='viridis', edgecolors='k')
ax2.set_xlabel('LDA Component 1')
ax2.set_ylabel('LDA Component 2')
ax2.set_title('2D Projection after LDA (2 Components)')
# Plot the 1D projection after LDA with 1 component
ax3 = fig.add_subplot(133)
ax3.scatter(X_lda_1d, np.zeros_like(X_lda_1d), c=y, cmap='viridis', edgecolors='k')
ax3.set_xlabel('LDA Component')
ax3.set_title('1D Projection after LDA (1 Component)')
plt.tight_layout()
plt.show()

A synthetic dataset with 500 samples, 3 informative features, 5 classes, and no redundant or repeated features is generated using the make_classification function. The dataset is split into training and testing sets using a 70-30 split ratio. A GridSearchCV is performed to find the best parameters for the LDA model. The grid search includes different solvers (svd, lsqr, eigen) and different numbers of components (1 and 2). The best parameters and the best LDA estimator are determined. The LDA model with the best parameters is used to predict the target values for the testing data. The accuracy of the LDA model is calculated by comparing the predicted labels with the actual labels. The example proceeds to visualize the process step by step. It creates an LDA model with 2 components and applies it to the original dataset to obtain a 2D projection. Similarly, it creates an LDA model with 1 component and obtains a 1D projection. These projections are then plotted alongside the original 3D scatter plot of the dataset.

The example demonstrates that LDA is capable of classifying data using a linear decision boundary. It uses accuracy as a performance metric to evaluate the quality of classification. However, it also highlights an important limitation of LDA: when the underlying data distribution is non-linear, LDA may not perform well in terms of classification accuracy. In such cases, LDA can still be valuable for data reduction and visualization, as shown by the 2D and 1D projections that help in understanding the structure of the data.

When dealing with non-linear data, other classification models like Support Vector Machines with non-linear kernels, decision trees, or neural networks may be more suitable for accurate classification.

### 9.4 Sumary

Linear Discriminant Analysis (LDA) is a technique used for both dimensionality reduction and classification. It works by finding a lower-dimensional representation of data while maximizing the distinction between different classes. LDA is particularly effective when classes are well-separated and assumes that the features are normally distributed and have equal covariance matrices across classes. 

Advantages of LDA include its ability to reduce dimensionality while maintaining class separability, aiding in overfitting prevention, and offering interpretability through its resulting components. However, LDA is sensitive to outliers and assumes linear separability, which can limit its performance in complex, nonlinear data scenarios.

In summary, Linear Discriminant Analysis is a versatile tool suitable for dimensionality reduction and classification tasks, provided that its assumptions align with the data characteristics. Its ability to enhance class separability makes it valuable for diverse real-world applications, though its performance may diminish in cases of nonlinear data patterns or substantial outliers.