# Discriminant Function Analysis Using R

_Discriminant Function Analysis_ (DFA) is a statistical technique used to classify observations into predefined categories based on independent variables. It's widely used in fields like biology, finance, and marketing.

## Key Concepts

- **Dependent Variable**: A categorical variable to be predicted (e.g., species).
- **Independent Variables**: Continuous variables used for prediction (e.g., sepal length, petal width).
- **Discriminant Function**: A linear combination of independent variables to differentiate between groups:

  $$
  D = b_1 x_1 + b_2 x_2 + \cdots + b_n x_n + c
  $$

- **Eigenvalue**: Indicates the variance explained by each discriminant function.
- **Canonical Correlation**: Measures the correlation between discriminant scores and the groups.

## 📘 Linear Discriminant Analysis (LDA)

LDA assumes:
- Multivariate normality of predictors.
- Equal covariance matrices across groups.
- Linearity in class separation.

### Objective:
Find a linear combination that best separates the classes.

$$
D_k = \mathbf{w}^T \mathbf{x} + c
$$

### Steps in R:

```r
library(MASS)
library(caret)

data(iris)
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train <- iris[trainIndex, ]
test <- iris[-trainIndex, ]

# Fit LDA model
model <- lda(Species ~ ., data = train)
predicted <- predict(model, newdata = test)

# Evaluate model
confusionMatrix(predicted$class, test$Species)
```

output:
```
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         1
  virginica       0          0         9

Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.8278, 0.9992)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 2.963e-13       
                                          
                  Kappa : 0.95            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           0.9000
Specificity                 1.0000            0.9500           1.0000
Pos Pred Value              1.0000            0.9091           1.0000
Neg Pred Value              1.0000            1.0000           0.9524
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3000
Detection Prevalence        0.3333            0.3667           0.3000
Balanced Accuracy           1.0000            0.9750           0.9500

```

### Interpretation of the LDA Confusion Matrix (Iris Dataset):

The following confusion matrix and statistics summarize the performance of the Linear Discriminant Analysis (LDA) model on the test set of the Iris dataset:

### Confusion Matrix:

| Prediction \\ Reference | setosa | versicolor | virginica |
|------------------------|--------|------------|-----------|
| **setosa**             | 10     | 0          | 0         |
| **versicolor**         | 0      | 10         | 1         |
| **virginica**          | 0      | 0          | 9         |

- All 10 instances of class `setosa` were correctly classified.
- 10 instances of class `versicolor` were correctly classified, with 1 misclassified as `virginica`.
- 9 instances of class `virginica` were correctly classified, with no misclassifications.

### Overall Statistics:

- **Accuracy**: 0.9667 (96.67%)
  - Indicates that 29 out of 30 predictions were correct.
- **95% Confidence Interval**: (0.8278, 0.9992)
  - Suggests high reliability of the classification accuracy.
- **No Information Rate (NIR)**: 0.3333
  - The accuracy that would be achieved by always predicting the most frequent class.
- **P-Value [Acc > NIR]**: 2.963e-13
  - Very small p-value indicating that the model performs significantly better than random guessing.
- **Kappa**: 0.95
  - Reflects excellent agreement between predictions and actual class labels.


### Class-wise Metrics:

| Metric               | setosa | versicolor | virginica |
|----------------------|--------|------------|-----------|
| Sensitivity (Recall) | 1.0000 | 1.0000     | 0.9000    |
| Specificity          | 1.0000 | 0.9500     | 1.0000    |
| Positive Predictive Value (Precision) | 1.0000 | 0.9091     | 1.0000    |
| Negative Predictive Value | 1.0000 | 1.0000     | 0.9524    |
| Balanced Accuracy    | 1.0000 | 0.9750     | 0.9500    |

- Class `setosa` was perfectly classified in all aspects.
- Class `versicolor` had one false positive for `virginica`, lowering its precision slightly.
- Class `virginica` had one false negative, where it was predicted as `versicolor`.

### Conclusion:

The LDA model performs excellently, achieving nearly perfect classification on the Iris dataset. Misclassification occurred only between `versicolor` and `virginica`, which are known to have overlapping feature distributions. All other predictions were correct, indicating strong class separability in the training data and an effective model fit.


## 📘 Quadratic Discriminant Analysis (QDA)

QDA allows:
- Unequal covariance matrices between groups.
- Non-linear class boundaries.

### Discriminant Function:
Each class has its own quadratic boundary:

$$
D_k(x) = -\frac{1}{2} \ln |\Sigma_k| - \frac{1}{2} (x - \mu_k)^T \Sigma_k^{-1} (x - \mu_k) + \ln \pi_k
$$

### Steps in R:

```r
library(MASS)
library(caret)

data(iris)
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train <- iris[trainIndex, ]
test <- iris[-trainIndex, ]

# Fit QDA model
qda_model <- qda(Species ~ ., data = train)
predicted <- predict(qda_model, newdata = test)

# Evaluate model
confusionMatrix(predicted$class, test$Species)
```
output:
```
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         1
  virginica       0          0         9

Overall Statistics
                                          
               Accuracy : 0.9667          
                 95% CI : (0.8278, 0.9992)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 2.963e-13       
                                          
                  Kappa : 0.95            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           0.9000
Specificity                 1.0000            0.9500           1.0000
Pos Pred Value              1.0000            0.9091           1.0000
Neg Pred Value              1.0000            1.0000           0.9524
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3000
Detection Prevalence        0.3333            0.3667           0.3000
Balanced Accuracy           1.0000            0.9750           0.9500
```

### Interpretation of the QDA Confusion Matrix (Iris Dataset):

This section presents the performance evaluation of the Quadratic Discriminant Analysis (QDA) model trained on the Iris dataset.

### Confusion Matrix:

| Prediction \\ Reference | setosa | versicolor | virginica |
|------------------------|--------|------------|-----------|
| **setosa**             | 10     | 0          | 0         |
| **versicolor**         | 0      | 10         | 1         |
| **virginica**          | 0      | 0          | 9         |

- All 10 instances of the class `setosa` were classified correctly — 100% accurate.
- 10 `versicolor` instances were correctly identified, with 1 misclassified as `virginica`.
- 9 instances of `virginica` were correctly identified; 1 instance was predicted as `versicolor`.

### Overall Performance:

- **Accuracy**: 0.9667 (96.67%)
  - The model correctly classified 29 out of 30 instances.
- **95% Confidence Interval**: (0.8278, 0.9992)
  - Indicates the range in which the true accuracy lies with 95% confidence.
- **No Information Rate (NIR)**: 0.3333
  - Accuracy of the model that always predicts the most frequent class.
- **P-Value [Acc > NIR]**: 2.963e-13
  - Extremely small p-value suggests that the model performs significantly better than random guessing.
- **Kappa**: 0.95
  - Reflects very strong agreement between the predicted and actual classifications.

### Class-wise Metrics:

| Metric               | setosa | versicolor | virginica |
|----------------------|--------|------------|-----------|
| Sensitivity (Recall) | 1.0000 | 1.0000     | 0.9000    |
| Specificity          | 1.0000 | 0.9500     | 1.0000    |
| Positive Predictive Value (Precision) | 1.0000 | 0.9091     | 1.0000    |
| Negative Predictive Value | 1.0000 | 1.0000     | 0.9524    |
| Balanced Accuracy    | 1.0000 | 0.9750     | 0.9500    |

- Class `setosa` is perfectly classified across all metrics.
- Class `versicolor` had one instance of `virginica` incorrectly predicted as `versicolor`, slightly affecting its precision.
- Class `virginica` had one false negative (predicted as `versicolor`), lowering its sensitivity to 0.90.

### Conclusion:

The QDA model provides excellent classification performance on the Iris dataset. It achieves perfect classification for the `setosa` class and performs nearly perfectly for the other two classes. The only misclassifications occurred between `versicolor` and `virginica`, which are commonly overlapping in their features. Overall, the model is highly accurate and reliable for this classification task.


## 📘 Gaussian Discriminant Analysis (Naive Bayes approximation)

GDA assumes:
- Predictor variables are normally distributed.
- Conditional independence between features (as in Naive Bayes).
- Shared covariance structure across all classes.

### Gaussian Naive Bayes Likelihood:

$$
P(x_i | y_k) = \prod_{j=1}^{n} \frac{1}{\sqrt{2\pi\sigma_{kj}^2}} \exp\left( -\frac{(x_j - \mu_{kj})^2}{2\sigma_{kj}^2} \right)
$$

### Steps in R:

```r
library(MASS)
library(caret)
library(e1071)

data(iris)
set.seed(123)
trainIndex <- createDataPartition(iris$Species, p = 0.8, list = FALSE)
train <- iris[trainIndex, ]
test <- iris[-trainIndex, ]

# Fit Gaussian model using Naive Bayes
gda_model <- naiveBayes(Species ~ ., data = train)
predicted <- predict(gda_model, newdata = test)

# Evaluate model
confusionMatrix(predicted, test$Species)
```

output:
```
> confusionMatrix(predicted, test$Species)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         2
  virginica       0          0         8

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.7793, 0.9918)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 8.747e-12       
                                          
                  Kappa : 0.9             
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           0.8000
Specificity                 1.0000            0.9000           1.0000
Pos Pred Value              1.0000            0.8333           1.0000
Neg Pred Value              1.0000            1.0000           0.9091
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.4000           0.2667
Balanced Accuracy           1.0000            0.9500           0.9000
```

### Interpretation of the GDA (Naive Bayes) Confusion Matrix — Iris Dataset:

This section explains the classification performance of a Gaussian Discriminant Analysis (GDA) model implemented using the Naive Bayes algorithm on the classic Iris dataset.

### Confusion Matrix:

| Prediction \\ Reference | setosa | versicolor | virginica |
|------------------------|--------|------------|-----------|
| setosa                 | 10     | 0          | 0         |
| versicolor             | 0      | 10         | 2         |
| virginica              | 0      | 0          | 8         |

- All 10 `setosa` instances are correctly predicted (100%).
- All 10 `versicolor` instances are correctly classified, but 2 `virginica` instances were incorrectly labeled as `versicolor`.
- 8 of the `virginica` instances are correctly predicted.

### Overall Model Metrics:

- 🔹 Accuracy: 0.9333 (93.33%)
  - The model made 28 correct predictions out of 30.
- 🔹 95% Confidence Interval: (0.7793, 0.9918)
  - The model's true accuracy is expected to lie within this range.
- 🔹 No Information Rate (NIR): 0.3333
  - Represents the accuracy obtained by guessing the most frequent class.
- 🔹 P-Value [Acc > NIR]: 8.747e-12
  - A very small p-value suggests that this model significantly outperforms a random classifier.
- 🔹 Kappa: 0.9
  - Indicates strong agreement between actual and predicted classifications.

### Class-wise Performance:

| Metric                       | setosa | versicolor | virginica |
|------------------------------|--------|------------|-----------|
| Sensitivity (Recall)         | 1.0000 | 1.0000     | 0.8000    |
| Specificity                  | 1.0000 | 0.9000     | 1.0000    |
| Positive Predictive Value    | 1.0000 | 0.8333     | 1.0000    |
| Negative Predictive Value    | 1.0000 | 1.0000     | 0.9091    |
| Detection Rate               | 0.3333 | 0.3333     | 0.2667    |
| Detection Prevalence         | 0.3333 | 0.4000     | 0.2667    |
| Balanced Accuracy            | 1.0000 | 0.9500     | 0.9000    |

- Class `setosa`: Perfectly classified with sensitivity, specificity, and precision all equal to 1.0.
- Class `versicolor`: While sensitivity is perfect (1.0), precision drops to 0.8333 due to false positives from `virginica`.
- Class `virginica`: Two false negatives reduce its sensitivity to 0.8000, although precision remains 1.0.

### Summary:

The Naive Bayes-based GDA model demonstrates very good classification accuracy (93.33%) on the Iris dataset. It classifies `setosa` with perfect precision and recall. Minor confusion occurs between `virginica` and `versicolor`, likely due to overlapping distributions in feature space. Overall, the model performs reliably and is suitable for multi-class classification tasks where the assumption of normally distributed predictors is reasonable.
