# The limits of LDA

Here, we will explore the limitations of LDA and how some of them can be overcome. To this end, two datasets will be used


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import os.path

## Preprocess the features
Sometimes, some of the features clearly violate the __linear separability__ assumption made by LDA. Features can be pre-processed by applying a given function to each feature, which yields a Generalized Linear Discriminant analysis:

$$ y = \sum_i^D w_i f_i(\mathbf{x}_i) $$ 

__Task:__ Follow the steps (1) to (4) in the code below.

In [None]:
ds_data = 'data/non-lin_train.data'
ds_labels ='data/non-lin_train.labels'
X_train = np.loadtxt(ds_data)
y_train = np.loadtxt(ds_labels)

# 1-> Scatter-plot your 2D-data


#### Q2.4.1) Which LDA assumption(s) are violated by the provided dataset?
#### Q2.4.2) In which way can the feature(s) be preprocessed in order to improve the performance? Hint: Think of polynomials, trigonometric functions, logarithms, exponential functions, etc.

In [None]:
ds_data = 'data/non-lin_test.data'
X_test = np.loadtxt(ds_data)

# 2-> Preprocess your features
# 3-> Check the performance of your LDA classifier
# 4-> If you think you got it, upload your prediction accuracy to the wiki.
# 5 -> In addition, upload your implementation to GitLab under git_dir/Assignment2/Q2-4.ipynb

## Feature extraction
__Tasks:__ 
 1. Visualize the given toy 2D-dataset. 
 2. Train an LDA classifier on the given data. Remember to split your data set into training and validation set. Which performance do you achieve?


There appears to be some exploitable structure in your data, but the LDA is not able to directly capture this.

#### Q2.4.3) In which way can the existing features be combined in order to get a __linearly separable__ dataset? 

__Hint:__ Try to think of it in a 3D way: How would you add another dimension such that the 2 classes can be separated by a plane?

__Tasks__: 
1. Implement your additional feature that combines the existing features. 
2. Calculate the performance of LDA using this additional feature and see if you can improve upon the performance.

In [None]:
ds_data = 'data/circles_train.data'
ds_labels ='data/circles_train.labels'
X = np.loadtxt(ds_data)
y = np.loadtxt(ds_labels)

# 1-> Scatter-plot your 2D-data.
# 2-> Combine the 2 existing features so a LDA classifier works

## What else can go wrong?

__Task:__ Visualize the data set.

#### Q2.4.4) What LDA assumption(s) are violated by the next dataset? What would be an optimal decision boundary? 

#### Q2.4.5 (Bonus): Investigate which method would provide an optimal decision boundary. 

__Hint:__ It is slightly a more general technique than LDA.

In [None]:

ds_data = 'data/unbalanced_train.data'
ds_labels ='data/unbalanced_train.labels'
X = np.loadtxt(ds_data)
y = np.loadtxt(ds_labels)

# 1-> Scatter-plot your 2D-data.
# 2-> What is the optimal decision boundary?