# Homework Lecture5

LDA and Logistic Classification and Feature Development with the  MNIST image dataset

## Preliminaries

### Imports

In [None]:
import pickle

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression


from sklearn.model_selection import train_test_split, KFold

import sys
sys.path.append("../..")
from E4525_ML import mnist
from E4525_ML.multiclass_logistic import LogisticGDClassifier
%matplotlib inline

### Random Seed

In [None]:
seed=458
np.random.seed(seed)

### Data Directories

In [None]:
data_dir=r"../../raw/mnist/"

<div class="alert alert-block alert-info"> Problem 0 </div>
Make sure to **update** the file `mnist.py` on the `E4525_ML` directory (new version posted on Canvas).

You will need the **updated** version of that file to complete the last section of this notebook.

## Read Data

<div class="alert alert-block alert-info"> Problem 1.0 </div>
Read MNIST data set and labels,  also read the MNMIST test data set and test labels

<div class="alert alert-block alert-info"> Problem 1.2 </div>
Use `skelearn`'s `train_test_split` function to separate the MNIST samples into  a 15% validation set and a  training sample.


## LDA

<div class="alert alert-block alert-info"> Problem 2.1 </div>
fit an LDA model on the training data set using `sklearns` `LinearDiscriminantAnalysis` classifier 

<div class="alert alert-block alert-info"> Problem 2.2 </div>
Compute model accuracy on the training set

<div class="alert alert-block alert-info"> Problem 2.3 </div>
Compute accuracy of the model on the validation set

## Logistic Regression

<div class="alert alert-block alert-info"> Problem 3.1 </div>

Use the `LogisticGDClassifier` class from `E4525_ML.multiclass_logistic` module to fit a logistic model

<div class="alert alert-block alert-info"> Problem 3.2 </div>
Compute model accuracy in the training data set

<div class="alert alert-block alert-info"> Problem 3.3 </div>
Compute model accuracy in the valuation data set

## Feature Engineering in one Dimension

In [None]:
N=50
N_val=1000

In [None]:
def f(x):
    return 10*(1-4*(np.abs(np.abs(x)-1)))

In [None]:
def generate_sample(N):
    X=np.random.uniform(-2,2,N)
    eta=f(X)
    eta.shape
    theta=1/(1+np.exp(-eta))
    Y= np.random.uniform(0,1,N)>theta
    return X,Y

<div class="alert alert-block alert-info"> Problem 4.0 </div>
Generate 
1. a training sample of variables $X$ and $Y$ with $N$ data samples
2. a valuation set with   $N_{val}$ samples
3. a test set with $N_{val}$ samples

<div class="alert alert-block alert-info"> Problem 4.1 </div>
What is the proportion of positive class ($Y=1$) samples on the training data?

<div class="alert alert-block alert-info"> Problem 4.2 </div>
Write a function able to generate the feature matrix
$$
    H_{i,d}= h_d(x_i)
$$
for $i=1,\dots N$ and $d=1,\dots D$

where the functions $h_d(x)$ are defined as 
$$
    h_d(x) = x^d
$$

[HINT] be careful to include $h_D$ in the range of functions

<div class="alert alert-block alert-info"> Problem 4.3 </div>
1. Train  a logistic regression model (use sklearn `LogisticRegression` class) over the training data you already generated. 
2. Use the valuation set  to select the best value of $D$ using accuracy as selection criteria.
3. Plot accuracy on the  training and valuation sets as a function of $D$.

[HINT]
1. You only need to consider the range $D=1,\dots 10$.
2. Remember to disable regularization by setting the parameter $C$ of the `LogisticRegression` class to a very large number.



<div class="alert alert-block alert-info"> Problem 4.4 </div>
Use the test set  to measure the accuracy for the optimal classifier you have found
(do not use data from the  valuation set to train the classifier)

## Feature Engineering for MNIST sample

<div class="alert alert-block alert-info"> Problem 5.1 </div>
In this problem we will use `mnist.ImageFeatureModel` class to find the optimal number of orientations $\theta$  of the oriented gradients
features for the MNIST data set.

1. use `mnist.ImageFeatureModel` to generate image oriented gradient features.
2. use  `LogisticGDClassifier` as the base model
3. set the block size to 4 ( this is to reduce memory use)
4. select the best number of orientations by performing  5-Fold cross-validation on the full MNIST data set.
5. Consider only [1,2,4,8] as possible values for the orientation
6. Plot number of orientations vs validation accuracy

[HINT] 
1. the `validation_model` function below will be useful to perform cross-validation
2. If you run into memory trouble (your computer crashes), reduce the size of the data set.
Make sure to  indicate this clearly on your solution.
3. This problem is computationally expensive, make sure to allocate time to resolve it.

In [None]:
def validate_model(model,K,X,Y):
    folder=KFold(K,shuffle=True)
    folds=folder.split(X,Y)
    val_error=0.0
    fold_count=0
    for fold in folds:
        train_idx,val_idx=fold
        x_train=X[train_idx]
        y_train=Y[train_idx]
        x_val=X[val_idx]
        y_val=Y[val_idx]     
        model.fit(x_train,y_train)
        y_pred=model.predict(x_val)
        val_err=np.mean(y_val==y_pred)
        val_error+=val_err
        fold_count+=1
        print(fold_count,val_err)
    return val_error/K
      

<div class="alert alert-block alert-info"> Problem 5.2 </div>

Fit the model with the optimal number of orientations to the full MNIST data set and estimate its accuracy on the MNIST test set
