$\newcommand{\xv}{\mathbf{x}}
\newcommand{\Xv}{\mathbf{X}}
\newcommand{\yv}{\mathbf{y}}
\newcommand{\zv}{\mathbf{z}}
\newcommand{\av}{\mathbf{a}}
\newcommand{\Wv}{\mathbf{W}}
\newcommand{\wv}{\mathbf{w}}
\newcommand{\tv}{\mathbf{t}}
\newcommand{\Tv}{\mathbf{T}}
\newcommand{\muv}{\boldsymbol{\mu}}
\newcommand{\sigmav}{\boldsymbol{\sigma}}
\newcommand{\phiv}{\boldsymbol{\phi}}
\newcommand{\Phiv}{\boldsymbol{\Phi}}
\newcommand{\Sigmav}{\boldsymbol{\Sigma}}
\newcommand{\Lambdav}{\boldsymbol{\Lambda}}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}}$

# Assignment 4: Classification with LDA and Logistic Regression

*Type your name here*

## Overview

Compare LDA and linear and nonlinear logistic regression applied to two data sets.

## Required Code

Download [nn2.tar](http://www.cs.colostate.edu/~anderson/cs480/notebooks/nn2.tar) and extract its contents, which are

* `neuralnetworks.py`
* `scaledconjugategradient.py`
* `mlutils.py`

as discussed in lecture. 

Write the following functions that train and evaluate LDA and neural network logistic regression models.

* `model = trainLDA(X,T,parameters)`
* `percentCorrect = evaluateLDA(model,X,T)`
* `model = trainNN(X,T,parameters)`
* `percentCorrect = evaluateNN(model,X,T)`
The `parameters` argument for `trainNN` is a list of the hidden layers structure and the number of SCG iterations, as in the previous assignment. The value of the `parameters` argument for `trainLDA` is not used.

Use the `trainValidateTestKFoldsClassification` function in `mlutils.py` to apply the above functions. 

The `NeuralNetworkClassifier` class in the above `neuralnetworks.py` file allows you to specify 0 hidden units.  This creates a neural network with just the output layer designed to do classification.  In other words, specify 0 hidden units to apply linear logistic regression.

In [1]:
import numpy as np
import mlutils as ml
import neuralnetworks as nn

If you prefer to develop your python code in a separate editor or IDA, you may do so.  If it is stored in a file called `A4mysolution.py`, you can use it here by executing the following cell.

<font color="red">REMEMBER</font> to remove our comment out the following import statement and instead, paste in all of your function definintions into this notebook.

In [20]:
# from A4mysolution import * 

Here is an example, using our automobile MPG data.  This time, instead of predicting the actual MPG values, we quantize the MPG values into 5 intervals, and classify each sample as being in one of these 5 intervals.

In [3]:
def makeMPGData(filename='auto-mpg.data'):
    def missingIsNan(s):
        return np.nan if s == b'?' else float(s)
    data = np.loadtxt(filename, usecols=range(8), converters={3: missingIsNan})
    print("Read",data.shape[0],"rows and",data.shape[1],"columns from",filename)
    goodRowsMask = np.isnan(data).sum(axis=1) == 0
    data = data[goodRowsMask,:]
    print("After removing rows containing question marks, data has",data.shape[0],"rows and",data.shape[1],"columns.")
    X = data[:,1:]
    T = data[:,0:1]
    Xnames =  ['cylinders','displacement','horsepower','weight','acceleration','year','origin']
    Tname = 'mpg'
    return X,T,Xnames,Tname

In [4]:
def makeMPGClasses(T):
    bounds = np.arange(5,45,10)
    Tclasses = -np.ones(T.shape).astype(np.int)
    for i,mpg in enumerate(T):
        for k in range(len(bounds)-1):
            if bounds[k] < mpg <= bounds[k+1]:
                Tclasses[i] = bounds[k+1]
        if Tclasses[i] == -1:
            Tclasses[i] = 50  # max mpg is 46.6
    return Tclasses

In [5]:
X,T,Xnames,Tname = makeMPGData('auto-mpg.data')
Tclasses = makeMPGClasses(T)
classes,counts = np.unique(Tclasses,return_counts=True)
print('classes',classes)
print('counts',counts)

Read 398 rows and 8 columns from auto-mpg.data
After removing rows containing question marks, data has 392 rows and 8 columns.
classes [15 25 35 50]
counts [ 69 167 123  33]


In [6]:
def printResults(label,results):
    print('{:4s} {:>20s}{:>8s}{:>8s}{:>8s}'.format('Algo','Parameters','TrnAcc','ValAcc','TesAcc'))
    print('-------------------------------------------------')
    for row in results:
        # 20 is expected maximum number of characters in printed parameter value list
        print('{:>4s} {:>20s} {:7.2f} {:7.2f} {:7.2f}'.format(label,str(row[0]),*row[1:]))

In [7]:
resultsLDA = ml.trainValidateTestKFoldsClassification( trainLDA,evaluateLDA, X,Tclasses, [None],
                                                       nFolds=6, shuffle=False,verbose=False)
printResults('LDA:',resultsLDA)


Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
LDA:                 None   79.21   70.53   56.04
LDA:                 None   75.58   66.36   80.23
LDA:                 None   78.46   67.19   82.72
LDA:                 None   77.72   68.07   82.89
LDA:                 None   79.75   69.57   69.01
LDA:                 None   81.23   72.96   51.52


In [9]:
resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,Tclasses, 
                                                     [ [ [0], 10], [[10], 100] ],
                                                     nFolds=6, shuffle=False,verbose=False)
printResults('NN:',resultsNN)


Algo           Parameters  TrnAcc  ValAcc  TesAcc
-------------------------------------------------
 NN:            [[0], 10]   83.16   74.72   51.65
 NN:          [[10], 100]   92.21   66.58   89.53
 NN:          [[10], 100]   93.59   71.86   72.84
 NN:          [[10], 100]   92.66   72.98   76.32
 NN:            [[0], 10]   80.50   73.58   61.97
 NN:          [[10], 100]   94.32   77.44   60.61


In [10]:
lda = ql.LDA()
lda.train(X,Tclasses)
predictedClasses,_,_ = lda.use(X)
ml.confusionMatrix(Tclasses,predictedClasses,classes); # <- semi-colon prevents printing of returned result

      15   25   35   50
    ------------------------
15 | 91.3  8.7  0    0     (69 / 69)
25 | 11.4 68.9 18.6  1.2   (167 / 167)
35 |  0    8.9 68.3 22.8   (123 / 123)
50 |  0    0   18.2 81.8   (33 / 33)


## Data

Pick at least two classification data sets and apply LDA, Linear Logistic Regression, and Nonlinear Logistic Regression to them.

## Results

In this section, we will be looking for

* clear explanations of each function;
* experiments with two different data sets with descriptions of the data;
* discussion of each result, including
  * accuracies as percent correctly classified,
  * best parameter values,
  * some analysis of each classification algorithm and how it is classifying the data by examining the $\mu$ values for LDA, and the first layer's weight values for the neural networks;
* and discuss which algorithm works best for each data set.

## Grading

Your notebook will be run and graded automatically. Download [A4grader.tar](http://www.cs.colostate.edu/~anderson/cs480/notebooks/A4grader.tar)  and extract A4grader.py from it. Run the code in the following cell to demonstrate an example grading session. You should see a perfect score of 80/100 if your functions are defined correctly. 

The remaining 20% will be based on your writing.  Be sure to explain each function, and details of the results summarized in the above section. 

## Check-in

Do not include this section in your notebook.

Name your notebook ```Lastname-A4.ipynb```.  So, for me it would be ```Anderson-A4.ipynb```.  Submit the file using the ```Assignment 4``` link on [Canvas](https://colostate.instructure.com/courses/41327).

Grading will be based on 

  * correct behavior of the required functions,
  * readability of the notebook,
  * effort in making interesting observations, and in formatting your notebook,
  * testing your code on two different classification data sets of your choice.

In [21]:
%run -i A4grader.py


   Testing   model = trainLDA(X,T)
             accuracy = evaluateLDA(model,X,T)

20/20 points. Accuracy is within 10 of correct value 50%

   Testing   model = trainNN(X,T, [[5],100])
             accuracy = evaluateNN(model,X,T)

30/30 points. Accuracy is within 10 of correct value 100%

  Testing
    resultsNN = ml.trainValidateTestKFoldsClassification( trainNN,evaluateNN, X,T, 
                                                          [ [ [0], 5], [ [10], 100] ],
                                                          nFolds=3, shuffle=False,verbose=False)
    bestParms = [row[0] for row in resultsNN]


30/30 points. You correctly find the best parameters to be [[10],100] for each fold.

a4 CODING GRADE is 80/80

a4 WRITING GRADE is ??/20

a4 FINAL GRADE is ??/100

Remember, this python script is just an example of how your code will be graded.
Do not be satisfied with an 80% from running this script.  Write and run additional
tests of your own design.
