# Quadratic Discriminant Analysis

In this exercise, you will implement a QDA on your own. The QDA computes $p(\omega|x)=\frac{p(x|\omega)*p(\omega)}{p(x)}$, and assumes that the likelihood $p(x|\omega)$ follows a normal distribution (conditional on each class $\omega$). 

# Exercise 1
A fisher asks you for help in classifying fish. Recently, he catched the following fish:

| Length (m)    | Type          | 
| ------------- |-------------  |
| 1.3           | sea bass       |
| 0.7           | salmon       |
| 0.62           | salmon      |
| 0.9           | salmon       |
| 0.91          | sea bass       |
| 0.31          | herring       |
| 0.26           | herring       |

* Calculate the class priors $p(\omega)$ for each class of fish
* Calculate the parameters $\mu$ and $\sigma^2$ for the likelihoods $p(x|\omega)$. 
* The fisher catches a new fish with length $x = 0.82 m$. Calculate the posteriori probability $p(\omega|x)$ for each class. Which class is most likely?

Priors: 
* $p(sea bass) = 2/7$
* $p(salmon) = 3/7$
* $p(herring) = 2/7$

Likelihoods:
* $\mu_{sb} = 1.105 \sigma^2_{sb} = 0.076$
* $\mu_{sa} = 0.74 \sigma^2_{sa} = 0.021$
* $\mu_{h} = 0.285 \sigma^2_{h} = 0.00125$

A posteriori (without normalizing constant):

* $p(sea bass | x) = 0.0046 * 2/7 =  0.0013$
* $p(salmon | x) = 0.134 * 3/7 =  0.0038$
* $p(herring | x) = 0 * 2/7 =  0$

Thus, the fish is most likely a salmon.

## Exercise 

Implement a function `priors(classes)` that calculates a prior $p(x)$ for each class, given a vector of class labels.
The input should be an array of classes, e.g. `np.array(["stand","sit","sit","stand"])`. The output should be a pandas data frame with columns `class` and `prior`.


In [8]:
from scipy.io import arff
import scipy.stats
import pandas as pd
import numpy as np

def priors(classes):
    unique, counts = np.unique(classes, return_counts=True)
    counts = counts/counts.sum()
    df = pd.DataFrame(np.array([unique,counts]).transpose(),columns=["class","prior"])
    return(df)
    
pp = priors(np.array(["stand","sit","sit","sit","stand"]))


data = arff.loadarff('features1.arff')
df = pd.DataFrame(data[0])

dat = df.loc[:, ["AccX_mean","class"]]
dat.columns = ["x","class"]

prior = priors(dat["class"])
prior

Unnamed: 0,class,prior
0,b'bike',0.257062
1,b'downstairs',0.0819209
2,b'lie',0.0960452
3,b'run',0.163842
4,b'sit',0.0621469
5,b'stand',0.0451977
6,b'upstairs',0.0762712
7,b'walk',0.217514


## Exercise

Implement a function `likelihood(data)`, that takes a data frame consisting of a column $omega$ and a column $x$, and estimates the parameters of the likelihood $p(x|\omega)$ for each class $\omega$, i.e. that returns a mean and a variance for each class. Thus, the output should be a data frame with columns `omega`, `mean` and `variance`. 

Plot the likelihoods for each class. 


In [4]:
from scipy.io import arff
import scipy.stats
import pandas as pd
import numpy as np

def likelihood(data):
    uc = np.unique(data["class"])
    def getMV(c):
        dsub = data.loc[data["class"]==c,"x"]
        return([dsub.mean(),dsub.var()])
    mvs = list(map(getMV, uc))
    r = pd.DataFrame(mvs,columns = ["mean","variance"])
    r["class"] = uc
    return r
    
data = arff.loadarff('features1.arff')
df = pd.DataFrame(data[0])

dat = df.loc[:, ["AccX_mean","class"]]
dat.columns = ["x","class"]
lik = likelihood(dat)
lik

Unnamed: 0,mean,variance,class
0,523.545587,15.843081,b'bike'
1,517.316002,18.523919,b'downstairs'
2,467.870175,56.697558,b'lie'
3,527.009294,96.483175,b'run'
4,486.721946,71.790221,b'sit'
5,528.594727,1.057665,b'stand'
6,545.413484,62.520043,b'upstairs'
7,525.170962,1.221023,b'walk'


## Exercise
Implement a function `myqda(newdat,lik,priors)`, that returns the most likely class for a new observation `newdat`. 
Test your implementation on the dataset `features1.arff`. Train your QDA (i.e. compute priors and likelihood parameters), and then classify the data. How good is your classification?  


In [4]:
from scipy.io import arff
import scipy.stats

def mylda(newdat,lik,prior):
    def getClass(d):
        def getProb(c,d):
            pr = prior.loc[prior["class"]==c,"prior"].values[0]
            m = lik.loc[lik["class"] == c,"mean"].values[0]
            v = lik.loc[lik["class"] == c,"variance"].values[0]
            li = scipy.stats.norm(m, v).pdf(d)
            return(li*pr)
        probs = np.array([getProb(c,d) for c in prior["class"].values])
        return prior.loc[np.argmax(probs),"class"]
    newclasses = np.array([getClass(dat) for dat in newdat])
    return newclasses


data = arff.loadarff('features1.arff')
df = pd.DataFrame(data[0])

dat = df.loc[:, ["AccX_mean","class"]]
dat.columns = ["x","class"]

lik = likelihood(dat)
prior = priors(dat["class"])

nc = mylda(dat["x"][1:100],lik,prior)
print(sum(nc == dat["class"][1:100])/100)

0.47
