## Setup Notebook

We import useful packages

In [25]:
# Standard includes
%matplotlib inline
%run helper.ipynb
import numpy as np
import matplotlib.pyplot as plt
from inspect import getsource
from scipy.stats import norm, multivariate_normal # module for dealing with the Gaussians

The object 'helper' has been imported into this notebook.


In [26]:
# installing packages for interactive graphs
import ipywidgets as widgets
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider

We load in the wine test and training datasets.
The dataset consists of 130 training points and 48 test points.
There are 13 features and a label with one of the values, (1,2,3).

* **x:** an `np.array` of the 130 training points' features.
* **y:** an `np.array` of the 130 training labels.
* **tx:** an `np.array` of the 48 test points' features.
* **ty:** an `np.array` of the 48 test points' labels.

In [27]:
x,y,tx,ty = helper.x, helper.y, helper.tx, helper.ty
print "x.shape :",x.shape
print "tx.shape :",tx.shape
print "y.shape :",y.shape
print "ty.shape :",ty.shape

featurenames = helper.featurenames
print;print "Names of Features:"; print featurenames

x.shape : (130, 13)
tx.shape : (48, 13)
y.shape : (130,)
ty.shape : (48,)

Names of Features:
['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash', 'Magnesium', 'Total phenols', 'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 'OD280/OD315 of diluted wines', 'Proline']


Let's see how many training points there are from each class.

In [28]:
sum(y==1), sum(y==2), sum(y==3)

(43, 51, 36)

### <font color="magenta">*Fast Exercise*</font>

Can you see how many test points there are from each class?

Write a function, **sumTestClasses**, that, like above, returns a tuple $(\text{sum}_1,\, \text{sum}_2,\, \text{sum}_3)$ but this time for the test points.

In [22]:
# modify this cell

def sumTestClasses():
    # inputs: no inputs, but you should call a variable from x, y, tx, ty
    # output: ( sum_1, sum_2, sum_3)
    
    ### BEGIN SOLUTION
    return sum(ty==1), sum(ty==2), sum(ty==3)
    ### END SOLUTION

In [23]:
# Check Function
assert sumTestClasses() == (16, 20, 12)

## Look at the distribution of a single feature from one of the wineries

Now we'll pick just one feature: 'Alcohol'. This is the first feature, that is, number 0.

We will display a histogram of this feature's values under class 1. We will also show the Gaussian fit to this distribution.

In [24]:
interact_manual( helper.uni.densityPlot, feature=IntSlider(0,0,12), label=IntSlider(1,1,3))

<function __main__.densityPlot>

### <font color="magenta">*Fast Exercise*</font>

The above cell shows the histogram and Gaussian fit to feature #0 for winery 1. Try looking at the plot for different features and different wineries.

The code for plotting the Gaussian density focuses on the region within 3 standard deviations of the mean. Do you see where this happens? Why did we make this choice?

** *Hint:* ** Use `np.mean` and `np.var`

Write a function, **hueForTwo**, which returns the tuple, $(\text{mean}\,,\text{var}\,,\text{s.d.})$ of feature #11 (*hue*) for winery 2.

In [8]:
# modify this cell

def hueForTwo():
    # inputs: no inputs, but you should call variables from x, y, tx, ty
    # output: ( mean, var, sd)
    
    ### BEGIN SOLUTION
    feature = 11
    label = 2
    mu = np.mean(x[y==label,feature]) # mean
    var = np.var(x[y==label,feature]) # variance
    std = np.sqrt(var) # standard deviation
    return mu, var, std
    ### END SOLUTION

In [9]:
# Check Function
assert sum(abs( hueForTwo() - np.array([2.76411764705, 0.2300477508650, 0.4796329334658]) )) < 10**-5

## Fit a Gaussian to each class

Let's define a function that will fit a Gaussian generative model to the three classes, restricted to just a single feature.

Call this function on the feature 'alcohol'. Print the class weights as a sanity check.

In [10]:
feature = 0 # 'alcohol'
mu, var, pi = helper.uni.fit_generative_model(feature)
pi[1], pi[2], pi[3]

(0.33076923076923076, 0.3923076923076923, 0.27692307692307694)

Next, display the Gaussian distribution for each of the three classes

In [12]:
interact_manual( helper.uni.gaussians, feature=IntSlider(0,0,12) )


<function __main__.gaussians>

### <font color="magenta">*Fast Exercise*</font>

Use the widget below to check out the class distributions for different features.

Write a function, **mostIntense**, which returns the number of the class that has the highest (*mean*) color intensity.

In [13]:
# modify this cell

def mostIntense():
    # inputs: no inputs, use the interactive graphic above to figure out the answer
    # output: an integer, either 1, 2, or 3
    
    ### BEGIN SOLUTION
    return 3
    ### END SOLUTION

In [14]:
# Check Function
assert mostIntense() == 3

## Prediction time

How well do you think, we can predict the class (1,2,3) based just on one feature? The code below lets us find this out.

Later we will look at larger sets of features as well, so we define a general-purpose routine that returns the test error using *any* subset of features.

Okay, so let's determine the test error using just the 'Alcohol' feature.

In [15]:
interact(helper.uni.test_model, feature=IntSlider(0,0,12) )


<function __main__.test_model>

### <font color="magenta">*Fast Exercise*</font>

Try checking out the test error for different features. 

Write a function, **featureFacts**, that returns $(\text{magnesuim_test_error}\,,\,\text{best_feature}\, ,\,\text{worst_feature})$

In [16]:
# modify this cell

def featureFacts():
    # inputs: no inputs, use the interactive graphic above to figure out the answer
    # output: (float, int, int) as described above
    
    ### BEGIN SOLUTION
    return (0.5208333333, 2 , 6)
    ### END SOLUTION

In [17]:
# Check Function
assert sum(abs(featureFacts() - np.array([0.5208333333, 2 , 6]) )) < 10**-5