## Load in the data set

Start by loading in the Wine data set.
There are 178 data points, each with 13 features and a label (1,2,3).
We will divide these into a training set of 130 points and a test set of 48 points.

In [1]:
# Standard includes
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
# Useful module for dealing with the Gaussian density
from scipy.stats import norm, multivariate_normal 
# Now load "wine.data.txt" data set.
# This needs to be in the same directory
# 178 lines, each with one point. First value is the label (1,2,3), remaining 13 numbers are features
data = np.loadtxt('wine.data.txt', delimiter=',')
# Names of features
featurenames = ['Alcohol', 'Malic acid', 'Ash', 'Alcalinity of ash','Magnesium', 'Total phenols', 
                'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins', 'Color intensity', 'Hue', 
                'OD280/OD315 of diluted wines', 'Proline']

In [2]:
# installing packages for interactive graphs
import ipywidgets as widgets
from IPython.display import display
from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider

Fix a particular "random" permutation of the data, and use these to effect the training / test split

In [3]:
perm = np.array([  4,  93, 103, 152,  77,  81,  14,  58, 139,  53,  40, 167,  20,
        80, 130,  16, 110, 158,  42, 135,   8,  69, 153,  94,  91,  51,
       117, 146,  72, 142, 137,  88, 165, 106,  33,  67, 133, 113, 171,
       129, 141,  21,  12,  44,   3, 164, 169,  41,   6, 177,  17, 174,
       104, 176, 168,  26, 173, 122, 159, 111, 163,  50,  15,  37, 114,
         2, 109,  68,  39,  96,  36, 149, 151, 124, 156, 108, 107,  30,
        43,  28,  54,  59, 154,  78,  92, 157, 140,  73,  34,  49, 160,
       118, 125, 126, 127, 145, 144,   9,  24,  90,  84,  55,  19, 148,
        25,  61, 123,   0,  38,  97,  32,  85,  29,  45, 128,  75,  66,
        86,  47, 102, 175,  63,  82,  83, 115, 136,  98,  46,  62, 150,
       162, 134, 138,  76,  87, 170, 105,  65,  89,  71, 112,  56,  74,
       132, 100,  27,  64, 166,  22, 155,  57, 119,  99,   7,  23,  13,
       121, 101, 116, 172,  95, 131,  10,  35,  11,  60, 161,   1,  18,
       147, 143,  31,  79,  48,   5, 120,  52,  70])
# Split 178 instances into training set (x, y) of size 130 and test set (tx, ty) of size 48
# Also split apart data and labels
# perm = np.random.permutation(178)
x = data[perm[0:130],1:14]
y = data[perm[0:130],0]
tx = data[perm[130:178], 1:14]
ty = data[perm[130:178],0]

Let's see how many training points there are from each class.

In [4]:
sum(y==1), sum(y==2), sum(y==3)

(43, 51, 36)

### <font color="magenta">*Fast Exercise*</font>

Can you see how many test points there are from each class?

Write a function, **sumTestClasses**, that, like above, returns a tuple $(\text{sum}_1,\, \text{sum}_2,\, \text{sum}_3)$ but this time for the test points.

In [5]:
# modify this cell

def sumTestClasses():
    # inputs: no inputs, but you should call a variable from x, y, tx, ty
    # output: ( sum_1, sum_2, sum_3)
    
    ### BEGIN SOLUTION
    return sum(ty==1), sum(ty==2), sum(ty==3)
    ### END SOLUTION

In [6]:
# Check Function
assert sumTestClasses() == (16, 20, 12)

## Look at the distribution of a single feature from one of the wineries

Now we'll pick just one feature: 'Alcohol'. This is the first feature, that is, number 0.

We will display a histogram of this feature's values under class 1. We will also show the Gaussian fit to this distribution.

In [7]:
@interact_manual( feature=IntSlider(0,0,12), label=IntSlider(1,1,3))
def densityPlot(feature, label):
    plt.hist(x[y==label,feature], normed=True)
    #
    mu = np.mean(x[y==label,feature]) # mean
    var = np.var(x[y==label,feature]) # variance
    std = np.sqrt(var) # standard deviation
    #
    x_axis = np.linspace(mu - 3*std, mu + 3*std, 1000)
    plt.plot(x_axis, norm.pdf(x_axis,mu,std), 'r', lw=2)
    plt.title("Winery "+str(label) )
    plt.xlabel(featurenames[feature], fontsize=14, color='red')
    plt.ylabel('Density', fontsize=14, color='red')
    plt.show()

### <font color="magenta">*Fast Exercise*</font>

The above cell shows the histogram and Gaussian fit to feature #0 for winery 1. Try looking at the plot for different features and different wineries.

The code for plotting the Gaussian density focuses on the region within 3 standard deviations of the mean. Do you see where this happens? Why did we make this choice?

Write a function, **hueForTwo**, which returns the tuple, $(\text{mean}\,,\text{var}\,,\text{s.d.})$ of feature #11 (*hue*) for winery 2.

In [8]:
# modify this cell

def hueForTwo():
    # inputs: no inputs, but you should call variables from x, y, tx, ty
    # output: ( mean, var, sd)
    
    ### BEGIN SOLUTION
    feature = 11
    label = 2
    mu = np.mean(x[y==label,feature]) # mean
    var = np.var(x[y==label,feature]) # variance
    std = np.sqrt(var) # standard deviation
    return mu, var, std
    ### END SOLUTION

In [9]:
# Check Function
assert sum(abs( hueForTwo() - np.array([2.76411764705, 0.2300477508650, 0.4796329334658]) )) < 10**-5

In the cell below, play around with these to see the kinds of distributions you obtain.

## Fit a Gaussian to each class

Let's define a function that will fit a Gaussian generative model to the three classes, restricted to just a single feature.

In [10]:
# Assumes y takes on values 1,2,3
def fit_generative_model(x,y, feature):
    k = 3 # number of classes
    mu = np.zeros(k+1) # list of means
    var = np.zeros(k+1) # list of variances
    pi = np.zeros(k+1) # list of class weights
    for label in range(1,k+1):
        indices = (y==label)
        mu[label] = np.mean(x[indices,feature])
        var[label] = np.var(x[indices,feature])
        pi[label] = float(sum(indices))/float(len(y))
    return mu, var, pi

Call this function on the feature 'alcohol'. Print the class weights as a sanity check.

In [11]:
feature = 0 # 'alcohol'
mu, var, pi = fit_generative_model(x, y, feature)
pi[1], pi[2], pi[3]

(0.33076923076923076, 0.3923076923076923, 0.27692307692307694)

Next, display the Gaussian distribution for each of the three classes

In [12]:
@interact_manual( feature=IntSlider(0,0,12) )
def gaussians(feature):
    mu, var, pi = fit_generative_model(x, y, feature)
    pi[1], pi[2], pi[3]

    colors = ['r', 'k', 'g']
    for label in range(1,4):
        m = mu[label]
        s = np.sqrt(var[label])
        x_axis = np.linspace(m - 3*s, m+3*s, 1000)
        plt.plot(x_axis, norm.pdf(x_axis,m,s), colors[label-1], label="class " + str(label))
    plt.xlabel(featurenames[feature], fontsize=14, color='red')
    plt.ylabel('Density', fontsize=14, color='red')
    plt.legend()
    plt.show()

### <font color="magenta">*Fast Exercise*</font>

Use the widget below to check out the class distributions for different features.

Write a function, **mostIntense**, which returns the number of the class that has the highest (*mean*) color intensity.

In [13]:
# modify this cell

def mostIntense():
    # inputs: no inputs, use the interactive graphic above to figure out the answer
    # output: an integer, either 1, 2, or 3
    
    ### BEGIN SOLUTION
    return 3
    ### END SOLUTION

In [14]:
# Check Function
assert mostIntense() == 3

## Prediction time

How well do you think, we can predict the class (1,2,3) based just on one feature? The code below lets us find this out.

Later we will look at larger sets of features as well, so we define a general-purpose routine that returns the test error using *any* subset of features.

Okay, so let's determine the test error using just the 'Alcohol' feature.

In [15]:
@interact( feature=IntSlider(0,0,12) )
def test_model(feature):
    mu, var, pi = fit_generative_model(x, y, feature)

    k = 3 # Labels 1,2,...,k
    nt = len(ty) # Number of test points
    score = np.zeros((nt,k+1))
    for i in range(0,nt):
        for label in range(1,k+1):
            score[i,label] = np.log(pi[label]) + \
            norm.logpdf(tx[i,feature], mu[label], np.sqrt(var[label]))
    predictions = np.argmax(score[:,1:4], axis=1) + 1
    # Finally, tally up score
    errors = np.sum(predictions != ty)
    print "Test error using feature " + featurenames[feature] + ": " + str(errors) + "/" + str(nt)

### <font color="magenta">*Fast Exercise*</font>

Try checking out the test error for different features. 

Write a function, **featureFacts**, that returns $(\text{magnesuim_test_error}\,,\,\text{best_feature}\, ,\,\text{worst_feature})$

In [16]:
# modify this cell

def featureFacts():
    # inputs: no inputs, use the interactive graphic above to figure out the answer
    # output: (float, int, int) as described above
    
    ### BEGIN SOLUTION
    return (0.5208333333, 2 , 6)
    ### END SOLUTION

In [17]:
# Check Function
assert sum(abs(featureFacts() - np.array([0.5208333333, 2 , 6]) )) < 10**-5