# Introduction

This year, we will run the practical exercises in Python, a powerful interpreted language. Python is being increasingly used in machine learning research, as it is fast, portable, has a clean syntax and sports a large collection of scientific software libraries. 

In order to deal with the more advanced mathematics that we deal with in this course, we make use of a few python libraries:

* numpy, which allows us to work with vectors and matrices
* scipy, which gives us access to scientific algorithms
* matplotlib, which allows us to plot our results

For this introduction, we will run our practical exercises within a "*notebook*": an interactive web page that integrates both code and text, allowing us to combine the code with documentation. In this case, you will edit this notebook to write your code and provide answers to the questions. Much of the documentation for python and its libraries can be found online. In particular, you should be able to do the exercises described in this document with the information available at http://scipy.org/docs.html

Before we start, let's load the libraries that we will need for this exercise.

In [2]:
# The following line makes sure that when we plot stuff it shows up in the notebook
%matplotlib inline 

#import scipy.io as sio          # Allow for the import of Matlab files
import scipy.stats as stats     # Statistics module
import numpy as np              # Module for, among others, matrix operations
import matplotlib.pyplot as plt # Plotting
       


# Basic stuff

Let's start by playing around a little with some data. Start by loading the variables saved in the file "*data.npz*". This can be done using the "**numpy.load**" command. This file contains multible variables, which can be accessed as elements of a dictionary structure. In the following, we will refer to these variables by their key in the file, so for example, $v$ refers to the vector that you can access as *data['v']* if you called your dictionary "data".

Answer the following questions:

1. What is the order of $v$?
1. Compute the 2-norm $|v|^2$ of $v$. Notice that v is relatively large, which can lead to out-of-memory errors if you're not careful. Check out the *numpy.matrix.dot()* function.
1. use *%timeit* to check how long the computation takes, and report the results
1. How many elements of v are larger than 1? How many are larger than $2, 3,\dots,5$?

In [None]:
d = np.load("data.npz")

v = d['v']

print "The order of v", v.shape
print "The two-norm of v", np.sqrt(v.T.dot(v))
print "Timing inner product"
%time v.T.dot(v)

%time for t in [1,2,3,4,5]: print "Number of elements >",t,":",len([x for x in v if x>t])




## Plotting

Plot, in the range $[-5\dots 5]$, the Gaussiab PDFs with parameters $(\mu=0,\sigma=1), (\mu=0, \sigma=2), (\mu=0, \sigma=3)$. Use the functions *plt.plot*, *stats.norm.pdf* and *plt.legend*


## Playing with matrices and Python

Plot a normalised histogram of the elements in vector v2 using 20 bins, and superimpose a plot, in the range $[−5, \dots , 5]$ of the Gaussian distribution with the mean and variance of the data. Use the built-in functions to compute these. In particular, use **np.mean** and **np.cov** to fit the parameters to the data.

Get information on the *np.cov* function using the **help** function. This function does not provide you with the maximum likelihood estimators (MLE) of the parameters. For the Gaussian function, the MLE are 

$$\boldsymbol{\mu} = \frac{\sum_{n=1}^N \mathbf{x}_n}{N}$$ 

and

$$\boldsymbol{\Sigma} = \frac{\sum_{n=1}^N (\mathbf{x}_n-\boldsymbol{\mu})^2}{N}$$

Compute these parameters by hand and compare the values you obtain to those returned by the built-in function. How does your implementation compare in terms of execution speed?
    

# More advanced things...

Load the dataset in **data-2class.npz**. This file contains a set of 2-dimensional points $d$, and a corresponding set of labels $l$

1. Create a 2D scatterplot of $d$, using red for the elements with corresponding label 0, and blue for $l_i=1$
1. Draw a straight line separating the two classes
1. Fit two 2D Gaussian distributions, to the points with label $l_i=0$ and $l_i=1$. 
1. Create a 3D plot with axes $x_0$, $x_1$ and $p$. Create a scatterplot of the data in the $p=0$ plane. Then plot a wireframe plot showing both Gaussians, where $p$ is function of $x_0,x_1$



In [3]:
d = np.load("data-2class.npz")
d.keys()

['d', 'l']