
# Introduction to Machine Learning
### Tom Galligan (thomas.galligan@bnc.ox.ac.uk)



Welcome to the Oxford University CodeSoc Introduction to Machine Learning course! By the end of this course, you'll have been exposed to the central ideas of machine learning (ML), and will have worked with some of its most popular and powerful algorithms. 


The first thing you need to do is download the relevant software packages. 
In this course, you'll need Python 3.5 with numpy, pandas, matplotlib and scikit learn (you might need to downgrade if you have a more recent version). You can download Python 3.5 from https://www.python.org/downloads/. 

I recommend installing Anaconda (https://docs.anaconda.com/anaconda/install/). This has all the main Python packages you'll need. You can also use pip, which is a bit more light-weight.

Then in a terminal, type the following:



In [None]:
conda install ipython
conda install keras
conda install tensorflow
conda install sklearn
conda install matplotlib

Type these in one at a time.
If you have any of these packages already, you can do

In [None]:
conda update ipython

etc.



To help things run more smoothly, please have these packages installed before the first class so we can get straight into coding.

Prerequisites for the course:
* Basic familiarity with Python will help a lot, but complete beginners should still find it enjoyable!
* Some knowledge of basic maths and statistics (mean, variance, notion of a function) will help a lot. 
* The deeper theory won't be covered in the classes, but if you want to look into it you'll find it easier if you're comfortable with things like partial derivatives, n-dimensional Gaussians, linear algebra, and metrics.



# k-Nearest Neighbors

The first thing we'll do is open up an iPython shell. To do this, open up terminal/command prompt and type

In [None]:
ipython --pylab

This option --pylab will start iPython with numpy and matplotlib already imported. 

Next we need to install the other packages we need. 

In [None]:
import matplotlib.pyplot as plt
import sklearn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris


The last command here will let us load in the data for our example. To do this, type

In [None]:
X, y  = load_iris(return_X_y=True)

X is a table containing the features in the following columns: sepal length, sepal width, petal length, petal width. y contains the species of each flower (setosa (0), versicolor (1), virginica (2))



First of all let's plot the data to see what we're working with (this almost always a good place to start).

In [None]:
plt.scatter(X[:50,0],X[:50,1],color='blue',label='setosa')
plt.scatter(X[50:100,0],X[50:100,1],color='red',label='versicolor')
plt.scatter(X[100:150,0],X[100:150,1],color='black',label='virginica')
plt.legend()
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.show()

You can see that it should be easy to identify setosa purely from the sepal length and width, but to identify between versicolor and virginica, we'll need to include more data. Close the previous plot, and type the following:

In [None]:
plt.scatter(X[:50,2],X[:50,3],color='blue',label='setosa')
plt.scatter(X[50:100,2],X[50:100,3],color='red',label='versicolor')
plt.scatter(X[100:150,2],X[100:150,3],color='black',label='virginica')
plt.legend()
plt.xlabel('petal length')
plt.ylabel('petal width')
plt.show()

It looks like we should be able to build a good classifying model from our data! First of all, we need to separate our data into a training batch and a test batch. This makes sure our test is 'clean', and is especially important in models where the training process is more involved. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Now we'll initiate our model, considering the 3 nearest neighbors for now:

In [None]:
knn = KNeighborsClassifier(n_neighbors=3)

And train it on our data:

In [None]:
knn.fit(X_train, y_train)

This might seem like a big step with a lot going on behind the scenes. In reality, all we've done is load the training data into the model. This is because kNN is a 'lazy-learner'. 