# Your first Machine Learning Model

As we've mentioned in the slide, we're going to start making your first Machine Learning model *from scratch*!

For revision, Machine Learning model consists of two parts*: __data__ and __method (algorithm)__. We'll start with the data part.


## Introducing the Iris dataset

![imgs/iris_petal_sepal.png](imgs/iris_petal_sepal.png)

We'll first play around with the [iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), a dataset consisting of lengths and widths on the petal and sepal of different Iris flower.

We'll first load our data with a tool called __Pandas__.

In [None]:
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
import numpy

In [None]:
# Reading the dataset file
names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'label']
iris_data = pandas.read_csv('datasets/iris.data', names=names)

# Create a colour map column
classes = list(iris_data['label'].unique())
iris_data['colour'] = [classes.index(i) for i in iris_data['label']]

# Trim only first 10 rows of the data.
iris_data.head(10)

In [None]:
_ = scatter_matrix(iris_data, c=iris_data['colour'])

In [None]:
plt.scatter(iris_data['petal_length'], iris_data['petal_width'], c=iris_data['colour'])
plt.show()

You can simply get $n^{th}$ row of data by using `iris_data.loc[n]`, and get the property of any row with the string index.

In [None]:
iris_data.loc[0]

In [None]:
iris_data.loc[0]['sepal_length']

Great! This is how you'll basically the data in your dataset.

## Heading over to the model

As we've already "get" the concept of what we're trying to achieve with the $k$-NN algorithm, when given the petal/sepal length/width of the flower we wish to "predict", we simply calculate the euclidian distant from the desired length/widths given

We can do this by our hand, but fortunately, Python's got a machine learning library called `sklearn`, of which we can import and use it simply.

Head to the [sklearn's documentation on $k$-nearest neighbour](#), and answer the following question.

* How should the model be imported?
* How can we create the model's instance?
* How can we train the model with our dataset?

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
X = iris_data.loc[:, 'sepal_length':'petal_width']
y = iris_data.loc[:, 'colour']

Create the model, and train it!

In [None]:
k = 1
clf = KNeighborsClassifier(k)
clf.fit(X, y)

Let's try to predict one data point around here.

In [None]:
preds = [4, 1.5, 6, 1.5]

clf.predict([preds])

And see the decision region

In [None]:
from mlxtend.plotting import plot_decision_regions

fig, ax = plt.subplots(2, 2, figsize=(10, 10))

for i, n in enumerate([[0, 1, 2, 3], [2, 3, 0, 1], [0, 2, 1, 3], [1, 3, 0, 2]]):
    plot_decision_regions(X.values, y.values, clf=clf,
                         filler_feature_values={n[2]: preds[n[2]], n[3]: preds[n[3]]},
                         filler_feature_ranges={n[2]: 2, n[3]: 2},
                         feature_index=[n[0], n[1]],
                         ax=ax.flat[i])
    ax.flat[i].scatter([preds[n[0]]], [preds[n[1]]], c='red', s=60, marker='X')
    ax.flat[i].set_xlabel(names[n[0]])
    ax.flat[i].set_ylabel(names[n[1]])
plt.show()

In [None]:
clf.score(X, y)

Try changing the $k$ for the best accuracy

__Question__: Why does $k = 1$ gives us the best accuracy (of 1.0 -- meaning that the model can predict all the data labels correctly)?

## Testing-Training split

The "good" intuition for measuring the accuracy is that we should split our dataset into testing set and training set.

`sklearn`'s got a function for us.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test  = train_test_split(X, y, test_size = 0.3, random_state = 4)
(X_train.shape, X_test.shape)

In [None]:
k_2 = 1
clf_tt = KNeighborsClassifier(k_2)
clf_tt.fit(X_train, y_train)

Evaluate the performance...

In [None]:
clf_tt.score(X_test, y_test)

In [None]:
for k in range(1, 51):
    clf_tt = KNeighborsClassifier(k)
    clf_tt.fit(X_train, y_train)
    score = clf_tt.score(X_test, y_test)
    print("k = {}\tscore = {}".format(k, score))