< [Supervised Learning](../ica05/Supervised_Learning.ipynb) | Contents (TODO) |  [Cluster Analysis](../ica07/Cluster_Analysis.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica06/Distance_and_Similarity.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

# Distance and Similarity

## 1. The Iris Data Set

The [Iris data set](https://archive.ics.uci.edu/ml/datasets/iris) is a popular "hello world" data set for data scientists. The data set contains three species of Iris flowers, including *Iris setosa*, *Iris versicolor*, and *Iris virginica* (see below).

<table>
    <tr>
        <td><img src=https://upload.wikimedia.org/wikipedia/commons/thumb/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg/220px-Kosaciec_szczecinkowaty_Iris_setosa.jpg><br>Iris setosa</td>
        <td><img src=https://upload.wikimedia.org/wikipedia/commons/thumb/4/41/Iris_versicolor_3.jpg/220px-Iris_versicolor_3.jpg><br>Iris versicolor</td>
        <td><img src=https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Iris_virginica.jpg/220px-Iris_virginica.jpg><br>Iris virginica</td>
    </tr>
    <tr>
        <td colspan=3><center>The three species of Iris</center></td>
    </tr>
</table>

These three spiecies are different in their sepal and petal dimensions. The data set contains four attributes, namely *sepal length*, *sepal width*, *petal length*, and *petal width*, for each flower.
![](https://www.integratedots.com/wp-content/uploads/2019/06/iris_petal-sepal-e1560211020463.png)


### 1.1. Reading the Iris data set

The Iris data set is available at the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Iris). There are several files in the repository, but all we need here is the data file https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data, which is *comma-separated*.

In our previous session, we learned how to read a comma-separated file using pandas (see Section 4 of [ICA02 - How to Read and Represent Data]((../ica02/How_to_Read_and_Represent_Data.ipynb))). We first import numpy and pandas:

In [None]:
import numpy as np
import pandas as pd

#### Assignment
Write a code to read https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data file to a pandas DataFrame. Set `header=None` and `names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'iris_class']` to manually set the column names. Name your DataFrame `data`.

In [None]:
data = # YOUR CODE HERE

To make our job easier in the future, we will convert the Sting data in `iris_class` column to ordinal variables: 

In [None]:
data_ordinal = data.replace({'iris_class': {'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}})
data_ordinal.sample(5)

Once properly loaded, you should be able to run the following line to draw a histogram:

In [None]:
data.sepal_length.hist(bins=30)

There are a few other ways to visualize the data. For example, below is an example to visualize the sepal length distribution for different classes of irises.

In [None]:
for class_type in data.iris_class.unique():
    data.sepal_length.iloc[np.where(data.iris_class == class_type)].hist(bins=30)

#### Assignment
- Write a code to draw the similar histogram plots for other attributes, i.e. sepal width, petal length, and petal width.
- For each attribute, can you tell the difference between the three species? You can just "eyeball".
- Based on your answer above, build a simple `if-else` logic to classify irises. Test your logic on the Iris data set. How accurate can you be?

In [None]:
# PROVIDE YOUR ANSWERS HERE. IF NECESSARY, CREATE NEW CODE/TEXT CELLS.

Alternatively, we could also visualize the data set in a 2-D scatter plot, each of the axes indicating one of those attributes. The type of the flower can be color-coded. We will use a library called `matplotlib` for visualization, which can be imported like this:

In [None]:
import matplotlib.pyplot as plt

Now, simply `plt.scatter()` will do the job for drawing a scatter plot. If you are already familiar with MATLAB, `matplotlib` is a lot similar to MATLAB visualization functions. For more details, see: https://matplotlib.org/gallery/index.html

In [None]:
plt.scatter(data_ordinal.sepal_length, data_ordinal.sepal_width, c=data_ordinal.iris_class)
plt.xlabel('sepal length')
plt.ylabel('sepal width')
plt.show()

#### Assignment

- Write a code to draw the scatter plot for all the other combinations than 'sepal length' - 'sepal width'.
- For each plot, can you draw a straight line separating the different species? (again, eyeballing) What is the slope and the intercept of the line you came up with, roughly?
- Implement a linear classifier using the line equations you manually came up with. What is the accuracy?

In [None]:
# PROVIDE YOUR ANSWERS HERE. IF NECESSARY, CREATE NEW CODE/TEXT CELLS.

### 1.2. K-Nearest Neighbors

K-nearest neighbors, or *KNN*, is one of the simplest machine learning (?) algorithm for supervised learning. There are many python libraries that provide nice, pre-defined implementations of KNN, but here, we will just implement everything from scratch on our own. Implementing a KNN is not actually difficult at all, and from the experience of implementing it, you will get to achieve some deeper insights on how things are working. 

#### Train-Test Split

In Iris data set, we have the total of 150 flower samples. We will randomly split these into two groups: group A with 120 flowers and group B with 30 flowers. We will "train" our KNN model based on the flowers in group A, and we will *pretend* the group B is a set of *queries* that we don't know the answer for. For example, assume you are building an app for telling the user which species of Iris it is, based on the sepal and petal shapes. Group A is the set of data that is already available to you (app developer) and Group B is the queue of queries that your users will randomly throw in. In data science, "Group A" the set of data you used for building a model is called *training set* and "Group B" the remainder of the data you left out is called *test set*. 

In Python world, there are a few pre-defined functions that are quite convenient for spliting train and test sets. Not that it is difficult to implement things from scratch, let's just take advantage of one of those functions. 

In [None]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(data_ordinal, test_size=0.2)

In [None]:
train

In [None]:
test

Now, to train KNN, we will convert `train` DataFrame to a numpy array.

In [None]:
train_np = train.to_numpy()
print(train_np)

Note that the first four columns are attributes we are going to use for making prediction (called *predictors*) and the last column is the species label we would like to predict (*output*). For simplicity, we will explicitly split those columns into `train_X` and `train_Y`.

In [None]:
train_X = train_np[:, 0:4]
train_Y = train_np[:, 4]

print(train_X)
print(train_Y)

Now, lets simulate a user query from the test data set. In this example, we'll simply pick the first row of the test set and name it `query`.

In [None]:
query = test.iloc[0]
print(query)

Here, similarly, we'll convert it to a numpy array and "pretend" we don't know the species of this query by deleting the information. However, since we are going to check later whether or not the prediction is correct, we will save it somewhere for our record.

In [None]:
query = query.to_numpy()
ground_truth = query[4]
query = np.delete(query, 4)
print(query)

Now, we will find the which flower in the train set is the most similar to the query flower. To do this, we first compute the difference between each of the train data and the query:

In [None]:
diff = np.abs(train_X - query)  # absolute difference
print(diff)

Now, let's take the sum of differences across the different attributes.

In [None]:
sum_diff = np.sum(diff, axis=-1)
print(sum_diff)

Finally, we will find the k-nearest neighbors by picking the ones that have the smallest differences. To this, `np.argpartition()` can be extremely useful, especially when you have large data. The function is similar to `np.sort` in a sense that it tries to sort the input array in the ascending order. However, it sorts the array only up to the first k elements and simply ignores the remainder.

In [None]:
k = 5
idx = np.argpartition(sum_diff, k)
print(sum_diff[idx])  # Notice that only the first k elements have been sorted.

Finally, the labels of the k-nearest neighbors are summarized as:

In [None]:
knn = train_Y[idx[:k]]
print(knn)

Whichever label achieves the majority vote, it is going to be the predicted species of the query.

In [None]:
uni, count = np.unique(knn, return_counts=True)
print('Predicted Class: ', uni[np.argmax(count)])
print('Ground Truth: ', ground_truth)             # compare with the ground truth

#### Assignment

- Implement a code to classify all the flowers in the test data set.
- Compare the predicted result with the actual ground truth. What is the accuracy?
- Plot the accuracy as you vary k=1, 2, 3, ..., 20. Does the accuracy changes along k? Is there any pattern you can observe?

< [Supervised Learning](../ica05/Supervised_Learning.ipynb) | Contents (TODO) |  [Cluster Analysis](../ica07/Cluster_Analysis.ipynb) >

<a href="https://colab.research.google.com/github/stephenbaek/bigdata/blob/master/in-class-assignments/ica06/Distance_and_Similarity.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>