# Machine Learning overview - assignment 1 

## What the assignment is for

In this assignment, we will explore the very classic iris dataset and use the very simple KNN (K-Nearest Neighbors) algorithm for classification.

**You should pay attention to how this assignment follows the Machine Learning Workflow in our lecture session!**

## Import libraries

In [None]:
%matplotlib inline
import pandas
import numpy
from matplotlib import pyplot as plt

## Load Dataset

NOTE: Iris dataset includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

In [None]:
iris = pandas.read_csv('../../assets/data/iris.csv')

## Dimension (shape) of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

In [None]:
print(iris.shape)

## Show the first 5 rows of the data

In [None]:
print(iris.head(5))

## Get a global description of the dataset

In [None]:
print(iris.describe())

## Group data

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

In [None]:
print(iris.groupby('Name').size())

## Dividing data into features and labels

As we can see dataset contain five columns: SepalLength, SepalWidth, PetalLength, PetalWidth and Name. The actual features are described by columns 0-3. Last column contains labels of samples. Firstly we need to split data into two arrays: X (features) and y (labels).


In [None]:
feature_columns = ['SepalLength', 'SepalWidth', 'PetalLength','PetalWidth']
X = iris[feature_columns].values
y = iris['Name'].values

In [None]:
X

In [None]:
y

## Label encoding

As we can see labels are categorical. KNeighborsClassifier does not accept string labels. We need to use LabelEncoder to transform them into numbers. Iris-setosa correspond to 0, Iris-versicolor correspond to 1 and Iris-virginica correspond to 2.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
print(y)

## Spliting dataset into training set and test set

Let's split dataset into training set and test set, to check later on whether or not our classifier works correctly.

In [None]:
from sklearn.model_selection  import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Feature scaling

Because features values are in the same order of magnitude, there is no need for feature scaling. Nevertheless in other sercostamses it is extremly important to apply feature scaling before running classification algorythms.

## Data Visualization

### pairplot
Pairwise is useful when you want to visualize the distribution of a variable or the relationship between multiple variables separately within subsets of your dataset.

In [None]:
import seaborn as sns

plt.figure()
sns.pairplot(iris, hue="Name", size=3, markers=["o", "s", "D"])
plt.show()

### Boxplots

In [None]:
plt.figure()
iris.boxplot(by="Name", figsize=(15, 10))
plt.show()

### 3D visualization

You can also try to visualize high-dimensional datasets in 3D using color, shape, size and other properties of 3D and 2D objects. In this plot we use marks sizes to visualize fourth dimenssion which is Petal Width.

In [None]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(1, figsize=(20, 15))
ax = Axes3D(fig, elev=48, azim=134)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y,
           cmap=plt.cm.Set1, edgecolor='k', s = X[:, 3]*50)

for name, label in [('Virginica', 0), ('Setosa', 1), ('Versicolour', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean(),
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'),size=25)

ax.set_title("3D visualization", fontsize=40)
ax.set_xlabel("Sepal Length", fontsize=25)
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("Sepal Width", fontsize=25)
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("Petal Length", fontsize=25)
ax.w_zaxis.set_ticklabels([])

plt.show()

## Using KNN for classification

### Train the model

In [None]:
# Fitting clasifier to the Training set
# Loading libraries
from sklearn.neighbors import KNeighborsClassifier

# Instantiate learning model (k = 3)
classifier = KNeighborsClassifier(n_neighbors=3)

# Fitting the model
classifier.fit(X_train, y_train)

### Predict, with test dataset

In [None]:
y_pred = classifier.predict(X_test)

In [None]:
y_test

In [None]:
y_pred

### Evaluating predictions

In [None]:
# Building confusion matrix:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print(cm)

In [None]:
from sklearn.metrics import accuracy_score

# Calculating model accuracy:

accuracy = accuracy_score(y_test, y_pred)*100
print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')

### Using cross-validation for parameter tuning

In [None]:
from sklearn.model_selection import cross_val_score

# creating list of K for KNN
k_list = list(range(1,50,2))
# creating list of cv scores
cv_scores = []

# perform 10-fold cross validation
for k in k_list:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

In [None]:
# changing to misclassification error
MSE = [1 - x for x in cv_scores]

plt.figure()
plt.figure(figsize=(15,10))
plt.title('The optimal number of neighbors', fontsize=20, fontweight='bold')
plt.xlabel('Number of Neighbors K', fontsize=15)
plt.ylabel('Misclassification Error', fontsize=15)
sns.set_style("whitegrid")
plt.plot(k_list, MSE)

plt.show()

In [None]:
# finding best k
best_k = k_list[MSE.index(min(MSE))]
print("The optimal number of neighbors is %d." % best_k)

## Acknowledgments

Thanks to SkalskiP for creating the open-source [Kaggle jupyter notebook](https://www.kaggle.com/code/skalskip/iris-data-visualization-and-knn-classification), licensed under Apache 2.0. It inspires the majority of the content of this assignment.