# COSC 311: Introduction to Data Visualization and Interpretation

Instructor: Dr. Shuangquan (Peter) Wang

Email: spwang@salisbury.edu

Department of Computer Science, Salisbury University


# Module 5_Machine Learning Overview and Algorithm

## 1. ML and KNN



**Contents of this note refer to 1) the teaching materials at Department of Computer Science, William & Mary; 2) book "Python Machine Learning"; 3) textbook "Data Science from Scratch"; 4) Python toturial: https://docs.python.org/3/tutorial/**

**<font color=red>All rights reserved. Dissemination or sale of any part of this note is NOT permitted.</font>**

## What is Machine Learning?

Machine learning refers to creating and using models that are learned from data. Typically, the goal is to use existing data to develop models that we can use to predict various outcomes for new data. For example:

- Whether an email message is spam or not
- Whether a credit card transaction is fraudulent
- Which football team is going to win the Super Bowl

### Supervised VS Unsupervised

Supervised learning utilizes a set of data labeled with the correct answers to learn from.
Unsupervised learning utilizes a set of data without such labels.

Other related concepts:
- Semisupervised: some of the data are labeled
- Online: the models need to continuously adjust to newly arriving data
- reinforcement: based on rewarding desired behaviors and/or punishing undesired ones. It can perceive and interpret its environment, take actions and learn through trial and error (https://www.techtarget.com/searchenterpriseai/definition/reinforcement-learning#:~:text=Reinforcement%20learning%20is%20a%20machine,learn%20through%20trial%20and%20error.)

# scikit-learn 
https://scikit-learn.org/stable/

scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support-vector machines, random forests, gradient boosting, k-means and DBSCAN. --Wiki

- Simple and efficient tools for predictive data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib (need install beforehand)
- Open source, commercially usable - BSD license



### Install & import scikit-learn

Installing scikit-learn: https://scikit-learn.org/stable/install.html#install-by-distribution

Anaconda offers scikit-learn as part of its free distribution. You may check the Anaconda Navigator (Environments --> Search packages to check if scikit-learn is installed; install it if necessary).

**import scikit-learn:**
- import sklearn
- from sklearn.neighbors import KNeighborsClassifier
- ...



### scikit-learn API

https://scikit-learn.org/stable/modules/classes.html#

Modules: 
- sklearn.base: Base classes and utility functions
- sklearn.calibration: Probability Calibration
- **sklearn.cluster: Clustering**
- sklearn.covariance: Covariance Estimators
- **sklearn.datasets: Datasets**
- sklearn.decomposition: Matrix Decomposition
- sklearn.ensemble: Ensemble Methods
- **sklearn.feature_extraction: Feature Extraction**
- **sklearn.feature_selection: Feature Selection**
- sklearn.gaussian_process: Gaussian Processes
- **sklearn.linear_model: Linear Models (linear classifiers, linear regressors)**
- sklearn.model_selection: Model Selection
- **sklearn.multiclass: Multiclass classification**
- sklearn.multioutput: Multioutput regression and classification
- sklearn.naive_bayes: Naive Bayes
- **sklearn.neighbors: Nearest Neighbors**
- **sklearn.neural_network: Neural network models**
- **sklearn.preprocessing: Preprocessing and Normalization**
- sklearn.semi_supervised: Semi-Supervised Learning
- sklearn.svm: Support Vector Machines
- **sklearn.tree: Decision Trees**
- sklearn.utils: Utilities
- ... ...

### sklearn.datasets

https://scikit-learn.org/stable/datasets/toy_dataset.html

scikit-learn comes with a few small standard datasets that do not require to download any file from some external website.

They can be loaded using the following functions:

- load_iris(*[, return_X_y, as_frame]): Load and return the iris dataset (classification).

- load_diabetes(*[, return_X_y, as_frame, scaled]): Load and return the diabetes dataset (regression).

- load_digits(*[, n_class, return_X_y, as_frame]): Load and return the digits dataset (classification).

- load_linnerud(*[, return_X_y, as_frame]): Load and return the physical exercise Linnerud dataset.

- load_wine(*[, return_X_y, as_frame]): Load and return the wine dataset (classification).

- load_breast_cancer(*[, return_X_y, as_frame]): Load and return the breast cancer wisconsin dataset (classification).

# K-Nearest Neighbors

- https://github.com/Kulbear/pytorch-the-hard-way/tree/master/Solutions
- https://www.profdavis.net/dm_slides/dmslides_files/6720Chapter7.ppt
- https://github.com/Kulbear/pytorch-the-hard-way/tree/master/Solutions
- https://github.com/zotroneneis/machine_learning_basics/blob/master/k_nearest_neighbour.ipynb

KNN is a data-driven algorithm, not a model-driven algorithm (no model is trained from the dataset). KNN makes no assumptions about the data (i.e. data not necessary Normally distributed, etc.).

The KNN algorithm is a simple supervised machine learning algorithm that can be used both for classification and regression. It's an instance-based algorithm. So instead of estimating a model, it stores all training examples in memory and makes predictions using a similarity measure.

Given an input example, the k-nn algorithm retrieves the k most similar instances from memory. Similarity is defined in terms of distance, that is, the training examples with the smallest (euclidean) distance to the input example are considered to be most similar.

The target value of the input example is computed as follows:

**Classification:**
a) unweighted: output the most common classification among the k-nearest neighbors; b) weighted: sum up the weights of the k-nearest neighbors for each classification value, output classification with highest weight

**Regression:**
a) unweighted: output the average of the values of the k-nearest neighbors; b) weighted: e.g. The distance‐weighted KNN regression model weighs the output values by the inverse of their distance from the test data point to its nearest neighbors and produces an output that is the weighted average of the output values of the nearest neighbors (i.e., closer neighbors of a test data point will have a greater influence on the output than neighbors that are further away). (refer to DOI:10.3390/geotechnics1020024)

The weighted KNN is a refined version of the KNN algorithm in which the contribution of each neighbor is weighted according to its distance to the query point. 

Take classification as example (https://www.jcchouinard.com/k-nearest-neighbors/):
![KNN.png](attachment:KNN.png)

- In the k=3 circle, green is the majority, new data points will be predicted as green;
- In the k=6 circle, blue is the majority, new data points will be predicted as blue;

**Low k VS High k**

Low values of k (1, 3, …) capture local structure in data (but also noise); high values of k provide more smoothing, less noise, but may miss local structure. The extreme case of k = n (i.e. the entire data set) is the same as the “naive rule” (classify all records according to majority class)

### Decision boundary
- https://cs231n.github.io/classification/

![knn-boundary.jpeg](attachment:knn-boundary.jpeg)

The above figure show an example of the difference between 1NN and 5NN, using 2-dimensional points and 3 classes (red, blue, green). The colored regions show the decision boundaries induced by the classifier with an L2 distance. The white regions show points that are ambiguously classified (i.e. class votes are tied for at least two classes). Notice that in the case of a NN classifier, outlier datapoints (e.g. green point in the middle of a cloud of blue points) create small islands of likely incorrect predictions, while the 5-NN classifier smooths over these irregularities, likely leading to better generalization on the test data (not shown). Also note that the gray regions in the 5-NN image are caused by ties in the votes among the nearest neighbors (e.g. 2 neighbors are red, next two neighbors are blue, last neighbor is green).

## Example 1: Analyze iris dataset using KNN

In [None]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from matplotlib import pyplot as plt
import pandas as pd

In [None]:
iris = pd.read_csv('iris.data', header=None, 
                   names=['sepal_length', 'sepal_width', 
                          'petal_length', 'petal_width', 'class'])
iris.info()

In [None]:
# Idea: when we look at the ground truth, these look easily classifiable by simple algorithms

# But now, separate by class
setosa = iris[iris['class'] == 'Iris-setosa']
virginica = iris[iris['class'] == 'Iris-virginica']
versicolor = iris[iris['class'] == 'Iris-versicolor']

plt.scatter(x=setosa['sepal_length'], y=setosa['sepal_width'], color='b')
plt.scatter(x=virginica['sepal_length'], y=virginica['sepal_width'], color='r')
plt.scatter(x=versicolor['sepal_length'], y=versicolor['sepal_width'], color='g')
plt.legend(['setosa', 'viginica', 'versicolor'])
plt.title("Sepal length vs width")

In [None]:
# Construct a classifier object
# Then will need to be trained, or "fit"
knn = KNeighborsClassifier(n_neighbors = 1)

# To do that, need to pass an array of data points
# and an array of labels for each of those points

# Extract the two columns we want to classify
X = iris[['sepal_length','sepal_width']].values

Y = iris['class'].values

# We have the labels as strings, just map to numbers
# We will set setosa to be class 0, virginica to 1, versicolor to 2

Y[Y == 'Iris-setosa'] = 0
Y[Y == 'Iris-virginica'] = 1
Y[Y == 'Iris-versicolor'] = 2

# Because Y started as strings, they were stored as a generic object
# so we will convert them to be stored as integer

print(Y.dtype)
Y = Y.astype('int')  # https://note.nkmk.me/en/python-numpy-dtype-astype/
print(Y.dtype)

# Fit/train the model/algorithm/classifier to the data and labels
knn.fit(X,Y)

# We can now ask the knn to classify new points!
# They call this "predict"
# The .predict() method needs an input matrix of the same cols as X
# It will return an array of predicted labels, one for each row
knn.predict([ [7,2.5], [5,2.7], [5,2.9] ])

In [None]:
# See how confused this algorithm is:

# test using the training data, which is called self-test
knn.predict(X)

In [None]:
Y

In [None]:
num_wrong = np.sum(np.not_equal(knn.predict(X),Y))
print(f'We got {num_wrong} wrong!')

In [None]:
# This is the self-test accuracy using all samples
knn.score(X,Y)

## In-class practice

The above classifier only uses two attributes ('sepal_length' and 'sepal_width'). If use all the four attributes, what is the self-test accuracy?

### Example 2: Analyze breast cancer dataset using KNN

This example refers to https://www.jcchouinard.com/k-nearest-neighbors/

#### About the breast cancer dataset

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

Associated Tasks: Classification

Number of Instances: 569

Attribute Characteristics: Real

Number of Attributes: 32

- (1) ID number
- (2) Diagnosis (M = malignant, B = benign)
- (3)-(32) Ten real-valued features are computed for each cell nucleus:

*a) radius (mean of distances from center to points on the perimeter)*

*b) texture (standard deviation of gray-scale values)*

*c) perimeter*

*d) area*

*e) smoothness (local variation in radius lengths)*

*f) compactness (perimeter^2 / area - 1.0)*

*g) concavity (severity of concave portions of the contour)*

*h) concave points (number of concave portions of the contour)*

*i) symmetry*

*j) fractal dimension ("coastline approximation" - 1)*

**Step 1: Load data**

In [None]:
import pandas as pd
from sklearn import datasets
 
dataset = datasets.load_breast_cancer()
df = pd.DataFrame(dataset.data,columns=dataset.feature_names)
df['target'] = pd.Series(dataset.target)
df.head()

In [None]:
# show all the features
print(dataset.feature_names)

In [None]:
# show the class labels (target)
print(dataset.target_names)
# This is a two-class classification problem

**Step 2: Split data into training and test sets**

Whenever we build a machine learning model, we want to check its accuracy.

You will need to split your data into training and test datasets using the *train_test_split* function.
- The training dataset is used to fit (or train) the model.
- The test dataset is excluded from training. It is labelled data that will be used to compare against the predictions made by the model.


**Introduction on the *train_test_split* function:**

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split
 
# Define independent (features) and dependent (targets) variables
X = dataset['data']
y = dataset['target']
 
# split taining and test set
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)


**Step 3: Train the Model**

To make predictions based on the labeled data, we 
- Initiate the KNeighborsClassifier the machine learning model
- Use the .fit() method to train the model
- Use the .predict() method to evaluate the trained model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
 
# train the model
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_train)

In [None]:
# classification results on the training sample
print(y_pred)

In [None]:
# ground truth of the training samples
print(y_train)

**Step 4: Evaluate the model**

It is very important to evaluate the accuracy of the model. Normally we evaluate the model using training data before predicting the test data **(this is called self-test)**. (why?)

The evaluation can be done using the **.score()** method on the knn object.

In [None]:
# compute accuracy of the model on the training data
knn.score(X_train, y_train)

**Step 5: Parameter optimization by changing K value**

K is the most important parameter of KNN.

In [None]:
import numpy as np
import matplotlib.pyplot as plt 
 
neighbors = np.arange(1, 25)
accuracy = np.empty(len(neighbors))
 
for i, k in enumerate(neighbors):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    accuracy[i] = knn.score(X_train, y_train)

plt.title('k-NN self-test accuracy')
plt.plot(neighbors, accuracy)
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()

Question: what K value is best?

**Pay attention: the best parameter on training data does not guarantee best result on test data.**

**Step 6: retrain the model using selected parameter and do prediction on test data**

In [None]:
# train the model
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
# y_pred = knn.predict(X_test)
knn.score(X_test, y_test)

How about if n_neighbors = 7?

In [None]:
# train the model
knn = KNeighborsClassifier(n_neighbors = 7)
knn.fit(X_train, y_train)
# y_pred = knn.predict(X_test)
knn.score(X_test, y_test)

**Further step: Check the Confusion Matrix**

It is possible that the accuracy is not fully representative. We will now try to see how many predictions are True and how many are False.

We will do this using:
- Confusion matrix
- Classification report

Reminder, in the coming plots we will plot the targets (0s and 1s) and not the target names (0 = malignant, 1 = benign).

**Confusion matrix**

To plot the confusion matrix, we will use the *confusion_matrix* and *plot_confusion_matrix* methods from the *sklearn.metrics* module.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
 
cm = confusion_matrix(y_test,y_pred)
print(cm)
 

In [None]:
# https://vitalflux.com/python-draw-confusion-matrix-matplotlib/
fig, ax = plt.subplots(figsize=(2, 2))
ax.matshow(cm, cmap=plt.cm.Blues, alpha=0.3)
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(x=j, y=i,s=cm[i, j], va='center', ha='center', size='large')
plt.xlabel('Predictions', fontsize=10)
plt.ylabel('Actuals', fontsize=10)
plt.title('Confusion Matrix', fontsize=10)
plt.show()

**Classification Report**

Let’s compute the classification report to assess the quality of the predictions.

In [None]:
from sklearn.metrics import classification_report
 
print(classification_report(y_test, y_pred))

## Classification evaluation metrics

Four evaluation metrics are often used to quantify the classification performance. They are accuracy, precision, recall and F-score. The last three are often used in binary classification scenarios.

![Confusion-matrix.png](attachment:Confusion-matrix.png)
