# Introduction to Machine Learning: Classification & Clustering

This notebook introduces fundamental concepts in machine learning, focusing on two common tasks: classification and clustering. We will explore how to implement and evaluate various algorithms for these tasks using the popular Python library Scikit-learn. We will primarily use the well-known Iris dataset to illustrate these techniques.

In [None]:
from sklearn import datasets
from sklearn import model_selection
from sklearn import linear_model
from sklearn import metrics
from sklearn import tree
from sklearn import neighbors
from sklearn import svm
from sklearn import ensemble
from sklearn import cluster

In [None]:
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
# set random seed so that we get the same random numbers
np.random.seed(123)

In [None]:
# check version of scikit-learn
import sklearn

print("The scikit-learn version is {}.".format(sklearn.__version__))

## Classification

Recall that we already saw in C03 an example of classification where we used a logistic regression to classify Iris flower samples from sepal and petal widths and lengths. Let's consider this data again but now we will consider all 3 species.


In [None]:
# load built-in data in the Scikit-Learn library
iris = datasets.load_iris()

In [None]:
type(iris)

Bunch is a dictionary-like object. Some useful attributes are:

- `data`: the data to learn
- `target`: the classification labels
- `target_names`: the meaning of the labels
- `feature_names`: the meaning of features


In [None]:
# iris.data

In [None]:
iris.data.shape

In [None]:
iris.target

In [None]:
iris.target_names

In [None]:
iris.feature_names

### Prepare Data: Training and Testing Sets

Before training a model, we need to split our dataset. We'll use a portion for training the model (the *training set*) and reserve the rest for evaluating its performance on unseen data (the *testing set*). This helps us understand how well the model generalizes to new examples. We'll use 70% for training and 30% for testing.

In [None]:
# Split the dataset into a training part(70%) and a testing part (30%)
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    iris.data, iris.target, train_size=0.7, random_state=123 # uncomment to get the same random numbers
)

In [None]:
# First step: create a classifier instance
classifier = linear_model.LogisticRegression(solver="lbfgs")

In [None]:
# Call the 'fit' method to train the classifier
classifier.fit(X_train, y_train)

In [None]:
# Predict the class for the samples in the testing datasets
#    (so that we can compare the predictions with the actual values)
y_test_pred = classifier.predict(X_test)

In [None]:
y_test_pred

In [None]:
y_test

The `sklearn.metrics.classification_report` function returns a text report showing the main classification metrics (see detail [here](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support) )

Also:

- [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall)
- [F1 score](https://en.wikipedia.org/wiki/F1_score)


In [None]:
print(metrics.classification_report(y_test, y_test_pred))

We can also look at the _confusion matrix_ $C$ where $C_{ij}$ is the number of samples of category $i$ that were categorized as _j_. That is, the diagonal elements corresponds to the number of correctly classified samples for each category, and the off-diagonal elements are the number of incorrectly classified samples.


In [None]:
metrics.confusion_matrix(y_test, y_test_pred)
## NOTE1:  the numbers depend on the random seed
## NOTE2:  the sum of each row is the total number of samples for the corresponding category.

In [None]:
# count unique value in y_test
np.bincount(y_test)

In [None]:
# We can also plot the confusion matrix


def plot_confusion_matrix(classifier, y_test, y_test_pred):
    cm = metrics.confusion_matrix(y_test, y_test_pred, labels=classifier.classes_)
    disp = metrics.ConfusionMatrixDisplay(
        confusion_matrix=cm, display_labels=classifier.classes_
    )
    disp.plot()

In [None]:
plot_confusion_matrix(classifier, y_test, y_test_pred)
plt.show()

So far we have just used a logistic regression model as a classifier. Other popular classifiers are

- decision trees ([detail](https://en.wikipedia.org/wiki/Decision_tree_learning))
- nearest neighbor methods ([detail](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) and [overview](https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn))
- support-vector machine (SVM) ([detail](https://en.wikipedia.org/wiki/Support_vector_machine))
- Random Forest method ([detail](https://en.wikipedia.org/wiki/Random_forest) and [detail](https://www.datacamp.com/tutorial/random-forests-classifier-python))

See a [flowchart](https://scikit-learn.org/stable/tutorial/machine_learning_map/) of a rough guide on how to choose an estimator and detailed [comparison](https://www.dataschool.io/comparing-supervised-learning-algorithms/).


In [None]:
# Use a decision tree as classifier
classifier = tree.DecisionTreeClassifier()  # set `random_state=123` to get the same random numbers
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

In [None]:
# Look at the decision tree
tree.plot_tree(classifier)
plt.show()

Let's read the tree. In each box we have a decision

- `Parameter <= value` --- tells you the feature and threshold used for splitting data at this node. Samples satisfying the condition go left, others go right.
- `Gini` score --- Measures node impurity ($g = 1 - \sum_c p_c^2$, where $p_c$ is the fraction of samples of class c at the node). A Gini score of 0 means all samples belong to one class. The algorithm chooses splits that maximize the decrease in impurity.
- `Samples` --- is the number of training samples reaching that node.
- `Value` --- shows the distribution of training samples among the classes at that node (e.g., `[35, 34, 36]` means 35 samples of class 0, 34 of class 1, and 36 of class 2 reached this node). The majority class here often determines the prediction for samples ending up in this node if it's a leaf.

In [None]:
# Use the nearest neighbor classifier
classifier = neighbors.KNeighborsClassifier()
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

The default number of neighbors is 5, but this is another example of a hyper parameter. We can search for an optimal choice


In [None]:
klist = list(range(1, 20))
scores = []

for k in klist:
    classifier = neighbors.KNeighborsClassifier(k)
    score = model_selection.cross_val_score(classifier, X_train, y_train, cv=4)
    scores.append(np.mean(score))

In [None]:
plt.plot(klist, scores)
plt.show()

In [None]:
# repeat with the best value of k from the training set
classifier = neighbors.KNeighborsClassifier(3)
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

In [None]:
# Use SVM classifer --- next class is devoted to SVM
classifier = svm.SVC(gamma="auto")
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

In [None]:
# Use Random Forest classifer
classifier = ensemble.RandomForestClassifier(n_estimators=100)  # set `random_state=123` to get the same random numbers
classifier.fit(X_train, y_train)
y_test_pred = classifier.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

In [None]:
# look at the feature importances --- we again see the petal width/height are most informative
classifier.feature_importances_

In [None]:
# Look at some of the trees
tree.plot_tree(classifier.estimators_[0])
plt.show()

In [None]:
tree.plot_tree(classifier.estimators_[1])
plt.show()

We now have two hyper parameters

- The number of estimators
- The maximum depth of the tree

Let's use another approach to search for these best parameters: Randomized Search. It samples a fixed number of parameter settings from specified distributions. This is often more efficient than `GridSearchCV`, which exhaustively tries all combinations on a predefined grid, especially when the search space is large.

In [None]:
import random

random.randint(7, 99)

In [None]:
import scipy.stats as stats

# These names must match the names of the arguments
args = {"n_estimators": stats.randint(50, 500), "max_depth": stats.randint(1, 20)}

classifier = ensemble.RandomForestClassifier()  # set `random_state=123` to get the same random numbers

ransearch = model_selection.RandomizedSearchCV(
    classifier, param_distributions=args, n_iter=5, cv=4  #, random_state=123 # uncomment to get the same random numbers
)

ransearch.fit(X_train, y_train)

In [None]:
help(ransearch)

In [None]:
y_test_pred = ransearch.best_estimator_.predict(X_test)
metrics.confusion_matrix(y_test, y_test_pred)

# For this tiny example there is no/little improvement

### Comparing Classifier Performance vs. Training Data Size

Different classifiers may perform better or worse depending on the amount of training data available. Let's systematically compare the accuracy of our chosen classifiers (Decision Tree, KNN, SVM, Random Forest) as we vary the proportion of data used for training from 10% to 90%. We'll look at the accuracy for each Iris species separately.

In [None]:
# create a numpy array with training size ratios, ranging from 10% to 90%
train_size_vec = np.linspace(0.1, 0.9, 30)

# create a list of classifiers
classifiers = [
    tree.DecisionTreeClassifier,
    neighbors.KNeighborsClassifier,
    svm.SVC,
    ensemble.RandomForestClassifier,
]

# create an array that stores the diagonals of the confusion matrix as a function of training size ratio
# and classifier
cm_diags = np.zeros((3, len(train_size_vec), len(classifiers)), dtype=float)

# loop over each training size ratio and classifier
for n, train_size in enumerate(train_size_vec):
    X_train, X_test, y_train, y_test = model_selection.train_test_split(
        iris.data, iris.target, train_size=train_size
    )

    for m, Classifier in enumerate(classifiers):
        classifier = Classifier()
        classifier.fit(X_train, y_train)
        y_test_pred = classifier.predict(X_test)
        cm_diags[:, n, m] = metrics.confusion_matrix(y_test, y_test_pred).diagonal()
        cm_diags[:, n, m] /= np.bincount(y_test)

# plot accuracy as a function of training size ratio
fig, axes = plt.subplots(1, len(classifiers), figsize=(12, 3))

for m, Classifier in enumerate(classifiers):
    axes[m].plot(train_size_vec, cm_diags[2, :, m], label=iris.target_names[2])
    axes[m].plot(train_size_vec, cm_diags[1, :, m], label=iris.target_names[1])
    axes[m].plot(train_size_vec, cm_diags[0, :, m], label=iris.target_names[0])
    axes[m].set_title(type(Classifier()).__name__, fontsize="x-small")
    axes[m].set_ylim(0, 1.1)
    axes[m].set_xlim(0.1, 0.9)
    axes[m].set_ylabel("classification accuracy", fontsize="x-small")
    axes[m].set_xlabel("training size ratio", fontsize="x-small")
    axes[m].legend(loc=4, fontsize="x-small")
    axes[m].tick_params(axis="x", labelsize=12)
    axes[m].tick_params(axis="y", labelsize=12)

fig.tight_layout()
plt.show()

- We can see that the error is different for each model.
- Which classifier is best depends on the problem.
- The good news is that it's easy to switch them in Scikit-learn.
- Besides accuracy, computational performance can be important. For large problems with many features, a decision tree method such as Random Forest is a good one to try first.


## Clustering

- Clustering can be considered as a classification problem where the classes are NOT known. For more details, see [Wikipedia](https://en.wikipedia.org/wiki/Cluster_analysis)).
- It is an example of _unsupervised learning_ (data is unlabeled).
- The input of a clustering algorithm contains only the feature variables and the output of the algorithm is an array of integers that represent a cluster(or class) of each sample.
- Popular clustering methods are:
  - [_K-means algorithm_](https://en.wikipedia.org/wiki/K-means_clustering): groups the samples into clusters such that the within-group sum of square deviation is minimized. ( `sklearn.cluster.KMeans`)
  - [_mean-shift algorithm_](https://en.wikipedia.org/wiki/Mean_shift) : clusters the samples by fitting the data to density functions (e.g. Gaussian functions) ( `sklearn.cluster.MeanShift`)

A full list of methods in Scikit-Learn [here](http://scikit-learn.org/stable/modules/clustering.html)


**Example:** Consider again the Iris dataset but this time we will not use the response variable. We will implement the K-means method. We need to specify the number of clusters (we will use `n_clusters = 3` since we know this in advance).


In [None]:
# store feature data in X and response data in y
X, y = iris.data, iris.target

In [None]:
# Step1: create an instance of KMeans class using number of clusters = 3
clustering = cluster.KMeans(n_clusters=3, n_init=10, random_state=555)

In [None]:
# Step2: call the fit() method
clustering.fit(X)

In [None]:
# Step3: use predict() method to make prediction
y_pred = clustering.predict(X)

In [None]:
# Since the output is long, we'll look at every 8th element
y_pred[::8]

In [None]:
y[::8]
## NOTE: there is a good correlation btw the two, but the output has assigned different numbers to the groups
##   than what was used in the target vector
## - To be able to compare two arrays with metrics such as the confusion matrix, we need to rename the elements
##      so that the same integers are used

In [None]:
y_pred

In [None]:
idx_0 = np.where(y_pred == 0)
idx_0

### Evaluating Clustering: Mapping Labels

The K-Means algorithm assigned cluster labels (0, 1, 2) to the samples. However, these labels are arbitrary and don't necessarily match the original Iris species labels (0: setosa, 1: versicolor, 2: virginica). To evaluate the clustering using metrics like the confusion matrix against the *true* labels (`y`), we first need to find the best mapping between the predicted cluster labels (`y_pred`) and the true labels. By inspecting the results of this specific run (which depends on the `random_state`), we can determine this mapping. For example, we might find that cluster 0 corresponds mostly to species 0, cluster 1 to species 2, and cluster 2 to species 1. We'll then relabel `y_pred` accordingly for comparison. Remember, this step is only possible because we *know* the true labels in this example; in purely unsupervised clustering, we wouldn't have `y` to compare against.

In [None]:
# Rename the elements in y_pred so that the same integers are used as in y
idx_0, idx_1, idx_2 = (np.where(y_pred == n) for n in range(3))
y_pred[idx_0], y_pred[idx_1], y_pred[idx_2] = 1, 0, 2
y_pred[::8]

In [None]:
# Look at the confusion matrix
metrics.confusion_matrix(y, y_pred)

## NOTE(numbers might be different): the algorithm was able to correctly identify all samples in group 0 (first species) as a group of its own.
#  2 elements from group 1 was assigned to group 2
# 14 elements from group 2 was assigned to group 1

In [None]:
# Make scatter plots for each pair of features

N = X.shape[1]

fig, axes = plt.subplots(N, N, figsize=(12, 12), sharex=True, sharey=True)

colors = ["coral", "blue", "green"]  # different color for each cluster
markers = ["^", "v", "o"]  # different symbol for each cluster
n_clusters = 3
for m in range(N):
    for n in range(N):
        for p in range(n_clusters):
            mask = y_pred == p
            axes[m, n].scatter(
                X[:, m][mask],
                X[:, n][mask],
                marker=markers[p],
                s=30,
                color=colors[p],
                alpha=0.25,
            )  # alpha is transparency

        for idx in np.where(y != y_pred):  # Put a red rectangle at bad predictions
            axes[m, n].scatter(
                X[idx, m],
                X[idx, n],
                marker="s",
                s=30,
                edgecolor="red",
                facecolor=(1, 1, 1, 0),
            )
        axes[m, n].set_xlim([0, 8])
        axes[m, n].set_ylim([0, 8])
        axes[m, n].set_xticks([0, 2, 4, 6, 8])
        axes[m, n].set_yticks([0, 2, 4, 6, 8])

    axes[N - 1, m].set_xlabel(iris.feature_names[m], fontsize=16)
    axes[m, 0].set_ylabel(iris.feature_names[m], fontsize=16)
fig.tight_layout()
plt.show()
fig.savefig("clustering_iris.pdf")
## NOTE: the clustering does a very good job at recognizing which sample belongs to distinct group,
## but because of the overlap in the features we cannot expect any unsupervised clustering algorithm can
## fully resolve the various groups in the dataset

## References:

- _Numerical Python: A Practical Techniques Approach for Industry_ by Robert Johansson (Chapter 15)
- _Python Data Science Handbook_ by Jake VanderPlas
- https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/
