## Evaluation Metrics
Evaluation Metrics are used to assess the performance of train models in machine learning:
##### Confusion Matrix: 
A confusion Matrix summarizes the performance of a classification model. A Confusion matrix consists of:
##### True Positive (TP):
Instances that are correctly predicted as positive by the model
##### True Negative (TN):
Instances that are correctly predicted as negative by the model
##### False Positive(FP):
Instances that are incorrectly predicted as positive by the model when the actual class is negative (Type 1 Error)
##### False Negative(FN):
Instances that are incorrectly predicted as negative by the model when the actual class is positive (Type 2 Error)

Based on the values of the confusion matrix, several evaluation metrics can be derived, including accuracy, precision, recall, and F1-Score, which provide a more comprehensive assessment of the model's performance accross different classes.
##### Accuracy: 
Accuracy measures the overall correctness of predictions  made by a classification model. It is the ration of correctly predicted samples to the total number of samples. Accuracy is calculated as

****Accuracy = (TP + TN)/(TP + TN + FP + FN)****

##### Precision:
Precision focuses on the accuracy of the positive predictions made by a classification model. Precision is calculated as 

****Precision = TP/(TP + FP)****

##### Recall: 
Or Sensitivy is the true positive rate which measures the propotion of true positive predictions out of the total actual positive instances, Mathematically, recall is calculated as 

****Recall = TP/(TP + FN)****

##### F1-Score:
The F1-Score is the harmonic mean of precisino and recall. It provides a balanced measure of a model's performance by considering  both precision and recall. F1-Score can be calculated as:

****F1-Score = 2 * (Precision * Recall)/(Precision + Recall)****

## Hyperparameters and Tuning
Hyperparameters are settings that determine how a machine learning model learns and makes predictions. These parameters are not learned from teh data but are predefined by the user and significantly influence the model's performance and behavior. Properly tuning hyperparameters is essential to optimize a model's performance for a specific task.
Hyperparameter tuning is a crucial aspect of machine learning model development. Hyperparameters are settings that we must specify before training the model, and they significantly impact a model's performance. Tuning involves finding the best combination of hyperparameters to optimize a model's performance.
The Process typically starts with selecting a range of hyperparameter values, and then various techniques such as grid search, random search, or Bayesian optimization are employed to systematically explore these values. Cross Validation is often used to evaluate model performance for different hyperparameter configurations, ensuring that the model generalizes well to unseen data.

## KMeans Clustering 
KMeans is a iterative unsupervised machine learning algorithm used for clustering data points into K different groups or clusters. It divides the data points into clusters such that ll the points in the same cluster have similar properties. 
KMeans is a centroid-based algorithm. Each cluster is a associated with a centroid which is the center point of the cluster. The Centroid of the cluster is calculated by taking the mean values of all the data points is a cluster. The k value specifies the number of clusters needed from the operation and has to be selected by the user.

K-means clustering starts by randomly initializing K cluster centroids. After assigning each data point to the nearest centroid, the algorithm recalculates the centroids by taking the average position of all data points assigned to each cluster. It continues this process until the centroids stabilize, optimizing the clustering by minimizing the total squared distance within each cluster. 

The step by step explaination of the Kmeans algorithm is as follows:
1. **Initialization:** The algorithm requires the user to specifythe number of clusters (k) needed in the ouput. The algorithm automatically select k initial points from the data as the starting centroids, which represent the centers of each cluster.

2. **Cluster Assignment:** Each data point in the dataset is assigned to the nearest centroid. This is done by using a distance function to calculate the distance from each point to each of teh centroids and teh data piont is then assigned to the nearest centroid. Distance functions such as the Euclidean distance are used to calculate the distances. The data points have been grouped into K clusters now.

3. **Update Centroids:** Now that we have k clusters, we recalculate the centroids of each cluster by taking the mean value of all the data points in the cluster. 

4. **Convergence:** After updating centroids, it is possible that some of the data points in the cluster might be closer to some centroids of another cluster. Step 2 and 3 are repeated to assign these data points to updated clusters. The process continues till there are no more changes in the centroids indicating the algorithm has converged. The process can also be stopped if a certain criterion has been met such as the maximum number of iterations allowed. 

Some of Common applications that we can use clustering on images are:
**Image Segmentation** Dividing images into similar regions and color quantizations. By Clustering similar pixels together, it can aid in tasks such as object recognition, image compression. 

**Image retrieval** Color quantization aims to reduce the number of colors needed to represent an image which is something we can use image clustering for by replacing some colors with color of centroid of its cluster.

**Content-Based Image Retrieval** Large image databases can be organized by grouping images with similar properties together. This enables efficient retrieval and browsing of images effectively making it easier to search for specific images. 

**Image Annotation** Clustering can be employed to automatically categorize and annotate images based on their properties. By grouping visually similar images. It can help in automatically generating tags or annotation for images. 



In [1]:
import numpy as np
import cv2

In [3]:
image = cv2.imread('Images/Input Images/Chapter 8/image.jpg')
# Reshape the image to a 2D array of pixels
pixels = image.reshape(-1, 3).astype(np.float32)

#define the criteria for k-means clustering
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)

# Set K values
k_values = [2, 3, 7]

cv2.imshow('K-Mean Segmentation', image)
cv2.waitKey(0)

#Perform k-means clustering for each k value and display  the segmented images.
for k in k_values:
    # Perform K-Mean Clustering
    distances, labels, centers = cv2.kmeans(pixels, k, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)
    # Convert the centers to integers
    centers = np.uint8(centers)
    
    #Replace each pixel with the corresponding cluster center value
    segmented_image = centers[labels.flatten()]
    segmented_image = segmented_image.reshape(image.shape)
    
    cv2.imshow('K Mean Segmentation', segmented_image)
    print("Distance: {}".format(distances))
    cv2.waitKey(0)
    
cv2.destroyAllWindows()


Distance: 185811069.9587214
Distance: 120383869.16633701
Distance: 37909526.4141376


## K-Nearest Neighbors (k-NN)
k-Nearest Neighbors is a popular supervised learning algorithm that is used for classification and regression application. K-NN predicts the class or valueof a new data point based on the majority or average of its K Nearest neighbors in the training dataset.
k-NN is an non parameteric algorithm. Thsi means that it does not make any assumptions about the underlying data and instead makes predictions based on the similarity between the input data points and the train labeled data. K-NN is also an instance based algorithm meaning that it does not use the dataset for training, instead, it used the entire dataset during the prediction phase of the operation. 

The **k-NN** classification algorthm works by implementing the following steps:
1. The user chooses the number for neighbors(represented by K) to be considered for classification. Load the dataset and apply any preprocessing such as feature scaling as per the requirements. 

2. Split the data into training and test sets. The test set will be used to evaluate the performance of the algorithm.

3. For each data point in the dataset, calculate its distance to all data points in the training set. Any distance meteric such as the Euclidean distance or the Manhattan distance can be used for calculating the distance between data points. 

4. The next step is to select the K nearest neighbors to the data point. The K nearest points to the data point will be selected. 

5. Among the selected K nearest neighbors the class label is assigned to the data point based on the majority voting in its selected neighbors. 

6. Step 4 and 5 are repeated for all the data points in the test set. 

7. The model is then evaluated using various evaluation meterics such as accuracy or precision. Depending on the results obtained, the performance of the model can be improved by changing the K value or the distance metrics used for the operation.

## Feature Scaling 
Feature scaling is a preprocessing technique used in machine learning to standardize or normalize the range of features in a dataset. It aims to bring all features onto a similar scale to avoid bias towards features with larger magnitudes. 

Commonly used methods for feature scaling are:
**Normalization (Min-Max scaling):** Normalization involves scaling each feature value to a range of 0 to 1. Min-Max scaling is a common normalization technique that is implemented by subtracting the minimum value of the feature from each data point and then dividing it by the range (maximum value minus the minimum value):

***X_scaled = (X - X_min)/(X_max - X_min)***

Where X is original feature value, X_min and X_max are minimum and maximum values of the feature respectively and X_scaled are the updated values in the range between 0 to 1.

**Standardization (Z-score normalization):** In this method, each feature is transformed such that it has a mean of 0 and a standard deviation of 1. It is acheived by subtracting the mean of the feature from each data point and then dividing it by standard deviation. Standardization preserves the shape fo the distribution and is suitable for features that have an normal distribution. 

***X_scaled = (X - mean)/(std_deviation)***

Where X is the original feature, X_scaled is the scaled value of the feature with mean value of 0 and a standard deviation of 1. 
The choice between standardization and normalization depends on the specific requirements of the dataset and the machine learning algorithm being used. 

## Hyperparameters.
In KNN the main hyperparameter is the K value. It represents the number of nearest neighbors to consider for classification and regression. It determines the level of complexity and generalization of the model. A smaller value of K makes the model more sensitive to noise and outliers, while a larger value of K makes the model more biased and less flexible.

The distance metric is another hyperparameter that is importand in KNNs algorithms. The choice of distance metric, such as Euclidean distance or the Manhattan distance affects how the neighbors are identified and the similarity between data points. 





In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import tensorflow as tf

In [5]:
mnist = tf.keras.datasets.mnist

#Load the Fashion MNIST dataset
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [8]:
# Flatten the Image
X_train = X_train.reshape((X_train.shape[0], -1))
X_test = X_test.reshape((X_test.shape[0], -1))

# Split the data into training and testing sets
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=42)

#Define the KNN Classifier
knn = KNeighborsClassifier()

#Fit the classifier to training data
knn.fit(X_train, Y_train)

# make predictions on the validation set
y_pred_val = knn.predict(X_val)

# Calculate the validation accuracy
val_accuracy = accuracy_score(Y_val, y_pred_val)
print("Validation Accuracy:", val_accuracy)

#Make predictions on the test set
y_pred_test = knn.predict(X_test)

#Calculate the test accuracy
test_accuracy = accuracy_score(Y_test, y_pred_test)
print("Test Accuracy:", test_accuracy)



Validation Accuracy: 0.9626302083333333
Test Accuracy: 0.9615


In [13]:
# Select three random test images
indices = np.random.randint(0, len(X_test), size=3)
images = X_test[indices]
predicted_labels = y_pred_test[indices]

# Preprocess the images
reshaped_images = [cv2.cvtColor(image.reshape(28, 28), cv2.COLOR_GRAY2BGR) for image in images]

# Concatenate the images horizontally
concatenated_image = np.hstack(reshaped_images)
cv2.imshow("Images", concatenated_image)
cv2.waitKey(0)
cv2.destroyAllWindows()
print("Predicted Labels:", predicted_labels)

Predicted Labels: [4 2 8]
