## Evaluation Metrics
Evaluation Metrics are used to assess the performance of train models in machine learning:
##### Confusion Matrix: 
A confusion Matrix summarizes the performance of a classification model. A Confusion matrix consists of:
##### True Positive (TP):
Instances that are correctly predicted as positive by the model
##### True Negative (TN):
Instances that are correctly predicted as negative by the model
##### False Positive(FP):
Instances that are incorrectly predicted as positive by the model when the actual class is negative (Type 1 Error)
##### False Negative(FN):
Instances that are incorrectly predicted as negative by the model when the actual class is positive (Type 2 Error)

Based on the values of the confusion matrix, several evaluation metrics can be derived, including accuracy, precision, recall, and F1-Score, which provide a more comprehensive assessment of the model's performance accross different classes.
##### Accuracy: 
Accuracy measures the overall correctness of predictions  made by a classification model. It is the ration of correctly predicted samples to the total number of samples. Accuracy is calculated as

****Accuracy = (TP + TN)/(TP + TN + FP + FN)****

##### Precision:
Precision focuses on the accuracy of the positive predictions made by a classification model. Precision is calculated as 

****Precision = TP/(TP + FP)****

##### Recall: 
Or Sensitivy is the true positive rate which measures the propotion of true positive predictions out of the total actual positive instances, Mathematically, recall is calculated as 

****Recall = TP/(TP + FN)****

##### F1-Score:
The F1-Score is the harmonic mean of precisino and recall. It provides a balanced measure of a model's performance by considering  both precision and recall. F1-Score can be calculated as:

****F1-Score = 2 * (Precision * Recall)/(Precision + Recall)****

## Hyperparameters and Tuning
Hyperparameters are settings that determine how a machine learning model learns and makes predictions. These parameters are not learned from teh data but are predefined by the user and significantly influence the model's performance and behavior. Properly tuning hyperparameters is essential to optimize a model's performance for a specific task.
Hyperparameter tuning is a crucial aspect of machine learning model development. Hyperparameters are settings that we must specify before training the model, and they significantly impact a model's performance. Tuning involves finding the best combination of hyperparameters to optimize a model's performance.
The Process typically starts with selecting a range of hyperparameter values, and then various techniques such as grid search, random search, or Bayesian optimization are employed to systematically explore these values. Cross Validation is often used to evaluate model performance for different hyperparameter configurations, ensuring that the model generalizes well to unseen data.

## KMeans Clustering 
KMeans is a iterative unsupervised machine learning algorithm used for clustering data points into K different groups or clusters. It divides the data points into clusters such that ll the points in the same cluster have similar properties. 
KMeans is a centroid-based algorithm. Each cluster is a associated with a centroid which is the center point of the cluster. The Centroid of the cluster is calculated by taking the mean values of all the data points is a cluster. The k value specifies the number of clusters needed from the operation and has to be selected by the user.

K-means clustering starts by randomly initializing K cluster centroids. After assigning each data point to the nearest centroid, the algorithm recalculates the centroids by taking the average position of all data points assigned to each cluster. It continues this process until the centroids stabilize, optimizing the clustering by minimizing the total squared distance within each cluster. 

The step by step explaination of the Kmeans algorithm is as follows:
1. **Initialization:** The algorithm requires the user to specifythe number of clusters (k) needed in the ouput. The algorithm automatically select k initial points from the data as the starting centroids, which represent the centers of each cluster.

2. **Cluster Assignment:** Each data point in the dataset is assigned to the nearest centroid. This is done by using a distance function to calculate the distance from each point to each of teh centroids and teh data piont is then assigned to the nearest centroid. Distance functions such as the Euclidean distance are used to calculate the distances. The data points have been grouped into K clusters now.

3. **Update Centroids:** Now that we have k clusters, we recalculate the centroids of each cluster by taking the mean value of all the data points in the cluster. 

4. **Convergence:** After updating centroids, it is possible that some of the data points in the cluster might be closer to some centroids of another cluster. Step 2 and 3 are repeated to assign these data points to updated clusters. The process continues till there are no more changes in the centroids indicating the algorithm has converged. The process can also be stopped if a certain criterion has been met such as the maximum number of iterations allowed. 

Some of Common applications that we can use clustering on images are:
**Image Segmentation** Dividing images into similar regions and color quantizations. By Clustering similar pixels together, it can aid in tasks such as object recognition, image compression. 

**Image retrieval** Color quantization aims to reduce the number of colors needed to represent an image which is something we can use image clustering for by replacing some colors with color of centroid of its cluster.

**Content-Based Image Retrieval** Large image databases can be organized by grouping images with similar properties together. This enables efficient retrieval and browsing of images effectively making it easier to search for specific images. 

**Image Annotation** Clustering can be employed to automatically categorize and annotate images based on their properties. By grouping visually similar images. It can help in automatically generating tags or annotation for images. 



In [1]:
import numpy as np
import cv2

In [3]:
image = cv2.imread('Images/Input Images/Chapter 8/image.jpg')
# Reshape the image to a 2D array of pixels
pixels = image.reshape(-1, 3).astype(np.float32)

#define the criteria for k-means clustering
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)

# Set K values
k_values = [2, 3, 7]

cv2.imshow('K-Mean Segmentation', image)
cv2.waitKey(0)

#Perform k-means clustering for each k value and display  the segmented images.
for k in k_values:
    # Perform K-Mean Clustering
    distances, labels, centers = cv2.kmeans(pixels, k, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS)
    # Convert the centers to integers
    centers = np.uint8(centers)
    
    #Replace each pixel with the corresponding cluster center value
    segmented_image = centers[labels.flatten()]
    segmented_image = segmented_image.reshape(image.shape)
    
    cv2.imshow('K Mean Segmentation', segmented_image)
    print("Distance: {}".format(distances))
    cv2.waitKey(0)
    
cv2.destroyAllWindows()


Distance: 185811069.9587214
Distance: 120383869.16633701
Distance: 37909526.4141376


## K-Nearest Neighbors (k-NN)
k-Nearest Neighbors is a popular supervised learning algorithm that is used for classification and regression application. K-NN predicts the class or valueof a new data point based on the majority or average of its K Nearest neighbors in the training dataset.
k-NN is an non parameteric algorithm. Thsi means that it does not make any assumptions about the underlying data and instead makes predictions based on the similarity between the input data points and the train labeled data. K-NN is also an instance based algorithm meaning that it does not use the dataset for training, instead, it used the entire dataset during the prediction phase of the operation. 

The **k-NN** classification algorthm works by implementing the following steps:
1. The user chooses the number for neighbors(represented by K) to be considered for classification. Load the dataset and apply any preprocessing such as feature scaling as per the requirements. 

2. Split the data into training and test sets. The test set will be used to evaluate the performance of the algorithm.

3. For each data point in the dataset, calculate its distance to all data points in the training set. Any distance meteric such as the Euclidean distance or the Manhattan distance can be used for calculating the distance between data points. 

4. The next step is to select the K nearest neighbors to the data point. The K nearest points to the data point will be selected. 

5. Among the selected K nearest neighbors the class label is assigned to the data point based on the majority voting in its selected neighbors. 

6. Step 4 and 5 are repeated for all the data points in the test set. 

7. The model is then evaluated using various evaluation meterics such as accuracy or precision. Depending on the results obtained, the performance of the model can be improved by changing the K value or the distance metrics used for the operation.

## Feature Scaling 
Feature scaling is a preprocessing technique used in machine learning to standardize or normalize the range of features in a dataset. It aims to bring all features onto a similar scale to avoid bias towards features with larger magnitudes. 

Commonly used methods for feature scaling are:
**Normalization (Min-Max scaling):** Normalization involves scaling each feature value to a range of 0 to 1. Min-Max scaling is a common normalization technique that is implemented by subtracting the minimum value of the feature from each data point and then dividing it by the range (maximum value minus the minimum value):

***X_scaled = (X - X_min)/(X_max - X_min)***

Where X is original feature value, X_min and X_max are minimum and maximum values of the feature respectively and X_scaled are the updated values in the range between 0 to 1.

**Standardization (Z-score normalization):** In this method, each feature is transformed such that it has a mean of 0 and a standard deviation of 1. It is acheived by subtracting the mean of the feature from each data point and then dividing it by standard deviation. Standardization preserves the shape fo the distribution and is suitable for features that have an normal distribution. 

***X_scaled = (X - mean)/(std_deviation)***

Where X is the original feature, X_scaled is the scaled value of the feature with mean value of 0 and a standard deviation of 1. 
The choice between standardization and normalization depends on the specific requirements of the dataset and the machine learning algorithm being used. 

## Hyperparameters.
In KNN the main hyperparameter is the K value. It represents the number of nearest neighbors to consider for classification and regression. It determines the level of complexity and generalization of the model. A smaller value of K makes the model more sensitive to noise and outliers, while a larger value of K makes the model more biased and less flexible.

The distance metric is another hyperparameter that is importand in KNNs algorithms. The choice of distance metric, such as Euclidean distance or the Manhattan distance affects how the neighbors are identified and the similarity between data points. 





In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

import tensorflow as tf

In [5]:
mnist = tf.keras.datasets.mnist

#Load the Fashion MNIST dataset
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz


In [8]:
# Flatten the Image
X_train = X_train.reshape((X_train.shape[0], -1))
X_test = X_test.reshape((X_test.shape[0], -1))

# Split the data into training and testing sets
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size=0.2, random_state=42)

#Define the KNN Classifier
knn = KNeighborsClassifier()

#Fit the classifier to training data
knn.fit(X_train, Y_train)

# make predictions on the validation set
y_pred_val = knn.predict(X_val)

# Calculate the validation accuracy
val_accuracy = accuracy_score(Y_val, y_pred_val)
print("Validation Accuracy:", val_accuracy)

#Make predictions on the test set
y_pred_test = knn.predict(X_test)

#Calculate the test accuracy
test_accuracy = accuracy_score(Y_test, y_pred_test)
print("Test Accuracy:", test_accuracy)



Validation Accuracy: 0.9626302083333333
Test Accuracy: 0.9615


In [13]:
# Select three random test images
indices = np.random.randint(0, len(X_test), size=3)
images = X_test[indices]
predicted_labels = y_pred_test[indices]

# Preprocess the images
reshaped_images = [cv2.cvtColor(image.reshape(28, 28), cv2.COLOR_GRAY2BGR) for image in images]

# Concatenate the images horizontally
concatenated_image = np.hstack(reshaped_images)
cv2.imshow("Images", concatenated_image)
cv2.waitKey(0)
cv2.destroyAllWindows()
print("Predicted Labels:", predicted_labels)

Predicted Labels: [4 2 8]


## Logistic Regression
Logistic regression is a popular and widely used algorithm in machine learning used for binary classification of data into categories. While initially developed for binary classification (Two Categories marked as 0 and 1), it can be used to handle multiclass classification using some different strategies. 

Logistic Regression is a supervised machine learning algorithm that works by modeling the relationship between the input features and the output category variable by assuming a linear relationship.

The logistic regression algorithm uses the following steps:
1. **Input Features:** Logistic regression takes input features that describe the characteristics or attributes of the data we want to classify. For example, in a dog and cat classification problem, the input features could be extracted from images, sucha as the color distribution, texture patterns or shape features. These features provide information about the distinguishing characteristics of dogs and cats, and they are used as the input variables for the logistic regression model.

2. **Initialize Weights:** To perform logistic regression, each input feature vector is multiplied by weight 'w', and the weighted features are summed up. The weighted sum represents the influence of each feature on the prediction. The weights are initialized to small random values or set to zeros and serve as starting points for the training process

***Sum over i = 1 to n (wixi + b)***
In the logistic regression, a bias term 'b' is also added. The bias term allows the logistic regression model to make predictions even when all the input features are zero.

3. **Train the Model:** Now tha tour data is ready, we can proceed to train our logistic regression model. During the training process, the algorithm will try to learn the boundary that will seperate the two classes effectively. The boundary is hyperplane that will divide the feature space into two categories, one for cats and the other for dogs in our case. 

The algorithm works by adjusting the weights assigned to each input feature interatively to minimize the error between the predicted probabilities and the actual labels. The optimization process tries to find the most optimal values for the weights by minimizing the cost or loss function such as cross entropy loss. Logistic regression commonly uses optimization algorithms such as gradient descent, stochastic gradient descent (SGD) or variants of SGD like mini-batch gradient descent. 

By adjusting the weights during the training, the logistic regression model learns the importance of each input feature in predicting the probability of an image belonging to a specific class (cat or dog). Ultimately , the trained logistic regression model finds the optimal decision boundary that seperates the cat image from the dog image in the feature space using weights and bias term.

4. **Gradient Descent:**  Gradient descent is an iterative optimization algorithm that aims to find the optmal set of parameters (weights) that minimize the cost function associated with logistic regression. 

In gradient descent, the algorithm starts with an initial set of parameters values and updates them iteratively by taking steps propotional to the negative gradient of the cost function. The gradient tells us the direction in which the cost function increases the most. By taking steps in the opposite direction we move towards the direction where the cost function decreases the most, eventually helping us find the minimum of the cost function. 

At each iteration, the algorithm computes the gradients of the cost function and then updates the parameters by subtracting a scaled value of the gradients. The learning rate determines the step size taken in each iteration. The process continues until a criterion is met, such as reaching a maximum number of iterations or acheiving a desired level of convergence. 

5. **Calculate Probabilities:** After training, the model uses the input features and their weights to calculate a weighted sum. The weighted sum is then mapped to a probability value between 0 and 1 using a sigmoid function

                        Sigmoid(x) = 1/(1 + exp(-x))

6. **Prediction:**  Now that our values have converged in to 0 to 1 range, the model makes class predictions by assigning the most likely label (cat or dog) to the image. If the probability is above 0.5 then the model will assign it to the dog class and if it is below 0.5, the image will be assigned to the cat class. 

## Hyperparameters

Some of the key hyperparameters in logistic regression are as follows:
1. Penalty : The hyperparameter determines the regularization used in logistic regression to prevent the overfitting. It can take different values: 

 ***L1 Regularization***, also known as Lasso Regularization, adds the absolute value of the coefficients as penalty term
 
 ***L2 Regularization***, also known as Ridge Regularization, adds the squared magnitude of the coefficients as the penalty term
 
***None***, No regularization required
                
2. C : This parameter denotes the inverse of regularization strength. It controls the amount of regularization applied on the images. Smaller values of C will result in stronger regularization, while larger values will reduce the amount of regularization in the dataset.

3. max_iter: This hyperparameter sets the maximum number of iterations for the algorithm to converge. It determines the maximum number of iterations taken for the algorithm to converge to the optimal solution. 




In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import os
import cv2



In [15]:
#INitializes the empty lists for dataset and labels
dataset = []
labels = []

train_folder = "train"

#Load Images from "dogs" folder
dog_folder = os.path.join(train_folder, 'dogs')
for filename in os.listdir(dog_folder):
    if filename.endswith('.jpg'):
        image = cv2.imread('train/dogs/'+filename, 0)
        if image is not None:
            image = cv2.resize(image, (64,64))
            k = image.flatten()
            dataset.append(k)
            labels.append(0) # labels 0 for dog image
            
#Load images from cats folder
cat_folder = os.path.join(train_folder, 'cats')
for filename in os.listdir(cat_folder):
    if filename.endswith('.jpg'):
        image = cv2.imread('train/cats/'+filename, 0)
        if image is not None:
            image = cv2.resize(image, (64, 64))
            k = image.flatten()
            dataset.append(k)
            labels.append(1) # labels 1 for cat image

#Convert the dataset and labels for Numpy arrays
dataset = np.array(dataset)
labels = np.array(labels)

#Split the dataset into train and test
X_train, X_test, Y_train, Y_test = train_test_split(dataset, labels, test_size=0.2)

# create a logistic regression model
logreg = LogisticRegression(max_iter = 500)

# Train the Model
logreg.fit(dataset, labels)

#Evaluate the model on the training set
train_predictions = logreg.predict(X_train)
train_accuracy = accuracy_score(Y_train, train_predictions)
print("Training Accuracy:", train_accuracy)

#Evaluate the model on the testing set
test_predictions = logreg.predict(X_test)
test_accuracy = accuracy_score(Y_test, test_predictions)
print("Test Accuracy:", test_accuracy)

Training Accuracy: 0.86796875
Test Accuracy: 0.868125


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [16]:
#Select three random images
image_files = np.random.choice(os.listdir("test/cats"), size=5, replace=False)
print(image_files)

images = []
#Iterate over the selected image file
for image_file in image_files:
    image_path = os.path.join("test/cats", image_file)
    
    #Read the image
    image = cv2.imread(image_path)
    
    #Preprocess the image
    gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    image = cv2.resize(gray_image, (64, 64))
    
    flattened_image = image.flatten()
    reshaped_image = flattened_image.reshape(1, -1)
    
    #Make a prediction on the image
    predicted_label = logreg.predict(reshaped_image)[0]
    
    #Get the class name based on the predicted label
    class_names = ['dog', 'cat']
    predicted_class = class_names[predicted_label]
    
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 0.7
    
    cv2.putText(image, predicted_class, (20, 20), font, font_scale, (255, 255, 255), 1, cv2.LINE_AA)
    
    #Add image with the predicted class to the list
    images.append(image)
    
output_image = np.hstack(images)

cv2.imshow("output", output_image)
cv2.waitKey(0)
cv2.destroyAllWindows()

['cat.4922.jpg' 'cat.4974.jpg' 'cat.4674.jpg' 'cat.4502.jpg'
 'cat.4967.jpg']


In [17]:
from tensorflow.keras.datasets import cifar10

#initialize empty lists for dataset and labels
dataset = []
labels = []

#Load CIFAR-10 dataset
(X_train, Y_train), (_, _) = cifar10.load_data()

#Select the desired class
selected_classes = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
num_images_per_class = 100
num_classes = len(selected_classes)

selected_images = []
test = []
#Iterate over the dataset and extract the desired images
for class_idx in selected_classes:
    class_images = X_train[Y_train.flatten() == class_idx]
    selected_images.extend(class_images[:num_images_per_class])
    
# Convert the list of selected images to a Numpy Array
selected_images = np.array(selected_images)

# Reshape the images to a flattened shape
flattened_images = selected_images.reshape(-1, np.prod(selected_images.shape[1:]))

#Initialize these values to the dataset list created earlier
dataset = flattened_images

#Initialize labels for each class
labels = [0]*1000
labels = [i // 100 for i in range(1000)]

#Convert the dataset and labels to Numpy arrays
dataset = np.array(dataset)
labels = np.array(labels)

#Split the extracted images into Train and test data
X_train, X_test, Y_train, Y_test = train_test_split(dataset, labels, test_size=0.2)

#Create Logistic regression model
logreg = LogisticRegression(C=0.1, max_iter=1000)

#Train the Model
logreg.fit(dataset, labels)

#Evaluate the model on the training set
train_predictions = logreg.predict(X_train)
train_accuracy = accuracy_score(Y_train, train_predictions)
print('Train Accuracy:', train_accuracy)

#Evaluate the model on testing set
test_predictions = logreg.predict(X_test)
test_accuracy = accuracy_score(Y_test, test_predictions)
print('Test Accuracy:', test_accuracy)
    

Train Accuracy: 1.0
Test Accuracy: 1.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [30]:
# funtion to draw predicted class on the image
def draw_predicted_class(image, predicted_class):
    class_name = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
    label = class_name[predicted_class]
    font = cv2.FONT_HERSHEY_SIMPLEX
    font_scale = 0.7
    color = (255, 255, 255)
    thickness = 2
    text_size, _ = cv2.getTextSize(label, font, font_scale, thickness)
    text_x = (image.shape[1] - text_size[0]) // 2
    text_y = (image.shape[0] - text_size[1]) // 2
    cv2.putText(image, label, (text_x, text_y), font, font_scale, color, thickness, cv2.LINE_AA)
    return image

#Load CIFAR-10 dataset
(_, _), (X_test, Y_test) = cifar10.load_data()

# Select one random test image
index = np.random.randint(len(X_test))
image = X_test[index]
true_label = Y_test[index]

#Make a prediction on the selected image.
selected_image = image.reshape(1, -1)
predicted_label = logreg.predict(selected_image)
print(predicted_label)

# Draw predicted class on the image
image_with_label = draw_predicted_class(selected_image, predicted_label[0])

cv2.imshow("log_final_res.jpg", image)
cv2.waitKey(0)
cv2.destroyAllWindows()

[5]


## Decision Trees