1. Model Selection:
Choose Algorithm: Select an appropriate machine learning algorithm based on the nature of the problem (classification, regression, clustering, etc.), the size of the dataset, and other factors.

2. Model Building:
Instantiate Model: Create an instance of the chosen machine learning algorithm.

Fit Model: Train the model on the training data by calling the fit() method. During training, the model learns the patterns and relationships present in the data.

K-Nearest Neighbors (KNN) is a simple yet powerful supervised machine learning algorithm used for both classification and regression tasks. It's a non-parametric, lazy learning algorithm that makes predictions based on the similarity of input data points to the training instances. Here's a detailed description of KNN:

1. Algorithm Overview:
Lazy Learning: KNN is considered a lazy learning algorithm because it doesn't explicitly build a model during the training phase. Instead, it memorizes the training instances and makes predictions based on their proximity to new, unseen instances during the inference phase.

Instance-Based Learning: KNN belongs to the family of instance-based learning algorithms, where predictions are made by comparing new instances to existing instances in the training data.

2. How KNN Works:
K-Nearest Neighbors: The "K" in KNN refers to the number of nearest neighbors (training instances) that are considered when making predictions for a new instance. The value of K is a hyperparameter that needs to be specified by the user.

Distance Metric: KNN typically uses distance metrics such as Euclidean distance, Manhattan distance, or Minkowski distance to measure the similarity between data points. Euclidean distance is commonly used for continuous features, while other distance metrics may be preferred for categorical or mixed-type features.

Prediction Process: To make predictions for a new instance, KNN calculates the distances between the new instance and all training instances. It then selects the K nearest neighbors based on these distances and assigns the majority class label (in classification) or average target value (in regression) of these neighbors as the predicted label or value for the new instance.

3. Key Hyperparameters:
K: The number of nearest neighbors to consider when making predictions. The optimal value of K can significantly impact the performance of the KNN algorithm and should be chosen carefully through hyperparameter tuning.

Distance Metric: The choice of distance metric (e.g., Euclidean distance, Manhattan distance) can affect the algorithm's performance, especially when dealing with high-dimensional or mixed-type data.

4. Advantages of KNN:
Simplicity: KNN is easy to understand and implement, making it suitable for beginners and as a baseline model for comparison.

No Training Phase: Since KNN is a lazy learning algorithm, there's no explicit training phase, which makes it computationally efficient for large datasets.

Versatility: KNN can be applied to both classification and regression tasks and can handle data with mixed types of features.

Non-Parametric: KNN makes no assumptions about the underlying data distribution, making it suitable for both linear and non-linear relationships between features and target variables.

5. Limitations of KNN:
Computational Complexity: KNN requires computing distances between the new instance and all training instances, which can be computationally expensive, especially for large datasets or high-dimensional feature spaces.

Sensitivity to Feature Scaling: KNN is sensitive to the scale of features, so it's important to scale or normalize features before applying the algorithm to ensure that all features contribute equally to the distance computation.

Curse of Dimensionality: KNN performance may degrade in high-dimensional feature spaces due to the curse of dimensionality, where the distance between data points becomes less meaningful as the number of dimensions increases.

Need for Optimal K: The choice of K can significantly impact the performance of KNN, and selecting an inappropriate value of K may lead to suboptimal results.

Overall, KNN is a versatile and intuitive algorithm that can be effective in many scenarios, especially when the underlying data distribution is not well understood or when there are no clear boundaries between classes or clusters. However, it's important to consider its limitations and perform appropriate preprocessing and hyperparameter tuning to ensure optimal performance.





In [1]:
#Import Required Library [Details are available in README.md file]
from sklearn.neighbors import KNeighborsClassifier
import os
import matplotlib.pyplot as plt
import random
import cv2
import pickle


In [2]:
# Get the current directory
current_dir = os.getcwd()

# Get the parent directory (one level up)
current_dir = os.path.dirname(current_dir)

# Get the parent directory (one level up)
parent_dir = os.path.dirname(current_dir)

# Print the parent directory
print("Parent Directory:", parent_dir)

Parent Directory: E:\upgrade_capston_project-main


In [3]:
preprocessed_data_dir = parent_dir+'/datasets/processed_dataset/'

In [4]:
#Load the preprocessed data
with open(os.path.join(preprocessed_data_dir,'X_train.pkl'), 'rb') as f:
    X_train = pickle.load(f)

# Load y_train from file
with open(os.path.join(preprocessed_data_dir,'y_train.pkl'), 'rb') as f:
    y_train = pickle.load(f)


In [5]:
# Define a function to train a KNN
def train_knn_model(X_train, y_train):
	knn_classifier = KNeighborsClassifier(n_neighbors=2)
	knn_classifier.fit(X_train, y_train)
	return knn_classifier

In [6]:
# Train KNN
knn_classifier = train_knn_model(X_train, y_train)

In [7]:
model_dir = parent_dir+'/models/'

In [8]:
# Save the trained model to a file
with open(os.path.join(model_dir,'knn_model.pkl'), 'wb') as f:
    pickle.dump(knn_classifier, f)