# Module 1: Introduction to Scikit-Learn

## Section 3: Supervised Learning Algorithms

### Part 6: K-Nearest Neighbors (KNN)

In this section, we will explore K-Nearest Neighbors (KNN), a popular non-parametric supervised learning algorithm used for both classification and regression tasks. KNN makes predictions based on the similarity of feature values between data points. Let's dive in!

### 6.1 Understanding K-Nearest Neighbors (KNN)

K-Nearest Neighbors is a simple yet effective algorithm that uses the k closest labeled data points in the training set to make predictions for new, unlabeled data points. KNN assumes that similar data points tend to belong to the same class or have similar values.

To determine the class or value of a new data point, KNN calculates the distance between the new point and the existing points in the training set. The most common distance metric used is the Euclidean distance. The k nearest neighbors are selected, and the majority class or the average value of the neighbors is assigned to the new data point.

### 6.2 Training and Evaluation

To train a KNN model, we need a labeled dataset with the target variable and the corresponding feature values. The model learns by storing the feature values of the training set.

Once trained, we can evaluate the model's performance using evaluation metrics suitable for classification or regression tasks, such as accuracy, precision, recall, F1-score, or mean squared error.

Scikit-Learn provides the KNeighborsClassifier class for classification tasks and the KNeighborsRegressor class for regression tasks. Here's an example of how to use them:

```python
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor

# Create an instance of the KNeighborsClassifier or KNeighborsRegressor model
classifier = KNeighborsClassifier()
regressor = KNeighborsRegressor()

# Fit the model to the training data
classifier.fit(X_train, y_train)
regressor.fit(X_train, y_train)

# Predict class labels or values for test data
y_pred_classifier = classifier.predict(X_test)
y_pred_regressor = regressor.predict(X_test)

# Evaluate the model's performance
classification_accuracy = accuracy_score(y_test, y_pred_classifier)
regression_mse = mean_squared_error(y_test, y_pred_regressor)
```

### 6.3 Choosing the Value of K

The value of k, the number of nearest neighbors to consider, is an important hyperparameter in KNN. It significantly affects the model's performance. Choosing an optimal value for k requires careful consideration.

- A small value of k can lead to high variance and overfitting.
- A large value of k can lead to high bias and underfitting.

Hyperparameter tuning techniques, such as grid search or cross-validation, can be used to find the optimal value of k for a given dataset. Another useful tool is use the elbow technique.

The Elbow Technique involves plotting the value of k on the x-axis and the corresponding evaluation metric (e.g., accuracy, mean squared error) on the y-axis. As k increases, the model becomes more biased, leading to underfitting. On the other hand, as k decreases, the model becomes more complex and prone to overfitting. The optimal value of k is often identified as the point of inflection or "elbow" in the curve.

By iterating through different values of k and evaluating the model's performance, we can determine the optimal value of k that balances bias and variance.

### 6.4 Handling Continuous or Categorical Features

KNN can handle both continuous and categorical features. For continuous features, the Euclidean distance or other distance metrics can be used. For categorical features, appropriate distance metrics such as Hamming distance or custom distance functions can be employed.

### 6.5 Scaling Features

Scaling features is crucial in KNN because features with larger scales can dominate the distance calculation. Scaling the features to a similar range ensures that all features contribute equally to the similarity calculation.

Scikit-Learn provides various scaling techniques, such as StandardScaler or MinMaxScaler, which can be applied to preprocess the features before training the KNN model.

### 6.6 Dealing with Imbalanced Classes

KNN can be sensitive to imbalanced classes, where one class has significantly more instances than the others. Techniques like class weighting, adjusting the decision threshold, or using oversampling or undersampling methods can help address the issue of imbalanced classes.

### 6.7 Conclusion

K-Nearest Neighbors (KNN) is a simple yet powerful algorithm for classification and regression tasks. It makes predictions based on the similarity of feature values between data points. Scikit-Learn provides the necessary classes to implement KNN easily. Understanding the concepts, training, and evaluation techniques is crucial for effectively using KNN in practice.

In the next part, we will explore Gradient Boosting methods, a family of ensemble learning algorithms widely used for both classification and regression tasks.

Feel free to practice implementing K-Nearest Neighbors (KNN) using Scikit-Learn. Experiment with different values of k, distance metrics, scaling techniques, and evaluation metrics to gain a deeper understanding of the algorithm and its performance.