# kNN From Scratch: Iris Dataset
> A step-by-step implementation of the k-Nearest Neighbours and Linear Regression algorithms using the standard Python libaries.


The detailed breakdown and explanation of the code and concepts in this notebook can be found in this [post](https://outsiders17711.github.io/Mein.Platz/kNN-Linear-Regression-Iris_Dataset/) at my personal [blog](https://outsiders17711.github.io/Mein.Platz/).

I hope you find it useful.

<style>
  .custom-images-style {
    display: flex;
    justify-content: center;
    align-content: stretch;
    flex-wrap: wrap;
    flex-direction: row;
    text-decoration: none !important;
  }
  .custom-images-style img {
    margin-right: 5px;
    margin-left: 5px;
    margin-bottom: 10px;
  }
</style>

---

In [None]:
# ---
# importing required libraries
import random
import csv
import math
import statistics
import copy

# set random seed
random.seed('iris dataset')

---

# k-Nearest Neighbours From Scratch

The flowchart for implementing the kNN algorithm is shown below. Each step in the implementation will be wrapped in its own function for clarity.

![kNN Flowchart](https://raw.githubusercontent.com/Outsiders17711/Mein.Platz/main/images/ipynb/knn_flowchart.png)

## DataLoader

The dataset is contained in a .csv file. We will implement a function `DataLoader` that calls several child functions to load and cleanup the data.

In [None]:
def _load_csv(filename):
	with open(filename, 'r') as file:
		csv_reader = csv.reader(file)
		return [row for row in csv_reader if row]


In [None]:
csv.reader??

In [None]:
def _clean_features(dataset):
    num_columns = len(dataset[0])

    for row in dataset:
        for column in range(num_columns-1):
            row[column] = float(row[column].strip())


In [None]:
def _map_classes(dataset):
    class_mappings = {}
    for row in dataset:
        _specie = row[-1]
        if _specie not in class_mappings.keys():
            class_mappings[_specie] = len(class_mappings)
        row[-1] = class_mappings[_specie]

    return class_mappings


In [None]:
def _normalize_data(dataset):
    num_features = len(dataset[0])-1
    for i in range(num_features):
        column_values = [row[i] for row in dataset]
        column_min = min(column_values)
        column_max = max(column_values)
        
        for row in dataset:
            row[i] = (row[i] - column_min) / (column_max - column_min)


In [None]:
def DataLoader(filename):
    dataset = _load_csv(filename)
    _clean_features(dataset)
    class_mappings = _map_classes(dataset)
    _normalize_data(dataset)

    return dataset, class_mappings


---

## kNN Algorithm

Next, we implement the algorithm itself in a main function `kNN_Algorithm` that calls several child functions.  

In [None]:
def _euclidean_distance(row1, row2):
    distance = 0.0
    num_features = len(row1)-1

    for i in range(num_features):
        distance += (row1[i] - row2[i])**2
    return math.sqrt(distance)


In [None]:
def _get_k_neighbours(test_row, train_data, num_neighbours):
    test_train_distances = []
    for train_row in train_data:
        _test_train_distance = _euclidean_distance(test_row, train_row)
        test_train_distances.append([train_row, _test_train_distance])

    test_train_distances.sort(key=lambda idx: idx[1])
    return [test_train_distances[i][0] for i in range(num_neighbours)]


In [None]:
def _predict_classification(test_row, train_data, num_neighbours):
    nearest_neighbours =  _get_k_neighbours(test_row, train_data, num_neighbours)
    nearest_classes = [neighbour[-1] for neighbour in nearest_neighbours]
    predicted_class = max(set(nearest_classes), key=nearest_classes.count)

    return predicted_class


In [None]:
def kNN_Algorithm(test_data, train_data, num_neighbours):
    return [_predict_classification(test_row, train_data, num_neighbours) for test_row in test_data]


---

## Evaluate kNN Algorithm

Now, we can go ahead and evaluate the performance of the algorithm against the dataset. The evaluation will be implemented using the function `Evaluate_kNN_Algorithm` which calls several child functions to split the dataset into test/train samples and calculate accuracies.  

In [None]:
def _test_train_split(dataset, test_ratio):
    _dataset = copy.deepcopy(dataset)
    random.shuffle(_dataset)

    split_index = int(len(dataset) * test_ratio)
    # Training data
    test_sample = _dataset[0:split_index]
    #Testing data
    train_sample = _dataset[split_index:]

    return test_sample, train_sample


In [None]:
def _cross_validation_split(dataset, num_groups):
    dataset_groups = []
    _dataset = copy.deepcopy(dataset)
    group_size = int(len(_dataset) / num_groups)

    for i in range(num_groups):
        group = []
        while len(group) < group_size:
            idx = random.randrange(len(_dataset))
            group.append(_dataset.pop(idx))
        dataset_groups.append(group)

    return dataset_groups


In [None]:
def _get_accuracy(test_sample, algorithm_predictions, class_mappings):
    test_classes = [row[-1] for row in test_sample]
    num_test_classes = len(test_classes)
    test_labels = list(class_mappings.keys())

    if len(test_classes) != len(algorithm_predictions):
        raise IndexError("The count of test classes is not equal to the count of algorithm predictions!")

    num_correct_predictions = sum([actual == predicted for actual, predicted 
                                                        in zip(test_classes, algorithm_predictions)])

    wrong_predictions = [f'A:{test_labels[actual]} | P:{test_labels[predicted]}'
                                                            for actual, predicted 
                                                            in zip(test_classes, algorithm_predictions)
                                                            if actual != predicted]
                        
    accuracy = (num_correct_predictions / num_test_classes) * 100
    return accuracy, wrong_predictions


In [None]:
def tts_Evaluate_kNN_Algorithm(dataset, class_mappings, test_ratio=0.25, 
                                                                num_neighbours=3, num_iterations=100):
    
    ACCURACY_HISTORY = []
    WRONG_PREDICTION_HISTORY = []

    for _iter in range(num_iterations):
        _dataset = copy.deepcopy(dataset)
        test_sample, train_sample = _test_train_split(_dataset, test_ratio)

        algorithm_predictions = kNN_Algorithm(test_sample, train_sample, num_neighbours)
        accuracy, wrong_predictions = _get_accuracy(test_sample, algorithm_predictions, class_mappings)
        ACCURACY_HISTORY.append(accuracy)
        WRONG_PREDICTION_HISTORY.extend(wrong_predictions)

    random.shuffle(WRONG_PREDICTION_HISTORY)
    print('kNN algorithm evaluation using the Test/Train Split method:', '\n\t', 
                'Average Accuracy:', round(statistics.mean(ACCURACY_HISTORY), ndigits=4), '\n\t', 
                'Maximum Accuracy:', max(ACCURACY_HISTORY), '\n')

    print('A: Actual | P: Predicted')
    print('\n'.join(WRONG_PREDICTION_HISTORY[:20]))


In [None]:
def cvs_Evaluate_kNN_Algorithm(dataset, class_mappings, num_groups=5, 
                                                                num_neighbours=3, num_iterations=100):
    
    ACCURACY_HISTORY = []
    WRONG_PREDICTION_HISTORY = []

    for _iter in range(num_iterations):
        _dataset = copy.deepcopy(dataset)
        dataset_groups = _cross_validation_split(_dataset, num_groups)

        for idx, group in enumerate(dataset_groups):
            test_sample = group
            _train_sample = copy.deepcopy(dataset_groups)
            del _train_sample[idx]
            
            train_sample = []
            for train_group in _train_sample:
                train_sample.extend(train_group)

            algorithm_predictions = kNN_Algorithm(test_sample, train_sample, num_neighbours)
            accuracy, wrong_predictions = _get_accuracy(test_sample, algorithm_predictions, class_mappings)
            ACCURACY_HISTORY.append(accuracy)
            WRONG_PREDICTION_HISTORY.extend(wrong_predictions)

    random.shuffle(WRONG_PREDICTION_HISTORY)
    print('kNN algorithm evaluation using the Cross Validation Split method:', '\n\t', 
                'Average Accuracy:', round(statistics.mean(ACCURACY_HISTORY), ndigits=4), '\n\t', 
                'Maximum Accuracy:', max(ACCURACY_HISTORY), '\n')

    print('A: Actual | P: Predicted')
    print('\n'.join(WRONG_PREDICTION_HISTORY[:20]))


---

### Evaluate kNN Algorithm: Using Test-Train Split Method 

In [None]:
dataset, class_mappings = DataLoader("../input/iris-dataset/iris.data.csv")
tts_Evaluate_kNN_Algorithm(dataset, class_mappings)


---

### Evaluate kNN Algorithm: Using Cross-Validation Split Method 

In [None]:
dataset, class_mappings = DataLoader("../input/iris-dataset/iris.data.csv")
cvs_Evaluate_kNN_Algorithm(dataset, class_mappings)


---

## Resources & References

-  [Develop k-Nearest Neighbors in Python From Scratch - Machine Learning Mastery](https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/)

-  [K Nearest Neighbors Algorithm using Python From Absolute Scratch - The Nerdy Dev](https://www.youtube.com/watch?v=uclqpQe8TMQ)