#                  <font color=red>       BBM409 : Introduction to Machine Learning Lab - Assignment 1</font>

<img src="logo.png" width=300 height=200 />

## <font color=green><center>Berra Nur SARI - 21727671 <br> Melih SUNMAN - 21827809</center></font>

#   <font color=blue>PART 1: Glass Material Classification</font>

## Abstract :

In the first part of this project, the KNN machine learning algorithm was used for the glass dataset. The accuracies of the model were calculated on 5-fold cross validation with different k parameters (1,3,5,7,9) of the model's different k-NN(with normalization and without normalization) and weighted k-NN (with normalization and without normalization) models.

## Code:

Required libraries are imported

In [72]:
from math import sqrt
import numpy as np
import pandas as pd
from collections import Counter
import time

KNN Classification Class  

1. "fit(self, x, y)" function: Takes which features to train and their labels as parameters.  

2. "predict(self, test_validation)" function: It calls the _predict(self,x) function for each data to be tested and stores the predictions in an array. This returns the predicted array.

3. "predict_weighted(self, test_validation)" function: It calls the _predict_weighted(self,x,hyper_parameter) function for each data to be tested and stores the predictions in an array. This returns the predicted array.

4. "euclidean_distance(self, row1, row2)" function: Calculates the distance between two given vectors with Euclidean distance.

5. "manhattan_distance(self, row1, row2)" function: Calculates the distance between two given vectors with Manahttan distance.

6. "_predict(self, x)" function: With distance array, for each sample in the given training set, it holds the distance of a particular sample from other samples. And then, create k_indices array which holding the indexes of the vectors with the closest distance(as many as k). Next, create k_nearest_labels array which holding the values of the vectors with the closest distance. Then, with most_common, it is determined which label is repeated the most and returned.

7. "_predict_weighted(self, x , hyper_parameter)" function: With distance array, for each sample in the given training set, it holds the distance of a particular sample from other samples. And then, create k_indices array which holding the indexes of the vectors with the closest distance(as many as k). And then, create k_nearest_labels array which holding the values of the vectors with the closest distance.Also, create k_distances array which holding the distances of the vectors with the closest distance. And create a dictionary to match distances and labels. The hyper parameter was created to avoid offsets of 0.0. We can explain it like this, if the distance is zero, the 1/d calculation cannot be made and in fact the closest value is considered invalid. For this reason, the distance measure as much as the entered hyper parameter is added to all the distances in the data set. In this way, the closest value actually has the shortest distance and has the greatest weight through the 1/d expression. After adding the entered hyperparameter to all distances, the corresponding weights for all labels were calculated and recorded in a dictionary. Then the label with the greatest weight is returned.  

In [73]:
class KNN_Classification:
    def __init__(self, k=3):
        self.k = k

    def fit(self, x, y):
        self.x_train = x
        self.y_train = y

    def predict(self, test_validation):
        predicted_labels = [self._predict(x) for x in test_validation]
        return np.array(predicted_labels)

    def predict_weighted(self, test_validation , hyper_parameter):
        predicted_labels = [self._predict_weighted(x , hyper_parameter) for x in test_validation]
        return np.array(predicted_labels)

    def euclidean_distance(self, row1, row2):
        distance = 0.0
        for i in range(len(row1) - 1):
            distance += (row1[i] - row2[i]) ** 2
        return sqrt(distance)

    def manhattan_distance(self, row1, row2):
        distance = 0.0
        for i in range(len(row1) - 1):
            distance += abs(row1[i] - row2[i])
        return distance

    def _predict(self, x):
        distances = [self.euclidean_distance(x, x_train) for x_train in self.x_train]
        #distances = [self.manhattan_distance(x, x_train) for x_train in self.x_train]

        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]

        most_common = Counter(k_nearest_labels).most_common(1) 
        return most_common[0][0]

    def _predict_weighted(self, x , hyper_parameter):
        distances = [self.euclidean_distance(x, x_train) for x_train in self.x_train]
        # distances = [self.manhattan_distance(x, x_train) for x_train in self.x_train]

        k_indices = np.argsort(distances)[:self.k]
        k_distances = [distances[i] for i in k_indices]
        k_nearest_labels = [self.y_train[i] for i in k_indices]

        dict = {}

        # add a small value in all label's distance and avoid 0 distance
        for i in range(len(k_distances)):
            k_distances[i] = k_distances[i] + hyper_parameter

        for i in range(len(k_nearest_labels)):
            if k_nearest_labels[i] in dict:
                dict[k_nearest_labels[i]] += 1 / k_distances[i]
            elif k_nearest_labels[i] not in dict:
                dict[k_nearest_labels[i]] = 1 / k_distances[i]

        return max(dict, key=dict.get)


In [74]:
# reading csv file with pandas
df = pd.read_csv('glass.csv')

In [75]:
# parse the data and convert numpy
df = df.to_numpy()

In [76]:
#shuffle
np.random.shuffle(df)

In [77]:
# parse the data as features and labels
features = df[:, :-1]
labels = df[:, -1]

Used min-max normalization on the features of samples to re-scale each feature (feature/attribute column on data) between (0-1) range. For this, each column was handled separately, the minimum and maximum data in the column were obtained, and the data with the normalization algorithm were re-recorded.


In [78]:
def normalization(df, features):
    for feature in range(df.shape[1] - 1):
        minvalue = features[:, feature].min()
        maxvalue = features[:, feature].max()
        for i in range(len(df)):
            features[i, feature] = (features[i, feature] - minvalue) / (maxvalue - minvalue)


In [79]:
# creating normalized features
normalized_features = features.copy()
normalization(df, normalized_features)

In [80]:
dfSize = len(df) 
numberOfCrossSize = int(len(df) / 5)

For each desired k value (1,3,5,7,9), 5-fold cross validation was performed separately and the results were printed. First of all, validate and train sets were determined. Then, "unnormalized and unweighted knn" , "normalized and unweighted knn" , "unnormalized and weighted knn" and "normalized and weighted knn" algorithms were run with the training data set, respectively, and their accuracy values were reached. Each accuracy value and its average are printed on the screen. <br><br>
In the creation phase of each model, first the model was created with the k value, the training data was given to the model, and then it was expected to make predictions from the model. Meanwhile, missing values were also kept in a dictionary and printed after each prediction.

In [81]:
k_value_list = [1, 3, 5, 7, 9]
average_accuracy = 0
average_accuracy_normalization = 0
average_accuracy_weighted = 0
average_accuracy_weighted_normalization = 0
for k in k_value_list:
    for i in range(5):  # 5-fold cross validation
        validate = int(dfSize * .2 * i)
        if i == 4:
            x_validate = features[validate:, :]
            x_normalized_validate = normalized_features[validate:, :]
            y_validate = labels[validate:]
        else:
            x_validate = features[validate:validate + numberOfCrossSize, :]
            x_normalized_validate = normalized_features[validate:validate + numberOfCrossSize, :]
            y_validate = labels[validate:validate + numberOfCrossSize]
        x_train = features[validate + numberOfCrossSize:, :]
        x_normalized_train = normalized_features[validate + numberOfCrossSize:, :]
        y_train = labels[validate + numberOfCrossSize:]
        if i != 0:
            x_train2 = features[:validate, :]
            x_normalized_train2 = normalized_features[:validate, :]
            y_train2 = labels[:validate]

            x_train = np.concatenate((x_train, x_train2), axis=0)
            x_normalized_train = np.concatenate((x_normalized_train, x_normalized_train2), axis=0)
            y_train = np.concatenate((y_train, y_train2), axis=0)


        # predict their classes using kNN without feature normalization
        start = time.time()
        model = KNN_Classification(k)
        model.fit(x_train, y_train)
        predictions = model.predict(x_validate)
        end = time.time()

        accuracy = (np.sum(predictions == y_validate) / len(y_validate)) * 100
        average_accuracy += accuracy
        print("accuracy without normalization for cross validation = ", i, " -> ", accuracy)
        print("Computation time:", end - start)

        dictForMissingValues = {}
        for count in range(len(predictions)):
            if predictions[count] != y_validate[count]:
                if y_validate[count] not in dictForMissingValues.keys():
                    dictForMissingValues[y_validate[count]] = list()
                    dictForMissingValues[y_validate[count]].append(predictions[count])
                else:
                    dictForMissingValues[y_validate[count]].append(predictions[count])

        print("Missing values: ", dictForMissingValues)
        print()

        # predict their classes using kNN with feature normalization
        start = time.time()
        model = KNN_Classification(k)
        model.fit(x_normalized_train, y_train)
        predictions = model.predict(x_normalized_validate)
        end = time.time()

        accuracy = (np.sum(predictions == y_validate) / len(y_validate)) * 100
        average_accuracy_normalization += accuracy
        print("accuracy with normalization for cross validation = ", i, " -> ", accuracy)
        print("Computation time:", end - start)

        dictForMissingValues = {}
        for count in range(len(predictions)):
            if predictions[count] != y_validate[count]:
                if y_validate[count] not in dictForMissingValues.keys():
                    dictForMissingValues[y_validate[count]] = list()
                    dictForMissingValues[y_validate[count]].append(predictions[count])
                else:
                    dictForMissingValues[y_validate[count]].append(predictions[count])


        print("Missing values: ", dictForMissingValues)
        print()

        # predict their classes using weighted kNN without feature normalization
        start = time.time()
        model = KNN_Classification(k)
        model.fit(x_train, y_train)
        predictions = model.predict_weighted(x_validate, 0.1)
        end = time.time()

        accuracy = (np.sum(predictions == y_validate) / len(y_validate)) * 100
        average_accuracy_weighted += accuracy
        print("weighted accuracy without normalization for cross validation = ", i, " -> ", accuracy)
        print("Computation time:", end - start)

        dictForMissingValues = {}
        for count in range(len(predictions)):
            if predictions[count] != y_validate[count]:
                if y_validate[count] not in dictForMissingValues.keys():
                    dictForMissingValues[y_validate[count]] = list()
                    dictForMissingValues[y_validate[count]].append(predictions[count])
                else:
                    dictForMissingValues[y_validate[count]].append(predictions[count])


        print("Missing values: ", dictForMissingValues)
        print()


        # predict their classes using weighted kNN with feature normalization
        start = time.time()
        model = KNN_Classification(k)
        model.fit(x_normalized_train, y_train)
        predictions = model.predict_weighted(x_normalized_validate, 0.1)
        end = time.time()

        accuracy = (np.sum(predictions == y_validate) / len(y_validate)) * 100
        average_accuracy_weighted_normalization += accuracy
        print("weighted accuracy with normalization for cross validation = ", i, " -> ", accuracy)
        print("Computation time:", end - start)

        dictForMissingValues = {}
        for count in range(len(predictions)):
            if predictions[count] != y_validate[count]:
                if y_validate[count] not in dictForMissingValues.keys():
                    dictForMissingValues[y_validate[count]] = list()
                    dictForMissingValues[y_validate[count]].append(predictions[count])
                else:
                    dictForMissingValues[y_validate[count]].append(predictions[count])


        print("Missing values: ", dictForMissingValues)
        print()


        if i == 4:
            print("for k = " , k)
            print("average_accuracy  ",  average_accuracy/ 5)
            average_accuracy = 0
            print("average_accuracy with normalization  ", average_accuracy_normalization / 5)
            average_accuracy_normalization = 0
            print("average_accuracy_weighted without normalization  ", average_accuracy_weighted / 5)
            average_accuracy_weighted = 0
            print("average_accuracy_weighted with normalization  ", average_accuracy_weighted_normalization / 5)
            average_accuracy_weighted_normalization = 0
            print()
            print("\n---------------------------------------------------------------------------------------- \n")

accuracy without normalization for cross validation =  0  ->  66.66666666666666
Computation time: 0.0359959602355957
Missing values:  {1.0: [3.0], 3.0: [1.0, 2.0, 1.0, 2.0], 2.0: [6.0, 1.0, 1.0, 1.0, 5.0, 1.0, 1.0, 1.0], 6.0: [7.0]}

accuracy with normalization for cross validation =  0  ->  69.04761904761905
Computation time: 0.03599739074707031
Missing values:  {1.0: [3.0], 3.0: [1.0, 2.0, 1.0, 2.0], 2.0: [6.0, 1.0, 1.0, 5.0, 1.0, 1.0, 1.0], 6.0: [7.0]}

weighted accuracy without normalization for cross validation =  0  ->  66.66666666666666
Computation time: 0.03601884841918945
Missing values:  {1.0: [3.0], 3.0: [1.0, 2.0, 1.0, 2.0], 2.0: [6.0, 1.0, 1.0, 1.0, 5.0, 1.0, 1.0, 1.0], 6.0: [7.0]}

weighted accuracy with normalization for cross validation =  0  ->  69.04761904761905
Computation time: 0.03500771522521973
Missing values:  {1.0: [3.0], 3.0: [1.0, 2.0, 1.0, 2.0], 2.0: [6.0, 1.0, 1.0, 5.0, 1.0, 1.0, 1.0], 6.0: [7.0]}

accuracy without normalization for cross validation =  1  -

weighted accuracy without normalization for cross validation =  3  ->  80.95238095238095
Computation time: 0.037009239196777344
Missing values:  {2.0: [1.0, 1.0, 1.0], 1.0: [7.0, 2.0], 6.0: [7.0], 3.0: [1.0], 7.0: [6.0]}

weighted accuracy with normalization for cross validation =  3  ->  85.71428571428571
Computation time: 0.036000967025756836
Missing values:  {2.0: [3.0, 1.0, 1.0], 3.0: [1.0], 7.0: [6.0], 1.0: [2.0]}

accuracy without normalization for cross validation =  4  ->  62.7906976744186
Computation time: 0.04602336883544922
Missing values:  {3.0: [1.0], 1.0: [2.0, 2.0, 3.0, 2.0], 2.0: [1.0, 1.0, 5.0, 1.0, 1.0, 3.0], 6.0: [2.0], 7.0: [1.0, 2.0, 2.0], 5.0: [2.0]}

accuracy with normalization for cross validation =  4  ->  67.44186046511628
Computation time: 0.03700828552246094
Missing values:  {2.0: [1.0, 1.0, 5.0, 1.0, 7.0, 3.0], 6.0: [5.0], 1.0: [2.0, 3.0, 3.0, 2.0], 7.0: [2.0, 2.0], 5.0: [2.0]}

weighted accuracy without normalization for cross validation =  4  ->  62.79069

accuracy without normalization for cross validation =  3  ->  78.57142857142857
Computation time: 0.036020755767822266
Missing values:  {5.0: [7.0], 2.0: [1.0, 1.0, 1.0], 6.0: [7.0], 3.0: [1.0], 7.0: [6.0], 1.0: [2.0, 2.0]}

accuracy with normalization for cross validation =  3  ->  83.33333333333334
Computation time: 0.036995887756347656
Missing values:  {5.0: [7.0], 2.0: [1.0, 1.0], 6.0: [7.0], 3.0: [1.0], 7.0: [6.0], 1.0: [2.0]}

weighted accuracy without normalization for cross validation =  3  ->  80.95238095238095
Computation time: 0.036020755767822266
Missing values:  {2.0: [1.0, 1.0, 1.0], 6.0: [7.0], 3.0: [1.0], 7.0: [6.0], 1.0: [2.0, 2.0]}

weighted accuracy with normalization for cross validation =  3  ->  85.71428571428571
Computation time: 0.035996198654174805
Missing values:  {2.0: [1.0, 1.0], 6.0: [7.0], 3.0: [1.0], 7.0: [6.0], 1.0: [2.0]}

accuracy without normalization for cross validation =  4  ->  60.46511627906976
Computation time: 0.03801417350769043
Missing values

## Error Analysis for Classification

<img src="Part1.png" />

#### Find a few misclassified samples and comment on why you think they were hard to classify.

- Each missing value is printed on the screen. We see some similarities between the causes of missing values. Let's take label 2 for example. The 2 label is generally confused with 1 and 5. This may be because vectors labeled 2 and vectors labeled 1 and 5 are located close in space. In another example, let's look at the label 1. In general, we can easily see that 2 predictions are made where 1 should be. We have seen that the prediction of 1 is frequently made for the 2 labels. This can confirm what we have just done that the positions of the vectors are close to each other. <br> If the k value we are considering is more than the optimum k value, the model may have been overfitted. If the selected k value is less than the optimum k value, then the model may be looking at only the nearest ones, confusing the generic label because it is examining less data. Also, KNN is a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. For this reason, if our training set is not well mixed and there are not enough samples to be estimated, the model may fail in the estimated value. <br>
In addition, missing values may also be due to lack of normalization. In the absence of normalization, some features may affect the model more than other features and mislead the model.

#### Compare performance of different feature normalization choices and investigate the effect of important system parameters (number of training samples used, k in k-NN, etc.). Wherever relevant, feel free to discuss computation time in addition to classification rate.

- At the table, if we compare the weighted and not weighted results at the same K and normalized values, at the most of the case we see the weighted KNN is giving more efficient results. This is beacuse weighted KNN provides more sensitive mesurement with calculating the distance of neighbors. Scoring the neigbours with their weights is make the results more efficent. This is because we give the greatest weight to the nearest neighbor. Since the weight is equal to 1/d, the weight decreases as the distance increases and the farthest point has the least effect on the point being estimated. 
<br>
- Sometimes a dataset can contain extreme values  which name is outliers that are outside the range of what is expected and unlike the other data. In additions, the ranges of some of features in dataset can differ greatly from the ranges of other featurs. This causes the feature with a wide range to be more effective in estimating the model, may be resulting in incorrect estimation. We can get rid of this problems with normalization.  The normalization here is done by  rescaled to the 0-1 interval is done by shifted the values of each feature so that the minimal value is 0, and then divided by the maximal value. At the table if we compare the normalized and not normalized values at the average, we seenormalized values has better results all the cases. The reason is data set has different features that has various intervals and data set has outlier values. When we determine a fixed interval this allows the evaluating the data more accurate.
<br>
- As considering the table we might say the best K value is 1. If we check the other K values we can't see a linear ratio on the results. The accuracy rate decreasing at K = 3 and increasing at K = 5 then continuing the decreasing with upper K values. If we look at the results in the table, k=1 gives us the best accuracy value for this problem.
<br>
- On the other hand we see the increasing k values effect the calculation time of program. As the k value increased and weighted operation was performed at the same time, the prediction time of the model became longer. This is due to additional processing, but this time is in milliseconds and the difference is quite small.
<br>
- In short, normalization on the data increased the accuracy value. The use of Weighted KNN increased the accuracy value compared to the use of the normal KNN algorithm. Based on the data we have, 1 seems to be the best k-value to choose from. If an estimation is desired, the model with normalized, weighted knn algorithm and k value of 1 should be preferred.

#   <font color=blue>PART 2: Concrete Material Strength Estimation from Data</font>

## Abstract :

In the second part of this project, the KNN machine learning algorithm was used for the concrete material strength dataset. The accuracies of the model were calculated on 5-fold cross validation with different k parameters (1,3,5,7,9) of the model's different k-NN and weighted k-NN  models.

## Code:

Required libraries are imported

In [82]:
import numpy as np
import pandas as pd
from math import sqrt

KNN Regression Class  

1. "fit(self, x, y)" function: Takes which features to train and their labels as parameters.  

2. "predict(self, test_validation)" function: It calls the _predict(self,x) function for each data to be tested and stores the predictions in an array. This returns the predicted array.

3. "predict_weighted(self, test_validation)" function: It calls the _predict_weighted(self,x,hyper_parameter) function for each data to be tested and stores the predictions in an array. This returns the predicted array.

4. "euclidean_distance(self, row1, row2)" function: Calculates the distance between two given vectors with Euclidean distance.

5. "manhattan_distance(self, row1, row2)" function: Calculates the distance between two given vectors with Manahttan distance.

6. "_predict(self, x)" function: With distance array, for each sample in the given training set, it holds the distance of a particular sample from other samples. And then, create k_indices array which holding the indexes of the vectors with the closest distance(as many as k). Next, create k_nearest_labels array which holding the values of the vectors with the closest distance. Then, the values of the nearest neighbors are summed up and divided by k, that is, they are averaged. The result is returned.

7. "_predict_weighted(self, x , hyper_parameter)" function: With distance array, for each sample in the given training set, it holds the distance of a particular sample from other samples. And then, create k_indices array which holding the indexes of the vectors with the closest distance(as many as k). And then, create k_nearest_labels array which holding the values of the vectors with the closest distance.Also, create k_distances array which holding the distances of the vectors with the closest distance. And create a dictionary to match distances and labels. The hyper parameter was created to avoid offsets of 0.0. We can explain it like this, if the distance is zero, the 1/d calculation cannot be made and in fact the closest value is considered invalid. For this reason, the distance measure as much as the entered hyper parameter is added to all the distances in the data set. In this way, the closest value actually has the shortest distance and has the greatest weight through the 1/d expression. After adding the entered hyperparameter to all distances, the corresponding weights(1/d) for all labels were calculated and recorded in a dictionary. Then, label values that is keys of dictionary, ​​and weights  are multiplied and assigned to the variable val. At the same time, only the weights are added and assigned to the variable divided. val/divided is result. The result is returned.

In [83]:
class KNN_Regression:
    def __init__(self, k=3):
        self.k = k

    def fit(self, x, y):
        self.x_train = x
        self.y_train = y

    def predict(self, test_validation):
        predicted_labels = [self._predict(x) for x in test_validation]
        return np.array(predicted_labels)

    def predict_weighted(self, test_validation , hyper_parameter):
        predicted_labels = [self._predict_weighted(x, hyper_parameter) for x in test_validation]
        return np.array(predicted_labels)

    def euclidean_distance(self, row1, row2):
        distance = 0.0
        for i in range(len(row1) - 1):
            distance += (row1[i] - row2[i]) ** 2
        return sqrt(distance)

    def manhattan_distance(self, row1, row2):
        distance = 0.0
        for i in range(len(row1) - 1):
            distance += abs(row1[i] - row2[i])
        return distance


    def _predict(self, x):
        distances = [self.euclidean_distance(x, x_train) for x_train in self.x_train]
        # distances = [self.manhattan_distance(x, x_train) for x_train in self.x_train]

        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]

        average_value = 0
        for i in k_nearest_labels:
            average_value += i
        average_value = average_value / len(k_nearest_labels)
        return average_value

    def _predict_weighted(self, x, hyper_parameter):
        distances = [self.euclidean_distance(x, x_train) for x_train in self.x_train]
        # distances = [self.manhattan_distance(x, x_train) for x_train in self.x_train]

        k_indices = np.argsort(distances)[:self.k]
        k_distances = [distances[i] for i in k_indices]
        k_nearest_labels = [self.y_train[i] for i in k_indices]

        dict = {}

        #add a small value in all label's distance and avoid 0 distance
        for i in range(len(k_distances)):
            k_distances[i] = k_distances[i] + hyper_parameter

        for i in range(len(k_nearest_labels)):
            if k_nearest_labels[i] in dict:
                dict[k_nearest_labels[i]] += 1 / k_distances[i]
            elif k_nearest_labels[i] not in dict:
                dict[k_nearest_labels[i]] = 1 / k_distances[i]
        val = 0
        divided = 0

        for i in dict.keys():
            val += i * dict[i]
            divided += dict[i]

        return val / divided


In [84]:
# reading csv file with pandas
df = pd.read_csv('Concrete_Data_Yeh.csv')

# parse the data and convert numpy
df = df.to_numpy()

#shuffle
np.random.shuffle(df)

# parse the data as features and labels
features = df[:, :-1]
labels = df[:, -1]

Used min-max normalization on the features of samples to re-scale each feature (feature/attribute column on data) between (0-1) range. For this, each column was handled separately, the minimum and maximum data in the column were obtained, and the data with the normalization algorithm were re-recorded.

In [85]:
def normalization(df, features):
    for feature in range(df.shape[1] - 1):
        minvalue = features[:, feature].min()
        maxvalue = features[:, feature].max()
        for i in range(len(df)):
            features[i, feature] = (features[i, feature] - minvalue) / (maxvalue - minvalue)

The mean absolute error is the sum of the absolute error value, a more direct representation of the sum of the error terms.

In [86]:
def meanAbsoluteError(actual, predicted):
    mae = 0
    for i in range(len(actual)):
        mae += abs(actual[i]-predicted[i])
    return mae/len(actual)

In [87]:
# creating normalized features
normalized_features = features.copy()
normalization(df, normalized_features)

In [88]:
dfSize = len(df)  
numberOfCrossSize = int(len(df) / 5)  

For each desired k value (1,3,5,7,9), 5-fold cross validation was performed separately and the results were printed. First of all, validate and train sets were determined. Then, "unnormalized and unweighted knn" , "normalized and unweighted knn" , "unnormalized and weighted knn" and "normalized and weighted knn" algorithms were run with the training data set, respectively, and their mae values were reached. Each mae value and its average are printed on the screen.

In the creation phase of each model, first the model was created with the k value, the training data was given to the model, and then it was expected to make predictions from the model. 

In [None]:
k_value_list = [1,3,5,7,9]
average_mae = 0
average_mae_normalization = 0
average_mae_weighted = 0
average_mae_weighted_normalization = 0
for k in k_value_list:
    for i in range(5):  # 5-fold cross validation
        validate = int(dfSize * .2 * i)
        if i == 4:
            x_validate = features[validate:, :]
            x_normalized_validate = normalized_features[validate:, :]
            y_validate = labels[validate:]
        else:
            x_validate = features[validate:validate + numberOfCrossSize, :]
            x_normalized_validate = normalized_features[validate:validate + numberOfCrossSize, :]
            y_validate = labels[validate:validate + numberOfCrossSize]
        x_train = features[validate + numberOfCrossSize:, :]
        x_normalized_train = normalized_features[validate + numberOfCrossSize:, :]
        y_train = labels[validate + numberOfCrossSize:]
        if i != 0:
            x_train2 = features[:validate, :]
            x_normalized_train2 = normalized_features[:validate, :]
            y_train2 = labels[:validate]

            x_train = np.concatenate((x_train, x_train2), axis=0)
            x_normalized_train = np.concatenate((x_normalized_train, x_normalized_train2), axis=0)
            y_train = np.concatenate((y_train, y_train2), axis=0)

        # predict their continuous values using kNN without feature normalization
        start = time.time()
        model = KNN_Regression(k)
        model.fit(x_train, y_train)
        predictions = model.predict(x_validate)
        end = time.time()

        mae = meanAbsoluteError(y_validate, predictions)
        print("mae without normalization for cross validation = ", i + 1, " -> ", mae)
        print("Computation time:", end - start)
        average_mae += mae


        # predict their continuous values using kNN with feature normalization
        start = time.time()
        model = KNN_Regression(k)
        model.fit(x_normalized_train, y_train)
        predictions = model.predict(x_normalized_validate)
        end = time.time()

        mae = meanAbsoluteError(y_validate, predictions)
        print("mae with normalization for cross validation = ", i+1, " -> ", mae)
        print("Computation time:", end - start)
        average_mae_normalization += mae


        # predict their classes using weighted kNN without feature normalization
        start = time.time()
        model = KNN_Regression(k)
        model.fit(x_train, y_train)
        predictions = model.predict_weighted(x_validate, 0.1)
        end = time.time()

        mae = meanAbsoluteError(y_validate, predictions)
        print("weighted mae without normalization for cross validation = ", i + 1, " -> ", mae)
        print("Computation time:", end - start)
        average_mae_weighted += mae
      

        # predict their continuous values using weighted kNN with feature normalization
        start = time.time()
        model = KNN_Regression(k)
        model.fit(x_normalized_train, y_train)
        predictions = model.predict_weighted(x_normalized_validate, 0.1)
        end = time.time()

        mae = meanAbsoluteError(y_validate, predictions)
        print("weighted mae with normalization for cross validation = ", i + 1, " -> ", mae)
        print("Computation time:", end - start)
        average_mae_weighted_normalization += mae


        if i == 4:
            print("for k = ", k)
            print("average_mae ", average_mae / 5)
            average_mae = 0

            print("average_mae with normalization ", average_mae_normalization / 5)
            average_mae_normalization = 0

            print("weighted average_mae without normalization ", average_mae_weighted / 5)
            average_mae_weighted = 0

            print("weighted average_mae with normalization ", average_mae_weighted_normalization / 5)
            average_mae_weighted_normalization = 0

            print()

mae without normalization for cross validation =  1  ->  12.212281553398055
Computation time: 0.7358636856079102
mae with normalization for cross validation =  1  ->  11.518640776699032
Computation time: 0.7356243133544922
weighted mae without normalization for cross validation =  1  ->  12.212281553398055
Computation time: 0.7449262142181396
weighted mae with normalization for cross validation =  1  ->  11.518640776699032
Computation time: 0.7373499870300293
mae without normalization for cross validation =  2  ->  11.693543689320386
Computation time: 0.7461621761322021
mae with normalization for cross validation =  2  ->  11.347864077669907
Computation time: 0.7346677780151367
weighted mae without normalization for cross validation =  2  ->  11.693543689320386
Computation time: 0.7367873191833496
weighted mae with normalization for cross validation =  2  ->  11.347864077669907
Computation time: 0.7381596565246582
mae without normalization for cross validation =  3  ->  12.239368932038

## Error Analysis for Classification

<img src="Part2.png" />

#### Compare performance of different feature normalization choices and investigate the effect of important system parameters (number of training samples used, k in k-NN, etc.). Wherever relevant, feel free to discuss computation time in addition to regression/estimation rate

- At the table, if we compare the weighted and not weighted results at the same K and normalized values, we see similar results in weighted and unweighted KNN algorithms. This may be because when the values in the dataset are vectorized, the distance does not affect the result much.
<br>
- The normalization here is done by  rescaled to the 0-1 interval is done by shifted the values of each feature so that the minimal value is 0, and then divided by the maximal value. At the table if we compare the normalized and not normalized values at the average, we seenormalized values has better results all the cases but this difference is small. The reason may be data set has different features that has various intervals and data set has outlier values. When we determine a fixed interval this allows the evaluating the data more accurate. Also, when the dataset is examined, we see that some attributes have a high range, while others have a small range between 0 and 1. This indicates to us that normalization should be done in this data set.
<br>
- As considering the table we might say the best K value is 7. The error value, which decreased until the k value was 7, then increased. If we look at the results in the table, k=7 gives us the lowest error value for this problem.
<br>
- On the other hand we see the increasing k values effect the calculation time of program. As the k value increased and weighted operation was performed at the same time, the prediction time of the model became longer. This is due to additional processing, but this time is in milliseconds and the difference is quite small.
<br>
- In short, normalization on the data reduced the error value. The use of Weighted KNN reduced the error value compared to the use of the normal KNN algorithm. Based on the data we have, 7 seems to be the best k-value to choose from. If an estimation is desired, the model with normalized, weighted knn algorithm and k value of 7 should be preferred.