# 2. k-NN: Regression

---
## Use k-NN

- k=3, L2 distance
- k=5, L2 distance

<br>

same as the classification, we calculate the distance between the new_data_point and given data points. 

Then, we sort the distances and take the k-nearest neighbors. 

But for the regression, we take the average of the k-nearest neighbors' target values.

<br>

For given table, except the id column, we have 4 features (Age, Horse_Power, Brand and MPG) and 1 target value(Price).

The process is as follows: 


In [3]:
import numpy as np

# Define the new data point
new_data_point = np.array([6, 200, 5, 30])

# Define the given data points with 4 features and 1 target value
data = np.array([[2, 200, 4, 27, 30000], 
                [5, 150, 3, 35, 20000], 
                [3, 180, 4, 25, 25000], 
                [1, 230, 2, 10, 21000], 
                [5, 180, 5, 40, 38000],
                [4, 210, 3, 30, 31000]])

# Function to calculate Euclidean distances 
def euclidean_distance(data, new_data_point):
    # exclude the given data's target value
    distances = np.sqrt(np.sum((data[:, :-1] - new_data_point)**2, axis=1))
    return distances

# Calculate the Euclidean distances
all_distances = euclidean_distance(data, new_data_point)
target_values = data[:, -1]

# print out the distances and the target values for each data point
for i in range(len(all_distances)):
    print(f"""data point id <{i+1}>: 
            distnace: {all_distances[i]},
            target value: {target_values[i]}""")
print("\n")

# when k = 3
# indices of the sorted distances
sorted_indices = np.argsort(all_distances)
# get the target values of the k nearest neighbors
k_3_target_values = target_values[sorted_indices[:3]]
# calculate the mean of the k nearest neighbors target values
k_3_pred = np.mean(k_3_target_values)

# when k = 5
# get the target values of the k nearest neighbors
k_5_target_values = target_values[sorted_indices[:5]]
# calculate the mean of the k nearest neighbors target values
k_5_pred = np.mean(k_5_target_values)


print(f"""(1) The id, distance, target value from the nearest to the farthest:
        id: {sorted_indices+1},
        distance: {all_distances[sorted_indices]},
        target value: {target_values[sorted_indices]}""")

print(f"""(a) The predicted target value for k=3: {k_3_pred}""")
print(f"""(b) The predicted target value for k=5: {k_5_pred}""")

data point id <1>: 
            distnace: 5.0990195135927845,
            target value: 30000
data point id <2>: 
            distnace: 50.299105359837164,
            target value: 20000
data point id <3>: 
            distnace: 20.85665361461421,
            target value: 25000
data point id <4>: 
            distnace: 36.52396473549935,
            target value: 21000
data point id <5>: 
            distnace: 22.38302928559939,
            target value: 38000
data point id <6>: 
            distnace: 10.392304845413264,
            target value: 31000


(1) The id, distance, target value from the nearest to the farthest:
        id: [1 6 3 5 4 2],
        distance: [ 5.09901951 10.39230485 20.85665361 22.38302929 36.52396474 50.29910536],
        target value: [30000 31000 25000 38000 21000 20000]
(a) The predicted target value for k=3: 28666.666666666668
(b) The predicted target value for k=5: 29000.0


---
## Use weight-base k-NN. 

- k = 3 and weight is 𝑤 𝑥 = exp(−𝑑𝑖𝑠𝑡(𝑥, 𝑥!)) 
- k = 5 and weight is 𝑤 𝑥 = exp(−𝑑𝑖𝑠𝑡(𝑥, 𝑥!))

<br>

This is also similar to the classification.

But, we take the weighted average of the k-nearest neighbors' target values.

The weighted average formula is as follows:

$$\hat{y} = \frac{\sum_{i=1}^{k} w_i y_i}{\sum_{i=1}^{k} w_i}$$
<br>
where $w_i$ is the weight of the i-th neighbor and $y_i$ is the target value of the i-th neighbor.

<br>
<br>

The process is as follows:



In [4]:
# all_distances[sorted_indices] is the sorted distances
sorted_weights = np.exp(-all_distances[sorted_indices]) 
for i in range(len(sorted_weights)):
    print(f"""data point id <{i+1}>: 
            weight: {sorted_weights[i]}""")
print("\n") 

def calculate_weighted_mean(target_values, sorted_weights, k):
    return np.sum(target_values[:k] * sorted_weights[:k]) / np.sum(sorted_weights[:k])

# when k = 3
k_3_weighted_pred = calculate_weighted_mean(target_values[sorted_indices], sorted_weights, 3)

# when k = 5
k_5_weighted_pred = calculate_weighted_mean(target_values[sorted_indices], sorted_weights, 5)

print(f"""(c) The predicted target value for k=3 using weight-base k-NN: {k_3_weighted_pred}""")
print(f"""(d) The predicted target value for k=5 using weight-base k-NN: {k_5_weighted_pred}""")


data point id <1>: 
            weight: 0.0061027272741740034
data point id <2>: 
            weight: 3.0667569021157886e-05
data point id <3>: 
            weight: 8.751256721111149e-10
data point id <4>: 
            weight: 1.9018396305356362e-10
data point id <5>: 
            weight: 1.373547422076624e-16
data point id <6>: 
            weight: 1.4301319118006315e-22


(c) The predicted target value for k=3 using weight-base k-NN: 30004.999382854297
(d) The predicted target value for k=5 using weight-base k-NN: 30004.99963076258
