# Non Parametric Regression

![](./helper/global_vs_local.PNG)

In linear regression, we fit our model to the whole of the data. To capture all the variance in the data, sometimes we would choose a high variance model (shown in left). Even though the points in the middle of the plot may be captured better with a constant line, we are forced to use polynomial as we are looking at the data at a global level. 

The plot to right fits the data at local level, and rather than having an equation describing the plot, this can be non parametric. 

But to get to that, we might want to divide our data into pockets, but we dont want to explicitly deal with it. Rather, we just look at the nieghbourhood of the query point and take decision based on the logic given below.

## 1 Nearest Neightbour 

<img src="helper/K1.PNG" alt="Drawing" style="width: 600px;"/>

<br>
<br>
<br>

For any query point $ x_{q} $ 
Find point closest in distance 

 
$$ X_{NN} = \underset{i}{\min}   distance(x_{i}, x_{q} )  $$


$$ \hat y_{q} = y_{NN} $$ (y corresponding to x closest to q) 

## 1 NN in 2 dimensional data  

<br>

<img src="helper/K2.PNG" alt="Drawing" style="width: 300px;"/>

Voronoi Tesselation : Divide space into N regions each containting single datapoint. Where any predicted example lying in the space, distance is calculated to nearest all data points. The prediction is given as per min distance.

For 2D data (and similar plot can be imagined in higher dimensions) we can visualize our plane segmented into N regions based on N data points. 
We dont actually need to do this division. Rather for every query point, we just calculate the point with minimum distance.

### Metrics 

We are to compute distance between datapoints, we used euclidean distance in the 1D, 2D example above. But we have many other choices. 

<br>

Eg:
* Scaled Eucledian Distance
    * For our housing price prediction, e may want to give feature importance, so we can weigh our distances for each feature differently 
    * $$ distance(x_{i},x_{q}) = \sqrt{a_{1}(x_{j}[1]-x_{q})^2 + a_{j}(x_{j}[i]-x_{q})^2 + a_{d}(x_{j}[d]-x_{q})^2}$$
     where d = no. of features 
* Manhattan Distance
    * Imagine you are driving on the streets of New York. So you find distance along x, then y; this is different than euclidean which will just take the diagonal. 
    * Other choices: Mahanobis, Hamming, cosine simiarity etc 
    
#### Predictive Surfaces change with the metrics chosen 
As shown below 


<img src="helper/K3.JPG" alt="Drawing" style="width: 400px;"/>

### 1 NN Pseudo Code 

dist1NN = $\infty$

for i = 0, 1,... N: <br>
    $ \delta = distance(x_{i},x_{q}) $ <br>
    $\qquad$ if  $ \delta  $ < dist1NN:<br>
    $\qquad$$\qquad$ dist1NN = $\delta$ <br>
    $\qquad$$\qquad$ $X_{NN} = x_{i}$
   
$ Y_{NN} = y_{i} $

### 1 NN Disadvantage

1 NN **doesnt interpolate well** as shown below 

<img src="helper/K4.PNG" alt="Drawing" style="width: 800px;"/>


**Sensitive to noise** in data. (high variance model)

## K Nearest Neighbours

To reduce the effect of noise, we can look at k neighbours rather than only one. (Real estate agents may look for multiple similar houses and give you average house price rather than looking for just one closest house)

### K NN Pseudo Code 

\# initialize with first k entries in dataset (sorted in ascending distance to query point : $X_{NN_k} $ is farthest)<br><br>
$ X_{NN_1}, X_{NN_2},.. X_{NN_k} = x_{1},  x_{2},...  x_{k} $ <br>
$ DistNN_1, DistNN_2,.. DistNN_k = sorted list(\delta(1,j), \delta(2,j) ...\delta(k,j)) $  
<br>
<br>

for i = k+1,... N: <br>
$\qquad$ 
    $ \delta = distance(x_{i},x_{q}) $ <br>
    $\qquad$$\qquad$ if $\delta < DistNN_k: $ <br>
    $\qquad$$\qquad$$\qquad$ find j s.t. $ \delta> DistNN_j$  but $\delta< DistNN_{j+1} $ <br>
    $\qquad$$\qquad$$\qquad$ insert $ x_{i}$ at $X_{NN_j}$  shift by one j+1, ... k  <br>
    $\qquad$$\qquad$$\qquad$ insert $\delta $ at $ DistNN_j $ shift by one j+1, ... k  <br>

$ Y_{NN} $ = $\frac{1}{N} \sum $(y's for k most similar points found above)



Salient points in pseudo code above: 
* initialize with first k entries in dataset (sorted in ascending distance to query point : $X_{NN_k} $ is farthest) 
* Check distance of query point for all k+1 to $N^{th} $ observation 
* At every step of this loop, check if distance is lesser than the farthest of the k neighbours 
* If yes, we need to update our k neighbours. But we shall check if this observation beats $(k-1)^{th}$ and so on. 
* So, in the sorted list we insert this observation at j (entries to left of it have distance lesser than $dist(i,q)$), and push everything else to the right. 
* at the end, we have indices of k nearest observations. We take average of the outputs and assign that to Y_q

<img src="helper/K5.PNG" alt="Drawing" style="width: 800px;"/> 

So we take a mean of all the points in the yellow box to predict y. Thus, KNN has much smoother graph than 1 NN. But still, if the data is sparse, our interpolation is bad. 

Also, there are a lot of jumps. 
Eg. 2600 sq ft house is 50k, 2601 sq ft is 54k. We dont like to believe such models! 

## Weighted K NN 
To overcome such jumpy predictions, we can weight the contribution of each k neihbours, the farthest having least effect. So even when we update and remove that observation, the effect is minimal an we get a smoother reponse. 

Note: We calculate the k neighbours using the algo given above. Here, we just do weighted average of y values. (here we arent iterating and finding the neighbours)

 $ \hat y_q $ = $\frac{1}{\sum c_{j}} \sum (c_{1}Y_{KNN_1} + c_{2}Y_{KNN_2}...  + c_{3}Y_{KNN_k}) $

One choice for c = $ \frac{1}{dist(X_{KNN_1}, x_{q})} $

We can have kernels based on distances

$$ C_{NN_j} = kernel_{\lambda}(|X_{KNN_1} - x_{q}|) $$

for multi dimensioal x, 
$$ C_{NN_j} = kernel_{\lambda}(distance(X_{KNN_1}, x_{q})) $$
calculated using any metrics defined above.


eg. $ Gaussian_{\lambda} = exp(-(X_{KNN_1}- x_{q})^2/\lambda) $

#### Note
* high $\lambda$, weights will decay slowly ?? 
* Weights never go exactly to zero for guassian kernel, they just become very, very small 


### Kernel Regression 

Weigh all the available observations (rather than only computed K neighbours in KNN)

$ \hat y_q $ = $\frac{\sum \limits _{i = 1} ^{N} (c_{qi}y_i)}{\sum \limits _{i = 1} ^{N}  c_{qj}}   $

$ \hat y_q $ = $\frac{\sum \limits _{i = 1} ^{N} (kernel_{\lambda}(distance(x_i, x_{q}))y_i)}{\sum \limits _{i = 1} ^{N}  kernel_{\lambda}(distance(x_i, x_{q}))}   $





#### Effect of $\lambda$ on prediction

More than the choice of kernel, $\lambda$ does the bias variance trade off. 

<img src="helper/K7.JPG" alt="Drawing" style="width: 800px;"/> 


(Epanechnikov kernel used)




Below, we compare boxcar kernel (very similar to unweighted KNN) with Epanechnikov kernel. We see although boxcar has discontinuity the fit looks very similar to Epanechnikov kernel. <br>
<br>
<img src="helper/K8.PNG" alt="Drawing" style="width: 600px;"/> 


### Local vs Global Fit 

* KNN, Kernel regressions comes uner **Globally Weighted Averages** 
* We can also locally fit linear/poly regression, that is known as ** locally weighted regression**
* Good choice: Lineat local fit 

<br>
<br>

<img src="helper/K9.PNG" alt="Drawing" style="width: 400px;"/> 

### Nearest Neighbour with increasing observations 

#### Noiseless Data
We see 1 NN can fit the true curve perfectly as the number of observations increase 

In [2]:
from IPython.display import Video

Video("./helper/1NN.mp4")

#### Noisy Data 
We need to have a high value of k to smooth over the noise as shown below 

<img src="helper/K11.PNG" alt="Drawing" style="width: 600px;"/> 

### Parametric vs Non Parametric 


<img src="helper/K10.PNG" alt="Drawing" style="width: 800px;"/> 

In the limiting case of infinite amount of data 
* Non Parametric: True error (bias^2 + variance) goes to zero
* Parametric (fixed model complexity): True error always has some bias

### Limitations of K NN 

* High dimensional data 
    * More the dimensions (in X) the more number of observation you need to cover the whole space (N = $ O(exp(D)) $)
    * If some pockets have less observations, KNN fails to interpolate. 
    * Parametric models outperform in such cases.

* Complexity of search with more observations
    * For 1 NN, each query we search all N observations. (O(N))
    * For k NN search complexity grows by O(Nlog(k))

To get over these limitations we can sue clustering, retrieval methods.

### K- NN for Classification

Rather than taking average of k outputs, here we select the class which occurs the most in the k neighbours found. 