# Modern Data Science 
**(Module 03: Pattern Classification)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session E - Parzen Window 

##### Gaussian kernel smoothing
The kernel density model is given by $$p(x) = \frac{1}{N} \sum_{i=1}^N \frac{1}{(2\pi h^2)^{D/2}} exp\left(\frac{- (x-x_i)^T(x-x_i)}{2h^2}\right) \ $$
where *D* is the dimension (which is 2 here), *h* is the standard deviation parameter we have to set, and *N* is the total number of samples.

##### Density estimation in 1 dimension
Let's generate data from a mixture of two 1D gaussians as follows. Toss a fair coin, if the outcome is heads, sample a data point from the first gaussian, otherwise sample from the second gaussian. The two gaussians have a mean 2 and 4 and a standard deviation of 1.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# Generate 100 points
points = np.array([])

for i in range(100): # sample 100 points
    if np.random.rand() > 0.5:
        points = np.append(points, np.random.normal(2,1))
    else:
        points = np.append(points, np.random.normal(4,1))

plt.hist(points)
plt.show()

###### Parzen window estimation
Our x ranges approximately from -2 to 10. The pdf is given by $p(x) = \frac{1}{N} \sum\limits_{i=1}^N \frac{1}{(2\pi h^2)^{1/2}} exp\left(\frac{- (x-x_i)^2}{2h^2}\right) \ $ for every value of x. In order to plot the estimated density, we compute the above pdf for a range of x, starting from -2 till 10, incrementing x by 0.02.
Choose different values for the smoothing parameter *h* to get the best density estimate. (Try h=0.08, 0.1, 0.15 etc.) What value of h gives the bimodal distribution?

In [None]:
#h value may be altered and tested for
h = 0.19
X = np.arange(-2, 10, 0.02)

# for each point in x, we have compute its pdf
Y = np.array([])
N = len(points)

for x in X:
    t = 0
    for xi in points:
        t += np.exp(-(x-xi)**2/(2*h*h))
    
    y = (t/(2*np.pi*h*h)**0.5)/N
    Y = np.append(Y, y)

plt.plot(X, Y)
plt.show()
    
#h value of 0.19 almost gives the bi-modal distribution

##### Density estimation in 2 Dimension
Similarly do density estimation for the above data set which we sampled from 3 2d gaussians. 

**Note:** It will be computationally expensive to calculate the density for all the points in the 2D plane. So do density estimation for points in the square [c-2, c+2]x[d-2, d+2] where (c,d) denotes the coordinates of the meeting point of the three discriminant lines in the Linear Discriminant Analysis we have done above.


In [None]:
cov = np.eye(2)

d1 = np.random.multivariate_normal([7, 5], cov, 500)
d2 = np.random.multivariate_normal([9, 9], cov, 500)
d3 = np.random.multivariate_normal([11, 5], cov, 500)

data = np.vstack([d1, d2, d3])

In [None]:
m1 = np.mean(d1, axis = 0)
m2 = np.mean(d2, axis = 0)
m3 = np.mean(d3, axis = 0)

In [None]:
a = np.array([[m1.item(0)-m2.item(0), m1.item(1)-m2.item(1)],
              [m1.item(0)-m3.item(0), m1.item(1)-m3.item(1)]])
b = np.array([0.5 * ((m1.item(0) * m1.item(0)) + (m1.item(1) * m1.item(1)) - 
                     (m2.item(0) * m2.item(0)) - (m2.item(1) * m2.item(1))),
              0.5 * ((m1.item(0) * m1.item(0)) + (m1.item(1) * m1.item(1)) - 
                     (m3.item(0) * m3.item(0)) - (m3.item(1) * m3.item(1)))])
sol = np.linalg.solve(a, b)

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from itertools import islice

#h value may be altered and tested for
h = 0.3

#The probability density function
def z(a, b): 
    x = np.array([a,b]) 
    t = 0
    for xi in data:
        t += np.exp(-1*np.dot(np.transpose(x-xi), x-xi)/(2*h*h))
    
    y = (t/(2*np.pi*h*h*N))
    return y

Plot 1 - to visualize the 3 2d gaussians

In [None]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(d1[:, 0], d1[:, 1], color='red')
ax.scatter(d2[:, 0], d2[:, 1], color='blue')
ax.scatter(d3[:, 0], d3[:, 1], color='green')

# Range of values for x and y axis (helps visualize the 3 2d gaussians)
X = np.linspace(3, 13, 100)
Y = np.linspace(2, 13, 100)
X,Y = np.meshgrid(X,Y)
Z = []
for i,j in zip(X.ravel(),Y.ravel()):
    Z.append(z(i, j))
    
Z = np.asarray(Z).reshape(100,100)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.cm.RdYlBu, linewidth=0, antialiased=False)
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.RdYlBu)

ax.set_xlim([2,14])
ax.set_ylim([3,14])
ax.set_zlim([-0.2,0.8])
plt.show()


Plot 2 - Plotting the density with x and y axiz values as mentioned in the question

In [None]:
fig = plt.figure(figsize=(12,10))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(d1[:, 0], d1[:, 1], color='red')
ax.scatter(d2[:, 0], d2[:, 1], color='blue')
ax.scatter(d3[:, 0], d3[:, 1], color='green')


# Range of values for x and y axis 
X = [np.random.uniform(sol.item(0)-2, sol.item(0)+2) for i in range(100)]
Y = [np.random.uniform(sol.item(1)-2, sol.item(1)+2) for i in range(100)]
X,Y = np.meshgrid(X,Y)
Z = []
for i,j in zip(X.ravel(),Y.ravel()):
    Z.append(z(i, j))
    
Z = np.asarray(Z).reshape(100,100)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.cm.RdYlBu, linewidth=0, antialiased=False)
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.RdYlBu)

ax.set_xlim([2,14])
ax.set_ylim([3,14])
ax.set_zlim([-0.2,0.8])
plt.show()