### Higher Dimensional Data

In this section we will learn about exploring high-dimensional data.  We will look at two approaches: PCA and t-SNE. Both approaches can map data with thousands of dimensions to a 2-dimension which can be explored visually. It is also used as a compression techniques. This help in discovering patterns in the data which may not have been possible in higher-dimensions. For example, for a data set we might see clustering in 2-dimensions helping in classifications.

Topics to be covered:

### References 

(https://7264-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/auto_examples/preprocessing/plot_scaling_importance.html)

http://scikit-learn.org/stable/modules/decomposition.html#exact-pca-and-probabilistic-interpretation

https://plot.ly/ipython-notebooks/principal-component-analysis/

https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm

http://alexanderfabisch.github.io/t-sne-in-scikit-learn.html

https://distill.pub/2016/misread-tsne/

In [None]:
import seaborn as sns
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
from sklearn import datasets
import matplotlib.pyplot 
%matplotlib inline

####  Background

Let us look at the IRIS dataset again.

In [None]:
iris = sns.load_dataset("iris")
iris[iris.species=='setosa'][0:5]

In [None]:
iris[iris.species=='virginica'][0:5]

Let us assume we locate an IRIS flower and can measure all the features: sepal_length,sepal_width,petal_length, and petal_width. By observing these features can we say about its species? For this we would like to do a 4-d scatter plot and color each dot corresponding to the species, and see if dots are getting clustered based on their species. If so, we may be able to say about the species of the new flower.  The problem here is to do a 4-D scatter plot!

In [None]:
sns.pairplot(iris)

Let us see if we pick two features and see if we can make a distinction? Which two features?

In [None]:
g = sns.pairplot(iris, hue="species")

It looks if we select petal_length and sepal_length, we are able to form almost separable clusters. This may work for this and some other dataset of low-dimension. 

In general, for very high-dimensional data it is not possible to do multiple scatter plots and it is difficult to visually identify coorelated features (variables)  and reduce them. 

For example, in the above figure we see there is an almost a linear relatrionship between petal_length and petal_width, so it is possible to reduce these two variables  to a single variable. 

Principal Compoinent Analysis essentially transforms a number of correlated variables to a reduced number of uncorrelated variables (also referred as principal components). 

The principal components (new reduced set of variables) are ordered based on their significance. The first component captures the maximum amount of variability in the data, followed by the second one, and so on. 

In summary, PCA reduces a large set of features (variables) in the data to a reduced set of features (variables) that still contains most of the information. 

NOTE: In our discussion below we use toy problems and small data set simply to illustrated the working of PCA and other higher-dimensional reduction approaches. Mostly these approaches are applied for large higher-dimensional data sets. 

<h3> Toy Example to Illustrate PCA </h3>

In this toy example, we consider the following three points in 2D and see how they can be reduced to 1D points.

(1, 2), (2, 4), (3, 6)

![image.png](attachment:image.png)





Now redraw the two axis such that one axis is going over the points

![image.png](attachment:image.png)

The three points in the new axis are 1D points. The value for the other dimension for all the three points is zero

Question: When we moved the axis why all the three points landed on the axis?

The three points can be represented as: (a,0), (b,0), (c,0)
Here, $a = \sqrt{1^2+2^2}$, $b = \sqrt{2^2+4^2}$, and $c = \sqrt{3^2+6^2}$

Note that all the information about the three points is in the first dimension, so we can drop the second dimension. In this process, we have also achieved compression.

Real data is not so nice

![image.png](attachment:image.png)


The three points can be represented as:

$(a,-\delta_1), (b,\delta_2), (c,-\delta_3)$

In this case most of the information is still in the first dimension so we can decide to drop the second dimension and achieve the dimensional reduction with some loss. 

Observe that variance along the first dimension is large compared to small variance along the second dimension. <b>A relatively large variance along a dimension is an indication of having more information about the points.</b>

Also, observe the error in approximating the data by one dimension is also small (one way to compute this error is perpendicular distance of red dots from the rotated X axis; and another way is to first reduce the dimension that is project the red dots on the rotated X axis and then rotate the axis back to its original state and compute the error).

### Using Eigen Decomposition 

Let $\mathbf X$ be the data matrix of size $n \times d$, where $n$ is the number of data points and $d$ is the number of dimensions. The Eigen Decomposition of the matrix $\mathbf X^\top \mathbf X$ is given by:


$$\mathbf X^\top \mathbf X = \mathbf W \mathbf L \mathbf W^\top$$,

where $\mathbf W$ is a $d \times d$ matrix of eigenvectors and $\mathbf L$ is a diagonal matrix with eigenvalues.The$j$-th principal component is given by $j$-th column of $\mathbf X \mathbf W$. 

In [None]:
# Dimensional reduction without loss
import numpy as np
x = np.array([[1.0, 2.0], [2.0, 4.0], [3.0, 6.0]])
cov_mat = x.T.dot (x) # this is like covariance matrix with some scaling
print (cov_mat)
np.shape(x)

In [None]:
np.cov?

In [None]:
cov_matN = np.cov(x.T)
print(cov_matN)

In [None]:
mean = np.mean(x, axis=0)
print(mean)
xn = x - mean
print(xn)

$cov_{x,y}=\frac{\sum_{i=1}^{N}(x_{i}-\bar{x})(y_{i}-\bar{y})}{N-1}$

$cov_{x,x}=\frac{\sum_{i=1}^{N}(x_{i}-\bar{x})(x_{i}-\bar{x})}{N-1}$

$cov_{y,y}=\frac{\sum_{i=1}^{N}(y_{i}-\bar{y})(y_{i}-\bar{y})}{N-1}$

In [None]:
#N = 3 -> N-1 = 2
print (xn.T.dot(xn)/2)  # after centering the data, x^T x  is same as covariance matrix

In [None]:
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('cov matrix \n%s' %cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

In [None]:
E1 = eig_vecs[:,1] # use the eigenvector corresponding to eigen value 70
print (E1)
print (x)
xr = np.dot(x, -E1) 
# minus sign before E1 does not impact the overall conclusion, just makes 
# it easy to match it with the geometric interpreation
print (xr)
# Note that from the figure we know the distance of three points from the origin is:
print (np.sqrt(1+4), np.sqrt(4+16), np.sqrt(9+36)) # this is also the representation of data in one dimension


### Using SVD and its Relationship with Eigen Decomposition 

We can perform the singular value decomposition (SVD) of $\mathbf X$:

$$\mathbf{X^\top} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top$$

Here $\mathbf U$ is a $d \times d$ orthogonal matrix and its columns are the left singular vectors. $\mathbf{\Sigma}$ is a $d \times n$ diagonal matrix of singular values. The matrix $\mathbf V^\top$ is of size $n \times n$ and its rows are the right singular vectors.

How is SVD related to Eigen Decomposition?

$$\mathbf{X^\top} \mathbf X = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top \mathbf{V} \mathbf{\Sigma} \mathbf{U}^\top  $$

$$\mathbf{X^\top} \mathbf X = \mathbf{U} \mathbf{\Sigma^2} \mathbf{U}^\top  $$

Here $\mathbf U$ just like  $\mathbf W$ in the Eigen Decomposition is a $d \times d$ matrix of eigenvectors. 

For dimensional reduction, basically multiply the data matrix $\mathbf x$ by  the subset of column vectors of $\mathbf U = \{ U_1  U_2 U_3 \ldots U_d \}$ corresponding to the significant singular values.

Let us look at our simple example to see how this computation is done.

In [None]:
# Let us now use SVD for the same computation
u,s,vt = np.linalg.svd(x.T)
print (u)
print (s)
print (vt)

In [None]:
U1 = u[:,0]
xr = np.dot(x,-U1)
print (xr) # 1-D representation of the points

In [None]:
#Matrix reconstruction
V1 = vt[0,:]
x_reconst = s[0]*np.outer(U1,V1)
print (x_reconst.T)

In [None]:
# Dimensional reduction after perturbing the three data points in space
x = np.array([[1.01, 2.01], [1.98, 4.0], [3.0, 5.98]])
u,s,vt = np.linalg.svd(x.T)
U1 = u[:,0]
xr = np.dot(x,-U1)
print (xr)

In [None]:
# Matrix reconstruction using the dominant singular value
from numpy import linalg as LA
V1 = vt[0,:]
x_reconst = s[0]*np.outer(U1,V1)
print (x_reconst.T)
print (LA.norm(x_reconst.T-x))  # compute approximation error

In [None]:
# add the term for the second singular value
U2 = u[:,1]
V2 = vt[1,:]
x_reconst = s[0]*np.outer(U1,V1) + s[1]*np.outer(U2,V2)
print (x_reconst.T)
print (LA.norm(x_reconst.T-x)) # compute approximation error

Exercise 1. Consider the following three-dimensional points.
(1,2,3), (3,4,5), (5,4,3), (0,2,4), (1,3,5)
Perform dimensional reduction to 2-D using SVD. Print the values of points in 2-D.  Reconstruct the original data and compute the approximation error using the norm. Next perform dimensional reduction to 1-D using SVD and compute the approximation error. 

<h3> Normalization of Data before PCA </h3>

Z-score normalization is important for many machine learning algorithms including PCA which rescales the features such that they have a standard normal distribution with a mean of zero and a standard deviation of one.



"In PCA we are interested in the components that maximize the variance. If there exists components (e.g human height) that vary less then other components (e.g human weight) because of their respective scales (meters vs. kilos) it can be seen how not scaling the features would cause PCA to determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis. As a change in height of one meter can be considered much more important than the change in weight of one kilogram, it is easily seen that this determination is incorrect. In the case of PCA, scaling features using normalization is preferred over using min-max scaling as the primary components are computed using the correlation matrix as opposed to the covariance matrix."

Z-Score Normalization

Take the data from samples along one dimension and rescale the data:

$$z = \frac{x - \mu}{\sigma}$$

Here  μ is the mean (average) and σ is the standard deviation from the mean for the data being rescaled.

<h3>Dimensional Reduction for IRIS data set</h3>

In [None]:
from sklearn.datasets import load_iris 
iris = load_iris() 
 
numSamples, numFeatures = iris.data.shape 
#print (numSamples )
#print (numFeatures )
#print (list(iris.target_names) )

Iris dataset is 150×4 matrix where the columns are the different features, and every row represents a separate flower sample. Each sample row x can be viewed  as a 4-dimensional vector. Each flower sample can be classified into one of three flower classes:
['setosa', 'versicolor', 'virginica']

In [None]:
X = iris.data 
#print (X)

In [None]:
from sklearn import preprocessing
Xn = preprocessing.scale(X)
#print (Xn.mean(axis=0), Xn.std(axis=0))

In [None]:
cov_mat = Xn.T.dot (Xn)
#cov_mat = np.cov(Xn.T)  #for some data substracting the mean from the data is useful, particularly when data is not scaled
eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

most PCA implementations perform a Singular Vector Decomposition (SVD) to improve the computational efficiency. So, let us perform an SVD to confirm that the result are indeed the same:

In [None]:
u,s,v = np.linalg.svd(Xn.T)
print (u)

In [None]:
print (s)

In [None]:
# extract first two eignevectors corresponding to two large eigenvalues
matrix_w = eig_vecs[:,0:2]
X_r = Xn.dot(matrix_w)
print (X_r.shape)

In [None]:
import matplotlib.pyplot as plt
target_names = iris.target_names
y = iris.target
print(target_names)
print (y)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of IRIS dataset')

plt.figure()


plt.show()

Exercise 2. Reconstruct the original iris data after dropping two least-significant eigenvalues and compute the approximation error.

### Scikit Learn

Tools for data mining and analysis. For now we will just focus on dimensionality reduction 

http://scikit-learn.org/stable/index.html


In [None]:
# use of PCA in Scikit for dimensional reduction of iris dataset

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_r  = pca.fit_transform(X)
print(np.shape(X), np.shape(X_r))

In [None]:
target_names = iris.target_names
y = iris.target
#print(target_names)
#print (y)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']

for color, i, target_name in zip(colors, [0, 1, 2], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8,
                label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('SCIKIT PCA of IRIS dataset')

plt.figure()


plt.show()

<h3> Introduction to t-SNE </h3>
One of the relatively new technique for dimensional reduction and is quite popular with machine learning community. 
The PCA transforms a number of correlated variables to a reduced number of uncorrelated variables (principal components) such that the variation for the first component is maximized. Maximizing variance also implies that you are trying to preserve distances between dissimilar points, which may not be good for visualization. What we want to preserve is the actual structure of the data and intutively structure of the data is not sensitive to large distances. 

#### Stochastic Neighbor Embedding (SNE)

SNE Key Idea: The similarity between one point $x_j$ to another point $x_i$ is given by “$p_{j|i}$, that $x_i$ would pick $x_j$ as its neighbor if neighbors were picked in proportion to their probability density under a Gaussian centered at $x_i$”. In other words, if normal distribution is spread out (high value of std) it will pick more points as similar (neighbors).

$$p_{j|i} = \frac{\exp{(-d(\boldsymbol{x}_i, \boldsymbol{x}_j) / (2 \sigma_i^2)})}{\sum_{i \neq k} \exp{(-d(\boldsymbol{x}_i, \boldsymbol{x}_k) / (2 \sigma_i^2)})}, \quad p_{i|i} = 0,$$

 

We also have the similarity measure in the lower dimension. 

$$q_{j|i} = \frac{exp(-||\boldsymbol{y}_i - \boldsymbol{y}_j)||^2)} {\sum_{k \neq i} exp(-||\boldsymbol{y}_i - \boldsymbol{y}_k)||^2)}$$

Minimize the cost function  with gradient descent:

$$CostFunction = \sum_{i \neq j} p_{j|i} \log \frac{p_{j|i}}{q_{j|i}}$$

The cost function is such that there is high penalty if close points are mapped to far points; and less penalty if far points are mapped closer.

The algorithm uses a hyper-parameter $perplexity$, which defines the gaussian centered at $x_i$. A large value of $perplexity$ means a data point will consider more points as its close neighbors. 
We now use a python package of t-SNE for dimensional reduction of iris dataset.

In [None]:
from sklearn.manifold import TSNE
from sklearn.datasets import load_iris

In [None]:
iris = load_iris()
target_names = iris.target_names
X_tsne = TSNE(learning_rate=100, n_components=2, random_state=0, perplexity=5).fit_transform(iris.data)
#print (iris.data.shape, X_tsne.shape)
plt.figure()
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target)
plt.title('t-SNE of IRIS dataset')
plt.figure()
plt.show()

In [None]:
X_tsne = TSNE(learning_rate=100, n_components=2, random_state=0, perplexity=30).fit_transform(iris.data)
plt.figure()
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target)
plt.title('t-SNE of IRIS dataset')
plt.figure()
plt.show()

In [None]:
X_tsne = TSNE(learning_rate=100, n_components=2, random_state=0, perplexity=50).fit_transform(iris.data)
plt.figure()
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=iris.target)
plt.title('t-SNE of IRIS dataset')
plt.figure()
plt.show()

<h3> Visualizing Handwritten Digits </h3>
load the digit data set

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
print (digits.data.shape)
#print(digits['DESCR'])
imagea = digits.images
targeta = digits.target
print (imagea.shape, targeta.shape)

In [None]:
# The data that we are interested in is made of 8x8 images of digits. For these images, 
# we know which digit they represent: it is given in the 'target' of the dataset.
nrows, ncols = 2, 5
plt.figure(figsize=(6,3))
plt.gray()
for i in range(ncols * nrows):
    ax = plt.subplot(nrows, ncols, i + 1)
    ax.matshow(digits.images[i])
    plt.axis('off')
    plt.title(digits.target[i])

plt.show()

In [None]:
# print ten images of handwritten zero
XI = digits.images[digits.target==0]
print (XI.shape)
nrows, ncols = 2, 5
plt.figure(figsize=(6,3))
plt.gray()
for i in range(ncols * nrows):
    ax = plt.subplot(nrows, ncols, i + 1)
    ax.matshow(XI[i])
    plt.axis('off')
    
plt.show()

In [None]:
for i in range(10):
    XT = digits.data[digits.target==i]
    print (i, XT.shape)  # number of images for a digit type

In [None]:
# Reorder the data points according to the handwritten numbers.
X = np.vstack([digits.data[digits.target==i]
               for i in range(10)])
y = np.hstack([digits.target[digits.target==i]
               for i in range(10)])
print (X.shape, y.shape)
print (y[0:20], y[178:198])

In [None]:
# Random state.
RS = 20150101
digits_proj = TSNE(random_state=RS).fit_transform(X)

In [None]:
# imports for plotting 
import matplotlib.patheffects as PathEffects
import seaborn as sns
sns.set_style('darkgrid')
sns.set_palette('muted')
sns.set_context("notebook", font_scale=1.5,
                rc={"lines.linewidth": 2.5})
def scatter(x, colors):
    # We choose a color palette with seaborn.
    palette = np.array(sns.color_palette("hls", 10))

    # We create a scatter plot.
    f = plt.figure(figsize=(8, 8))
    ax = plt.subplot(aspect='equal')
    sc = ax.scatter(x[:,0], x[:,1], lw=0, s=40,
                    c=palette[colors.astype(np.int)])
    plt.xlim(-25, 25)
    plt.ylim(-25, 25)
    ax.axis('off')
    ax.axis('tight')

    # We add the labels for each digit.
    txts = []
    for i in range(10):
        # Position of each label.
        xtext, ytext = np.median(x[colors == i, :], axis=0)
        txt = ax.text(xtext, ytext, str(i), fontsize=24)
        txt.set_path_effects([
            PathEffects.Stroke(linewidth=5, foreground="w"),
            PathEffects.Normal()])
        txts.append(txt)

    return f, ax, sc, txts

In [None]:
scatter(digits_proj, y)
plt.show()

Exercise 3: Use PCA for MNIST dataset to 2D and show the visualization. Which one is better in reducing the dimension of the data?

Working with Real Estate Dataset

In [None]:
#Adapted from: https://www.kaggle.com/yohanb/t-sne-bokeh/notebook , Author: Yohan
df = pd.read_csv('resources/train_house.csv')
df.head()

In [None]:
y = df.pop('SalePrice') # take sale price column out as a separate column

In [None]:
y.head()

In [None]:
# get column names that has numerical values
num_cols = df.select_dtypes(include=['float64', 'int64']).columns.values 

In [None]:
num_cols

In [None]:
# get column names that has categorical values
cat_cols = df.select_dtypes(include=['object']).columns.values 

In [None]:
cat_cols

In [None]:
# Create a data frame of numerical columns and fill NA values
# in a column by its median value
X_num = df[num_cols].fillna(df.median())
X_num.head()

In [None]:
ids = X_num.pop('Id')  

In [None]:
X_num.head()

In [None]:
ids.head()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()  # get a scaler
X_num_scaled = pd.DataFrame(scaler.fit_transform(X_num), columns=X_num.columns.values) 


In [None]:
X_num_scaled.head()

In [None]:
X_tsne = TSNE(learning_rate=100, n_components=2, random_state=0, 
              perplexity=5).fit_transform(X_num_scaled)


In [None]:
np.shape(X_tsne)

In [None]:
np.shape(y)

In [None]:
cmap = plt.cm.get_cmap('jet')
plt.figure()
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap=cmap,s=20)
plt.colorbar()
plt.title('t-SNE of housing dataset')
plt.figure()
plt.show()

In [None]:
df.head()

In [None]:
from bokeh.io import output_notebook, show
output_notebook()
import matplotlib as mpl
from bokeh.plotting import figure, output_file, show, ColumnDataSource
from bokeh.models import HoverTool
source_X = ColumnDataSource(
        data=dict(
            x = X_tsne[:len(y),0],
            y = X_tsne[:len(y),1],
            desc = y,
            colors = ["#%02x%02x%02x" % (int(r), int(g), int(b)) for r, g, b, _ in 
                      255*mpl.cm.jet(mpl.colors.Normalize()(y.values))],
            OverallQual = df['OverallQual'].iloc[:len(y)],
            GrLivArea = df['GrLivArea'].iloc[:len(y)],
            GarageCars = df['GarageCars'].iloc[:len(y)]
        )
    )

hover = HoverTool(tooltips=[("Price", "@desc"), 
                            ("OverallQual", "@OverallQual"), 
                            ("GrLivArea", "@GrLivArea"), 
                            ("GarageCars", "@GarageCars")])
#tools_tsne = [hover_tsne, 'pan', 'wheel_zoom', 'reset']
#plot_tsne = figure(plot_width=600, plot_height=600, tools=tools_tsne, title='Prices')

p = figure(plot_width=800, plot_height=800, tools=[hover],  
           title="Housing Data")

p.circle('x', 'y', size=10, source=source_X, fill_color='colors')
show(p)





In [None]:
data=dict(
            f1 = X_tsne[:len(y),0],
            f2 = X_tsne[:len(y),1],
            sale = y,
            quality = df['OverallQual'].iloc[:len(y)]
            )

In [None]:
dfn = pd.DataFrame(data)

In [None]:
dfn.head()

In [None]:
sns.lmplot( x="f1", y="f2", data=dfn, fit_reg=False, hue='sale', palette="coolwarm", legend=False)


In [None]:
sns.lmplot( x="f1", y="f2", data=dfn, fit_reg=False, hue='quality', palette="coolwarm", legend=True)

Exercise 4: Repeat the above process for performing t-SNE and various plots on numerical columns of housing data where column "OverallQual" has a value of 3, 5, 7, or 10. 