# Using dimensionality reduction for visualization.

Sometimes dimensionality reduction is used to reduce the size of a dataset to make machine learning more tractable.

However, this is generally avoided because we don't want to throw away training data.  The danger of throwing away valuable information is especially worrisome because dimensionality reduction methods like PCA are unsupervised, so don't know what information is most important relative to your target variable.

The use of neural nets also makes dimensionality reduction less important, because neural nets can handle large data objects like images.  In other words, neural nets can handle lots of input variables.

On the other hand, dimensionality reduction is commonly used for visualization, especially for visualizing data with 2D scatterplots.

We learned about many dimensionality reduction methods -- how do they differ in visualizing data?  How would you know which method to use for visualizations?

In this assignment you will use several dimensionality reduction methods to visualize the college data set, and you will see how the visualizations differ.

You will also use dimensionality reduction methods to visualize the results of cluster analysis.

v0.2  author: Glenn Bruns

### Instructions:

Read through the code, then enter code in the cell below each numbered problem.

Most of the problems are very similar.  Write supporting functions to avoid lots of duplicated code.  Part of your grade will be based on how you factor your code.

The instructions are not detailed.  I expect you to think and to use good judgement.

Restart your notebook and run from top to bottom before submitting.

In [None]:
import os
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA, KernelPCA
from sklearn.manifold import LocallyLinearEmbedding, TSNE
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.datasets import make_swiss_roll

In [None]:
# use this cell for any imports you want to add

In [None]:
# use if needed to suppress Jupyter notebook warnings
# warnings.filterwarnings('ignore')

#### Red wine data

In [None]:
dfw = pd.read_csv("https://raw.githubusercontent.com/grbruns/cst495/master/winequality-red.csv", sep=";")

In [None]:
# summary
dfw.info()

In [None]:
Xw = dfw.iloc[:,0:11].values
yw = dfw.iloc[:,11].values

In [None]:
# normalize the data
scaler = StandardScaler()
Xw = scaler.fit_transform(Xw)

#### College data

In [None]:
dfc = pd.read_csv('https://raw.githubusercontent.com/grbruns/cst383/master/College.csv', index_col=0)

In [None]:
# summary
dfc.info()

In [None]:
Xc = dfc.iloc[:,1:].values
yc = (dfc.iloc[:,0] == "Yes").astype(int).values

In [None]:
# normalize the data
scaler = StandardScaler()
Xc = scaler.fit_transform(Xc)

#### Swiss roll

In [None]:
Xr, yr = make_swiss_roll(n_samples=1000, noise=0.05, random_state=1)

### Supporting functions

Define any supporting functions you want in this section of the notebook.

### Shared variables

You may like to use one or more code cells in this section of the notebook to define data you will use in the rest of the notebook.

## Problem 1.  Apply dimensionality reduction to the data sets

### 1a.  PCA 

Transform the data to 2D using PCA for the wine, college, and swiss roll data.  Then plot the data.  Use the labels to color the points in 2D.  For example, with the wine data, the data to be plotted is Xw, and yw should be used to color the points.

Seaborn scatterplots make it easy to color the points.

I think putting the three plots side by side is a good idea, but you can use your own judgement.

In [None]:
### YOUR CODE HERE

### 1b.  Kernel PCA, RBF kernel.

This problem is just like the previous problem, but with kernel PCA.  Use the RBF kernel.  Use the default value for other hyperparameters.

In [None]:
### YOUR CODE HERE

### 1c.  Kernel PCA, polynomial kernel.

Use Kernel PCA with a polynomial kernel of degree 2.

In [None]:
### YOUR CODE HERE

### 1d.  LLE with default n_neighbors

Use the default value for n_neighbors.  Use the default value for other hyperparameters.

In [None]:
### YOUR CODE HERE

### 1e.  LLE with n_neighbors = 10

In [None]:
### YOUR CODE HERE

### 1f.  LLE with n_neighbors = 20

In [None]:
### YOUR CODE HERE

### 1g.  tSNE

Use tSNE with default hyperparameters.

In [None]:
### YOUR CODE HERE

### Problem 1h.  Summary

Discuss the results of the part 1 experiments in the markdown cell below.

Make thoughtful observations -- go beyond the obvious.

*** Replace this text with your thoughts. ***

## Problem 2.  Apply dimensionality reduction after cluster analysis

Here is a k-means cluster analysis object.

In [None]:
kmeans = KMeans(n_clusters = 2, n_init='auto')

### 2a.  PCA 

For each data set, first perform k-means cluster analysis, then reduce the dimensionality of the data to 2D and plot using a scatterplot.  Use each point's cluster as its color.

The purpose of this application of dimensionality reduction is to see if the clustering appears to be effective.  We cannot look at the clusters in high-dimensional space.

The use of cluster analysis does not modify the data that dimensionality reduction is being applied to; it is only used to determine the color of each point.

In [None]:
### YOUR CODE HERE

### 2b.  Kernel PCA , RBF kernel

Do the same thing as the last problem, but this time use kernel PCA with an RBF kernel.

In [None]:
### YOUR CODE HERE

### 2c.  Kernel PCA, polynomial kernel

In [None]:
### YOUR CODE HERE

### 2d.  LLE, default n_neighbors

In [None]:
### YOUR CODE HERE

### 2e.  LLE, n_neighbors = 10

In [None]:
### YOUR CODE HERE

### 2f.  LLE, n_neighbors = 20

In [None]:
### YOUR CODE HERE

### 2g.  tSNE

In [None]:
### YOUR CODE HERE

### 2h.  Summary

Discuss the results of the part 2 experiments in the markdown cell below.

Make thoughtful observations -- go beyond the obvious.

*** Replace this text with your thoughts. ***

## Problem 3. Perform further experiments

Add as many cells as you want to perform further experiments.

There are many possible things to explore.  Ask your instructor if you are not sure about what to try.

Be sure to include a clearly-labeled summary at the end to discuss your findings.

You don't need to try lots of different things.  Be thoughtful in your choice of things to try.  Use your curiousity.