# SI 618 Day 9: Dimension Reduction

Version 2023.03.07.3.CT

## Task (to generate data for use later in today's class):
Fill in the [spreadsheet](https://docs.google.com/spreadsheets/d/178npckIJAcp0vY2TEmg9Sn3sJYUYYDSAmw8_ouxZR7A/edit?usp=sharing) with your music preferences.  Rate each genre on a scale of 1 to 10, with 1 being "no way" and 10 being "the best".  Note that you are indicating your preference for each genre on a scale of 1-10; 
you are not ranking the genres from 1-10.  Thus, you can have all 10s if you love all genres of music, or all 1s if you hate music in general.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import manifold

## Demo

Let's set up a really simple dataframe to play with:

In [None]:
demo = pd.DataFrame({'a': [1, 2, 3, 1], 'b': [1, 4, 6, 1], 'c': [2, 4, 6, 3]},
                    index=['Chris', 'Xin', 'Arjun', 'Buko'])

In [None]:
demo

And then let's split the dataframe into X and y matrices:

In [None]:
demo_X = demo.values

In [None]:
demo_X

In [None]:
demo_y = demo.index

In [None]:
demo_y

## Multi-dimensional scaling (MDS)

In [None]:
nmds = manifold.MDS(n_components=2,
                    metric=False,
                    max_iter=3000,
                    eps=1e-9,
                    random_state=42,
                    dissimilarity='euclidean',
                    normalized_stress='auto',
                    n_jobs=1)

In [None]:
npos = nmds.fit_transform(demo_X)

In [None]:
npos

In [None]:
npos_labelled = pd.concat([pd.DataFrame({'who': demo_y}), pd.DataFrame(npos)], axis=1)

In [None]:
npos_labelled.columns = ['who', 'mds1', 'mds2']

In [None]:
npos_labelled

In [None]:
p1 = sns.scatterplot(data=npos_labelled, x='mds1', y='mds2')

In [None]:
# Based on https://stackoverflow.com/questions/46027653/adding-labels-in-x-y-scatter-plot-with-seaborn
p1 = sns.scatterplot(data=npos_labelled, x='mds1', y='mds2')
for line in range(0, npos_labelled.shape[0]):
     p1.text(npos_labelled['mds1'][line]+0.01, npos_labelled['mds2'][line], 
     npos_labelled['who'][line], horizontalalignment='left', 
     size='medium', color='black')

In [None]:
def labelled_scatterplot(data=None,x=None,y=None,labs=None):
    p1 = sns.scatterplot(data=data,x=x,y=y)
    for line in range(0,data.shape[0]):
        if data[labs][line] == 'Chris T':
            c = 'red'
        else:
            c = 'black'
        p1.text(data[x][line]+0.01, data[y][line], 
                data[labs][line], horizontalalignment='left', 
                size='medium', color=c)

In [None]:
labelled_scatterplot(npos_labelled,'mds1','mds2','who')

## Principal Components Analysis (PCA)

In [None]:
demo

In [None]:
from sklearn.preprocessing import scale
scale(demo_X)

In [None]:
np.var([-1.29777, .016222, 1.135549])

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(scale(demo_X))

In [None]:
X_pca

In [None]:
X_pca_labelled = pd.concat([pd.DataFrame({'who':demo_y}),pd.DataFrame(X_pca,columns=['pca1','pca2'])],axis=1)

In [None]:
X_pca_labelled

In [None]:
labelled_scatterplot(data=X_pca_labelled,x='pca1',y='pca2',labs='who')

In [None]:
pca.explained_variance_

In [None]:
np.cov(X_pca)

In [None]:
pca.explained_variance_ratio_

### Question: 
What's the expected explained variance ratio for each variable?

**Answer:** for n variables, the expected explained variance for each variable is 1/n.

## t-SNE

In [None]:
# import t-SNE package from sklearn
from sklearn.manifold import TSNE

In [None]:
demo_X

In [None]:
tsne = TSNE(n_components=2, perplexity=2, random_state=0)
X = demo_X.copy()
X_2d = tsne.fit_transform(X)

In [None]:
X_2d

In [None]:
tsne_labelled = pd.concat([pd.DataFrame({'who':demo_y}),pd.DataFrame(X_2d,columns=['d1','d2'])],axis=1)

In [None]:
tsne_labelled

In [None]:
labelled_scatterplot(data=tsne_labelled, x='d1', y='d2', labs='who')

### t-SNE demo
https://cs.stanford.edu/people/karpathy/tsnejs/csvdemo.html

# In your groups

Let's read the CSV file of the music data we generated at the start of today's class:

In [None]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vTRe8guKi6zKf4_rQr8BiNmb2-V1Qq72vV7ZqSQ9Upeo6TsBtgFyZ4kk_IJgGdXx6kPdJP6NC_s_HOO/pub?gid=0&single=true&output=csv"
music = pd.read_csv(url)

In [None]:
music

In [None]:
music.info()

In [None]:
music.describe()

In [None]:
sns.heatmap(music.corr(numeric_only=True))

In [None]:
sns.pairplot(music)

## Task
Create X (features matrix) and y (labels matrix) from the `music` dataframe:

In [None]:
# insert your code here

## Task
Perform a multi-dimensional scaling on the music data.  Should you use metric or non-metric MDS?  Do the 
results differ between metric and non-metric?  Visualize your results.

In [None]:
# Insert your code here

## Task
Perform a principal components analysis (PCA) on the music data.  Do you think you should scale the data before you
do the PCA?  How many principal components should you retain (hint: look at a scree plot and/or eigenvalues, a.k.a. the explained_variance_ attribute of the PCA model).  Visualize your results.

In [None]:
# insert your code here

### Visualizing principal components

In [None]:
def pca_results(data, columns, pca):
    
    # Dimension indexing
    dimensions = ['Dimension {}'.format(i) for i in range(1, len(pca.components_)+1)]
    
    # PCA components
    components = pd.DataFrame(np.round(pca.components_, 4), columns=columns)
    components.index = dimensions

    # PCA explained variance
    ratios = pca.explained_variance_ratio_.reshape(len(pca.components_), 1) 
    variance_ratios = pd.DataFrame(np.round(ratios, 4), columns = ['Explained Variance'])
    variance_ratios.index = dimensions

    # Create a bar plot visualization
    fig, ax = plt.subplots(figsize=(14,8))

    # Plot the feature weights as a function of the components
    components.plot(ax=ax, kind='bar')
    ax.set_ylabel("Feature Weights")
    ax.set_xticklabels(dimensions, rotation=0)

    # Display the explained variance ratios
    for i, ev in enumerate(pca.explained_variance_ratio_):
        ax.text(i-0.40, ax.get_ylim()[1] + 0.05, "Explained Variance\n %.4f"%(ev))

    # Return a concatenated DataFrame
    return pd.concat([variance_ratios, components], axis=1)



In [None]:
pcax = pca_results(X_pca, data.columns, pca)

## Task
Perform a t-SNE analysis of the music data.  Experiment with different hyperparameters (i.e. perplexity and n_iter) to see how your solution changes.  Visualize your results.

In [None]:
# Insert your code here

## Task
Compare the three analyses (MDS, PCA, and t-SNE).  Comment on similarities and differences.  What do you think the best technique is to use with the music data?

Insert your answer here.

## Stretch task

Use a Pipeline to perform a PCA on the music data.

## A few words about pipelines

Consider the following pipeline:

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

pipe = Pipeline([
    ('scale',StandardScaler()),
    ('pca', PCA(n_components=5,random_state=42)),
])

In [None]:
pipe

The pipeline can be queried by using `get_params()`:

In [None]:
pipe.get_params()

The `named_steps` attribute holds (unsurprisingly) the named steps of the pipeline:

In [None]:
pipe.named_steps

The steps themselves can be accessed as attributes of the `named_steps` property:

In [None]:
pipe.named_steps.pca

And specifics about the step can be modified by assigning new values to them:

In [None]:
pipe.named_steps.pca.n_components=3

In [None]:
pipe.named_steps.pca