# How to Use UMAP:

This is taken directly from 

https://umap-learn.readthedocs.io/en/latest/basic_usage.html#

Before proceeding install UMAP from a terminal command line using pip install or conda install:

* pip install umap-learn
* conda install -c conda-forge umap-learn

In addition to using umap, we'll also be using seaborn for the first time:

* seaborn: statistical data visualization

https://seaborn.pydata.org/

The following code blocks come from umap-learn.readthedocs. Rather than cut and paste
content from there. Read the descriptions provided there and execute the code here:

In [1]:
import numpy as np
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
%matplotlib inline
# import umap.umap_ as umap
# import umap as mp
import umap.umap_ as umap

Set the seaborn style:

In [None]:
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})

Load the penguins data set and peek at the head of the dataframe:

In [None]:
penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/c19a904462482430170bfe2c718775ddb7dbb885/inst/extdata/penguins.csv")
penguins.head()

Notice there are NaN entries, which we will drop:

In [None]:
penguins = penguins.dropna()
penguins.species.value_counts()

The following pair plots are an interesting, though involved way to visualize a high dimensional
data set. What are the diagonal entries?

In [None]:
sns.pairplot(penguins.drop("year", axis=1), hue='species');

Now we introduce UMAP:

In [None]:
reducer = umap.UMAP()

In [None]:
penguin_data = penguins[
    [
        "bill_length_mm",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
    ]
].values

In [None]:
scaled_penguin_data = StandardScaler().fit_transform(penguin_data)

In [None]:
embedding = reducer.fit_transform(scaled_penguin_data)
embedding.shape

In [None]:
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in penguins.species.map({"Adelie":0, "Chinstrap":1, "Gentoo":2})])
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Penguin dataset', fontsize=24);

In [None]:
# Basic UMAP Parameters: Random Colors

In [None]:
The following is from: https://umap-learn.readthedocs.io/en/latest/parameters.html

In [None]:
sns.set(style='white', context='poster', rc={'figure.figsize':(14,10)})

In [None]:
np.random.seed(42)
data = np.random.rand(800, 4)

In [None]:
fit = umap.UMAP()
%time u = fit.fit_transform(data)

In [None]:
plt.scatter(u[:,0], u[:,1], c=data)
plt.title('UMAP embedding of random colours');

In [None]:
plt.scatter(u[:,0], u[:,1], c=data)
plt.title('UMAP embedding of random colours');

In [None]:
def draw_umap(n_neighbors=15, min_dist=0.1, n_components=2, metric='euclidean', title=''):
    fit = umap.UMAP(
        n_neighbors=n_neighbors,
        min_dist=min_dist,
        n_components=n_components,
        metric=metric
    )
    u = fit.fit_transform(data);
    fig = plt.figure()
    if n_components == 1:
        ax = fig.add_subplot(111)
        ax.scatter(u[:,0], range(len(u)), c=data)
    if n_components == 2:
        ax = fig.add_subplot(111)
        ax.scatter(u[:,0], u[:,1], c=data)
    if n_components == 3:
        ax = fig.add_subplot(111, projection='3d')
        ax.scatter(u[:,0], u[:,1], u[:,2], c=data, s=100)
    plt.title(title, fontsize=18)


In [None]:
for n in (2, 5, 10, 20, 50, 100, 200):
    draw_umap(n_neighbors=n, title='n_neighbors = {}'.format(n))

In [None]:
for d in (0.0, 0.1, 0.25, 0.5, 0.8, 0.99):
    draw_umap(min_dist=d, title='min_dist = {}'.format(d))


In [None]:
draw_umap(n_components=1, title='n_components = 1')

In [None]:
draw_umap(n_components=3, title='n_components = 3')