## Visualization in Python

*Authors: Zach del Rosario (zdelrosario@olin.edu)*

### Learning outcomes
By working through this notebook, you will be able to:

* Re-create a number of different plots using `plotly`
* Draw conclusions about data using histograms, boxplots, and scatterplots

Tips:
* [This page](https://plotly.com/python/basic-charts/) gives details on how to make a wide variety of charts with plotly.
* *Google is your friend*: If you don't know how to do something and can't find it in the examples, *first* try searching for something related on the web. If you can't find it after 5 minutes of searching on your own, please do ask a question. Learning how to search for your own answers is a valuable skill, and you are *expected* to have to look stuff up in this exercise.
* Not all the code here will make sense if you're new to python. That's ok! I've tried to scaffold the exercises to focus on `plotly` syntax, rather than more general python syntax.

In [1]:
# Setup
import plotly.express as px
import plotly.graph_objects as go

import numpy as np
import grama as gr
import pandas as pd

from grama.tran import tran_umap
from plotly.subplots import make_subplots

DF = gr.Intention()

# Necessary to show plotly figures in Jupyter
%matplotlib widget

# Load data
filename_alloy_data = "./data/wiki_comp.csv"
df_data = pd.read_csv(filename_alloy_data)
# Extract the series number
df_data = df_data.assign(
    series=list(map(lambda s: int(s.replace("V-", "")[0]), df_data.alloy))
)


When studying a new dataset, it's always a good idea to do some simple 'first checks' before going straight to visualizing.


In [2]:
# Check the data's shape
df_data.shape


(137, 28)

We have 137 observations on 28 variables. Because there are so many variables, these data will be difficult to visualize!

In [3]:
# Glimpse the data
df_data.head()


Unnamed: 0,alloy,al_percent,Ag,B,Be,Bi,Ca,Cd,Ce,Cr,...,O,Pb,Sc,Si,Ti,V,Y,Zn,Zr,series
0,1100,99.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,1230,90.85,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,...,0.0,0.0,0.0,0.3,0.15,0.0,0.0,0.1,0.0,1
2,1420,92.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.1,1
3,1421,92.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,0.1,1
4,1424,91.4885,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.08,0.08,0.0,0.0,0.0,0.7,0.1,1


This is a dataset about Aluminum alloys, scraped from [Wikipedia](https://en.wikipedia.org/wiki/Aluminium_alloy#Wrought_alloys). The `alloy` column identifies the specific alloy, `al_percent` gives the percent Aluminum content (by weight), `series` gives the alloy series number (defining "families" of alloys), and the remaining columns give other alloying elements. From the "head" of the data (the first few observations) we can see the alloying elements are *sparse* -- many of the entries are zero.

### Learning `plotly` syntax by doing

### Q1: Reproduce This Plot
Use `plotly` to reproduce the following histogram. Make sure to learn how to change *all* of the following:
* Number of histogram bins
* Axis limits
* Axis labels

The following documentation pages may help:
- [Histograms](https://plotly.com/python/histograms/)
- [Axis range documentation](https://plotly.com/python/axes/#setting-the-range-of-axes-manually)
- [Label documentation](https://plotly.com/python/figure-labels/)


In [12]:
###
# TASK: Reproduce the histogram below
# TODO: Consult the matplotlib documentation to help you re-create the histogram below
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
fig = px.histogram(
    df_data, 
    x="al_percent", 
    nbins=20,
    title="Aluminum Content",
    labels={
        "al_percent": "Aluminum content by weight (%)",
    }
)
# fig.update_layout(xaxis_range=[70, 100])
fig.update_xaxes(range=[80, 100])
fig.show()
fig.write_image("./incl/04_q1_target.png")
# solution-end


*Note*: With plotly you can click and drag to zoom in on the plot. You can also see additional data by *hovering* your cursor over geometric elements; this is helpful for reading off precise values! You can double-click to reset the view of the plot.


<!-- task-begin -->
<img src="./incl/04_q1_target.png">
<!-- task-end -->

### Q2: What does this histogram tell you?
List _at least 3_ observations conclusions about the data you can draw, based on the histogram above.
<!-- solution-begin -->
* All `Al` percentages are at or above 85%
* Many of the alloys tend to be around 93% `Al`, as evidenced by the tallest peak
* There is a second mode (peak) of alloys around 97-98% `Al`
* There is a fairly long left tail of alloys spread between 85% and 92% `Al`
<!-- solution-end -->

---

Next, we'll study how alloy composition varies across the various Al series.

In [5]:
series = list(range(1, 9))
mean_percentages = [np.mean(df_data[df_data.series == s]["al_percent"])
                    for s in series]  # List comprehension
print("Series, Mean Percent")
for ind in range(len(series)):
    print("     {0:},   {1:2.1f}".format(series[ind], mean_percentages[ind]))


Series, Mean Percent
     1,   93.3
     2,   93.6
     3,   98.6
     4,   92.9
     5,   96.2
     6,   97.9
     7,   90.1
     8,   94.8


This shows that Series 3 and 6 have the highest `Al` content, on average. However, this does not give us as much information as a full histogram. We could plot histograms for the individual series, as demonstrated below:

In [6]:
series1_al = df_data[df_data.series == 1]["al_percent"]
# task-begin
# Solution code hidden!
# No need to write code here
# task-end
# solution-begin
fig = px.histogram(
    df_data[df_data.series == 1], 
    x="al_percent", 
    nbins=20,
    title="Aluminum Content, Series 1 only",
    labels={
        "al_percent": "Aluminum content by weight (%)",
    }
)
# fig.update_layout(xaxis_range=[70, 100])
fig.update_xaxes(range=[80, 100])
fig.show()
fig.write_image("./incl/04_series_hist.png")
# solution-end


<!-- task-begin -->
<img src="./incl/04_series_hist.png">
<!-- task-end -->

However, it is challenging to compare multiple histograms against each other. Instead, we can use *boxplots* to summarize and plot the various series in the same plot.

### Q3: Reproduce This Plot
Use `plotly` to reproduce the following [boxplots](https://plotly.com/python/box-plots/) depicting each aluminum series.

Hints:
* You may re-use lessons you learned from Q1


In [7]:
###
# TASK: Reproduce the boxplots below
# TODO: Consult the matplotlib documentation to help you re-create the boxplots below
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
fig = px.box(
    df_data, 
    x="series", 
    y="al_percent",
    title="Aluminum Content",
    labels={
        "series": "Series",
        "al_percent": "Aluminum content by weight (%)",
    }
)
fig.update_yaxes(range=[80, 100])
fig.show()
fig.write_image("./incl/04_q3_target.png")
# solution-end


<!-- task-begin -->
<img src="./incl/04_q3_target.png">
<!-- task-end -->

### Q4: What does this plot tell you?
List _at least 3_ observations conclusions about the data you can draw, based on the boxplots above.
<!-- solution-begin -->
* Series 1 is supposed to be *pure* aluminum, but the data here suggest otherwise. Perhaps some of the alloy designations are incorrect?
* Series 3 and 6 are indeed the most pure in `Al`, though some instances of Series 5 have higher `Al` content
* Series 4 and 8 are highly variable in `Al` content
* Series 2 has a (relatively) large number of outliers (shown as dots)
<!-- solution-end -->

---

### Dimension Reduction

Studying *all* the variables at once is difficult, as we cannot directly visualize 25-dimensional space! Instead, we will use different *dimension reduction* techniques to visualize the data while preserving low-dimensional structure. We will illustrate two methods:

- Principal Component Analysis [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) is a classic dimension reduction technique that seeks to preserve *variance* in the original data while reducing the number of dimensions. In this approach, the resulting reduced coordinates have a clear interpretation as *distance along special directions* within the original space.
- [UMAP](https://umap-learn.readthedocs.io/en/latest/) stands for *uniform manifold approximation*; this is a nonlinear dimension reduction technique. UMAP coordinates don't have any physical interpretation, but the technique can frequently capture low-dimensional structure that PCA cannot. [This article](https://pair-code.github.io/understanding-umap/) is a fantastic resource for understanding UMAP.

In [8]:
# Perform PCA
df_pca_weights = gr.tran_pca(df_data.drop(["alloy", "al_percent", "series"], axis=1))
df_pca_data = (
    gr.tran_inner(
        df_data,
        df_pca_weights,
    )
)

# Perform t-SNE
df_umap_data = (
    tran_umap(df_data.drop(["alloy", "al_percent", "series"], axis=1), seed=101)
    >> gr.tf_bind_cols(df_data[["alloy", "series"]])
)



ignoring df_weights columns {'lam'}



Both datasets `df_pca_data` and `df_umap_data` have transformed coordinates we can use to plot the data using just two coordinates. We'll illustrate this below.

---

### PCA

When performing PCA, we need to decide how many principal components ("special directions") in the data to retain. We do this by looking at a plot of the singular values, which describe how much variance is captured along each of the principal directions. This kind of plot is sometimes called a [scree plot](https://en.wikipedia.org/wiki/Scree_plot).

In [9]:
# Create figure with secondary y-axis
fig = make_subplots(specs=[[{"secondary_y": True}]])
# Add traces
fig.add_trace(
    go.Scatter(
        x=df_pca_weights.index, 
        y=df_pca_weights.lam, 
        name="Component variance",
    ),
    secondary_y=False,
)
fig.add_trace(
    go.Scatter(
        x=df_pca_weights.index,
        y=gr.cumsum(df_pca_weights.lam) / gr.colsum(df_pca_weights.lam),
        name="Variance fraction",
    ),
    secondary_y=True,
)
# Add figure title
fig.update_layout(
    title_text="Scree Plot",
)
# Set x-axis title
fig.update_xaxes(title_text="xaxis title")
# Set y-axes titles
fig.update_yaxes(title_text="Component Variance", secondary_y=False)
fig.update_yaxes(title_text="Variance Fraction", secondary_y=True)
fig.show()

The total variance captured in the first two dimensions is less than `50%`; this suggests that a two-dimensional projection of the data will not be very effective. Regardless, let's try plotting the data using just the first two PCA components.

In [10]:
# task-begin
# CODE REDACTED...
# You'll learn to use plt.scatter() in Q5
# task-end
# solution-begin
fig = px.scatter(
    df_pca_data, 
    x="dot0", 
    y="dot1",
    color="series",
    hover_data=["alloy"],
    title="Principal Components",
    labels={
        "dot0": "PC 1",
        "dot1": "PC 2",
    }
)
fig.show()
fig.write_image("./incl/04_ex_pca.png")
# solution-end


<!-- task-begin -->
<img src="./incl/04_ex_pca.png">
<!-- task-end -->

We can see a bit of clustering in the data -- the `Series 7` alloys are largely separated from other alloys. However, we can find more distinct clusters by using a UMAP projection.

---

### Q5: Reproduce This UMAP Plot
Use `plotly` to reproduce the following scatterplot on the two UMAP coordinates in `df_umap_data`. The easiest way is probably by using `px.scatter()`, which is documented [here](https://plotly.com/python/line-and-scatter/).

Hints:
* You may re-use lessons you learned from Q1 and Q2


In [11]:
###
# TASK: Reproduce the scatterplot below
# TODO: Consult the matplotlib documentation to help you re-create the scatterplot below
###

# -- WRITE YOUR CODE BELOW -----
# solution-begin
# Plot data
fig = px.scatter(
    df_umap_data, 
    x="xi0", 
    y="xi1",
    color="series",
    hover_data=["alloy"],
    title="UMAP Plot",
    labels={
        "xi0": "UMAP Coordinate 1",
        "xi1": "UMAP Coordinate 2",
    }
)
fig.show()
fig.write_image("./incl/04_q5_target.png")
# solution-end


<!-- task-begin -->
<img src="./incl/04_q5_target.png">
<!-- task-end -->

Interpreting a UMAP plot is [subtle and challenging](https://pair-code.github.io/understanding-umap/), but also extremely powerful. One of the caveats of UMAP is that the coordinates themselves do not have a simple interpretation, so the location of clusters in the two coordinates fairly arbitrary. The distances between clusters is also not necessarily meaningful. A well-calibrated UMAP will tend to reveal clustering in the data, in which case points nearby in UMAP space can be understood to be close in the original space.

### Q6: What does this plot tell you?
List _at least 3_ observations conclusions about the data you can draw, based on the t-SNE scatterplot above. Use both the observed clusters and the original Series labels in your analysis.

*Hint*: Remember that you can zoom and hover over points to get more information! Double-click to reset the zoom of the plot.

<!-- solution-begin -->
* The compositional clusters do not perfectly align with the alloy Series
* There is a cluster of 7000 Series alloys that are more similar to each other, than to other alloys
* All 6000 Series alloys cluster together, though they also cluster with other alloys from different series
<!-- solution-end -->