![Banner logo](https://raw.githubusercontent.com/CitrineInformatics/community-tools/master/templates/fig/citrine_banner_2.png "Banner logo")

## Visualization in Python

*Authors: Zach del Rosario (zdelrosario@citrine.io)*

### Learning outcomes
By working through this notebook, you will be able to:

* Re-create a number of different plots using `matplotlib`
* Draw conclusions about data using histograms, boxplots, and scatterplots

Tips:
* *Google is your friend*: If you don't know how to do something and can't find it in the examples, *first* try searching for something related on the web. If you can't find it after 5 minutes of searching on your own, then try asking a question. Learning how to search for your own answers is a valuable skill, and you are *expected* to have to look stuff up in this exercise.
* Not all the code here will make sense if you're new to python. That's ok! I've tried to scaffold the exercises to focus on `matplotlib` syntax, rather than more general python syntax.

In [None]:
# Setup
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
# For jupyter-matplotlib compatibility
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

# External filename handling
filename_alloy_data = "./data/wiki_comp.csv"

# Load data
df_data = pd.read_csv(filename_alloy_data)
# Extract the series number
df_data = df_data.assign(
    series = list(map(lambda s: s.replace("V-", "")[0], df_data.alloy))
)

When studying a new dataset, it's always a good idea to do some simple 'first checks'

In [None]:
# Check the data's shape
df_data.shape

We have 137 observations on 28 variables. Because there are so many variables, these data will be difficult to visualize!

In [None]:
# Glimpse the data
df_data.head()

This is a dataset about Aluminum alloys, scraped from [Wikipedia](https://en.wikipedia.org/wiki/Aluminium_alloy#Wrought_alloys). The `alloy` column identifies the specific alloy, `al_percent` gives the percent Aluminum content (by weight), `series` gives the alloy series number (defining "families" of alloys), and the remaining columns give other alloying elements. From the "head" of the data (the first few observations) we can see the alloying elements are *sparse* -- many of the entries are zero.

### Learning `matplotlib` syntax by doing

### Q1: Reproduce This Plot
Use `matplotlib` to reproduce the following histogram. Make sure to learn how to change *all* of the following:
* Font sizes
* Number of histogram bins
* Axis limits
* Axis labels

In [None]:
## Generate histogram


<img src="./incl/03_q1_target.png">
<!-- task-end -->

### Q2: What does this histogram tell you?
List _at least 3_ observations conclusions about the data you can draw, based on the histogram above.


Next, we'll study how alloy composition varies across the various Al series.

In [None]:
series = [str(i) for i in range(1, 9)]
mean_percentages = [np.mean(df_data[df_data.series == s]["al_percent"]) for s in series] # List comprehension
print("Series, Mean Percent")
for ind in range(len(series)):
    print("     {0:},   {1:2.1f}".format(series[ind], mean_percentages[ind]))

This shows that Series 3 and 6 have the highest `Al` content, on average. However, we can use *boxplots* to gain a fair bit more insight into the data.

### Q3: Reproduce This Plot
Use `matplotlib` to reproduce the following [boxplot](https://towardsdatascience.com/understanding-boxplots-5e2df7bcbd51).

Hints:
* You may re-use lessons you learned from Q1
* plt.boxplot() can take a list of different data sets to produce multiple boxplots on the same figure. You can use a *list comprehension* like in `mean_percentages` above to create such a list, and pass it to plt.boxplot()

In [None]:
## Generate boxplot


<img src="./incl/03_q3_target.png">
<!-- task-end -->

### Q4: What does this plot tell you?
List _at least 3_ observations conclusions about the data you can draw, based on the boxplots above.


Studying *all* the variables at once is difficult, as we cannot directly visualize 26-dimensional space! Instead, we will use [t-SNE](https://lvdmaaten.github.io/tsne/) to *reduce dimensionality* while preserving low-dimensional structure, in this case clusters of data.

In [None]:
X_series = df_data["series"]
## Reduce dimensionality
X_orig = df_data.drop(["alloy", "al_percent", "series"], axis = 1).values
# np.random.seed(101) # t-SNE is a random algorithm; set seed for reproducibility
X_tsne = TSNE(
    n_components = 2
).fit_transform(X_orig)

print(X_orig.shape)
print(X_tsne.shape)

We can see the modified data `X_tsne` has just two variables, but the same number of observations as `X_orig`. We can use these two coordinates to plot the data and look for patterns.

### Q5: Reproduce This Plot
Use `matplotlib` to reproduce the following scatterplot on the two t-SNE coordinates in `X_tsne`.

Hints:
* You may re-use lessons you learned from Q1 and Q2
* You will need to introduce a colormap to mark the alloys from different series. [This tutorial](https://gist.github.com/jakevdp/8a992f606899ac24b711) may be helpful
* I use the colormap `viridis` below; it is a [perceptually uniform](http://bids.github.io/colormap/) colorscale that avoids introducing erroneous patterns in visualizations

<img src="./incl/03_q5_target.png">
<!-- task-end -->

### Q6: What does this plot tell you?
List _at least 3_ observations conclusions about the data you can draw, based on the t-SNE scatterplot above.


In [None]:
## Generate scatterplot
