### Data science process:
1. Define problem
2. Collect data
3. Process data
4. **Visualize data**
5. **Analyze data**
6. Report

### Color encoding:
See slides.
 * categorical (hue) vs quantitative (saturation,value)
 * Colormaps: https://matplotlib.org/examples/color/colormaps_reference.html 
 
 https://matplotlib.org/stable/tutorials/colors/colormaps.html?highlight=colormaps


### [Common visualizations for low-dimensional data](#Common-visualizations-for-low-dimensional-data):  
* DataFrame.plot library
* Histogram/KDE:   1D, Quantitative, distributions
* Bar:  2D, Cat x Quant, comparing points
* Pie: 2D, Cat x Quant, comparing points to whole
* Line/area:  2D, Quant(Continuous) x Quant, trend
* Scatter: 2D, Quant x Quant, correlation
* Annotating plots: Axes.text(), .add_line(), .add_path(), .annotate()


### [Common visualizations for high-dimensional data](#Multi-dimensional-data-visualization):  
* pandas.plotting library
* Scatterplot glyphs  https://bost.ocks.org/mike/nations/ 
* Distribution plots:
    * Histogram matrix http://square.github.io/crossfilter/
    * Boxplots, KDEs 
* Scatterplot Matrix https://bl.ocks.org/mbostock/4063663
* Parallel Coordinates 	https://bl.ocks.org/jasondavies/1341281 
* Heatmaps http://bl.ocks.org/mbostock/3202354
* Interaction:  Dynamic Queries https://www.gapminder.org/world/  https://www.youtube.com/watch?v=5X8XY9430fM 

## Review from the EDA class
Afterall the first step after data collection is EDA and viz that goes with EDA

### reviewing common viz for low-dimensional data

In [None]:
### Libraries:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

# Static (most reliable)
%matplotlib inline 

In [None]:
s = pd.read_csv("data/Survey-clean.csv")
s

## Common visualizations for low-dimensional data

### Histogram

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html

In [None]:
# Histogram
s.Height.plot.hist()

In [None]:
s.Height.plot.kde()

### Bar Chart

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.bar.html

In [None]:
# Bar chart
s.plot.bar(x='Name', y='Height', figsize=(20,3), color='steelblue')
# double click on the chart to expand

**<span class="mark">TODO</span>**: What are the marks and channels here?

#### Data from a Social Media platform (Reddit) 
Let's practice EDA with some social media data.

CSV File containing Reddit user activity form 2004

* author: Reddit username
* num_comments : number of comments made
* num_subs: number of subreddits participated in
* karma : total Reddit score accumulated 
* controv : total number of controvertial comments ---(comments with both, high upvotes and dowsvotes)
* gild : number of comments that received Reddit gold from other users 
* verbosity: average length of the comment

In [None]:
rdata = pd.read_csv("data/reddit_2007_author.csv")
rdata.head()

## Common visualizations for low-dimensional data

### Histogram

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html

https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html

#### Plot the distribution of number of comments made by the users

In [None]:
rdata.num_comments.plot.hist()

#### Plot the distribution of comments made by users. We are not interested in users who make > 200 comments.

In [None]:
rdata.num_comments[rdata.num_comments < 200].plot.hist()

#### Left-skewed distribution. 
A lot of people make few comments and a few people make a lot of comments skewing the histogram. Now lets look at the distribution of top 100 num_comments

In [None]:
f1 = rdata.sort_values(by='num_comments', ascending=False).head(100)
f1.num_comments.plot.hist()

In [None]:
# who made these to comments?

f1

### Bar chart

Plotting number of comments made by each user

In [None]:
f1.plot.bar(x='author', y='num_comments', figsize=(20,3), color='purple')

#### properly labeling the axes

In [None]:
f1.plot.bar(x='author', y='num_comments', figsize=(20,3), color='purple')
plt.xlabel('Author', fontsize=16) # Add x & y label, change font size
plt.ylabel('# of Comments', fontsize=16)

### Scatter plot

visualizing two quantitative (continuous) variables

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html

**How does the karma earned by the user relate to user's contribution (# of comments)?**

Basic scatterplot shown, but you can add more parameters to visualize additional attributes on the same plot.

Recall EDA class and read the pandas scatterplot documentation.

In [None]:
# How does the karma earned by the user relate to user's contribution (# of comments)?

p = f1.plot.scatter(x='num_comments', y='karma') # you can include additional parameters. Enhance the visualization.
p.set_xlabel("Number of comments", fontsize=16)
p.set_ylabel("Karma", fontsize=16)

In [None]:
# Your code to include additional parameters. Enhance the visualization.

### EDA for your projects

**<span class="mark">TODO shortly with your group</span>** Start thinking how you would conduct visual EDA with the data that you have collected for your projects.

## Annotating plots 
**new stuff**

* create a Line2D  https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D
* plot.annotate https://matplotlib.org/api/_as_gen/matplotlib.pyplot.annotate.html

In [None]:
# Annotate a plot
plot = s.plot.scatter(x='Born', y='Spicy')
plot.text(2000, 3, "I'm here!", color='Red')
plot.add_line(matplotlib.lines.Line2D(
    [2000, 4000, 6000], [50, 0, 40], color="Orange"))  #Create a Line2D instance with x and y data in sequences xdata, ydata.
plot.add_patch(matplotlib.patches.Ellipse(
    (3000,30), 2000, 10, color='Maroon'))
plot.annotate("Dude", (8000,40), xytext=(10000,20), 
               bbox=dict(), arrowprops=dict(facecolor='red'))  #try with arrowprops=dict(facecolor='red')

## Multi-dimensional data visualization

pandas.plotting.

In [None]:
f1.hist(figsize=(10,10))

<span class="mark">QUESTION</span>: What do you observe in each of these distributions?

## Parallel Coordinates (less used)

* each variable is given its own axis 
* all the axes are placed in parallel to each other.
* Each axis can have a different scale, or all the axes can be normalised to keep all the scales uniform. 
* Values are plotted as a series of lines that connected across all the axes. 
* The order the axes can impact interpretation 
* relationships between adjacent variables are easier to perceive, then for non-adjacent variables.
* DOWNSIDE: can become over-cluttered and illegible when they’re very data-dense. 

In [None]:
%%html
<div><img src="https://datavizcatalogue.com/methods/images/anatomy/SVG/parallel_coordinates.svg" width="220"/></div>

### Plottting parallel coordinates
Demo with pre-existing iris data. Don't worry about the data too much at this point. We will do additional practice exercises with parallel coordinates in future classes

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.plotting.parallel_coordinates.html

In [None]:
# libraries
from pandas.plotting import parallel_coordinates

# Take the iris dataset
import seaborn as sns
data = sns.load_dataset('iris')

# Make the plot
parallel_coordinates(data, 'species')
plt.show()

## Exercises
part 2

1. visualize average heights of males & females

How you would do it programmatically?

## Exercise
<span class="mark">**TODO**</span>
2. Visualize Heights, Born, Gender

In [None]:
# Exercise:  scatterplot with categorical colors for Gender?
plot = s.plot.scatter(x='Born', y='Height')

# several solutions possible...

In [None]:
# Your code

# Exercise 

<span class="mark">**TODO**</span>
3. Compare height distributions of genders

In [None]:
# YOUR CODE BELOW

## Few more viz examples

### Load datasets, map into categorical variables

In [None]:
white_wine = pd.read_csv('data/winequality-white.csv', sep=';')
red_wine = pd.read_csv('data/winequality-red.csv', sep=';')

In [None]:
white_wine.head()

In [None]:
# store wine type as an attribute
red_wine['wine_type'] = 'red'   
white_wine['wine_type'] = 'white'

In [None]:
white_wine.head()

In [None]:
# bucket wine quality scores into qualitative quality labels
red_wine['quality_label'] = red_wine['quality'].apply(lambda value: 'low' 
                                                          if value <= 5 else 'medium' 
                                                              if value <= 7 else 'high')
red_wine['quality_label'] = pd.Categorical(red_wine['quality_label'], 
                                           categories=['low', 'medium', 'high'])

white_wine['quality_label'] = white_wine['quality'].apply(lambda value: 'low' 
                                                              if value <= 5 else 'medium' 
                                                                  if value <= 7 else 'high')
white_wine['quality_label'] = pd.Categorical(white_wine['quality_label'], 
                                             categories=['low', 'medium', 'high'])

In [None]:
white_wine.head()

In [None]:
red_wine.head()

In [None]:
# merge red and white wine datasets
wines = pd.concat([red_wine, white_wine])

# re-shuffle records just to randomize data points
wines = wines.sample(frac=1, random_state=42).reset_index(drop=True)

In [None]:
wines.head() #check data with & without re-shuffle

**Now we are ready to visualize after all these data wrangling steps**

## Visualizing two dimensions
### with seaborn

In [None]:
f, ax = plt.subplots(figsize=(10, 6))
corr = wines.corr()
hm = sns.heatmap(round(corr,2), annot=True, ax=ax, cmap="coolwarm",fmt='.2f',
            linewidths=.05)
f.subplots_adjust(top=0.93)
t= f.suptitle('Wine Attributes Correlation Heatmap', fontsize=14)

### without seaborn.

Plot the values of a 2D matrix or array as color-coded image: [matplotlib.pyplot.matshow](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.matshow.html#matplotlib.axes.Axes.matshow)

In [None]:
matrix = wines.corr()
img = matplotlib.pyplot.matshow(matrix, 
    cmap=plt.cm.BrBG) #, vmin=-1, vmax=1)
# looks pretty bad. No labels. Figure is too small. 

To know more about adding labels and ticks

[tick-locators]( https://matplotlib.org/3.1.1/gallery/ticks_and_spines/tick-locators.html)

[ticklabels](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html)

<img src="https://matplotlib.org/3.1.1/_images/sphx_glr_tick-locators_001.png" alt="Variables" width="600"/>

 * Colormaps: https://matplotlib.org/examples/color/colormaps_reference.html 

In [None]:
matrix = wines.corr()
img = plt.matshow(matrix, #vmin and vmax define the data range that the colormap covers.
    cmap=plt.cm.BrBG, vmin=-1, vmax=1) #Diverging color map. 
img.figure.set_size_inches(10, 8, forward=True)
img.axes.set_xticklabels(list(matrix.columns), rotation='vertical')
img.axes.set_yticklabels(list(matrix.index), rotation='horizontal')
img.axes.xaxis.set_major_locator(matplotlib.ticker.MultipleLocator(1)) #ticks
img.axes.yaxis.set_major_locator(matplotlib.ticker.MultipleLocator(1))
img.figure.colorbar(img) # adding the color bar at the end.
Answer4 = img.axes #HW5, problem4. Make sure that Answer4 is an Axes or AxesSubplot instance 
Answer4

There is a minor problem with the labels. **<span class="mark">Can you spot the problem?</span>**

**Anyone has a better way to fix this?**

See my hackish approach below.

In [None]:
# xticklabels indexes from -1
img.axes.set_xticklabels(list(matrix.columns), rotation='vertical', ha='center')

In [None]:
pos = np.arange(len(matrix.columns))

In [None]:
x = list(matrix.columns)
x = [''] + x
x

In [None]:
img.axes.set_xticks(pos)
img.axes.set_xticklabels(x, rotation='vertical')

In [None]:
# Now you can see "fixed acidity" in the legend.
img = plt.matshow(matrix, 
    cmap=plt.cm.BrBG, vmin=-1, vmax=1)
img.figure.set_size_inches(10, 8)
img.axes.set_xticklabels(x, rotation='vertical')
img.axes.set_yticklabels(x, rotation='horizontal')
img.axes.xaxis.set_major_locator(matplotlib.ticker.MultipleLocator(1)) #ticks
img.axes.yaxis.set_major_locator(matplotlib.ticker.MultipleLocator(1))
img.figure.colorbar(img)

In [None]:
matrix.shape

## Parallel Coordinates
another example.
Let's get back to our survey data

In [None]:
s = pd.read_csv("data/Survey-clean.csv")

In [None]:
# normalizing to find z-scores of numeric data.
norm = (s.iloc[:,5:] - s.mean()) / s.std()
norm

In [None]:
# joining with major data
norm.join(s.Major)

In [None]:
#sequential color map, Major is the legend
plot = pd.plotting.parallel_coordinates(norm.join(s.Major), 'Major', 
                                      colormap=plt.cm.rainbow)
plot.figure.set_size_inches(40, 10, forward=True)

## Multidimensional visualization (more possibilities)

part 3

 * boxplot
 * KDE
 * Scatterplot matrix

In [None]:
norm.plot.kde()

In [None]:
pd.plotting.boxplot(norm, rot=90, figsize=(10,5))

In [None]:
# Takes a while to run.
plot = pd.plotting.scatter_matrix(s, figsize=(20,20))

### Seaborn as an alternative

Seaborn provides an API on top of Matplotlib  
[searborn vs. matplotlib](https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html)

#### Visualizing 3 dimensions

Working with the wine data

In [None]:
# box plot to show distributions of alcohol %age vs. wine quality
f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Wine Quality - Alcohol Content', fontsize=14)

sns.boxplot(x="quality", y="alcohol", data=wines,  ax=ax)
ax.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)

In [None]:
# violinplots
f, (ax) = plt.subplots(1, 1, figsize=(12, 4))
f.suptitle('Wine Quality - Sulphates Content', fontsize=14)

sns.violinplot(x="quality", y="sulphates", data=wines,  ax=ax)
ax.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax.set_ylabel("Wine Sulphates",size = 12,alpha=0.8)

In [None]:
# utilizing both sides of a violin plot
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle('Wine Type - Quality - Acidity', fontsize=14)

sns.violinplot(x="quality", y="volatile acidity",
               data=wines, inner="quart", linewidth=1.3,ax=ax1)
ax1.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax1.set_ylabel("Wine Volatile Acidity",size = 12,alpha=0.8)

sns.violinplot(x="quality", y="volatile acidity", hue="wine_type", 
               data=wines, split=True, inner="quart", linewidth=1.3,
               palette={"red": "#FF9999", "white": "white"}, ax=ax2)
ax2.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax2.set_ylabel("Wine Volatile Acidity",size = 12,alpha=0.8)
l = plt.legend(loc='upper right', title='Wine Type')

<span class="mark">TODO</span>: What are the marks and channels in this visualization?

In [None]:
# visualizing with box plots, showing quality labels and quality class
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle('Wine Type - Quality - Alcohol Content', fontsize=14)

sns.boxplot(x="quality", y="alcohol", hue="wine_type",
               data=wines, palette={"red": "#FF9999", "white": "white"}, ax=ax1)
ax1.set_xlabel("Wine Quality",size = 12,alpha=0.8)
ax1.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)

sns.boxplot(x="quality_label", y="alcohol", hue="wine_type",
               data=wines, palette={"red": "#FF9999", "white": "white"}, ax=ax2)
ax2.set_xlabel("Wine Quality Class",size = 12,alpha=0.8)
ax2.set_ylabel("Wine Alcohol %",size = 12,alpha=0.8)
l = plt.legend(loc='best', title='Wine Type')

In [None]:
# pairwise plots
cols = ['density', 'residual sugar', 'total sulfur dioxide', 'fixed acidity', 'wine_type']
pp = sns.pairplot(wines[cols], hue='wine_type', height=1.8, aspect=1.8, 
                  palette={"red": "#FF9999", "white": "#FFE888"},
                  plot_kws=dict(edgecolor="black", linewidth=0.5))
fig = pp.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
t = fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

### Common visualization mistakes: 
* Chart junk: 
    * Data ink ratio = data ink / total chart ink 
    * https://cdn-images-1.medium.com/max/2600/1*iQcNh732KTyKd6qROcebhg.png
    
    <img src="https://infovis-wiki.net/w/images/5/55/DIR.jpg" alt="Variables" width="600"/>
    
* Y-axis: https://xkcd.com/2023/
* https://flowingdata.com/2017/02/09/how-to-spot-visualization-lies/ 
    * Truncated bar axis
    * 2D Size: Area vs width
* Rainbow colormap for quantitative data https://en.wikipedia.org/wiki/Photoreceptor_cell#/media/File:1416_Color_Sensitivity.jpg