# Introduction to Data Visualization in Python

Welcome! In this tutorial, we are going to be looking at the wine quality dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/) and using Python data visualization tools in order to model wine quality on various physiochemical tests.

## Import Packages

If any of these packages are not already installed, just run `pip3 install *pkg name*`

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline

## Pre-processing the Data

Download the wine quality CSVs (red wine & white wine) from the following links: 
* red wine = https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv
* white wine = https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

Move the CSVs to the same directory as this notebook.

In [0]:
from google.colab import files
uploaded = files.upload()

In [0]:
## loading red wine csv
red_wine = pd.read_csv('winequality-red.csv', sep=';')

## TODO: load the white wine csv using pandas
white_wine = ...

Let's concatenate the `red_wine` and `white_wine` dataframes into a `wine` dataframe.

In [0]:
## Add a label for red wine versus white wine
red_wine['type'] = 'red'
white_wine['type'] = 'white'

## TODO: concatenate both dataframes using pandas
wines = ...

## shuffle rows of dataframe
wines = wines.sample(frac=1).reset_index(drop=True)

In [0]:
wines.head()

As can be seen above, each wine dataset has the results of multiple physiochemical tests, alongside a continuous quality. 

Let's discretize this continuous variable as follows (quality_label): 
* low: value <= 5
* med: 5 < value <= 7 
* high: value > 7

In [0]:
quality_label = []
for value in wines['quality']:
    ## TODO: assign quality label array with low, med, high
    if value <= 5: 
        ...
    elif value > 5 and value <= 7: 
        ...
    else: 
        ...

wines['quality_label'] = quality_label

In [0]:
wines.head()

# 1D Visualization:

## Univariate Analysis

Analyze one data attribute/variable & visualize. 

Let's create a histogram for each numeric attribute!

In [0]:
fig = wines.hist(bins=15,
                 edgecolor='black', 
                 xrot=45, yrot=0,
                 figsize=(10,9),
                 grid=False)

plt.tight_layout(rect=(0, 0, 1.5, 1.5))   

## Continuous Numeric Attribute

Let's analyze a continuous, numeric attribute (i.e. citric acid content)

In [0]:
## create figure
fig = plt.figure(figsize=(24, 15))
title = fig.suptitle("Citric Acid Content in Wine", fontsize=16, fontweight='bold')
fig.subplots_adjust(top=0.88, wspace=0.3)

## histogram of continuous variable
ax1 = fig.add_subplot(221)
ax1.set_xlabel("Citric Acid")
ax1.set_ylabel("Frequency")

## TODO: calculate the mean 
citric_acid_mean = ...
## add mean as text attribute
ax1.text(x=1.2, y=800, 
         s=r'$\mu$='+str(citric_acid_mean), 
         fontsize=12)

freq, bins, patches = ax1.hist(wines['citric acid'], 
                               bins=40,
                               color='coral',
                               edgecolor='black')

## density plot of continuous variable
ax2 = fig.add_subplot(222) 
ax2.set_xlabel("Citric Acid")
ax2.set_ylabel("Density") 
## TODO: use kdeplot to plot the density graph
sns.kdeplot(...)

## Discrete Categorical Attribute

Let's analyze a discrete, categorical attribute (i.e. quality)

In [0]:
## create a figure
fig = plt.figure(figsize=(6, 4))
title = fig.suptitle("Wine Quality Frequency", fontsize=14, fontweight='bold')
fig.subplots_adjust(top=0.9, wspace=0.3)

## TODO: calculate the values within the bin
quality_counts = ...

ax = fig.add_subplot(1,1,1)
ax.set_xlabel("Quality")
ax.set_ylabel("Frequency") 
ax.tick_params(axis='both', which='major', labelsize=8.5)

## TODO: plot barchart using ax.bar
bar = ax.bar(...)

# 2D Visualization:

Let's look at finding potential relationships/correlations between data attributes. 

In order to do this, let's start by creating a heatmap from a correlation matrix!

## Heatmap

In [0]:
## TODO: calculate correlation matrix using pandas
corr = ...

## create figure
fig, ax = plt.subplots(1, 1, figsize=(10,6))

## TODO: create a heatmap
heatmap = sns.heatmap(...)

fig.suptitle('Wine Attributes Correlation Heatmap', 
              fontsize=14, 
              fontweight='bold')

Let's pick the most strongly correlated attributes to analyze. 

These are: density, residual sugar, total sulfur dioxide, free sulfur dioxide, and fixed acidity

## Pair Plot

In [0]:
## TODO: define attributes of interest in an array
cols = ...

## TODO: create a pairplot using seaborn's pairplot function
pairplot = sns.pairplot(...)

fig = pairplot.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
fig.suptitle('Wine Attributes Pairwise Plots', 
              fontsize=14, fontweight='bold')

## Continuous Numeric Attributes

Let's look at the correlation between fixed acidity & density!

In [0]:
## TODO: create a joint plot using seaborn's joint plot function
jointplot = sns.jointplot(...)

In [0]:
## TODO: create a KDE joint plot using seaborn
kdejoint = sns.jointplot(...)

## Discrete Categorical Attributes

Let's make a histogram comparing red wine and white wine quality numbers!

In [0]:
## TODO: create a histogram comparing red & white wine (i.e. countplot)
countplot = sns.countplot(...,
                          palette={"red": "crimson", "white": "lemonchiffon"})

## Mixed Attributes (Numeric & Categorical)

Let's try using `seaborn`'s facet plot to compare the citric acid content in white wine versus red wine.

In [0]:
## initialize figure
fig = plt.figure(figsize=(10,8))
title = fig.suptitle("Citric Acid Content in Wine", fontsize=14)

ax = fig.add_subplot(111)
ax.set_xlabel("Citric Acid")
ax.set_ylabel("Frequency") 

## TODO: initialize facet grid
grid = sns.FacetGrid(..., 
                  palette={"red": "crimson", "white": "lemonchiffon"})

## add a kde plot
grid.map(sns.distplot, 'citric acid', kde=True, bins=15, ax=ax)

ax.legend(title='Wine Type')
plt.close(2)

Let's try making some box & violin plots!

In [0]:
## set up figure
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(24, 8))
f.suptitle('Wine Quality vs. Alcohol Content', fontsize=14)

## TODO: create boxplot using seaborn
sns.boxplot(..., 
            ax=ax1)

ax1.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax1.set_ylabel("Wine Alcohol %",size=12,alpha=0.8)

## TODO: create violin plot using seaborn
sns.violinplot(...,   
               ax=ax2)

ax2.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax2.set_ylabel("Wine Alcohol %",size=12,alpha=0.8)

# Multiple Dimension Visualization:

## 3D Visualization

Let's create a "3D" pairplot based on wine type!

In [0]:
## attributes of interest
cols = ['density', 
        'residual sugar', 
        'total sulfur dioxide', 
        'fixed acidity', 
        'type']
  
## TODO: make 3D pairplot
pairplot = sns.pairplot(...)

## add figure
fig = pairplot.fig 
fig.subplots_adjust(top=0.93, wspace=0.3)
fig.suptitle('Wine Attributes Pairwise Plots', fontsize=14)

## 3D Continuous Numeric Attributes

Let's plot the relationship between acidity, alcohol & residual sugar

In [0]:
## TODO: use scatter and set size attribute * 25
plt.scatter(...)

plt.xlabel('Fixed Acidity')
plt.ylabel('Alcohol')
plt.title('Wine Alcohol Content - Fixed Acidity - Residual Sugar', y=1.05)

## 3D Discrete Categorical Attributes

Let's look at quality vs. quality label vs. wine type

In [0]:
## TODO: create factorplot using seaborn
factorplot = sns.factorplot(...)

## 3D Mixed Attributes (Categorical & Numerical)

Let's look at the relationship between sulphates, alcohol, and wine type

* x-axis = sulphates
* y-axis = alcohol
* color = wine type

In [0]:
## TODO: make an lm joint plot using seaborn's lmplot
linearreg_jointplot = sns.lmplot(...)

In [0]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 4))
f.suptitle('Wine Type - Quality - Alcohol %', fontsize=14)

## TODO: create a violin plot
sns.violinplot(...,
               ax=ax1)

ax1.set_xlabel("Wine Quality",size=12,alpha=0.8)
ax1.set_ylabel("Wine Alcohol %",size=12,alpha=0.8)
ax1.legend(loc='best', title='Wine Type')

## TODO: create a boxplot
sns.boxplot(..., 
            ax=ax2)

ax2.set_xlabel("Wine Quality Class",size=12,alpha=0.8)
ax2.set_ylabel("Wine Alcohol %",size=12,alpha=0.8)
plt.legend(loc='best', title='Wine Type')

## 6D Plotting!

Let's visualize this data in *6 dimensions*

1) x-axis
2) y-axis
3) color
4) n columns
5) size
6) m rows

In [0]:
## TODO: fill in row, col, hue
grid = sns.FacetGrid(wines,
                  row=..., # row
                  col=..., # column
                  hue=..., # color
                  size=4)

## TODO: fill in x and y axis
grid.map(plt.scatter,
         ..., # x-axis
         ..., # y-axis
         alpha=0.5, 
         edgecolor='k', 
         linewidth=0.5, 
         s=...) # size

fig = grid.fig 
fig.set_size_inches(18, 8)
fig.subplots_adjust(top=0.85, wspace=0.3)
fig.suptitle('Wine Type - Sulfur Dioxide - Residual Sugar - Alcohol - Quality Class - Quality Rating', fontsize=14)
grid.add_legend(title='Wine Quality Class')