<img src="https://ucfai.org/groups/supplementary/sp20/02-13-visualization/visualization/banner.png">

<div class="col-12">
    <span class="btn btn-success btn-block">
        Meeting in-person? Have you signed in?
    </span>
</div>

<div class="col-12">
    <h1> Visualizing Your Data and Advanced Visualization </h1>
    <hr>
</div>

<div style="line-height: 2em;">
    <p>by: 
        <strong> None</strong>
        (<a href="https://github.com/cokeacolaking">@cokeacolaking</a>)
    
        <strong> None</strong>
        (<a href="https://github.com/calvinyong">@calvinyong</a>)
     on 2020-02-13</p>
</div>

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Loading in and Reading Data

In [None]:
tips = pd.read_csv('tips.csv')  # import dataset
tips  # look at the data

By default, `tips.head()` will give us the first 5 rows. A number will give us a certain number of rows.

`tips.head(7)` or `tips.head(3)`

In [None]:
tips.head()

In [None]:
tips.tail()

To see a summary of the data with statistical numbers.

In [None]:
tips.describe()

You may have noticed we didn't see the other categories. It's not possible to get the mean or standard deviation of those columns. The next command lets us see the other columns.

In [None]:
tips.describe(include=['O'])

# Missing Values

`tips.isnull()` gets us a table of booleans. If the data is missing in that position, it will return True.

In [None]:
tips.isnull()

By using  `tips.isnull().any(axis=0)` we compare each column and if <u>any</u> values are missing in the column it returns True for the column.

In [None]:
tips.isnull().any(axis=0)

If we change `axis` to 1, it will return if any row has a missing value.

In [None]:
tips.isnull().any(axis=1).head(13)

Now we know how to see if there is a missing value in the data, but how much of our data are we actually missing?

In [None]:
tips.isnull().sum(axis=0)

If you want to see the total number of missing values, take the sum of the above line

In [None]:
tips.isnull().sum(axis=0).sum()

In [None]:
tips.shape
# the size of the dataset (rows, cols)

In many datasets, the data is larger than only 244, so we want to look at the impact the missing values have on our dataset.

3 missing values out of 244 would be more significant than 3 missing values out of thousands. To see the impact, we want to divide each sum by 244 to see the percent of missing data per category.

In [None]:
tips.isnull().sum(axis=0) / tips.shape[0]

# Plotting Continuous Data

Data comes in different types. With our `tips` dataset, we have three variables which are numbers (`total_bill`, `tip`, `size`). Although `size` is a number, we are going to treat it as a categorical value.

## Scatterplots

These graphs are useful for finding correlations or patterns in number values. In our graphs, we will assign `total_bill` to the x-axis and `tip` to the y-axis.

In [None]:
sns.relplot(x="total_bill", y="tip", data=tips)

`col=""` splits the graphs

In [None]:
sns.relplot(x="total_bill", y="tip", col="time", data=tips)

`hue=""` shades different points

In [None]:
sns.relplot(x='total_bill', y='tip', hue='sex', data=tips)

`style=""` changes the dots

In [None]:
sns.relplot(x='total_bill', y='tip', hue='smoker', style='smoker', data=tips)

`size=""` changes the size

In [None]:
sns.relplot(x='total_bill', y='tip', size='size', data=tips)

All together now!

In [None]:
sns.relplot(x='total_bill', y='tip', col='time',
            hue='smoker', style='sex', size='size', data=tips)

## Histograms

Viewing the univariate distribution

In [None]:
sns.distplot(tips.total_bill, kde=False)

Now we add in the kernel density estimate, `kde`, which is basically a smoothed curve over the data. This helps us plot the shape of the distribution.

In [None]:
sns.distplot(tips.total_bill, kde=True)

The number of `bins` divides the data into `n` equal parts, but the default `bins` number is 14.

In [None]:
sns.distplot(tips.total_bill, bins=20, kde=True)

Lastly, we have `rug`. It places a dash where each datapoint lies.

In [None]:
sns.distplot(tips.total_bill, bins=20, kde=True, rug=True)

In certain scenarios, `rug` helps us to see data that we might've forgotten about. In `tips.tip`, it is hard to tell where the points lie above 7.

In [None]:
sns.distplot(tips['tip'], bins=12, kde=True, rug=True)

## Joint Graphs

The `jointplot` is a combination of a scatterplot and histogram.

In [None]:
sns.jointplot(x="total_bill", y="tip", data=tips)

In [None]:
sns.pairplot(tips)

# Plotting Categorical Data

## Categorical Plots

Using `sns.catplot()` requires you to change the `kind` variable. Each kind shows a different way to visualize the data.

In [None]:
sns.catplot(x="time", y="total_bill", data=tips)

In the `catplot` we can use `hue` like before

In [None]:
sns.catplot(x="time", y="total_bill", hue='smoker', data=tips)

What do we do if we want to split the data more than just time? `dodge`!

In [None]:
sns.catplot(x="time", y="total_bill", hue='smoker', dodge=True, data=tips)

There are different kinds of `catplots`. A few of them are `swarm`, `box`, `boxen`, `bar`.

In [None]:
sns.catplot(x="time", y="total_bill", hue='sex',
            dodge=True, kind='swarm', data=tips)

If you want to use the `x` as a `hue`, make sure you set `dodge` to False

In [None]:
sns.catplot(x="time", y="total_bill", hue='time',
            dodge=False, kind='bar', data=tips)

# Visualization - Part 2

## 3D plotting / interactive plotting

- https://plot.ly/python/

## Dimension reduction and manifold learning

An assumption about data is that its dimension is artificially inflated, and there is a lower dimensional embedding where the data lies.

### Principal Component Analysis (PCA)

A linear transformation to "better" represent the data, and can also be used for dimension reduction.

https://scikit-learn.org/stable/modules/decomposition.html

### Manifold learning

Nonlinear algorithms to find a lower dimensional embedding

- https://scikit-learn.org/stable/modules/manifold.html
- https://github.com/lmcinnes/umap

### How to misread manifold learned embeddings

- https://distill.pub/2016/misread-tsne/
- https://pair-code.github.io/understanding-umap/index.html

## Plotting huge amounts of data

https://datashader.org/index.html

### Plotting Pitfalls

https://datashader.org/user_guide/Plotting_Pitfalls.html

Explains how Datashader avoids pitfalls encountered when plotting big datasets using techniques designed for small ones.

## More

https://pyviz.org/

The PyViz.org website is an open platform for helping users decide on the best open-source (OSS) Python data visualization tools for their purposes, with links, overviews, comparisons, and examples.

- https://github.com/ResidentMario/missingno
- https://github.com/altair-viz/altair
- https://github.com/bokeh/bokeh