#### Univariate
- Continuous
    - Histograms
    - Boxplots
- Categorical
    - Bar charts

#### Bivariate
- Continuous vs Categorical
    - Use Univariate techniques but use a different chart for each 
    - Use colors, markers, shapes
- Discrete vs Categorical
- Continuous vs Continuous

# A stroll through basic visualizations using plotly express.

I've recently discovered plotly express,and am super-pumped to add it to my toolbox. 

To put it through the paces I thought it would be a good time to run through all the standard visualization plots. 

Let's get request some data from Kaggle.

In [1]:
import pandas as pd
import json
import os
import plotly_express as px

ModuleNotFoundError: No module named 'plotly_express'

In [None]:
# def get_keys(path):
#     with open(path) as f:
#         return json.load(f)

# keys = get_keys(os.path.join(os.environ['HOME'], '.secret/kaggle.json'))

# client_id = keys['username']
# api_key = keys['key']

I found this dataset on kaggle that looks interesting:
https://www.kaggle.com/magshimimsummercamp/superheroes-info-and-stats#superheroes_info.csv

Here is the API command:
`kaggle datasets download -d magshimimsummercamp/superheroes-info-and-stats`

Which you can get easily onto your clipboard by just clicking here:
![alt_command](./images/api_cmd_small.png)

Let's use the shell to get that data!

In [None]:
# Download the file
!kaggle datasets download -d magshimimsummercamp/superheroes-info-and-stats

In [None]:
# Make sure it's in there
!ls

In [None]:
# Unzip the file
!unzip superheroes-info-and-stats.zip

In [None]:
# Look at files again
!ls

Sweet. After perusing through the files on Kaggle, I decided that a combination of `superheroes_info.csv` and `superheroes_stats.csv`.

Let's start with `info`.

In [None]:
info = pd.read_csv('superheroes_info.csv')
info.head()

Let's drop that `Unnamed:0` column.

In [None]:
info.columns

In [None]:
info.drop(['Unnamed: 0'], axis=1, inplace=True)

In [None]:
info.columns

In [None]:
info.head()

There appears to be multi-versions of some superheroes. While we are just interested in putting together some dummy data for our visualization exercises rather than putting together a rigorous analysis, let's see what the risk is of these 'dupes'.

In [None]:
print("Number of rows: " + str(info.shape[0]))
print("Number of unique names: " + str(len(info['Name'].unique())))

In [None]:
22265 / 23777

That seems sizeable, but based on our small sample size that it appears that superheroes are just alternative universe versions of each other, I think it's for our not-so-serious exercise.

Ok, now let's look at `stats`

In [None]:
stats = pd.read_csv('superheroes_stats.csv')
stats.head()

Looks ok. Let's try to `pd.merge`

In [None]:
heroes = pd.merge(info, stats, how='inner', left_on='Name', right_on= 'Name')

In [None]:
heroes.head().T

Looks like 'Alignment' is duplicated. Let's clean that up.

In [None]:
heroes.columns

In [None]:
heroes.rename(columns={'Alignment_x':'Alignment'}, inplace=True)
heroes.drop(['Alignment_y'], axis=1, inplace=True)

In [None]:
heroes.columns

Better. 

To simplify our demo work, let's remove any rows where any of the numeric variables are `NaN`.

In [None]:
cols = ['Intelligence', 'Strength', 'Speed', 'Durability', 'Power', 'Combat', 'Total'] 
heroes = heroes.dropna(subset=cols).copy()

In [None]:
heroes.shape

In [None]:
heroes.describe()

In [None]:
heroes.info()

Good enough for our practice work.

# Unvariates Statistics

Let's cover a couple of classic ways to perform univariate exploration of continuous variables: Histograms and Boxplots. 

First up, the histogram:

In [None]:
px.histogram(data_frame=heroes
     , x="Strength"
     , title="Strength Distribution : Count of Heroes"
     , template='plotly'
     )

You can read more on how cool histograms are here.
https://medium.com/@johnnaujoks/extreme-makeover-histogram-edition-fdb824d7e58

And then the boxplot. So elegant in it's simplicity it almost make tear up.

In [None]:
px.box(data_frame=heroes
    , y="Speed"
    , title="Distribution of Heroes' Speed Ratings"
    , template='presentation'
    )

More can be found on boxplots here: https://dev.to/annalara/deconstructing-the-box-and-whisker-plot-11f3
https://medium.com/@larrychewy/the-box-and-the-bees-7d0b6ded65db

But wait. Is that violin music I hear?

In [None]:
px.violin(data_frame=heroes
          , y="Speed"
          , box=True
          , title="Distribution of Heroes' Speed Ratings"
          , template='presentation'
         )

[Violin plots](https://en.wikipedia.org/wiki/Violin_plot) are becoming increasingly popular. I like to think of them as boxplot's cooler, younger sibling.

What about investigating categorical variables one by one? Usually, we want to see what relative counts of distinct values looks like. 

Enter, the [bar chart](https://medium.com/@Infogram/the-dos-and-donts-of-bar-charts-bd2df09e5cd1). Here's a classic version:

In [None]:
# Aggregate publisher counts
heroes_publisher = pd.DataFrame(heroes['Publisher'].value_counts()).reset_index()
heroes_publisher.columns = ['publisher','counts']

In [None]:
px.bar(data_frame=heroes_publisher
       , x='publisher'
       , y='counts'
       , template='plotly_white'
       , title='Count of Heroes by Publisher'
      )

Univariate analysis is all well and good, but usually we are not solely trying to get a feel for the distribution of one variable, but for it's relationship to one or more other variables. So let's flexing our `plotly-express` muscles on some examples of bivariate techniques.



# Bivariate Comparison

Let's start with comparing continuous variables to other continuous variables.

### Continuous vs Continuous

[Scatter plots](https://medium.com/@mia.iseman/in-praise-of-scatterplots-and-bubble-charts-e1f39548ee84) are the tried and true way of comparing two continuous (numeric) variables. It's a great way to quickly assess whether a relationship exists between the two variables. 

In the example below, we further give ourselves a helping hand at spotting a relationship by adding a trendline. It appears that there is a weak positive correlation between `Strength` and `Intelligence`.

In [None]:
px.scatter(data_frame=heroes
           , x="Strength"
           , y="Intelligence"
           , trendline='ols'
           , title='Heroes Comparison: Strength vs Intelligence'
           , hover_name='Name'
           , template='plotly_dark'
          )

In [None]:
# lots_win_pct = dribble_lots.groupby(['team_long_name']).agg(
#     {'won': 'sum','team_long_name': 'count'}).rename(columns=
#     {'won':'wins','team_long_name': 'matches'}).reset_index()

In [None]:
heroes['Year'] = heroes.loc[:,'Year'].fillna(0).astype(int)

In [None]:
heroes.info()

In [None]:
heroes_first_appear_year = heroes.loc[
        heroes['Year']!=0,:].groupby(['Year']).agg(
            {'Name':'count'}).reset_index().rename(
                columns={'Name':'Num_Heroes'})

A special case of continuous versus continuous (or if you really want to, discrete) comparison are time series. The classic way to do this is with a [line plot](https://medium.com/@patrickbfuller/line-plot-7b4068a3a9fc).  Almost always the time variable will be along the x-axis while the other continuous variable is measured along the y-axis. And now you can see how it changed over time!

Here's an example looking at `Number of Superheroes` by their `Year of First Appearance`.

In [None]:
px.line(data_frame=heroes_first_appear_year
        ,x='Year'
        ,y='Num_Heroes'
        ,template='ggplot2'
        ,title="Number of Heroes by Year of First Appearance"
        ,labels={"Num_Heroes":"Number of Heroes"}
       )

In [None]:
heroes.columns

### Categorical vs Continuous

What if we want to compare categorical versus continuous variables? Well, it turns out that we can just use univariate techniques, but just "repeat" them! One of my favorite ways is using a stacked histogram. We can make a histogram for our continous variable for each value of a categorical variable, and then just stack them!

For example, let's revisit our histogram from before on `Strength`, but this time we'd like to see them separated out by `Gender`.  I prefer to seem the stacked list this:

In [None]:
px.histogram(data_frame=heroes[~heroes.Gender.isna()]
             , x="Strength"
             , color='Gender'
             , labels={'count':'Count of Heroes'}
             , title="Strength Distribution : Count of Heroes"
             , template='plotly'
            )

But maybe you want to see the like bins grouped together? 

In [None]:
px.histogram(data_frame=heroes[~heroes.Gender.isna()]
             , x="Strength"
             , color='Gender'
             , barmode = 'group'
             , labels={'count':'Count of Heroes'}
             , title="Strength Distribution : Count of Heroes"
             , template='plotly'
            )

...or maybe you prefer to see them unstacked? 

In [None]:
px.histogram(data_frame=heroes[~heroes.Gender.isna()]
             , x="Strength"
             , color='Gender'
             , facet_row='Gender'
             , labels={'count':'Count of Heroes'}
             , title="Strength Distribution"
             , template='plotly')

Boxplots want to get in on the action!

In [None]:
px.box(
        data_frame=heroes[~heroes.Gender.isna()]
        , y="Speed"
        , color="Gender"
        , title="Distribution of Heroes' Speed Ratings"
        , template='presentation')

and whatever boxplot can do, so can violin plots!

In [None]:
px.violin(
        heroes[~heroes.Gender.isna()]
        , y="Speed"
        , color="Gender"
        , box=True
        , title="Distribution of Heroes' Speed Ratings"
        , template='presentation')

### Categorical vs. Categorical

So now what about if you want to just compare categorical vs categorical values? Usually, if that's the case you want to look at relative counts. So stacked bars are a good way to go:

In [None]:
heroes.columns

In [None]:
px.histogram(data_frame=heroes
             ,x="Publisher"
             ,y="Name"
             ,color="Alignment"
             ,histfunc="count"
             ,title="Distribution of Heroes, by Publisher | Good-Bad-Neutral"
             ,labels={'Name':'Characters'}
             ,template='plotly_white'
            )

Aside: It turns out that stacked bar charts are way easier using `.histogram` since it gives access to `histfunc`, which allows you to apply a function to the histogram. Saves from having to aggregate first (which you may have noticed was done for the `.bar` chart above.

## Mix it up!

So we may be sensing a pattern here. We can turn any univariate visualization into a bivariate one (or more) by using  another visual element, such as color; or by splitting (sometimes called `facet`ing) along category values. 

Let's explore a third variable! 

Maybe this add a categorical variable to our scatter plot using color?

In [None]:
px.scatter(data_frame=heroes[~heroes.Gender.isna()]
           , x="Strength"
           , y="Intelligence"
           , color="Alignment"
           , trendline='ols'
           , title='Heroes Comparison: Strength vs Intelligence'
           , hover_name='Name'
           , opacity=0.5
           , template='plotly_dark')

Maybe *this* data set is not that interesting with the added category, but categories really stand out when you find the right pattern, such as with the class iris data set...like this:

![image.png](iris_scatter.png)

#### Univariate
- Continuous
    - Histograms
    - Boxplots
- Categorical
    - Bar charts

#### Bivariate
- Continuous vs Continuous
    - Scatter
    - Line Plot
- Continuous vs Categorical
    - Use Univariate techniques but use a different chart for each 
    - Use colors, markers, shapes
- Categorical vs Categorical


Maybe *this* data set is not that interesting with the added category, but categories really stand out when you find the right pattern, such as with the class iris data set...like this:

![image.png](iris_scatter.png)

But going back to our orginal scatter with with color, what if we wanted to add on a *third* continuous variable? How about if we tied it to the size of our markers?

In [None]:
heroes.columns

In [None]:
px.scatter(data_frame=heroes[~heroes.Gender.isna()]
           , x="Strength"
           , y="Intelligence"
           , color="Alignment"
           , size="Power"
           , trendline='ols'
           , title='Heroes Comparison: Strength vs Intelligence'
           , hover_name='Name'
           , opacity=0.5
           , template='plotly_dark'
          )

In [None]:
heroes.loc[:,['Gender','Publisher']].notnull()

Wow, Galactus is tops in Strength, Intelligence, and size! One thing I noticed is that the legend doesn't automatically add an entry for Size = 'Power'. Hey, plotly.express has already spoiled me in the course of this post!

### Scatter Matrix

In [None]:
heroes.columns

In [None]:
heroes[~heroes['Gender'].isna() & ~heroes['Publisher'].isna()].describe()

In [None]:
px.scatter_matrix(data_frame=heroes[~heroes['Gender'].isna()]
                  , dimensions=["Strength", "Speed", "Power"] 
                  , color="Alignment"
                  , symbol="Gender" 
                  , title='Heroes Attributes Comparison'
                  , hover_name='Name'
                  , template='seaborn'
                 )

Scatter with marginal plots

In [None]:
px.scatter(data_frame=heroes[~heroes.Gender.isna()]
           , x="Strength"
           , y="Speed"
           , color="Alignment"
           , title='Strength vs Speed | by Alignment'
           , trendline='ols' 
           , marginal_x='histogram'
           , marginal_y='box'
           , hover_name='Name'
           , opacity=0.2
           , template='seaborn'
          )

## Sources:

https://www.plotly.express/

plotly. Introducing Plotly Express , 20 Mar 2019, https://medium.com/@plotlygraphs/introducing-plotly-express-808df010143d. Accessed 11 May 2019.


https://towardsdatascience.com/plotly-express-yourself-98366e35ad0f