#### Univariate
- Continuous
    - Histograms
    - Boxplots
- Categorical
    - Bar charts

#### Bivariate
- Continuous vs Categorical
    - Use Univariate techniques but use a different chart for each 
    - Use colors, markers, shapes
- Discrete vs Categorical
- Continuous vs Continuous

# A stroll through basic visualizations using plotly express.

I've recently discovered plotly express,and am super-pumped to add it to my toolbox. 

To put it through the paces I thought it would be a good time to run through all the standard visualization plots. 

Let's get request some data from Kaggle.

In [2]:
import pandas as pd
import json
import os
import plotly_express as px

In [3]:
# def get_keys(path):
#     with open(path) as f:
#         return json.load(f)

# keys = get_keys(os.path.join(os.environ['HOME'], '.secret/kaggle.json'))

# client_id = keys['username']
# api_key = keys['key']

I found this dataset on kaggle that looks interesting:
https://www.kaggle.com/magshimimsummercamp/superheroes-info-and-stats#superheroes_info.csv

Here is the API command:
`kaggle datasets download -d magshimimsummercamp/superheroes-info-and-stats`

Which you can get easily onto your clipboard by just clicking here:
![alt_command](./images/api_cmd_small.png)

Let's use the shell to get that data!

In [4]:
# Download the file
!kaggle datasets download -d magshimimsummercamp/superheroes-info-and-stats

superheroes-info-and-stats.zip: Skipping, found more recently modified local copy (use --force to force download)


In [5]:
# Make sure it's in there
!ls

Blog 2..ipynb                  superheroes-info-and-stats.zip
blog_2.ipynb                   superheroes_info.csv
[34mdata[m[m                           superheroes_power_matrix.csv
[34mimages[m[m                         superheroes_stats.csv
python_libraries.ipynb


In [6]:
# Unzip the file
!unzip superheroes-info-and-stats.zip

Archive:  superheroes-info-and-stats.zip
replace superheroes_stats.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [7]:
# Look at files again
!ls

Blog 2..ipynb                  superheroes-info-and-stats.zip
blog_2.ipynb                   superheroes_info.csv
[34mdata[m[m                           superheroes_power_matrix.csv
[34mimages[m[m                         superheroes_stats.csv
python_libraries.ipynb


Sweet. After perusing through the files on Kaggle, I decided that a combination of `superheroes_info.csv` and `superheroes_stats.csv`.

Let's start with `info`.

In [8]:
info = pd.read_csv('superheroes_info.csv')
info.head()

Unnamed: 0.1,Unnamed: 0,Name,Identity,Status,Gender,Alignment,Race,Height,Weight,EyeColor,HairColor,SkinColor,Publisher,Year,Appearances,FirstAppearance,AdditionalData
0,0,Spider-Man,Secret,Living,Male,Good,Human,178.0,74.0,Hazel,Brown,,Marvel,1962.0,4043.0,1962-08-01,Peter Parker
1,1,Spider-Man,Secret,Living,Male,Good,Human,178.0,77.0,Hazel,Brown,,Marvel,1962.0,4043.0,1962-08-01,Peter Parker
2,2,Spider-Man,Secret,Living,Male,Good,Human,157.0,56.0,Hazel,Brown,,Marvel,1962.0,4043.0,1962-08-01,Peter Parker
3,3,Captain America,Public,Living,Male,Good,Human,188.0,108.0,Blue,White,,Marvel,1941.0,3360.0,1941-03-01,Steven Rogers
4,4,Captain America,Secret,Living,Male,Bad,Human,188.0,108.0,Blue,Blond,,Marvel,1966.0,1.0,1966-10-01,"Impersonator, Sons of the Serpent"


Let's drop that `Unnamed:0` column.

In [9]:
info.columns

Index(['Unnamed: 0', 'Name', 'Identity', 'Status', 'Gender', 'Alignment',
       'Race', 'Height', 'Weight', 'EyeColor', 'HairColor', 'SkinColor',
       'Publisher', 'Year', 'Appearances', 'FirstAppearance',
       'AdditionalData'],
      dtype='object')

In [10]:
info.drop(['Unnamed: 0'], axis=1, inplace=True)

In [11]:
info.columns

Index(['Name', 'Identity', 'Status', 'Gender', 'Alignment', 'Race', 'Height',
       'Weight', 'EyeColor', 'HairColor', 'SkinColor', 'Publisher', 'Year',
       'Appearances', 'FirstAppearance', 'AdditionalData'],
      dtype='object')

In [12]:
info.head()

Unnamed: 0,Name,Identity,Status,Gender,Alignment,Race,Height,Weight,EyeColor,HairColor,SkinColor,Publisher,Year,Appearances,FirstAppearance,AdditionalData
0,Spider-Man,Secret,Living,Male,Good,Human,178.0,74.0,Hazel,Brown,,Marvel,1962.0,4043.0,1962-08-01,Peter Parker
1,Spider-Man,Secret,Living,Male,Good,Human,178.0,77.0,Hazel,Brown,,Marvel,1962.0,4043.0,1962-08-01,Peter Parker
2,Spider-Man,Secret,Living,Male,Good,Human,157.0,56.0,Hazel,Brown,,Marvel,1962.0,4043.0,1962-08-01,Peter Parker
3,Captain America,Public,Living,Male,Good,Human,188.0,108.0,Blue,White,,Marvel,1941.0,3360.0,1941-03-01,Steven Rogers
4,Captain America,Secret,Living,Male,Bad,Human,188.0,108.0,Blue,Blond,,Marvel,1966.0,1.0,1966-10-01,"Impersonator, Sons of the Serpent"


There appears to be multi-versions of some superheroes. While we are just interested in putting together some dummy data for our visualization exercises rather than putting together a rigorous analysis, let's see what the risk is of these 'dupes'.

In [13]:
print("Number of rows: " + str(info.shape[0]))
print("Number of unique names: " + str(len(info['Name'].unique())))

Number of rows: 23777
Number of unique names: 22265


In [14]:
22265 / 23777

0.9364091348782437

That seems sizeable, but based on our small sample size that it appears that superheroes are just alternative universe versions of each other, I think it's for our not-so-serious exercise.

Ok, now let's look at `stats`

In [26]:
stats = pd.read_csv('superheroes_stats.csv')
stats.head()

Unnamed: 0,Name,Alignment,Intelligence,Strength,Speed,Durability,Power,Combat,Total
0,3-D Man,good,50.0,31.0,43.0,32.0,25.0,52.0,233.0
1,A-Bomb,good,38.0,100.0,17.0,80.0,17.0,64.0,316.0
2,Abe Sapien,good,88.0,14.0,35.0,42.0,35.0,85.0,299.0
3,Abin Sur,good,50.0,90.0,53.0,64.0,84.0,65.0,406.0
4,Abomination,bad,63.0,80.0,53.0,90.0,55.0,95.0,436.0


Looks ok. Let's try to `pd.merge`

In [27]:
heroes = pd.merge(info, stats, how='inner', left_on='Name', right_on= 'Name')

In [28]:
heroes.head().T

Unnamed: 0,0,1,2,3,4
Name,Spider-Man,Spider-Man,Spider-Man,Captain America,Captain America
Identity,Secret,Secret,Secret,Public,Secret
Status,Living,Living,Living,Living,Living
Gender,Male,Male,Male,Male,Male
Alignment_x,Good,Good,Good,Good,Bad
Race,Human,Human,Human,Human,Human
Height,178,178,157,188,188
Weight,74,77,56,108,108
EyeColor,Hazel,Hazel,Hazel,Blue,Blue
HairColor,Brown,Brown,Brown,White,Blond


Looks like 'Alignment' is duplicated. Let's clean that up.

In [29]:
heroes.columns

Index(['Name', 'Identity', 'Status', 'Gender', 'Alignment_x', 'Race', 'Height',
       'Weight', 'EyeColor', 'HairColor', 'SkinColor', 'Publisher', 'Year',
       'Appearances', 'FirstAppearance', 'AdditionalData', 'Alignment_y',
       'Intelligence', 'Strength', 'Speed', 'Durability', 'Power', 'Combat',
       'Total'],
      dtype='object')

In [30]:
heroes.rename(columns={'Alignment_x':'Alignment'}, inplace=True)
heroes.drop(['Alignment_y'], axis=1, inplace=True)

In [31]:
heroes.columns

Index(['Name', 'Identity', 'Status', 'Gender', 'Alignment', 'Race', 'Height',
       'Weight', 'EyeColor', 'HairColor', 'SkinColor', 'Publisher', 'Year',
       'Appearances', 'FirstAppearance', 'AdditionalData', 'Intelligence',
       'Strength', 'Speed', 'Durability', 'Power', 'Combat', 'Total'],
      dtype='object')

Better. 

To simplify our demo work, let's remove any rows where any of the numeric variables are `NaN`.

In [32]:
cols = ['Intelligence', 'Strength', 'Speed', 'Durability', 'Power', 'Combat', 'Total'] 
heroes = heroes.dropna(subset=cols).copy()

In [33]:
heroes.shape

(525, 23)

In [34]:
heroes.describe()

Unnamed: 0,Height,Weight,Year,Appearances,Intelligence,Strength,Speed,Durability,Power,Combat,Total
count,443.0,437.0,249.0,241.0,525.0,525.0,525.0,525.0,525.0,525.0,525.0
mean,194.960497,125.71167,1979.578313,237.975104,64.220952,43.847619,39.885714,62.44,58.462857,62.007619,330.864762
std,85.55066,119.869096,20.949894,685.387007,20.38037,33.130663,23.045969,30.013208,27.195873,22.705128,107.013977
min,15.2,2.0,1939.0,1.0,8.0,4.0,8.0,5.0,5.0,10.0,61.0
25%,175.0,63.0,1964.0,2.0,50.0,10.0,23.0,32.0,35.0,42.0,248.0
50%,183.0,86.0,1982.0,10.0,63.0,34.0,35.0,64.0,60.0,64.0,323.0
75%,191.0,135.0,1997.0,109.0,75.0,80.0,53.0,90.0,76.0,80.0,406.0
max,975.0,900.0,2013.0,4043.0,113.0,100.0,100.0,120.0,100.0,101.0,581.0


In [35]:
heroes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 525 entries, 0 to 711
Data columns (total 23 columns):
Name               525 non-null object
Identity           211 non-null object
Status             263 non-null object
Gender             521 non-null object
Alignment          525 non-null object
Race               343 non-null object
Height             443 non-null float64
Weight             437 non-null float64
EyeColor           474 non-null object
HairColor          479 non-null object
SkinColor          61 non-null object
Publisher          522 non-null object
Year               249 non-null float64
Appearances        241 non-null float64
FirstAppearance    249 non-null object
AdditionalData     165 non-null object
Intelligence       525 non-null float64
Strength           525 non-null float64
Speed              525 non-null float64
Durability         525 non-null float64
Power              525 non-null float64
Combat             525 non-null float64
Total              525 non-n

Good enough for our practice work.

# Unvariates Statistics

Let's cover a couple of classic ways to perform univariate exploration of continuous variables: Histograms and Boxplots. 

First up, the histogram:

In [38]:
px.histogram(data_frame=heroes
     , x="Strength"
     , title="Strength Distribution : Count of Heroes"
     , template='plotly'
     )

You can read more on how cool histograms are here.
https://medium.com/@johnnaujoks/extreme-makeover-histogram-edition-fdb824d7e58

And then the boxplot. So elegant in it's simplicity it almost make tear up.

In [39]:
px.box(data_frame=heroes
    , y="Speed"
    , title="Distribution of Heroes' Speed Ratings"
    , template='presentation'
    )

More can be found on boxplots here: https://dev.to/annalara/deconstructing-the-box-and-whisker-plot-11f3
https://medium.com/@larrychewy/the-box-and-the-bees-7d0b6ded65db

But wait. Is that violin music I hear?

In [40]:
px.violin(data_frame=heroes
          , y="Speed"
          , box=True
          , title="Distribution of Heroes' Speed Ratings"
          , template='presentation'
         )

[Violin plots](https://en.wikipedia.org/wiki/Violin_plot) are becoming increasingly popular. I like to think of them as boxplot's cooler, younger sibling.

What about investigating categorical variables one by one? Usually, we want to see what relative counts of distinct values looks like. 

Enter, the [bar chart](https://medium.com/@Infogram/the-dos-and-donts-of-bar-charts-bd2df09e5cd1). Here's a classic version:

In [41]:
# Aggregate publisher counts
heroes_publisher = pd.DataFrame(heroes['Publisher'].value_counts()).reset_index()
heroes_publisher.columns = ['publisher','counts']

In [42]:
px.bar(data_frame=heroes_publisher
       , x='publisher'
       , y='counts'
       , template='plotly_white'
       , title='Count of Heroes by Publisher'
      )

Univariate analysis is all well and good, but usually we are not solely trying to get a feel for the distribution of one variable, but for it's relationship to one or more other variables. So let's flexing our `plotly-express` muscles on some examples of bivariate techniques.



# Bivariate Comparison

Let's start with comparing continuous variables to other continuous variables.

### Continuous vs Continuous

[Scatter plots](https://medium.com/@mia.iseman/in-praise-of-scatterplots-and-bubble-charts-e1f39548ee84) are the tried and true way of comparing two continuous (numeric) variables. It's a great way to quickly assess whether a relationship exists between the two variables. 

In the example below, we further give ourselves a helping hand at spotting a relationship by adding a trendline. It appears that there is a weak positive correlation between `Strength` and `Intelligence`.

In [43]:
px.scatter(data_frame=heroes
           , x="Strength"
           , y="Intelligence"
           , trendline='ols'
           , title='Heroes Comparison: Strength vs Intelligence'
           , hover_name='Name'
           , template='plotly_dark'
          )

In [44]:
# lots_win_pct = dribble_lots.groupby(['team_long_name']).agg(
#     {'won': 'sum','team_long_name': 'count'}).rename(columns=
#     {'won':'wins','team_long_name': 'matches'}).reset_index()

In [45]:
heroes['Year'] = heroes.loc[:,'Year'].fillna(0).astype(int)

In [46]:
heroes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 525 entries, 0 to 711
Data columns (total 23 columns):
Name               525 non-null object
Identity           211 non-null object
Status             263 non-null object
Gender             521 non-null object
Alignment          525 non-null object
Race               343 non-null object
Height             443 non-null float64
Weight             437 non-null float64
EyeColor           474 non-null object
HairColor          479 non-null object
SkinColor          61 non-null object
Publisher          522 non-null object
Year               525 non-null int64
Appearances        241 non-null float64
FirstAppearance    249 non-null object
AdditionalData     165 non-null object
Intelligence       525 non-null float64
Strength           525 non-null float64
Speed              525 non-null float64
Durability         525 non-null float64
Power              525 non-null float64
Combat             525 non-null float64
Total              525 non-nul

In [47]:
heroes_first_appear_year = heroes.loc[
        heroes['Year']!=0,:].groupby(['Year']).agg(
            {'Name':'count'}).reset_index().rename(
                columns={'Name':'Num_Heroes'})

A special case of continuous versus continuous (or if you really want to, discrete) comparison are time series. The classic way to do this is with a [line plot](https://medium.com/@patrickbfuller/line-plot-7b4068a3a9fc).  Almost always the time variable will be along the x-axis while the other continuous variable is measured along the y-axis. And now you can see how it changed over time!

Here's an example looking at `Number of Superheroes` by their `Year of First Appearance`.

In [48]:
px.line(data_frame=heroes_first_appear_year
        ,x='Year'
        ,y='Num_Heroes'
        ,template='ggplot2'
        ,title="Number of Heroes by Year of First Appearance"
        ,labels={"Num_Heroes":"Number of Heroes"}
       )

In [49]:
heroes.columns

Index(['Name', 'Identity', 'Status', 'Gender', 'Alignment', 'Race', 'Height',
       'Weight', 'EyeColor', 'HairColor', 'SkinColor', 'Publisher', 'Year',
       'Appearances', 'FirstAppearance', 'AdditionalData', 'Intelligence',
       'Strength', 'Speed', 'Durability', 'Power', 'Combat', 'Total'],
      dtype='object')

### Categorical vs Continuous

What if we want to compare categorical versus continuous variables? Well, it turns out that we can just use univariate techniques, but just "repeat" them! One of my favorite ways is using a stacked histogram. We can make a histogram for our continous variable for each value of a categorical variable, and then just stack them!

For example, let's revisit our histogram from before on `Strength`, but this time we'd like to see them separated out by `Gender`.  I prefer to seem the stacked list this:

In [50]:
px.histogram(data_frame=heroes[~heroes.Gender.isna()]
             , x="Strength"
             , color='Gender'
             , labels={'count':'Count of Heroes'}
             , title="Strength Distribution : Count of Heroes"
             , template='plotly'
            )

But maybe you want to see the like bins grouped together? 

In [51]:
px.histogram(data_frame=heroes[~heroes.Gender.isna()]
             , x="Strength"
             , color='Gender'
             , barmode = 'group'
             , labels={'count':'Count of Heroes'}
             , title="Strength Distribution : Count of Heroes"
             , template='plotly'
            )

...or maybe you prefer to see them unstacked? 

In [52]:
px.histogram(data_frame=heroes[~heroes.Gender.isna()]
             , x="Strength"
             , color='Gender'
             , facet_row='Gender'
             , labels={'count':'Count of Heroes'}
             , title="Strength Distribution"
             , template='plotly')

Boxplots want to get in on the action!

In [53]:
px.box(
        data_frame=heroes[~heroes.Gender.isna()]
        , y="Speed"
        , color="Gender"
        , title="Distribution of Heroes' Speed Ratings"
        , template='presentation')

and whatever boxplot can do, so can violin plots!

In [54]:
px.violin(
        heroes[~heroes.Gender.isna()]
        , y="Speed"
        , color="Gender"
        , box=True
        , title="Distribution of Heroes' Speed Ratings"
        , template='presentation')

### Categorical vs. Categorical

So now what about if you want to just compare categorical vs categorical values? Usually, if that's the case you want to look at relative counts. So stacked bars are a good way to go:

In [55]:
heroes.columns

Index(['Name', 'Identity', 'Status', 'Gender', 'Alignment', 'Race', 'Height',
       'Weight', 'EyeColor', 'HairColor', 'SkinColor', 'Publisher', 'Year',
       'Appearances', 'FirstAppearance', 'AdditionalData', 'Intelligence',
       'Strength', 'Speed', 'Durability', 'Power', 'Combat', 'Total'],
      dtype='object')

In [56]:
px.histogram(data_frame=heroes
             ,x="Publisher"
             ,y="Name"
             ,color="Alignment"
             ,histfunc="count"
             ,title="Distribution of Heroes, by Publisher | Good-Bad-Neutral"
             ,labels={'Name':'Characters'}
             ,template='plotly_white'
            )

Aside: It turns out that stacked bar charts are way easier using `.histogram` since it gives access to `histfunc`, which allows you to apply a function to the histogram. Saves from having to aggregate first (which you may have noticed was done for the `.bar` chart above.

## Mix it up!

So we may be sensing a pattern here. We can turn any univariate visualization into a bivariate one (or more) by using  another visual element, such as color; or by splitting (sometimes called `facet`ing) along category values. 

Let's explore a third variable! 

Maybe this add a categorical variable to our scatter plot using color?

In [57]:
px.scatter(data_frame=heroes[~heroes.Gender.isna()]
           , x="Strength"
           , y="Intelligence"
           , color="Alignment"
           , trendline='ols'
           , title='Heroes Comparison: Strength vs Intelligence'
           , hover_name='Name'
           , opacity=0.5
           , template='plotly_dark')

Maybe *this* data set is not that interesting with the added category, but categories really stand out when you find the right pattern, such as with the class iris data set...like this:

![image.png](iris_scatter.png)

#### Univariate
- Continuous
    - Histograms
    - Boxplots
- Categorical
    - Bar charts

#### Bivariate
- Continuous vs Continuous
    - Scatter
    - Line Plot
- Continuous vs Categorical
    - Use Univariate techniques but use a different chart for each 
    - Use colors, markers, shapes
- Categorical vs Categorical


Maybe *this* data set is not that interesting with the added category, but categories really stand out when you find the right pattern, such as with the class iris data set...like this:

![image.png](iris_scatter.png)

But going back to our orginal scatter with with color, what if we wanted to add on a *third* continuous variable? How about if we tied it to the size of our markers?

In [58]:
heroes.columns

Index(['Name', 'Identity', 'Status', 'Gender', 'Alignment', 'Race', 'Height',
       'Weight', 'EyeColor', 'HairColor', 'SkinColor', 'Publisher', 'Year',
       'Appearances', 'FirstAppearance', 'AdditionalData', 'Intelligence',
       'Strength', 'Speed', 'Durability', 'Power', 'Combat', 'Total'],
      dtype='object')

In [59]:
px.scatter(data_frame=heroes[~heroes.Gender.isna()]
           , x="Strength"
           , y="Intelligence"
           , color="Alignment"
           , size="Power"
           , trendline='ols'
           , title='Heroes Comparison: Strength vs Intelligence'
           , hover_name='Name'
           , opacity=0.5
           , template='plotly_dark'
          )

In [None]:
heroes.loc[:,['Gender','Publisher']].notnull()

In [None]:
px.scatter(data_frame=heroes[heroes[:,['Gender','Publisher']]._)]
           , x="Strength"
           , y="Intelligence"
           , color="Alignment"
           , size="Power"
           , symbol=""
           , trendline='ols'
           , title='Heroes Comparison: Strength vs Intelligence'
           , hover_name='Name'
           , opacity=0.5
           , template='plotly_dark'
          )

Wow, Galactus is tops in Strength, Intelligence, and size! One thing I noticed is that the legend doesn't automatically add an entry for Size = 'Power'. Hey, plotly.express has already spoiled me in the course of this post!

### Scatter Matrix

In [60]:
heroes.columns

Index(['Name', 'Identity', 'Status', 'Gender', 'Alignment', 'Race', 'Height',
       'Weight', 'EyeColor', 'HairColor', 'SkinColor', 'Publisher', 'Year',
       'Appearances', 'FirstAppearance', 'AdditionalData', 'Intelligence',
       'Strength', 'Speed', 'Durability', 'Power', 'Combat', 'Total'],
      dtype='object')

In [61]:
heroes[~heroes['Gender'].isna() & ~heroes['Publisher'].isna()].describe()

Unnamed: 0,Height,Weight,Year,Appearances,Intelligence,Strength,Speed,Durability,Power,Combat,Total
count,439.0,434.0,518.0,241.0,518.0,518.0,518.0,518.0,518.0,518.0,518.0
mean,195.07631,125.354839,951.573359,237.975104,64.505792,43.864865,39.92278,62.642857,58.859073,62.083012,331.878378
std,85.929271,119.583683,990.113516,685.387007,20.326136,33.199373,23.089386,29.993001,27.097885,22.648326,106.82377
min,15.2,2.0,0.0,1.0,8.0,4.0,8.0,5.0,5.0,10.0,61.0
25%,175.0,63.0,0.0,2.0,50.0,10.0,23.0,32.75,36.0,42.0,251.5
50%,183.0,86.0,0.0,10.0,63.0,34.0,35.0,64.0,60.0,64.0,323.0
75%,191.0,135.0,1979.0,109.0,75.0,80.0,53.0,90.0,77.75,80.0,412.75
max,975.0,900.0,2013.0,4043.0,113.0,100.0,100.0,120.0,100.0,101.0,581.0


In [80]:
px.scatter_matrix(data_frame=heroes[~heroes['Gender'].isna()]
                  , dimensions=["Strength", "Speed", "Power"] 
                  , color="Alignment"
                  , symbol="Gender" 
                  , title='Heroes Attributes Comparison'
                  , hover_name='Name'
                  , template='seaborn'
                 )

Scatter with marginal plots

In [91]:
px.scatter(data_frame=heroes[~heroes.Gender.isna()]
           , x="Strength"
           , y="Speed"
           , color="Alignment"
           , title='Strength vs Speed | by Alignment'
           , trendline='ols' 
           , marginal_x='histogram'
           , marginal_y='box'
           , hover_name='Name'
           , opacity=0.2
           , template='seaborn'
          )

## Sources:

https://www.plotly.express/

plotly. Introducing Plotly Express , 20 Mar 2019, https://medium.com/@plotlygraphs/introducing-plotly-express-808df010143d. Accessed 11 May 2019.
