In [1]:
%run supportvectors-common.ipynb



<div style="color:#aaa;font-size:8pt">
<hr/>

 </blockquote>
 <hr/>
</div>



# The scatterplots revisited, with altair

## Load and explore the data

For our consideration, we will work with the classic, oft-used  `auto` dataset, that explores the impact of various automobile engine characteristics on the mileage of the automobile. Perhaps the reader has encountered this dataset before, especially in other SupportVectors workshops, or perhaps in other data science or machine learning textbooks.

This data has some missing values, which we will elide, before we continue with the visualization journey.

In [2]:
source = 'https://raw.githubusercontent.com/supportvectors/ml-100/master/Auto.csv'
data = pd.read_csv(source)
data.sample(5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,year,origin,name
114,26.0,4,98.0,90,2265,15.5,73,2,fiat 124 sport coupe
278,31.5,4,89.0,71,1990,14.9,78,2,volkswagen scirocco
237,30.5,4,98.0,63,2051,17.0,77,1,chevrolet chevette
57,24.0,4,113.0,95,2278,15.5,72,3,toyota corona hardtop
72,15.0,8,304.0,150,3892,12.5,72,1,amc matador (sw)


In [3]:
data.describe(include="all").transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
mpg,397.0,,,,23.515869,7.825804,9.0,17.5,23.0,29.0,46.6
cylinders,397.0,,,,5.458438,1.701577,3.0,4.0,4.0,8.0,8.0
displacement,397.0,,,,193.532746,104.379583,68.0,104.0,146.0,262.0,455.0
horsepower,397.0,94.0,150,22.0,,,,,,,
weight,397.0,,,,2970.261965,847.904119,1613.0,2223.0,2800.0,3609.0,5140.0
acceleration,397.0,,,,15.555668,2.749995,8.0,13.8,15.5,17.1,24.8
year,397.0,,,,75.994962,3.690005,70.0,73.0,76.0,79.0,82.0
origin,397.0,,,,1.574307,0.802549,1.0,1.0,1.0,2.0,3.0
name,397.0,304.0,ford pinto,6.0,,,,,,,


Now, the feature `name` of an automobile is irrelevant to its mileage; therefore, we will drop it. We will also drop records where `horsepower` has missing values as `?`. Next, we will convert `horsepower` to a numerical value, and `origin` to a `string` value. (We know, *a priori* from the documentation of the dataset that the geographical origin of the automobile is encoded as a number `1, 2, 3`.

In [4]:
data = data[data.horsepower !='?'] \
            .astype({'horsepower':'float', 'origin':'string'}) \
            .drop(columns='name')

## Scatterplots using `altair`

Let us see how we can render the scatterplots using `altair`. 

Refer to the `altair` documentation here: https://altair-viz.github.io/getting_started/overview.html

### A simple scatter plot

Let us start by plotting a simple scatterpoint of `horsepower` vs `mpg`.

`altair` chart visualization can be done using the `Chart` class and its methods.

*  top level **chart** object accepts data
*  **mark** method specifies how the encoded attributes should be represented in the chart
*  **encode** method maps data columns to visual attributes of the chart

#### Encoding types

The encoding type of a data column determines how altair interprets the values of the column. The data columns can be encoded in several different types:

| encoding type | shorthand* | description |
|---|---|---|
| `quantitative` | `Q` | continuous real valued quantity |
| `ordinal` | `O` | discrete ordered quantity |
| `norminal` | `N` | discrete unordered category |
| `temporal` | `T` | time or date value |
| `geojson` | `G` | geographic shape |

To learn more about encoding types refer:
https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types 

#### Shorthand

A **shorthand** is used to conviniently specify the column name and it's type as a string. (It is used to specify the aggregate type as well). The shorthand is used in the code below. We will use the long form which allows for more customization in the following code cells.

In [5]:

(alt.Chart(data)
    .mark_circle(size=60)          # mark_circle for creating scatter plots
    .encode(
            x='horsepower:Q',      # encode x-axis value to the data column 'horsepower' of type "quantitative"
            y='mpg:Q',             # encode y-axis value to the data column 'mpg' of type "quantitative"
))       
    

# compare to plotting with pandas
# data.plot.scatter(x='horsepower', y='mpg')

The basic plot in altair is formatted better than the basic matplotlib plot. However there is scope for improvement. Let us re-tread the path we have taken with the previous notebooks pertaining to the bar plots.

### Applying a dash of style 

We will do the following:

* resize the figure
* pick a color for the points
* add a bit of transparency, for aesthetics
* increase the size of the points
* add a title
* slightly change the `y-axis title`

The above tasks can be done with the methods : `.encode()`, `.mark()` and `.properties()`

#### Customize with mark
The `Chart.mark_*()` allows for customization of the mark properties. For scatterplot we use `Chart.mark_circle()`. A few of the mark properties are: 
* `size`
* `stroke`
* `strokeweight`
* `color` etc.

For the full list of mark properties refer : https://altair-viz.github.io/user_guide/marks.html#mark-properties.

These properties can also be encoded using the encode method.


#### Customize with encode

The `Chart.encode()` method provides several **channels** for mapping the data columns to visual attributes of which we will review three in this notebook:

1. position channels - `x`, `y`, `theta` etc.
2. mark property channels - `color`, `size`, `fill`, etc.
3. text and tooltip channels - `text`, `tooltip` 

Each channel can be customized using the **encode channel options**. For example the channel `x` can be encoded using the `alt.X` class with the encode channel options passed as parameters. The encode channel options include:
* `field`
* `type`
* `aggregate`
* `title`
* `scale` 
* `sort` etc.

For the full list of encode channel options refer : https://altair-viz.github.io/user_guide/encoding.html#encoding-channel-options.

Note that, in the code below we do not use the shorthand to encode the `x` channel. Instead the altair class `X` is used. For each encoding channel we specify the `field` and `type` individually along with the `scale` domain.

#### Customize with properties

The `Chart.properties()` method is used to configure the figure. 

In [6]:

(alt.Chart(data)
    .mark_circle(
                 size=150,                             # increase the size of the points
                 color='salmon',                       # pick a color for the points
                 opacity=0.5,                          # add transparency
                 stroke='white',
                 strokeWidth=1,
                )
    .encode(
            x=alt.X(
                    field='horsepower',                     # encode x-axis using the Altair class "X" instead of the shorthand 
                    type='quantitative',                    # specify encoding type
                    scale=alt.Scale(domain=[40, 240]),      # change the x-axis domain from [0,250] to [30,240] 
                    ),
            y=alt.Y(
                    field='mpg',                            # y-axis is encoded using the Altair class "Y"
                    type='quantitative',                    # specify encoding type
                    title="mpg (mileage)",                  # # change the y-axis title
                    ),                 
           )
    .properties(
                width=800,                                      # resize width of the figure
                height=400,                                     # resize height of figure                         
                title="Automobile's mileage versus horsepower", # set title to the figure
               )
      
)

### Add more information with color

#### Continuous color scale 
Perhaps it would be instructive to further color each of the datum in the graph by the weight of the automobile it represents.

This is done by encoding the mark property channel - **color**, similar to encoding the x and y axes. Specifying the type of the data column plays an important role here. A "quantitative" type produces a continuous color scale, while the "ordinal" type produces a discrete ordered color scale. We use the shorthand code `weight:Q` to specify the type as quantitative, since weight takes continuous values.

For predefined color scheme refer:
https://vega.github.io/vega/docs/schemes/

In [7]:
(alt.Chart(data)
    .mark_circle(
                 size=150,                             
                 color='salmon',                                       # note: the color specified within mark_circle is overriden                       
                 opacity=0.5,                          
                 stroke='white',
                 strokeWidth=1,
                )
    .encode(
            x=alt.X(
                    field='horsepower',                     
                    type='quantitative',                   
                    scale=alt.Scale(domain=[40, 240]),
                    ),     
            y=alt.Y(
                    field='mpg',                           
                    type='quantitative',                   
                    title="mpg (mileage)",
                    ),
            color=alt.Color(
                            'weight:Q',                                 # encode color property of mark_circle with data column "weight" 
                             scale=alt.Scale(scheme='yelloworangered')  # set a predefined color scheme 
                            ), 
            )                 
    .properties(
                width=800,                                     
                height=400,                                    
                title="Automobile's mileage versus horsepower",
    )
)


#### Discrete unordered color scale 
Next, we will explore how we can color each of the datum by the geographical origin of the automobile it represents. In altair this can be easily done by changing the encoding `field` to the data column `origin` and specifying the `type` as `nominal`. `nominal` produces a discrete unordered color scale. Alternatively the shorthand `origin:N` can be used.

In [8]:
(alt.Chart(data)
    .mark_circle(
                 size=150,                             
                 color='salmon',                                  # note: the color specified within mark_circle is overriden                       
                 opacity=0.5,                          
                 stroke='white',
                 strokeWidth=1,
                )
    .encode(
            x=alt.X(
                    field='horsepower',                     
                    type='quantitative',                   
                    scale=alt.Scale(domain=[40, 240]),
                    ),     
            y=alt.Y(
                    field='mpg',                           
                    type='quantitative',                   
                    title="mpg (mileage)",
                    ),
            color=alt.Color(
                            'origin:N',                           # encode color property of mark_circle with data column "origin" 
                            scale=alt.Scale(scheme='tableau10'),  # set a predefined color scheme 
                            ), 
            )                 
    .properties(
                width=800,                                     
                height=400,                                    
                title="Automobile's mileage versus horsepower", 
    )
)

Observe how we have managed to use color to add another dimension to the plot, making it represent three features `mpg`, `horsepower`, and `origin`.

### Add more information with size

We can take it one step further, by harnessing the size of the points to represent another scalar feature, say, number of cylinders. Thus in total we now are showing a four-dimensional projection of the data. This is done by encoding the mark property channel - **size**, similar to encoding the x-axis, y-axis and color. Here we set the `type` to `ordinal` for a discrete ordered size scale.


To learn more about scales refer https://altair-viz.github.io/user_guide/generated/core/altair.Scale.html


#### Point scale

For the `size` encoding channel with field type `ordinal`, `altair` uses a **point** scale by default. A point scale maps discrete values to a continuous range. 

To learn more about scale types refer: https://vega.github.io/vega-lite/docs/scale.html#type

Let us customize this scale by specifying it's domain and range. The domain of ordinal fields can be specified with an array of valid input values. We get the valid input values from the dataset as shown below: 

In [9]:
data['cylinders'].unique()

array([8, 4, 6, 3, 5])

These values are passed as `domain` to the `Scale` class. The range can be specified using a numerical extent `[a,b]`.

In [10]:
(alt.Chart(data)
    .mark_circle(
                 size=150,                                        # note: the size specified within mark_circle is overriden                             
                 color='salmon',                                                        
                 opacity=0.5,                          
                 stroke='white',
                 strokeWidth=1,
                )
    .encode(
            x=alt.X(
                    field='horsepower',                     
                    type='quantitative',                   
                    scale=alt.Scale(domain=[40, 240]),
                    ),     
            y=alt.Y(
                    field='mpg',                           
                    type='quantitative',                   
                    title="mpg (mileage)",
                    ),
            color=alt.Color(
                            'origin:N',                           
                            scale=alt.Scale(scheme='tableau10'),  
                            ),
            size=alt.Size(
                          'cylinders:O',                           # encode color channel with data column "cylinders" of type "ordinal"
                           scale=alt.Scale(
                                           domain=[3,4, 5, 6, 8], 
                                           range=[25,300],         # map cylinder sizes [3,4, 5, 6, 8] to circle sizes of [25,300]
                                           ),
                         ) 
            )                 
    .properties(
                width=800,                                     
                height=400,                                    
                title="Automobile's mileage versus horsepower", 
    )
)

On doing so, we notice that the plot has an interesting story to tell -- most low mileage cars seem to be of high horsepower, originate from the geographical location `1`, and contain a relatively high number of cylinders.

### Adding annotations

Sometimes, to explain vital parts of the data, it may desirable to draw attention to a few facts. Altair does not provide direct features to create annotations with arrows. It can be done by layering separate charts for each text over the base chart. The text charts are created using the `Chart.mark_text()` method. 

Charts can be layered on top of another using the `+` sign to create compound plots.  

Let us draw attention to the automobiles with the highest and lowest mileage. 

In [11]:
plot = (alt.Chart(data)
           .mark_circle(
                        size=150,                                                         
                        color='salmon',                                           
                        opacity=0.5,                          
                        stroke='white',
                        strokeWidth=1,
           )
           .encode(
                  x=alt.X(
                          "horsepower:Q",                   
                          scale=alt.Scale(domain=[40, 240])
                          ),     
                  y=alt.Y(
                          "mpg:Q",                   
                          title="mpg (mileage)",
                          ),
                  color=alt.Color(
                                  'origin:N',               
                                   scale=alt.Scale(scheme='tableau10'),
                                  ),
                  size=alt.Size(
                                'cylinders:O',                            
                                 scale=alt.Scale(
                                                 domain=[3,4, 5, 6, 8],
                                                 range=[25,300]),
                                ),
                )
)


#------------ annotate minimum mpg ------------- 


min_text=alt.Chart(data[(data.mpg == data.mpg.min())]).\
             mark_text(
                        size=12,
                        text = "minimum mpg",
                        align='center',                     # center align to x and y positions
                        dy=20,                              # offset the text 
                       ).\
             encode(
                     x="horsepower:Q", 
                     y="mpg:Q"
                     )                                      # encode x and y positions 
        
         

# add circle around the point to draw attention
min_dot = min_text.mark_circle(
                               size=300,
                               strokeWidth=2,
                               stroke='black',
                               fillOpacity=0,
                               )


#------------ annotate maximum mpg -------------

max_text=alt.Chart(data[(data.mpg == data.mpg.max())]).\
             mark_text(
                        size=12,
                        text = "maximum mpg",
                        align='center',                     # center align to x and y positions
                        dy=-20,                             # offset the text 
                       ).\
             encode(
                     x="horsepower:Q", 
                     y="mpg:Q")                             # encode x and y positions 
                    
        

# add circle around the point to draw attention
max_dot = max_text.mark_circle(
                               size=300,
                               strokeWidth=2,
                               stroke='black',
                               fillOpacity=0,
                               )


#------------ compose charts -------------  

((plot + min_text + max_text + min_dot + max_dot)           # layer charts 
    .properties( 
                width=800,                                     
                height=400,                                    
                title="Automobile's mileage versus horsepower", 
                )
)


### Add Tooltip 

To add more information to the plot, the tooltip feature can be used. Tooltip shows selected column values when the cursor hovers over points.

In [12]:
(alt.Chart(data)
    .mark_circle(
                 size=150,                                                                    
                 color='salmon',                                                        
                 opacity=0.5,                          
                 stroke='white',
                 strokeWidth=1,
                )
    .encode(
            x=alt.X(
                    field='horsepower',                     
                    type='quantitative',                   
                    scale=alt.Scale(domain=[40, 240]),
                    ),     
            y=alt.Y(
                    field='mpg',                           
                    type='quantitative',                   
                    title="mpg (mileage)",
                    ),
            color=alt.Color(
                            'origin:N',                           
                            scale=alt.Scale(scheme='tableau10'),  
                            ),
            size=alt.Size(
                          'cylinders:O',                           
                           scale=alt.Scale(
                                           domain=[3,4, 5, 6, 8], 
                                           range=[25,300],         
                                           ),
                         ),
            tooltip=['mpg', 'horsepower', 'origin', 'cylinders']   # add tooltip to show more information
            )                 
    .properties(
                width=800,                                     
                height=400,                                    
                title="Automobile's mileage versus horsepower", 
    )
)

## Interactive charts

Make charts interactive by adding the `interactive()` method to the chart

In [13]:
(alt.Chart(data)
    .mark_circle(
                 size=150,                                                                    
                 color='salmon',                                                        
                 opacity=0.5,                          
                 stroke='white',
                 strokeWidth=1,
                )
    .encode(
            x=alt.X(
                    field='horsepower',                     
                    type='quantitative',                   
                    scale=alt.Scale(domain=[40, 240]),
                    ),     
            y=alt.Y(
                    field='mpg',                           
                    type='quantitative',                   
                    title="mpg (mileage)",
                    ),
            color=alt.Color(
                            'origin:N',                           
                            scale=alt.Scale(scheme='tableau10'),  
                            ),
            size=alt.Size(
                          'cylinders:O',                           
                           scale=alt.Scale(
                                           domain=[3,4, 5, 6, 8], 
                                           range=[25,300],         
                                           ),
                         ),
            tooltip=['mpg', 'horsepower', 'origin', 'cylinders']   
            )                 
    .properties(
                width=800,                                     
                height=400,                                    
                title="Automobile's mileage versus horsepower", 
    )
).interactive()                                                      # make interactive

### Multiple interactive panels with interval selection

This is an example of using an interval selection to control the scale of the right chart based on the selection from the left chart (by dragging across the chart). 

Steps followed:
 * Create selection objects to store the selected x and y ranges
 * Bind the selection objects to the chart
 * Use the selection object to set the `domain` of the axes scale of the right chart

refer Bindings, Selections, Conditions: Making Charts Interactive:
https://altair-viz.github.io/user_guide/interactions.html#bindings-selections-conditions-making-charts-interactive

#### Compound charts
To display multiple charts in the same figure, the charts need to be composed to form compound charts. Charts can be vertically concatenated using `&` and horizontally concatenated using `|`. Charts can be layered on top of each other using `+`.

Here we horizontally concatenate the charts using `|`.

In [14]:
#------------ selection -------------

x_extent = alt.selection_interval(encodings=['x'])    # create a selection object to capture the x-range selected
y_extent = alt.selection_interval(encodings=['y'])    # create a selection object to capture the y-range selected

#------------ base -------------

base = (alt.Chart(data)
           .properties(
                        width=400, 
                        height=400,
                      )
           .add_selection(
                          x_extent,                # bind selection object to the base chart
                          y_extent,
                         )
       )                   

#------------ left chart ------------- 

main = (base.mark_circle(
                         size=150,
                         opacity=0.5,                          
                         stroke='white',
                         strokeWidth=1,
                        )
            .encode(
                    x=alt.X(
                            'horsepower:Q',                  
                             scale=alt.Scale(domain=[30, 240]),
                            ),     
                    y=alt.Y(
                            'mpg:Q',                   
                            title="mpg (mileage)",
                            scale=alt.Scale(domain=[0,50])
                            ),
                    color=alt.Color(
                                    'origin:N',               
                                     scale=alt.Scale(scheme='tableau10')
                                    )
                    )
       )

#------------ right chart -------------


zoom = (base.mark_circle(
                         size=150,
                         opacity=0.5,                          
                         stroke='white',
                         strokeWidth=1,
                        )
            .encode(
                    x=alt.X(
                            'horsepower:Q',                  
                             scale=alt.Scale(domain=x_extent),   # set scale to the selected x-range
                            ),     
                    y=alt.Y(
                            'mpg:Q',                   
                            title="mpg (mileage)",
                            scale=alt.Scale(domain=y_extent)     # set scale to the selected y-range
                            ),
                    color=alt.Color(
                                    'origin:N',               
                                     scale=alt.Scale(scheme='tableau10')
                                    )
                    )
       )

#------------ compose charts -------------

(main | zoom).properties(
                         title={
                                "text":"Automobile's mileage versus horsepower",
                                "anchor":"middle",
                               }
                        )

In [15]:
#------------ selection -------------

x_extent = alt.selection_interval(encodings=['x'])    # create a selection object to capture the x-range selected
y_extent = alt.selection_interval(encodings=['y'])    # create a selection object to capture the y-range selected

#------------ base -------------

base = alt.Chart(data)
                   
#------------ left chart ------------- 

scatter = (base.mark_circle(
                         size=150,
                         opacity=0.5,                          
                         stroke='white',
                         strokeWidth=1,
                        )
            .encode(
                    x=alt.X(
                            'horsepower:Q',                  
                             scale=alt.Scale(domain=[30, 240]),
                            ),     
                    y=alt.Y(
                            'mpg:Q',                   
                            title="mpg (mileage)",
                            scale=alt.Scale(domain=[0,50])
                            ),
                    color=alt.Color(
                                    'origin:N',               
                                     scale=alt.Scale(scheme='tableau10')
                                    )
                    )
           .add_selection(x_extent)                # bind selection object to the base chart
           .properties(
                       width=800,
                       height=400
           )
           
       )
       

#------------ right chart -------------


hist = (base.mark_bar()
            .encode(
                    x=alt.X(
                            'count(origin):Q',  
                            scale=alt.Scale(domain=[0,250]),
                            ),     
                    y=alt.Y(
                            'origin:N',                   
                            title="origin",
                            ),
                    color=alt.Color(
                                    'origin:N',               
                                     scale=alt.Scale(scheme='tableau10')
                                    )
                    )
            .transform_filter(x_extent)             # filter data based on selection
            .properties(
                        width=800,
                        height=100
                       )
       )

#------------ compose charts -------------

(scatter & hist).properties(
                         title={
                                "text":"Automobile's mileage versus horsepower",
                                "anchor":"middle",
                               },
                        )

## Facetted charts

One can create compound plots as **facetted charts** that put together various statistical plots of the data and put it adjacent to, say, the scatter plot of the data itself. This is done by composing the charts using different stacking options discussed previously.

To dive deeper into plotting histograms with `altair` refer the `visualization-univariate-altair.ipynb` . 

#### Customize legend
We move the legend which is by default placed at the top right corner after the rightmost chart. We make use of the `alt.Legend` class options to manually set the x-axis position of the legend.

In [16]:
base = alt.Chart(data)                                        # create base chart

xscale = alt.Scale(domain=[30, 240])                          # custom scale to use in multiple charts
yscale = alt.Scale(domain=(0,50))

bar_args = {'opacity': .7, 'binSpacing': 0}

#------------ scatter plot -------------

scatter = (base.mark_circle(
                         size=150,
                         opacity=0.5,                          
                         stroke='white',
                         strokeWidth=1,
                        )
                .encode(
                        x=alt.X(
                                'horsepower:Q',                  
                                 scale=alt.Scale(domain=[30, 240]),
                                ),     
                        y=alt.Y(
                                'mpg:Q',                   
                                title="mpg (mileage)",
                                scale=alt.Scale(domain=[0,50])
                                ),
                        color=alt.Color(
                                        'origin:N',               
                                         scale=alt.Scale(scheme='tableau10')
                                        )
                        )
                .properties(
                            width=800,
                            height=400,
                            )
          )

#------------ top histogram ------------- 

top_hist = (base.mark_bar(**bar_args)
                .encode(
                        x=alt.X(
                                'horsepower:Q',
                                 bin=alt.Bin(
                                             maxbins=20,
                                             extent=xscale.domain
                                             ),
                                title='',
                               ),
                        y=alt.Y(
                                'count()',                                    # shorthand for aggregate 
                                 title=''
                                ),
                        color=alt.Color(
                                        'origin:N',               
                                         scale=alt.Scale(scheme='tableau10'),
                                         legend=alt.Legend(
                                                           orient='none',     # Change default orientation from "right" to "none"
                                                           legendX=850)       # Custom x-position for legend with orient “none”
                                       )                       
                        )
               .properties(
                          width=800, 
                          height=100,
                          )
)

#------------ right histogram -------------

right_hist = (base.mark_bar(**bar_args)
                  .encode(
                          alt.Y(
                               'mpg:Q',
                                bin=alt.Bin(maxbins=20,
                                extent=yscale.domain),
                                title=''
                               ),
                          alt.X(
                               'count()', 
                                stack=True, 
                                title=''
                              ),
                          color=alt.Color('origin:N',               
                                          scale=alt.Scale(scheme='tableau10'))
).properties(
                width=100, 
                height=400,
                  
            )
)

#------------ compose charts -------------

top_hist & (scatter | right_hist)