# Seaborn Part 2

COMP 4304 / 6934\
Terrence Tricco

Seaborn provides a high-level interface for creating Matplotlib plots.

Our learning objectives for this notebook are to become familiar with Seaborn's functions for creating bar, box and violin plots and histograms. We will look at a mix of relational and distribution plots.
- ``barplot()``
- ``countplot()``
- ``histplot()``
- ``boxplot()``
- ``violinplot()``

We will also introoduce Seaborn's colour palette system.



Seaborn has excellent API documentation and tutorials: https://seaborn.pydata.org/


## Import Libraries

Seaborn is commonly loaded with the name ``sns``.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns

## Load Data

We will work with the data set of trending YouTube videos that we have used several times in previous lectures.

In [7]:
df = pd.read_csv('CA_videos.csv', parse_dates=['trending_date', 'publish_time'])

ParserError: Error tokenizing data. C error: EOF inside string starting at row 37895

In [5]:
df['trending_month'] = df.trending_date.dt.month

NameError: name 'df' is not defined

In [None]:
df['trending_quarter'] = df.trending_date.dt.quarter

We will find the top 4 channels based on number of trending videos and filter our data set to just those channels.

In [None]:
top4_channels = df[['video_id', 'channel_title']].groupby('channel_title').nunique().reset_index().sort_values(by='video_id', ascending=False).head(4).channel_title

In [None]:
top4_channels

In [None]:
df = df[df.channel_title.isin(top4_channels)]

## Bar Plots

Let's investigate the top channel, NBA.

We can make a bar plot, using ``barplot()``, that shows the average number of views per month. With Matplotlib, we would have had to create that data (average views per month) ourselves using groupby. As we saw with lineplot, Seaborn will do that work for us.

In [None]:
df_nba = df[df.channel_title == 'NBA']

In [None]:
sns.barplot(x='trending_month', y='views', data=df_nba)

The height of each bar represents the average number of views for videos in that month. Just like with ``lineplot()``, the error bars reflect the 95% confidence interval for the average. 

Bar plots are often used to show the number of items in a category. Seaborn offers a specific function for that - ``countplot()``.

Let's make a plot that shows the number of videos that went trending each month for VikatanTV.

In [None]:
sns.countplot(x='trending_month', data=df_nba)

Notice that ``countplot()`` has done the work for us of calculating the number of items in each category (the categories are the months, in this case).

``barplot()`` and ``countplot()`` can also work with multiple categories at a time. For example, let's create the previous bar plot showing average views per month, but for two of the top channels. We can represent them in our plot by using the ``hue=`` parameter.

In [None]:
df_two = df[df.channel_title.isin(['NBA', 'FORMULA 1'])]

In [None]:
sns.barplot(x='trending_month', y='views', data=df_two, hue='channel_title')

Also notice how Seaborn has changed the colour scheme to something sensible. We will discuss Seaborn's colour options further below.

We coud also represent all 4 top channels.

In [None]:
sns.barplot(x='trending_month', y='views', data=df, hue='channel_title')

This shows exactly the same information as the line plot. The average number of views for all 4 channels are present, and the error bars for the averages are included.

In [None]:
fig, ax = plt.subplots()

sns.lineplot(x='trending_month', y='views', data=df, hue='channel_title', ax=ax)

Which is the better representation? Is there an advantage of one over the other?

In my opinion, the line plot is more clear than the bar plot and is easier to perceive and interpret. The bar plot has a lot of visual complexity, and it is different to follow the trend of one channel. The bar plot is cluttered with so many bars. It may be best for comparing channels for only one month, not across many months. 

The line plot does a much better job at showing how the average views change month over month, and the shading for the error bars are much more intuitive.

# Seaborn Distribution Plots

Seaborn offers a variety of plot types for distributions, some of which we have previously studied and some new.

The distribution of a quantitative variable is its spread of values.

## Histograms

Histograms show one distribution at a time. This is different than box or violin plots which show multiple distributions at once.

In this example, the histogram is the distribution of views for all videos for the NBA.

In [None]:
sns.histplot(x='views', data=df_nba)

By default, the histogram is the count of items in each bin. This can be changed using the ``stat=`` parameter, which takes the following values:
    
- **count:** show the number of observations in each bin
- **frequency:** show the number of observations divided by the bin width
- **probability:** or proportion: normalize such that bar heights sum to 1
- **percent:** normalize such that bar heights sum to 100
- **density:** normalize such that the total area of the histogram equals 1

In [None]:
sns.histplot(x='views', data=df_nba, stat='percent')

Visually, the shape of the distribution is the same -- it is the quantitative scale on the y-axis that is different.

The bin sizes have been automatically set in the preceding examples. Using the ``bins=`` parameter, there are a number of built-in ways to automatically create the bins, or the number of bins can be specified, or a list of bin edges can be used.

In [None]:
sns.histplot(x='views', data=df_nba, bins=40)

Multiple distributions can be plotted at once using the ``hue=`` parameter. In this case, let's plot the distribution for category of videos.

In [None]:
sns.histplot(x='views', data=df_two, hue='channel_title', bins=40)

By default, the two histograms are layered on top of each other. You can specify to Seaborn how multiple distributions should be handled by the ``multiple=`` keyword, which accepts either ``layer`` (default), ``stack``, ``dodge`` or ``fill``.

In [None]:
sns.histplot(x='views', data=df_two, hue='channel_title', bins=40, multiple='stack')

## Box Plots

Box plots show the median, inter-quartile range (IQR) and extended points that are outliers to the distribution.

The box plot below shows the distribution of views for the videos corresponding to that month. Unlike histograms, which are ideal for studying a single distribution, box plots are excellent for comparing distributions to one another.

In [None]:
sns.boxplot(x='trending_month', y='views', data=df_nba)

We can see that February has a wide range, and that March and December have very narrow distriutions. We can connect these two months to the `countplot()` we made earlier, where we found that these months have the smallest number of trending videos.

We can also add nested grouping to our box plot. Let's plot the view distribution per month, but use two categories for the two channels we investigated earlier.

In [None]:
sns.boxplot(x='trending_month', y='views', data=df_two, hue='channel_title')

Each month now shows two box plots -- one for the NBA and the second for Formula 1.

## Violin Plots

We can represent the same plots using violin plots.

First is the distribution of views per month.

In [None]:
sns.violinplot(x='trending_quarter', y='views', data=df_nba)

The width of the violins can be set using ``density_norm=``.
- ``area``: each violin will have the same area. 
- ``count``: the width of the violins will be scaled by the number of observations in that bin. 
- ``width``: each violin will have the same width.

In [None]:
sns.violinplot(x='trending_quarter', y='views', data=df_nba, density_norm='count')

The interior of the violin has a miniature box plot overlaid on it.

The mini box plot can be changed to lines at the position of the quartiles instead with the ``inner=`` parameter. In effect, this combines the best elements of the box plot and violin plot together.

In [None]:
sns.violinplot(x='trending_quarter', y='views', data=df_nba, density_norm='count', inner='quartiles')

We can create the second box plot in the previous section also by using the ``hue=`` parameter.

In [None]:
sns.violinplot(x='trending_quarter', y='views', data=df_nba)

This is much easier to create than with Matplotlib!

One option that may work well in some cases is to use ``split=True|False``. Rather than drawing the two violins side-by-side, they are drawn as two halves of a single violin.

In [None]:
sns.violinplot(x='trending_quarter', y='views', data=df_two, hue='channel_title', split=True)

Overall, violin plots created through Seaborn are actually useful and easy to work with, unlike Matplotlib.

## Colour

We have not specified any colour choices thus far, and instead have relied on Seaborn to choose colours on its own.

There are a number of colour palettes available with Seaborn. The current palette can be obtained with ``sns.color_palette()``.

In [None]:
sns.color_palette()

To set a palette, the ``sns.set_palette()`` function can be used. Palettes can often be specified in plot function calls, as well, and individual colours specified, if that fine-grain customization is needed.

In [None]:
sns.set_palette("Set3")

In [None]:
palette = sns.color_palette()

In [None]:
palette

There are many options.

The latest versions of Seaborn have changed how palettes are used. Previously, plots would use the palette set globally be `set_palette()`. This does not seem to be the case anymore. Instead, Seaborn advises the following.

In [None]:
sns.set_palette("Set3")
sns.boxplot(x='trending_month', y='views', data=df_two, hue='channel_title')

In [None]:
sns.set_palette("Set1")
sns.boxplot(x='trending_month', y='views', data=df_two, hue='channel_title')

The colour palette can also be specified in the plot function call using ``palette=``. This accepts either the name of the palette, a palette object, or a list of colours.

It seems that Seaborn will change the behaviour of ``palette=`` in a future version. It will require `hue=` to be specified as well. See the FutureWarning below and the official advised way to work around this.

In [None]:
sns.boxplot(x='trending_month', y='views', data=df_nba, palette='hls')

In [None]:
sns.boxplot(x='trending_month', y='views', data=df_nba, palette='hls', hue='trending_month', legend=False)

Instead of using a palette, a list of specific colours can be used instead. The colours can be specified by name or by using RGB values.

In [None]:
colours = ['red', 'blue', 'orange', '#880088']

sns.boxplot(x='trending_month', y='views', data=df_nba, palette=colours, hue='trending_month', legend=False)

All of the plots in this notebook accept the ``palette=`` parameter.

There is a much deeper conversation around colour -- it is vitally imporant to your visualization, after all. In a future lecture, we will discuss colour in more detail, examining the structure of colours, colour systems, sensible colour choices, and more.

# Summary

Seaborn offers many easy to use functions to create nice looking plots. Most functions use the same syntax, with ``x=``, ``y=``, ``data=``, ``hue=`` and ``palette=``.

There are a number of built-in options to customize the visual look of your plot. And, if needed, they can be further customized using Matplotlib.