ARTI308 - Machine Learning
# Seaborn Overview

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.




## Distribution Plots

Let's discuss some plots that allow us to visualize the distribution of a data set. These plots are:

* distplot
* jointplot
* pairplot
* rugplot
* kdeplot

## Imports

In [1]:
import seaborn as sns  # Import seaborn and enable inline plotting for Jupyter.
%matplotlib inline    

## Data
Seaborn comes with built-in data sets!

In [2]:
tips = sns.load_dataset('tips') # Load the built-in "tips" dataset from Seaborn.

In [3]:
tips.head() # Display the first 5 rows of the "tips" dataset.

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## distplot

The distplot shows the distribution of a univariate set of observations.

In [4]:
sns.distplot(tips['total_bill'])
# Safe to ignore warnings


`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(tips['total_bill'])


<Axes: xlabel='total_bill', ylabel='Density'>

To remove the kde layer and just have the histogram use:

In [5]:
sns.distplot(tips['total_bill'],kde=False,bins=30) # Plotting a histogram of the 'total_bill' column without KDE (Kernel Density Estimate) and setting 30 bins.



`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

  sns.distplot(tips['total_bill'],kde=False,bins=30) # Plotting a histogram of the 'total_bill' column without KDE (Kernel Density Estimate) and setting 30 bins.


<Axes: xlabel='total_bill', ylabel='Density'>

## jointplot

jointplot() allows you to basically match up two distplots for bivariate data. With your choice of what **kind** parameter to compare with: 
* “scatter” 
* “reg” 
* “resid” 
* “kde” 
* “hex”

In [6]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter') # Creating a scatter plot to show the relationship between 'total_bill' and 'tip' using a jointplot.

<seaborn.axisgrid.JointGrid at 0x22e22b897f0>

In [7]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='hex') # Creating a hexbin plot to show the density of the relationship between 'total_bill' and 'tip'.

<seaborn.axisgrid.JointGrid at 0x22e22c28a50>

In [8]:
sns.jointplot(x='total_bill',y='tip',data=tips,kind='reg') # Creating a scatter plot with a regression line to show the relationship between 'total_bill' and 'tip'.

<seaborn.axisgrid.JointGrid at 0x22e2518b4d0>

## pairplot

pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns). 

In [9]:
sns.pairplot(tips) # Creating a pairplot to visualize the relationships between all pairs of columns in the 'tips' dataset.

<seaborn.axisgrid.PairGrid at 0x22e22b896a0>

In [10]:
sns.pairplot(tips,hue='sex',palette='coolwarm') # Creating a pairplot with the data, colored by 'sex' column and using the 'coolwarm' color palette.

<seaborn.axisgrid.PairGrid at 0x22e26c6b110>

## rugplot

rugplots are actually a very simple concept, they just draw a dash mark for every point on a univariate distribution. They are the building block of a KDE plot:

In [11]:
sns.rugplot(tips['total_bill']) # Creating a rug plot to visualize the distribution of the 'total_bill' column with small ticks.

<Axes: xlabel='size', ylabel='Density'>

## kdeplot

kdeplots are [Kernel Density Estimation plots](http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth). These KDE plots replace every single observation with a Gaussian (Normal) distribution centered around that value. For example:

In [12]:
# Don't worry about understanding this code!
# It's just for the diagram below
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

#Create dataset
dataset = np.random.randn(25)

# Create another rugplot
sns.rugplot(dataset);

# Set up the x-axis for the plot
x_min = dataset.min() - 2
x_max = dataset.max() + 2

# 100 equally spaced points from x_min to x_max
x_axis = np.linspace(x_min,x_max,100)

# Set up the bandwidth, for info on this:
url = 'http://en.wikipedia.org/wiki/Kernel_density_estimation#Practical_estimation_of_the_bandwidth'

bandwidth = ((4*dataset.std()**5)/(3*len(dataset)))**.2


# Create an empty kernel list
kernel_list = []

# Plot each basis function
for data_point in dataset:
    
    # Create a kernel for each point and append to list
    kernel = stats.norm(data_point,bandwidth).pdf(x_axis)
    kernel_list.append(kernel)
    
    #Scale for plotting
    kernel = kernel / kernel.max()
    kernel = kernel * .4
    plt.plot(x_axis,kernel,color = 'grey',alpha=0.5)

plt.ylim(0,1)

(0.0, 1.0)

In [13]:
# To get the kde plot we can sum these basis functions.

# Plot the sum of the basis function
sum_of_kde = np.sum(kernel_list,axis=0)

# Plot figure
fig = plt.plot(x_axis,sum_of_kde,color='indianred')

# Add the initial rugplot
sns.rugplot(dataset,c = 'indianred')

# Get rid of y-tick marks
plt.yticks([])

# Set title
plt.suptitle("Sum of the Basis Functions")

Text(0.5, 0.98, 'Sum of the Basis Functions')

So with our tips dataset:

In [14]:
# Plotting a KDE (Kernel Density Estimate) plot for the 'total_bill' column.
sns.kdeplot(tips['total_bill']) 
sns.rugplot(tips['total_bill'])

<Axes: xlabel='size', ylabel='Density'>

In [15]:
# Plotting the KDE and rug plot for the 'tip' column.
sns.kdeplot(tips['tip'])
sns.rugplot(tips['tip'])

<Axes: xlabel='size', ylabel='Density'>

# Categorical Data Plots

Now let's discuss using seaborn to plot categorical data! There are a few main plot types for this:

* factorplot
* boxplot
* violinplot
* stripplot
* swarmplot
* barplot
* countplot

Let's go through examples of each!

In [16]:
# Import Seaborn for statistical plots and enable inline plotting for Jupyter.
import seaborn as sns
%matplotlib inline

In [17]:
tips = sns.load_dataset('tips') # Load the "tips" dataset from Seaborn and display the first 5 rows.
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## barplot and countplot

These very similar plots allow you to get aggregate data off a categorical feature in your data. **barplot** is a general plot that allows you to aggregate the categorical data based off some function, by default the mean:

In [18]:
sns.barplot(x='sex',y='total_bill',data=tips) # Creating a bar plot to show the total bill amounts grouped by 'sex'.

<Axes: xlabel='size', ylabel='Density'>

In [19]:
import numpy as np # Importing the NumPy library, commonly used for numerical operations.

You can change the estimator object to your own function, that converts a vector to a scalar:

In [20]:
sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std) # Create a bar plot that shows the standard deviation of the total_bill amounts by sex.

<Axes: xlabel='size', ylabel='Density'>

### countplot

This is essentially the same as barplot except the estimator is explicitly counting the number of occurrences. Which is why we only pass the x value:

In [21]:
sns.countplot(x='sex',data=tips) # Create a count plot to visualize the distribution of values in the 'sex' column.

<Axes: xlabel='size', ylabel='Density'>

## boxplot and violinplot

boxplots and violinplots are used to shown the distribution of categorical data. A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

In [22]:
sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow') # Create a boxplot to visualize the distribution of 'total_bill' across different days, using the 'rainbow' color palette.


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow') # Create a boxplot to visualize the distribution of 'total_bill' across different days, using the 'rainbow' color palette.


<Axes: xlabel='size', ylabel='Density'>

In [23]:
# Can do entire dataframe with orient='h'
sns.boxplot(data=tips,palette='rainbow',orient='h')

  ax.set_ylim(n - .5, -.5, auto=None)


<Axes: xlabel='size', ylabel='Density'>

In [24]:
sns.boxplot(x="day", y="total_bill", hue="smoker",data=tips, palette="coolwarm") # Create a boxplot comparing the total bill for smokers vs non-smokers across different days, using the 'coolwarm' color palette.

<Axes: xlabel='size', ylabel='Density'>

### violinplot
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

In [25]:
sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow') # Create a violin plot to visualize the distribution of total bills across different days, using the 'rainbow' color palette.


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow') # Create a violin plot to visualize the distribution of total bills across different days, using the 'rainbow' color palette.


<Axes: xlabel='size', ylabel='Density'>

In [26]:
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',palette='Set1') # Create a violin plot to visualize the distribution of total bills across different days, differentiated by gender ('sex'), using the 'Set1' color palette.

<Axes: xlabel='size', ylabel='Density'>

In [27]:
sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',split=True,palette='Set1') # Create a split violin plot to compare the distribution of total bills across days, with a breakdown by gender ('sex') using the 'Set1' color palette.

<Axes: xlabel='size', ylabel='Density'>

## stripplot and swarmplot
The stripplot will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.

The swarmplot is similar to stripplot(), but the points are adjusted (only along the categorical axis) so that they don’t overlap. This gives a better representation of the distribution of values, although it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).

In [28]:
# Create a strip plot to visualize the distribution of total bills across days, with color variations using the 'rainbow' palette.
sns.stripplot(x="day", y="total_bill", data=tips, palette='rainbow')


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.stripplot(x="day", y="total_bill", data=tips, palette='rainbow')


<Axes: xlabel='size', ylabel='Density'>

In [29]:
# Create a strip plot to visualize the distribution of total bills across days, adding some random noise (jitter) to the points for better visualization. The color variation is controlled using the 'rainbow' palette.
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True, palette='rainbow')


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  sns.stripplot(x="day", y="total_bill", data=tips,jitter=True, palette='rainbow')


<Axes: xlabel='size', ylabel='Density'>

In [30]:
# Create a strip plot to visualize the distribution of total bills across days, with different colors for each gender ('sex') using the 'Set1' color palette. Random noise (jitter) is added to the data for better clarity.
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1')

<Axes: xlabel='size', ylabel='Density'>

In [31]:
# Plot the distribution of total bills for different days of the week. This visualization adds jitter for clarity and separates data based on gender using the 'sex' column with the 'Set1' color palette. The dodge parameter ensures that the data points for each gender are displayed separately.
sns.stripplot(x="day", y="total_bill", data=tips,jitter=True,hue='sex',palette='Set1',dodge=True)

<Axes: xlabel='size', ylabel='Density'>

In [32]:
sns.swarmplot(x="day", y="total_bill", data=tips) # Visualize 'total_bill' by 'day'



<Axes: xlabel='size', ylabel='Density'>

### Combining Categorical Plots

In [33]:
sns.violinplot(x="tip", y="day", data=tips,palette='rainbow') # Visualize 'tip' by 'day' with color 'black'
sns.swarmplot(x="tip", y="day", data=tips,color='black',size=3)


Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.violinplot(x="tip", y="day", data=tips,palette='rainbow') # Visualize 'tip' by 'day' with color 'black'
  ax.set_ylim(n - .5, -.5, auto=None)
  ax.set_ylim(n - .5, -.5, auto=None)


<Axes: xlabel='size', ylabel='Density'>

## catplot

factorplot is the most general form of a categorical plot. It can take in a kind parameter to adjust the plot type:

In [34]:
sns.catplot(x='sex',y='total_bill',data=tips,kind='bar') # Create a bar plot of 'total_bill' by 'sex'

<seaborn.axisgrid.FacetGrid at 0x22e22b8acf0>

# Matrix Plots

Matrix plots allow you to plot data as color-encoded matrices and can also be used to indicate clusters within the data (later in the machine learning section we will learn how to formally cluster data).

Let's begin by exploring seaborn's heatmap and clutermap:

In [35]:
import seaborn as sns  # Import necessary libraries and load the dataset
%matplotlib inline

In [36]:
flights = sns.load_dataset('flights')

In [37]:
tips = sns.load_dataset('tips')

In [38]:
tips.head() # View first 5 rows of the 'tips' dataset

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [39]:
flights.head() # View first 5 rows of the 'flights' dataset

Unnamed: 0,year,month,passengers
0,1949,Jan,112
1,1949,Feb,118
2,1949,Mar,132
3,1949,Apr,129
4,1949,May,121


## Heatmap

In order for a heatmap to work properly, your data should already be in a matrix form, the sns.heatmap function basically just colors it in for you. For example:

In [40]:
tips.head() # Check the first few rows of the 'tips' dataset

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [41]:
# Matrix form for correlation data
tips.corr()

ValueError: could not convert string to float: 'No'

In [None]:
sns.heatmap(tips.corr()) # Display correlation of columns

In [None]:
sns.heatmap(tips.corr(),cmap='coolwarm',annot=True) # Add color and annotations to the heatmap

Or for the flights data:

In [None]:
flights.pivot_table(values='passengers',index='month',columns='year') # Pivot data for month vs year

In [None]:
pvflights = flights.pivot_table(values='passengers',index='month',columns='year') # Store pivot table
sns.heatmap(pvflights)

In [None]:
sns.heatmap(pvflights,cmap='magma',linecolor='white',linewidths=1) # Display the heatmap with customized color scheme

## clustermap

The clustermap uses hierarchal clustering to produce a clustered version of the heatmap. For example:

In [None]:
sns.clustermap(pvflights) # Group months and years based on similar values (passenger count)

Notice now how the years and months are no longer in order, instead they are grouped by similarity in value (passenger count). That means we can begin to infer things from this plot, such as August and July being similar (makes sense, since they are both summer travel months)

In [None]:
# More options to get the information a little clearer like normalization
sns.clustermap(pvflights,standard_scale=1)

# Regression Plots

Seaborn has many built-in capabilities for regression plots, however we won't really discuss regression until the machine learning section of the course, so we will only cover the **lmplot()** function for now.

**lmplot** allows you to display linear models, but it also conveniently allows you to split up those plots based off of features, as well as coloring the hue based off of features.

Let's explore how this works:

In [None]:
import seaborn as sns # Visualize the relationship between total_bill and tip with lmplot, adding hue to distinguish by sex
%matplotlib inline

In [None]:
tips = sns.load_dataset('tips')

In [None]:
tips.head()

## lmplot()

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips) # Basic scatter plot with linear regression line

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex') # Scatter plot with color distinction for 'sex'

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,hue='sex',palette='coolwarm') # Scatter plot with custom color palette

## Using a Grid

We can add more variable separation through columns and rows with the use of a grid. Just indicate this with the col or row arguments:

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,col='sex') # Plot total_bill vs tip, separated by 'sex' in different columns

In [None]:
sns.lmplot(x="total_bill", y="tip", row="sex", col="time",data=tips) # Plot total_bill vs tip, separated by 'sex' in rows and 'time' in columns

In [None]:
# Plot total_bill vs tip, separated by 'day' and hue by 'sex' with custom color palette
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm')

## Aspect and Size

Seaborn figures can have their size and aspect ratio adjusted with the **height** and **aspect** parameters:

In [None]:
sns.lmplot(x='total_bill',y='tip',data=tips,col='day',hue='sex',palette='coolwarm',
          aspect=0.6,height=8) # Adjust figure size and aspect ratio using height and aspect parameters

### Reference:

* https://seaborn.pydata.org/ - Seaborn: statistical data visualization


* https://seaborn.pydata.org/tutorial/color_palettes.html - Color palettes