# Seaborn

Seaborn is a nice library for make easy statistical plots  
Seaborn works great with Pandas Dataframes

**Note** you still want/need to do plt.show() if you don't want the weird bracket stuff showing up

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

#### Seaborn Data

Seaborn can easily load some classic/toy datasets to illustrate plotting features.  An internet connection is required to load the data.  

In [None]:
iris = sns.load_dataset('iris')
tips = sns.load_dataset('tips')
flights = sns.load_dataset('flights')

In [None]:
iris.head()

In [None]:
tips.head()

In [None]:
flights.head()

## Line Plot

lineplot will plot a line with a 95% confidence interval.

In [None]:
sns.lineplot(data = flights, x = 'year', y = 'passengers')
plt.show()

In [None]:
#you can also do bars instead of bands
sns.lineplot(data = flights, x = 'year', y = 'passengers', err_style = 'bars')
plt.show()

In [None]:
#you can plot standard dev or other 'error' metrics rather than the default confidence interval. Other error metrics you can use:
#prediction interval, standard deviation, and standard error
sns.lineplot(data = flights, x = 'year', y = 'passengers', errorbar = 'sd')
plt.show()

#you can change the width of the error bars using a tuple. For example, I can remove them all together by setting the width to 0
sns.lineplot(data = flights, x = 'year', y = 'passengers', errorbar = ('ci', 0))
plt.show()


# Distribution Plots

Let's discuss some plots that allow us to visualize the distribution of a data set. These plots are:

* histplot
* displot
* jointplot
* pairplot
* rugplot

You can use the dataset and call variables by name, or specifically designate a series as input.

In [None]:
# Pass in full dataset
sns.histplot(data = iris, x = 'sepal_length', bins = 30)
plt.show()

In [None]:
# Pass in a series
sns.histplot(x = iris['sepal_length'])
plt.show()

Although the seaborn histogram may look prettier than matplotlib's, I always do it in matplotlib. One advantage of matplotlib is you can set density = True, which is nice if you want to compare the histogram to an actual probability density function. Although there may be some way to do something similar in seaborn, it wasn't immediately obvious to me how just looking at the documentation.

In [None]:
plt.hist(iris.sepal_length.values, density = True)
plt.show()

displot sorta combines histograms and kde's into one function. You can change the kind using the kind argument. By default, displot will plot a histogram, but with kind = 'kde' it plots a kde. But overall it's displot is kind of silly becaues you can also just do sns.kdeplot or sns.histplot. I guess the advantage is you only have to remember one function.

In [None]:
sns.displot(data=iris, x='petal_length')
plt.title('Hist with distplot')
plt.show()

sns.displot(data=iris, x='petal_length', kind = 'kde')
plt.title('KDE with distplot')
plt.show()

sns.kdeplot(data=iris, x='petal_length')
plt.title('KDE with kdeplot')
plt.show()

In [None]:
#displot also let's you plot the "rug", which kdeplot doesn't allow
sns.displot(data = iris, x = 'petal_length', kind = 'kde', rug = True)
plt.show()

In [None]:
#you can also combine a histogram, kde, and rug plot by setting kind = 'hist' and then kde=True and rug = True
sns.displot(data = iris, x = 'petal_length', kind = 'hist', rug = True, kde = True)
plt.show()

More on displot can be found here: [Displot](https://seaborn.pydata.org/generated/seaborn.displot.html)

## jointplot

jointplot() allows you to basically match up two distplots for **bivariate** data (two variables). With your choice of what **kind** parameter to compare with:
* “scatter”
* “reg”
* “resid”
* “kde”
* “hex”

`sns.jointplot(x='',y='',data=,kind='')`

In [None]:
sns.jointplot(data = iris, x = 'petal_length', y = 'petal_width', hue = 'species')
plt.show()

In [None]:
sns.jointplot(data = iris, x = 'petal_length', y = 'petal_width', hue = 'species', kind = 'kde')
plt.show()

In [None]:
#kind = 'reg' fits a linear regression line between the two variables
sns.jointplot(data = tips, x = 'total_bill', y = 'tip', kind = 'reg')
plt.show()

In [None]:
sns.jointplot(data = tips, x = 'total_bill', y = 'tip', kind = 'hex')
plt.show()

In [None]:
#when you plot resid, you do the linear regression model using x to predict y, then you plot the residuals.
#so this is useful for checking the constant variance assumption of linear regression.
sns.jointplot(data = tips, x = 'total_bill', y = 'tip', kind = 'resid')
plt.show()

## Updating figure attributes

seaborn is built on matplotlib, so you can define titles and axis labels using the same functions as in matplotlib.

In [None]:
g = sns.jointplot(data=iris, x='petal_length', y='petal_width', hue='species')
plt.xlabel('Petal Length (cm)')
plt.ylabel('Petal Width (cm)')
plt.suptitle('Scatterplot of Petal Length and Petal Width', y = 1) #y = 1 I believe raises the height of the table. If you don't do that it'll intersect with the blue kde.
plt.show()

## pairplot

pairplot will plot pairwise relationships across an entire dataframe (for the numerical columns) and supports a color hue argument (for categorical columns).

``sns.pairplot(data)``  
``sns.pairplot(data, hue= , palette="")``

useful if you want to check for linear relationships between variables or multicollinearity before performing linear regression.

In [None]:
sns.pairplot(data = iris, hue = 'species')
plt.show()

Note that the default option gives a symmetrical matrix of plots. I can customize things to be more efficient with how I use the plots

In [None]:
# Customize the variables and mapped functions:
g = sns.PairGrid(data = iris, hue = 'species', vars = ['petal_length', 'sepal_length', 'petal_width'])
g.map_diag(sns.histplot)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
plt.show()

## boxplot and violinplot

#### boxplot
A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

``sns.boxplot(x="", y="", data= ,palette="")``

``sns.boxplot(data= ,palette=' ',orient=' ')``

``sns.boxplot(x=" ", y=" ", hue=" ", data= , palette=" ")``

#### violinplot
A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.

`sns.violinplot(x="<cat>", y="<num>", data= , palette=' ')`

`sns.violinplot(x="<cat>", y="<num>", data= , hue=' ', palette=' ')`

`sns.violinplot(x="<cat>", y="<num>", data= , hue=' ', split=(boolean), palette=' ')`

**Aren't violinplots way cooler than boxplots??**

One advantage of a boxplot is very clearly seeing where the median and interquartile range is as well as where the outliers begin. So they do server their purpose.

In [None]:
sns.boxplot(data = tips, y = 'sex', x = 'tip', orient = 'horizontal')
plt.show()

In [None]:
tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [None]:
sns.boxplot(data = tips, x = 'sex', y = 'tip_pct')
plt.show()

In [None]:
sns.violinplot(data = tips, x = 'sex', y = 'tip_pct',  hue = 'sex', palette = 'Dark2')
plt.show()

## stripplot
The stripplot will draw a scatterplot where one variable is categorical. A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.


``sns.stripplot(x="<cat>", y="<num>", data= , jitter=<bool>)``

``sns.stripplot(x="<cat>", y="<num>", data= ,jitter=<bool>,hue='<cat>',palette=' ',split=<bool>)``

In [None]:
sns.stripplot(data = tips, x = 'day', y = 'total_bill', hue = 'tip_pct')
plt.show()

In [None]:
sns.stripplot(data = tips, x = 'day', y = 'total_bill', jitter = False)
plt.show()

**Note from Will Melville**: stripplot's seem dumb to me. There's nothing that they do that histograms or kde plots don't do better.

# Categorical Data Plots

Now let's discuss using seaborn to plot categorical data!

* barplot
* countplot

``sns.barplot(x='<cat>',y='<num>',data= , estimator= <default is mean>)``

``sns.countplot(x='<cat>',data=tips)``

In [None]:
sns.barplot(data = tips, x = 'sex', y = 'tip')
plt.show()

In [None]:
sns.countplot(data = tips, x = 'sex', hue = 'sex')
plt.show()

# 2-D Plots: Heatmaps/clustermaps

Heatmaps allow you to plot 2-d data as color-encoded matrices.

`sns.heatmap(matrix)`

`sns.heatmap(matrix, cmap=' ', annot=<bool>)`

In [None]:
from scipy.spatial.distance import pdist, squareform

In [None]:
iris_dists = squareform(pdist(iris.loc[:, iris.columns != 'species']))

In [None]:
sns.heatmap(iris_dists, cmap = 'cubehelix')
plt.show()

In [None]:
sns.heatmap(iris_dists, cmap = 'coolwarm')
plt.show()

**Note to Will Melville** Show them the wOBA Cube!!

Seaborn colormaps: [colormaps](https://seaborn.pydata.org/tutorial/color_palettes.html)

In [None]:
#matplotlib can also do heatmaps using imshow
plt.imshow(iris_dists, cmap = 'coolwarm')
plt.show()

## Facet Grid

FacetGrid is the general way to create grids of plots based off of a feature, so it's the general version of pair plot

In [None]:
g = sns.FacetGrid(tips, col = "time",  row = "smoker")
g = g.map(sns.histplot, "total_bill")
plt.show()

In [None]:
# Can also use matplotlib plots!
g = sns.FacetGrid(tips, col = "time",  row = "smoker")
g = g.map(plt.hist, "total_bill")
plt.show()

In [None]:
g = sns.FacetGrid(tips, col = "time",  row = "smoker", hue = 'sex')
# Notice how the arguments come after plt.scatter call: x, y
g = g.map(plt.scatter, "total_bill", "tip").add_legend()
plt.show()

# Regression Plots

Seaborn has many built-in capabilities for regression plots.

**lmplot** allows you to display linear models, but it also conveniently allows you to split up those plots based off of features, as well as coloring the hue based off of features.

Let's explore how this works:

`sns.lmplot(x='<num>',y='<num>', data= )`

`sns.lmplot(x='<num>',y='<nujm>',data= , hue='<cat>', palette=" ")`



In [None]:
sns.lmplot(data = tips, x = 'total_bill', y = 'tip')
plt.show()

In [None]:
sns.lmplot(data = iris, x = 'petal_width', y = 'petal_length', hue = 'species')
plt.show()

In [None]:
sns.lmplot(data = iris, x = 'petal_width', y = 'petal_length', hue = 'species', palette = 'coolwarm')
plt.show()

## Using a Grid

`lmplot` can easily create facets. Just indicate this with the col or row arguments. **Note from Will Melville**: I kind of prefer the idea of using colors though in one plot

In [None]:
sns.lmplot(x = "total_bill", y = "tip", row = "sex", col = "time", data = tips)
plt.show()

In [None]:
sns.lmplot(x = "total_bill", y = "tip", hue = 'sex', row = 'time', data = tips)
plt.show()

## Selecting a palette (color scheme)
We can review the available palettes by calling sns.color_palette().

Seaborn color palettes [here.](https://seaborn.pydata.org/tutorial/color_palettes.html)

Color advice can be found here: [ColorBrewer2](https://colorbrewer2.org/)

In [None]:
# Example of using hue
sns.scatterplot(data=iris, x='petal_length', y='petal_width', hue='species', palette='Dark2')
plt.show()

You can also generate your own color palette using hex codes:

In [None]:
custom_palette = sns.color_palette(["#A39382", "#002E5D", "#0047BA"]) # Go BYU!
# sns.set_palette(custom_palette) # Set globally

# Example usage with a scatter plot
sns.scatterplot(data=iris, x='petal_length', y='petal_width', hue='species', palette = custom_palette)
plt.show()

Colormaps (cmap) can be set for continuous variables.

In [None]:
# Example using colormaps for continuous variables
sns.scatterplot(data=iris, x='petal_length', y='petal_width', hue='sepal_length', palette='viridis')
plt.show()

# Style and Context

`sns.set_style('whitegride')` can take the following styles:  darkgrid, whitegrid, dark, white, or ticks.  Advanced users can customize further.

`sns.set_context('notebook', font_scale = 1)` can take take the following contexts: notebook (default), paper, talk, poster.  The font can also be adjusted.  Advanced users can customize further.

In [None]:
sns.set_style('whitegrid')
#sns.set_style('darkgrid')

In [None]:
fig, axes = plt.subplots(1, 2, figsize = (12, 6))

sns.scatterplot(iris, x = 'petal_length', y = 'petal_width', hue = 'species', ax = axes[0])
sns.scatterplot(iris, x = 'petal_length', y = 'petal_width', hue = 'sepal_length', ax = axes[1])
plt.show()

In [None]:
sns.histplot(iris, x = 'petal_length', hue = 'species')
plt.show()

`# Your turn!

In this exercise, you will work with the `tips` dataset to explore various aspects of the data using Seaborn.

## Questions:

1. **Distribution of Total Bill:**
    - Plot the distribution of the `total_bill` column using a histogram. Include a rug plot.
    - What is the most common range of total bills?

2. **Tip Percentage by Day:**
    - Create a boxplot (or violinplot) to visualize the distribution of `tip_pct` for each day of the week.
    - Which day has the highest median tip percentage?

3. **Total Bill vs Tip:**
    - Create a scatter plot to visualize the relationship between `total_bill` and `tip`.
    - Add a regression line to the scatter plot.
    - Is there a correlation between total bill and tip?

4. **Tip Percentage by Gender and Smoker Status:**
    - Create a violin plot to visualize the distribution of `tip_pct` smoker status.
    - Are there any noticeable differences in tip percentages between smokers and non-smokers?

5. **Facet Grid of Total Bill by Time and Day:**
    - Create a FacetGrid to plot the distribution of `total_bill` for each combination of `time` (Lunch/Dinner) and `day`.
    - Which combination of time and day has the highest total bills?


# Your turn!
## Practice Questions for the Flights Dataset:

Start by getting familiar with the data. Then answer the below questions.

1. **Monthly Passengers Over Time:**
    - Plot the number of passengers for each month over the years.
    - Which month has the highest number of passengers on average?

2. **Yearly Trend:**
    - Create a line plot to show the trend of passengers over the years.
    - Is there a noticeable trend in the number of passengers over the years?

3. **Heatmap of Passengers:**
    - Create a heatmap to visualize the number of passengers for each month and year.
    - Which year and month combination had the highest number of passengers?

4. **Monthly Distribution:**
    - Create a boxplot to visualize the distribution of passengers for each month.
    - Which month shows the most variation in the number of passengers?

5. **Seasonal Trend:**
    - Create a FacetGrid to plot the distribution of passengers for each season (Winter, Spring, Summer, Fall). (Hint: You will need to create a season variable!)
    - Which season has the highest average number of passengers?

6. **Additional Question:**
    - Come up with one additional question and a plot to answer it.

In [None]:
flights.head()