## Introduction to Seaborn

Seaborn is a library for making attractive and informative statistical graphics in Python. It is built on top of matplotlib and tightly integrated with the PyData stack, including support for numpy and pandas data structures and statistical routines from scipy and statsmodels.

http://seaborn.pydata.org/introduction.html

https://seaborn.pydata.org/tutorial.html

In [None]:
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import seaborn as sns
sns.set(color_codes=True)

#### Plotting univariate distributions
By default, this will draw a histogram and fit a kernel density estimate (KDE).

In [None]:
x = np.random.normal(size=100)
sns.distplot(x);

#### Plotting bivariate distributions
It can also be useful to visualize a bivariate distribution of two variables. The easiest way to do this in seaborn is to just use the jointplot() function, which creates a multi-panel figure that shows both the bivariate (or joint) relationship between two variables along with the univariate (or marginal) distribution of each on separate axes.

In [None]:
# Generate multivariate data
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)

In [None]:
df = pd.DataFrame(data, columns=["x", "y"])
df.head()

In [None]:
sns.jointplot(x="x", y="y", data=df);

#### Hexbin plots
The bivariate analogue of a histogram is known as a “hexbin” plot, because it shows the counts of observations that fall within hexagonal bins. 

In [None]:
x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("white"):
    sns.jointplot(x=x, y=y, kind="hex", color="k");

#### Kernel density estimation
It is also posible to use the kernel density estimation procedure described above to visualize a bivariate distribution. In seaborn, this kind of plot is shown with a contour plot and is available as a style in jointplot():

In [None]:
sns.jointplot(x="x", y="y", data=df, kind="kde")

#### Categorical scatterplots
tips = sns.load_dataset("tips")A simple way to show the the values of some quantitative variable across the levels of a categorical variable uses stripplot(), which generalizes a scatterplot to the case where one of the variables is categorical:

In [None]:
tips = sns.load_dataset("tips")

In [None]:
tips.head()

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips);

In a strip plot, the scatterplot points will usually overlap. This makes it difficult to see the full distribution of data. One easy solution is to adjust the positions (only along the categorical axis) using some random “jitter”:

In [None]:
sns.stripplot(x="day", y="total_bill", data=tips, jitter=True);

In [None]:
iris = sns.load_dataset("iris")
iris.head()

Exercise: Create four random-jitter strip plots, one for each numerical column of iris dataset. Use species column for categorical data. For each plot comment on the standard deviation across different species.  

#### Boxplots
This kind of plot shows the three quartile values of the distribution along with extreme values. The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently. 

In [None]:
sns.boxplot(x="day", y="total_bill", data=tips);

Exercise: Values beyond whiskers can be considered as outliers. For the boxplot list all outliers for total_bill for day="Thur". You can use the total_bill value of 30 as the threshold.

Exercise: Using titanic dataset, create boxplot for age data for different values of pclass. Load the titalnic dataset and use the head() function to view different columns:

titanic = sns.load_dataset("titanic")

titanic.head()

What can you conclude from the plots?

Exercise: For the previous boxplot exercise on titanic dataset add hue="sex". What can you conclude from the plot? List all outliers for age for pclass=3 and sex=male.

#### Violinplots
A different approach is a violinplot(), which combines a boxplot with the kernel density estimation :

In [None]:
sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True);

Exercise: Create violin plot for the titanic dataset for the age data using differnt pclass values and hue="sex". Can you comment on the distribution parameters such as mean and standard deviation. 

### Plot linear regression models

The lmplot function draws a scatterplot of two variables, x and y, and then fit the regression model y ~ x and plot the resulting regression line and a 95% confidence interval for that regression:

In [None]:
sns.lmplot(x="total_bill", y="tip", data=tips)

What if one of the variables has discrete values?

In [None]:
sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean)

Exercise: Use lmplot() function to draw linear regression model for iris dataset with y=sepal_length and x=petal_width. Repeat the plot with y=sepal_width and x=petal_width. What can you conlude from the two plots?

#### Conditioning on other variables

How does the relationship between  two variables change as a function of a third variable (typical categorical)?

In [None]:
sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips)

### Heatmap


In [None]:
flights = sns.load_dataset("flights")
flights.head()

In [None]:
flightsN = flights.pivot("month", "year", "passengers")
flightsN.head()

In [None]:
ax = sns.heatmap(flightsN)

In [None]:
ax = sns.heatmap(flightsN, annot=True, fmt="d")

#### Facet Grid
For visualizing the distribution of a variable separately within subsets of a dataset. These subsets of data can be based on a categorical variable in the dataset.


In [None]:
tips.head()

In [None]:
# identify the two subsets of data based on time: lunch and dinner and setup the grid
g = sns.FacetGrid(tips, col="time")
# visualize data on the grid: distribution of tips in the two subsets
g.map(plt.hist, "tip")

In [None]:
iris.head()

Exercise: Use FacetGrid function to plot histogram for iris dataset for sepal_length for different species.

Add one more dimension using hue

In [None]:
g = sns.FacetGrid(tips, col="sex", hue="smoker")
g.map(plt.hist, "tip")
g.add_legend()

In [None]:
g = sns.FacetGrid(tips, col="sex")
g.map(plt.hist, "tip")
g.add_legend()

In [None]:
g = sns.FacetGrid(tips, col="smoker")
g.map(plt.hist, "tip")
g.add_legend()

In [None]:
# Using one grid
g = sns.FacetGrid(tips, row = "sex", col="smoker")
#g = sns.FacetGrid(tips, row = "sex", col="smoker", margin_titles=True)
g.map(plt.hist, "tip")

In [None]:
g = sns.FacetGrid(tips, col="sex", hue="smoker")
g.map(plt.scatter, "total_bill", "tip", alpha=.7)
g.add_legend()

In [None]:
g = sns.FacetGrid(tips, col="smoker")
g.map(sns.regplot, "total_bill", "tip")

#g = sns.FacetGrid(tips, row="smoker", col="time", margin_titles=True)
#g.map(sns.regplot, "size", "total_bill", color=".3", fit_reg=False, x_jitter=.1);

Exercise:  Use  FacetGrid function to plot linear regression model for petal_width (y) and sepal_length (x) for different species.

#### Visualizing pairwise relationships in a dataset
To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. by default, it also draws the univariate distribution of each variable on the diagonal Axes:

In [None]:
sns.pairplot(iris)