# Data Visualization using Seaborn (Tips dataset)

We need to visualize data properly to understand how variables in a dataset relate to each other and how those relationships depend on other variables.

**Seaborn** 
- statistical plotting library based on matplotlib 
- work very well with Pandas dataframe objects

**Structure of this Notebook:**

**1. Dataset - at a glance**

**2. Data Visualization**

    a. Frequency Distribution - Categorical Variables 
        * countplot 
        * catplot
        
    b. Distribution of the Numerical Variable** 
        * distplot(histogram)
        * kdeplot
        * boxplot
        * violinplot
        
    c. Relationship between 2 Numerical Variables
        * lineplot
        * scatterplot
        * relplot
        * jointplot
        * kdeplot
        * lmplot
        * heatmap
        * pairplot
        * facetgrid
        
    d. Relationship between Numerical and Categorical Variables 
        * pointplot
        * barplot
        * boxplot
        * violinplot
        * swarmplot
        * catplot
        * facetgrid
        
 **Remark**: *I used subplot to plot two or more plots in one figure.* 

## 1. Dataset - at a glance

In [None]:
# importing required libraries
import os #provides functions for interacting with the operating system
import numpy as np 
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# loading dataset
tips_dataset='../input/seaborn-tips-dataset/tips.csv'
tips_raw = pd.read_csv(tips_dataset)

In [None]:
# make a copy of data 
tips=tips_raw.copy()

In [None]:
# run all the data
tips
# run the first 5 rows
tips.head()

In [None]:
# get number of rows and columns
tips.shape

In [None]:
# get attribute names 
tips.columns

In [None]:
# get information about a dataset (dtype, non-null values, memory usage)
tips.info()

In [None]:
# detect labels in categorical variables
for col in tips.columns[2:6]:
    print(col, np.unique(tips[col]))

In [None]:
# detect missing values
tips.isna().sum()

In [None]:
# Summary statistics
tips.describe() # only for numerical variables 
tips.describe().T # transpose
#tips.describe(include='all') # for all variables

In [None]:
# correlation matrix
tips.corr()

####  So, what do I know about dataset?

- The *tips* dataset has 7 columns (features) and 244 rows (observations,samples).\
  *Numerical* columns are:
    - *total_bill* (continuous) - the amount of the total bill   
    - *tip* (continuous) - the amount of the tip paid on the bill
    - *size* (discrete) - the number of total people served

  *Categorical* columns are:
    - *sex* (Male/Female) - the gender of the person who paid the bill
    - *smoker* (Yes/No) - whether or not the person who paid the bill is a smoker 
    - *day* (Thur/Fri/Sat/Sun) - the day when the person paid the bill
    - *time* (Lunch/Dinner) - the time of the day i.e. lunch or dinner

- There is no missing values.
- tip is in interval [1,10] with overall average tip approx.3 (with standard deviation 1.4) and median 3.6  
- total_bill is approx. between 3 and 51 with overall average 20 (with standard deviation 9) and median 24
- positive correlation (>1) among numerical variables
- the strongest relationship is between total_bill and tip (0.7)

## 2. Visualization 

### **a. Frequency Distribution - Categorical Variables**

###  * COUNT PLOT

- show value counts for a single categorical variable
- can be thought of as a histogram across a categorical, instead of quantitative, variable.

In [None]:
# Apsolute values - the number of records 

sns.countplot(x='sex', data=tips)
sns.despine() # no top and right axes spine

print(tips.sex.value_counts())

In [None]:
# change orientation, use same color for both label
sns.countplot(y='smoker', data=tips, color='b') 

In [None]:
# show value counts for two categorical variables
sns.countplot(x='sex', data=tips, hue='smoker', palette='viridis')

So, we see that:
- Man paid mostly
- No smokers paid mostly
- Male-No smoker paid mostly
- Male-No smoker paid more often then Male-smoker
- Female-smoker paid the most rarly
- ...

### * CATPLOT

- do the same as count plot with parametar kind='count'

In [None]:
# show value counts for two categorical variables
sns.catplot(x='day', data=tips, hue='sex', palette='ch:.25', kind='count')

In [None]:
# facet along the columns to show a third categorical variable
sns.catplot(x='sex', hue='smoker', col='day', data=tips, kind='count')

### * BAR PLOTS

In [None]:
# Relative values - the percentage of records
perc=tips['sex'].value_counts(normalize=True)*100
print(perc)
sns.barplot(x=perc.index, y=perc, data=tips)
sns.despine(left='True') # no top, left and right axes spine

Man paid mostly. <br>
64% of bills were paid by man comapared to 36% paid by woman. 

## **b. Distribution of the Numerical Variable**

### * DISTPLOT

- plot a univariate distribution of observations
- combines the histogram & plots the estimated probability density function over the data. 
- calculate bin size

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15,12)) # plot 4 graphs 

# histogram and density function, set title
sns.distplot(tips.total_bill, ax=axes[0,0]).set_title('Total_bill distribution')

#set number of bins and color, set title
sns.distplot(tips.total_bill, bins=50, color='r', ax=axes[0,1]).set_title('Total_bill distribution') 

# only histogram, without density function, set title
sns.distplot(tips.total_bill, kde=False, ax=axes[1,0]).set_title('Histogram') 

# only density function, without histogram, set title
sns.distplot(tips.total_bill, hist=False, ax=axes[1,1]).set_title('PDF of Total_bill')
sns.despine() # no top and right axes spine

The most of the bill values are in the range of 10 - 20.

### * KDE PLOT

- plot density

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15,6)) # plot 2 graphs

# simple density function
sns.kdeplot(tips.total_bill, ax=axes[0])

# filled area under the curve, set color, remove legend, set title
sns.kdeplot(tips.tip, shade=True, color='purple', legend=False, ax=axes[1]).set_title('PDF of Tip') 

### * BOX-PLOT

- the box shows the quartiles of the dataset 
- the whiskers extend to show the rest of the distribution, except for points that are determined to be "outliers"

In [None]:
# detect the outliers

fig, axes = plt.subplots(1, 2,figsize=(15,6)) # plot 2 graphs

# use red color, set title
sns.boxplot(x='total_bill', data=tips, color='red', ax=axes[0]).set(title='Total_bill outliers') 

# change orientation, set title
sns.boxplot(x='tip', data=tips, orient='v', ax=axes[1]).set_title('Tip outliers') 

There are total_bill/tip values that 'lie' outside 'far away' from other total_bill/tip values. <br>
Both variables contain outlier candidates. <br>
"To drop or not to drop?" isn't the topic now, but it’s important to investigate the nature of the outliers before deciding.

In [None]:
tips[tips.total_bill>=40]

In [None]:
tips[tips.tip>=6]

### * VIOLIN PLOT

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15,6)) # plot 2 graphs 

# single horizontal violinplot
sns.violinplot(tips.total_bill, ax=axes[0])

# change orientation, set color
sns.violinplot(tips.tip, orient='v', color='red', ax=axes[1])

## **c. Relationship between 2 Numerical Variables**

### * LINE PLOT

- usually show trends during the time

In [None]:
# line plot with confidence interval, set axes and title
sns.lineplot(x='size', y='tip', data=tips).set(xlabel='X axis- Size', ylabel='Y axis - Tip', title='Line plot - Size vs. Tip')

In [None]:
# show error bars and plot the standard error 
sns.lineplot(x='size', y='tip', hue='sex', data=tips, err_style='bars', ci=68)

### * SCATTER PLOT 

- interaction between the two numeric columns

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15,12)) # plot 4 graphs 

# simple scatter plot between two variables 
sns.scatterplot(x='total_bill', y='tip', data=tips, ax=axes[0,0])

# group by time and show the groups with different colors
sns.scatterplot(x ='total_bill', y ='tip', data = tips, hue= 'time', ax=axes[0,1])

# variable time by varying both color and marker
sns.scatterplot(x ='total_bill', y ='tip', data = tips, hue='time', style= 'time', ax=axes[1,0])

# vary colors and markers to show two different grouping variables
sns.scatterplot(x = 'total_bill', y = 'tip', hue= 'time', style= 'sex', data = tips)

In [None]:
sns.set(style='white') #set background

fig, axes = plt.subplots(2, 2, figsize=(15,12)) # plot 4 graphs 

# vary colors to show one grouping variable-size
sns.scatterplot(x='total_bill', y='tip', data=tips, hue='size', ax=axes[0,0])

# quantitative variable-size by varying the size of the points
sns.scatterplot(x='total_bill', y='tip', data=tips, hue='size', size='size', ax=axes[0,1])

# set the minimum and maximum point size and show all sizes in legend
sns.scatterplot(x='total_bill', y='tip', data=tips, hue='size', size='size', sizes=(10,200), ax=axes[1,0])

# vary colors and markers to show two different grouping variables -size,sex
sns.scatterplot(x='total_bill', y='tip', data=tips, hue='size', size='size', style='sex', sizes=(10,200), ax=axes[1,1])
sns.despine() 

### * RELPLOT

- show the relationship between two variables with semantic mappings of subsets
- could be used instead of scatter plot 

In [None]:
# how could we use relplot instead of scatter plot

#sns.scatterplot(x='total_bill', y='tip', data=tips, hue='size', size='size', style='sex', sizes=(10,200))
sns.relplot(x='total_bill', y='tip', data=tips, hue='size', size='size', style='sex', sizes=(10,200))

In [None]:
sns.set(style='whitegrid') # set background for following graphs

In [None]:
# draw a single facet, set axes 
sns.relplot(x='total_bill', y='tip', hue='day', data = tips).set(xlabel='X - total_bill', ylabel='Y - tip')

In [None]:
# facet on the columns with another variable
sns.relplot(x='total_bill', y='tip', hue='day', col='time', data = tips)

In [None]:
# facet on the columns and rows
sns.relplot(x='total_bill', y='tip', hue='day', col='time', row='sex', data = tips)

### * JOINT PLOT

- take two variables and create histogram and scatterplot together

In [None]:
sns.set(style='white') # set background and palette for following graphs

In [None]:
# scatterplot with marginal histograms
sns.jointplot(x='total_bill', y='tip', data=tips)

In [None]:
# add regression line and density function:
sns.jointplot(x='total_bill', y='tip', data=tips, kind='reg')

In [None]:
# replace the scatterplot with “hexbin” plot - shows the counts of observations that fall within hexagonal bins
sns.jointplot(x='total_bill', y='tip', data=tips, kind='hex', color='purple')

In [None]:
# replace the scatterplot and histograms with density estimates
sns.jointplot(x='total_bill', y='tip', data=tips, kind='kde', color='pink')

### * KDE PLOT

- fit and plot a univariate or bivariate density estimate

In [None]:
# bivariate densiti, more contour levels and a different color palette
sns.kdeplot(tips.total_bill,tips.tip, n_levels=30, cmap='Purples_d')

In [None]:
# 2 density function on same graph
# shade under the density curve, use a different color
sns.kdeplot(tips.total_bill,shade=True, color='b')
sns.kdeplot(tips.tip, shade=True, color='r')

In [None]:
# don't shade under the density curve, use a different color
female_tip=tips[tips['sex'] == 'Female'].tip.values
male_tip=tips[tips['sex'] == 'Male'].tip.values

sns.kdeplot(female_tip, color='red')
sns.kdeplot(male_tip, color='blue')

### * LM PLOT

- plot data and regression model

In [None]:
sns.set(style='ticks')

# simple linear relationship between two variables
sns.lmplot(x='total_bill', y='tip', data=tips)

In [None]:
# regression line without confidence interval
sns.lmplot(x='total_bill', y='tip', data=tips, ci=None)

In [None]:
# third variable, levels in different colors with markers
sns.lmplot(x='total_bill', y='tip', data=tips, hue='smoker', markers=['o','x'])

In [None]:
# facet on the columns and rows, set name of axes
sns.lmplot(x='total_bill', y='tip', data=tips, col='smoker',row='time').set_axis_labels('Total bill (in $ )', ' Tip ( in $ )')

### * HEATMAP

In [None]:
# visualize the correlation matrix
sns.heatmap(data=tips.corr(),annot=True) # values of Pearson coefficient

### * PAIR PLOT

- relationship between numeric columns in the form of multiple scatter plots

In [None]:
# simple paiplot - all numerical varibles, scatter plots and histograms on diagonal
sns.pairplot(tips)

In [None]:
# select wanted variables
sns.pairplot(tips[['total_bill','tip']])  

In [None]:
# fit linear regression models to the scatter plots, show density plots on diagonal
sns.pairplot(tips, kind='reg', diag_kind="kde")  

In [None]:
# set markers for different levels
sns.pairplot(tips, hue='sex', markers=["+", "o"])  

### * FACETGRID

- multi-plot grid for plotting conditional relationships.

In [None]:
# facets on column and row
g = sns.FacetGrid(tips, col='time',  row='smoker')
g = g.map(plt.scatter, 'total_bill', 'tip', edgecolor='w').set_titles('Scatter-plot')

In [None]:
# facets on column, with hue represent levels of a variable in different colors
g=sns.FacetGrid(tips, col='time',  hue='smoker')
g = (g.map(plt.scatter, 'total_bill', 'tip', edgecolor='w').add_legend())

In [None]:
pal = dict(Lunch='seagreen', Dinner='gray')

g = sns.FacetGrid(tips, col='sex', hue='time', palette=pal, hue_order=['Dinner', 'Lunch'])
g = (g.map(plt.scatter, 'total_bill', 'tip').add_legend())

## **d. Relationship between Numerical and Categorical Variables**

### * POINT PLOT

- show point estimates and confidence intervals using scatter plot glyphs
- can be more useful than bar plots for focusing comparisons between different levels of one or more categorical variables. 
- show how the relationship between levels of one categorical variable changes across levels of a second categorical variable

In [None]:
sns.set(style='darkgrid',palette='Set2') # set background and palette

# grouped by a categorical variable
sns.pointplot(x='day', y='tip', data=tips)

In [None]:
# grouped by a two variables,separate lines
sns.pointplot(x='day', y='tip', data=tips, hue='smoker', dodge=True)

In [None]:
# separate the points for different hue levels with different marker and line style
sns.pointplot(x='day', y='tip', data=tips, hue='smoker', dodge=True, markers=['o','x'], linestyles=['dotted','--'])

In [None]:
# show standard deviation of observations instead of a confidence interval
sns.pointplot(x='tip', y='day', data=tips, ci='sd')

### * BAR PLOT

- represents an estimate of central tendency for a numeric variable
- shows only the mean (or other estimator) value, not the distribution of values at each level of the categorical variables (like a box or violin plot)

In [None]:
sns.set(style='whitegrid')# set background

fig, axes = plt.subplots(2, 2, figsize=(15,8)) # plot 4 graphs

# grouped by a categorical variable
sns.barplot(x='day', y='tip', data=tips, ax=axes[0,0])

# all bars in a single color
sns.barplot(x='day', y='tip', data=tips, color='salmon', saturation=.8, ax=axes[0,1])

# grouped by a two variables, show standard deviation of observations instead of a confidence interval
sns.barplot(x='day', y='tip', data=tips, hue='sex', ci='sd', ax=axes[1,0])

# grouped by new variable 
tips['weekend'] = tips['day'].isin(['Sat', 'Sun'])
sns.barplot(x='day', y='total_bill', hue="weekend", data=tips, dodge=False, ax=axes[1,1]) 

###  * BOX-PLOT

- show distributions with respect to categories
- box shows the quartiles 
- whiskers show the rest of the distribution, except for points that are determined to be 'outliers'

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15,8)) # plot 4 graphs

# the quartile information for a numerical column grouped by categorical column
sns.boxplot(x='time', y='total_bill', data = tips, ax=axes[0,0])

# set unique categorical label in different order
sns.boxplot(x='time', y='total_bill', data=tips,order=['Dinner', 'Lunch'], ax=axes[0,1])

# boxplot with swarmplot
sns.boxplot(x = 'time', y = 'total_bill', data = tips, ax=axes[1,0])
sns.swarmplot(x = 'time', y = 'total_bill', data = tips, color='.25', ax=axes[1,0])

# boxplot with nested grouping by two categorical variables and ticker line
sns.boxplot(x='time', y='total_bill', data = tips, hue='day', linewidth=2.5, ax=axes[1,1])

###  * VIOLIN PLOT

- combination of boxplot and KDE plot.
- shows the distribution of quantitative data across several levels of one (or more) categorical variables 
- show multiple distributions of data at once

In [None]:
fig, axes = plt.subplots(3, 2, figsize=(15,12)) # plot 6 graphs

# vertical violinplot grouped by a categorical variable
sns.violinplot(x='day', y='total_bill', data=tips, ax=axes[0,0])

# vertical violinplot grouped by a categorical variable, set palette
sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', palette='muted', ax=axes[0,1])

# split violins to compare the across the hue variable
sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', palette='muted', split=True, ax=axes[1,0]) 

# show each observation with a stick inside the violin
sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', palette='muted', split=True, inner='stick', ax=axes[1,1])

# scale the violin width by the number of observations in each bin
sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', palette='muted', split=True, scale='count', ax=axes[2,0]) 

#Scale the density relative to the counts across all bins
sns.violinplot(x='day', y='total_bill', data=tips, hue='sex', palette='muted', split=True, scale='count', inner='stick', ax=axes[2,1])

### * SWARM PLOT

- draw a categorical scatterplot with non-overlapping points
- good complement to a box or violin plot

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(15,12)) # plot 4 graphs

sns.swarmplot(x='time', y='tip', data=tips, ax=axes[0,0])

sns.swarmplot(x='time', y='tip', data=tips, hue='sex', ax=axes[0,1])

sns.swarmplot(x='time', y='tip', data=tips, hue='sex', dodge=True, ax=axes[1,0]) 

# combine swarm and violin plot
sns.swarmplot(x='time', y='tip', data=tips, color='k', ax=axes[1,1])
sns.violinplot(x='time', y='tip', data=tips, inner=None, ax=axes[1,1])

### * CAT PLOT

- show the relationship between a numerical and one or more categorical variables 
- coulds use one of several visual representations (boxplot, swarmplot, violinplot,...)

In [None]:
# simple relationship between a numerical and a categorical variable  
sns.catplot(x='day', y='total_bill', data=tips)

In [None]:
# facet on column, grouped by 2 variables
sns.catplot(x='day', y='tip', data=tips, hue='size', col='sex')

In [None]:
# facet on column, grouped by 2 variables
sns.catplot(x='day', y='total_bill', data=tips, hue='size', col='time', row='sex' )

In [None]:
# observations in one line
sns.catplot(x='day', y='tip', data=tips, hue='sex', jitter=False, alpha=.4)

In [None]:
# use a different plot kind to visualize the same data
sns.catplot(x='sex', y='total_bill',hue='smoker', col='time', data=tips, kind='point', dodge=True, height=4, aspect=.7)

In [None]:
# use a different plot kind to visualize the same data
sns.catplot(x='sex', y='total_bill', data=tips, hue='smoker', col='day', kind='bar')

In [None]:
# use a different plot kind to visualize the same data
sns.catplot(x='time', y='tip', data=tips, color='k', height=3, kind='swarm')

In [None]:
# use a different plot kind to visualize the same data
sns.catplot(x='time', y='tip', data=tips, kind='boxen')

### * FACETGRID

- multi-plot grid for plotting conditional relationships

In [None]:
# univariate plot on each facet
g = sns.FacetGrid(tips, col='time',  row='smoker')
g = g.map(plt.hist, 'total_bill', color='green').set_titles('Histogram')

In [None]:
# specify the order, change the height and aspect ratio of each facet
bins = np.arange(0, 65, 5)
g = sns.FacetGrid(tips, col='smoker', col_order=['Yes', 'No'], height=4, aspect=.5)
g = g.map(plt.hist, 'total_bill', bins=bins, color='m').add_legend()