# categorical data
Plotting when one of the main variable is categorical

Three main family of plots
- Categorical scatterplots:

    - stripplot() (with kind="strip"; the default)
    - swarmplot() (with kind="swarm")

- Categorical distribution plots:

    - boxplot() (with kind="box")
    - violinplot() (with kind="violin")
    - boxenplot() (with kind="boxen")

- Categorical estimate plots:

    - pointplot() (with kind="point")
    - barplot() (with kind="bar")
    - countplot() (with kind="count")

## Categorical scatterplots

We'll use higher level interface **catplot**

In [None]:
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)
tips = sns.load_dataset("tips")
tips.head(10)

In [None]:
#The default representation of the data in catplot() uses a scatterplot.
sns.catplot(x="day", y="total_bill", jitter=False, data=tips);

As you can see, all of the points belonging to one category  fall on the same position along the categorical variable axis.

stripplot(default kind of catplot), adjust the positions of points on the categorical axis with a small amount of random “jitter”

In [None]:

sns.catplot(x="day", y="total_bill", data=tips);


Other approach(swarmplot) adjusts the points along the categorical axis to avoid overlap.

In [None]:
sns.catplot(x="day", y="total_bill", kind="swarm", data=tips);

Again, we can plot more than two variable using **hue**.

In [None]:
sns.catplot(x="day", y="total_bill", hue="sex", kind="swarm", data=tips);

Scalar data has an inherent order defined by real line $\mathbb{R}$. For categorical data it is not very clear. Seaborn makes best guess but you can always control the order.



# categorical(discrete) variables
- **nominal variable**: has no intrinsic ordering to its categories like color
- **ordinal variable**: has a clear ordering like low high medium




In [None]:
sns.catplot(x="smoker", y="tip",kind= 'swarm',  order=["No", "Yes"], data=tips);

If category names are relatively long or there are many categories, swap x and y

In [None]:
sns.catplot(x="total_bill", y="day", hue="sex", kind="swarm", data=tips);

# Distributions of observations within categories

Scatter plots are limited in visualizing distribution of values when we have lots of datapoints.

One can visualize the summary of distribution in these scenarios.

# Box plot (well known graphical representations of a probability distribution)



In [None]:
# Note: outlier as displayed as individual points
sns.catplot(x="day", y="total_bill", kind="box", data=tips);

In [None]:
# Let's overlay points
fig, ax = plt.subplots()

sns.catplot(x="day", y="total_bill", kind="box", data=tips, ax= ax);
sns.catplot(x="day", y="total_bill", kind="strip",color='black' , data=tips, ax=ax);

## Again we can play with *hue* semantics

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker", kind="box", data=tips);

Above default behavior is called “dodging” and assumes that the semantic variable is nested within the main categorical variable. If that’s not the case, you can disable the dodging.

seaborn does it best if nesting is not possible.



In [None]:
tips["weekend"] = tips["day"].isin(["Sat", "Sun"])
sns.catplot(x="day", y="total_bill", hue="weekend",
            kind="box", dodge=True, data=tips);

For large data set, we can better see the shape of the distribution using **boxen** plot

Here is the paper for more information

https://vita.had.co.nz/papers/letter-value-plot.html

In [None]:
sns.catplot(x='total_bill', kind="boxen", data=tips)

In [None]:
# try plot by smoker
sns.catplot(x='day', y='total_bill', kind="boxen", data=tips)

# Violinplots ( boxplot + kernel density estimation )

In [None]:
sns.catplot(x="total_bill", y="day", hue="time",
            kind="violin", data=tips)

In [None]:
sns.catplot(x="day", y="total_bill", hue="smoker",
            kind="violin", split=True, data=tips)

In [None]:
g = sns.catplot(x="day", y="total_bill", kind="violin", inner=None, data=tips)
sns.swarmplot(x="day", y="total_bill", color="k", size=3, data=tips, ax=g.ax);

# Statistical estimation(central tendency) within categories

In [None]:
sns.catplot(x="smoker", y="tip", kind="bar", data=tips)

In [None]:
# from tips data set display bar plot of median tip for smoker(yes/no)

???

In [None]:
titanic = sns.load_dataset("titanic")
titanic.head()

In [None]:
sns.catplot(x="sex", y="survived", hue="class", kind="bar", data=titanic)

In [None]:
# can you display percentage survived

In [None]:
# display by class too
sns.catplot(x="deck", kind="count", data=titanic)

# Point plot(same as bar plot but without bar)

In [None]:
sns.catplot(x="smoker", y="tip", hue= "smoker",kind="point", data=tips)

In [None]:
# For black and white compatibility
sns.catplot(x="sex", y="survived", hue="class", kind="point",linestyles=["-", "--", "-."], data=titanic);

In [None]:
# need to add marker too
sns.catplot(x="sex", y="survived", hue="class", kind="point",markers=["^", "o", "s"],linestyles=["-", "--", "-."], data=titanic);

# multiple relationships with facets(grid or panel charts;)

In [None]:
titanic.head()

In [None]:
sns.catplot(x="sex", y="fare", hue="deck",
            col="class",
            kind="swarm", data=titanic);

# Visualizing the distribution 

In [None]:
import numpy as np
x = np.random.binomial(30, .6, size = 500)
sns.distplot(x)

In [None]:
sns.distplot(x,kde=False, rug=True)

# KDE 

In [None]:
sns.distplot(x, hist=False)

In [None]:
from scipy.stats import expon

In [None]:
x= np.random.exponential(size = 10)
sns.distplot(x, kde=False, fit=expon)

# Visualizing distribution of two variable

In [None]:
from scipy.stats import multivariate_normal

In [None]:
rv = multivariate_normal([1, 1], 1)

In [None]:
import pandas as pd
x2d= rv.rvs(size=100, random_state=1)
df = pd.DataFrame(x2d, columns=["x1", "x2"])


In [None]:
# try changing default from scatter to hex, kde
sns.jointplot(x="x1", y="x2", data=df);

# Pairwise relationships

In [None]:
df_mpg = sns.load_dataset('mpg')
df_mpg.head()

In [None]:
# try chaning diag plot to kde
joint_grid_inst= sns.pairplot(df_mpg.iloc[:, 0:-3])

In [None]:
# pairplot is build on top of PariGrid object
type(joint_grid_inst)

In [None]:
g = sns.PairGrid(df_mpg.iloc[:, 0:-3].dropna())
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, n_levels=6);

# Realtionship(linear) among variables

In [None]:
iris_df = sns.load_dataset('iris')
iris_df.head()

In [None]:
sns.regplot(x="sepal_length", y="petal_length", data=iris_df)

In [None]:
sns.pairplot(iris_df,)

In [None]:
# create a PariGrid and map kdeplot on diagonal and regression of diaglonal
g = sns.PairGrid(iris_df)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.regplot)

# Heatmap

In [None]:

sns.heatmap(iris_df.corr(),annot=True, fmt=".2f")

# Geo map
## [Folium](https://python-visualization.github.io/folium/). It uses Leaflet - a JavaScript library for interactive maps

We are only scratching the surface. It is a bit more involved subject.

In [None]:
!pip install folium

In [None]:
import folium
m = folium.Map(location=[39.6766, -104.9619])
m

# One can  choose different tiles

In [None]:
folium.Map(
    location=[39.675938, -104.960721],
    tiles='Stamen Toner',
    zoom_start=13
)

# Markers

In [None]:
m = folium.Map(
    location=[40.211209, -105.821088],
    zoom_start=11,
    tiles='Stamen Terrain'
)

tooltip = 'Click me!'

folium.Marker([40.249466, -105.827617], popup='<i>Grand lake</i>', tooltip=tooltip).add_to(m)
folium.Marker([40.144075, -105.844817], popup='<b>LAke Granby</b>', tooltip=tooltip).add_to(m)

m

In [None]:


m = folium.Map(
    location=[-0.760488, -90.331771],
    zoom_start=14,
    tiles='Stamen Terrain'
)

folium.Circle(
    radius=100,
    location=[-0.760488, -90.331771],
    popup='Tortuga bay',
    color='crimson',
    fill=False,
).add_to(m)
m

In [None]:
# long latitude pop up
m = folium.Map(
    location=[-0.307781, -90.691985],
    zoom_start=8,
    tiles='Stamen Terrain'
)
m.add_child(folium.LatLngPopup())


m


JeoJson support

http://geojson.org/


In [None]:
!curl -O https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/antarctic_ice_edge.json

In [None]:
!head -n 10 antarctic_ice_edge.json

In [None]:
!curl -O https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/antarctic_ice_shelf_topo.json

In [None]:
!head antarctic_ice_shelf_topo.json

In [None]:

antarctic_ice_edge = 'antarctic_ice_edge.json'
antarctic_ice_shelf_topo = 'antarctic_ice_shelf_topo.json'

m = folium.Map(
    location=[-59.1759, -11.6016],
    tiles='Mapbox Bright',
    zoom_start=2
)

folium.GeoJson(
    antarctic_ice_edge,
    name='geojson'
).add_to(m)

folium.TopoJson(
    open(antarctic_ice_shelf_topo),
    'objects.antarctic_ice_shelf',
    name='topojson'
).add_to(m)

folium.LayerControl().add_to(m)


m

# Choropleth maps

Visualizing a quantity(population density or per-capita income.) using map

In [None]:
!curl -O https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/US_Unemployment_Oct2012.csv

In [None]:
!head -n 10 US_Unemployment_Oct2012.csv

In [None]:
!curl -O https://raw.githubusercontent.com/python-visualization/folium/master/examples/data/us-states.json

In [None]:
!head -n 10 us-states.json

In [None]:
import pandas as pd

state_geo = 'us-states.json'

state_unemployment = 'US_Unemployment_Oct2012.csv'
state_data = pd.read_csv(state_unemployment)

m = folium.Map(location=[48, -102], zoom_start=3)

folium.Choropleth(
    geo_data=state_geo,
    name='choropleth',
    data=state_data,
    columns=['State', 'Unemployment'],
    key_on='feature.id',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Unemployment Rate (%)'
).add_to(m)

folium.LayerControl().add_to(m)

m

# Resources
Look at this gallery

 https://residentmario.github.io/geoplot/gallery.html
 
## Geoplot and geopandas
- https://github.com/ResidentMario/geoplot
- http://geopandas.org/gallery/plotting_with_geoplot.html