## Kaggle Data Visualization Course

[Course Link](https://www.kaggle.com/learn/data-visualization)

<a></a>
### Types of Data Measurement

There are three types of data - Nomial, Ordinal, Interval.

Reference - 
[Difference between different types](https://www.graphpad.com/support/faq/what-is-the-difference-between-ordinal-interval-and-ratio-variables-why-should-i-care/)

[Types of data](https://www.mymarketresearchmethods.com/data-types-in-statistics/)

[Types of Data Measurement](https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/)


### Basic Charts

- Bar Chart
    - Good for nominal and small ordinal categorical data
- Line Chart
    - Good for Ordinal Categorical and Interval Data
- Area Chart
    - Good for Ordinal Categorical and interval Data
- Histogram
    - Good for interval data
    
    
q : Why are line and area charts not suitable for nominal data?

In [None]:
# imports

import pandas as pd
import numpy as np
import os
import urllib
import seaborn as sns
import matplotlib.pyplot as plt

import os
print(os.listdir("../input"))

In [None]:
# Was required for Colab

# !pip install -q kaggle

In [None]:
# Was required for Colab

# Download Dataset from kaggle

# os.environ['KAGGLE_USERNAME'] = 'tj2807'
# os.environ['KAGGLE_KEY'] = '9af863cab313e18021aa8051e9c6ded1'

# !kaggle datasets download -d zynicide/wine-reviews

In [None]:
# Was required for Colab
# importing required modules 
# from zipfile import ZipFile 
  
# # specifying the zip file name 
# file_name = "/content/wine-reviews.zip"
  
# # opening the zip file in READ mode 
# with ZipFile(file_name, 'r') as zip: 
#     # printing all the contents of the zip file 
#     zip.printdir() 
  
#     # extracting all the files 
#     print('Extracting all the files now...') 
#     zip.extractall() 
#     print('Done!') 

In [None]:
reviews = pd.read_csv('../input/wine-reviews/winemag-data_first150k.csv', index_col = 0)

In [None]:
reviews.head(3)

## UniVariate Plotting

### Bar Charts and Categorical Data

In [None]:
# Get absolute numbers for each province
reviews.province.value_counts().head(10).plot.bar()

In [None]:
# what percentage though

# Each bar is a category - Thus suitable for representing stats about
# categorical data.

# Bar chart height can represent anything, as long as it is a number.

# Here province is a nominal variable.

(reviews.province.value_counts().head()/len(reviews)*100).plot.bar()

In [None]:
# Bar Charts however can also be used for ordinal categories. In our case
# scores for each wine are between 80-100, thus even though it's a number, 
# they are ordinal category variables.

reviews.points.value_counts().sort_index().plot.bar()

### Line Charts

In [None]:
# Line charts are typically useful when there are too many categories
# Line charts however do not make sense for nominal data. Line charts
# mush the values together and order is implicit in line charts. 

reviews.points.value_counts().sort_index().plot.line()

### Area Charts

In [None]:
# Area charts are just line charts with area shaded in case of single variable plot.

reviews.points.value_counts().sort_index().plot.area()

### Histogram

Histomgrams are basically continuous bar charts. However there are some differences:
- Each bar represents range of values instead of a single value.
- X axis becomes continuous.

In [None]:
reviews[reviews.price < 200]['price'].plot.hist()

In [None]:
# Histograms however face a problem with skewed data. When the data is skewed, 
# since histogram divides input space in uniform intervals, distribution may
# not fir right without normalization. 

# This also makes histograms a very good way to see if data is skewed and decide
# how to normalize.

reviews.price.plot.hist()

In [None]:
# Histograms work really well with ordinal variables as well.

reviews.points.plot.hist()

## BiVariate Plotting

There are different types of plots when it comes to bivariate plotting.
- Scatter Plot
  - Helpful for interval data and some nominal data
- Hex Plot
  - Helpful when there's too much overlapping data to be plotted on scatter plots, otherwise same function as scatter plot
- Stacked Bar Chart
  - good for nominal and ordinal categorical data
- Bivariate Line Chart
  - Good for ordinal categorical and interval data

### Scatter Plots

In [None]:
reviews[reviews.price<100].sample(100).plot.scatter(x='price',y='points')

In [None]:
# From above plot, we can estimate that there is a weak correlation between
# price and points. We had to sample 100 values because there's too much 
# overlapping data

reviews[reviews.price<100].plot.scatter(x='price', y= 'points')

In [None]:
# We took price < 100 because there are too many outliars which will make the 
# scale problematic

reviews.plot.scatter(x='price',y='points')

###Hex bin Plot

Above problem of the scatter plot is solved by hex bin plot.

In [None]:
reviews[reviews.price < 100].plot.hexbin(x='price',y='points', gridsize = 15)

X axis is price. Hexbin plot is used when there is too much overlapping data. The color of each hexbin denotes how many overlapping datapoints are present in the hexbin.

This plot also gives us an additional information which scatter plot did not give. We can say that many reviewed bottles are concentrated around 87.5 points and price of around $20. 

In [None]:
# Converting the data to 2 dimensions and counts. This is a standard format to
# many multivariate plot functions in pandas.

filtered = reviews[reviews.variety.isin(reviews.variety.value_counts().head(5).index)]
wine_counts = filtered.groupby(['points','variety']).country.count().unstack()

### Stacked Bar chart


Limitations of stacked charts - 

The first limitation is that the second variable in a stacked plot must be a variable with a very limited number of possible values (probably an ordinal categorical, as here). Five different types of wine is a good number because it keeps the result interpretable; eight is sometimes mentioned as a suggested upper bound. Many dataset fields will not fit this critereon naturally, so you have to "make do", as here, by selecting a group of interest.

The second limitation is one of interpretability. As easy as they are to make, and as pretty as they look, stacked plots make it really hard to distinguish concrete values. For example, looking at the plots above, can you tell which wine got a score of 87 more often: Red Blends (in purple), Pinot Noir (in red), or Chardonnay (in green)? It's actually really hard to tell!

In [None]:
wine_counts.plot.bar(stacked=True)

### Area plot

In [None]:
wine_counts.plot.area()

### Bivariate Line Chart

In [None]:
wine_counts.plot.line()

## Styling your plots

In [None]:
# regular bar chart

reviews.points.value_counts().sort_index().plot.bar()

In [None]:
# figsize for overall plot size

reviews.points.value_counts().sort_index().plot.bar(figsize = (12,6))

# figsize takes the image size in inches, it takes (width,height) values.

In [None]:
# color and legend font size



reviews.points.value_counts().sort_index().plot.bar(figsize = (12,6),
                                                    color='mediumvioletred',
                                                    fontsize = 16)

In [None]:
# Add a title

reviews.points.value_counts().sort_index().plot.bar(figsize = (12,6),
                                                    color='mediumvioletred',
                                                    fontsize = 16,
                                                   title = "No. of reviews for each score")

In [None]:
# Pandas plot functions and its paramters are built on matplotlib and act as 
# an abstraction layer. Plot can also be modified using matplotlib directly.


myPlot = reviews.points.value_counts().sort_index().plot.bar(figsize = (12,6),
                                                    color='mediumvioletred',
                                                    fontsize = 16)

myPlot.set_title('No. of reviews for each score', fontsize = 16)

# This is useful since pandas hasn't included all the customization
# functionality that matplotlib provides. For eg. only with pandas we cannot 
# set the font size of title.

In [None]:
# seaborn works along with these libraries

myPlot = reviews.points.value_counts().sort_index().plot.bar(figsize = (12,6),
                                                    color='mediumvioletred',
                                                    fontsize = 16)

myPlot.set_title('No. of reviews for each score', fontsize = 16)
sns.despine(bottom=True, left=True)

## SubPlots

Subplots are used to combine information about multiple related things into one figure. 

When pandas works with matplotlib to create a figure, it follows following steps - 
- Make a matplotlib Figure object. 
-  Make matplotlib AxesSubplot object, assign it to Figure.
- Use AxesSubplot methods to draw figure on screen
- Return result to user.

We use the similar flow below to create subplots.

In [None]:
fig, axrr = plt.subplots(2,1, figsize = (12,8))

# sibplots method takes rows and columns as argument.
# fig has the figure object now.
# axrr is an array of axes subplot objects for both figures.


In [None]:
# axrr is an array of both axes subplot objects

axrr

In [None]:
# in order to tell pandas where to plot, we need to specify ax attribute which
# takes axws subplot object type.
fig, axrr = plt.subplots(2,1, figsize = (12,8))
reviews.points.value_counts().sort_index().plot.bar(ax = axrr[0])
axrr[0].set_title('No. of reviews with given points')

In [None]:
# Thus each individual subplot can be referred by considering the top left point as origin.

fig, axarr = plt.subplots(2,2, figsize = (12,10))

reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0][0], fontsize=12, color='mediumvioletred'
)
axarr[0][0].set_title("Wine Scores", fontsize=18)

reviews['variety'].value_counts().head(20).plot.bar(
    ax=axarr[1][0], fontsize=12, color='mediumvioletred'
)
axarr[1][0].set_title("Wine Varieties", fontsize=18)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1][1], fontsize=12, color='mediumvioletred'
)
axarr[1][1].set_title("Wine Origins", fontsize=18)

reviews['price'].value_counts().plot.hist(
    ax=axarr[0][1], fontsize=12, color='mediumvioletred'
)
axarr[0][1].set_title("Wine Prices", fontsize=18)

plt.subplots_adjust(hspace=0.3) # Gap between rows

sns.despine()

> ## Plotting with Seaborn

There are following types of charts in seaborn

- Countplot ( Basically bar plot)
  - Good for nominal categorical and small ordinal cat. data
- KDE plot (Smoothened line plot/contour plot for 2d)
  - Good for interval data , but contrary to line plot - very bad for ordinal categorical data.
- Joint plot ( Basically hex plot)
  - Good for interval data and  some nominal categorical
- Violin Plot 
  - Interval and Nominal Categorical Data

### Count Plot

In [None]:
# pandas bar plot is basically count plot in seaborn

sns.countplot(reviews.points)

# Note that you do not need to pass value counts to the plot, it automatically
# takes care of the same. Very intuitive.

### KDE plot

KDE is short for Kernel Density Estimate. It is based on a statistical technique of the same name which works on smoothning out the data. KDE plot addresses a very important problem with the line plot. It buffs out the outlier so that there are no sudden changes in the line plot.

However this thing makes it horrible for ordinal categorical variables even though line plot can be used for them. KDE plot may find some intermediate value to smoothen the curve which doesn't make any sense when it comes to categorical variables.

In [None]:
sns.kdeplot(reviews.query('price < 200').price)

In [None]:
# alternative line plot :

reviews[reviews.price < 200].price.value_counts().sort_index().plot.line()

# Thus it can be observed that kde plot give the true shape line chart data.

In [None]:
# Kde plot can also be plotted for 2d data

# Note that bivariate KDE plots are very computationally expensive. 
# This is the reason why we sample 5000 points here.
# sns.kdeplot(reviews[reviews['price'] < 200].loc[:, ['price', 'points']].dropna().sample(5000))

# Better way to do bivariate
tempFrame = reviews[reviews.price < 200].dropna().sample(5000)
sns.kdeplot(tempFrame.price, tempFrame.points )

### Dist Plot

Dist plot is basically pandas histogram.

In [None]:
sns.distplot(reviews['points'], bins=10, kde=False)

### Scatter plot and Hex Plot

Seaborn equivalent of these two plots is jointplot.

In [None]:
sns.jointplot(x='price', y='points', data = reviews[reviews.price < 100])

In [None]:
sns.jointplot(x='price', y='points', data = reviews[reviews.price < 100], kind='hex', gridsize=20)

### Box Plot and Violin Plot

Suppose we want to know spread of values for some data in database. For eg. Suppose if we want to know the spread of points for top 5 wine categories. 
Box plot or violin plot can be very useful in such situations. 

These plots do not provide any information about individual values, however they do give interesting insights about the spread of data.

These plots work well only with interval data and some nominal data having very large number of values.
These plots expect our data spread to be roughly in normal distribution, otherwise various markers in box plots won't make much sense.


In [None]:
myData = reviews[reviews.variety.isin(reviews.variety.value_counts().head(5).index)]
myData.variety.value_counts()

In [None]:
sns.boxplot(x = 'variety', y = 'points', data = myData)

In [None]:
# Violin plot shows the same data that boxplot doesn, but it replaces the box in boxplot with kernel density estimation of the data.
myData = reviews[reviews.variety.isin(reviews.variety.value_counts().head(5).index)]
sns.violinplot(x = 'variety', y= 'points', data = myData)

### Exercise with Pokemons

In [None]:
pokemons = pd.read_csv('../input/pokemon/Pokemon.csv', index_col = 0)
pokemons.head()

In [None]:
sns.countplot(pokemons.Generation)

In [None]:
sns.distplot(pokemons.HP, kde = True)

In [None]:
sns.jointplot(x = 'Attack', y = 'Defense', data = pokemons)

In [None]:
sns.jointplot(x = 'Attack', y = 'Defense', data = pokemons, kind = 'hex', gridsize = 20)

In [None]:
sns.kdeplot(pokemons['HP'], pokemons['Attack'])

In [None]:
sns.boxplot(x = 'Legendary', y= 'Attack', data = pokemons)

In [None]:
sns.violinplot(x = 'Legendary', y= 'Attack', data = pokemons)

## Faceting With Seaborn

There are two important functions here

- facetgrid()
    - Good for data with atleast two categorical variables
- pairplot()
    - Good for exploring most of the kinds of data



In [None]:
# Loading footballers stats

df = pd.read_csv('../input/fifa-18-demo-player-dataset/CompleteDataset.csv', index_col=0)

footballers = df.copy()
footballers['Unit'] = df['Value'].str[-1]
footballers['Value (M)'] = np.where(footballers['Unit'] == '0', 0, 
                                    footballers['Value'].str[1:-1].replace(r'[a-zA-Z]',''))
footballers['Value (M)'] = footballers['Value (M)'].astype(float)
footballers['Value (M)'] = np.where(footballers['Unit'] == 'M', 
                                    footballers['Value (M)'], 
                                    footballers['Value (M)']/1000)
footballers = footballers.assign(Value=footballers['Value (M)'],
                                 Position=footballers['Preferred Positions'].str.split().str[0])
footballers.head()

### Facet Grid

Now what if we want to get the same plot but one plot per position of the player?
In case of usual matplotlib, we might have to group by position, then use subplots, then plot for each group.

Seaborn makes this very easy using faceting. 

There are two categorical variables involved in making this plot. The first variable is on x axis, for which we want to plot ( for eg. kdeplot). 
The second variable is the category based on which we want to split the facets. 

At the end note that, our use case is: Plot our usual graphs for some categorical variable on x axis, but split these plots into multiple facets based on another 
Categorical variable. This can be exceptionally helpful so many times. 

The core seaborn utility for faceting is the FacetGrid. A FacetGrid is an object which stores some information on how you want to break up your data visualization.

For example, suppose that we're interested in (as in the previous notebook) comparing strikers and goalkeepers in some way. To do this, we can create a FacetGrid with our data, telling it that we want to break the Position variable down by col (column).

In [None]:
# Suppose we want to get a kde plot for overall score. This will basically show us the count (actually KDE of probability mass function) of players getting a particular score.

sns.kdeplot(footballers.Overall)

In [None]:
# in this example we are gonna plot multiple facets for overall variable split based on position of the player.

# Let us take only two positions for now

df = footballers[footballers.Position.isin(['ST', 'GK'])]
g = sns.FacetGrid(df, col = 'Position')

# This makes a FacetGrid object which keeps blank facets ready for any particular dataframe to plot any x axis variable.


In [None]:
# Now we use FacetGrid map method to map facetGrid object with plotting function and x axis variable.
df = footballers[footballers.Position.isin(['ST', 'GK'])]
g = sns.FacetGrid(df, col = 'Position')
g.map(sns.kdeplot, 'Overall')

In [None]:
# That's super simple and super useful! We can plot all the positions as well, with wrap on.

g = sns.FacetGrid(footballers, col = 'Position', col_wrap=6)
g.map(sns.kdeplot, 'Overall')

In [None]:
# Facegrid actually let's us do the splitting of facets according to different combinations of two 
# categorical variables. These variables are mentioned in row and col attribute while 
# creating facetGrid object.
df = footballers[footballers.Position.isin(['ST', 'GK'])]
df = df[df.Club.isin(['Real Madrid CF', 'FC Barcelona', 'Atlético Madrid'])]

# We can specify row and column order as well.

g = sns.FacetGrid(df, row = 'Position', col = 'Club', row_order=['GK', 'ST'],
                  col_order=['Atlético Madrid', 'FC Barcelona', 'Real Madrid CF'])
g.map(sns.violinplot, 'Overall')

# Only problem with faceting is that in order to avoid plots becoming too small, we can split facets
# only for about 2 categorical variables with limited no. of categories each.

### Pairplot

Pairplot is generated by taking as input some columns as a dataframe, and seaborn outputs subplots for each combination of those columns. Since there are 2 axes to choose, if we provide n columns -> we get n\*n subplots.

In [None]:
sns.pairplot(footballers.loc[:, ['Overall', 'Potential', 'Value']])

# At the diagonal, every variable gets plotted against itself, so it's a histogram.
# At other place it has scatter plot.

# Pairplot is often the first visualization tool DS uses to visualize the data.

### Exercises

In [None]:
g = sns.FacetGrid(pokemons, row = 'Legendary')
g.map(sns.kdeplot, 'Attack')

In [None]:
g = sns.FacetGrid(pokemons, row = 'Generation', col = 'Legendary')
g.map(sns.kdeplot, 'Attack')

In [None]:
sns.pairplot(pokemons.loc[:, ['Defense', 'Attack', 'HP']])

## Multivariate Plotting (>2)

Following are the options:

- Multivariate Scatter Plot
- Grouped Box Plot
- Heat Map
- Parallel Coordinates


We still plot on 2d axes, but we use something called as Visual Variables. Visual variables are the ones
which can be plotted on top of each other and still be distinguished. eg. Size, Color etc.

In [None]:
footballers.head()

## Multivariate Scatter Plot

we can use seaborn lmplot along with a third variable of color or shape(not recommended)

In [None]:
footballers.head()

In [None]:
# Suppose we want to see how different kinds of offensive players are paid [Value] 
# as per their overall rating [Score]]

# sns.lmplot(footballers, x = 'Value', y = 'Overall', hue = 'Position', 
#           data = footballers[footballers.Position.isin(['ST', 'RW', 'LW'])])

sns.lmplot('Value','Overall', footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])],
           fit_reg=False, hue='Position')

In [None]:
# We can also use different markers to distinguish

sns.lmplot('Value','Overall', footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])],
           fit_reg=False, hue='Position', markers = ['*','x', 'o'])

### Grouped Box Plot

In [None]:
# Suppose we want to see how goalkeepers are scored on agression as compared to Strikers.
# We also want to know this information for player with different overall scores.

f = (footballers
        .loc[ footballers.Position.isin(['GK', 'ST'])]
        .loc[:, ['Overall', 'Aggression', 'Position']])
f = f.loc[f.Overall >= 80]
f = f.loc[f.Overall < 85]
f['Aggression'] = f['Aggression'].astype(float)
sns.boxplot(x = 'Overall', y = 'Aggression', hue = 'Position', data = f)

# This is using basically another visual variable "Grouping"

## Summarization Techniques

### Heatmaps

Most heavily used summarization technique is Correlation plot, which calculates correlation
for all combinations of columns and plots the result in color.

In [None]:
# Let's find correlation between certain columns

f = (footballers.loc[:, ['Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control']]
                .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
                .dropna()).corr()
sns.heatmap(f, annot = True)

### Parallel Coordinates Plot

Parallel coordinate plots are good for seeing whether two classes in the data are properly distingushable
or not. We chose certain columns to be there on x axis, these are the variables we want to judge our data on.

The plot then takes an input of categorical variable (class) we want to distingust and draws simple lines for each record in dataset.

In [None]:
from pandas.plotting import parallel_coordinates

f = (
    footballers.iloc[:, 12:17]
        .loc[footballers['Position'].isin(['ST', 'GK'])]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
)
f['Position'] = footballers['Position']
f = f.sample(200)

parallel_coordinates(f, 'Position')

### Exercise

In [None]:
sns.lmplot('Attack', 'Defense', pokemons, hue = 'Legendary', markers = ['x','o'], fit_reg=False)

In [None]:
sns.boxplot(x = 'Generation', y = 'Total', hue = 'Legendary', data = pokemons)

In [None]:
pokemons.head()
sns.heatmap(pokemons.loc[:, ['HP', 'Attack', 'Sp. Atk', 'Defense', 'Sp. Def', 'Speed']].corr(), annot = True)

In [None]:
from pandas.plotting import parallel_coordinates

p = (pokemons.loc[pokemons['Type 1'].isin(['Fighting', 'Psychic'])]
     .loc[:,['Attack', 'Sp. Atk', 'Defense', 'Sp. Def', 'Type 1']])

parallel_coordinates(p, 'Type 1', color = ['Green', 'orange'])

### Inntroduction to Plotly

Plotly is one of the open source libraries which deals with interactive graphs and animations. 

Plotly has online and offline mode, wherein online mode does not inject the library source directly into notebook. The processing happens online.

Following are the types of graphs: 
- Scatter plot
- Choropleth
- Heatmap

In [None]:
reviews.head()

In [None]:
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected = True)
import plotly.graph_objs as go

### Scatter plot

In [None]:
# graph_objs go makes the graph object while iplot composes these objects and generates the plot.
# The only issue with plotly like interactive libraries is the amount of data it can plot,
# since they are very resource intensive.
iplot([go.Scatter(x=reviews.head(1000)['points'], y=reviews.head(1000)['price'], mode='markers')])

### Plotly 2 plots KDE + Scatter

In [None]:
# Kde plot + Scatter

# The graph_objs - iplot design also makes it easy to plot multiple plots on each other.

iplot([go.Histogram2dContour(x=reviews.head(500)['points'], 
                             y=reviews.head(500)['price'], 
                             contours=go.Contours(coloring='heatmap')),
       go.Scatter(x=reviews.head(1000)['points'], y=reviews.head(1000)['price'], mode='markers')])

## Surface Plot

In [None]:
# Plotly surface is one of the best applications
df = reviews.assign(n=0).groupby(['points', 'price'])['n'].count().reset_index()
df = df[df["price"] < 100]
v = df.pivot(index='price', columns='points', values='n').fillna(0).values.tolist()
iplot([go.Surface(z=v)])

### Choropleths

In [None]:
# On kaggle plotly is mostly used to make choropleths. Choropleth is a kind of map where every 
# region of the map is covered as per some variable.
df = reviews['country'].replace("US", "United States").value_counts()

iplot([go.Choropleth(
    locationmode='country names',
    locations=df.index.values,
    text=df.index,
    z=df.values
)])


# It is important to decide when to and when not to use plotly. While plotly is extremely attractive,
# it has less documentation, and also sometimes overly complex. It is very rarely useful as compared
# to equivalent plots in pandas and matplotlib.

## Grammar of Graphics with Plotnine

Grammar of Graphics is an important concept in data visualization. Usually, when we create a plot, we follow these steps:
- Create the figure
- Adjust geometry of the figure
- Adjust aesthetics of the figure

This can make things really confusing from developer point of view. As developer it's difficult to know when one should an object, when to use a particular function.

Grammar of graphics is a concept which breaks this flow in multiple parts, with each part being handled by a function. As we add more and more function with help of simple + operator, more and more complex plots can be generated.

In [None]:
# Get the data ready

top_wines = reviews[reviews.variety.isin(reviews.variety.value_counts().head(5).index)]
top_wines.head()

In [None]:
# simple scatter plot
from plotnine import *

df = top_wines.head(1000).dropna()

( ggplot(df)
    + aes('points', 'price')
    + geom_point() )

# as it can be seen, it's super simple! ggplot takes in the data, aes takes in the aesthetics related
# details including the axes information. Finally a function is added to see what kind of plot it is.


In [None]:
# it is also super easy to just add another plot to this, add a function!

( ggplot(df)
    + aes('points', 'price')
    + geom_point()
    + stat_smooth())


In [None]:
# to add color, just add one more aes with color!

( ggplot(df)
    + aes('points', 'price')
     + aes(color = 'points')
    + geom_point()
    + stat_smooth())


In [None]:
# To add faceting to this, just add facet wrap!

( ggplot(df)
    + aes('points', 'price')
     + aes(color = 'points')
    + geom_point()
    + stat_smooth()
    + facet_wrap('~variety')
)


# If we wanted to add or remove faceting in seaborn or matplotlib, we would have to change the
# entrie structure of the code! With grammar of graphics based libraries, it's super easy.

In [None]:
# Bar plot

( ggplot(df)
+ aes('points')
 + geom_bar()
)

In [None]:
# Hex plot

( ggplot(top_wines)
 + aes('points', 'variety')
 + geom_bin2d(bins = 20)
)

In [None]:
# Non geometric functions can be mixed to make changes

(ggplot(top_wines)
         + aes('points', 'variety')
         + geom_bin2d(bins=20)
         + coord_fixed(ratio=1)
         + ggtitle("Top Five Most Common Wine Variety Points Awarded")
)

## Time Series Plotting

Most of the visualizations deal with interval variables, nominal categorical or ordinal categorical variables etc. There is another class of variables called as 'Time Series Variable' which needs special treatment.

Time series variables are plotted by values which are specific to time. Time variable is linear and infinitely fine grained, so it can be considered a special case of interval variable.

There are two ways in which time information may be included in the data. 
In the 'Strong' Case - Example of stock prices, data is collected according to various dates. Changing any date causes a huge change in data. 

In the 'Weak' Case - Example of shelter outcomes, dates act merely as descriptions of records. They are less intertwined with the data compared to the 'Strong' Case.

Time is generally measured in periods, which can be of varying length. Pandas has in built data type called period.

In [None]:
pd.set_option('max_columns', None)

stocks = pd.read_csv("../input/nyse/prices.csv", parse_dates=['date'])
stocks = stocks[stocks['symbol'] == "GOOG"].set_index('date')
stocks.head()

In [None]:
shelter_outcomes = pd.read_csv(
    "../input/austin-animal-center-shelter-outcomes-and/aac_shelter_outcomes.csv", 
    parse_dates=['date_of_birth', 'datetime']
)
shelter_outcomes = shelter_outcomes[
    ['outcome_type', 'age_upon_outcome', 'datetime', 'animal_type', 'breed', 
     'color', 'sex_upon_outcome', 'date_of_birth']
]
shelter_outcomes.head()

### Line plots, Bar plots and Resampling

In [None]:
# Line plot across time

shelter_outcomes.date_of_birth.value_counts().sort_values().plot.line()

In [None]:
# In above plot it looks like data peaked around 2014, but we can't say for sure since data is rather noisy.
# To solve this problem, we resample the data so that it is yearly rather than daily.

# In pandas, resampling works in similar way like groupby.

shelter_outcomes.date_of_birth.value_counts().resample('Y').sum().plot.line()

# Pandas find x axis labels automatically, it's date-time aware.

In [None]:
stocks.volume.resample('Y').mean().plot.bar()

### Lag Plot

Lag plot compares each date record to its previous date record. This is especially useful to determine the periodicity of the time series data. Time series data often exhibits periodicity. For eg. Bars will see more volume on fridays, matches will see more crowd on sundays etc. etc.

Note that this plot makes sense only for 'Strong Case' Time series data.

In [None]:
from pandas.plotting import lag_plot

lag_plot(stocks.volume.tail(250))

# following plot shows that two consecutive days of trading may be highly correlated. High volume on one 
# day also corresponds to higher volume on next day. It can also be seen however, this phenemenon seems 
# to be more prevalant in case of lower first day values. 

 ### AutoCorrelation Plot
 
 Autocorrelation plot takes above concept even further. It plots correlations fo different lags which makes it easy to determine periodicity in data.

In [None]:
from pandas.plotting import autocorrelation_plot

autocorrelation_plot(stocks.volume)