<a href="https://colab.research.google.com/github/tonytan4ever/visualization-for-fun/blob/main/visualization_notebooks/Pandas%2BPlotting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Visualization Using Pandas

# Where is my data?

In [3]:
import numpy as np
import pandas as pd
import seaborn as sns

## 1. Importing a .csv File

In [4]:
mc12 = pd.read_csv('/content/Two+Machines.csv')

## 2. Datsets Available in Seaborn Library

In [5]:
sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

In [6]:
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


## 3. Datasets in Pydataset

In [7]:
!pip install pydataset

Collecting pydataset
  Downloading pydataset-0.2.0.tar.gz (15.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.9/15.9 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pydataset
  Building wheel for pydataset (setup.py) ... [?25l[?25hdone
  Created wheel for pydataset: filename=pydataset-0.2.0-py3-none-any.whl size=15939415 sha256=5b1112b527f8ab2fc76c25979ad22866757396ba4d0bde8ecd6020675d28027b
  Stored in directory: /root/.cache/pip/wheels/29/93/3f/af54c413cecaac292940342c61882d2a8848674175d0bb0889
Successfully built pydataset
Installing collected packages: pydataset
Successfully installed pydataset-0.2.0


In [None]:
from pydataset import data

In [None]:
import pandas as pd
pd.set_option('display.max_rows', 10)  # Change the 10 to `None` to see everything
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [None]:
data()

## a. Diamond - is for ever

In [None]:
# You can search the dataset by name. If you do not know the name try something close.
data('dimond')

In [None]:
diamond = data('Diamond')

In [None]:
diamond.head()

## b. Let's Solve the Housing Problem

In [None]:
data('Housing', show_doc=True)

In [None]:
housing = data('Housing')
housing.head()

### c. Let the river flow (Nile)

In [None]:
data('Nile')

### d. Don't be a Chicken

In [None]:
data('chickwts')

### e. The Number of Breaks in Yarn during Weaving

In [None]:
data('warpbreaks')

### f. How to survive a crash like Titatnic?

In [None]:
data('titanic')

# Some Anatomy First

Before we get into the various APIs available to you, we need to go over the anatomy of a plot. This will help you familiarize yourself with the terminology you're likely to hear a lot in the rest of this course.

Learning this terminology will make the rest of this course substantially easier.

![Anatomy of a figure](https://matplotlib.org/3.3.3/_images/sphx_glr_anatomy_001.png)

source: https://matplotlib.org/3.3.3/gallery/showcase/anatomy.html

Here are some of the major terms you should know about:

* **Figure**: This is *the whole figure*. It contains everything you see in a plot. In the picture above, the whole thing is "the figure".
* **Axes**: This is what you think of as "the plot". Everything you see in the plot above is part of a single "axes" object. I know this can be confusing when you consider this with the definition of a figure. But that's because the image above consists of a figure with only one "axes" object. The figure below consists of 4 "axes objects" arranged in a 2x2 grid, but the whole thing is still just one figure:

![4 axes objects](https://matplotlib.org/3.1.1/_images/sphx_glr_usage_002.png)

* **Axis**: An "axes" object contains two (or three in the case of 3D) "axis" objects. These 2 terms are very similar in spelling, so it can often lead to confusion. "Axes" is simply the plural of "axis".

> Go over the rest of the terms in the lecture. They are not as confusing.

# Using Pandas for plotting

If you're doing any reasonably complex data manipulation, chances are that you're already using Pandas. It just so happens that Pandas already includes a very easy to use plotting API that can be used to do many of the common tasks that you might encounter.

See [pandas.DataFrame.plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) for more details.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pydataset import data

For this section, we'll mostly use the housing data to learn about basic plotting.

In [None]:
housing = data('Housing')

In [None]:
housing.head()

Take a look at the table above and familiarize yourself with the data in there.

In [None]:
# Start with the simplest possible plot
housing.plot()

This plot above has no arguments, so something in Pandas has to figure it all out. In this case, it so happens that things kind of work out, but this is certainly not guaranteed.

The point of the plot above is to show how simple the API is. In real life, you'll likely need to be more explicit about the details of the plot.

When you don't specify anything, Pandas chooses the index of your dataframe as the X-axis. In our case, that's the increasing range of integers.

The default kind of plot is called a "line plot", which this is. If you don't specify anything, Pandas will automatically plot every "numerical" column, which is what happened here.

Let's simplify things to start things off. We'll only plot the price for now and add other plots later.

In [None]:
# Providing a name of a dataframe column in the `y` argument helps us choose
housing.plot(y='price')

Note all that text above the plot. That's because Jupyter notebooks print the return value of the last line of the cell, before doing any fancy stuff like showing the plot. To turn that off, end the last line of the cell with a semicolon (`;`).

In [None]:
housing.plot(y='price');

Note that the plot isn't very useful if it's missing important stuff like a title and labels.

In [None]:
housing.plot(
    y='price',
    title='Housing Prices',
    xlabel='House index',
    ylabel='Price',
);

In [None]:
# You can make the plot larger using figsize. The argument to figsize is a tuple (width, height) in inches
housing.plot(
    y='price',
    title='Housing Prices',
    xlabel='House index',
    ylabel='Price',
    legend=False,
    figsize=(20, 10),
);

We had talked earlier about plotting multiple columns from the dataframe simultaneously. Let's do that now. Let's plot both pricing and lot-size simultaneously.

In [None]:
# The `y` parameter can also be given a list of column names
housing.plot(
    y=['price', 'lotsize'],
    title='Housing Prices',
    xlabel='House Index',
    figsize=(20, 10),
);

Note that I've removed the label on the Y axis. That's because we're plotting 2 quantities simultaneously that are completely different and have different scales and units.

In fact, it might be better to allocate 2 separate subplots for both of these. If you recall, a `figure` can contain multiple `axes` objects. That's exactly what we're going to do here.


In [None]:
housing.plot(
    y=['price', 'lotsize'],
    title='Housing Prices and Lot Size',
    xlabel='House Index',
    subplots=True,
);

Another thing you might want to do is to try changing the layout of these subplots. That can be done with the `layout` parameter, which is a tuple of the form `(rows, columns)`:

In [None]:
housing.plot(
    y=['price', 'lotsize'],
    title='Housing price vs. lot size',
    xlabel='House Index',
    subplots=True,
    layout=(1, 2)
);

Note that I removed the `figsize` option. That's to illustrate a situation that you might encounter from time to time when plotting subplots: labels can overlap other things and cause crowding. This situation can be fixed by using a matplotlib method called `tight_layout`.


In [None]:
housing.plot(
    y=['price', 'lotsize'],
    title='Housing price vs. lot size',
    xlabel='House index $\\rightarrow$',
    subplots=True,
    layout=(1, 2)
)
plt.tight_layout()

Let's go back to using a larger figure and not needing `tight_layout`. Notice that the scales of the Y-values is completely different. It's misleading because it makes it look like both the subplots have equivalent values, when they don't.

We can share the Y-axis for both the plots to essentially match the levels. The parameter needed for this is `sharey`.

In [None]:
housing.plot(
    y=['price', 'lotsize'],
    title='Housing Price and Lot Size',
    xlabel='House index',
    subplots=True,
    layout=(1, 2),
    sharey=True,
);

## Line Plot for Nile flow

In [None]:
nile = data('Nile')

In [None]:
nile.head()

In [None]:
nile.plot();

In [None]:
nile.plot(
    x='time',
    y='Nile',
    xlabel='Year',
    ylabel='Flow',
    title = 'Flow of Nile River',
    legend=False
    );


# Bar plots

So far we've only created line plots using the Pandas plotting interface. But creating other types of plots is just as easy. Let's plot a bar plot for the number of bedrooms across the houses.

Before we get into plotting, let's do some basic data manipulation to create our data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pydataset import data

In [None]:
housing = data('Housing')
housing.head()

In [None]:
bedroom_counts = housing.groupby('bedrooms')['bedrooms'].count()
bedroom_counts

In [None]:
bedroom_counts.plot(
        kind='bar',
        title='Bedroom Count',
        xlabel='Number of Bedrooms',
        ylabel='Count');

That's it! It's that simple.

# Box and whisker plots

What if we wanted to know about house prices for various bedroom sizes? We can use a box-and-whisker plot just as easily. While there is a `kind='box'` option available with the usual `.plot(...)` method, there is an even better method available directly on the `.boxplot(...)` method.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pydataset import data

In [None]:
housing = data('Housing')
housing.head()

In [None]:
housing.plot(
    y='price',
    kind='box');

In [None]:
housing.plot(
    y=['price', 'lotsize'],
    by='bedrooms',
    kind='box');

In [None]:
housing.boxplot(
    by='bedrooms',
    column='price',
    grid=False,
);

# Histogram
### Plotting a histogram of prices

There's similar functionality available for plotting histograms. Note the additional use of `bins`, which splits the data into `20` bins.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pydataset import data

In [None]:
housing = data('Housing')
housing.head()

In [None]:
housing.plot(
    y='price',
    kind='hist');

In [None]:
housing.plot(
    y='price',
    kind='hist',
    bins=20,
);

# Plotting KDE or density plot

In [None]:
housing.plot(
  y='price',
  kind='kde',
);

In [None]:
housing.plot(
    y='price',
    kind='kde',
    xlim=(0,250000)
    );

# Scatter plot

If your objective is to look at the correlation between the lot-size and the price, a scatter plot should help a lot.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pydataset import data

In [None]:
housing = data('Housing')
housing.head()

In [None]:
housing.plot(
  x='lotsize',
  y='price',
  kind='scatter',
);

# Pie Charts

In general, pie charts are not a good way to communicate relative ratios. But they are quite common. And if you wanted to make one, Pandas makes that easy too.

Let's say you wanted to plot a pie chart for the number of stories.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pydataset import data

In [None]:
housing = data('Housing')
housing.head()

In [None]:
story_counts = housing['stories'].value_counts()
story_counts

In [None]:
story_counts.plot(
    kind='pie',
    title='Number of St'
);

The wide-array of plots you can generate using just this interface provided by Pandas is impressive. And given how easy it is to use, you should consider this as one of your first options when deciding which data visualization tools to use.

Unfortunately, as you use this interface for your day-to-day data visualization needs, you will encounter its shortcomings. It turns out that this Pandas feature is built on top of another library called Matplotlib, which is the topic of the next section. You'll need to use this when you encounter situations where this Pandas interface alone cannot help you.