# Data Visualization in Python

Python offers multiple great graphing libraries packed with lots of different features. Whether you want to create interactive or highly customized plots, Python has an excellent library for you.

To get a little overview, here are a few popular plotting libraries:

- `Matplotlib`: It is the most used library for plotting in the Python community, despite having more than a decade of development. Because matplotlib was the first Python data visualization library, many other libraries are built on top of it. Some libraries like `pandas` and `Seaborn` are wrappers over `matplotlib`.
- `Seaborn`: leverages matplotlib's ability to create aesthetic graphics in a few lines of code. The most palpable difference is Seaborn's default styles and color palettes, which are designed to be more aesthetically pleasing and modern.
- `Pandas` plotting: allows data visualization through adaptations of the `matplotlib` library, facilitating the data aggregation and manipulation in a few lines of code.
- `Plotly`: allows the data viusalization by interactive plots, offering additional chart configurations as contour plots, dendograms, and 3D charts.
- `ggplot`: is based on *ggplot2* from R plotting system. `ggplot` operates differently than `matplotlib` and `seaborn`, making layers fromo its components to create a complete plot.
- `Bokeh`: creates interactive, web-ready plots, which can be easily output as JSON objects, HTML documents, or interactive web applications, supporting streaming and real-time data.
- `AstroPy`: is a collection of software packages written in the Python, and designed for use in astronomy.
- `Gleam`: is inspired by R's Shiny package. It allows to turn analysis into interactive web applications using only Python scripts, avoiding the use of other languages like HTML, CSS, or JavaScript.
- `Geoplotlib`: is a toolbox for creating maps and plotting geographical data by creating a variety of map-types, like choropleths, heatmaps, and dot density maps.
- `Missingno`: allows to quickly gauge the completeness of a dataset with a visual summary, instead of trudging through a table.

## Matplotlib

The **matplotlib** Python library, developed by **John Hunter** and many other contributors, is used to create high-quality graphs, charts, and figures. The library is extensive and capable of changing very minute details of a figure. 

__Some basic concepts and functions provided in matplotlib are__

### Figure and axes 
The entire illustration is called a figure and each plot on it is an axes (do not confuse Axes with Axis). The figure can be thought of as a canvas on which several plots can be drawn. We obtain the figure and the axes using the `subplots()` function

### Plotting
The very first thing required to plot a graph is data. A dictionary of key-value pairs can be declared, with keys and values as the x and y values. After that, `scatter()`, `bar()`, and `pie()`, along with tons of other functions, can be used to create the plot. 

### Axis
The figure and axes obtained using `subplots()` can be used for modification. Properties of the x-axis and y-axis (labels, minimum and maximum values, etc.) can be changed using `Axes.set()`

### Anatomy of a Figure 

A matplotlib visualization is a figure onto which is attached one or more axes. Each axes has a horizontal (x) axis and vertical (y) axis, and the data is encoded using color and glyphs such as markers (for example circles) or lines or polygons (called patches). The figure below annotates these parts of a visualization and was created by Nicolas P. Rougier using matplotlib. The source code can be found in the [matplotlib documentation](https://matplotlib.org/gallery/showcase/anatomy.html#sphx-glr-gallery-showcase-anatomy-py).
![anatomy](https://matplotlib.org/stable/_images/sphx_glr_anatomy_001.png)

### Importing packages

Just as we use the `np` shorthand for `NumPy` and the `pd` shorthand for `Pandas`, we will use the standard shorthand `plt` for the Matplotlib imports:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

### Loading Dataset


In [None]:
df = pd.read_csv('./data/2_1_metropolitan_areas.csv')

In [None]:
df.head()

### Scatter Plot using `Matplotlib`

- A scatter plot (aka scatter chart, scatter graph) uses dots to represent values for two different numeric variables. 
- The position of each dot on the horizontal and vertical axis indicates values for an individual data point. 
- Scatter plots are used to observe relationships between variables.

To create a scatter plot in `Matplotlib` we can use the `.scatter()` method:

In [None]:
plt.scatter(df.crime_rate, df.percent_senior) # Plotting the scatter plot

plt.show() # Showing the figure

Both structures that store the values `df.crime_rate` and `df.percent_senior` must have the same size, so that the bijective relationship is maintained between each point of the independent variable $x$ with its corresponding one in the dependent variable $y$.

#### Is `plt.show()` always required?

1. If `Matplotlib` is used in a terminal, scripts or specialized IDEs as Spyder, Pycharm or VS Code, `plt.show()` is a must.

2. If `Matplotlib` is used in a IPython shell or a notebook as Jupyter Notebook or Colab Notebook, `plt.show()` is usually unnecessary.

In the following cell we are executing the same script as above, removing the `plt.show()` instruction:

In [None]:
# The same code block without plt.show() gives the same result in Jupyter Notebook
plt.scatter(df.crime_rate, df.percent_senior)

The only difference is the inclusion of the figure output object:
```
<matplotlib.collections.PathCollection at 0x2589719d490>
``` 
If you want to prevent this from being included as a cell output, use `plt.show()` at the end of each plotting instruction.

The `plt.show()` command does a lot under the hood, as it must interact with your system's interactive graphical backend. The details of this operation can vary greatly from system to system and even installation to installation, but matplotlib does its best to hide all these details from you. 

#### Adding titles and labels

Visualizing the data through plotting for better interpretation is an indispensable practice. However, the graphics must express the full story. To express this story, it is necessary to add labels that indicate what is being plotted, where and what it represents. We do this by including data labels, axis labels, and titles.

In [None]:
plt.scatter(df.percent_senior, df.crime_rate)

plt.title('Plot of Crime Rate vs Percent Senior') # Adding a title to the plot
plt.xlabel("Percent Senior") # Adding the label for the horizontal axis
plt.ylabel("Crime Rate") # Adding the label for the vertical axis
plt.show()

### Line Chart using `Matplotlib`

A line chart is used to represent data over a continuous time span. It is generally used to show trend of a measure (or a variable) over time. Data values are plotted as points that are connected using line segments.

In Matplotlib we can create a line chart by calling the plot method.

`plot()` is a versatile command, and will take an arbitrary number of arguments.

In [None]:
plt.plot(df.work_force, df.income) # 2 arguments: X and Y points
plt.xlabel("Work Force") # Adding the label for the horizontal axis
plt.ylabel("Income")
plt.show()

Because it is a line chart, `matplotlib` automatically draws a line to connect each pair of consecutive points that represent Cartesian coordinates on the graph. We can also make a graph with a single input argument:

In [None]:
plt.plot([1, 2, 3, 4]) # 1 argument
plt.show()

You may be wondering why the x-axis ranges from 0-3 and the y-axis from 1-4. If you provide a single list or array of $n$ elements to the `.plot()` function, `matplotlib` will assume it is a sequence of $y$ values, and **automatically generates the $x$ values for you** as a range of $n$ elements starting from 0. Since python ranges start with 0, the default $x$ vector has the same length as $y$. Hence the $x$ data will be [0,1,2,3].

#### Changing the size of the plot

You may have noticed that there is a lot of space in the width to show the graph and `matplotlib` has shown it in a small space of our notebook. The size of the figure that contains the graph can be varied with the `figsize` argument as follows:

```
plt.figure(figsize=(new_width_pixels, new_height_pixels))
```

Let's look at this example:

In [None]:
plt.figure(figsize=(12,5)) # 12x5 plot

plt.plot(df.work_force, df.income) 
plt.xlabel("Work Force") 
plt.ylabel("Income")
plt.show()

#### Formatting the style of your plot

Specify the keyword args linestyle and/or marker in your call to plot.

For example, using a dashed line and red circle markers:

In [None]:
plt.plot(df.work_force, df.income, linestyle='--', marker='o', color='r')
plt.xlabel("Work Force") 
plt.ylabel("Income")
plt.show()

#### A shortcut for the above

For every x, y pair of arguments, there is an optional third argument which is the format string that indicates the color and line type of the plot. The letters and symbols of the format string are from MATLAB, and you concatenate a color string with a line style string. The default format string is 'b-', which is a solid blue line. 

In [None]:
plt.plot(df.work_force, df.income, '--ro')  # ro = red circles
plt.xlabel("Work Force") 
plt.ylabel("Income")
plt.show()

It turns out that the plot function can produce scatter plots as well

In [None]:
plt.plot(df.work_force, df.income, 'ro') # ro = red circles
plt.show()

In [None]:
plt.plot(df.work_force, df.income, "gx") # gx = green x
plt.show()

There are plenty of other options. You can try the following:

```
plt.plot(df.work_force, df.income, "go") # green circles
plt.plot(df.work_force, df.income, "g^") # green traingles
plt.plot(df.work_force, df.income, "ro") # red circles
plt.plot(df.work_force, df.income, "rx") # red x symbol
plt.plot(df.work_force, df.income, "b^") # red ^ symbol
plt.plot(df.work_force, df.income, "go--", linewidth=3) # green circles and dashed lines of width 3.
```

### Plotting consecutive plots using `Matplotlib` 

Plotting data on multiple consecutive figures can be done by calling the corresponding graphing functions and displaying each figure consecutively:

In [None]:
plt.plot(df.work_force, df.income, color="r") 
plt.show()
plt.plot(df.physicians, df.income) 
plt.show()

What happens if you don't use `plt.show()` after the first figure? Both variables will be plotted in the same figure:

#### Adding a Legend

A legend is an area describing the elements of the graph. In the `matplotlib` library, there’s a function called `.legend()` which is used to place a legend on the axes, as follows:

In [None]:
#Instead of creating a separate image for points2, it displays both the plots in the same figure 
# (both share the same axes)
plt.plot(df.work_force, df.income, color="r", label = 'work_force') 
plt.plot(df.physicians, df.income, label='physicians') 

# Adding a legend
plt.legend()

plt.show()

### Multiple plots in one figure using `Matplotlib`
We can make multiple graphics in one figure. This goes very well for comparing charts or for sharing data from several types of charts easily with a single image.

The `.subplot()` method is used to add multiple plots in one figure. It takes three arguments: 
1. `nrows`: number of rows in the figure
2. `ncols`: number of columns in the figure
3. `index`: index of the plot

Let's see how variables are plotted with different row and column configurations in the figures.

#### 1 row and 2 columns

In [None]:
plt.subplot(1,2,1)  # row, column, index
plt.plot(df.work_force, df.income, "go")
plt.title("Income vs Work Force")

## plt.subplot(1,2,2) # row, column, index
plt.subplot(1,2,2).label_outer()

plt.plot(df.hospital_beds, df.income, "r^")
plt.title("Income vs Hospital Beds")

plt.suptitle("Sub Plots") # Add a centered title to the figure.
plt.show()

#### 2 rows and 1 column

In [None]:
plt.subplot(2,1,1) # row, column, index
plt.plot(df.work_force, df.income, "go")

plt.subplot(2,1,2) # row, column, index
plt.plot(df.hospital_beds, df.income, "r^")

plt.suptitle("Sub Plots")
plt.show()

#### 2 rows and 2 columns

In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(8,8)) #creating a grid of 2 rows, 2 columns and 8x8 figure size
ax[0,0].plot(df.work_force, df.income, "go") # The top-left axes
ax[0,1].plot(df.work_force, df.income, "bo") # The top-right axes
ax[1,0].plot(df.work_force, df.income, "yo") # The bottom-left axes
ax[1,1].plot(df.work_force, df.income, "ro") # The bottom-right axes

plt.suptitle("Sub Plots")

plt.show()

### Saving Figures

As the plot is gone as soon as we close the figure window, we usually want to save our figure, e.g., as a PDF:

In [None]:
fig.savefig("my_plot.pdf", dpi=120)
# or, if you didn't use the fig = plt.subplots() command:
# plt.savefig("my_plot.pdf", dpi=120)

In savefig(), the file format is inferred from the extension of the given filename. Depending on what backends you have installed, many different file formats are available. The list of supported file types can be found for your system by using the following method of the figure canvas object:

In [None]:
fig.canvas.get_supported_filetypes()

### Two Interfaces for the Price of One

A potentially confusing feature of Matplotlib is its dual interfaces: a convenient MATLAB-style state-based interface, and a more powerful object-oriented interface. We'll quickly highlight the differences between the two here.

#### MATLAB-style Interface

Matplotlib was originally written as a Python alternative for MATLAB users, and much of its syntax reflects that fact.
The MATLAB-style tools are contained in the pyplot (``plt``) interface.
For example, the following code will probably look quite familiar to MATLAB users:

In [None]:
x = np.linspace(0, 10, 100)

plt.figure()  # create a plot figure

# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))

# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x))

It is important to note that this interface is *stateful*: it keeps track of the "current" figure and axes, which are where all ``plt`` commands are applied.
You can get a reference to these using the ``plt.gcf()`` (get current figure) and ``plt.gca()`` (get current axes) routines.

While this stateful interface is fast and convenient for simple plots, it is easy to run into problems.
For example, once the second panel is created, how can we go back and add something to the first?
This is possible within the MATLAB-style interface, but a bit clunky.
Fortunately, there is a better way.

#### Object-oriented interface

The object-oriented interface is available for these more complicated situations, and for when you want more control over your figure.
Rather than depending on some notion of an "active" figure or axes, in the object-oriented interface the plotting functions are *methods* of explicit ``Figure`` and ``Axes`` objects.
To re-create the previous plot using this style of plotting, you might do the following:

In [None]:
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)

# Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x))

## Seaborn

`Seaborn` is a Python data visualization library based on `Matplotlib`. It provides a high-level interface for creating attractive graphs, and statistical data visualization. 

Seaborn has a lot to offer. You can create graphs in one line that would take you multiple tens of lines in `Matplotlib`. Its standard designs are awesome and it also has a nice interface for working with pandas dataframes.

In the following sections we will review how to graph with `seaborn`

### Importing seaborn

In [None]:
import seaborn as sns

In [None]:
# Load iris data
df_iris = sns.load_dataset("iris")

df_iris.sample(10)

### Scatter Plot using `Seaborn`

In [None]:
sns.scatterplot(x='sepal_length', y='sepal_width', data=df_iris)
plt.show()

### Swarm Plot using `Seaborn`

In [None]:
# Construct swarm plot for sepcies vs petal_length
sns.swarmplot(x="species", y="petal_length", data=df_iris, size=4)

# Show plot
plt.show()

### Heatmap using `Seaborn`

A Heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colors.

Seaborn heatmaps are appealing to the eyes, and they tend to send clear messages about data almost immediately. This is why this method for correlation matrix visualization ( exploring the correlation of features in a dataset) is widely used by data analysts and data scientists alike.

Correlation is a measure of the strength of a linear relationship between two quantitative variables. 
(We'll study more about correlation in the upcoming week)

To get the correlation of the features inside a dataset we can call <dataset>.corr(), which is a Pandas dataframe method. This will give us the correlation matrix.

In [None]:
# Correlation matrix completely ignores any non-numeric column. 
sns.heatmap(df_iris.corr(), annot=True)
plt.show()

- Each square shows the correlation between the variables on each axis. 
- Correlation ranges from -1 to +1. 
- Values closer to zero means there is no linear trend between the two variables. 
- The more close to 1 the correlation is the more positively correlated they are; that is as one increases so does the other. 
- A correlation closer to -1 is similar, but instead of both increasing one variable will decrease as the other increases.

## Plotly

The `plotly.express` module (usually imported as px) contains functions that can create entire figures at once, and is referred to as Plotly Express or PX. Plotly Express is a built-in part of the plotly library, and is the recommended starting point for creating most common figures.

Plotly Express provides more than 30 functions for creating different types of figures. The API for these functions was carefully designed to be as consistent and easy to learn as possible, making it easy to switch from a scatter plot to a bar chart to a histogram to a sunburst chart throughout a data exploration session. Scroll down for a gallery of Plotly Express plots, each made in a single function call.

In [None]:
import plotly.express as px

df_iris = px.data.iris()
fig = px.scatter_matrix(df, dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"], color="species")
fig.show()

In [None]:
fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',
              color='species')
fig.show()

## Appendix

### Bar Chart using `Matplotlib` 

Bar charts are one of the most common types of graphs and are used to show data associated with the categorical variables.

`Pyplot` provides a `bar()` method to make bar graphs which take the following arguments: categorical variables, their values and color (if you want to specify any). Let's see some ways to display a bar graph with `matplotlib`:

####  Vertical Bar Charts

In [None]:
plt.bar(df.region, df.crime_rate, color="green")

plt.title("Bar Graph")
plt.xlabel("Region")
plt.ylabel("Crime Rate")
plt.show()

#### Horizontal Bar Charts

It’s also really simple to make a horizontal bar-chart using the plot.barh() method.

In [None]:
plt.barh(df.region, df.crime_rate, color="green")
plt.title("Bar Graph")
plt.show()

####  Bar Charts with multiple quantities

When comparing several quantities and when changing one variable, we might want a bar chart where we have bars of one color for one quantity value.

We can plot multiple bar charts by playing with the thickness and the positions of the bars.

In [None]:
divisions = ["A", "B", "C", "D", "E"]
division_avg = [70, 82, 73, 65, 68]
boys_avg = [68, 67, 77, 61, 70]

# Using the NumPy arange function to generate values for index between 0-4.
# Here,stop is 5, start is 0, and step is 1
index = np.arange(5) 
width = 0.30

plt.bar(index, division_avg, width, color="green", label="Division Marks")
plt.bar(index+width, boys_avg, width, color="blue", label="Boys Marks")

plt.title("Bar Graph")
plt.xlabel("Divisions")
plt.ylabel("Marks")
plt.show()

####  Stacked Bar Chart

The stacked bar chart stacks bars that represent different groups on top of each other. The height of the resulting bar shows the combined result of the groups.

The optional bottom parameter of the pyplot.bar() function allows you to specify a starting value for a bar. Instead of running from zero to a value, it will go from the bottom to the value. The first call to pyplot.bar() plots the blue bars. The second call to pyplot.bar() plots the green bars, with the bottom of the blue bars being at the top of the green bars.

In [None]:
divisions = ["A", "B", "C", "D", "E"]
girls_avg = [72, 97, 69, 69, 66]
boys_avg = [68, 67, 77, 61, 70]

index = np.arange(5)
width = 0.50

plt.bar(index, boys_avg, width, color="green", label="Boys Marks")
plt.bar(index, girls_avg, width, color="blue", label="Girls Marks", bottom=boys_avg)

plt.title("Bar Graph")
plt.xlabel("Divisions")
plt.ylabel("Marks")
plt.show()

### Pie Chart using `Matplotlib`
One more basic type of chart is a Pie chart which can be made using the method pie().

Pie charts show the size of items (called wedge) in one data series, proportional to the sum of the items. The data points in a pie chart are shown as a percentage of the whole pie

Parameters of a pie chart:
- x: array-like. The wedge sizes.
- labels: list. A sequence of strings providing the labels for each wedge.
- Colors: A sequence of matplotlibcolorargs through which the pie chart will cycle. If None, will use the colors in the currently active cycle.
- Autopct: string, used to label the wedges with their numeric value. The label will be placed inside the wedge. The format string will be fmt%pct.

We can also pass in arguments to customize our Pie chart to show shadow, explode a part of it, tilt it at an angle as follows:

In [None]:
firms = ["Firm A", "Firm B", "Firm C", "Firm D", "Firm E"]
market_share = [20,25,15,10,20]

# Explode the pie chart to emphasize a certain part or some parts( Firm B in this case) 
# It is useful because it makes the highlighted portion more visible.
Explode = [0,0.1,0,0,0] 

plt.pie(market_share, explode=Explode, labels=firms, autopct='%1.1f%%', startangle=45)

plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.legend(title="List of Firms")

plt.show()