# Introduction to Data Visualization

Data visualization (or *dataviz*) is an essential tool
to facilitate the understanding of the data and highlight
phenomena from these, as well as to promote a
effective communication of analysis results. However, this is a
area that goes far beyond technical skills alone:
The best visualizations are those that are tailored to the data.
that they represent, and which manage to tell a story to
from these (*data storytelling*). This tutorial therefore does not aim
to present the subject in detail, but offers an introduction to
main existing tools in `Python` to produce visualizations
of data.

We will begin our exploration with the graphics integrated into
`Pandas`, very simple and therefore perfect for a quick analysis of
data. Then, we will discover `Seaborn`, a library that allows you to
create attractive visualizations in very few lines of code. These
two libraries are based on `Matplotlib`, the very complete library
reference for visualization in `Python`, which allows for levels
very advanced customization but whose use turns out to be more
complex, and will therefore not be directly addressed in this practical work.

## Pandas

As we saw in the dedicated TP, the Pandas library offers
numerous and powerful tools for manipulating tabular data. But
It also comes with built-in tools to view them.
In particular, the `.plot()` method allows you to simply produce plots
quick visualizations of analyzed data.

### The `.plot()` method

The method
[.plot()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html),
integrated with Series and DataFrames, simplifies the process of
creation of graphs by allowing to generate visualizations
standards with one line of code, directly from the structure
Behind the scenes, `.plot()` calls on Matplotlib to render
graph, which means that any graph generated by Pandas can
be further customized with Matplotlib functions. This
integration provides a balance between convenience for tasks
Fast visualization and the power of Matplotlib for visualization needs
more advanced customization, making `.plot()` the starting point
ideal for data visualization in Python.

### Examples of graphs

Although the `.plot()` method allows you to produce simply and quickly
graphics, the possibilities are very numerous and depend on the
input data. In this section, we provide some examples
standards to understand how the method works. To
discover more possibilities, we can draw inspiration from the many
examples from the [documentation
official](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

Let's generate synthetic data that mimics cash register data, which we
will use graphics as a basis.

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Setup for reproducibility
np.random.seed(0)

# Generate a date range over a month
dates = pd.date_range(start='2023-01-01', end='2023-01-31', freq='D')

# Simulate cash data for the month
N_POINTS = 1000
mean_price = 10
std_dev_price = 4
prices = np.random.normal(mean_price, std_dev_price, N_POINTS)
quantities = 10 - 0.5 * prices + np.random.normal(0, 1.5, N_POINTS)
data = {
    'Date': np.random.choice(dates, N_POINTS),
    'Transaction_ID': np.arange(N_POINTS) + 1,
    'COICOP': np.random.choice(['01.1.1', '02.1.1', '03.1.1', '04.1.1'], N_POINTS),
    'Enseigne': np.random.choice(['Carrefour', 'Casino', 'Lidl', 'Monoprix'], N_POINTS),
    'Prix': prices,
    'Quantité': quantities
}

# Create the DataFrame
df_caisse = pd.DataFrame(data)

# Sort by date for consistency
df_caisse = df_caisse.sort_values(by='Date').reset_index(drop=True)

# Show first lines of cash register data
print(df_caisse.head())

#### Scatter plot

Point clouds allow you to visualize the relationship between two
continuous numerical variables. Let us illustrate this through the relation
between the price and the quantities of transactions.

In [None]:
df_caisse.plot(x='Quantité', y='Prix', kind='scatter')

#### Bar charts

Bar charts are ideal for visual comparison of
different categories. Here we use the `.value_counts()` method
to retrieve the frequencies of each modality in a `Series`, at
to which we apply the `.plot()` method to visualize a diagram
bars.

In [None]:
df_caisse['Enseigne'].value_counts().plot(kind='bar')

#### Box plot

The box plot allows you to quickly visualize statistics of
dispersion of a statistical series (median, quartiles, min, max) as well
that the possible presence of outliers

In [None]:
df_caisse['Prix'].plot(kind="box")

#### Histograms (*histogram*)

Histograms help to understand the distribution of a variable
digital. Let us calculate the histogram of transaction prices on the
period studied.

In [None]:
df_caisse['Prix'].plot(kind='hist', bins=20)

#### Line plot (*lineplot*)

In [None]:
df_caisse.groupby('Date')['Quantité'].sum().plot(kind='line')

### Customization

As mentioned earlier, the charting functionality built into
Pandas actually relies on the Matplotlib library, as the
Pandas `.plot()` method is just a wrapper around
the function
[plot()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html)
from Matplotlib. In theory, all the customization possibilities
allowed by Matplotlib are with the graphs created by it
in Pandas. To access it, you need to import Matplotlib in addition to
Pandas.

In [None]:
import matplotlib.pyplot as plt

Let's illustrate some customization possibilities by taking one of the
previous graphs.

In [None]:
df_caisse.plot(x='Quantité', y='Prix', kind='scatter', color="green", alpha=0.6)
plt.title('Relation entre le prix et la quantité des produits')
plt.xlabel('Quantité vendue')
plt.xlabel('Prix (en €)')

### Go further

Again, many other possibilities are described in the
[documentation](https://pandas.pydata.org/docs/user_guide/visualization.html#basic-plotting-plot).
However, the graphics features built into Pandas remain
first of all made for quick visualization of the analyzed data.
For more attractive visualizations without the need to produce
much more code, we will prefer the `Seaborn` library.

## Seaborn

Seaborn is a data visualization library that offers a
high-level interface for creating statistical graphs
aesthetics. It is also built on Matplotlib and integrates
well with Pandas data structures, allowing
more sophisticated visualizations than those natively offered by Pandas
without requiring a significant amount of code. This makes it
a great choice to go beyond Pandas' graphics capabilities
while avoiding the complexity of Matplotlib.

Let's import the Seaborn package. The common usage is to give it the alias
`sns` to avoid code redundancies.

In [None]:
import seaborn as sns

### Examples of graphs

For the same graphs as previously made with Pandas,
Seaborn offers much more visually pleasing representations.
presents some of them in the rest of this tutorial.

#### Point cloud

Information can be easily added to a point cloud, for example
example via the color of the points or their style (size, marker, etc.).
Let's analyze the scatter plot of prices as a function of quantity according to
the store in which the transaction took place.

In [None]:
sns.scatterplot(data=df_caisse, x='Prix', y='Quantité', hue='Enseigne', alpha=0.6)

#### Histogram

With Seaborn, one can easily add an estimation curve of
density to a histogram. This allows you to visually check the
normality of data.

In [None]:
sns.histplot(df_caisse['Prix'], kde=True, color='skyblue')

#### Pair plot

The *pair plot* allows you to analyze the relationships between two variables
continuous by coupling a point cloud and density curves.

In [None]:
subset = df_caisse[['Prix', 'Quantité', 'Enseigne']]
sns.pairplot(subset, hue='Enseigne')

#### Violin plot

Similar to the box plot, the *violin plot* adds a curve
density estimation in order to better visualize the masses of the
distribution.

In [None]:
sns.violinplot(data=df_caisse, x='Enseigne', y='Prix', hue="Enseigne")

### Customization

Like Pandas, Seaborn's graphing features are based on
those of Matplotlib. Here again, we can therefore customize the
graphs using Matplotlib's `plt.xxx` functions.

In [None]:
sns.scatterplot(data=df_caisse, x='Prix', y='Quantité', hue='Enseigne', alpha=0.6)
plt.title('Relation entre prix et quantité selon les enseignes')

### Go further

Seaborn's possibilities are really broad, and the
[gallerie](https://seaborn.pydata.org/examples/index.html) examples of
Seaborn illustrates many visually pleasing possibilities and
easy to reproduce. For more advanced needs, we can
depending on the case, move towards other graphics libraries:

- for maximum customization possibilities (at the price of a
certain learning cost):
[Matplotlib](https://matplotlib.org/stable/tutorials/pyplot.html),
the fundamental visualization library in Python;

- for R users:
[plotnine](https://plotnine.readthedocs.io/en/v0.12.4/), a
library that implements the “graphical grammar” specific to
[ggplot2](https://ggplot2.tidyverse.org/) ;

- for interactive visualization: [plotly](https://plotly.com/)
and [bokeh](http://bokeh.org/) are the most used.