# AICE1006 - Data Analytics

## Lecture 7 - Data Plotting (Advanced)
**Interactive data visualization with plotly**


**Zhiwu Huang**  <br/>
Lecturer (Assistant Professor) <br/>
Vision, Learning and Control (VLC) Research Group <br/>
School of Electronics and Computer Science (ECS) <br/>
University of Southampton<br/>

*Office Hour: Wed 2PM-3PM, Please book in advance.* <br/>
``Zhiwu.Huang@soton.ac.uk``

<br/>
<br/>
<!-- <br/> -->

Credit: Marco Forgione, Researcher, USI-SUPSI

## Plotly in a nutshell

Plotly is a modern plotting library for Python, R, MATLAB, Julia, etc.

For Python, the reference documentation is available at https://plotly.com/python/


## Plotly vs matplotlib
 
You can build high-quality visualizations with good old matplotlib. However,

* A lot of low-level code is required
* The visualizations are generally *static*


Plotly is a modern and powerful alternative. It provides:

* Concise high-level syntax for common data visualization 
* Tight integration with pandas
* Interactive plots



Other alternatives exist: for instance [seaborn](https://seaborn.pydata.org/)

* Also concise and high-level
* Also integrated with pandas
* Not interactive

### Plotly Express

The **plotly express** sub-module of plotly provides a high-level API for common visualizations. Covers many use cases.

In [1]:
import plotly.express as px 
# import plotly # contains more advanced low-level functionalities for custom visualizations

Plotly express provides methods to load well-known datasets. Let us load the iris dataset

In [2]:
df_iris = px.data.iris() # several classic dataframes are included in plotly for demonstration purpose
df_iris.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,species_id
114,5.8,2.8,5.1,2.4,virginica,3
70,5.9,3.2,4.8,1.8,versicolor,2
80,5.5,2.4,3.8,1.1,versicolor,2
141,6.9,3.1,5.1,2.3,virginica,3
137,6.4,3.1,5.5,1.8,virginica,3


### Scatterplot

A scatterplot is the most common visualization for 2 numeric variables

In [32]:
fig = px.scatter(df_iris, x="petal_width", y="petal_length", width=1600, height=800) # specify dataframe and columns for x/y
fig.update_layout(font_size=20);
fig.show()

* Syntax: ``px.scatter(df_iris, x="petal_width", y="petal_length", ...``)
* Axes labels automatically set to the column names
* Interactive!

### Scatterplot cont'd

The **marker color** is commonly used as another dimension of visual analysis

In [4]:
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species", width=1600, height=800) # specify dataframe and columns for x/y
fig.update_layout(font_size=20);
fig.show()

* Implemented with ``color="species"``
* Legend automatically added 

### Scatterplot cont'd

The **marker size** provides yet another dimension of visual analysis

In [33]:
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species", size="petal_width", width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()

* Implemented with ``size="petal_length"``

### Scatterplot cont'd

The **interactive text** displayed when hovering over a point may also be modified

In [35]:
fig = px.scatter(df_iris, x="petal_width", y="petal_length", color="species", 
                 size="sepal_width", hover_data=["sepal_length"], width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()

* Implemented with ``hover_data=["petal_width"]``

### Scatterplot matrix

The scatterplot matrix is a useful visualization for  **several numeric variables**. It is the collection of all possible combinations of scatterplots.

In [7]:
fig = px.scatter_matrix(df_iris, dimensions=["sepal_width", "sepal_length", "petal_width", "petal_length"], 
                        color="species", width=1600, height=800)
fig.update_layout(font_size=20);
fig.show()

* Implemented with ``px.scatter(...)``
* The variables to be analyzed correspond to the ``dimensions`` argument

## Histograms & Box plots

Histograms & box plots may be used to represent the distribution of a single numerical variable

In [8]:
fig = px.histogram(df_iris, x="sepal_width", width=800, height=400)
fig.update_layout(font_size=20);
fig.show()

In [9]:
fig = px.box(df_iris, x="sepal_width", width=800, height=400)
fig.update_layout(font_size=20);
fig.show() 

## Multiple box plots

Multiple box plots may be constructed specifying a categorical variable for ``y``...

In [10]:
fig = px.box(df_iris, x="sepal_width", y="species", width=800, height=400); fig.update_layout(font_size=20); fig.show() 

... or for ``color``

In [11]:
fig = px.box(df_iris, x="sepal_width", color="species", width=800, height=400); fig.update_layout(font_size=20); fig.show() 

## Multiple box plots cont'd

Note: the role of ``x`` and ``y`` may be interchanged

In [12]:
fig = px.box(df_iris, y="sepal_width", x="species", width=800, height=400); fig.update_layout(font_size=20); fig.show() 

In [13]:
fig = px.box(df_iris, y="sepal_width", color="species", width=800, height=400); fig.update_layout(font_size=20); fig.show() 

### Bar plot

Bar plots are commonly used to represent a numeric variable vs. a categorical one. Example: aggregated group statistics

In [14]:
df_iris_mean = df_iris.groupby("species", as_index=False).mean()
df_iris_mean

Unnamed: 0,species,sepal_length,sepal_width,petal_length,petal_width,species_id
0,setosa,5.006,3.418,1.464,0.244,1.0
1,versicolor,5.936,2.77,4.26,1.326,2.0
2,virginica,6.588,2.974,5.552,2.026,3.0


In [15]:
fig = px.bar(df_iris_mean, x="species", y="petal_length", title="Average petal_length, by species"); fig.update_layout(font_size=20); fig.show() 

### Bar plot

Another example where a bar plot looks nice: data for different years

In [16]:
import plotly.express as px
data_canada_it = px.data.gapminder().query("country == 'Canada' or country == 'Italy'")
fig = px.bar(data_canada_it, x='year', y='pop', color="country", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 
#data_canada_it

### Bar plot

Another example where a bar plot looks nice: data for different years

In [36]:
import plotly.express as px
data_canada_it = px.data.gapminder().query("country == 'Canada' or country == 'Italy'")
fig = px.bar(data_canada_it, x='year', y='pop', color="country", barmode="group", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 

Italian population is stable since the 80s, canadian population is still increasing

### Pie chars

Pie charts give an intuitive representation of percentages.

In [18]:
df = px.data.gapminder().query("year == 2007").query("continent == 'Europe'")
df.loc[df['pop'] < 5.e6, 'country'] = 'Other countries' # Represent only large countries
df.sample(3)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
1391,Other countries,Europe,2007,77.926,2009245,25768.25759,SVN,705
419,Denmark,Europe,2007,78.332,5468120,35278.41874,DNK,208
407,Czech Republic,Europe,2007,76.486,10228744,22833.30851,CZE,203


In [19]:
fig = px.pie(df, values='pop', names='country', title='Population of European continent', width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 

### Faceting

Faceting allows dealing with up to two categorical variables by **repeating** the same base plot on different rows/ columns.


Back to the tip dataset:

In [20]:
df_tip = px.data.tips()
df_tip.head(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


* **total_bill** and **tip** are numeric quantities
* **day** and **time** are also (ordered) categorical 
* **sex** and **smoker** are categorical variables with unspecified order

### Simple scatterplot

How is the relation **tip** vs **total_bill** for the different days? We may use a scatterplot **tip** vs **total_bill**, colored by **day**.

In [37]:
fig = px.scatter(df_tip, x="total_bill", y="tip", color="day", width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 

 The result is not very clear...

### Faceted Scatterplots

A **facet columns** may be used instead: generate separate plots for each **day**

In [22]:
fig = px.scatter(df_tip, x="total_bill", y="tip", facet_col="day", category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]}, width= 1800, height=500)
fig.update_layout(font_size=20); fig.show() 

* ``facet_col="day"``: repeat the scatterplot for the different values of the categorical variable **day** on columns
* The ``category_orders`` dictionary specifies the order to be used for the categorical variables

### Faceted Statterplots cont'd

Using **facet rows and columns** we may handle 2 categorical variables

In [23]:
fig = px.scatter(df_tip, x="total_bill", y="tip", facet_col="day", facet_row="time", 
                 category_orders={"day": ["Thur", "Fri", "Sat", "Sun"], "time": ["Lunch", "Dinner"]},
                 width= 1600, height=700)
fig.update_layout(font_size=20); fig.show() 

* ``facet_col="day"``: day on columns
* ``facet_row="time"``: time on rows 

### Faceted  Histograms


Histograms may also be modified with faceting

In [24]:
fig = px.histogram(df_tip, x="total_bill",  facet_col="day", facet_row="smoker", color="sex",                 
                   category_orders={"day": ["Thur", "Fri", "Sat", "Sun"]},
                   width= 1600, height=800, )
fig.update_layout(font_size=20); fig.show() 

* 1 categorical variable (sex) handled with the ``color`` option
* 2 categorical variables (day/smoker) handled with rows/columns

### Faceted Boxplot


In [25]:
fig = px.box(df_tip, x="day", y="total_bill",
             facet_col="smoker",
             category_orders={"day": ["Thur", "Fri", "Sat", "Sun"], "time": ["Lunch", "Dinner"]},
             color="day",
             width= 1600, height=800)
fig.update_layout(font_size=20); fig.show() 

### Animation: time as an extra dimension

In the following scatterplot, we visualize 4 properties for different countries in 2007 :
 * gdpPercap (x position)
 * lifeExp (y position)
 * continent (marker color) 
 * population (marker size)


In [26]:
import plotly.express as px
df = px.data.gapminder()
fig = px.scatter(df.query("year==2007"), x="gdpPercap", y="lifeExp", size="pop", color="continent", hover_name="country", log_x=True,
           title="GDP, life expectancy, continent, and population of countries in 2007", size_max=60, width=1400, height=600)
fig.update_layout(font_size=20); fig.show() 

What if we want to see the evolution over time? An *animation* could be used! 

### Animation: time as an extra dimension

In [27]:
import plotly.express as px
df = px.data.gapminder()
fig = px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", 
           size="pop", color="continent", hover_name="country", 
           log_x=True, size_max=45, range_x=[100,100000], range_y=[25,90],
           width=1400, height=600)
fig.update_layout(font_size=20); fig.show() 

animation time is well-suited to represent the *year* dimension! 

In [28]:
import plotly.io as pio
from PIL import Image

frames = []

# Loop through animation frames
for frame in fig.frames:
    # Update the figure with each frame's data
    fig.update(data=frame.data)
    
    # Export each frame as an image
    img_bytes = pio.to_image(fig, format='png')

    # Save each frame as a PNG
    frame_name = frame.name  # Typically the year
    with open(f"frame_{frame_name}.png", "wb") as f:
        f.write(img_bytes)
    
    # Collect frames for GIF conversion
    frames.append(Image.open(f"frame_{frame_name}.png"))

    # Save the frames as an animated GIF
    frames[0].save(
    'gapminder_animation.gif',
    save_all=True,
    append_images=frames[1:], 
    duration=300,  # Duration of each frame (ms)
    loop=0  # Infinite loop
)

### Maps

Maps are the obvious representation of geographical data. They are similar to scatterplots

In [29]:
import pandas as pd
# covid-19 italian data downloaded from https://github.com/pcm-dpc/COVID-19/blob/master/dati-regioni/dpc-covid19-ita-regioni.csv on 27-08-2020
data_latest = pd.read_csv("dpc-covid19-ita-regioni.csv") 

In [30]:
center = {"lat": 43.1, "lon": 12.3} # coordinates of center italy (Perugia)
fig = px.scatter_mapbox(data_latest, lon="long", lat="lat",
                        center=center,
                        size="totale_casi", # total cases
                        hover_data= ["denominazione_regione"], # region name
                        zoom=4)
fig.update_traces(textposition='top center')
fig.update_layout(
    width=800,
    height=800,
    title_text='Italian COVID-19 total cases, updated on 27-08-2020',
    #center=center
)
fig.update_layout(mapbox_style="carto-darkmatter") # warning! some styles require an account 
fig.show()

### Maps

can also be animated, as all other ``plotly`` visualizations.

In [31]:
center = {"lat": 43.1, "lon": 12.3}
fig = px.scatter_mapbox(data_latest, lon="long", lat="lat", # longitude, latitude
                        center=center,
                        size="totale_casi", # total cases
                        hover_data= ["denominazione_regione"], # region name
                        animation_frame="data", # date
                        zoom=4)

fig.update_traces(textposition='top center')
fig.update_layout(
    width=800,
    height=800,
    title_text='Cases-Regions',
)
fig.update_layout(mapbox_style="carto-darkmatter") # warning! some styles require an account 
fig.show()