In [2]:
import pandas as pd
import numpy as np
import seaborn as sns 
import plotly.express as px 
import plotly.graph_objects as go 
import matplotlib.pyplot as plt 



# Plotly Plots: Part 01
In this topic we will explore and learn multiple use cases in Plotly to deliver:
- Line plot
- Area Plot
- Histogram
- Boxplot


## Line plot

Consider the gapminder dataset; it has records on population, Gross Domestic Product, and life expectancy for more than 140 countries from 1952 to 2007. Each row represents a country in a given year

In [45]:
df = px.data.gapminder().query("country=='Canada'")
df.head(3)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
240,Canada,Americas,1952,68.75,14785584,11367.16112,CAN,124
241,Canada,Americas,1957,69.96,17010154,12489.95006,CAN,124
242,Canada,Americas,1962,71.3,18985849,13462.48555,CAN,124


We are already familiar with a line plot. We can draw an interactive line plot using px.line(). The function documentation is here. The arguments are data_frame, x, y and title. Other arguments can be found in the documentation and will be presented progressively
- We will plot life expectancy in Canada; therefore, we are querying it on the data_frame argument
- As a recap, we render plotly plot with fig.show()

In [46]:
fig = px.line(data_frame=df, x="year", y="lifeExp", title='Life expectancy in Canada')
fig.show()


Now we want to plot multiple lines, each related to a country. First, let's subset some data; we will subset data from Oceania

In [47]:
df_oceania = px.data.gapminder().query("continent=='Oceania'")
df_oceania.head(3)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
60,Australia,Oceania,1952,69.12,8691212,10039.59564,AUS,36
61,Australia,Oceania,1957,70.33,9712569,10949.64959,AUS,36
62,Australia,Oceania,1962,70.93,10794968,12217.22686,AUS,36


There are two countries

In [48]:
df_oceania['country'].unique()


array(['Australia', 'New Zealand'], dtype=object)

We add the colour argument to px.line() and set it as 'country', so it plots a line for each country

In the legend, you can click in a given country to activate/deactivate it in the plot. The plot is automatically updated when you do it; This is a practical benefit of an interactive graph

- Note both countries look to change at quite similar rates and have similar levels


In [49]:
fig = px.line(data_frame= df_oceania,
             x="year", y="lifeExp", color='country')
fig.show()


When dealing with multiple categorical variables, you can facet (or split) the plot by column and row
- First, let's subset a part of the data where the continent is Oceania or the Americas


In [50]:
df_facet = px.data.gapminder().query("continent in ['Oceania','Americas']")
df_facet.head(3)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
48,Argentina,Americas,1952,62.485,17876956,5911.315053,ARG,32
49,Argentina,Americas,1957,64.399,19610538,6856.856212,ARG,32
50,Argentina,Americas,1962,65.142,21283783,7133.166023,ARG,32


We add the parameter facet_col='continent' to create facetted columns subplots
- Note in the Americas; there is a disparity in life expectancy; few countries have levels below 65 or above 75, and the majority are between these levels. Note also if all countries changed at the same rate across the years? For example, El Salvador has a different rate than the others.


In [51]:
fig = px.line(data_frame=df_facet,
             x="year", y="lifeExp", color='country', facet_col ='continent')
fig.show()


We add the parameter facet_row='continent' to create facetted row subplots
- The interpretation is similar to the previous plot. But you may ask yourself, which option do I visualise the data best? The answer typically is: try both and see which works best.


In [52]:
fig = px.line(data_frame=df_facet, x="year", y="lifeExp", color='country', facet_row ='continent')
fig.show()


You can add more information to your plot, for example when you hover over a given data point. Add hover_name and hover_data arguments:
- Using: hover_name, the column iso_alpha displays in bold in the hover tooltip.
- Using: hover_data, the columns named 'pop' and 'gdpPercap' values appear at the bottom as extra data in the hover tooltip


In [53]:
fig = px.line(data_frame=df_facet, x="year", y="lifeExp", color='country', facet_col='continent',
             hover_name='iso_alpha',
             hover_data=['pop','gdpPercap'])
fig.show() 


You can add a range slider when visualising time series in a line plot
Consider the dataset loaded from Plotly dataset at GitHub


In [54]:
df = pd.read_csv('https://raw.githubusercontent.com/Code-Institute-Solutions/sample-datasets/main/finance-charts-apple.csv')
print(df.shape)
df.head(3)

(506, 11)


Unnamed: 0,Date,AAPL.Open,AAPL.High,AAPL.Low,AAPL.Close,AAPL.Volume,AAPL.Adjusted,dn,mavg,up,direction
0,2015-02-17,127.489998,128.880005,126.919998,127.830002,63152400,122.905254,106.741052,117.927667,129.114281,Increasing
1,2015-02-18,127.629997,128.779999,127.449997,128.720001,44891700,123.760965,107.842423,118.940333,130.038244,Increasing
2,2015-02-19,128.479996,129.029999,128.330002,128.449997,37362400,123.501363,108.894245,119.889167,130.884089,Decreasing


You will create the line plot as usual and update the x-axis with rangeslider_visible=True

In [55]:
fig = px.line(df, x='Date', y='AAPL.High', title='Time Series with Rangeslider')
fig.update_xaxes(rangeslider_visible=True) ####### add range slider for time series data


Note you can drag and move the "handles" circled in red in the image below.
- That is the range slider when you are seeing time series data in a line plot


PRACTICE: We will use the DataFrame below, which uses the 'UK MacroData' dataset.

In [56]:
df = pd.read_csv('UK MacroData.csv', encoding='latin1')
df['Date'] = pd.to_datetime(df['Date'].str[:4] + '-' + df['Date'].str[-2:] )
print(df.shape)
df.head()


(999, 5)



Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



Unnamed: 0,Date,GDP (£ m),CPI,Bank Rate,Gross Fixed Capital Formation (Investments)
0,2000-01-01,401242.0,1.1,5.875,69114.0
1,2000-04-01,404196.0,1.0,6.0,73074.0
2,2000-07-01,406795.0,1.2,6.0,68011.0
3,2000-10-01,409411.0,1.4,6.0,70115.0
4,2001-01-01,413054.0,1.3,5.75,70186.0


You are interested in using a plotly lineplot to visualise the GDP (£ m) values over time

In [57]:
# df = df.sort_values('Date')
fig = px.line(df, x='Date', y='GDP (£ m)', title= 'GDP over time',labels= {'Date': 'Year', 'GDP (£ m)': 'GDP (in £ millions)'})
#fig.update_traces(line=dict(color='royalblue', width=3), mode='lines+markers')
# fig.update_layout(
#     plot_bgcolor='white',
#     xaxis=dict(showgrid=True, gridcolor='lightgrey'),
#     yaxis=dict(showgrid=True, gridcolor='lightgrey'),
#     font=dict(size=14),
# )
fig.update_traces(hovertemplate='Date: %{x|%Y-%m}<br>GDP: £%{y:,.0f}m')
fig.show()

We are already familiar with Area Plot. You can visualise time series data with an area plot, which can be considered a type of line plot, but filled with a colour to facilitate visualisation
Consider the stocks dataset. It has records for stock prices from multiple technology companies from the beginning of 2018 until the end of 2019.

In [58]:
df = px.data.stocks().head(50)
df.head(3)

Unnamed: 0,date,GOOG,AAPL,AMZN,FB,NFLX,MSFT
0,2018-01-01,1.0,1.0,1.0,1.0,1.0,1.0
1,2018-01-08,1.018172,1.011943,1.061881,0.959968,1.053526,1.015988
2,2018-01-15,1.032008,1.019771,1.05324,0.970243,1.04986,1.020524


Note the data is in a wide format. We will convert to a long format using pd.melt(); in case you want a refresh from the Pandas lesson, check pd.melt documentation

In [59]:
df = (pd.melt(frame=df, var_name='Company', value_name='Price',
            value_vars=['GOOG', 'AAPL', 'AMZN', 'FB', 'NFLX', 'MSFT'],
            id_vars='date')
)

df.head(10)

Unnamed: 0,date,Company,Price
0,2018-01-01,GOOG,1.0
1,2018-01-08,GOOG,1.018172
2,2018-01-15,GOOG,1.032008
3,2018-01-22,GOOG,1.066783
4,2018-01-29,GOOG,1.008773
5,2018-02-05,GOOG,0.941528
6,2018-02-12,GOOG,0.993259
7,2018-02-19,GOOG,1.022282
8,2018-02-26,GOOG,0.978852
9,2018-03-05,GOOG,1.052448


We plot an area chart with px.area(); its documentation is here.
- Note the plot is stacked, meaning for each value in x, the set of y's is compounded or summed. We are not interested in that since we want to see the variation of each company separately.


In [60]:
fig = px.area(df,x='date', y='Price', color='Company')
fig.show()


We will facet the plot to reach the effect we want - unstack the price and see each company's evolution. We also considered the argument facet_col_wrap, so we defined the maximum number of facet columns.

In [61]:
fig = px.area(df,x='date',y='Price', color='Company',
             facet_col="Company", facet_col_wrap=2)
fig.show()


Let's see an example where it would be interesting to have stacked data
Consider the gapminder dataset; it has the population, Gross Domestic Product, and life expectancy records for more than 140 countries. Each row represents a country of a given year
- We will subset data for Oceania in this exercise


In [63]:
df = px.data.gapminder().query("continent == 'Oceania'")
df.head(10)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
60,Australia,Oceania,1952,69.12,8691212,10039.59564,AUS,36
61,Australia,Oceania,1957,70.33,9712569,10949.64959,AUS,36
62,Australia,Oceania,1962,70.93,10794968,12217.22686,AUS,36
63,Australia,Oceania,1967,71.1,11872264,14526.12465,AUS,36
64,Australia,Oceania,1972,71.93,13177000,16788.62948,AUS,36
65,Australia,Oceania,1977,73.49,14074100,18334.19751,AUS,36
66,Australia,Oceania,1982,74.74,15184200,19477.00928,AUS,36
67,Australia,Oceania,1987,76.32,16257249,21888.88903,AUS,36
68,Australia,Oceania,1992,77.56,17481977,23424.76683,AUS,36
69,Australia,Oceania,1997,78.83,18565243,26997.93657,AUS,36


We are interested in plotting the population evolution across the years per country. In this context, we are interested in having a stacked aspect in the plot since we can check the population proportion across countries.
- For example, we see that Australia has a much larger population than New Zealand


In [64]:
fig = px.area(df,x='year',y='pop', color='country')
fig.show()


We have to be aware when using a stacked area plot. The example above had two countries. When we have more countries, the following two concerns are raised:
- colouring
- and lines have been cluttered
We will subset the countries from the Americas continent


In [65]:
df = px.data.gapminder().query("continent == 'Americas'")
df.head(3)

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap,iso_alpha,iso_num
48,Argentina,Americas,1952,62.485,17876956,5911.315053,ARG,32
49,Argentina,Americas,1957,64.399,19610538,6856.856212,ARG,32
50,Argentina,Americas,1962,65.142,21283783,7133.166023,ARG,32


In [66]:
fig = px.area(df,x='year',y='pop', color='country')
fig.show()


The colour aspect can be managed with the color_discrete_sequence argument. You can check the possible arguments with the command px.colors.qualitative.swatches()

In [67]:
px.colors.qualitative.swatches()

You will add the desired choice to px.colors.qualitative; for example, if you want Alphabet, it will be px.colors.qualitative.Alphabet. We picked Alphabet since there are many countries and Alphabet has a wide range of colours
- The graph will improve since there will be no more repeated colours; however, the "cluttered" effect may still remain
- the insights here will be to grasp which countries were more representative over the years
Naturally, you can set color_discrete_sequence argument for other plot types where you are colouring the plot with a categorical variable

In [68]:
fig = px.area(df,x='year',y='pop', color='country',
             color_discrete_sequence=px.colors.qualitative.Alphabet,
             )
fig.show()


# Histogram

Consider the tips dataset. Yes, the one from Seaborn. It holds records for waiter tips based on the day of the week, time of day, total bill, gender, if it is a table of smokers or not, and how many people were at the table.


In [69]:
df = px.data.tips().sample(n=150, random_state=1)
print(df.shape)
df.head()

(150, 7)


Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
67,3.07,1.0,Female,Yes,Sat,Dinner,1
243,18.78,3.0,Female,No,Thur,Dinner,2
206,26.59,3.41,Male,Yes,Sat,Dinner,3
122,14.26,2.5,Male,No,Thur,Lunch,2
89,21.16,3.0,Male,No,Thur,Lunch,2


You can create a histogram using px.histogram(); its documentation is here. The arguments are:
- data_frame
- x for the variable you want to see the distribution of

Additional parameters use cases will be discussed soon.


In [70]:
fig = px.histogram(data_frame=df, x="tip")
fig.show()


You can create a histogram with Marginal Boxplot by adding marginal='box'
- It helps you to see outliers, median, Q1 and Q3 quickly


In [71]:
fig = px.histogram(df, x="tip", marginal="box")
fig.show()


You can add more information to your histogram; for example, with the colour argument, it uses a variable, typically categorical, to assign colour marks.

In [72]:
fig = px.histogram(df, x="tip", color="sex", marginal="box")
fig.show()


We can use facet_col and facet_row, which we learned in a previous section, to divide the plot further and look for more insights
- Note, for example, that tips greater than 6 occur at dinner on weekends


In [73]:
fig = px.histogram(df, x="tip", color="sex", facet_col='day', facet_row='time')
fig.show()
