# Visualizations with `plotly.express`
### by [Jason DeBacker](http://jasondebacker.com), September 2025

This Jupyter Notebook will walk you through examples of plots using [`plotly.express`](https://plotly.com/python/plotly-express/) in Python.  We'll cover the types of plots commonly used in economics applications: line and bar graphs, histograms, scatter plots, and whisker plots.  We'll also look at representing geospatial data on shaded maps.

As an application, we'll look at Chetty, Friedman, and Hendren's (and others') [Opportunity Insights Project](https://opportunityinsights.org).  This project, from which there are a series of [papers](https://opportunityinsights.org/paper/) seeks to document and understand variation in economic, health, and other outcomes across geography and time in the United States.  To do this, the project contributors have utilized adiministrative data from a number of sources, along with survey data, to consider childhood exposure to various factors and link that to adult outcomes.

There website that hosts this project has a number of different datasets available for other researchers.  We'll use data from ["Where is the Land of Opportunity? The Geography of Intergenerational Mobility in the United States"](https://eml.berkeley.edu/~saez/chetty-friedman-kline-saezQJE14mobility.pdf) by  Chetty, Hendren, Kline, and Saez (*Quarterly Journal of Economics*, 2014).  Specifically, we'll download the Excel workbook with the online data tables [here](https://opportunityinsights.org/data/).


## Line Plots

One of the major contributions of this project is a transition matrix that gives the nation-wide probability of a child moving to another point in the income distribution than her parents.  The transitions in this matrix are thus inter-generational.  They tell us the probability of a child born in 1980-82 moving from a household with parents at a certain percentile in the distribution of family income to another percentile as an adult (family income of the child is measured in 2011-2012, when the child is not approximately 30 years old).  

One measure of mobility is what is called the "immobility rate" or "staying probability". This is simply the average of the diagonal elements of the transition matrix.  To see how mobility varies over the distribution of family income, let's plot the staying probabilities at each percentile in the distribution of family income.

In [1]:
# imports
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.io as pio
# pio.renderers.default = 'jupyterlab'
pio.renderers.default = 'notebook'
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Read in data from Excel workbook directly from URL
transmat = pd.read_excel('http://www.equality-of-opportunity.org/data/descriptive/table1/online_data_tables.xls',
                         sheet_name="Online Data Table 1", header=0, skiprows=8, index_col=0)
# Note how select from certain work sheet
# Note how skip header rows

TypeError: Cannot convert numpy.ndarray to numpy.ndarray (sheet: Online Data Table 1)

In [None]:
# Line plot by centile
# select diagonal - the probability stay in same percentile
# first, drop columns that don't have corresponding row
# transmat.drop([2, 3, 5, 6], axis=1, inplace=True)
staying_probs = np.diag(transmat)
# range 1 to 100 for centile
x = range(1, len(staying_probs) + 1)
# plotly express will want things in dataframe, so put back x and transmat in df
df = pd.DataFrame({"staying_probs": staying_probs, "centile": x})
# simple plot
px.line(df, x='centile', y='staying_probs')

In [None]:
# Formatting options
fig = px.line(df, x='centile', y='staying_probs',
        title='Mobility by Percentile of Family Income', # add title
        labels={'staying_probs': 'Staying Probabilities',
               'centile': 'Percentile of Family Income'}, #change labels
        template="simple_white", # set a template/theme
        color_discrete_sequence=['red'])  # change color of line
fig.update_xaxes(range=[3, 97]) # set axis range

In [None]:
# adding a vertical line and annotation to plot
fig = px.line(df, x='centile', y='staying_probs',
        title='Mobility by Percentile of Family Income', # add title
        labels={'staying_probs': 'Staying Probabilities',
               'centile': 'Percentile of Family Income'}, #change labels
        template="simple_white", # set a template/theme
        color_discrete_sequence=['red'])  # change color of line
fig.add_vline(x=3, line_width=2, line_dash="dash", line_color="green",
              annotation_text="This line denotes where family\n income is positive",
              annotation_position="top right")

## Bar plots

We can represent these data as a bar plot as well.

In [None]:
# Bar plot by group (here, percentile)
px.bar(df, x='centile', y='staying_probs',
       title='Mobility',
       labels={'staying_probs': 'Staying Probabilities', })

But bar graphs are more interesting with data by category.  Let's look at measures of mobility by state.

In [None]:
# Read in data by county
county_df = pd.read_excel('http://www.equality-of-opportunity.org/data/descriptive/table1/online_data_tables.xls',
                         sheet_name="Online Data Table 3", header=0, skiprows=list(range(0, 29)) + [30], index_col=0)

# replace missing values with zeros
county_df.fillna(value=0, inplace=True)

# create dataframe with state level data
# compute state level values using weighed avg of counties
state_df = pd.DataFrame({'Absolute Upward Mobility' : county_df.groupby('State').
                         apply(lambda x: np.average(x['Absolute Upward Mobility'],
                                                    weights=x['Number of Children in Core Sample'])),
                         'Gini' : county_df.groupby('State').
                         apply(lambda x: np.average(x['Gini'],
                                                    weights=x['Number of Children in Core Sample']))})

state_df.head(n=10)

In [None]:
# Bar by with mobility by state
px.bar(state_df, y="Absolute Upward Mobility",
       title="Mobility by State")

In [None]:
# Sorting the bar plot
# just sort dataframe first
state_df.sort_values(by='Absolute Upward Mobility', inplace=True)
px.bar(state_df, y="Absolute Upward Mobility",
       title = "Mobility by State")

# Maps

The bar graph above gives us a good sense of the variation across states in terms of upwards mobility.  We can see that there is a range of mobility outcomes - relatively low mobility in Washington D.C. and relatively high mobility in Wyoming and North Dakota.  By rank ordering the mobility outcomes we can generally see that states in thte southeastern U.S. tend to have low upward mobility and states in the upper midwest seem to have the highest mobility.

But to really visualize the geographic variation, we might want to see these mobility outcomes represented on a map.

In [None]:
from urllib.request import urlopen
import json
# read in a geojson file with information on the borders of states
with urlopen('https://eric.clst.org/assets/wiki/uploads/Stuff/gz_2010_us_040_00_500k.json') as response:
    state_map_data = json.load(response)
# in state data, make state a column, not index
state_df.reset_index(inplace=True)
# create the map with px
fig = px.choropleth_mapbox(state_df, geojson=state_map_data,
                           locations="State",  # name of location in the dataframe
                           featureidkey="properties.NAME",   # name of the location in the geojson
                           color='Absolute Upward Mobility',
                           color_continuous_scale="Viridis",
                           mapbox_style="carto-positron",
                           zoom=2, center = {"lat": 37.0902, "lon": -95.7129},
                           opacity=0.5,
                           title='Absolute Upward Mobility Across States'
                          )
fig.show()

# Histograms

Another way to see how mobility varies would be to look at the distribution across states through a histogram.

In [None]:
px.histogram(county_df[county_df['Absolute Upward Mobility'] > 0]['Absolute Upward Mobility'],
             title='Distribution of Upward Mobility Across Counties',
             opacity=0.5)

# Scatter Plots


To visualize correlations between variables, scatter plots can be quite useful.  

Let's use a scatter plot to explore the correlation between mobility and teenage child birth.  [Kearney and Levine (*JHR*, 2014)](http://jhr.uwpress.org/content/49/1/1.refs) find that increases in inequality drive increases in teenage child bearning.  But their story largely relies on mobility - that teenages from areas of high inequality view their prospect of economic mobility as low and so do not postpone pregnancy.  Of course, causation can go in the other direction - pregnancy may make mobility difficult - and this something Kearnery and Levine grapple with.

In any case, let's see if there is a relationship in the data by looking at a scatter plot.

In [None]:
county_df['99/25 Ratio'] = county_df['Parent Income P99'] / county_df['Parent Income P25']
plot_data = county_df[county_df['Absolute Upward Mobility'] > 0]
px.scatter(plot_data, x='Teenage Birth Rate', y='Absolute Upward Mobility',
           title="Mobility and Teenage Births",
           opacity=0.5)

In [None]:
# add a line of best fit with the 'trendline' keyword
px.scatter(plot_data, x='Teenage Birth Rate', y='Absolute Upward Mobility',
           title="Mobility and Teenage Births", color='Gini', color_continuous_scale="Viridis",
           opacity=0.5, trendline="ols")

# 3D plots

Sometimes it's useful to see your data in more than two-dimensions.  `plotly.express` allows for some nice 3D plotting.  A typical application is to look at a 3D meshgrid of a surface.  But one can plot historgrams or scatter plots in 3D as well.

Since we have the data loaded, let's just look at a 3D scatter plot to illustrate these capabilities.  We considered the relationship between mobility and teen pregnancy above.  No lets add another covariate, parents' income.

In [None]:
# 3D plot
plot_data2 = plot_data[plot_data['Mean Parent Income'] < 100000]
fig_3d = px.scatter_3d(plot_data2, x='Teenage Birth Rate', y='Mean Parent Income',
           z='Absolute Upward Mobility', opacity=0.1)
fig_3d.show()

# Saving plotly figures

`plotly` figures are dynamic by default.  You can save a dynamic/interactive version (e.g., as html to load in a browser) or you can save a static image (e.g., a a png file to put in a paper). The later will require an additional installation.

To save static images from `plotly`, you'll need to install [Kaleido](https://github.com/plotly/Kaleido).  Install with:
```
conda install -c conda-forge python-kaleido
```

or 

```
pip install -U kaleido
```

In [None]:
#1, save a dynamic plot that you can open in web browser
fig_3d.write_html("../Python/scatter_3d_plot.html")

In [None]:
# 2, save a static image
fig_3d.write_image("scatter_3d_plot.png")