# CSS 201.5 - CSS MA Bootcamp

## Python for Data Analysis

# Plot.ly

## Plot.ly

- Plotly is an interactive graph library.

- It was built for the web and is implemented in JavaScript.

- But there are great wrappers for R and Python (not sure if also for Julia...), so we do not need to learn JavaScript.

## Plot.ly

The advantages of Plotly are significant:

1. Fast

2. (relatively) Easy

3. Interactive

4. Customizable: If you are patient, you can find a way to plot almost whatever you want.

## Plot.ly

Ways to find how to build your plot quickly:

- Plotly has one of the best free software documentation I have seen.
    - Here it is: https://plotly.com/python/

- Unfortunately, no good books on it.
    + Makes sense: Plotly is very customizable. I am sure that even people who know it well have yet to learn how to work with most things it does.

- I am following it closely, but adapting for Social Sciences Problems.

- Never tried to ask ChatGPT to do plotly. If any of you do, please let me know about the quality of the results.

## Plot.ly

Let us install it first.

Go to the terminal and do the following:

```
pip install plotly
```

And by the way, this is how to install things.

**Exercise:** Go to your terminal and install plotly.

## Plot.ly

Plotly has two components:

1. **Data**: the data itself

2. **Layout**: the details of your plot

And if you happen to do animated plots (yes, we are going to do some of those!), then:

3. **Frames**: The step-by-step animation for the plot.

The barplot above has only the first two.

## Plot.ly

Let's load some data:

In [None]:
## Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns #
import plotly.express as px # Plotly express!
import plotly.graph_objects as go # Plotly graph objects!
from plotly.subplots import make_subplots

## Datasets
# Political and Economic Risk
perisk = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm151/main/datasets/PErisk.csv')
perisk = perisk.set_index('country')

# Tips
tips = pd.read_csv('https://raw.githubusercontent.com/umbertomig/qtm151/main/datasets/tips.csv')
tips = tips.set_index('obs')

## Plot.ly

There are two families of plots in general:

1. Plots that we do for one single variable

2. Plots that we do for interactions between variables

A `how to plot` class has to start with the first and develop the second.

## Plot.ly

### Single variable plots

- **Why?** We want to learn about our dataset. Very exploratory.

- **What** to know before starting? The type of the variable determines the type of plot.
    1. Quantitative Continuous (e.g., `income`, `height`, `debt`, `gdp`): Histogram, Boxplots, Violinplots, etc
    2. Quantitative Discrete (few categories) and qualitative: Barplots, Piecharts
    
- **How?** The plot.ly commands!

## Plot.ly

### Single variable plots -- Histograms

- Great for visualizing quantitative data:

- Syntax (Plotly express): `px.histogram(data_frame = ..., x = ..., nbins = ...)`

In [None]:
## Histogram
g = px.histogram(data_frame = perisk, x = 'gdpw2', nbins = 10)
g.show()

**Exercise:** Build a histogram of total bills in the tips dataset.

In [None]:
## Your answers here

## Plot.ly

### Single variable plots -- Histograms

Customizations:

1. Orientation: 'h' or 'v'
2. Marginal: 'violin', 'box', or 'rug'
3. Adding title: `g.update_layout({'title':{'text':'My title'}})`
4. Update x-axis label: `g.update_xaxes(title_text='X Label')`
5. Update y-axis label: `g.update_yaxes(title_text='Y Label')`

In [None]:
## Histogram (more at https://plotly.com/python/histograms/)
g = px.histogram(
    data_frame = perisk,
    x = 'gdpw2',
    nbins = 10,
    marginal = 'rug')
g.update_layout({'title':{'text':'Log of GDP per capita'}})
g.show()

**Exercise**: Build a histogram of `tip` in the `tips` dataset.

In [None]:
## Your answers here

## Plot.ly

### Single variable plots -- Barplots

- Barplots are good for displaying a count of a discrete variable.

- Syntax (Plotly express): `px.bar(data_frame = ..., x = ..., nbins = ...)`
    + Need to create a table!

In [None]:
tab = perisk.groupby(["prscorr2"]).size().reset_index(name = "counts")

g = px.bar(
    data_frame = tab,
    x = 'prscorr2',
    y = 'counts',
    title = "Corruption Barplot")

g.show()

**Exercise**: Build a barplot of the week days in the `tips` dataset.

In [None]:
## Your answers here

## Plot.ly for multiple variables

## Plot.ly

- When we have more than two variables, there are several plots we can use to explore the relationships in the data.

- First, let us load the packages we will be using in here:

## Plot.ly

Suppose a researcher wants to investigate whether females and males implement different policies. 

She has data on Indian villages, where they reserve seats randomly to females.

Let us look at the data:

In [None]:
india = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI30public/main/data/india.csv')
india.head(2)

## Plot.ly

Let us first explore each of the variables. For `female`:

In [None]:
# Descriptive of female
fig = px.bar(india.female.value_counts())
fig.show()

## Plot.ly

For `water`:

In [None]:
# Descriptive of water
fig = px.histogram(india, x = 'water')
fig.show()

## Plot.ly

For `irrigation`:

In [None]:
# Descriptive of irrigation
fig = px.histogram(india, x = 'irrigation')
fig.show()

## Plot.ly

The hypothesis here is that women care more about driking water while men care more about irrigation. Is this true?

In [None]:
print(india.head())

# Grouped bar: stacks the variation in the y-variables.
fig = px.bar(india, 
             x = "female", 
             y = ["irrigation", 'water'], 
             hover_data = ['village'],
             barmode = "group") # Exercise: change this to 'stack'
fig.show()

## Plot.ly

A better way to visualize it is to use plots that are aligned with the underlying data variation.

In this case, `female` is qualitative and `irrigation` / `water` are quantitative.

A plot for Qualitative x Quantitative is the `boxplot`:

In [None]:
# Well, we tried... Exercise: try for water
fig = px.box(india, x = "female", y = "irrigation", points="all")
fig.show()

**Exercise:** Check if countries with high levels of corruption have lower GDP per capita.

1. Create a variable for high level of corruption

2. Create a variable for GDP per capita (not Log GDPpc).

3. Do the plot.

In [None]:
## Your answers here

## Plot.ly

And if you want to test the differences-in-means estimator, we can build from scratch:

```
fig = go.Figure()

fig.add_trace(go.Bar(
    name = 'Male',
    x = ['Irrigation', 'Water'], 
    y = [3.39, 14.74],
    error_y = dict(type = 'data', array = [1.96 * 0.73, 1.96 * 1.3])
))

fig.add_trace(go.Bar(
    name = 'Female',
    x = ['Irrigation', 'Water'], 
    y = [3.02, 23.99],
    error_y = dict(type = 'data', array = [1.96 * 0.64, 1.96 * 4.93])
))

fig.update_layout(barmode = 'group')

fig.show()
```

With:
```
dat[dat.female == 1].mean()
dat[dat.female == 1].std()/np.sqrt(len(dat[dat.female == 1]))
```

In [None]:
# After getting it done:
fig = go.Figure()

fig.add_trace(go.Bar(
    name = 'Male',
    x = ['Irrigation', 'Water'], 
    y = [3.39, 14.74],
    error_y = dict(type = 'data', array = [0.73, 1.3])
))

fig.add_trace(go.Bar(
    name = 'Female',
    x = ['Irrigation', 'Water'], 
    y = [3.02, 23.99],
    error_y = dict(type = 'data', array = [0.64, 4.93])
))

fig.update_layout(barmode = 'group')

fig.show()

## Plot.ly

Now, suppose that we want to study education expenditure. How does income affect it? Let's see:

In [None]:
educ = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI30/main/datasets/educexp.csv')
educ.head()

## Plot.ly

Both variables are quantitative. A plot that is especially good for the case is the scatterplot.

Remember: `scatterplot` is good for Quantitative x Quantitative relations.

In [None]:
# Education x income
fig = px.scatter(
    data_frame = educ,
    x = "income",
    y = "education",
    title = "My Education x Income data (from the 70s!)",
    hover_data = ['states']
)
fig.show()

**Exercise:** Create a plot of Informal Markets x GDP per capita. Make the user able to rover the mouse over the data points and see the names of the countries.

In [None]:
## Your answers here

## Plot.ly

We can add `histogram`s at the side (or `rug`s, `violin`s, `box`plots). For other marginal plots see [here](https://plotly.com/python/marginal-plots/).

In [None]:
# Education x income
fig = px.scatter(
    data_frame = educ,
    x = "income",
    y = "education",
    title = "My Education x Income data (from the 70s!)",
    hover_data = ['states'],
    marginal_x = 'histogram',
    marginal_y = 'histogram'
)
fig.show()

## Plot.ly

To be honest, these grids annoy me a bit. You can change template. Options in [here](https://plotly.com/python/templates/), but I like `simple_white` better.

If you are patient enough, you can (and should, if you do lot's of presentations for your startup or job) create your own.

In [None]:
# Education x income
fig = px.scatter(
    data_frame = educ,
    x = "income",
    y = "education",
    title = "My Education x Income data (from the 70s!)",
    hover_data = ['states'],
    marginal_x = 'histogram',
    marginal_y = 'histogram',
    template = 'simple_white'
)
fig.show()

## Plot.ly

And add a trendline (more [here](https://plotly.com/python/linear-fits/)). Options: `ols`, `lowess`, `rolling` (for moving-averages), `expanding` (for expanding means, medians, maxs...).

In [None]:
# Education x income
fig = px.scatter(
    data_frame = educ,
    x = "income",
    y = "education",
    title = "My Education x Income data (from the 70s!)",
    hover_data = ['states'],
    marginal_x = 'histogram',
    marginal_y = 'histogram',
    template = 'simple_white',
    trendline = 'ols'
)
fig.show()

## Plot.ly

You can even `log` the axes to make farther-away data look closer:

In [None]:
# Education x income
fig = px.scatter(
    data_frame = educ,
    x = "income",
    y = "education",
    title = "My Education x Income data (from the 70s!)",
    hover_data = ['states'],
    template = 'simple_white',
    trendline = 'ols',
    log_x = True, log_y = True
)
fig.show()

## Plot.ly

But let's say you argue that there are good reasons to also check the urban population in places. 

We can add that there:

In [None]:
# Education x income + urban
fig = px.scatter(
    data_frame = educ,
    x = "income",
    y = "education",
    title = "My Education x Income data (from the 70s!)",
    size = 'urban',
    template = 'simple_white',
    hover_data = ['states']
)
fig.show()

## Plot.ly

Or you may change the color of the dots, if you prefer:

In [None]:
# Education x income + urban
fig = px.scatter(
    data_frame = educ,
    x = "income",
    y = "education",
    title = "My Education x Income data (from the 70s!)",
    color = 'urban',
    template = 'simple_white',
    hover_data = ['states']
)
fig.show()

## Plot.ly

Now let us say that you want to analyze all Quantitative variables, one against the other, in pairs.

This is called [scatterplot matrices](https://plotly.com/python/splom/) (or pairplots in other softwares).

In [None]:
## Very cool plot!
fig = px.scatter_matrix(
    educ, 
    dimensions = ['education', 'income', 'young', 'urban'],
    template = 'seaborn',
    hover_data = ['states']
)
fig.update_traces(diagonal_visible = False)
fig.update_layout(width = 700, height = 700)
fig.show()

**Exercise:** Create a scatterplot matrix of the profession prestige data with type of profession coloring data points and the type of profession showing when you hover the mouse over the observation.

In [None]:
duncan = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI175public/main/data/Duncan.csv')
## Your answers here

## Plot.ly

If you don't care only about trends, then use a [heatmap](https://plotly.com/python/heatmaps/) on the correlation matrix.

And for some color scales, see [here](https://plotly.com/python/colorscales/).

In [None]:
# Can be also 'kendall' and 'spearman'
corm = educ.corr(method = 'pearson', numeric_only = True)
## We are tricking the system to make it think the matrix is an image!
fig = px.imshow(corm, 
                color_continuous_scale = 'RdBu_r', 
                zmin = -1, 
                zmax = 1)
fig.show()

## Plot.ly

Now, suppose you want to investigate discrimination in the job markets.

Do you think that women and people of color are treated fairly?

Let's check this experimental data:

In [None]:
resumes = pd.read_csv('https://raw.githubusercontent.com/umbertomig/POLI30/main/datasets/resumes.csv')
resumes.head()

## Plot.ly

For this, we need to test a relationship between two Qualitative variables.

In [None]:
fig = px.histogram(
    resumes,
    x = 'race', 
    y = 'call', 
    color = 'sex', 
    barmode = 'group')
fig.show()

## Plot.ly

Or you can add one in the top of the other:

In [None]:
fig = px.histogram(
    resumes,
    x = 'race', 
    y = 'call', 
    color = 'sex', 
    barmode = 'relative') # Can also 'overlay' the categories.
fig.show()

## Plot.ly

Finally, if you want to see a dataset that you have no idea what it is, nothing better than plotly to help you up:

In [None]:
fig = go.Figure(
    data=[
        go.Table(
            header = dict(
                values = list(educ.columns),
                fill_color = 'paleturquoise',
                align = 'left'
            ),
            cells = dict(
                values = [educ[i] for i in educ.columns],
                fill_color = 'lavender',
                align = 'left'
            )
        )
    ]
)

fig.show()

## Plot.ly

Now, supposed that we want to build a plot with multiple plots. We cannot do it on plotly express, but we can do it with graph objects.

It is going to be harder, but it is still doable.

In [None]:
# Multiple histograms
fig = make_subplots(rows = 2, cols = 2, subplot_titles=['Education', 'Income', 'Young', 'Urban'])
fig.add_trace(go.Histogram(x = educ.education, nbinsx = 10, name='Education'), row = 1, col = 1)
fig.add_trace(go.Histogram(x = educ.income, nbinsx = 10, name='Income'), row = 1, col = 2)
fig.add_trace(go.Histogram(x = educ.young, nbinsx = 10, name='Young'), row = 2, col = 1)
fig.add_trace(go.Histogram(x = educ.urban, nbinsx = 10, name='Urban'), row = 2, col = 2)
fig.show()
educ.head()

## Plot.ly

We can also have buttons. The first one does not look good, but then improve:

In [None]:
# Multiple histograms
fig = go.Figure()
dropdown_buttons = [
    {'label': 'education', 'method': 'restyle',
     'args': [{'visible': [True, False, False, False]},
              {'title': 'Education'}]},
    {'label': 'income', 'method': 'restyle',
     'args': [{'visible': [False, True, False, False]},
              {'title': 'Income'}]},
    {'label': "young", 'method': "restyle",
     'args': [{"visible": [False, False, True, False]},
              {'title': 'Young'}]},
    {'label': "urban", 'method': "restyle",
     'args': [{"visible": [False, False, False, True]},
              {'title': 'Urban'}]}
]
fig.update_layout({
    'updatemenus':[{
        'type': "dropdown",
        'x': 1.3,
        'y': 0.5,
        'showactive': True,
        'active': 0,
        'buttons': dropdown_buttons}]
})
for var in ['education', 'income', 'young', 'urban']:
    fig.add_trace(go.Histogram(x = educ[var], nbinsx = 10, name = var))
fig.show()

## Plot.ly

And sliders:

In [None]:
gapminder = px.data.gapminder()
fig = px.scatter(
    gapminder, x = 'gdpPercap', y = 'lifeExp', color = 'continent', size = 'pop',
    animation_frame = "year", 
    animation_group = "country", 
    log_x = True, size_max = 45, range_x = [100, 100000], range_y = [25, 90])
fig['layout'].pop('updatemenus')
fig.show()

In [None]:
gapminder.head()

## Plot.ly

Or you can do a full-on animation:

In [None]:
fig = px.scatter(
    gapminder, x = 'gdpPercap', y = 'lifeExp', color = 'continent', size = 'pop',
    animation_frame = "year", 
    animation_group = "country", 
    log_x = True, size_max = 45, range_x = [100, 100000], range_y = [25, 90])
#fig['layout'].pop('updatemenus')
fig.show()

## Plot.ly

You learned in the past three classes how to build interactive plots for one or more variables.

More about that on the amazing [plotly documentation](https://plotly.com/python/).

# Great work!

## Exercise

You have now to explore a more complicated dataset: [QOG Environmental Indicators Codebook](https://www.qogdata.pol.gu.se/data/codebook_ei_sept21.pdf)

In [None]:
envind = pd.read_csv('https://www.qogdata.pol.gu.se/data/qog_ei_sept21.csv', encoding='ISO-8859-1')

## Exercise

Your job:

1. Start a new data report (like we did yesterday)

2. Describe the variables individually

3. Describe relevant variable interactions in the data

4. Use **a few** numeric descriptive stats and **a lot** of plots.

Hint: Build the analysis around the year of 2019.

In [None]:
## I suggest you focus on two or three of the following environmental indicators

epivars = ['epi_tbn', 'epi_tbg', 'epi_ghp', 'epi_uwd', 'epi_usd', 'epi_pmd', 
           'epi_bhv', 'epi_pbd', 'epi_par', 'epi_ozd', 'epi_msw', 'epi_had', 
           'epi_wwt', 'epi_snm', 'epi_gib', 'epi_cha', 'epi_cda', 'epi_noa', 
           'epi_bca', 'epi_sda', 'epi_nxa', 'epi_wrs', 'epi_wmg', 'epi_agr', 
           'epi_air', 'epi_ape', 'epi_bdh', 'epi_cch', 'epi_hmt', 'epi_eh', 
           'epi_h2o', 'epi_ev', 'epi_tcl']

## Look at the codebook to learn what they are