# Highlighting Differences in Data Visualizations

> This notebook serves the purpose of exploring Vega-Lite datasets and experimenting with them. Ultimately, it serves as a backbone to a bachelor thesis with the same topic name.

In [1]:
import os
import json

import pandas as pd
import altair as alt
from vega import VegaLite

### Offline Access to Vega Datasets

In [2]:
# pip install vega_datasets
from vega_datasets import data
vega_datasets = data

In [3]:
datasets_ = data.list_datasets()
print("List of all datasets:\n\n", datasets_)

List of all datasets:

 ['7zip', 'airports', 'annual-precip', 'anscombe', 'barley', 'birdstrikes', 'budget', 'budgets', 'burtin', 'cars', 'climate', 'co2-concentration', 'countries', 'crimea', 'disasters', 'driving', 'earthquakes', 'ffox', 'flare', 'flare-dependencies', 'flights-10k', 'flights-200k', 'flights-20k', 'flights-2k', 'flights-3m', 'flights-5k', 'flights-airport', 'gapminder', 'gapminder-health-income', 'gimp', 'github', 'graticule', 'income', 'iowa-electricity', 'iris', 'jobs', 'la-riots', 'londonBoroughs', 'londonCentroids', 'londonTubeLines', 'lookup_groups', 'lookup_people', 'miserables', 'monarchs', 'movies', 'normal-2d', 'obesity', 'ohlc', 'points', 'population', 'population_engineers_hurricanes', 'seattle-temps', 'seattle-weather', 'sf-temps', 'sp500', 'stocks', 'udistrict', 'unemployment', 'unemployment-across-industries', 'uniform-2d', 'us-10m', 'us-employment', 'us-state-capitals', 'volcano', 'weather', 'weball26', 'wheat', 'windvectors', 'world-110m', 'zipcodes']


In [4]:
# Only the .json datasets are of interest to us, therefore we filter out datasets of different file types.

datasets = []
for dataset in datasets_:
    try:
        path = getattr(data, dataset).filepath
        if path.endswith(".json"):
            datasets.append(dataset)
    except (AttributeError, ValueError):
        continue
        
datasets.remove('anscombe')
datasets.remove('ohlc')
print("Available datasets:", datasets)

Available datasets: ['barley', 'burtin', 'cars', 'crimea', 'driving', 'iris', 'wheat']


Because some datasets were not suitable (anscombe and ohlc), all remaining datasets with file extension .json have been copied to a folder called "datasets".

## randomize.py

Helper functions:<br>
```getNotXYColor(file)```: returns a list of those values that are NOT encoded in the x or y field of the vega lite spec.<br>
```modify(point, file)```: carries two actions: one to skew x and y values and one to add a new point based on statistical information about the dataset.<br><br>
Main function:<br>
```randomize(json_file, p)```: generates a new json file in altered directory. scans through each datapoint of the file. with probability p, the datapoint gets altered, meaning either removed (1/3) or sent to modify (2/3).

## compare.py

Main function:<br>
```compare(a_json, b_json)```: if spec is barplot, we iterate through datapoints in a and b. regardless of the datapoints y values, if they are identical, copy to new file and adjust y value.

# The Datasets

> The following Altair Visualizations offer an option "View Source". <br>
> The source codes are manually saved.

In [5]:
os.chdir("datasets/datasets_altered")

### set percent for modification plots:

In [6]:
percent = 5

## (anscombe.json)

Anscombe’s Quartet is a famous dataset constructed by Francis Anscombe. The common summary statistics in each of the series identical, despite the subsets' different characteristics. <br> It makes sense to make subplots. Variations of this dataset do not make much sense, however for demonstration purposes in this project, we will generate them anyway.

In [7]:
# source = vega_datasets.anscombe()

# x = "X"
# y = "Y"
# color = "Series"

In [8]:
# alt.Chart(source).mark_point().encode(
#     x=x,
#     y=y,
#     color=color,
#     tooltip=(x,y)
# ).interactive().facet("Series", columns=2)

In [9]:
# alternation = None
# with open(f"anscombe{percent}.json", 'r') as f:
#     data = json.load(f)
#     alternation = pd.DataFrame(data)

In [10]:
# alt.Chart(alternation).mark_point().encode(
#     x=x,
#     y=y,
#     color=color,
#     tooltip=(x,y)
# ).interactive().facet("Series", columns=2)

## barley.json

The Becker’s Barley Trellis charts identify an anomoly in a widely used agriculatural dataset, which is called “The Morris Mistake”. It shows that "Morris" is the only site that is the reverse of other panels. It is usually displayed split into the sites, however I will aggregate the sites.

In [11]:
source = vega_datasets.barley()

alt.Chart(source).mark_point().encode(
    x = "yield",
    y = "site",
    color = "year:N",
    tooltip=("yield", "site")
)

In [12]:
alternation = None
with open(f"barley{percent}.json", 'r') as f:
    data = json.load(f)
    alternation = pd.DataFrame(data)

In [13]:
alt.Chart(alternation).mark_point().encode(
    x = "yield",
    y = "site",
    color = "year:N",
    tooltip=("yield", "site")
)

## burtin.json

This dataset was gathered by Will Burtin and is used to explore the effectiveness of various antibiotics in treating a variety of bacterial infections.

In [14]:
source = vega_datasets.burtin()

alt.Chart(source).mark_bar().encode(
    x = "Bacteria",
    y = "Streptomycin",
    color = "Penicillin:N",
    tooltip=("Bacteria", "Penicillin")
)

In [15]:
alternation = None
with open(f"burtin{percent}.json", 'r') as f:
    data = json.load(f)
    alternation = pd.DataFrame(data)

In [16]:
alt.Chart(alternation).mark_bar().encode(
    x = "Bacteria",
    y = "Streptomycin",
    color = "Penicillin:N",
    tooltip=("Bacteria", "Penicillin")
)

## cars.json

Acceleration, horsepower, fuel efficiency, weight, and other characteristics of different makes and models of cars. 

In [17]:
source = vega_datasets.cars()

alt.Chart(source).mark_point().encode(
    x = "Horsepower",
    y = "Miles_per_Gallon",
    color = "Cylinders:N",
    tooltip=("Horsepower", "Miles_per_Gallon")
).interactive()

In [18]:
alternation = None
with open(f"cars{percent}.json", 'r') as f:
    data = json.load(f)
    alternation = pd.DataFrame(data)

In [19]:
alt.Chart(alternation).mark_point().encode(
    x = "Horsepower",
    y = "Miles_per_Gallon",
    color = "Cylinders:N",
    tooltip=("Horsepower", "Miles_per_Gallon")
).interactive()

## crimea.json

This is a dataset containing monthly casualty counts from the Crimean war. 

In [20]:
source = None
with open(f"../crimea.json", 'r') as f:
    data = json.load(f)
    source = pd.DataFrame(data)
    
alt.Chart(source).mark_bar().encode(
    x = "date:N",
    y = "disease",
    color = "other:O",
    tooltip=("disease", "other")
)

In [21]:
alternation = None
with open(f"crimea{percent}.json", 'r') as f:
    data = json.load(f)
    alternation = pd.DataFrame(data)

In [22]:
alt.Chart(alternation).mark_bar().encode(
    x = "date:N",
    y = "disease",
    color = "other:O",
    tooltip=("disease", "other")
)

## driving.json

This dataset tracks miles driven per capita along with gas prices annually from 1956 to 2010.

In [23]:
source = vega_datasets.driving()

alt.Chart(source).mark_point().encode(
    x = "miles",
    y = "gas",
    color = "year",
    tooltip=("miles", "gas")
).interactive()

In [24]:
alternation = None
with open(f"driving{percent}.json", 'r') as f:
    data = json.load(f)
    alternation = pd.DataFrame(data)

In [25]:
alt.Chart(alternation).mark_point().encode(
    x = "miles",
    y = "gas",
    color = "year",
    tooltip=("miles", "gas")
).interactive()

## iris.json

This classic dataset contains lengths and widths of petals and sepals for 150 iris flowers, drawn from three species. It was introduced by R.A. Fisher in 1936.

In [26]:
x = "sepalLength"
y = "sepalWidth"

In [27]:
source = vega_datasets.iris()

alt.Chart(source).mark_point().encode(
    x=x,
    y=y,
    color='species',
    tooltip=(x,y)
).interactive()

In [28]:
# modified iris dataset available 5, 10, 15, 20

iris_modified = None
with open(f"iris{percent}.json", 'r') as f:
    data = json.load(f)
    iris_modified = pd.DataFrame(data)

In [29]:
iris_mod_plot = alt.Chart(iris_modified).mark_point().encode(
    x=x,
    y=y,
    color='species',
    tooltip=(x,y)
).interactive()

iris_mod_plot

## (ohlc.json)

(open, high, low and closed prices) - This one contains the performance of the Chicago Board Options Exchange

In [30]:
# source = vega_datasets.ohlc()

# alt.Chart(source).mark_bar().encode(
#     x='date:T',
#     color=alt.condition('datum.open <= datum.close',
#                         alt.value('#1a9850'), alt.value('#d73027')),
#     y='low:Q',
#     y2='high:Q',
#     tooltip=('date:T', 'open:Q', 'high:Q', 'low:Q', 'close:Q')
# )

In [31]:
# alternation = None
# with open(f"ohlc{percent}.json", 'r') as f:
#     data = json.load(f)
#     alternation = pd.DataFrame(data)

In [32]:
# alt.Chart(alternation).mark_bar().encode(
#     x='date:T',
#     color=alt.condition('datum.open <= datum.close',
#                         alt.value('#1a9850'), alt.value('#d73027')),
#     y='low:Q',
#     y2='high:Q',
#     tooltip=('date:T', 'open:Q', 'high:Q', 'low:Q', 'close:Q')
# )

## wheat.json

A collection of data on the yields of different varieties of wheat, as well as various characteristics of the wheat plants such as protein content, kernel weight, and moisture. 

In [33]:
source = vega_datasets.wheat()

alt.Chart(source).mark_point().encode(
    x = "wages",
    y = "wheat",
    color = "year:Q",
    tooltip=("wages", "wheat")
).interactive()

In [34]:
alternation = None
with open(f"wheat{percent}.json", 'r') as f:
    data = json.load(f)
    alternation = pd.DataFrame(data)

In [35]:
alt.Chart(alternation).mark_point().encode(
    x = "wages",
    y = "wheat",
    color = "year:Q",
    tooltip=("wages", "wheat")
).interactive()

# Diverging Bar Plots

In [36]:
# example dataframe
data = pd.DataFrame({
    'Category': ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I'],
    'Value': [10, 8, 6, 4, 2, -2, -4, -6, -8]
})

alt.Chart(data).mark_bar().encode(
    x='Category:N',
    y='Value:Q',
    color=alt.condition(
        alt.datum.Value >= 0,
        alt.value('steelblue'),  # Positive values
        alt.value('orange')  # Negative values
    )
)

In [37]:
# source = vega_datasets.crimea()

# alt.Chart(source).mark_bar().encode(
#     x = "date:N",
#     y = "disease",
#     color = "other:O",
#     tooltip=("disease", "other")
# )

In [38]:
# alternation = None
# with open(f"crimea{percent}.json", 'r') as f:
#     data = json.load(f)
#     alternation = pd.DataFrame(data)

In [39]:
# alt.Chart(alternation).mark_bar().encode(
#     x = "date:N",
#     y = "disease",
#     color = "other:O",
#     tooltip=("disease", "other")
# )

# Datasets info

In [40]:
vega_datasets.wheat.description

"In an 1822 letter to Parliament, William Playfair[1]_, a Scottish engineer who is often credited as the founder of statistical graphics, published an elegant chart on the price of wheat[2]_. It plots 250 years of prices alongside weekly wages and the reigning monarch. He intended to demonstrate that 'never at any former period was wheat so cheap, in proportion to mechanical labour, as it is at the present time.' The electronic dataset was created by Mike Bostock and released into the public domain."

In [41]:
# vega_datasets.wheat()