# Information Presentation & Data Visualisation

## 03 Lab Exercise sheet: Data abstraction

This lab focuses on practising the selection of appropriate data visualisations in Python by matching them to specific task abstractions. For examples, see: https://altair-viz.github.io/gallery/index.html

## Task 1: Import libraries and load datasets

In [38]:
import pandas as pd
from vega_datasets import data as vega_data
import altair as alt

## Task 2: Create a scatter plot

a) Create a scatter plot using the "cars" dataset to visualise the relationship between "Horsepower" and "Miles Per Gallon. Set the colour of the points to a specific RGB triplet (e.g.,'rgb(255, 0, 0)' for red).

In [4]:
data_cars = vega_data.cars()
data_cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


In [276]:
# https://altair-viz.github.io/gallery/scatter_tooltips.html

scatter_chart_data_cars = alt.Chart(data_cars).mark_point(color='rgb(255, 0, 0)', filled=True, size=80).encode(
    x=alt.X('Miles_per_Gallon:Q', title='Miles per Gallon'),
    y=alt.Y('Horsepower:Q', title='Horsepower'),
    tooltip=['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon'],
).properties(
    title='Miles per Gallon vs Horsepower',
    width=1000,
    height=400
).interactive()

scatter_chart_data_cars

## Task 3: Create a line plot

a) Create a line plot to visualise the trend of maximum temperature over time in the "seattle_weather" dataset. Set the background to "grey" and the lines to "black".

In [118]:
data_weather = vega_data.seattle_weather()
data_weather.head()

Unnamed: 0,date,precipitation,temp_max,temp_min,wind,weather
0,2012-01-01,0.0,12.8,5.0,4.7,drizzle
1,2012-01-02,10.9,10.6,2.8,4.5,rain
2,2012-01-03,0.8,11.7,7.2,2.3,rain
3,2012-01-04,20.3,12.2,5.6,4.7,rain
4,2012-01-05,1.3,8.9,2.8,6.1,rain


In [274]:
scatter_chart_data_weather = alt.Chart(data_weather).mark_line(color="black").encode(
    x=alt.X('date:T', title='Date'),
    y=alt.Y('temp_max:Q', title='Maximum Temperature'),
    tooltip=['date', 'temp_max', 'date', 'temp_max'],
).properties(
    title='Date vs Maximum Temperature',
    width=1000,
    height=400,
    # background="lightgrey"
).configure_view(
    fill="lightgrey"
).interactive()

scatter_chart_data_weather

## Task 4: Create a bar chart

Create a bar chart to visualise the distribution of population across different age groups in the "population" dataset.

In [15]:
data_population = vega_data.population()
data_population.head()

Unnamed: 0,year,age,sex,people
0,1850,0,1,1483789
1,1850,0,2,1450376
2,1850,5,1,1411067
3,1850,5,2,1359668
4,1850,10,1,1260099


In [272]:
chart_data_population = alt.Chart(data_population).mark_bar().encode(
    x=alt.X('age:O', title="Age"),
    y=alt.Y('people:Q', title="People"),
    color=alt.Color("age:Q", title="Age"),
    tooltip=[
        alt.Tooltip('age:O', title='Age'),
        alt.Tooltip('people:Q', title='People')
    ]
).properties(
    title='Population by Age Group in 2000',
    width=900,
    height=600
)

chart_data_population

## Task 5:  Create a stacked bar chart

Create a stacked bar chart to visualise the contribution of different energy sources (e.g., fossil, renewables) to the total net electricity generation in Iowa (dataset: "iowa_electricity") over time.


In [157]:
data_iowa_electricity = vega_data.iowa_electricity()
data_iowa_electricity.head()

Unnamed: 0,year,source,net_generation
0,2001-01-01,Fossil Fuels,35361
1,2002-01-01,Fossil Fuels,35991
2,2003-01-01,Fossil Fuels,36234
3,2004-01-01,Fossil Fuels,36205
4,2005-01-01,Fossil Fuels,36883


In [266]:
chart_data_iowa_electricity = alt.Chart(data_iowa_electricity).mark_bar().encode(
    x=alt.X('year:O', title="Year"),  # use temporal type
    y=alt.Y('net_generation:Q', title="Net generation"),
    color=alt.Color('source:N', legend=None),  # fixed syntax
    tooltip=[
        alt.Tooltip('source:N', title='Source'),
        alt.Tooltip('net_generation:Q', title='Net generation')
    ]
).properties(
    title='Contribution of Different Energy Sources to Total Net Electricity',
    width=1000,
    height=600
)

chart_data_iowa_electricity

## Task 6: Create a streamgraph

Create a streamgraph to visualise the changing proportions of unemployment across the "Education and Health" and "Information" sectors over time using the "unemployment_across_industries" dataset.

In [192]:
data_unemployment_across_industries = vega_data.unemployment_across_industries()
data_unemployment_across_industries.head()

Unnamed: 0,series,year,month,count,rate,date
0,Government,2000,1,430,2.1,2000-01-01 08:00:00+00:00
1,Government,2000,2,409,2.0,2000-02-01 08:00:00+00:00
2,Government,2000,3,311,1.5,2000-03-01 08:00:00+00:00
3,Government,2000,4,269,1.3,2000-04-01 08:00:00+00:00
4,Government,2000,5,370,1.9,2000-05-01 07:00:00+00:00


In [196]:
# Filter to only 'Education and Health' and 'Information'
filtered_data = data_unemployment_across_industries[
    data_unemployment_across_industries['series'].isin(['Education and Health', 'Information'])
]
filtered_data.head()

Unnamed: 0,series,year,month,count,rate,date
732,Information,2000,1,125,3.4,2000-01-01 08:00:00+00:00
733,Information,2000,2,112,2.9,2000-02-01 08:00:00+00:00
734,Information,2000,3,140,3.6,2000-03-01 08:00:00+00:00
735,Information,2000,4,95,2.4,2000-04-01 08:00:00+00:00
736,Information,2000,5,131,3.5,2000-05-01 07:00:00+00:00


In [264]:
streamgraph_data_unemployment_across_industries = alt.Chart(filtered_data).mark_area().encode(
    x=alt.X('yearmonth(date):T', title='Date', axis=alt.Axis(format='%Y', domain=False, tickSize=0)),
    y=alt.Y('sum(count):Q', stack='center', title='Unemployment Count'),
    color=alt.Color('series:N', title='Sector'),
    tooltip=[
        alt.Tooltip('series:N', title='Sector'),
        alt.Tooltip('sum(count):Q', title='Count'),
        alt.Tooltip('yearmonth(date):T', title='Date')
    ]
).properties(
    title='Streamgraph of Unemployment: Education and Health vs Information',
    width=1000,
    height=400
).interactive()

streamgraph_data_unemployment_across_industries


## Task 7: Create an indexed line chart

Create an indexed line chart to visualise the relative price movements of different stocks over time using the "stocks" dataset.

In [216]:
data_stocks = vega_data.stocks()
data_stocks.head()

Unnamed: 0,symbol,date,price
0,MSFT,2000-01-01,39.81
1,MSFT,2000-02-01,36.35
2,MSFT,2000-03-01,43.22
3,MSFT,2000-04-01,28.37
4,MSFT,2000-05-01,25.45


In [262]:
chart_indexed_stocks = alt.Chart(data_stocks).mark_line().encode(
    x=alt.X('date:T', title='Date'),
    y=alt.Y('price:Q', title='Indexed Price'),
    color=alt.Color('symbol:N', title='Stock Symbol'),
    tooltip=[
        alt.Tooltip('symbol:N', title='Stock'),
        alt.Tooltip('date:T', title='Date'),
        alt.Tooltip('price_indexed:Q', title='Indexed Price', format=".2f")
    ]
).properties(
    title='Indexed Price Movements of Stocks Over Time',
    width=1000,
    height=400
).interactive()

chart_indexed_stocks

## Task 7: Create a heatmap

Create a heatmap to visualise the relationship between flight distance and average arrival delay in the "flights_5k" dataset. Try out different colour schemes.

In [244]:
data_flights_5k = vega_data.flights_5k()
data_flights_5k.head()

Unnamed: 0,date,delay,distance,origin,destination
0,2001-01-10 18:20:00,25,192,SAT,HOU
1,2001-01-31 16:45:00,17,371,SNA,OAK
2,2001-02-16 12:07:00,21,417,SJC,SAN
3,2001-02-03 17:00:00,-5,480,SMF,SAN
4,2001-01-02 12:16:00,5,833,OKC,PHX


In [258]:
# Heatmap
heatmap_data_flights_5k = alt.Chart(data_flights_5k).mark_rect().encode(
    x=alt.X('distance:Q', bin=alt.Bin(maxbins=50), title='Flight Distance'),
    y=alt.Y('delay:Q', bin=alt.Bin(maxbins=50), title='Arrival Delay (minutes)'),
    color=alt.Color('mean(delay):Q', 
                    title='Average Arrival Delay', 
                    scale=alt.Scale(scheme='inferno')),  # try other schemes like 'magma', 'plasma', 'inferno'
    tooltip=[
        alt.Tooltip('mean(delay):Q', title='Avg Arrival Delay', format=".2f"),
        alt.Tooltip('count():Q', title='Number of Flights')
    ]
).properties(
    width=800,
    height=500,
    title='Heatmap of Flight Distance vs Arrival Delay'
).interactive()

heatmap_data_flights_5k

## Task 8: Create a parallel coordinates plot

Create a parallel coordinates plot to visualise the relationships between sepal length, sepal width, petal length, and petal width for different species of Iris flowers in the "iris" dataset.

In [281]:
data_iris = vega_data.iris()
data_iris.head()

Unnamed: 0,sepalLength,sepalWidth,petalLength,petalWidth,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [285]:
# Parallel coordinates plot
parallel_coords = alt.Chart(data_iris).transform_fold(
    ['sepalLength', 'sepalWidth', 'petalLength', 'petalWidth'],
    as_=['Measurement', 'Value']
).mark_line(opacity=0.7).encode(
    x=alt.X('Measurement:N', title='Feature'),
    y=alt.Y('Value:Q', title='Value'),
    color=alt.Color('species:N', title='Species'),
    tooltip=['species:N', 'sepalLength:Q', 'sepalWidth:Q', 'petalLength:Q', 'petalWidth:Q']
).properties(
    width=1000,
    height=400,
    title='Parallel Coordinates Plot of Iris Flower Measurements'
).interactive()

parallel_coords

## Task 9: Create a pie chart

Create a pie chart to visualise the distribution of cars based on their country of origin in the "cars" dataset.

In [293]:
data_cars = vega_data.cars()
data_cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


In [305]:
# Aggregate count of cars by Origin
origin_counts = data_cars.groupby('Origin').size().reset_index(name='count')

pie_chart_origin_counts = alt.Chart(origin_counts).mark_arc().encode(
    theta=alt.Theta(field='count', type='quantitative'),
    color=alt.Color(field='Origin', type='nominal', legend=alt.Legend(title="Country of Origin")),
    tooltip=[alt.Tooltip('Origin:N', title='Origin'), alt.Tooltip('count:Q', title='Number of Cars')]
).properties(
    title='Distribution of Cars by Country of Origin',
    width=600,
    height=600
)

pie_chart_origin_counts

### Worksheets should be submitted on Canvas by the Monday after the lab at 23:59.