# Create graphics

Creating graphics usually takes a relatively long time. In the notebook we will discuss the most common types of visualisations and how visualisations can be saved. We use the `Matplotlib` library for this purpose. See [here](https://matplotlib.org/stable/tutorials/pyplot.html) for a basic introduction to Matlibplot. The last two examples use `Plotly` ([documentation](https://plotly.com/python/plotly-fundamentals/)).

Since we have just discussed pandas, we assume that the data is available in pandas. The sample data is read from a CSV file that can be found in the same Ilias folder as the notebook and that has to be stored in the same directory as the notebook on your computer.

The libraries must first be imported:

In [None]:
#%pip install matplotlib pandas plotly

In [None]:
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

import plotly.express as px
import plotly.figure_factory as ff

import pandas as pd

## Load data

First we load the data we want to visualize:

- On the one hand, we load the regional portraits of the Swiss municipalities from 2019 (see [here](https://opendata.swiss/de/dataset/regionalportrats-2021-kennzahlen-aller-gemeinden)). Each Swiss municipality (approx. 2000) is described with key figures on population, economy and politics.

- We also load bicycle data from the city of Zurich ([here](https://data.stadt-zuerich.ch/dataset/ted_taz_verkehrszaehlungen_werte_fussgaenger_velo)). We know the number of passing bicycles per hour at one location in the city, i.e. Langstrasse Unterführung, in January 2021.

In [None]:
gem = pd.read_csv("gemeindeportraits.csv", sep=";", encoding="utf-8")

print(gem.info())
display(gem.head(5))

In [None]:
velo = pd.read_csv("velozaehldaten_aufbereitet.csv", sep=",", encoding="utf-8")

print(velo.info())
display(velo.head(5))

## Bar and pie plot

How does the age structure in Zurich look like?

- First we filter the data for Zurich
- Then we select only the desired columns with the age structure
- And finally, we make life easy for ourselves by taking the mean of this one row (makes no sense!). This gives us a ready to use data structure for Matplotlib

In [None]:
gem_zh = gem[gem["gemeinde"] == "Zürich"]
gem_zh_alt = gem_zh[["0_19_y", "20_64_y", "65_y"]].mean()

print(gem_zh_alt)

- We select the `bar()` plot
- We pass the `index` of the data (i.e. the labels) as X and the percentage values of the age structure (`values`) as Y.
- Each bar plot is assigned a fill color (`color`). 
- All available colors can be found in this [list](https://matplotlib.org/stable/gallery/color/named_colors.html#css-colors)

In [None]:
# Grösse des Plot
plt.figure(figsize=(8, 6))

# Typ des Plot und Inputdaten
plt.bar(gem_zh_alt.index, gem_zh_alt.values, color=['skyblue', 'salmon', 'lightgreen'])

# y-Achse soll von 0 bis 100% gehen
plt.ylim(0,100)

# Beschriftung
plt.xlabel('Altersgruppen')
plt.ylabel('Prozent der Bevölkerung')
plt.title('Altersverteilung in Zürich')

# Anzeige der Plot
plt.show()

- We can also display the same as a pieplot (`pie()`)
- For pieplots, it makes sense to label the areas with the % (`autopct`)

In [None]:
# Grösse des Plot
plt.figure(figsize=(6, 6))

# Pie-Plot
plt.pie(gem_zh_alt.values, labels=gem_zh_alt.index, autopct='%1.1f%%', colors=['skyblue', 'salmon', 'lightgreen'])

# Beschriftung
plt.title('Altersverteilung in Zürich')
plt.show()

## Scatterplot

Scatterplot can be used to compare pairs of values as point distributions. How about an obligatory comparison of the % of foreigners with the % of people that depend on social welfare per municipality?

- We use the `scatter()` function
- The two measurements per municipality can be passed directly as columns of the Pandas dataframe
- Does it make sense to scale the two axes from 0-100?

In [None]:
# Grösse
plt.figure(figsize=(6, 4))

# Scatterplot
plt.scatter(gem["auslaender_proz"], gem["sozialhilfequote"], color='skyblue', alpha=0.6)

# Achsen Limiten
plt.xlim(0,100)
plt.ylim(0,100)

# Beschriftung
plt.xlabel('Prozent Ausländer')
plt.ylabel('Sozialhilfequote')
plt.title('Vergleich Sozialhilfequote und Anteil Ausländer')
plt.show()

You can test if a relashionship occurs if the `xlim()` and `ylim()` of the plot is changed!?

 ## Boxplot
 
Boxplots are very useful for comparing several numerical columns. Just as a reminder on how to read boxplots (and an example of how to show images in markdown:)):

<img src="boxplot.png" width=500/>

We compare the voter shares of the largest parties across all municipalities.

- We can apply the `boxplot()` directly to the (filtered) dataframe

In [None]:
# Grösse
plt.figure(figsize=(8, 4))

# Boxplot
gem[['svp', 'fdp', 'cvp', 'sp']].boxplot()

# Beschriftung
plt.xlabel('Parteien')
plt.ylabel('Wähleranteil')
plt.title('Wähleranteil der grössten Parteien pro Gemeinde')
plt.show()

## Line and time series plot

Line plots are most commonly used to visualize phenomena that have been measured multiple times. A typical example is time series, i.e. measurements of the same phenomenon that have been taken repeatedly over time. 

We work with the Zürich bicycle count data.

The simplest visualization shows all measurements in the order in which they were registered.

In [None]:
# Grösse
plt.figure(figsize=(14, 4))

# einfacher Lineplot
plt.plot(velo["y"], color = "darkorange")

# Beschriftung
plt.xlabel('Messung')
plt.ylabel('Anzal Velos (Stunde)')
plt.title('Messungen zur Anzahl Velos in der Langstrassenunterführung Zürich (Januar 2021)')
plt.show()

It gets a little more complicated if you want to display the date format appropriately (x-axis).

- First we have to convert the time field in the data into a datetime format (see [here](https://pandas.pydata.org/docs/user_guide/timeseries.html) for some examples of working with datetime in pandas)
- When plotting, we transfer the time field and the measurements
- Now comes a lot of axis formatting and labeling

In [None]:
velo['ds'] = pd.to_datetime(velo['ds'], format="%Y-%m-%d %H:%M:%S") # the time field looks like this: "2021-01-01 01:01:15"

In [None]:
# Grösse
fig = plt.figure(figsize=(14, 4))

# Lineplot
plt.plot(velo['ds'], velo['y'], label='Velozählungen', color = "darkorange")

# Formatierung der Achsenbeschriftung
ax = plt.gca()
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_major_locator(mdates.DayLocator())

# Welche Gridlines?
ax.xaxis.set_major_locator(mdates.WeekdayLocator(interval=1))
ax.xaxis.set_minor_locator(mdates.DayLocator(interval=1))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%y-%m-%d'))

# Art der Gridlijnes
ax.xaxis.grid(True, which='major', linestyle='-', linewidth='0.5', color='black')
ax.xaxis.grid(True, which='minor', linestyle=':', linewidth='0.5', color='gray')

# Sonstite Beschriftung
plt.xlabel('Datum')
plt.ylabel('Anzal Velos (Stunde)')
plt.title('Messungen zur Anzahl Velos in der Langstrassenunterführung Zürich  (Januar 2021)')

plt.show()

## Saving plots

We like the last plot and therefore save it as an image file. It is important that we have saved the plot above in the *fig* object (`fig = plt.figure(figsize=(14, 4))`).

Saving as an image is now very simple:

In [None]:
fig.savefig("velozaehlung.png", format="png", dpi=300)

You can find the image file in the same directory as the notebook.

## Interactive plots

*Matplotlib* is good for static plots. However, if you want to explore the data interactively, you can use libraries like *Plotly*.

In the following we crate a heat-map from correlations between features in the Gemeinde-Tabelle.

In [None]:
# Select features in a list
features = [
    "sozialhilfequote",
    "auslaender_proz",
    "geburtenziffer", 
    "sterbeziffer",
    "heiratsziffer", 
    "scheidungsziffer",
    "leerwohnungsziffer", 
    "siedlung_proz",
    "landw_proz",
    "sp", 
    "svp",
    "fdp",
]

# Compute pair-wise correlations between all features
correlation_matrix = gem[features].corr()

# plot the correlations as a heatmap
fig = px.imshow(correlation_matrix, 
                text_auto=True, # Display the correlation coefficients on the heatmap
                title='Heatmap mit Gemeinde-Korrelationen',
                color_continuous_scale='Viridis',  # Choose a color scale
                width=1200,  # Specify width
                height=1200  # Specify height
                )

fig.show()

## Histogram and Density Plot

Two widely used plots that we are still missing are histograms and density plots (the later being a version of the first).

In histograms, the occurrence of individual "things" are counted. Most often, histograms are applied to categorical data, which we don't have. We therefore create one histogram from synthetic categorical data and another histogram from numerical data from a previous dataset.

In [None]:
# synthetic data
data = {
    'categories': ['A', 'B', 'A', 'C', 'B', 'A', 'D', 'B', 'B', 'A', 'D', 'B', 'B', 'B', 'D']
}
df = pd.DataFrame(data)

# Create a histogram using Plotly
fig = px.histogram(df, 
                   x='categories',          # Specify the column to plot
                   title='Histogram of Categories', 
                   labels={'categories': 'Category'},  # Label for x-axis
                   color_discrete_sequence=['lightblue'],  # Set color of the bars
                   category_orders={'categories': ['A', 'B', 'C', 'D']}  # Order categories if desired
)

# Show the figure
fig.show()

In [None]:
# Create a histogram from numerical data in a dataframe
fig = px.histogram(
    gem[["einwohner"]], 
    x='einwohner',
    nbins = 10, # the bins need to be specified. This is the number of groups in the histogram
    title='Histogram der Anzahl Einwohner pro Gemeinde',
    opacity=0.75, 
    text_auto=True 
)

# Show the figure
fig.show()

Ok, this is not very helpful. Almost all Gemeinden are smaller than 50k Einwohner. In these cases it might make sense to use a more advanced plot, such as density from [figure_factory](https://plotly.com/python/figure-factories/).

In [None]:
fig = ff.create_distplot([gem["einwohner"].to_list()], ["einwohner"])
fig.show()