**Data Visualization course - winter semester 2023/24 - FU Berlin**

*Tutorials adapted from the [Information Visualization](https://infovis.fh-potsdam.de/tutorials/) course at the FH Potsdam*

# Tutorial 1: Getting started

During the tutorials you will be reading and writing **Python** code in **Jupyter** notebooks.
Phew… Let's unpack this a bit!

* 🐍 [Python](https://www.python.org) is a programming language that has gained considerable traction over the last years, in various contexts, including data science and the digital humanities. If you have never written any Python before, it would be useful for you to familiarize yourself with the language, its basic constructs and conventions. It is popular for its versatility and readability. Speaking of which…

* 🪐 [Jupyter](https://jupyter.org) notebooks are hybrid documents that contain both code and markup. So it becomes easy to mix programming and documentation. What you are looking at now is a text cell written in the markup language Markdown, further below you see code cells written in the programming language Python (note the light grey background), which contain computable code! When viewing the notebooks on in Jupyter, you can double-click on any text cell to see its source. 

In this tutorial you will get a bit acquainted with Python and Jupyter, and get to know a few handy libraries for working with data.

## 🌍 Hello world 

Okay, enough words. Let's dive right into it and start with a classic:

In [1]:
print("Hello world")

Hello world


Above code cell can be executed (i.e., run) by clicking **Shift + Enter**.

Of course we can set variables and extend them. Feel free to change the message:

In [2]:
hello = "hello world"
hello = hello + " how are you!"
hello

'hello world how are you!'

Now that we have our first variable `hello` we can perform some string tricks, for example, we could change the capitalization:

In [3]:
hello.title()

'Hello World How Are You!'

In [4]:
hello.upper()

'HELLO WORLD HOW ARE YOU!'

✏️ *Now it's your turn! (The pencil stands for a small hands-on activity!). Try some string manipulations yourself. To get some inspiration, have a look at the [string methods](https://docs.python.org/3/library/stdtypes.html?#string-methods) that Python has built-in:*

In [5]:
hello.capitalize()

'Hello world how are you!'

## 📦 Let's get some packages

Python itself provides only limited methods for working with more complex data. One of the main reasons for Python's (and  Jupyter's) popularity is the wide availability of software packages that provide powerful means for preparing, processing, presenting, and probing data. Throughout the tutorials you will get to know a few packages, some of them highly specific tools and others more general-purpose libraries. 

To use packages in a notebook, you simply `import` them and assign an abbreviation after `as` to keep your code succinct. This is how you do it:

In [6]:
import pandas as pd

Now the powerful `pandas` package is loaded and will answer to its nickname `pd`.

🐼 [Pandas](https://pandas.pydata.org) really is a data analysis workhorse with the DataFrame data structure being one of its main muscles. You will learn to love it! With pandas you can do simple and sophisticated operations over small and sizable datasets. 

Let's create a little toy dataset to give you a sense of how it works:


In [7]:
covid_data = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")

To check whether the DataFrame was created successfully, we can simply type the variable name `covid_data`, display its content as an ouput:

In [8]:
covid_data

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index,population,excess_mortality_cumulative_absolute,excess_mortality_cumulative,excess_mortality,excess_mortality_cumulative_per_million
0,AFG,Asia,Afghanistan,2020-01-03,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
1,AFG,Asia,Afghanistan,2020-01-04,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
2,AFG,Asia,Afghanistan,2020-01-05,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
3,AFG,Asia,Afghanistan,2020-01-06,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
4,AFG,Asia,Afghanistan,2020-01-07,,0.0,,,0.0,,...,,37.746,0.5,64.83,0.511,41128772.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349994,ZWE,Africa,Zimbabwe,2023-10-14,265808.0,0.0,5.286,5718.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
349995,ZWE,Africa,Zimbabwe,2023-10-15,265808.0,0.0,5.286,5718.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
349996,ZWE,Africa,Zimbabwe,2023-10-16,265808.0,0.0,5.286,5718.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,
349997,ZWE,Africa,Zimbabwe,2023-10-17,265808.0,0.0,0.000,5718.0,0.0,0.0,...,30.7,36.791,1.7,61.49,0.571,16320539.0,,,,


In [9]:
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349999 entries, 0 to 349998
Data columns (total 67 columns):
 #   Column                                      Non-Null Count   Dtype  
---  ------                                      --------------   -----  
 0   iso_code                                    349999 non-null  object 
 1   continent                                   333370 non-null  object 
 2   location                                    349999 non-null  object 
 3   date                                        349999 non-null  object 
 4   total_cases                                 312088 non-null  float64
 5   new_cases                                   340457 non-null  float64
 6   new_cases_smoothed                          339198 non-null  float64
 7   total_deaths                                290501 non-null  float64
 8   new_deaths                                  340511 non-null  float64
 9   new_deaths_smoothed                         339281 non-null  float64
 

The output generated by a code cell is printed right below it. In the case of a DataFrame we get a table. By convention, the rows are the data entries and the columns are the data dimensions. The first column on the left side is the index.

Now let's do something with our newly created DataFrame. For example, we could get the largest amount of new cases using the ```max``` method.

In [10]:
covid_data.total_cases.max()

771407061.0

✏️ *What would it take to get the highest positive rate?*

In [11]:
covid_data.positive_rate.max()

1.0

In [12]:
covid_data.new_cases.max()

8401961.0

To get the entry belonging to the biggest amount of new cases, one needs to **loc**ate it via its index:

In [13]:
covid_data.loc[ covid_data.total_cases.idxmax() ]

iso_code                                       OWID_WRL
continent                                           NaN
location                                          World
date                                         2023-10-15
total_cases                                 771407061.0
                                               ...     
population                                 7975105024.0
excess_mortality_cumulative_absolute                NaN
excess_mortality_cumulative                         NaN
excess_mortality                                    NaN
excess_mortality_cumulative_per_million             NaN
Name: 345839, Length: 67, dtype: object

We can also calculate averages for each numeric column by selecting them first and then calculating the mean:

In [14]:
covid_data[['total_cases', 'new_cases', 'new_deaths']].mean(axis=0)

total_cases    6.683354e+06
new_cases      9.601634e+03
new_deaths     8.551106e+01
dtype: float64

There is so much more to discover, some of which you will do over the course of the tutorials. The [DataFrame page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) in the pandas reference gives a complete (i.e., long) list of all methods provided by the data structure. 

If you want to do something specific, but do not know the particular method name, a well formulated search query in a search engine can help wonders. In particular, the discussions on Stack Overflow contain various helpful entries. Quite often it is the case that somebody else has had a similar problem that you're trying to solve. The key then is to precisely formulate your query. For this it is good to understand the basic terminology of Python, pandas, etc.

## 🌠 Let's reach to the stars 

Altair is the brightest star in the Aquila constellation and it is also the name of a versatile [visualization library](https://altair-viz.github.io/)  specifically created for Python based on the popular Vega-Lite visualization grammar. 

With 📊Altair we can create charts and visualizations in little time. 

In order to put Altair to use, we first have to import it and give it a short name:


In [15]:
import altair as alt

First lets prepare the data. Since Altair only supports dataframes up to 5000 rows, we need a bit of work to get our data in form! So lets by start by aggregating our data.

In [16]:
data = covid_data.groupby('continent').sum().reset_index()

  data = covid_data.groupby('continent').sum().reset_index()


First we call ```groupby``` to group our data by the ```continent``` column, then we sum the values in each group. The result of this computation has the grouped-by values in its index. But since Altair does not support the creation of axes out of indexes we reset the index to a column by executing ```reset_index``` on the resulting dataframe.

In [17]:
alt.Chart(data).mark_bar().encode(x='continent', y='new_cases')

✏️ *Change above chart into a horizontal bar chart of new cases:* 

In [18]:
alt.Chart(data).mark_bar().encode(y='continent', x='new_cases')

In [19]:
alt.Chart(data).mark_bar().encode(y='continent', x='new_cases').properties(height=alt.Step(40), width=500)

With a few more specifications, we can give this barchart some tooltips and an aspect ratio of a square:

In [20]:
alt.Chart(data).mark_bar().encode(
    x='continent', 
    y='new_cases',
    tooltip=['new_deaths', 'new_cases_per_million']
).properties(
    width=200,
    height=200
)

This is admittedly still a very simple chart, but it gets the job done.

Altair can be used to create a wide range of static and interactive visualizations—have a look at their [gallery](https://altair-viz.github.io/gallery/index.html) for some inspiration!

## Sources
- [Pandas Tutorial: DataFrames in Python - DataCamp](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)
- [Examining Data Using Pandas | Linux Journal](https://www.linuxjournal.com/content/examining-data-using-pandas)