# Tutorial 1: Getting started

During the tutorials you will be reading and writing **Python** code in **Jupyter** notebooks on the **Deepnote** platform. 

Phew… Let's unpack this a bit!

* 🐍 [Python](https://www.python.org) is a programming language that has gained considerable traction over the last years, in various contexts, including data science and the digital humanities. If you have never written any Python before, it would be useful for you to familiarize yourself with the language, its basic constructs and conventions. It is popular for its versatility and readability. Speaking of which…

* 🪐 [Jupyter](https://jupyter.org) notebooks are hybrid documents that contain both code and markup. So it becomes possible to mix programming visualizations, interacting with them, and adding documentation. What you are looking at now is a text cell written in the markup language Markdown, further below you see code cells (note the light grey background), which contain computable code written in the programming language Python! When viewing the notebooks on Deepnote or in Jupyter, you can double-click the text cell to see its source. 

* 📝 [Deepnote](https://deepnote.com) is a web-based environment to view, edit, run and share Jupyter notebooks from the comfort of your web browser. This also means you can use their computing power to do some number crunching and it makes it easy to share, discuss and run tutorials. When you open a notebook in edit mode, you are able to execute and edit any cell.

In this tutorial you will get a bit acquainted with Python, Jupyter and Deepnote, and get to know a few handy libraries for working with data.

## 🌍 Hello world 

Okay, enough words. Let's dive right into it and start with a classic:

In [1]:
print("hello world")

hello world


Above code cell can be executed (i.e., run) by clicking on ▶️ to the right of the block after clicking into the code cell. Can you see it? Depending on the environment you're in, it might also look differently.

To be able to edit any cell for yourself, you need to make a copy of this notebook. If you do not want to use Deepnote, you can also download this notebook and run it on your own computer with a local installation of Jupyter Notebook.

Of course we can set variables and extend them. Feel free to change the message:

In [2]:
hello = "hello world!"
hello = hello + " how are you?"
hello

'hello world! how are you?'

Now that we have our first variable `hello` we can perform some string tricks, for example, we could change the capitalization:

In [3]:
hello.title()

'Hello World! How Are You?'

In [4]:
hello.upper()

'HELLO WORLD! HOW ARE YOU?'

✏️ *Now it's your turn! (The pencil stands for a small hands-on activity!). Try some string manipulations yourself. To get some inspiration, have a look at the [string methods](https://docs.python.org/3/library/stdtypes.html?#string-methods) that Python has built-in:*

## 📦 Let's get some packages

Python itself provides only limited methods for working with more complex data. One of the main reasons for Python's (and  Jupyter's) popularity is the wide availability of software packages that provide powerful means for 🛒 preparing, 🥒 processing, and 🥗 presenting data. The tutorials will help you to pass through these steps with the help of several packages, some of them highly specific tools and others more general-purpose libraries. 

The Deepnote platform already has many packages ready to go. To use them in a notebook, you simply `import` them and assign an abbreviation after `as` to keep your code succinct. This is how you do it:

In [5]:
import pandas as pd

Now the powerful `pandas` package is loaded and will answer to its nickname `pd`.

🐼 [Pandas](https://pandas.pydata.org) is really a data analysis workhorse with the DataFrame data structure being one of its main muscles. You will learn to love it! With pandas you can do simple and sophisticated operations over small and sizable datasets. 

Let's create a little toy dataset to give you a sense of how it looks and works:


In [6]:
cities = pd.DataFrame({
  "name": ["Athens", "Bratislava", "Copenhagen", "Dublin"],
  "area": [39, 367.6, 86.2, 115],
  "elevation": [170, 152, 14, 20],
  "population": [664046, 429564, 602481, 553165]
  }
)

To check whether the DataFrame was created successfully, we can simply type the variable name `cities`, display its content as an ouput:

In [7]:
cities

Unnamed: 0,name,area,elevation,population
0,Athens,39.0,170,664046
1,Bratislava,367.6,152,429564
2,Copenhagen,86.2,14,602481
3,Dublin,115.0,20,553165


The output generated by a code cell is printed right below it. In the case of a DataFrame we get a table. By convention, the rows are the data entries and the columns are the data dimensions. The first column on the left side is the index.

Now let's do something with our newly created DataFrame. For example, we could get the largest area using the max method.

In [8]:
cities["area"].max()

367.6

✏️ *What would it take to get the highest elevation?*

To get the entry belonging to the largest area, one needs to locate it via its index. `idxmax()` returns the index for the row with the maximum value in column `area` and with loc we can can retrieve the row via the index:

In [9]:
cities.loc[ cities['area'].idxmax() ]

name          Bratislava
area               367.6
elevation            152
population        429564
Name: 1, dtype: object

We can also calculate averages for each numeric column:

In [10]:
cities.mean(numeric_only=True)

area             151.95
elevation         89.00
population    562314.00
dtype: float64

There is so much more to discover, some of which you will do over the course of the tutorials. The [DataFrame page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) in the pandas reference gives a complete (i.e., long) list of all methods provided by the data structure. 

If you want to do something specific, but do not know the particular method name, a well formulated Google search query can help wonders. In particular, the discussions on Stack Overflow contain various helpful entries. Quite often it is the case that somebody else has had a similar problem that you're trying to solve. The key then is to precisely formulate your query. For this it is good to understand the basic terminology of Python, pandas, etc.

## 🌠 Let's reach to the stars 

Altair is the brightest star in the Aquila constellation and it is also the name of a versatile [visualization library](https://altair-viz.github.io/) specifically created for Python based on the popular Vega-Lite [visualization grammar](https://vega.github.io/vega/). 

With 📊Altair we can create charts and visualizations in little time. 

In order to put Altair to use, we first have to import it and give it a short name:


In [11]:
import altair as alt

If you were to run this notebook from a local Jupyter installation, you might not have the library installed. To install it you would either use Python's package manager `pip` or use Anaconda's package management tool `conda`. There is a convention to list all packages in a `requirements.txt` file. On Deepnote they are automatically installed. Depending on your environment you could enter this line: `!pip install -r requirements.txt` in a code block and all packages needed for the tutorials are installed in one go. Alternatively, you can also run pip from the command line (then without the preceeding the exclamation mark).

Now let's make a bar chart representing the `area` in the cities dataset—in one line!

In [12]:
alt.Chart(cities).mark_bar().encode(x='name', y='area')

✏️ *Change above chart into a horizontal bar chart of populations:* 

With a few more specifications, we can give this barchart some tooltips and an aspect ratio of a square:

In [13]:
alt.Chart(cities).mark_bar().encode(
  x='name',
  y='area',
  tooltip=['name', 'area', 'elevation', 'population'],
).properties(
  width=200,
  height=200
)

This is admittedly still a very simple chart, but it gets the job done and might make you wonder why Athens is actually so small (it turns out that this number refers to the municipality and not the metropolitan area of Athens… Anyways).

Altair can be used to create a wide range of static and interactive visualizations—have a look at their [gallery](https://altair-viz.github.io/gallery/index.html) for some inspiration!

As a last step in this tutorial, we load some countries data from the geographical database Geonames and create an interactive scatterplot of the area sizes and populations of all the capital cities. I included some comments (note the # and the green color) to explain what is going on:

In [14]:
# we load the dataset from Geonames using a URL, a web address
countries = pd.read_csv("http://www.geonames.org/countryInfoCSV", sep='\t', keep_default_na=False)
# we pass two further parameters: first we specify that the file is tab-separated,
# and then we ask it not to translate NA into 'not a number' as it refers to North America

# let's take a look at the first rows in the dataset:
countries.head()

Unnamed: 0,iso alpha2,iso alpha3,iso numeric,fips code,name,capital,areaInSqKm,population,continent,languages,currency,geonameId
0,AD,AND,20,AN,Andorra,Andorra la Vella,468.0,77006,EU,ca,EUR,3041565
1,AE,ARE,784,AE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS,"ar-AE,fa,en,hi,ur",AED,290557
2,AF,AFG,4,AF,Afghanistan,Kabul,647500.0,37172386,AS,"fa-AF,ps,uz-AF,tk",AFN,1149361
3,AG,ATG,28,AC,Antigua and Barbuda,St John's,443.0,96286,,en-AG,XCD,3576396
4,AI,AIA,660,AV,Anguilla,The Valley,102.0,13254,,en-AI,XCD,3573511


Note that the values in the column `continent` are abbreviated. To make them meaningful we replace them with their full names:

In [15]:
# string replace in continent column with a dictionary of find-replace pairs:
countries = countries.replace( { "continent": {
  "AF": "Africa",
  "AN": "Antarctica",
  "AS": "Asia",
  "EU": "Europe",
  "OC": "Oceania",
  "NA": "North America",
  "SA": "South America"
}})

Now the data is ready to be visualized. Note: We will spend considerable time in the coming tutorials to get data into shape, so that we can start visualizing them. This is an important and often laborious step in any visualization project, which is why we also spend a bit of time on it here.

Because there are many countries with small populations, and few very large ones, we use [logarithmic scales](https://en.wikipedia.org/wiki/Logarithmic_scale) on both axes by adding `scale(type='log')`. And because logarithmic scales cannot contain zero values, we skip the territories with a population of 0 (such as Antarctica). 

In [16]:
alt.Chart(countries).transform_filter(alt.datum.population>0).mark_circle().encode(
  alt.X('areaInSqKm').scale(type='log'),
  alt.Y('population').scale(type='log'),
  color='continent',
  tooltip = ["name", "capital", "areaInSqKm", "population", "continent"]
).interactive().properties(
  width=600,
  height=400
)

✏️ *Replace the `log` with `linear` to see how most countries would cluster close together making it hard to see any patterns.*

This visualization is interactive: you can hover over each entry, zoom into the scatterplot (by scrolling in and out), and drag it around to adjust the axis segments in view. Double-clicking resets the zoom and panning.

**If you have so far just read the notebook, it's time to make it work, i.e., to actually run these code cells and do the pencil activities. For this you need to either log into Deepnote or download the notebook and run it in a local Jupyter installation.**

## 📚 Let's go to the library

Here are a few places where you can dive deeper into the things you learned about in this first tutorial:

* [An Informal Introduction to Python](https://docs.python.org/3/tutorial/introduction.html)
* [Tutorials for getting started with Pandas](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html)
* [Altair Website](https://altair-viz.github.io/index.html)
* [Markdown Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet)


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=4442fbcd-c9e6-471d-ad20-ed32828751f9' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>