*Part 2: Python for Data Analysis II*
# Working with Pandas#

In the last tutorial we got to know the main object types of the pandas module: *Series* and *Dataframes*. We also learned how to access (and change) different parts of a dataframe using the different *indexers* that are available in pandas (i.e. ``[]``, ``loc[]`` and ``iloc[]``). In this tutorial, we will start working with real data and get to know some of the **methods and functions** the pandas module provides. We will only be able to cover a small fraction of them (see here for a documentation on all available functions and methods: https://pandas.pydata.org/docs/reference/index.html).

When you start working with data, you may encounter problems you cannot easily solve with the methods and functions we discussed in this tutorial. Before you start writing complicated code, it is usually a good idea to Google for an easy solution first. Pandas has a large amount of useful functions and methods -- and most (if not all) of them are discussed on Stack Overflow.

## Getting help

In this class we will not be able to cover all aspects of Python. If you want more details, you can consult, for example, the **Python Standard Library Reference** at https://docs.python.org/3/library/ or the **Language Reference** at https://docs.python.org/3/reference/. But be warned: the amount of detail in these sources can be overwhelming. For **quick and easy-to-understand overviews** of different topics see, for example, https://www.w3schools.com/python/.

For an introduction to pandas, see:

*  https://www.w3schools.com/python/pandas/default.asp
* https://pandas.pydata.org/docs/user_guide/index.html#user-guide

If you get stuck or don't remember how to do something, it is usually a good idea to **Google** your problem. Python has a large (and fast-growing) community and you will probably find answers to most of your questions online (e.g. on **Stack Overflow** or in a **Youtube tutorial**).

## Importing data

The data we will work with in this tutorial can be found here:
https://drive.google.com/drive/folders/1QnHTDQ0tb8_Ex6dMgNCwqJuL3PxzEKIv

The relevant file is called ``countries_life_satisfaction.csv``.

### Copying the data to your Google Drive

You first need to make a copy of the data. Right-click the the file and then do one of the following:

(a) If you work with Colab, select **"Make a copy"** to store a copy of the file in your Google Drive. We suggest that you **create folder ``MyData`` in your Google Drive and copy the file into this folder**. The file name of the copy will be ``Copy of countries_life_satisfaction.csv``, so change the name back to ``countries_life_satisfaction.csv``. Placing the file in folder ``MyData`` and using file name ``countries_life_satisfaction.csv`` ensures that the path in the example below will work.

(b) If you work locally, select "Download" to store a copy of the file on your harddrive. In this case, you can skip "Mounting your Google Drive", but you will need to modify the path in "Setting the working directory".

**If you opened this tutorial in Colab, then you need to copy the file to Google Drive (option a) because Colab cannot access your harddrive.**

Alternatively, you can read in the data via a direct URL.


### Mounting your Google Drive

Assuming you work on Colab, you will now need to mount your Google Drive, so Colab can access the file (you can skip this section if you work locally). To mount the drive, run the following code:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


A link will appear that will take you to a page where you can grant access to your drive. After clicking "Allow", a code will be shown that you have to copy into the displayed field. After that, your Google Drive will be available at ``/content/drive/MyDrive``.

### Setting the working directory

Specifying full file paths when reading in the data is a bit tedious so you may want to set the working directory to the folder that contains the data. **You can use the ``os`` module to specify your working directory**. Assuming you copied the file into a folder called ``MyData`` in your drive, this would go as follows:

In [None]:
import os
os.getcwd()  # display path of current working directory

'/content'

In [None]:
os.chdir("/content/drive/MyDrive/MyData")  # change working directory
print(os.getcwd())  # display path of current working directory
os.listdir()        # list contents of current working directory

You can now refer to your file simply as ``countries_life_satisfaction.csv`` without having to type the full path.

### Loading data into a Pandas dataframe

First import the ``pandas`` module:

In [None]:
import pandas as pd


To import data from a CSV (comma separated values) file, you can use pandas ``read_csv()`` method:

In [None]:
df = pd.read_csv("countries_life_satisfaction.csv")

---


>  <font color='teal'> **In-class exercise**: Copy the ``countries_life_satisfaction.csv`` to a folder in your Google Drive (e.g. ``MyData``) and mount your Google Drive. Then, import the  dataset ``countries_life_satisfaction.csv`` and assign it to a variable ``df``. You do not have to write additional code to do this. If you can run the code above, you succeeded!



---



### Importing data from an URL

If you didn't succeed to mount your Google Drive and import the data from there, you can also load the data from the following URL (so you can follow along with the remainder of the tutorial):

In [None]:
df = pd.read_csv("http://farys.org/daten/countries_life_satisfaction.csv")

Alternatively you could directly read the csv if you correctly specify the url to the csv in the Google Drive of this course:

In [None]:
url = 'https://drive.google.com/file/d/1fURALIPF9jWwwqZHcFPIUfnxwZCmrQ_k/view?usp=share_link'
url = 'https://drive.google.com/uc?id=' + url.split('/')[-2]  # extracts the second-last part
                                                              # between the "/" which holds the file id

print(url)
df = pd.read_csv(url)

https://drive.google.com/uc?id=1fURALIPF9jWwwqZHcFPIUfnxwZCmrQ_k


### Encodings and separators

Oftentimes you may encounter problems with reading files which are related to the encoding of a file. Different operating systems use different defaults (utf8, latin1, applemac), which is important as soon as you use special characters like e.g. umlauts.

In [None]:
df_utf8 = pd.read_csv("http://farys.org/daten/kantone_utf8.csv")  # works
df_utf8

Unnamed: 0,kn,kanton,hauptort
0,1,Zürich,Zürich
1,2,Bern,Bern


In [None]:
# df_ansi = pd.read_csv("http://farys.org/daten/kantone_ansi.csv")  # does not work
# df_ansi

In [None]:
df_ansi = pd.read_csv("http://farys.org/daten/kantone_ansi.csv",
                      encoding = "latin1")  # works
df_ansi

Unnamed: 0,kn,kanton,hauptort
0,1,Zürich,Zürich
1,2,Bern,Bern


``csv`` (and ``txt``) files can be saved with different seperators (e.g. ``,`` or ``;``).

In [None]:
df_sc = pd.read_csv("http://farys.org/daten/kantone_semicolon.csv")
df_sc

Unnamed: 0,kn; kanton; hauptort; einwohner; fläche
1; Zürich; Zürich; 1'564'662; 1728,94
2; Bern; Bern; 1'047'473; 5958,51


In [None]:
df_sc = pd.read_csv("http://farys.org/daten/kantone_semicolon.csv", sep=";")
df_sc

Unnamed: 0,kn,kanton,hauptort,einwohner,fläche
0,1,Zürich,Zürich,1'564'662,172894
1,2,Bern,Bern,1'047'473,595851


Furthermore - especially in Europe - you might encounter differences in the specification of decimal-characters and thousands separators.

In [None]:
df_sc = pd.read_csv("http://farys.org/daten/kantone_semicolon.csv", sep=";",
                    thousands="'",
                    decimal=",")
df_sc

Unnamed: 0,kn,kanton,hauptort,einwohner,fläche
0,1,Zürich,Zürich,1564662,1728.94
1,2,Bern,Bern,1047473,5958.51


There are also ways to roughly guess which encoding a file has:

In [None]:
import chardet
import requests

# URL of the CSV file
url_utf8 = "http://farys.org/daten/kantone_utf8.csv"
url_ansi = "http://farys.org/daten/kantone_ansi.csv"

# Download the file and check the encoding
response = requests.get(url_utf8)
encoding = chardet.detect(response.content)['encoding']
print(encoding)

response = requests.get(url_ansi)
encoding = chardet.detect(response.content)['encoding']
print(encoding)

utf-8
ISO-8859-1


We recommend to use these options to start with an as clean as possible dataset. Avoid manually cleaning data after reading it in a messy way.

><font color = 4e1585> SIDENOTE: The ``read_csv`` function takes many more arguments that can be useful. For example, you can can specify what should be used as the index, what values should be interpreted as missings, what datatypes to use for the different columns, how many rows should be skipped etc. You can use similar functions to import other file types such as Excel (``read_excel``), Stata (``read_stata``), SPSS (``read_spss``) etc.

## Inspecting data

### Getting started

We have imported the data and assigned it to a dataframe called ``df``. Let's take a look at it now. You can use the **``head()`` method** to look at the first observations (i.e. rows) in a dataset:

In [None]:
df.head()  # print first 5 rows

Unnamed: 0.1,Unnamed: 0,index,code,gni_per_capita,life satisfaction in cantril ladder (world happiness report 2019),fertility,total population,continent,working hours per year
0,0,Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
1,1,Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009
2,2,Georgia,GEO,4410.0,4.659097,2.05,4024000.0,Asia,no data
3,3,Norway,NOR,93110.0,7.444262,1.741,5200000.0,Europe,1422.5608
4,4,Bulgaria,BGR,7450.0,5.098814,1.534,7200000.0,Europe,1644.3853


Similarly, you can use the **``tail()`` method** to look at the last rows:

In [None]:
df.tail(100)  # Look the last 100 rows

Unnamed: 0.1,Unnamed: 0,index,code,gni_per_capita,life satisfaction in cantril ladder (world happiness report 2019),fertility,total population,continent,working hours per year
103,103,Malawi,MWI,350.0,3.334634,4.527,16745000.0,Afrika,no data
104,104,Japan,JPN,38840.0,5.793575,1.395,127985000.0,Asia,1750.9
105,105,Philippines,PHL,3380.0,5.869173,2.805,102113000.0,Asia,2148.5645
106,106,Guyana,GUY,5470.0,,2.534,767000.0,South America,no data
...,...,...,...,...,...,...,...,...,...
199,199,Saudi Arabia,SAU,23710.0,6.356393,2.507,31718000.0,Asia,no data
200,200,Latvia,LVA,15100.0,5.901154,1.606,1998000.0,Europe,1901.7413
201,201,Sint Maarten (Dutch part),SXM,29560.0,,-99.000,40000.0,North America,no data
202,202,,,,,,,,


As you may have noticed, the output will get shortened if you try to print many rows (or columns) at once. You can **change how many rows or columns should be displayed using the ``set_option()`` function**:

In [None]:
pd.set_option('display.max_rows', 8)  # Display 8 rows
pd.set_option('display.max_columns', None)  # Display all columns
df

Unnamed: 0.1,Unnamed: 0,index,code,gni_per_capita,life satisfaction in cantril ladder (world happiness report 2019),fertility,total population,continent,working hours per year
0,0,Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
1,1,Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009
2,2,Georgia,GEO,4410.0,4.659097,2.050,4024000.0,Asia,no data
3,3,Norway,NOR,93110.0,7.444262,1.741,5200000.0,Europe,1422.5608
...,...,...,...,...,...,...,...,...,...
199,199,Saudi Arabia,SAU,23710.0,6.356393,2.507,31718000.0,Asia,no data
200,200,Latvia,LVA,15100.0,5.901154,1.606,1998000.0,Europe,1901.7413
201,201,Sint Maarten (Dutch part),SXM,29560.0,,-99.000,40000.0,North America,no data
202,202,,,,,,,,


The data still looks a bit messy; we will learn later how to tidy it up.

### Data types

It may also be useful to take a look at the **data types** of each of your columns:

In [None]:
df.dtypes

Unnamed: 0                  int64
index                      object
code                       object
gni_per_capita            float64
                           ...   
fertility                 float64
total population          float64
continent                  object
working hours per year     object
Length: 9, dtype: object

You already know the ``float`` and the ``int`` types, but what is the ``object`` type? It is used for string columns or mixed columns (e.g. strings and floats).


><font color = 4e1585> SIDENOTE: If you want to convert a column to a different datatype, you can use the ``astype`` method. For example, ``df["Unnamed: 0"].astype("float")`` would return the "Unnamed: 0" column as floats.
>
><font color = 4e1585>If you are interested in data types in pandas, see, for example:
* https://pbpython.com/pandas_dtypes.html


### Summary statistics

You can **use the ``describe()`` method to get summary statistics** for all (numeric) variables in your dataset:

In [None]:
df.describe()

Unnamed: 0.1,Unnamed: 0,gni_per_capita,life satisfaction in cantril ladder (world happiness report 2019),fertility,total population
count,203.0,184.0,126.0,202.0,202.0
mean,101.0,14601.304348,5.533525,-8.313262,34447130.0
std,58.745213,19758.81596,1.103122,31.807608,139718100.0
min,0.0,260.0,2.694303,-99.0,10000.0
25%,50.5,2112.5,4.748194,1.564,769500.0
50%,101.0,6010.0,5.468088,2.101,6274000.0
75%,151.5,17505.0,6.280186,3.214,19471500.0
max,202.0,101120.0,7.858107,7.169,1406848000.0


You can do the same for **one specific statistic**:

In [194]:
df.max(numeric_only=True)  # Print maximum of each column in dataframe

code                                ZWE
gni_per_capita                 101120.0
life_satisfaction              7.858107
fertility                         7.169
total_population           1406848000.0
continent                 South America
working_hours_per_year        2455.5508
dtype: object

Usually it makes more sense to do this **for a specific column**:

In [None]:
print(df["gni_per_capita"].count())
print(df["gni_per_capita"].mean())
print(df["gni_per_capita"].median())
print(df["gni_per_capita"].min())
print(df["gni_per_capita"].max())

184
14601.304347826086
6010.0
260.0
101120.0


You can use the **``agg()`` method to specify a list of statistics** that should be computed:

In [None]:
print(df["gni_per_capita"].agg(["mean", "median", "min", "max"]))

mean       14601.304348
median      6010.000000
min          260.000000
max       101120.000000
Name: gni_per_capita, dtype: float64


These methods only work for numeric variables (i.e. floats, integers or booleans). For categorical (often string) variables, you may want to know the **different categories** and the **number of observations in each category**. This can be done using the **``unique()`` method** and the **``value_counts()`` method** respectively:

In [None]:
print(df["continent"].unique())  # Get categories (unique values)

['Europe' 'Oceania' 'Asia' 'North America' 'Africa' 'South America'
 'Afrika' 'asia' nan]


In [None]:
print(df["continent"].value_counts())  # Get number of observations per category

Europe           47
Afrika           35
Asia             34
North America    33
Oceania          18
Africa           15
South America    11
asia              9
Name: continent, dtype: int64




---


>  <font color='teal'> **In-class exercise**:
Import the life satisfaction dataset and assign it to a variable called ``ls_data``. Print (1) the first 3 rows of the dataset, (2) the minimum, the maximum and the median population size and (3) the number of countries per continent.



---



## Cleaning data

In [None]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,index,code,gni_per_capita,life satisfaction in cantril ladder (world happiness report 2019),fertility,total population,continent,working hours per year
0,0,Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
1,1,Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009
2,2,Georgia,GEO,4410.0,4.659097,2.05,4024000.0,Asia,no data


You may already have noticed that our dataset has some inconsistencies, redundancies and errors that should be cleaned before we can start analyzing it.

### Changing column and row indices

A first thing we might want to tidy up are the names of the columns. In the last tutorial we saw how to change all column names at once by assigning a list of names to ``df.columns``. But what if we want only to **change some of the column names**? We can use the **``rename`` method**:

In [None]:
df.rename(columns={'life satisfaction in cantril ladder (world happiness report 2019)': 'life_satisfaction',
                   'index': 'country'})  # Provide a dictionary as argument
                                         # ('old name':'new name', 'old name':'new name', ...)

Unnamed: 0.1,Unnamed: 0,country,code,gni_per_capita,life_satisfaction,fertility,total population,continent,working hours per year
0,0,Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
1,1,Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009
2,2,Georgia,GEO,4410.0,4.659097,2.050,4024000.0,Asia,no data
3,3,Norway,NOR,93110.0,7.444262,1.741,5200000.0,Europe,1422.5608
...,...,...,...,...,...,...,...,...,...
199,199,Saudi Arabia,SAU,23710.0,6.356393,2.507,31718000.0,Asia,no data
200,200,Latvia,LVA,15100.0,5.901154,1.606,1998000.0,Europe,1901.7413
201,201,Sint Maarten (Dutch part),SXM,29560.0,,-99.000,40000.0,North America,no data
202,202,,,,,,,,


So far, so good. But if we look at our dataframe, we see that the column names were not modified:

In [None]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,index,code,gni_per_capita,life satisfaction in cantril ladder (world happiness report 2019),fertility,total population,continent,working hours per year
0,0,Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
1,1,Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009


Why? The rename method did not change anything in our dataframe, it only returned a new dataframe with the renamed columns. What could we do to make the changes carry over to ``df``. One possibility is to reassign the result to ``df``. Moreover, **many pandas methods have an ``inplace`` parameter. If you set it to ``True``, changes will be done "in place", meaning that the object will be modified**:

In [None]:
df.rename(columns={'life satisfaction in cantril ladder (world happiness report 2019)': 'life_satisfaction',
                   'index': 'country'},
          inplace=True)

df.head(2)

Unnamed: 0.1,Unnamed: 0,country,code,gni_per_capita,life_satisfaction,fertility,total population,continent,working hours per year
0,0,Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
1,1,Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009


Another thing we might want to do is to make column labels more consistent, e.g. by **replacing all spaces with underscores**. How could we do this? We could, for example, use a list comprehension:

In [None]:
df.columns

Index(['Unnamed: 0', 'country', 'code', 'gni_per_capita', 'life_satisfaction',
       'fertility', 'total population', 'continent', 'working hours per year'],
      dtype='object')

In [None]:
column_names = [x.replace(" ", "_") for x in list(df.columns)]
print(column_names)
df.columns = column_names
df.head()

['Unnamed:_0', 'country', 'code', 'gni_per_capita', 'life_satisfaction', 'fertility', 'total_population', 'continent', 'working_hours_per_year']


Unnamed: 0,Unnamed:_0,country,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year
0,0,Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
1,1,Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009
2,2,Georgia,GEO,4410.0,4.659097,2.05,4024000.0,Asia,no data
3,3,Norway,NOR,93110.0,7.444262,1.741,5200000.0,Europe,1422.5608
4,4,Bulgaria,BGR,7450.0,5.098814,1.534,7200000.0,Europe,1644.3853


><font color = 4e1585> SIDENOTE: As often with pandas, there are easier ways to do this. Pandas provides specific string methods that allow you to modify string values (in indices or series) more directly:
>
>```
># Replace empty spaces with underscores
>df.columns = df.columns.str.replace(" ", "_")
```
>
><font color = 4e1585> We will take a closer look at string methods in pandas in the last part of the tutorial when we will learn how to work with text data.

Another thing we may want to do is to change the **row indices**.

In [None]:
df.set_index("country", inplace=True)  # Or (without dropping country column):
                                       # df.index = df["country"]
df

Unnamed: 0_level_0,Unnamed:_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Slovak Republic,0,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
Australia,1,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009
Georgia,2,GEO,4410.0,4.659097,2.050,4024000.0,Asia,no data
Norway,3,NOR,93110.0,7.444262,1.741,5200000.0,Europe,1422.5608
...,...,...,...,...,...,...,...,...
Saudi Arabia,199,SAU,23710.0,6.356393,2.507,31718000.0,Asia,no data
Latvia,200,LVA,15100.0,5.901154,1.606,1998000.0,Europe,1901.7413
Sint Maarten (Dutch part),201,SXM,29560.0,,-99.000,40000.0,North America,no data
,202,,,,,,,


We can also **revert back to numeric indices using ``reset_index``** at any time:

In [None]:
df.reset_index()  # If drop=True, the old index is not added as a column
                  # Specify inplace = True to modify index in df

Unnamed: 0,country,Unnamed:_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year
0,Slovak Republic,0,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
1,Australia,1,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009
2,Georgia,2,GEO,4410.0,4.659097,2.050,4024000.0,Asia,no data
3,Norway,3,NOR,93110.0,7.444262,1.741,5200000.0,Europe,1422.5608
...,...,...,...,...,...,...,...,...,...
199,Saudi Arabia,199,SAU,23710.0,6.356393,2.507,31718000.0,Asia,no data
200,Latvia,200,LVA,15100.0,5.901154,1.606,1998000.0,Europe,1901.7413
201,Sint Maarten (Dutch part),201,SXM,29560.0,,-99.000,40000.0,North America,no data
202,,202,,,,,,,


### Removing, modifying and adding columns

In [None]:
df.head(1)

Unnamed: 0_level_0,Unnamed:_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Slovak Republic,0,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863


The ``unnamed:_0`` is redundant and we would like to remove it. How could this be done? We could write something like ``df = df.loc[:,"code":]``, but there is an easier solution: the **``drop()`` method**:

In [None]:
df.drop("Unnamed:_0", axis="columns", inplace=True)  # Provide a list to drop several columns,
                                                     # e.g. ["Unnamed:_0", "code"]
df

Unnamed: 0_level_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863
Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009
Georgia,GEO,4410.0,4.659097,2.050,4024000.0,Asia,no data
Norway,NOR,93110.0,7.444262,1.741,5200000.0,Europe,1422.5608
...,...,...,...,...,...,...,...
Saudi Arabia,SAU,23710.0,6.356393,2.507,31718000.0,Asia,no data
Latvia,LVA,15100.0,5.901154,1.606,1998000.0,Europe,1901.7413
Sint Maarten (Dutch part),SXM,29560.0,,-99.000,40000.0,North America,no data
,,,,,,,


Many methods in pandas have an optional **``axis`` parameter**. It allows us to **specify if your are referring to rows (``axis=0`` or ``axis="rows"``) or to columns (``axis=1`` or ``axis="columns"``)**. The default value is ``axis=0``, i.e. rows. This is why we need to set the axis to 1 -- ``df.drop("Unnamed:_0",  inplace=True)`` would return an error because there is no row named ``Unnamed:_0``!

We have already seen how we can **change and add columns**. Let's make use of this knowledge to clean the ``continent`` column. Have a look at the values in the column:

In [None]:
df["continent"].value_counts()

Europe           47
Afrika           35
Asia             34
North America    33
Oceania          18
Africa           15
South America    11
asia              9
Name: continent, dtype: int64

We see that Africa and Asia appear in different spellings. How could we correct the ``asia`` and the ``Afrika`` values? One way to do this is by **filtering the values using the ``loc`` indexer** (i.e. by doing Boolean indexing):

In [None]:
df.loc[df["continent"] == "asia", "continent"] = "Asia"  # Or: df["continent"] = df["continent"].str.replace("asia", "Asia")
df.loc[df["continent"] == "Afrika", "continent"] = "Africa"  # Or: df["continent"] = df["continent"].str.replace("Afrika", "Afrika")
df["continent"].value_counts()

Africa           50
Europe           47
Asia             43
North America    33
Oceania          18
South America    11
Name: continent, dtype: int64

Now suppose we would like to add a further column indicating if a country is a *high income economy* (12,536 USD or more), an *upper middle economy* (4,046 USD - 12,535 USD), a *lower middle economy* (1,036 USD - 4,045 USD) or a *low income economy* (1,035 USD or less). Just as in the example with continents, you could use the ``loc`` indexer to do a series of replacements:

In [None]:
df.loc[df["gni_per_capita"] >= 12536, "income_level"] = "High income"

df.loc[(df["gni_per_capita"] >= 4046) & (df["gni_per_capita"] < 12536),
       "income_level"] = "Upper middle income"

df.loc[(df["gni_per_capita"] >= 1036) & (df["gni_per_capita"] < 4046),
       "income_level"] = "Lower middle income"

df.loc[df["gni_per_capita"] < 1036, "income_level"] = "Low income"

df["income_level"].value_counts()

High income            60
Upper middle income    52
Lower middle income    52
Low income             20
Name: income_level, dtype: int64

><font color = 4e1585> SIDENOTE: If you have to specify multiple conditions, each condition has to be put in parentheses! Note also that the ``and``, ``or``, ``not`` and ``in`` operators we got to know in the first tutorial do not work in pandas. You can use the following alternatives:
* and:  ``&``
* or:  ``|``
* not:  ``~``
* in: ``isin()`` (method)

This is a lot of error-prone coding. Let's drop the column and then try a different approach.

In [None]:
df.drop("income_level", axis=1, inplace=True)

KeyError: ignored

The **``apply()`` method** offers a more readable way to modify data or generate new columns. It allows you **to apply a function to each element of a pandas Series** (e.g. a column in a dataframe). Let's first define the function we would like to apply:

In [None]:
# Define function
def income_group(x):
    if x >= 12536:
        return "High income"
    elif x >= 4046:
        return "Upper middle income"
    elif x >= 1036:
        return "Lower middle income"
    elif x < 1036:
        return "Low income"


# Check if it works
income_group(500)

'Low income'

Now we can insert our function into the apply method and check the result:

In [None]:
df["income_level"] = df["gni_per_capita"].apply(income_group)
df.head(3)

Unnamed: 0_level_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year,income_level
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Slovak Republic,SVK,17670.0,6.235111,1.442,5436000.0,Europe,1754.0863,High income
Australia,AUS,60500.0,7.176993,1.858,23932000.0,Oceania,1747.009,High income
Georgia,GEO,4410.0,4.659097,2.05,4024000.0,Asia,no data,Upper middle income


How does this work? Our ``income_group`` function is applied to each element ``x`` of the pandas Series ``df["gni_per_capita"]`` and the result is assigned to the new column ``df["income_level"]``. You can think of it in terms of a loop through all elements ``x`` of your Series.

The apply method is often used **in combination with a lambda function**. Suppose we would like to add a column with the flag of each county. The pycounty module allows you to retrieve information (including the flag) on all countries:

In [None]:
%pip install pycountry
import pycountry

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
print(pycountry.countries.get(alpha_3="CHE"))  # Retrieve information on
                                               # Switzerland based on ISO-3 code
pycountry.countries.get(alpha_3="CHE").flag

Country(alpha_2='CH', alpha_3='CHE', flag='🇨🇭', name='Switzerland', numeric='756', official_name='Swiss Confederation')


'🇨🇭'

Let's pass this to a lambda function.

In [None]:
df["code"]

country
Slovak Republic              SVK
Australia                    AUS
Georgia                      GEO
Norway                       NOR
                            ... 
Saudi Arabia                 SAU
Latvia                       LVA
Sint Maarten (Dutch part)    SXM
NaN                          NaN
Name: code, Length: 203, dtype: object

In [None]:
df["code"][:-1].apply(lambda x: pycountry.countries.get
                      (alpha_3=x).flag)  # Apply lambda function (excluding last value)

country
Slovak Republic              🇸🇰
Australia                    🇦🇺
Georgia                      🇬🇪
Norway                       🇳🇴
                             ..
Benin                        🇧🇯
Saudi Arabia                 🇸🇦
Latvia                       🇱🇻
Sint Maarten (Dutch part)    🇸🇽
Name: code, Length: 202, dtype: object

As the last value in the column is not a country code, we will also have to take care how to handle these missings if we want to add the flag as a new column. Let's define a lambda function that returns the flag if the value is a string and "Not found" otherwise:

In [None]:
get_flag = (lambda x: pycountry.countries.get(alpha_3=x).flag
            if type(x) == str else "Not found")  # Simple if-else conditions
                                                 # can be defined on one line
print(get_flag("CHE"))
get_flag(0)

🇨🇭


'Not found'

In [None]:
df["flag"] = df["code"].apply(get_flag)
df["flag"]

country
Slovak Republic                     🇸🇰
Australia                           🇦🇺
Georgia                             🇬🇪
Norway                              🇳🇴
                               ...    
Saudi Arabia                        🇸🇦
Latvia                              🇱🇻
Sint Maarten (Dutch part)           🇸🇽
NaN                          Not found
Name: flag, Length: 203, dtype: object

But what if we wanted to apply a function that is **based on the values of several columns**? For example, we might want to identify the world leaders (large population and high income) in our data. The apply method does not only exist for pandas Series, but also for **pandas Dataframes**. If you apply a (reducing) function to a dataframe, you further need so specify if this is to be done **along columns or rows**.

In [None]:
df.apply(lambda x: "world leader" if (x["gni_per_capita"] > 6000)
         & (x["total_population"] > 10000000)
         else "no world leader",
         axis="columns")  # Along columns, x refers to rows

country
Slovak Republic              no world leader
Australia                       world leader
Georgia                      no world leader
Norway                       no world leader
                                  ...       
Saudi Arabia                    world leader
Latvia                       no world leader
Sint Maarten (Dutch part)    no world leader
NaN                          no world leader
Length: 203, dtype: object

In [None]:
df.apply(lambda x: x.mean() if x.dtype != "object" else "-",
         axis="rows")  # Along rows, x refers to columns

code                                 -
gni_per_capita            14601.304348
life_satisfaction             5.533525
fertility                    -8.313262
                              ...     
continent                            -
working_hours_per_year               -
income_level                         -
flag                                 -
Length: 9, dtype: object

What does your local variable ``x`` refer to in each case?

If ``apply`` is executed **along columns (first example), ``x`` refers to a row in the dataframe** (so, ``x["gni_per_capita"]`` will refer to a different element of the ``gni_per_capita`` column in each "iteration"). Typically, the outcome will be a column.

Analogously, if you use ``apply`` **along rows (second example), ``x`` will point to a column** in your dataframe your function has to refer to the columns in the dataframe (allowing us to iterate through all columns and compute the mean across all rows in each of them). Typically, the outcome will be a row.

Other useful methods to change column values are **``map()``** and **``replace()``**:

In [None]:
continent_codes = {"Africa": "AF",
                   "Europe": "EU",
                   "North America": "NA",
                   "South America": "SA",
                   "Asia": "AS"}

In [None]:
df.continent.map(continent_codes)

country
Slovak Republic               EU
Australia                    NaN
Georgia                       AS
Norway                        EU
                            ... 
Saudi Arabia                  AS
Latvia                        EU
Sint Maarten (Dutch part)     NA
NaN                          NaN
Name: continent, Length: 203, dtype: object

In [None]:
df.continent.replace(continent_codes)

country
Slovak Republic                   EU
Australia                    Oceania
Georgia                           AS
Norway                            EU
                              ...   
Saudi Arabia                      AS
Latvia                            EU
Sint Maarten (Dutch part)         NA
NaN                              NaN
Name: continent, Length: 203, dtype: object

Can you spot how they differ? <font color = FF10F0> Answer: <font color = white> While map creates missings whenever a value is not contained in our dictionary, replace keeps the original values.

><font color = 4e1585> SIDENOTE: The ``apply()`` method allows you to write compact code to perform sophisticated operations, but, as it is basically a loop in disguise (``apply`` is still much faster than manually programming a loop), it can be a bit slow on very large datasets. Pandas offers a wide range of (vectorized) methods that allow you to perform operations on rows or columns more efficiently. So, if what you would like to do is not very complex, it makes sense to look for a method first before using ``apply``. For example, instead of:



In [None]:
import numpy as np
df["code"].apply(lambda x: x.lower() if type(x) == str else "")

country
Slovak Republic              svk
Australia                    aus
Georgia                      geo
Norway                       nor
                            ... 
Saudi Arabia                 sau
Latvia                       lva
Sint Maarten (Dutch part)    sxm
NaN                             
Name: code, Length: 203, dtype: object

><font color = 4e1585> ...you could use:

In [None]:
df["code"].str.lower()

country
Slovak Republic              svk
Australia                    aus
Georgia                      geo
Norway                       nor
                            ... 
Saudi Arabia                 sau
Latvia                       lva
Sint Maarten (Dutch part)    sxm
NaN                          NaN
Name: code, Length: 203, dtype: object

><font color = 4e1585> In some cases, you can also use function from numpy when none is implemented in pandas:

In [None]:
np.log(df["total_population"])  # Returns a pandas Series

country
Slovak Republic              15.508554
Australia                    16.990727
Georgia                      15.207787
Norway                       15.464169
                               ...    
Saudi Arabia                 17.272395
Latvia                       14.507657
Sint Maarten (Dutch part)    10.596635
NaN                                NaN
Name: total_population, Length: 203, dtype: float64


><font color = 4e1585> To find out more about ``apply()`` and similar methods such as ``map()`` or ``applymap()`` see, for example, here:
- https://-towardsdatascience.com/introduction-to-pandas-apply-applymap-and-map-5d3e044e93ff
- https://www.digitalocean.com/community/tutorials/pandas-dataframe-apply-examples

---

>  <font color='teal'> **In-class exercise**:
Set the country code as the index of your ``ls_data`` dataframe and rename the column ``fertility`` to ``fertility_rate``.

 >  <font color='teal'>Drop the ``Unnamed: 0`` and the ``total population`` columns.

>  <font color='teal'> Can you find a way to drop the last row in the dataset?

>  <font color='teal'> Can you create a new column called ``code2`` containing a two-letter code for each country using the apply method?

>  <font color='teal'> Pandas offers simpler solutions if you need to extract a substring from a column. Can you find one of them to create ``code2`` more efficiently?

>  <font color='teal'> *Extra task*: The correct two-letter country code (ISO-2) will not always correspond to the first two letters of the three-letter code (ISO-3). Use the pycountry module to retrieve the correct ISO-2 code (alpha_2). Print out the name and codes for all countries where the correct two-letter code does not correspond to the first two digits of the three-digit code.



---



### Sorting

Your can **sort your dataset by the index or by the values of a column (or several columns)**. This can be done with the **``sort_index()`` and the ``sort_values()``** methods respectively. Let's sort our dataset alphabetically by country name:

In [None]:
df.sort_index(inplace=True)
df

Unnamed: 0_level_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year,income_level,flag
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Afghanistan,AFG,600.0,2.694303,4.976,34414000.0,Asia,no data,Low income,🇦🇫
Albania,ALB,4390.0,5.004403,1.677,2891000.0,Europe,no data,Upper middle income,🇦🇱
Algeria,DZA,4850.0,5.043086,3.043,39728000.0,Africa,no data,Upper middle income,🇩🇿
American Samoa,ASM,,,-99.000,56000.0,Oceania,no data,,🇦🇸
...,...,...,...,...,...,...,...,...,...
West Bank and Gaza,PSE,3670.0,4.553922,3.955,4529000.0,Asia,no data,Lower middle income,🇵🇸
Zambia,ZMB,1580.0,4.041488,4.918,15879000.0,Africa,no data,Lower middle income,🇿🇲
Zimbabwe,ZWE,1280.0,3.616480,3.896,13815000.0,Africa,no data,Lower middle income,🇿🇼
,,,,,,,,,Not found


 Let's now take a look at the countries with the lowest life satisfaction:

In [None]:
# Sort by life satisfaction and print at head of dataset
df.sort_values("life_satisfaction").head()

Unnamed: 0_level_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year,income_level,flag
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Afghanistan,AFG,600.0,2.694303,4.976,34414000.0,Asia,no data,Low income,🇦🇫
Malawi,MWI,350.0,3.334634,4.527,16745000.0,Africa,no data,Low income,🇲🇼
Tanzania,TZA,980.0,3.445023,5.079,51483000.0,Africa,no data,Low income,🇹🇿
Botswana,BWA,6840.0,3.461366,2.968,2121000.0,Africa,no data,Upper middle income,🇧🇼
Rwanda,RWA,750.0,3.561047,4.157,11369000.0,Africa,no data,Low income,🇷🇼


><font color = 4e1585> SIDENOTE: If you want to subsequently apply several methods you can write one behind the other as in the example above. The statment will be evauated from left to right. In our case, the ``sort_values()`` method will be applied first and the ``head()`` method second. However, this only works if the object returned by the first method supports the second method. In our example, ``sort_values()`` returns a pandas dataframe, which is why we can apply the ``head()`` method next. Can you guess why ``df.sort_values("life_satisfaction", inplace=True).head()`` would not work?

In [None]:
# If you want to sort descending instead of ascending, you would need to
# change the default value of the 'ascending' option:
df.sort_values("life_satisfaction", ascending=False).head()

Unnamed: 0_level_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year,income_level,flag
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Finland,FIN,47150.0,7.858107,1.653,5481000.0,Europe,1637.1949,High income,🇫🇮
Denmark,DNK,60510.0,7.648786,1.737,5689000.0,Europe,1412.2715,High income,🇩🇰
Switzerland,CHE,85670.0,7.508587,1.533,8297000.0,Europe,1589.4751,High income,🇨🇭
Netherlands,NLD,49810.0,7.463097,1.693,16938000.0,Europe,1423.9202,High income,🇳🇱
Norway,NOR,93110.0,7.444262,1.741,5200000.0,Europe,1422.5608,High income,🇳🇴


### Handling missing values

Let's take a look at our dataset:

In [None]:
df.drop(["income_level", "flag"], axis=1, inplace=True)
df.head()

Unnamed: 0_level_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,AFG,600.0,2.694303,4.976,34414000.0,Asia,no data
Albania,ALB,4390.0,5.004403,1.677,2891000.0,Europe,no data
Algeria,DZA,4850.0,5.043086,3.043,39728000.0,Africa,no data
American Samoa,ASM,,,-99.0,56000.0,Oceania,no data
Andorra,AND,,,-99.0,78000.0,Europe,no data


**Missing values in pandas are coded as ``NaN``** (Not a Number). In our dataset, some of the missing values are properly coded as ``NaN`` (Pandas generated ``NaN`` for empty cells when reading the CSV), while others are not:

In [None]:
# Missings in gni_per_capita are properly coded as NaN
print(df["gni_per_capita"].head())
print(df["gni_per_capita"].dtypes)

country
Afghanistan        600.0
Albania           4390.0
Algeria           4850.0
American Samoa       NaN
Andorra              NaN
Name: gni_per_capita, dtype: float64
float64


In [None]:
# Missings in working_hours_per_year are coded as "no data"
print(df["working_hours_per_year"].head())
print(df["working_hours_per_year"].dtypes)

country
Afghanistan       no data
Albania           no data
Algeria           no data
American Samoa    no data
Andorra           no data
Name: working_hours_per_year, dtype: object
object


 When missing values are not properly specified, some operations may not work (or yield incorrect results):

In [None]:
# This will return an error:
df["working_hours_per_year"].mean()

TypeError: ignored

How can we change these "no data" string values to proper missings? **Missing values can be created using numpy**:

In [None]:
import numpy as np  # Import numpy
np.nan              # Create a missing value (or: np.NaN)

nan

Now we can replace the "no data" ocurrences with missings:

In [None]:
df.loc[df["working_hours_per_year"] == "no data",
       "working_hours_per_year"] = np.nan

## Note that these would not work:
#df[df["working_hours_per_year"]=="no data"] = NaN    # Not correct!!
#df[df["working_hours_per_year"]=="no data"] = "NaN"  # Not correct!!

But we still have the problem that ``working_hours_per_year`` is not a numeric column:

In [None]:
df.dtypes

code                       object
gni_per_capita            float64
life_satisfaction         float64
fertility                 float64
total_population          float64
continent                  object
working_hours_per_year     object
dtype: object

We have to **recast it to float** before computing the mean:

In [None]:
df["working_hours_per_year"] = df["working_hours_per_year"].astype("float")
df["working_hours_per_year"].mean()

1848.6521682539683

It is often important to **know if certain values are missings or not** (e.g. to count the number of missings). We can do this using the ``isna`` (or the ``notna``) method:

In [None]:
print(df["working_hours_per_year"].isna().head(10))   # checks if a value is missing
print(df["working_hours_per_year"].notna().head(10))  # checks if a value is not missing

## Note that these would not work:
#print(df["working_hours_per_year"] == np.nan)  # Not correct!!
#print(df["working_hours_per_year"] == "NaN")   # Not correct!!

country
Afghanistan             True
Albania                 True
Algeria                 True
American Samoa          True
                       ...  
Antigua and Barbuda     True
Argentina              False
Armenia                 True
Aruba                   True
Name: working_hours_per_year, Length: 10, dtype: bool
country
Afghanistan            False
Albania                False
Algeria                False
American Samoa         False
                       ...  
Antigua and Barbuda    False
Argentina               True
Armenia                False
Aruba                  False
Name: working_hours_per_year, Length: 10, dtype: bool


You can use these functions to get the **number (or share) of (non-)missing values for each column**:

In [None]:
# Number of missings per column
print(df.isna().sum())

# Share of non-missing values per column
print(df.notna().mean())

code                        1
gni_per_capita             19
life_satisfaction          77
fertility                   1
total_population            1
continent                   1
working_hours_per_year    140
dtype: int64
code                      0.995074
gni_per_capita            0.906404
life_satisfaction         0.620690
fertility                 0.995074
total_population          0.995074
continent                 0.995074
working_hours_per_year    0.310345
dtype: float64


><font color = 4e1585> SIDENOTE: Why does this work? ``df.isna()`` returns a dataframe full of booleans. When we apply the ``sum`` method, ``True`` values are counted as 1 and ``False`` values as 0. The same applies for the second example.

In [None]:
df.head()

Unnamed: 0_level_0,code,gni_per_capita,life_satisfaction,fertility,total_population,continent,working_hours_per_year
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Afghanistan,AFG,600.0,2.694303,4.976,34414000.0,Asia,
Albania,ALB,4390.0,5.004403,1.677,2891000.0,Europe,
Algeria,DZA,4850.0,5.043086,3.043,39728000.0,Africa,
American Samoa,ASM,,,-99.0,56000.0,Oceania,
Andorra,AND,,,-99.0,78000.0,Europe,


We may also want to **delete rows (or columns) with missing data**. This can be done with the **``dropna`` method**:

In [None]:
# Drop rows with all missings
df.dropna(how="all", inplace=True)  # how="any" would drop rows with at least one missing

# Drop columns with all missings
df.dropna(axis=1, how="all", inplace=True)  # how="any" would drop columns
                                            # with at least one missing

---

>  <font color='teal'> **In-class exercise**:
We will continue to work with your ``ls_data`` dataframe. Can you sort your dataframe by continent and fertility rate?

>  <font color='teal'> Missing fertility values were coded as -99. Can you convert them to real missings? Print the number of missing values for the fertility column.



---



## Exporting data

After we have cleaned our data we might like to export it to some folder on our computer or on Google Drive. We can do this using the **``to_csv()`` method**:

In [None]:
df.to_csv("countries_life_satisfaction_cleaned.csv")  # export file to current working directory
os.listdir()  # list contents of current working directory

['.config', 'drive', 'countries_life_satisfaction_cleaned.csv', 'sample_data']

><font color = 4e1585> SIDENOTE: The ``to_csv()`` function also accepts a series of arguments allowing you to specify what should be done with the index, what delimiter to use etc. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html. You can use similar functions to export the data to other file formats.

## Next week

In the next tutorial we will have a look at some more special topics for `pandas` as well as plotting with `matplotlib`. If you like to prepare in advance, we have the following recommendations:
  
* Grouping data with pandas:
  + https://youtu.be/txMdrV1Ut64  (49 min)
  + Or: https://youtu.be/qy0fDqoMJx8 (8 min)
* Combining dataframes with pandas:
  + https://youtu.be/wzN1UyfRSWI (13 min)
  + Or: https://youtu.be/iYWKfUOtGaw (21 min)
* Plotting with matplotlib:
  + https://youtu.be/DAQNHzOcO5A (32 min)
  + Or: https://youtu.be/UO98lJQ3QGI (very extensive, parts 1-4, 6 and 7)

If you prefer working with text instead of videos:

* https://www.tutorialspoint.com/python_pandas/python_pandas_groupby.htm
* https://pandas.pydata.org/docs/user_guide/merging.htm
* https://matplotlib.org/stable/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py