*Part 2: Python for Data Analysis III*
### Data Wrangling with Pandas
#Exercises#



---


<font color='violet'>
Hints are written in white, so you do not see them immediately. If you highlight them (or double-click on them), they will appear! 
<font color='white'> I am a hint! :-)


---


For this exercise, we will use the following datasets: 
* ``homicide.csv`` (homicide and other data for all countries)
* ``real-gdp-per-capita.csv`` (country level gdp data from 1950-2017)
* ``life_satsifaction_clean.csv`` (homicide and other data for all countries)

You can find them here: https://drive.google.com/drive/folders/1QnHTDQ0tb8_Ex6dMgNCwqJuL3PxzEKIv 

Copy them to your drive or to a folder on your computer. Import ``pandas``, ``numpy`` and ``os`` and change your directory to the folder where you placed your data. If you work with Google Drive, mount your drive!

In [1]:
import pandas as pd
import numpy as np
import os

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
os.chdir("/content/drive/MyDrive/MyData")

If you do not manage to mount your drive, you can also read the data using `pd.read_csv()` from the following URLs:

* `"http://farys.org/daten/homicide.csv"`
* `"http://farys.org/daten/real-gdp-per-capita.csv"`
* `"http://farys.org/daten/life_satisfaction_clean.csv"`

## Exercise 1

Consider the following two datasets:

In [None]:
income1 = pd.DataFrame({"income": [4300, 8600, 5200], 
                        "sex": ["m", "m", "f"]},
                       index=["Max", "Peter", "Mary" ])
income1

Unnamed: 0,income,sex
Max,4300,m
Peter,8600,m
Mary,5200,f


In [None]:
income2 = pd.DataFrame({"income": [2300, 9600], "sex": ["f", "f"]},
                       index=["Annina", "Petra"])
income2

Unnamed: 0,income,sex
Annina,2300,f
Petra,9600,f


Can you combine them into one dataset called ``income`` using the ``concat`` function? 

You also have information on people's age:

In [None]:
age = pd.DataFrame({"age": [23, 34, 61, 19, 56]},
                    index=["Max", "Peter", "Mary", "Annina", "Petra"])

 Can you add it to your ``income`` data using ``concat``? And can you do the same using ``merge``? Which is usually preferable?


Now consider the following two datasets: 

In [None]:
df1 = pd.DataFrame({"name":["Max", "Peter", "Mary" ],
                   "income": [4300, 8600, 5200],
                    "sex": ["m", "m", "f"]})
df1

Unnamed: 0,name,income,sex
0,Max,4300,m
1,Peter,8600,m
2,Mary,5200,f


In [None]:
df2 = pd.DataFrame({"name": ["Max", "Peter", "Annina", "Petra"],
                    "age": [23, 34, 19, 56]})
df2

Unnamed: 0,name,age
0,Max,23
1,Peter,34
2,Annina,19
3,Petra,56


Perform (1) an ``inner``, (2) and ``outer`` and (3) a ``left`` merge and print the resulting dataframes. What observations are kept in each case?

In [None]:
# Inner merge

In [None]:
# Outer merge

In [None]:
# Left merge

## Exercise 2

Consider the following dataset:

In [None]:
df_long = pd.DataFrame({"first_name": ["Max", "Max", "Annina", "Annina", "Annina"],
                        "year": [2021, 2022, 2020, 2021, 2022],
                        "salary": [4200, 4300, 0, 5700, 5800],
                        "occupation": ["phd student", "phd student", "student", "banker", "banker"]})
df_long

Unnamed: 0,first_name,year,salary,occupation
0,Max,2021,4200,phd student
1,Max,2022,4300,phd student
2,Annina,2020,0,student
3,Annina,2021,5700,banker
4,Annina,2022,5800,banker


Can you reshape it to a *wide* data format? Assign your new dataset to the variable ``df_wide``.

Print out the column names of your new dataset. 

You should get a MultiIndex object (https://pandas.pydata.org/docs/user_guide/advanced.html). Can you find out how to print out Anninas salary in 2022 using this MultiIndex?

You don't like the MultiIndex and would like to have the following column names:
'name',
 'salary2020',
 'salary2021',
 'salary2022',
 'occupation2020',
 'occupation2021',
 'occupation2022'. Try to achieve this (e.g. by using a list comprehension).


Now look at the (row) index of your dataframe:

You will see that this index has a name (first_name), which is why ``first_name`` appears above "Annina" when your print out the data. You can also access it by typing ``df_wide.index.name``. Can you find a way to rename the index to ``name``?

Check out the documentation of the ``wide_to_long`` function (which is similar to ``melt``): https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html. Can you use it to convert your data back to *long* format?

## Exercise 3

Read in the homicide data.

You would like to analyze homicide patterns across the globe. As a first step, you would like to know if homicide rates differ between continents. Use ``groupby`` to create an aggregated dataset called ``homicide_continents`` with the the homicide rate of the average country as well as the number of countries on each continent. It should look as follows:


|continent|mean| 	count|
| :- | -: | :-: |	
|Africa |	0.888916 |	50|
|Asia |	0.603732 |	43|
|Europe| 	0.218499| 	41|
|North America |	2.922844| 	25|
|Oceania |	0.568807 |	13|
|South America |	2.172048| 	11|



## Exercise 4

Now, you would like to find out if there is a correlation between homicide rates and life satisfaction. Import the the dataset "life_satisfaction_clean.csv" and merge it with your homicide data (keeping only the life statisfaction column). Make sure no observations in your homicide data are dropped.

In [None]:
df = homicide.merge(satisfaction[["code", "life_satisfaction"]], on="code", how="left")

Group your countries into 5 quantiles according to their life satisfaction and print out the mean, median, minimum and maximum homicide rate for each group. 

## Exercise 5

Read in the file: ``real-gdp-per-capita.csv``. You will have to specifiy a few parameters to make sure this works properly. Try to read in the data in a way that requires only mimimal (or no) additional cleaning afterwards.

You would like to look at the GDP growth by continent. Merge your gdp data with the ``continent`` column from the dataset in the previous exercise. What kind of join did you perform?

Create a dataframe with the per-capita income growth (between 1950 and 2017) of the median country on each continent. The dataframe should have a row for each continent and a column with the respective growth rate. On which continent did the typical country fare best/worst?
<font color='violet'>
Hints: <font color='white'> You may need to group and reshape your data. If you keep only the years 1950 and 2017 before reshaping, your  dataset will be more compact.