# Preparation for class in week 5


Our example is inspired by the Dutch LISS data, in particular the waves on Time Use and consumption in November 2019 and in April 2020. 

In particular, the data contain the following variables, alphabetically sorted:

| Variable     | Content                                                           |
|:-------------|:------------------------------------------------------------      |
| geslacht     | Gender (Man: Male, Vrouw: Female)                                 |
| nohouse_encr | Household identifier                                              |
| nomem_encr   | Member identifier                                                 |
| v1q1_v1col1  | Working hours (Nov) / Working hours at workplace (Apr)            |
| v1q1a_v1col1 | Working hours in home office, no kids in HH (Apr)                 |
| v1q1b_v1col1 | Working hours in home office while responsible for kids (Apr)     |
| v1q1c_v1col1 | Working hours in home office while not responsible for kids (Apr) |
| v1q5_v1col1  | Childcare hours (all in Nov, residual in Apr)                     |
| v1q5a_v1col1 | Homeschooling hours (Apr)                                         |


In this exercise, we will only work with the November data.

We start by importing Pandas

In [62]:
import pandas as pd
import numpy as np
from functools import reduce
from statistics import mean

## Read in the data

The November data is stored in Stata format. Replace the `XXXXX` in the next cell by the appropriate Pandas function.

In [48]:
nov_2019 = pd.read_stata("time_use_2019-11.dta", convert_categoricals=True)

## Browse the data

You can browse through the data by just typing the name of a DataFrame as the last thing in a cell.

This will yield a nice html-formatted output. Use this to find out the difference between setting `convert_categoricals` to `True` or `False` in the call above. In case you know some Stata: Can you explain what is happening? Else don't worry about the reasons behind what is going on.


When convert_categoricals=True then the gender column displays Man and Vroww. When False it displays 1 and 0.
I do not know the reason but it seems self-explanatory.

In [49]:
nov_2019.head()

Unnamed: 0,nomem_encr,geslacht,v1q1_v1col1,v1q5_v1col1,nohouse_encr
0,1687033.0,Man,0.0,2.0,1049420.0
1,1662353.0,Vrouw,9.0,0.0,1049420.0
2,1631191.0,Man,67.0,1.0,1011033.0
3,1687630.0,Vrouw,17.0,6.0,1011033.0
4,1746405.0,Man,40.0,12.0,1047651.0


## Rename variables

We give the Dutch and partly cryptic variable names sensible identifiers. The `replace` method on DataFrames returns a new DataFrame, so we need to assign it to `nov_2019` again if we want to continue working with it.

Note that this is a very stateful transformation if you assign to the same variable; you will not be able to successfully execute the cell twice without re-loading the data above.

In [50]:
nov_2019 = nov_2019.rename(
    columns={
        "geslacht": "gender",
        "v1q1_v1col1": "working_hours",
        "v1q5_v1col1": "childcare_hours",
        "nohouse_encr": "household_identifier",
        "nomem_encr": "member_identifier"
    }
)

Again view the dataframe to see if the changes applied.

In [51]:
nov_2019.head()

Unnamed: 0,member_identifier,gender,working_hours,childcare_hours,household_identifier
0,1687033.0,Man,0.0,2.0,1049420.0
1,1662353.0,Vrouw,9.0,0.0,1049420.0
2,1631191.0,Man,67.0,1.0,1011033.0
3,1687630.0,Vrouw,17.0,6.0,1011033.0
4,1746405.0,Man,40.0,12.0,1047651.0


## Convert data types

`nomem_encr` and `nohouse_encr` are classical identifiers. They can be used to identify a particular observation, but they do not carry any meaning beyond that.

You have heard in the screencast that you should never use floating point numbers for identifiers. It is common that this happens in Stata, though (e.g., through use of `compress` or mathematical operations of integers, which do implicit type conversions unless you request an integer back).

Add new columns `hh_id` and `ind_id` with sensible data types. 


In [52]:
# Converting a panda series by using astype():
nov_2019["hh_id"] = nov_2019["household_identifier"].astype(int)
nov_2019["ind_id"] = nov_2019["member_identifier"].astype(int)

In [53]:
# Again view the dataframe
nov_2019.head()

Unnamed: 0,member_identifier,gender,working_hours,childcare_hours,household_identifier,hh_id,ind_id
0,1687033.0,Man,0.0,2.0,1049420.0,1049420,1687033
1,1662353.0,Vrouw,9.0,0.0,1049420.0,1049420,1662353
2,1631191.0,Man,67.0,1.0,1011033.0,1011033,1631191
3,1687630.0,Vrouw,17.0,6.0,1011033.0,1011033,1687630
4,1746405.0,Man,40.0,12.0,1047651.0,1047651,1746405


## Select columns

We do not need to keep the old identifiers anymore.

You can select a subset of columns by including a list of column names in the standard square brackets used for indexing in Python.

Replace the XXXX and YYYY appropriately and include all other variables that you want to keep.

In [54]:
# Displaying columns' name:
nov_2019.columns.values

array(['member_identifier', 'gender', 'working_hours', 'childcare_hours',
       'household_identifier', 'hh_id', 'ind_id'], dtype=object)

In [55]:
nov_2019 = nov_2019[['hh_id', 'ind_id', "gender", "working_hours", "childcare_hours"]]

In [56]:
nov_2019.head()

Unnamed: 0,hh_id,ind_id,gender,working_hours,childcare_hours
0,1049420,1687033,Man,0.0,2.0
1,1049420,1662353,Vrouw,9.0,0.0
2,1011033,1631191,Man,67.0,1.0
3,1011033,1687630,Vrouw,17.0,6.0
4,1047651,1746405,Man,40.0,12.0


## Set index

The default index created by Pandas (a DataFrame always has an index) does not make much sense.

Create an index based on a column / several columns that makes sense to you via replacing the XXXX by an appropriate construct.

In [57]:
nov_2019_with_index = nov_2019.set_index(['hh_id', 'ind_id'])

In [58]:
nov_2019_with_index.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,gender,working_hours,childcare_hours
hh_id,ind_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1049420,1687033,Man,0.0,2.0
1049420,1662353,Vrouw,9.0,0.0
1011033,1631191,Man,67.0,1.0
1011033,1687630,Vrouw,17.0,6.0
1047651,1746405,Man,40.0,12.0
1047651,1712561,Vrouw,0.0,40.0
1059082,1755058,Man,32.0,8.0
1059082,1703625,Vrouw,0.0,40.0
1032991,1660073,Man,40.0,20.0
1032991,1679690,Vrouw,0.0,0.0


## Our first reduction operation

Calculate the mean hours spent on different activities.

In [70]:
# Define an average() function (I'm surprised there's no built-in)
def average(list):
    """Returns the average of a list.

    Args:
        list (interable): a list of numbers
    
    Returns:
        float: the average of the list of numbers
    """
    
    out = sum(list) / len(list)
    return out 

In [74]:
# Use the built-in one from statistics:
print("Mean working hours in Nov 2019 is: ", end="")
print(mean(nov_2019['working_hours']))

# And from the self-defined one:
print("and it should be the same as the results from the self-defined function. Indeed, here's the output: ")
print(average(nov_2019['working_hours']))

Mean working hours in Nov 2019 is: 17.988636363636363
and it should be the same as the results from the self-defined function. Indeed, here's the output: 
17.988636363636363


## Our second reduction operation

Calculate the mean hours spent on different activities by gender by appending the following with the appropriate method to calculate a mean.

In [76]:
nov_2019.groupby("gender")[["working_hours", "childcare_hours"]].mean()

Unnamed: 0_level_0,working_hours,childcare_hours
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Man,23.434783,7.315789
Vrouw,12.02381,18.263158
