## Data Management Exercise for EPP
The exercise is inspired by the Dutch LISS data, in particular the waves on Time Use and consumption in November 2019 and in April 2020. 
In particular, the data contain the following variables, alphabetically sorted:

| Variable     | Content                                                           |
|:-------------|:------------------------------------------------------------      |
| geslacht     | Gender (Man: Male, Vrouw: Female)                                 |
| nohouse_encr | Household identifier                                              |
| nomem_encr   | Member identifier                                                 |
| v1q1_v1col1  | Working hours (Nov) / Working hours at workplace (Apr)            |
| v1q1a_v1col1 | Working hours in home office, no kids in HH (Apr)                 |
| v1q1b_v1col1 | Working hours in home office while responsible for kids (Apr)     |
| v1q1c_v1col1 | Working hours in home office while not responsible for kids (Apr) |
| v1q5_v1col1  | Childcare hours (all in Nov, residual in Apr)                     |
| v1q5a_v1col1 | Homeschooling hours (Apr)                                         |

In [1]:
import pandas as pd

In [2]:
## Read in the data
nov_2019 = pd.read_stata("time_use_2019-11.dta",convert_categoricals=True)
apr_2020 = pd.read_csv("time_use_2020-04.csv")

## November Dataset

### Browse the data types

In [3]:
nov_2019.dtypes

nomem_encr       float64
geslacht        category
v1q1_v1col1      float64
v1q5_v1col1      float64
nohouse_encr     float64
dtype: object

### Browse the data
You can browse through the data by just typing the name of a DataFrame as the last thing in a cell. This will yield a nice html-formatted output.

In [4]:
nov_2019

Unnamed: 0,nomem_encr,geslacht,v1q1_v1col1,v1q5_v1col1,nohouse_encr
0,1687033.0,Man,0.0,2.0,1049420.0
1,1662353.0,Vrouw,9.0,0.0,1049420.0
2,1631191.0,Man,67.0,1.0,1011033.0
3,1687630.0,Vrouw,17.0,6.0,1011033.0
4,1746405.0,Man,40.0,12.0,1047651.0
...,...,...,...,...,...
83,1700319.0,Man,40.0,,1132053.0
84,1696174.0,Man,0.0,,1156059.0
85,1678816.0,Vrouw,24.0,25.0,1144468.0
86,1668177.0,Man,36.0,4.0,1144468.0


### Rename variables
We give the Dutch and partly cryptic variable names sensible identifiers. The `replace` method on DataFrames returns a new DataFrame, so we need to assign it to `nov_2019` again if we want to continue working with it. Note that this is a very stateful transformation if you assign to the same variable; you will not be able to successfully execute the cell twice without re-loading the data above.

In [5]:
nov_2019.rename(columns={"geslacht": "gender","v1q1_v1col1": "work_hrs",
                         "v1q5_v1col1": "childcare_hrs"},inplace=True)
nov_2019

Unnamed: 0,nomem_encr,gender,work_hrs,childcare_hrs,nohouse_encr
0,1687033.0,Man,0.0,2.0,1049420.0
1,1662353.0,Vrouw,9.0,0.0,1049420.0
2,1631191.0,Man,67.0,1.0,1011033.0
3,1687630.0,Vrouw,17.0,6.0,1011033.0
4,1746405.0,Man,40.0,12.0,1047651.0
...,...,...,...,...,...
83,1700319.0,Man,40.0,,1132053.0
84,1696174.0,Man,0.0,,1156059.0
85,1678816.0,Vrouw,24.0,25.0,1144468.0
86,1668177.0,Man,36.0,4.0,1144468.0


### Convert data types
`nomem_encr` and `nohouse_encr` are classical identifiers. They can be used to identify a particular observation, but they do not carry any meaning beyond that. You have heard in the screencast that you should never use floating point numbers for identifiers. Add new columns `hh_id` and `ind_id` with sensible data types.

In [6]:
nov_2019["hh_id"] = nov_2019["nohouse_encr"].astype('int')
nov_2019["ind_id"] = nov_2019["nomem_encr"].astype('int')
nov_2019.dtypes

nomem_encr        float64
gender           category
work_hrs          float64
childcare_hrs     float64
nohouse_encr      float64
hh_id               int64
ind_id              int64
dtype: object

### Select columns
We do not need to keep the old identifiers anymore.You can select a subset of columns by including a list of column names in the standard square brackets used for indexing in Python.

In [7]:
nov_2019.columns

Index(['nomem_encr', 'gender', 'work_hrs', 'childcare_hrs', 'nohouse_encr',
       'hh_id', 'ind_id'],
      dtype='object')

In [8]:
nov_2019 = nov_2019[['gender', 'work_hrs', 'childcare_hrs','hh_id', 'ind_id']]
nov_2019.dtypes

gender           category
work_hrs          float64
childcare_hrs     float64
hh_id               int64
ind_id              int64
dtype: object

### Set index
The default index created by Pandas (a DataFrame always has an index) does not make much sense. Create an index based on a column / several columns that makes sense to you.

In [9]:
nov_2019_with_index = nov_2019.set_index(['hh_id','ind_id'])
nov_2019_with_index

Unnamed: 0_level_0,Unnamed: 1_level_0,gender,work_hrs,childcare_hrs
hh_id,ind_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1049420,1687033,Man,0.0,2.0
1049420,1662353,Vrouw,9.0,0.0
1011033,1631191,Man,67.0,1.0
1011033,1687630,Vrouw,17.0,6.0
1047651,1746405,Man,40.0,12.0
...,...,...,...,...
1132053,1700319,Man,40.0,
1156059,1696174,Man,0.0,
1144468,1678816,Vrouw,24.0,25.0
1144468,1668177,Man,36.0,4.0


### Our first reduction operation
Calculate the mean hours spent on different activities.

In [10]:
nov_2019_with_index.mean(axis = 0) 

work_hrs         17.988636
childcare_hrs    12.789474
dtype: float64

### Our second reduction operation
Calculate the mean hours spent on different activities by gender 

In [11]:
nov_2019_with_index.groupby(['gender']).mean()

Unnamed: 0_level_0,work_hrs,childcare_hrs
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Man,23.434783,7.315789
Vrouw,12.02381,18.263158


## April Dataset

### Browse the data types

In [12]:
apr_2020.dtypes

geslacht         object
v1q1_v1col1     float64
v1q1a_v1col1    float64
v1q1b_v1col1    float64
v1q1c_v1col1    float64
v1q5_v1col1     float64
v1q5a_v1col1    float64
nohouse_encr      int64
nomem_encr        int64
dtype: object

### Browse the data

In [13]:
apr_2020

Unnamed: 0,geslacht,v1q1_v1col1,v1q1a_v1col1,v1q1b_v1col1,v1q1c_v1col1,v1q5_v1col1,v1q5a_v1col1,nohouse_encr,nomem_encr
0,Man,0.0,,0.0,0.0,0.0,0.0,1049420,1687033
1,Vrouw,3.0,,0.0,0.0,0.0,0.0,1049420,1662353
2,Man,47.0,,4.0,0.0,4.0,0.0,1011033,1631191
3,Vrouw,8.0,,0.0,8.0,0.0,0.0,1011033,1687630
4,Man,36.0,,0.0,0.0,0.0,0.0,1047651,1746405
...,...,...,...,...,...,...,...,...,...
101,Man,0.0,0.0,,,,,1156059,1696174
102,Vrouw,16.0,,0.0,0.0,15.0,20.0,1144468,1678816
103,Man,0.0,,8.0,24.0,10.0,10.0,1144468,1668177
104,Man,0.0,0.0,,,,,1159704,1655142


### Convert data types
Make sure that all columns have a sensible data type (try: apr_2020.dtypes)

In [14]:
apr_2020["geslacht"]=apr_2020["geslacht"].astype('category')
apr_2020.dtypes

geslacht        category
v1q1_v1col1      float64
v1q1a_v1col1     float64
v1q1b_v1col1     float64
v1q1c_v1col1     float64
v1q5_v1col1      float64
v1q5a_v1col1     float64
nohouse_encr       int64
nomem_encr         int64
dtype: object

### Rename the variables
Give the columns of the April data sensible names

In [15]:
apr_2020.rename(columns={"geslacht": "gender","v1q1_v1col1": "work_hrs_workplace","v1q5_v1col1": "childcare_hrs_res",
                         "v1q1a_v1col1":"work_hrs_home_no_kids","v1q1b_v1col1":"work_hrs_home_kids_responsible",
                         "v1q1c_v1col1":"work_hrs_home_kids_not_responsible","v1q5a_v1col1":"homeschool_hrs",
                         "nomem_encr":"ind_id","nohouse_encr":"hh_id"},inplace=True)
apr_2020

Unnamed: 0,gender,work_hrs_workplace,work_hrs_home_no_kids,work_hrs_home_kids_responsible,work_hrs_home_kids_not_responsible,childcare_hrs_res,homeschool_hrs,hh_id,ind_id
0,Man,0.0,,0.0,0.0,0.0,0.0,1049420,1687033
1,Vrouw,3.0,,0.0,0.0,0.0,0.0,1049420,1662353
2,Man,47.0,,4.0,0.0,4.0,0.0,1011033,1631191
3,Vrouw,8.0,,0.0,8.0,0.0,0.0,1011033,1687630
4,Man,36.0,,0.0,0.0,0.0,0.0,1047651,1746405
...,...,...,...,...,...,...,...,...,...
101,Man,0.0,0.0,,,,,1156059,1696174
102,Vrouw,16.0,,0.0,0.0,15.0,20.0,1144468,1678816
103,Man,0.0,,8.0,24.0,10.0,10.0,1144468,1668177
104,Man,0.0,0.0,,,,,1159704,1655142


### Now append the April data to the November data.
For the November data, use the version with the default index set by Pandas. You may need to add another variable beforehand.

In [16]:
apr_2020['month']='April'
nov_2019['month']='November'
df= pd.concat([nov_2019, apr_2020])
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  nov_2019['month']='November'


Unnamed: 0,gender,work_hrs,childcare_hrs,hh_id,ind_id,month,work_hrs_workplace,work_hrs_home_no_kids,work_hrs_home_kids_responsible,work_hrs_home_kids_not_responsible,childcare_hrs_res,homeschool_hrs
0,Man,0.0,2.0,1049420,1687033,November,,,,,,
1,Vrouw,9.0,0.0,1049420,1662353,November,,,,,,
2,Man,67.0,1.0,1011033,1631191,November,,,,,,
3,Vrouw,17.0,6.0,1011033,1687630,November,,,,,,
4,Man,40.0,12.0,1047651,1746405,November,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
101,Man,,,1156059,1696174,April,0.0,0.0,,,,
102,Vrouw,,,1144468,1678816,April,16.0,,0.0,0.0,15.0,20.0
103,Man,,,1144468,1668177,April,0.0,,8.0,24.0,10.0,10.0
104,Man,,,1159704,1655142,April,0.0,0.0,,,,


### Make the data respect the three requirements we set out in the screencast
- Values do not have internal structure
- Tables do not contain redundant information
- Variable names to not have any structure

In [17]:
df.columns

Index(['gender', 'work_hrs', 'childcare_hrs', 'hh_id', 'ind_id', 'month',
       'work_hrs_workplace', 'work_hrs_home_no_kids',
       'work_hrs_home_kids_responsible', 'work_hrs_home_kids_not_responsible',
       'childcare_hrs_res', 'homeschool_hrs'],
      dtype='object')

In [18]:
background = df[['gender', 'hh_id', 'ind_id']]
background

Unnamed: 0,gender,hh_id,ind_id
0,Man,1049420,1687033
1,Vrouw,1049420,1662353
2,Man,1011033,1631191
3,Vrouw,1011033,1687630
4,Man,1047651,1746405
...,...,...,...
101,Man,1156059,1696174
102,Vrouw,1144468,1678816
103,Man,1144468,1668177
104,Man,1159704,1655142


In [19]:
hours = df[['work_hrs', 'childcare_hrs','ind_id', 'month','work_hrs_workplace', 'work_hrs_home_no_kids',
            'work_hrs_home_kids_responsible', 'homeschool_hrs','work_hrs_home_kids_not_responsible','childcare_hrs_res']]
hours

Unnamed: 0,work_hrs,childcare_hrs,ind_id,month,work_hrs_workplace,work_hrs_home_no_kids,work_hrs_home_kids_responsible,homeschool_hrs,work_hrs_home_kids_not_responsible,childcare_hrs_res
0,0.0,2.0,1687033,November,,,,,,
1,9.0,0.0,1662353,November,,,,,,
2,67.0,1.0,1631191,November,,,,,,
3,17.0,6.0,1687630,November,,,,,,
4,40.0,12.0,1746405,November,,,,,,
...,...,...,...,...,...,...,...,...,...,...
101,,,1696174,April,0.0,0.0,,,,
102,,,1678816,April,16.0,,0.0,20.0,0.0,15.0
103,,,1668177,April,0.0,,8.0,10.0,24.0,10.0
104,,,1655142,April,0.0,0.0,,,,


### Give the DataFrames the sensible indicies

In [20]:
background_with_index = background.set_index(['ind_id'])
background_with_index

Unnamed: 0_level_0,gender,hh_id
ind_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1687033,Man,1049420
1662353,Vrouw,1049420
1631191,Man,1011033
1687630,Vrouw,1011033
1746405,Man,1047651
...,...,...
1696174,Man,1156059
1678816,Vrouw,1144468
1668177,Man,1144468
1655142,Man,1159704


In [21]:
hours_with_index = hours.set_index(['ind_id','month'])
hours_with_index

Unnamed: 0_level_0,Unnamed: 1_level_0,work_hrs,childcare_hrs,work_hrs_workplace,work_hrs_home_no_kids,work_hrs_home_kids_responsible,homeschool_hrs,work_hrs_home_kids_not_responsible,childcare_hrs_res
ind_id,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1687033,November,0.0,2.0,,,,,,
1662353,November,9.0,0.0,,,,,,
1631191,November,67.0,1.0,,,,,,
1687630,November,17.0,6.0,,,,,,
1746405,November,40.0,12.0,,,,,,
...,...,...,...,...,...,...,...,...,...
1696174,April,,,0.0,0.0,,,,
1678816,April,,,16.0,,0.0,20.0,0.0,15.0
1668177,April,,,0.0,,8.0,10.0,24.0,10.0
1655142,April,,,0.0,0.0,,,,


### Modify the Gender variable so that it has the Categories “male” and “female”

In [22]:
background_with_index['gender']=pd.Categorical(background_with_index['gender']).rename_categories(['male','female'])
background_with_index

Unnamed: 0_level_0,gender,hh_id
ind_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1687033,male,1049420
1662353,female,1049420
1631191,male,1011033
1687630,female,1011033
1746405,male,1047651
...,...,...
1696174,male,1156059
1678816,female,1144468
1668177,male,1144468
1655142,male,1159704


### Total hours worked
Say you are interested in total hours worked for the purpose of thinking about what happens to GDP (silly example on many dimensions, but disregard that). Create a variable working_hours_total that makes this reasonably comparable across time. Note: There is no correct answer. This is a typical judgement call you have to make, which is supposed to make the point that setting up the data can: be very hard and have a direct impact on your results.

In [23]:
hours_with_index['working_hrs_total']=hours_with_index[['work_hrs_workplace','work_hrs_home_no_kids','work_hrs_home_kids_responsible',
                                                        'work_hrs_home_kids_not_responsible','work_hrs']].sum(axis=1, skipna=True)
hours_with_index

Unnamed: 0_level_0,Unnamed: 1_level_0,work_hrs,childcare_hrs,work_hrs_workplace,work_hrs_home_no_kids,work_hrs_home_kids_responsible,homeschool_hrs,work_hrs_home_kids_not_responsible,childcare_hrs_res,working_hrs_total
ind_id,month,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1687033,November,0.0,2.0,,,,,,,0.0
1662353,November,9.0,0.0,,,,,,,9.0
1631191,November,67.0,1.0,,,,,,,67.0
1687630,November,17.0,6.0,,,,,,,17.0
1746405,November,40.0,12.0,,,,,,,40.0
...,...,...,...,...,...,...,...,...,...,...
1696174,April,,,0.0,0.0,,,,,0.0
1678816,April,,,16.0,,0.0,20.0,0.0,15.0,16.0
1668177,April,,,0.0,,8.0,10.0,24.0,10.0,32.0
1655142,April,,,0.0,0.0,,,,,0.0


### Calculate the average of total working hours in each month.

In [24]:
hours_with_index.groupby(['month'])['working_hrs_total'].mean().round(1)

month
April       13.3
November    18.0
Name: working_hrs_total, dtype: float64