# The Empirical Evaluation of Management Practices
## Examiner: Prof. Dr. Dirk Sliwka
## Date: 2024-03-13

## Instructions:

Follow the instructions below so that we will be able to correctly identify your solutions to the exam.

**1. Make a copy of this notebook and save it as a separate file in the following format:**

*WS2324_EEMP_exam_PT2_matriculationnumber_initials.ipynb*

- i.e., the final file name should look like this: *WS2324_EEMP_exam_PT2_1234567_MM.ipynb*

**2. Share this notebook with eemp.exams@gmail.com.**

**3. Also enter your matriculation number and your initials in the following cell:**

### Matriculation number:
### Initials:

The code cell below imports the standard modules. Execute this cell before you start your work.

In [1]:
# Run this cell:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
!pip install stargazer
from stargazer.stargazer import Stargazer
!pip install linearmodels
from linearmodels import PanelOLS

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

Collecting stargazer
  Downloading stargazer-0.0.6-py3-none-any.whl (11 kB)
Installing collected packages: stargazer
Successfully installed stargazer-0.0.6
Collecting linearmodels
  Downloading linearmodels-5.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting mypy-extensions>=0.4 (from linearmodels)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Collecting pyhdfe>=0.1 (from linearmodels)
  Downloading pyhdfe-0.2.0-py3-none-any.whl (19 kB)
Collecting formulaic>=0.6.5 (from linearmodels)
  Downloading formulaic-1.0.1-py3-none-any.whl (94 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m94.2/94.2 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting setuptools-scm[toml]<9.0.0,>=8.0.0 (from linearmodels)
  Downloading setuptools_scm-8.0.4-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━

## Assignment 1

Your task is to study determinants of individual productivity with a (simulated) data set. The data set contains a *standardized* measure of productivity (**productivity**) as well as several employee characteristics such as **age**, the **region** where the employee works, whether the employee had a feedback meeting in a given year (**feedbackmeet**) and measures of the big 5 personality traits (**b5cons**, etc). The data set tracks the productivity of employees (indexed with the variable **employeeID**) across four time periods (variable **year** running from 2018 through 2021). However, you will start working with data from only the year 2021.

The employees work in teams and some teams receive a performance bonus (**teamBonus**).

In the first part, we will use regressions to analyze the data.

Start by using the cross-sectional data frame `df` (which includes only the year 2021 and not the other years). You can import and inspect this data frame by running the cell directly below.

In [2]:
# Run this cell:
df = pd.read_csv('https://raw.githubusercontent.com/ryebreadgoat/general-squirrel/main/data.csv')
df = df.drop(columns=['Unnamed: 0'])
df

Unnamed: 0,employeeID,year,team,region,age,experience,workSampleTest,b5cons,b5extra,b5open,b5emos,b5agree,feedbackmeet,teamBonus,productivity
0,0,2021,0,2,41,14,96.714352,-1.3,-0.4,-0.8,-1.2,0.6,1,0,-0.053513
1,1,2021,0,2,57,28,69.090243,-0.1,1.2,0.6,-2.1,-1.3,1,0,-1.909388
2,2,2021,0,2,27,5,106.327070,2.4,-0.1,-1.2,-1.0,1.4,1,0,1.842034
3,3,2021,0,1,61,34,122.873481,1.9,1.0,0.5,1.1,-0.5,1,0,0.282908
4,4,2021,0,2,36,11,92.794113,0.6,-0.4,0.8,-1.7,0.9,1,0,-1.618877
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,3995,2021,399,2,53,27,88.833529,1.3,-0.2,0.6,1.0,-1.2,0,1,-0.456262
3996,3996,2021,399,2,45,22,72.790986,1.2,-0.1,1.2,3.0,1.0,0,1,2.269619
3997,3997,2021,399,2,64,42,93.029969,-0.8,-0.3,-1.0,-0.0,0.6,1,1,0.780240
3998,3998,2021,399,1,50,22,103.814012,0.2,1.4,0.1,0.8,-0.3,0,1,1.143187


a) Show a histogram of productivity. What is the interpretation of a productivity value of 0? And of -2?

**(4 points)**

In [3]:
# Write your code below:


*Give the verbal answer here:*



b)  Investigate how productivity varies with conscientiousness (**b5cons**) in three simple bivariate regressions. Use **productivity** as the dependent variable. The models should be the same except for the standard errors, which should be as follows:
- `model1`: use the default standard errors
- `model2`: use heteroscedasticity-robust standard errors
- `model3`: use standard errors clustered at the **team** level.

Tip: use `Stargazer([model1, model2, model3)` to display the results side-by-side.

**(5 points)**

In [4]:
# Write your code below:


c) Explain in your own words the difference in the way that the standard errors are computed between Models 1 and 2, and what assumption is involved.

**(3 points)**

*Give the verbal answer here:*



d) Why is the third specification the most appropriate here? In other words, why is it best to cluster the standard errors at the team level?

**(3 points)**

*Give the verbal answer here:*



e) The firm ran an A/B test in the year 2021 where it randomly picked several teams which got a team bonus. The dummy variable **teamBonus** takes the value 1 if the employee's team got the bonus.

Regress the employee's (individual) productivity on **teamBonus**. Do not add any control variables. Use standard errors clustered at the **team** level.

Do you believe that this model estimates the causal effect of the team bonus on productivity? Why or why not?

**(6 points)**

In [5]:
# Write your code below:


*Give the verbal answer here:*



f) Interpret the sign, magnitude, and significance of the **teamBonus** coefficient.

**(4 points)**

*Give the verbal answer here:*



g) The firm suspects that feedback meetings of employees with their supervisors are another important driver of productivity. Regress productivity on the use of feedback meetings in an OLS regression without control variables. Do you think this regression may suffer from omitted variabe bias?

If so, explain why using an example of a potential omitted variable not observed in this data set.

If not, explain why not.

**(7 points)**

In [6]:
# Write your code below:


*Give the verbal answer here:*



### Attention: panel data

Now use the data frame *dp_panel* of all the years and not only 2021. You can load and inspect this data frame by running the cell below.

In [7]:
# Run this cell:
df_panel = pd.read_csv('https://raw.githubusercontent.com/ryebreadgoat/general-squirrel/main/panel-data.csv')
df_panel = df_panel.drop(columns=['Unnamed: 0'])
df_panel

Unnamed: 0,employeeID,year,team,region,age,experience,workSampleTest,b5cons,b5extra,b5open,b5emos,b5agree,feedbackmeet,teamBonus,productivity
0,0,2018,0,2,38,11,96.714352,-1.3,-0.4,-0.8,-1.2,0.6,1,0,0.729210
1,0,2019,0,2,39,12,96.714352,-1.3,-0.4,-0.8,-1.2,0.6,1,0,-0.780119
2,0,2020,0,2,40,13,96.714352,-1.3,-0.4,-0.8,-1.2,0.6,0,0,-0.305490
3,0,2021,0,2,41,14,96.714352,-1.3,-0.4,-0.8,-1.2,0.6,1,0,-0.053513
4,1,2018,0,2,54,25,69.090243,-0.1,1.2,0.6,-2.1,-1.3,1,0,-1.266936
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15995,3998,2021,399,1,50,22,103.814012,0.2,1.4,0.1,0.8,-0.3,0,1,1.143187
15996,3999,2018,399,2,37,14,101.326148,1.7,1.9,-0.4,-0.3,1.4,0,1,0.464175
15997,3999,2019,399,2,38,15,101.326148,1.7,1.9,-0.4,-0.3,1.4,1,1,2.398882
15998,3999,2020,399,2,39,16,101.326148,1.7,1.9,-0.4,-0.3,1.4,0,1,0.935645


h) Perform two panel regressions. Both should be without control variables and with the standard errors clustered at the employee level. First, include only time fixed effects. Second, include time fixed effects and employee fixed effects. Explain why these two regressions yield different estimates of the **feedbackmeet** coefficient by pointing out what the employee fixed effects do.

Tip: You can print both regression outputs from the same code cell using `print(panel_reg1, '\n', panel_reg2)`.

**(10 points)**

In [8]:
# Write your code below:


*Give the verbal answer here:*



i) Do you believe that your second model with employee and time fixed effects estimates the causal effect of **feedbackmeet**? Why or why not?

**(3 points)**

*Give the verbal answer here:*



## Assignment 2

Here your task is to train an algorithm to predict **productivity**. The necessary packages have already been imported above.

Use again the data from 2021. Run the cell below to load the data again because you may have modified the data frame in the previous assignment.



In [9]:
# Run this cell:
df = pd.read_csv('https://raw.githubusercontent.com/ryebreadgoat/general-squirrel/main/data.csv')
df = df.drop(columns=['Unnamed: 0'])
df

Unnamed: 0,employeeID,year,team,region,age,experience,workSampleTest,b5cons,b5extra,b5open,b5emos,b5agree,feedbackmeet,teamBonus,productivity
0,0,2021,0,2,41,14,96.714352,-1.3,-0.4,-0.8,-1.2,0.6,1,0,-0.053513
1,1,2021,0,2,57,28,69.090243,-0.1,1.2,0.6,-2.1,-1.3,1,0,-1.909388
2,2,2021,0,2,27,5,106.327070,2.4,-0.1,-1.2,-1.0,1.4,1,0,1.842034
3,3,2021,0,1,61,34,122.873481,1.9,1.0,0.5,1.1,-0.5,1,0,0.282908
4,4,2021,0,2,36,11,92.794113,0.6,-0.4,0.8,-1.7,0.9,1,0,-1.618877
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3995,3995,2021,399,2,53,27,88.833529,1.3,-0.2,0.6,1.0,-1.2,0,1,-0.456262
3996,3996,2021,399,2,45,22,72.790986,1.2,-0.1,1.2,3.0,1.0,0,1,2.269619
3997,3997,2021,399,2,64,42,93.029969,-0.8,-0.3,-1.0,-0.0,0.6,1,1,0.780240
3998,3998,2021,399,1,50,22,103.814012,0.2,1.4,0.1,0.8,-0.3,0,1,1.143187


a) Generate the y vector and the feature matrix where you also exclude the variables **employeeID**, **year**, and **team** from this matrix. Then split your data in a training and test set with 60% of the data in the training set and set the random state to 142.

**(3 points)**

In [10]:
# Write your code below:


b) Train a Random Forest with 500 trees on the training sample. Set the maximum number of features equal to the square root of the number of features. Set the maximum depth equal to 10. Set the random state equal to 142.

**(4 points)**

In [11]:
# Write your code below:


c) Compute the training and test R².

**(2 points)**

In [12]:
# Write your code below:


d) Train the same random forest but with one difference: The maximum depth should now be 5.

Compute the training and test $R^2$ again. Compare these to $R^2$ values from your previous random forest. Discuss how changing the maximum depth produces potentially different values of $R^2$ here.

**(6 points)**

In [13]:
# Write your code below:


*Give the verbal answer here:*



**This is the end of the exam. If you haven't already done so, share this notebook with eemp.exams@gmail.com.**