# Ec 143 - Problem Set 5
# Quantile Regression
Due by 5PM on April 23rd. The GSI, Nadav Tadelis (ntadelis@berkeley.edu), will handle the logistics of problem set collection.

Working with peers on the problem set is actively encouraged, but everyone needs to turn in their own Jupyter Notebook and any other accompanying materials. 

This problem set reviews the material on quantile regression developed in lecture. Any "pencil and paper" and/or narrative answers may be placed in markdown boxes in this Jupyter notebook (preferred). Alternatively you can hand write your answers and turn in a pdf scan of them. This problem set is deliberately more open-ended than the first four and consequently you may find it challenging (but I hope also rewarding).

Any computational questions should be answered by writing the required code and executing it. This should be included in this notebook.

In [2]:
# Load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
data = '/Users/bgraham/Dropbox/Teaching/Berkeley_Courses/Ec143/Ec143_Spring2023/Datasets/'
graphics = '/Users/bgraham/Dropbox/Teaching/Berkeley_Courses/Ec143/Ec143_Spring2023/Graphics/'

The file brazil_pnad96.out contains 65,801 comma delimited records drawn from the 1996 round of the Brazilian Pesquisas Nacional por Amostra de Domicilos (PNAD96). An overview of education, earnings and inequality in Brazil is provided by Blom et al. (2001). This is the same dataset you used in Problem Set #3.

**References**
Blom, Andreas, Holm-Nielsen, Lauritz, and Verner, Dorte, "Education, earnings, and inequality in Brazil, 1982-1998: implications for education policy", _Peabody Journal of Education_ 76, 3-4 (2001), pp. 180 - 221.

In [7]:
pnad96 = pd.read_csv(data + 'Brazil_1996PNAD.out', header = 0, sep='\t+', engine='python')

# Find relevant estimation subsample
sample = pnad96.loc[(pnad96['MONTHLY_EARNINGS'] > 0) & (pnad96['AgeInDays'] >= 20)  & (pnad96['AgeInDays'] <= 60)]

#Display the first few rows of the dataframe
sample.describe()

Unnamed: 0,AgeInDays,YRSSCH,MONTHLY_EARNINGS,Father_NoSchool,Father_Incomplete1stPrimary,Father_Complete1stPrimary,Father_Incomplete2ndPrimary,Father_Complete2ndPrimary,Father_IncompleteSecondary,Father_CompleteSecondary,...,Mother_NoSchool,Mother_Incomplete1stPrimary,Mother_Complete1stPrimary,Mother_Incomplete2ndPrimary,Mother_Complete2ndPrimary,Mother_IncompleteSecondary,Mother_CompleteSecondary,Mother_IncompleteHigher,Mother_CompleteHigher,Mother_DontKnow
count,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,...,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0,55551.0
mean,37.055054,5.830462,634.184245,0.282569,0.205037,0.139691,0.034311,0.054832,0.051808,0.039531,...,0.334125,0.189015,0.136199,0.041493,0.065093,0.032691,0.045148,0.016921,0.05476,0.084553
std,10.262022,4.217958,1104.788945,0.450253,0.403732,0.34667,0.182028,0.227655,0.221642,0.194857,...,0.471688,0.391524,0.343003,0.19943,0.246693,0.177827,0.20763,0.128978,0.227514,0.278218
min,20.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,28.6078,3.0,180.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,36.18891,5.0,320.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,44.68172,8.0,602.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,60.0,15.0,50000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Warm-Up ##

1. Compute the least squares fit of ln(MONTHLY_EARNINGS) onto a constant YRSSCH, AgeInDays, and AgeInDays squared.   

2. Create a dummy variable for each of the $16$ possible schooling levels. Compute the least squares fit of ln(MONTHLY_EARNINGS) onto each of the $16$ dummy variables, AgeInDays, and AgeInDays squared (exclude a constant from this regression).    

3. Construct a plot with the regression fits from parts (1) and (2) above on the same same figure holding AgeInDays fixed at $40$, but varying YRSSCH. Comment on your findings.

## Exploring the conditional distribution of earnings given schooling ##

In this part of the problem set you will explore the conditional distribution of earnings given schooling and age using quantile regression. There are a variety of ways to undertake the computations described below. Tools you may need include a "for loop", a Pandas dataframe for organizing your results and setting up your regressor matrix, the numpy.quantile and numpy.sort (to find order statistics) will be useful for find quantile point estimates and constructing standard errors. To construct standard errors for your minimum distance estimates you will need to do some basic matrix multiplication. This is best done using Numpy.

4. Construct two histograms; one each for the distribution of the logarithm of monthly earnings given YRSSCH = 0 and another YRSSCH = 8. Comment on any differences.

5. Consider the following $L=8$ age ranges: $\left[20,25\right),\left[25,30\right),\left[30,35\right),\left[35,40\right),\left[40,45\right),\left[45,50\right),\left[50,55\right),\left[55,60\right]$. Let $K=16$ be the number of distinct schooling values. For each of the $K\times L=8\times16=128$ years of schooling and age range combinations with at least $30$ observations in the dataset estimate the 10th, 25th, 50th, 75th and 90th quantiles of the distribution of log earnings. For each conditional quantile construct a confidence interval using order statistics as described in lecture. Using this confidence interval construct an asymptotic variance estimate.     

6. Inspect your standard error estimates. Are any of them are zero? Why? Inspect the distribution of MONTHLY_EARNINGS. Is MONTHLY_EARNINGS a continuously-valued random variable? Relate what you find to the phenomena of standard error estimates of zero [1 paragraph].    

7. Assume that, for the five estimated quantiles, the conditional quantile function of the logarithm of monthly earnings given schooling and age is a linear function of YRSSCH, AgeInDays, and AgeInDays squared (you may use the mid-point of each of the age ranges as your measure of “age”). Estimate the parameters indexing each of the five conditional quantile functions by minimum distance as described in lecture. You should exclude all cells with less that 30 observations and/or where the estimated standard error is zero. How does the coefficient on schooling vary with the quantile under consideration? How does it compare to that computed in question (2) above?

8. Summarize, in words, your analysis. How do earnings vary with education in Brazil? [4 to 6 paragraphs]

9. Repeat your analysis in part (7) for all “centiles” 5,6,7....,94,95. Plot “centile” on the x-axis and the corresponding coefficient on schooling on the y-axis. Also plot the corresponding point-wise 95 percent confidence band. Comment on your graph [1 to 3 paragraphs].