# Population and Net Migration

This file computes estimates of net migration by age over the sample period 1900 to 2100. These estimates are fed to the model in a way that reconciles the Census Bureau's population historical estimates and projections with the population generated by our model through births, deaths, and migration. 

The file begins by amalgamating several Census Bureau products to create a consistent resident population series by single years of age over the 1900-2060 period. In addition to creating a consistent series for the total population, the notebook creates separate series of the male and female populations by single year of age, which we used to inspect trends in labor force participation rates. These latter series are called by the notebook "Employment.ipynb".

All data sources measure the resident population as of July 1 of each calendar year. We use intercensal population statistics up to 2010, which means that Census staff adjusted the data in ways that reconcile discrepancies between decennial censuses and intermediate estimates based on growing the latest census figures through estimates of births, deaths, and migration. For the 2010-2014 data, we use the 2015 vintage of postcensal population estimates, which are based on adjustments to the 2010 Census figures based on estimates of births, deaths, and migration. Similarly, we use the latest set of population projections by age from 2014 onward based on the 2010 Census. The data sources are summarized below, along with the age coverage. We saved the raw data as CSV files and perform the data cleaning below. 

* [1900 to 1939] We downloaded the "Historical National Population Estimates/National Estimates by Age, Sex, Race: 1900-1979." The annual data cover years of age 0 through 74 years old and 75+ years old. 
* [1940 to 1979] The source is the same as above. The annual data cover years of age 0 through 84 years old and 85+ years old.
* [1980 to 1989] We downloaded the "Quarterly Intercensal Resident Population" estimates by single years of age and retained estimates as of July 1 of each year. The data cover years of age 0 through 99 years old and 99+ years old.
* [1990 to 2000] We downloaded the "Intercensal Estimates of the United States Population by Age and Sex, 1990-2000: All Months." The data cover years of age 0 through 99 years old and 99+ years old.
* [2001 to 2013] We downloaded the "2000-2010 Intercensal Estimates April 1, 2000 to July 1, 2010." The data cover years of age 0 through 84 years old and 85+ years old.
* [2014 to 2060] We downloaded the "2014 to 2060 Populations Projections based on Census 2010 (released 2014)." The data cover years of age 0 through 99 years old and 100+ years old.

The derivation of net migration estimates uses information on our interpolation of mortality rates from Bell and Campbell (2005). Per our timing assumptions, births, deaths, and migration take place at the beginning of the period whereas population measurement takes place at the end. Thus, we treat a population estimate for July 1 as corresponding to the population at the end of the second quarter of the calendar year. By definition, 

$$Population_{age,t} = \big( Population_{age-1,t-1} + net\_migration_{age,t} \big) \times (1-\gamma_{age,t}),$$

where $\gamma_{age,t}$ is the mortality rate of an indidual aged $age$ in period $t$. As is apparent, we assume that mortality rates are the same for new immigrants as existing residents. For infants, the term $Population_{age-1,t-1}$ is replaced by our estimate of the number of live births for the period.

Because the net migration statistics are computed as residuals, they may absorb sources of variation in populations statistics other than migration. These sources include some year-to-year measurement errors and imprecision in the imputation of mortality rates by age and birth cohorts.

For the data beyond 2060, we assume that our estimates of net migration linearly converge to those published by the Census for 2060. These Census estimates are consistent with a stabilization in the absolute level of net migration around 2060. Our estimates and those of the Census in 2060 are similar with two minor exceptions. First, our estimated number of the net migration of infants (that is, less than a year old) is mildly negative whereas that of the Census is a bit positive, likely reflecting differences in assumptions between the UN projections that we used for projecting live births through 2100 and the assumptions underlying the Census population projections. Second, we have slightly negative net migration for older people, likely capturing differences in assumptions about mortality late in life. Both elements are unlikely to affect our macroeconomic results as they have little if any incidence on the aggregate capital-labor ratio.

Due to our interpolation to a quarterly frequency of age and calendar period, our population statistics need not correspond exactly to those published by the Census Bureau. The difference is about 1/4 percent in the level of the population by 2060, indicating that the cumulative discrepancy is very small.

This file also exports an estimate of the population at the end of 1899 using the oficial population counts. These official data are smoothed because there is significant over-reporting of age ending in "0" or "5" in the raw Census data, and under-reporting of the shouldering years. The information will be used as an input in the notebook `PopulationTurn1900.ipynb`, where we use the raw 1900 Census data to fill in the missing smoothed age counts. The migration counts for the elderly with missing population counts are effectively zeroed out. 

Historical population estimates were downloaded from https://www.census.gov/popest/data/historical/. Population projections were downloaded from https://www.census.gov/population/projections/data/national/2014.html. The pages were last accessed on August 1, 2016.

## Assembling the population data from various Census publications

In [None]:
# Creating the array that will contain the population numbers by single year of age (0 to 119)
population_Y_Total = zeros(120,161)
population_Y_Men = zeros(120,161)
population_Y_Women = zeros(120,161)
per_year = 1900:2060
age_year = 0:119

# Adding the 1900-1979 data
using CSV, DataFrames;
for yy = 1900:1939 # up to 75+
    temppop = CSV.read("RawData/census_pop_estimates_age_sex_" * string(yy) * ".csv", DataFrame; header=false, skipto=9, delim=',')
    population_Y_Total[1:75,yy-1899] = temppop[1:75,3]
    population_Y_Men[1:75,yy-1899] = temppop[1:75,4]
    population_Y_Women[1:75,yy-1899] = temppop[1:75,5]
end
for yy = 1940:1959 # up to 85+
    temppop = CSV.read("RawData/census_pop_estimates_age_sex_" * string(yy) * ".csv", DataFrame; header=false, skipto=9, delim=',')
    population_Y_Total[1:85,yy-1899] = temppop[1:85,3]
    population_Y_Men[1:85,yy-1899] = temppop[1:85,4]
    population_Y_Women[1:85,yy-1899] = temppop[1:85,5]
end
for yy = 1960:1979 # change in number of header rows
    temppop = CSV.read("RawData/census_pop_estimates_age_sex_" * string(yy) *".csv", DataFrame; header=false, skipto=8, delim=',')
    population_Y_Total[1:85,yy-1899] = temppop[1:85,3]
    population_Y_Men[1:85,yy-1899] = temppop[1:85,4]
    population_Y_Women[1:85,yy-1899] = temppop[1:85,5]
end

In [None]:
# Adding the 1980s data
for yy = 1980:1989 # change in number of header rows
    temppop = CSV.read("RawData/pop_" * string(yy) *"_to_" * string(yy+1) * "_clean.csv", DataFrame; header=false,delim=';')
    temppop = temppop[temppop[:,2].==(yy-1200),:]
    temppop = temppop[temppop[:,3].!=999,:]
    population_Y_Total[1:100,yy-1899] = temppop[1:100,4]
    population_Y_Men[1:100,yy-1899] = temppop[1:100,5]
    population_Y_Women[1:100,yy-1899] = temppop[1:100,6]
end
# Adding the 1990s data (Intercensal)
#Pkg.add("Missings")
using DataFrames, Missings
for yy = 1990:1999 # change in number of header rows
    temppop = CSV.read("RawData/intercensal_1990_2000_clean.csv", DataFrame,skipto=4,delim=';')
    temppop = dropmissing(temppop)
    temppop = temppop[(temppop[:, 1] .=="July 1") .& (temppop[:, 2] .==yy),:] # keep July 1 data
    temppop = temppop[temppop[:, 3].!="100+",:] # keep anything that is not 100+
    temppop = temppop[temppop[:, 3].!="All Age",:] # keep anything that is not "All Age"
    population_Y_Total[1:100, yy-1899] = temppop[1:100, 4]
    population_Y_Men[1:100, yy-1899] = temppop[1:100, 5]
    population_Y_Women[1:100, yy-1899] = temppop[1:100, 6]
end
# Adding the 2000s data (Intercensal)
temppop = CSV.read("RawData/Population_2000_2010.csv", DataFrame,skipto=3,delim=',')
temppop = temppop[temppop[:,1].==7,:]
temppop = temppop[temppop[:,3].!=85,:]
temppop = temppop[temppop[:,3].!=999,:]
for yy = 2000:2010 # change in number of header rows
    population_Y_Total[1:85,yy-1899] = temppop[temppop[:,2].==yy,4]
    population_Y_Men[1:85,yy-1899] = temppop[temppop[:,2].==yy,5]
    population_Y_Women[1:85,yy-1899] = temppop[temppop[:,2].==yy,6]
end
# Adding the 2010s data
for yy = 2010:2013 # change in number of header rows
    temppop = CSV.read("RawData/pop_7_" * string(yy) * "_to_12_" * string(yy) * ".csv", DataFrame,skipto=3,delim=',')
    temppop = temppop[temppop[:,2].==7,:]
    temppop = temppop[temppop[:,4].!=100,:]
    temppop = temppop[temppop[:,4].!=999,:]
    population_Y_Total[1:100,yy-1899] = temppop[:,5]
    population_Y_Men[1:100,yy-1899] = temppop[:,6]
    population_Y_Women[1:100,yy-1899] = temppop[:,7]
end

In [None]:
# Adding 2014:2060 projection data
pop_2014_2060 = CSV.read("RawData/CensusBureau_Population_projections_2014_2060.csv", DataFrame,skipto=2,delim=',')
temppop=pop_2014_2060[pop_2014_2060[:,1].==0,:]
temppop=temppop[temppop[:,2].==0,:]
temppop=temppop[temppop[:,3].==0,:]
temppop=transpose(Matrix(temppop[:,6:105]))
population_Y_Total[1:100,115:161] = Array{Float64,2}(temppop)

temppop=pop_2014_2060[pop_2014_2060[:,1].==0,:]
temppop=temppop[temppop[:,2].==0,:]
temppop=temppop[temppop[:,3].==1,:]
temppop=transpose(Matrix(temppop[:,6:105]))
population_Y_Men[1:100,115:161] = Array{Float64,2}(temppop)

temppop=pop_2014_2060[pop_2014_2060[:,1].==0,:]
temppop=temppop[temppop[:,2].==0,:]
temppop=temppop[temppop[:,3].==2,:]
temppop=transpose(Matrix(temppop[:,6:105]))
population_Y_Women[1:100,115:161] = Array{Float64,2}(temppop);

In [None]:
# Saving the population data to CSVs
using DelimitedFiles
writedlm("CleanData/population_1900_2060_Total.csv",hcat(["Age" ; age_year],[per_year' ; population_Y_Total]), ',')
writedlm("CleanData/population_1900_2060_Men.csv",hcat(["Age" ; age_year],[per_year' ; population_Y_Men]), ',')
writedlm("CleanData/population_1900_2060_Women.csv",hcat(["Age" ; age_year],[per_year' ; population_Y_Women]), ',')

# Padding population with zeros through 2100 for use below
population_Y_Total = [population_Y_Total zeros(120,39)]
population_Y_Men   = [population_Y_Men zeros(120,39)]
population_Y_Women = [population_Y_Women zeros(120,39)];

## Estimating net migration flows

In [None]:
# Reading the annual mortality rates 
Γ_AGEy_PERy = Matrix(CSV.read("CleanData/interp_death_rate_1900_2220_Y.csv", DataFrame,delim=',',header=false));

# Computing the net migration flows...
netmigration = zeros(120,200);
netmigration[2:end,2:end] = population_Y_Total[2:end,2:end] ./ (1.0.-Γ_AGEy_PERy[2:end,2:size(netmigration,2)]) - population_Y_Total[1:end-1,1:end-1]

# And adding zeros when change in population cannot be computed
netmigration[101,:] .= 0.0;       # All years  101+
netmigration[76,1:40] .= 0.0;     # 1900:1939  75+
netmigration[86,41:80] .= 0.0;    # 1940-1979  85+
netmigration[77:85,41] .= 0.0;    # 1940       75 to 85
netmigration[87:end,81] .= 0.0;   # 1981       86+
netmigration[86,101:110] .= 0.0;  # 2000-2010  85+
netmigration[86:end,101] .= 0.0;  # 2001       85+
netmigration[87:end,111] .= 0.0;  # 2010       86+
netmigration[:,162] .= 0.0;       # 2061       All ages

In [None]:
# Extracting infant migration conditional on our birth series
births = Matrix(CSV.read("CleanData/births_Y.csv", DataFrame,delim=',',header=false))'
netmigration[1,1:161]=population_Y_Total[1,1:161]./(1.0 .- Γ_AGEy_PERy[1,1:161])-births[1,1:161]

# Extending the net migration series by assuming convergence to the levels Census estimates will prevail in 2060
# Here we simply linearly converge to the migration statics in 2060, as estimated by the Census bureau. 
# Beyond that year, we hold net migration constant.
# That way, the only population growth by 2100 comes from net migration.
census_net_migration = Matrix(CSV.read("RawData/Census_net_migration_projections_2014_2060.csv", DataFrame,skipto=2,delim=',',header=false))
for yy=162:200
    netmigration[:,yy]=(yy-161)/39*[census_net_migration[1:85,end] ; zeros(35,1)] + (1-(yy-161)/39)*netmigration[:,161]
end

# Because the official population data is truncated, it  makes sense to truncate the net migration data as well.
# The conservative thing to do is set net migration to zero for 75+.
netmigration[75:120,:] .= 0.0

# Saving the net migration data to a CSV
per_year=1900:2099
writedlm("CleanData/netmigration_Y.csv",hcat(["Age" ; age_year],[per_year' ; netmigration]), header = true, ',')
writedlm("CleanData/netmigration_Y.csv",netmigration, header = true, ',')

In [None]:
# Converting the yearly net migration data to quarterly levels
include("ordernorep.jl")
include("spline_cubic.jl")
netmigration_perQ_ageY = zeros(120,800)   # Converting periods...
netmigration_perQ_ageQ = zeros(480,800)   # ...then converting ages
per_quarters = 1900.25:.25:2100  # End of period so that 1900:Q2 coincides with July 1 Census population count
per_years = 1900.5:1.0:2100
age_years = 0.5:1:120
age_quarters = 0.125:.25:120

# Converting yearly periods to quarters and smoothing net migration statistics
for aa=1:120
    netmigration_perQ_ageY[aa,2:end-2] = spline_cubic(per_years,netmigration[aa,:],per_quarters[2:end-2])
    netmigration_perQ_ageY[aa,1]=netmigration_perQ_ageY[aa,3]
    netmigration_perQ_ageY[aa,end-1]=netmigration_perQ_ageY[aa,end-2]
    netmigration_perQ_ageY[aa,end]=netmigration_perQ_ageY[aa,end-2]
end

# Removing the annualization
netmigration_perQ_ageY=netmigration_perQ_ageY/4.0

# Saving for inspection
writedlm("CleanData/netmigration_perQ_ageY.csv",netmigration_perQ_ageY, header = false, ',')

In [None]:
# Converting ages to quarters
for qq=1:800
    netmigration_perQ_ageQ[3:end-2,qq] = spline_cubic(age_years,netmigration_perQ_ageY[:,qq],age_quarters[3:end-2])
end
netmigration_perQ_ageQ[1,:]=0.5*(4*netmigration_perQ_ageY[1,:]-netmigration_perQ_ageQ[3,:]-netmigration_perQ_ageQ[3,:])
netmigration_perQ_ageQ[2,:]=netmigration_perQ_ageQ[1,:]
netmigration_perQ_ageQ[end-1,:]=0.5*(4*netmigration_perQ_ageY[end,:]-netmigration_perQ_ageQ[end-2,:]-netmigration_perQ_ageQ[end-3,:])
netmigration_perQ_ageQ[end,:]=netmigration_perQ_ageQ[end-1,:]

# Converting data to non-annualized values
netmigration_perQ_ageQ=netmigration_perQ_ageQ/4

# Saving to CSV
writedlm("CleanData/netmigration_Q.csv",netmigration_perQ_ageQ, header = false, ',')

## Smoothed population counts at the turn of 1900

This next set of instructions compute a population at the end of 1899 up to age 74 years using the smoothed population estimates. The information will be used as an input in the notebook `PopulationTurn1900.ipynb`, where it will be merged with raw 1900 Census data to fill in the missing age counts.

In [None]:
# Computing the population at the end of 1899 (annual) in a way
# consistent with our net migration and mortality rate estimates in 1900
population_1899_Y = [population_Y_Total[2:end,1]./(1.0.-Γ_AGEy_PERy[2:end,1]) ; 0]

# Reading the annual mortality rates 
Γ_AGEq_PERq = Matrix(CSV.read("CleanData/interp_death_rate_1900_2220_Q.csv", DataFrame,delim=',',header=false))

# Computing the population at the end of 1899 (quarterly)
population_1899_Q = zeros(480,1)
age_quarters=0.25:0.25:120
population_1899_Q[1:298]=spline_cubic(age_years,population_Y_Total[:,1],age_quarters[2:299])./(1.0 .- Γ_AGEq_PERq[2:299,2]).^0.25./(1.0 .- Γ_AGEq_PERq[1:298,1]).^0.25
population_1899_Q[1:298]=population_1899_Q[1:298]-netmigration_perQ_ageQ[1:298,1] - netmigration_perQ_ageQ[2:299,2]./(1.0 .- Γ_AGEq_PERq[2:299,2]).^0.25
population_1899_Q=population_1899_Q/4.0 # Removing annualization
population_1899_Y[isnan.(population_1899_Y)] .= 0.0;

# Saving the population to CSV (note: the 75+ years old tails need to be added)
writedlm("CleanData/population_1899_smoothed_Y.csv",hcat(age_year,population_1899_Y), header = false, ',')
writedlm("CleanData/population_1899_smoothed_Q.csv",hcat(age_quarters,population_1899_Q), header = false, ',');

In [None]:
# Changing permissions
run(`chmod 664 CleanData/population_1900_2060_Total.csv`);
run(`chmod 664 CleanData/population_1900_2060_Men.csv`);
run(`chmod 664 CleanData/population_1900_2060_Women.csv`);
run(`chmod 664 CleanData/netmigration_Y.csv`);
run(`chmod 664 CleanData/netmigration_perQ_ageY.csv`);
run(`chmod 664 CleanData/netmigration_Q.csv`);
run(`chmod 664 CleanData/population_1899_smoothed_Y.csv`);
run(`chmod 664 CleanData/population_1899_smoothed_Q.csv`);