# Measuring US Industry Level Productivity for 1947-2023
## [Juan Ignacio Vizcaino](https://www.jivizcaino.com/) and [Selim Elbadri](https://www.selimelbadri.com/) 

We combine data from US KLEMS, March 2017 Release, with BEA-BLS Integrated Industry-Level Production Account for 1947–2016, and BEA-BLS Integrated Industry-Level Production Account for 1997-2023 to produce industry-level measures of Gross Output (**GO**), Value Added (**VA**), Capital (**CAP**), Labor (**LAB**), and Intermediate Inputs (**II**), in Nominal  terms. We also provide Quantity Indices for **GO**, **VA**, **CAP**, **LAB**, **II** and total hours employed (**HRS**), and utilize these indices to compute measures of Total Labor Productivity (**LP**) and Total Factor Productivity (**TFP**) for the corresponding time period, following the methodology in US KLEMS, April 2013 Release.
See our [Gihub repo](https://github.com/selbadri/Measuring-US-Industry-Level-Productivity-1947-to-2023) for details on data processing. 



# Introduction

We construct the final dataset in two stages.
Step 1 merges the raw datasets and standardizes all variables. This step, implemented in clean.ipynb, produces [clean_data.xlsx](../Output/clean_data.xlsx) — a unified dataset containing the core variables for 44 industries (1947–2023) and 63 industries (1963–2023). It includes nominal series for **GO**, **CAP**, **LAB**, **II**, and **VA**, as well as nominal compensation values and quantity indices for each input type.

Step 2, implemented by [analysis.ipynb](./analysis.ipynb), uses clean_data.xlsx to aggregate the growth rates of the various **CAP**, **LAB**, and **II** types, weighting them by their nominal compensation shares to generate the corresponding quantity indices. Combined with the **GO** quantity index, we use these to produce the **VA** quantity index and the productivity measures. The final output is [EV_production_accounts_1947to2023.xlsx](../Output/EV_production_accounts_1947to2023.xlsx).

In addition to the construction of the dataset, we offer a code that verifies our dataset against well-established and comparable datasets. This is implemented in [validate.ipynb](./validate.ipynb). Specifically, we compare industry- and broad-sector level **GO** and **VA** shares as well as US economy-wide quantity and productivity indices (**GO**, **CAP**, **LAB**, **II**, **VA**, **TFPVA** and **LPVA**).

Running the notebook [run_all.ipynb](./run_all.ipynb) automatically executes [clean.ipynb](./clean.ipynb), followed by [analysis.ipynb](./analysis.ipynb) and then [validate.ipynb](./validate.ipynb) , so there is no need to open the three notebooks. 

# Table of Contents
1. [Preliminaries](#1-preliminaries)
   - [Set-up working directory](#11-set-up-working-directory)
   - [Import required libraries](#12-import-required-libraries)
2. [Functions & Lists](#2-functions--lists)
   - [Define widely used functions](#21-define-widely-used-functions)
   - [Define widely used lists / aggregate_groups](#22-define-widely-used-lists--aggregate_groups)
3. [Cleaning BEA-BLS (Experimental) 1947-2016](#3-cleaning-bea-bls-experimental-1947-2016)
   - [Extracting the required data](#31-extracting-the-required-data)
   - [Log differencing variables](#32-log-differencing-variables)
   - [Renaming variables](#33-renaming-variables)
   - [Restrict nominal variables to 1997 and growth rates to 1998](#34-restrict-nominal-variables-to-1997-and-growth-rates-to-1998)
4. [Cleaning US KLEMS, March 2017 Release](#4-cleaning-us-klems-march-2017-release)
   - [Extracting the required data](#41-extracting-the-required-data)
   - [Merge government industries](#42-merge-government-industries)
   - [Build panel datasets](#43-build-panel-datasets)
5. [Cleaning BEA-BLS Capital Dataset (1997-2023)](#5-cleaning-bea-bls-capital-dataset-1997-2023)
   - [Extracting the required data](#51-extracting-the-required-data)
   - [Creating industry identifiers](#52-creating-industry-identifiers)
   - [Log differencing the quantity indices](#53-log-differencing-the-quantity-indices)
   - [Renaming variables](#54-renaming-variables)
   - [Extend nominal compensation 2015-2023](#55-extend-nominal-compensation-2015-2023)
6. [Merging the Datasets for 1963-2023](#6-merging-the-datasets-for-1963-2023)
   - [Extend KLEMS nominal data to 2023](#61-extend-klems-nominal-data-to-2023)
   - [Merge quantity indices and compensation](#62-merge-quantity-indices-and-compensation)
   - [Chain growth rates to quantity indices](#63-chain-growth-rates-to-quantity-indices)
   - [Merge factor compensation and quantity indices with nominal series](#64-merge-factor-compensation-and-quantity-indices-with-nominal-series)
   - [Order variables](#65-order-variables)
7. [Merging the Datasets for 1947-2023](#7-merging-the-datasets-for-1947-2023)
   - [Extend KLEMS nominal data beyond 2014](#71-extend-klems-nominal-data-beyond-2014)
   - [Order variables](#72-order-variables)
   - [Turn quantity growth rates into indices](#73-turn-quantity-growth-rates-into-indices)
   - [Merge nominal series with factor compensation and quantity growth rates](#74-merge-nominal-series-with-factor-compensation-and-quantity-growth-rates)
   - [Reorder variables](#75-reorder-variables)
8. [Combine & Export](#8-combine--export)
   - [Format datasets for export](#81-format-datasets-for-export)
   - [Variable definitions and industry key](#82-variable-definitions-and-industry-key)
   - [Export as excel](#83-export-as-excel)

<a id="1-preliminaries"></a>
# 1. Preliminaries

<a id="11-set-up-working-directory"></a>
#### 1.1 Set-up working directory

In [1]:
import_file_path = rf"..\\Input"
export_file_path = rf"..\\Output" 

<a id="12-import-required-libraries"></a>
#### 1.2 Import required libraries

In [2]:
import pandas as pd
import numpy as np
import os 
from functools import reduce

<a id="2-functions--lists"></a>
# 2. Functions & Lists

<a id="21-define-widely-used-functions"></a>
#### 2.1. Define widely used functions

In [None]:
def dlog(df, var, group_col='indnum', time_col='yr', base_year=None):
    """Compute the log-difference of a column per group (industry).
    
    Parameters
    ----------
    df : pandas.DataFrame
        Input dataframe containing the variable to be differenced
    var : str
        Column name of the variable to take log-differences of
    group_col : str, optional
        Column name used to group the panel (default 'indnum').
    time_col : str, optional
        Time column used for ordering within each group (default 'yr').
    base_year : int or None, optional
        If provided, the row(s) corresponding to base_year will be set to NaN
        for the resulting growth column. If None, the first observation in
        each group is set to NaN.
    
    Returns
    -------
    pandas.DataFrame
        A copy of the input dataframe with a new column named 'dlog_<var>'
        (spaces in <var> replaced with underscores) and the original column
        <var> dropped.
    """
    df = df.copy()
    df = df.sort_values([group_col, time_col])
    newvar = f'dlog_{var}'.replace(' ', '_')

    df[newvar] = df.groupby(group_col)[var].transform(lambda x: np.log(x).diff())

    if base_year is not None:
        df.loc[df[time_col] == base_year, newvar] = np.nan
    else:
        min_years = df.groupby(group_col)[time_col].transform('min')
        df.loc[df[time_col] == min_years, newvar] = np.nan

    return df.drop(columns=[var])

In [4]:
def index_generation(df, variables):
    """Convert growth-rate columns into chained quantity index levels.

    This function looks for columns representing growth rates (e.g. 'X_g')
    and creates corresponding level index columns named 'X' by chaining the
    growth rates within each industry group. For each group, the index is
    initialized such that the period immediately prior to the first available
    growth observation is treated as the base (log-index = 0). The returned
    dataframe drops the input growth-rate columns (variables).

    Parameters
    ----------
    df : pandas.DataFrame
        Input dataframe containing grouped time series with growth columns.
    variables : list of str
        List of column names containing growth rates (each expected to end
        with '_g').

    Returns
    -------
    pandas.DataFrame
        Copy of the input dataframe with new level index columns (one per
        growth variable) and the original growth columns removed.
    """
    df = df.copy()
    
    for var_g in variables:
        index_col_name = var_g.replace('_g', '')
        df[index_col_name] = np.nan

        for indnum in df['indnum'].unique():
            mask = df['indnum'] == indnum
            group_data = df.loc[mask].copy()

            if group_data.empty:
                continue

            group_data = group_data.sort_values('yr')

            growth_values = group_data[var_g].dropna()
            if growth_values.empty:
                continue

            first_valid_idx = growth_values.index[0]
            first_valid_year = group_data.loc[first_valid_idx, 'yr']
            base_year = first_valid_year - 1

            log_X = pd.Series(np.nan, index=group_data.index)

            base_year_mask = group_data['yr'] == base_year
            if base_year_mask.any():
                log_X[base_year_mask] = 0

            future_mask = group_data['yr'] >= first_valid_year
            if future_mask.any():
                growth_rates = group_data.loc[future_mask, var_g]
                cumulative_sum = growth_rates.cumsum()
                log_X.loc[future_mask] = cumulative_sum

            df.loc[mask, index_col_name] = np.exp(log_X)
    
    return df.drop(columns=variables)

In [5]:
def rebase_indices(df, vars_to_rebase, base_year, id_var='indnum', time_var='yr'):
    """Rebase selected index columns so their value equals 1 in base_year.

    For each variable listed in `vars_to_rebase`, this function finds the
    value of that variable in `base_year` for each group (identified by
    `id_var`) and divides the full series by that base-year value, producing
    an index normalized to 1 in `base_year`. If a group's base-year value is
    missing, the result will be NaN for that group's rows.

    Parameters
    ----------
    df : pandas.DataFrame
        Input dataframe containing the variables to rebase
    vars_to_rebase : list of str
        Column names to rebase (these columns must be present in df)
    base_year : int or str
        The year used as the base (rows where time_var == base_year are used
        to extract base values per group)
    id_var : str, optional
        Column name used to identify groups (default 'indnum')
    time_var : str, optional
        Column name representing time (default 'yr')

    Returns
    -------
    pandas.DataFrame
        A copy of df with the specified variables rebased (original columns
        overwritten).
"""
    df_new = df.copy()
    
    for var in vars_to_rebase:
        base_values = df_new.loc[df_new[time_var] == base_year, [id_var, var]]
        base_values = base_values.rename(columns={var: f'{var}_base'})

        df_new = df_new.merge(base_values, on=id_var, how='left')

        df_new[var] = df_new[var] / df_new[f'{var}_base']

        df_new = df_new.drop(columns=[f'{var}_base'])
    
    return df_new

<a id="22-define-widely-used-lists--aggregate_groups"></a>
#### 2.2 Define widely used lists

The list *aggregate_groups* is useful to aggregate industries at the same level as the BEA-BLS Industry Production Account Experimental for the period of 1947-1963. 

In [6]:
aggregate_groups = {
    2936: list(range(29, 37)),
    3740: list(range(37, 41)),
    4144: list(range(41, 45)),
    4749: list(range(47, 50)),
    5152: list(range(51, 53)),
    5456: list(range(54, 57)),
    5758: list(range(57, 59))
}

The list *order_1963to2023* and *order_1947to2023* is useful because it dictates the final order of variables in both dataframes.  

In [7]:
#Define the desired column order
order_1963to2023 = ['indnum','yr', 'GO', 'VA', 'CAP', 'LAB', 'II','CAPIT','CAPSOFT', 'CAPRD','CAPART','CAPOTH',
                    'LABCOL','LABNCOL', 'IIEN', 'IIMT','IISERV',
                    'GO_QI','CAPIT_QI','CAPSOFT_QI','CAPRD_QI','CAPART_QI','CAPOTH_QI',
                    'LABCOL_QI','LABNCOL_QI','II_QI', 'IIEN_QI', 'IIMT_QI', 'IISERV_QI','HRS_QI']

order_1947to2023 = ['indnum','yr', 'GO', 'VA', 'CAP', 'LAB', 'II','CAPIT','CAPSOFT', 'CAPRD','CAPART','CAPOTH',
                    'LABCOL','LABNCOL', 'IIEN','IIMT','IISERV',
                    'GO_QI','CAPIT_QI','CAPSOFT_QI','CAPRD_QI','CAPART_QI','CAPOTH_QI',
                    'LABCOL_QI','LABNCOL_QI','II_QI', 'IIEN_QI', 'IIMT_QI', 'IISERV_QI','HRS_QI']

<a id="3-cleaning-bea-bls-experimental-1947-2016"></a>
# 3. Cleaning BEA-BLS (Experimental) Integrated Industry-Level Production Account 1947-2016 Dataset

This section extracts and cleans the data required from BEA-BLS (Experimental) Integrated Industry-Level Production Account for 1947-2016. Each step in the cleaning process is repeated twice - once for the 44 industries running from 1947 to 1963 and once for the 63 industries running from 1963 to 2023. We obtain nominal compensation and quantity indices for various types of CAP and LAB from this dataset, in addition to nominal GO values and GO and II quantity indices. 

<a id="31-extracting-the-required-data"></a>
#### 3.1 Extracting the required data

In [8]:
experimental_varlist           = ['yr','indnum','goqi.','iiqi.','vlcol.','vln.','vkit.','vksoft.','vkRD.',
    'vkart.','vkoth.','qkit.','qks.','qkrd.','qka.','qko.','hrs','qlindexcol_merge.', 'qlindexn_merge.']

df_experimental_1947to1963     = pd.read_excel(os.path.join(import_file_path, 'industry-production-account-experimental.xlsx'), 
    sheet_name='1947-1963', skiprows=1, usecols=experimental_varlist)
df_experimental_1963to2016     = pd.read_excel(os.path.join(import_file_path, 'industry-production-account-experimental.xlsx'),
    sheet_name='1963-2016', skiprows=1, usecols=experimental_varlist)

<a id="32-log-differencing-variables"></a>
#### 3.2 Log differencing variables

In [9]:
q_indices                      = ['goqi.','iiqi.','qkit.','qks.','qkrd.','qka.','qko.','qlindexcol_merge.','qlindexn_merge.','hrs']
for v in q_indices:
    df_experimental_1947to1963 = dlog(df_experimental_1947to1963, v, base_year=1947)
for v in q_indices:
    df_experimental_1963to2016 = dlog(df_experimental_1963to2016, v, base_year=1963)

<a id="33-renaming-variables"></a>
#### 3.3 Renaming variables

In [10]:
experimental_renaming                      = {
    'vkit.': 'CAPIT','vksoft.': 'CAPSOFT', 'vkRD.': 'CAPRD','vkart.': 'CAPART','vkoth.': 'CAPOTH','vlcol.': 'LABCOL','vln.': 'LABNCOL',
    'dlog_goqi.': 'GO_QI_g','dlog_iiqi.': 'II_QI_g','dlog_qkit.': 'CAPIT_QI_g','dlog_qks.': 'CAPSOFT_QI_g','dlog_qkrd.': 'CAPRD_QI_g',
    'dlog_qka.': 'CAPART_QI_g','dlog_qko.': 'CAPOTH_QI_g','dlog_qlindexcol_merge.': 'LABCOL_QI_g','dlog_qlindexn_merge.': 'LABNCOL_QI_g',
    'dlog_hrs': 'HRS_QI_g'
}
df_experimental_1963to2016                 = df_experimental_1963to2016.rename(columns=experimental_renaming)
df_experimental_1947to1963                 = df_experimental_1947to1963.rename(columns=experimental_renaming)

<a id="34-restrict-nominal-variables-to-1997-and-growth-rates-to-1998"></a>
#### 3.4 Restrict nominal variables to 1997 and growth rates to 1998

Nominal variables from 1997-on and growth rates from 1998-on will be retrieved from BEA-BLS Integrated Industry-Level Production Account for 1997-2023.

In [11]:
nom_var                    = ['CAPIT','CAPSOFT', 'CAPRD', 'CAPART', 'CAPOTH', 'LABCOL', 'LABNCOL']
growth_var                 = ['CAPIT_QI_g','CAPSOFT_QI_g', 'CAPRD_QI_g', 'CAPART_QI_g', 'CAPOTH_QI_g', 'LABCOL_QI_g', 'LABNCOL_QI_g', 'HRS_QI_g','GO_QI_g','II_QI_g']
               
df_experimental_1963to2016 = df_experimental_1963to2016.copy()
for v in nom_var:
    df_experimental_1963to2016.loc[(df_experimental_1963to2016['yr'] < 1963) | (df_experimental_1963to2016['yr'] > 1996),v] = np.nan
for v in growth_var:
    df_experimental_1963to2016.loc[(df_experimental_1963to2016['yr'] < 1963) | (df_experimental_1963to2016['yr'] > 1997),v] = np.nan

<a id="4-cleaning-us-klems-march-2017-release"></a>
# 4. Cleaning US KLEMS, March 2017 Release

This section extracts and cleans the data required from US KLEMS, March 2017 Release. We obtain nominal variables from 1947 to 2014 from this dataset for the 63 industries. 

<a id="41-extracting-the-required-data"></a>
#### 4.1 Extracting the required data

Rename the panel identifiers (industry and year variables to be consistent with those in BEA-BLS Integration Accounts 1947 to 2016).

In [12]:
klems_vars = ['year', 'industry', 'gross output', 'capital', 'labor', 'intermediate']
df_klems   = pd.read_excel(os.path.join(import_file_path, 'usa_wk_mar_2017.xlsx'), sheet_name='KLEMdata', skiprows=1, usecols=klems_vars)
df_klems.rename(columns={'industry': 'indnum','year': 'yr'}, inplace=True)
df_klems.rename(columns={'gross output': 'GO','capital': 'CAP','labor': 'LAB','intermediate': 'II'}, inplace=True)

<a id="42-merge-government-industries"></a>
#### 4.2.  Merge government industries into Federal Government and State & Local Government

Industries 62/63 are Federal General Government and Federal Government Enterprises while Industries 64/65 are State & Local Government Enterprises and State & Local General Government, respectively.  

In [13]:
federal_inds            = [62, 63]
state_local_inds        = [64, 65]
federal                 = df_klems[df_klems['indnum'].isin(federal_inds)].groupby('yr', as_index=False)[['GO', 'CAP', 'LAB', 'II']].sum()
state_local             = df_klems[df_klems['indnum'].isin(state_local_inds)].groupby('yr', as_index=False)[['GO', 'CAP', 'LAB', 'II']].sum()
federal['indnum']       = 62          
state_local['indnum']   = 63          

df_klems                = df_klems[~df_klems['indnum'].isin(federal_inds + state_local_inds)]
df_klems                = pd.concat([df_klems, federal, state_local], ignore_index=True)

<a id="43-build-panel-datasets"></a>
#### 4.3 Build two separate panel datasets of nominal variables for each industry-level aggregation

In [14]:
df_klems_1947to2014                       = (
    df_klems.assign(indnum=df_klems['indnum'].replace({i:new for new, olds in aggregate_groups.items() for i in olds}))
    .groupby(['yr','indnum'], as_index=False)[['GO', 'CAP', 'LAB', 'II']].sum())
df_klems_1947to2014                       = df_klems_1947to2014.sort_values(['indnum','yr']).reset_index(drop=True)
df_klems_1963to2014                       = df_klems[df_klems['yr'] >= 1963].copy()

<a id="5-cleaning-bea-bls-capital-dataset-1997-2023"></a>
# 5. Cleaning BEA-BLS Capital Dataset

This section extracts and cleans the data required from BEA-BLS (Experimental) Integrated Industry-Level Production Account for 1997-2023. We obtain nominal compensation quantity indices for various types of CAP, LAB and II from this dataset, in addition to nominal and quantity index values of GO all from 1997 to 2023.  

<a id="51-extracting-the-required-data"></a>
#### 5.1 Extracting the required data

In [15]:
capital_sheets = [
    'Capital_Art_Quantity','Capital_R&D_Quantity','Capital_IT_Quantity','Capital_Other_Quantity','Capital_Software_Quantity',
    'Capital_Art Compensation', 'Capital_R&D Compensation','Capital_IT Compensation','Capital_Other Compensation','Capital_Software Compensation',
    'Labor_Col_Quantity','Labor_NoCol_Quantity','Labor_Col Compensation','Labor_NoCol Compensation',
    'Energy_Quantity','Materials_Quantity','Services_Quantity','Energy Compensation', 
    'Materials Compensation','Service Compensation','Gross Output', 'Gross Output_Quantity', 'Labor Hours_Quantity'
]
long_data = []
for sheet in capital_sheets:
    df_tmp = pd.read_excel(os.path.join(import_file_path, 'industry-production-account-capital.xlsx'), sheet_name=sheet, header=1).dropna(how='all')
    df_tmp = df_tmp.rename(columns={df_tmp.columns[0]: 'industry_description'})
    df_tmp = df_tmp.melt(id_vars='industry_description', var_name='year', value_name=sheet)
    long_data.append(df_tmp)
df_capital_1997to2023 = reduce(lambda l, r: pd.merge(l, r, on=['industry_description','year'], how='outer'), long_data)
df_capital_1997to2023 = df_capital_1997to2023.rename(columns={'year':'yr','industry_description':'Description'})

<a id="52-creating-industry-identifiers"></a>
#### 5.2 Creating industry identifiers consistent with first two datasets

The BEA-BLS (Experimental) Integrated Industry-Level Production Account 1997-2023 dataset does not contain numeric industry identifiers as in US KLEMS and BEA-BLS (Experimental) Integrated Industry-Level Production Account 1947-2016. We must therefore create them. 

In [16]:
order                           = df_capital_1997to2023['Description'].drop_duplicates().tolist()
mapping                         = {desc: i+1 for i, desc in enumerate(order)}
df_capital_1997to2023['indnum'] = df_capital_1997to2023['Description'].map(mapping).astype('Int64')
df_capital_1997to2023['yr']     = pd.to_numeric(df_capital_1997to2023['yr'], errors='coerce')
df_capital_1997to2023           = df_capital_1997to2023.drop(columns='Description')

<a id="53-log-differencing-the-quantity-indices"></a>
#### 5.3 Log differencing the quantity indices

In [17]:
q_indices                 = ['Gross Output_Quantity','Capital_IT_Quantity','Capital_Software_Quantity','Capital_R&D_Quantity',
                         'Capital_Art_Quantity','Capital_Other_Quantity','Labor_Col_Quantity','Labor_NoCol_Quantity',
                         'Labor Hours_Quantity','Energy_Quantity','Materials_Quantity','Services_Quantity']
for v in q_indices:
    df_capital_1997to2023 = dlog(df_capital_1997to2023, v, base_year=1997)

<a id="54-renaming-variables"></a>
#### 5.4 Renaming variables

In [18]:
capital_dictionary = {
    'Gross Output': 'GO','Capital_IT Compensation': 'CAPIT','Capital_Software Compensation': 'CAPSOFT',
    'Capital_R&D Compensation': 'CAPRD','Capital_Art Compensation': 'CAPART','Capital_Other Compensation': 'CAPOTH',
    'Labor_Col Compensation': 'LABCOL','Labor_NoCol Compensation': 'LABNCOL','dlog_Gross_Output_Quantity': 'GO_QI_g',
    'dlog_Capital_IT_Quantity': 'CAPIT_QI_g','dlog_Capital_Software_Quantity': 'CAPSOFT_QI_g','dlog_Capital_R&D_Quantity': 'CAPRD_QI_g',
    'dlog_Capital_Art_Quantity': 'CAPART_QI_g','dlog_Capital_Other_Quantity': 'CAPOTH_QI_g','dlog_Labor_Col_Quantity': 'LABCOL_QI_g',
    'dlog_Labor_NoCol_Quantity': 'LABNCOL_QI_g','dlog_Labor_Hours_Quantity': 'HRS_QI_g','dlog_Energy_Quantity': 'IIEN_QI_g',
    'dlog_Materials_Quantity': 'IIMT_QI_g','dlog_Services_Quantity': 'IISERV_QI_g','Service Compensation': 'IISERV',
    'Materials Compensation': 'IIMT','Energy Compensation': 'IIEN'
}
rename_dict                = {k: v for k, v in capital_dictionary.items() if k in df_capital_1997to2023.columns}
df_capital_1997to2023.rename(columns=rename_dict, inplace=True)

<a id="55-extend-nominal-compensation-2015-2023"></a>
#### 5.5 Extend nominal compensation data from 2015 to 2023 

The US KLEMS, March 2017 release offers nominal CAP, LAB and II compensation from 1947 to 2014 for all 63 industries. We obtain the nominal CAP, LAB, II and GO values from 2015 onwards from the BEA-BLS (Experimental) Integrated Industry-Level Production Account 1997-2023. For CAP, LAB and II, these are simply the sum of the nominal compensation paid to each type of input. This aggregation must be done for the 63 industries as well as the broader 44 industries of interest.

In [19]:
df_capital_nominal                  = (df_capital_1997to2023[df_capital_1997to2023['yr'] >= 2015].copy())
nominal_agg_map                     = {
    'GO' : ['GO'],
    'CAP': ['CAPIT', 'CAPSOFT', 'CAPRD', 'CAPART', 'CAPOTH'],
    'LAB': ['LABNCOL', 'LABCOL'],
    'II' : ['IISERV', 'IIMT', 'IIEN']
}
for newvar, cols in nominal_agg_map.items():
    cols_present                   = [c for c in cols if c in df_capital_nominal.columns]
    df_capital_nominal[newvar]     = df_capital_nominal[cols_present].apply(pd.to_numeric, errors='coerce').sum(axis=1, min_count=1)
df_capital_nominal_start1963       = df_capital_nominal[['yr','indnum','GO','II','LAB','CAP']].reset_index(drop=True)
df_capital_nominal_start1947       = (df_capital_nominal_start1963.copy()
    .assign(indnum=lambda d: d['indnum'].map({i:new for new, old in aggregate_groups.items() for i in old}).fillna(d['indnum']).astype('Int64'))
    .groupby(['yr','indnum'], as_index=False)[['GO','II','LAB','CAP']].sum())
df_capital_nominal_start1947       = df_capital_nominal_start1947.sort_values(by=['indnum', 'yr']).reset_index(drop=True)
df_capital_1997to2023              = df_capital_1997to2023.sort_values(['indnum', 'yr']).reset_index(drop=True)
df_capital_1997to2023_exGO         = df_capital_1997to2023.drop('GO', axis=1).round(5)

<a id="6-merging-the-datasets-for-1963-2023"></a>
# 6. Merging the Datasets for 1963-2023

This section merges the data extracted and cleaned above from the three data sources to create a dataset with all required variables to compute the relevant quantity indices and productivity measures for *63* industries from *1963 to 2023*.

<a id="61-extend-klems-nominal-data-to-2023"></a>
#### 6.1 Extend US KLEMS nominal data from 2014 to 2023 using BEA-BLS data

In [21]:
df_nom_1963to2023       = pd.concat([df_klems_1963to2014, df_capital_nominal_start1963], ignore_index=True)
df_nom_1963to2023       = df_nom_1963to2023.sort_values(by=['indnum', 'yr']).reset_index(drop=True)
df_nom_1963to2023["VA"] = df_nom_1963to2023["GO"] - df_nom_1963to2023["II"]

<a id="62-merge-quantity-indices-and-compensation"></a>
#### 6.2 Merge quantity indices and compensation for factor components from both BEA-BLS data sources

The BEA-BLS (Experimental) Integrated Industry-Level Production Accounts 1947-2023 dataset runs from 1963 to 2014 for 63 industries while the BEA-BLS Integrated Industry-Level Production Accounts runs from 1997 to 2023 for the same number of industries. Given the experimental nature of the former, we rely on the latter whenever possible. We therefore extend the latter dataset back using the former dataset for the 63 industries
 

In [22]:
all_cols                   = list(set(df_experimental_1963to2016.columns).union(df_capital_1997to2023_exGO.columns))
df_experimental_1963to2016 = df_experimental_1963to2016.reindex(columns=all_cols)
df_capital_1997to2023_exGO = df_capital_1997to2023_exGO.reindex(columns=all_cols)
df_extended                = pd.merge(df_experimental_1963to2016,df_capital_1997to2023_exGO,on=["indnum", "yr"], how="outer", suffixes=("_exp", "_cap"))
for col in all_cols:
    if col + "_exp" in df_extended and col + "_cap" in df_extended:
        df_extended[col] = df_extended[col + "_exp"].combine_first(df_extended[col + "_cap"])
        df_extended      = df_extended.drop(columns=[col + "_exp", col + "_cap"])
df_extended              = df_extended.sort_values(by=["indnum", "yr"]).reset_index(drop=True)

<a id="63-chain-growth-rates-to-quantity-indices"></a>
#### 6.3 Chain growth rates to quantity indices

In [23]:
growth_vars = [
    "GO_QI_g", "CAPIT_QI_g", "CAPSOFT_QI_g", "CAPRD_QI_g", "CAPART_QI_g", "CAPOTH_QI_g",
    "LABCOL_QI_g", "LABNCOL_QI_g", "II_QI_g", "IISERV_QI_g", "IIMT_QI_g", "IIEN_QI_g", 
    "HRS_QI_g"
]
df_qindices_1963to2023     = index_generation(df_extended, growth_vars)

<a id="64-merge-factor-compensation-and-quantity-indices-with-nominal-series"></a>
#### 6.4 Merge the factor compensation and quantity indices with the nominal series

In [24]:
df_1963to2023 = pd.merge(df_nom_1963to2023,df_qindices_1963to2023,on=['indnum', 'yr'],how='inner') 
df_1963to2023 = df_1963to2023.sort_values(by=['indnum', 'yr']).reset_index(drop=True)
iiqi_vars     = ["IIEN_QI","IIMT_QI","IISERV_QI"]
df_1963to2023 = rebase_indices(df_1963to2023, vars_to_rebase=iiqi_vars, base_year=2000)

<a id="65-order-variables"></a>
#### 6.5 Order all variables

In [25]:
df_1963to2023          = df_1963to2023[order_1963to2023]

<a id="7-merging-the-datasets-for-1947-2023"></a>
# 7. Merging the Datasets for 1947-2023

This section merges the data extracted and cleaned above to extend industry-level data back to 1947 instead of 1963 for *44* industries. 

<a id="71-extend-klems-nominal-data-beyond-2014"></a>
#### 7.1 Extend US KLEMS mominal data beyond 2014 using BEA-BLS data

In [26]:
all_cols                     = list(set(df_klems_1947to2014.columns).union(df_capital_nominal_start1947.columns))
df_klems_1947to2014          = df_klems_1947to2014.reindex(columns=all_cols)
df_capital_nominal_start1947 = df_capital_nominal_start1947.reindex(columns=all_cols)
df_nom_1947to2023            = pd.concat([df_klems_1947to2014, df_capital_nominal_start1947], ignore_index=True)
df_nom_1947to2023["VA"]      = df_nom_1947to2023["GO"] - df_nom_1947to2023["II"]

<a id="72-order-variables"></a>
#### 7.2 Order variables

In [27]:
df_nom_1947to2023            = df_nom_1947to2023.sort_values(by=['indnum', 'yr']).reset_index(drop=True)
other_cols                   = [c for c in df_nom_1947to2023.columns if c not in ["indnum", "yr"]]
qi_cols                      = [col for col in df_nom_1947to2023.columns if col.endswith('QI')]
df_nom_1947to2023[qi_cols]   = df_nom_1947to2023[qi_cols].round(5)
df_nom_1947to2023            = df_nom_1947to2023[["indnum", "yr"] + other_cols]

<a id="73-turn-quantity-growth-rates-into-indices"></a>
#### 7.3 Turn quantity growth rates of relevant variables into indices

In [28]:
available_growth_vars    = [col for col in df_experimental_1947to1963.columns if col.endswith('_g')]
growth_vars = [
    "GO_QI_g", "CAPIT_QI_g", "CAPSOFT_QI_g", "CAPRD_QI_g", "CAPART_QI_g", "CAPOTH_QI_g",
    "LABCOL_QI_g", "LABNCOL_QI_g", "II_QI_g", "HRS_QI_g"
]
growth_vars_filtered    = [var for var in growth_vars if var in df_experimental_1947to1963.columns]
df_qindices_1947to2023  = index_generation(df_experimental_1947to1963, growth_vars_filtered)
 

<a id="74-merge-nominal-series-with-factor-compensation-and-quantity-growth-rates"></a>
#### 7.4 Merge the nominal series with factor compensation and quantity growth rates

In [29]:
df_1947to2023   = pd.merge(df_nom_1947to2023,df_qindices_1947to2023,on=["indnum", "yr"],how="outer",suffixes=("_exp63", "_full"))  # keeps all rows from both
df_1947to2023   = df_1947to2023.sort_values(by=["indnum", "yr"]).reset_index(drop=True)
other_cols      = [c for c in df_1947to2023.columns if c not in ["indnum", "yr"]]
df_1947to2023   = df_1947to2023[["indnum", "yr"] + other_cols]

<a id="75-reorder-variables"></a>
#### 7.5 Reorder variables

In [30]:
available_cols_1947to2023 = [col for col in order_1947to2023 if col in df_1947to2023.columns]
df_1947to2023             = df_1947to2023[available_cols_1947to2023]

<a id="8-combine--export"></a>
# 8. Combine 1947-2023 and 1963-2023 Datasets and Export

This section exports both datasets into excel. This excel is saved in the chosen directory specified in 1.1.

<a id="81-format-datasets-for-export"></a>
#### 8.1 Format both datasets

In [31]:
def format_dataframe_for_export(df):
    """Format dataframe with specific decimal places for different variable groups"""
    df_formatted = df.copy()
    two_decimal_vars = ['GO', 'VA', 'CAP', 'LAB', 'II','CAPIT','CAPSOFT', 'CAPRD','CAPART','CAPOTH',
                        'LABCOL','LABNCOL','IIEN', 'IIMT','IISERV']
    four_decimal_vars = [col for col in df.columns if col not in ['indnum', 'yr'] + two_decimal_vars]
    for col in two_decimal_vars:
        if col in df_formatted.columns:
            df_formatted[col] = df_formatted[col].round(2)
    
    for col in four_decimal_vars:
        if col in df_formatted.columns:
            df_formatted[col] = df_formatted[col].round(7)
    
    return df_formatted
df_1963to2023 = format_dataframe_for_export(df_1963to2023)
df_1947to2023 = format_dataframe_for_export(df_1947to2023)

<a id="82-variable-definitions-and-industry-key"></a>
#### 8.2 Create a variable definitions and industry key

In [32]:
var_defs = pd.DataFrame({
"Variable" : ['indnum','yr', 'GO', 'VA', 'CAP', 'LAB', 'II',
              'CAPIT','CAPSOFT', 'CAPRD','CAPART','CAPOTH',
              'LABCOL','LABNCOL',
              'IIEN', 'IIMT','IISERV',
              'GO_QI','II_QI',
              'CAPIT_QI','CAPSOFT_QI','CAPRD_QI','CAPART_QI','CAPOTH_QI',
              'LABCOL_QI','LABNCOL_QI',
              'IIEN_QI', 'IIMT_QI', 'IISERV_QI','HRS_QI'],

    "Description": [
        "Industry Identifier",                                  
        "Year",                                                 
        "Nominal Gross Output",                                 
        "Nominal Value Added",                                  
        "Nominal Capital Compensation",                         
        "Nominal Labor Compensation",                           
        "Nominal Intermediate Inputs",              
        "Nominal IT Equipment Capital Compensation",            
        "Nominal Software Capital Compensation",                
        "Nominal R&D Capital Compensation",                      
        "Nominal Entertainment Originals Capital Compensation",  
        "Nominal Other Capital Compensation",        
        
        "Nominal Energy Intermediate Inputs",   
        "Nominal Materials Intermediate Inputs",
        "Nominal Services Intermediate Inputs", 
        
        "Nominal College Labor Compensation",         
        "Nominal Non-College Labor Compensation",    
        
        "Output Quantity Index", 
        "Intermediate Inputs Quantity Index",         

        "IT Equipment Capital Quantity Index",        
        "Software Capital Quantity Index",            
        "R&D Capital Quantity Index",                 
        "Entertainment Originals Quantity Index",     
        "Other Capital Quantity Index",               
 
        "College Labor Quantity Index",               
        "Non-college Labor Quantity Index",           
         
        "Energy Intermediate Input Quantity Index", 
        "Materials Intermediate Input Quantity Index", 
        "Services Intermediate Input Quantity Index",     
        
        "Total Hours Index"    
    ]
})

#1947-2023 dataset industry list
industry_44 = pd.DataFrame({
    "industry_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,2936,3740,4144,45,46,4749,50,5152,53,5456,5758,59,60,61,62,63],
    "Industry Description": [
        "Farms", "Forestry, fishing, and related activities", "Oil and gas extraction", "Mining, except oil and gas",
        "Support activities for mining", "Utilities", "Construction", "Wood products", "Nonmetallic mineral products",
        "Primary metals", "Fabricated metal products", "Machinery", "Computer and electronic products",
        "Electrical equipment, appliances, and components", "Motor vehicles, bodies and trailers, and parts",
        "Other transportation equipment", "Furniture and related products", "Miscellaneous manufacturing",
        "Food and beverage and tobacco products", "Textile mills and textile product mills",
        "Apparel and leather and allied products", "Paper products", "Printing and related support activities",
        "Petroleum and coal products", "Chemical products", "Plastics and rubber products", "Wholesale trade",
        "Retail trade", "Transportation and warehousing", "Information", "Finance and insurance", "Real estate",
        "Rental and leasing services and lessors of intangible assets", "Professional, scientific, and technical services",
        "Management of companies and enterprises", "Administrative and waste management services", "Educational services",
        "Health care and social assistance", "Arts, entertainment, and recreation", "Accommodation",
        "Food services and drinking places", "Other services, except government", "Federal", "State and local"
    ]
})

#1963-2023 dataset industry list
industry_63 = pd.DataFrame({
    "industry_id": list(range(1, 64)),
    "Industry Description": [
        "Farms", "Forestry, fishing, and related activities", "Oil and gas extraction", "Mining, except oil and gas",
        "Support activities for mining", "Utilities", "Construction", "Wood products", "Nonmetallic mineral products",
        "Primary metals", "Fabricated metal products", "Machinery", "Computer and electronic products",
        "Electrical equipment, appliances, and components", "Motor vehicles, bodies and trailers, and parts",
        "Other transportation equipment", "Furniture and related products", "Miscellaneous manufacturing",
        "Food and beverage and tobacco products", "Textile mills and textile product mills",
        "Apparel and leather and allied products", "Paper products", "Printing and related support activities",
        "Petroleum and coal products", "Chemical products", "Plastics and rubber products", "Wholesale trade",
        "Retail trade", "Air transportation", "Rail transportation", "Water transportation", "Truck transportation",
        "Transit and ground passenger transportation", "Pipeline transportation", "Other transportation and support activities",
        "Warehousing and storage", "Publishing industries, except internet (includes software)",
        "Motion picture and sound recording industries", "Broadcasting and telecommunications",
        "Data processing, internet publishing, and other information services",
        "Federal Reserve banks, credit intermediation, and related activities", "Securities, commodity contracts, and investments",
        "Insurance carriers and related activities", "Funds, trusts, and other financial vehicles", "Real estate",
        "Rental and leasing services and lessors of intangible assets", "Legal services", "Computer systems design and related services",
        "Miscellaneous professional, scientific, and technical services", "Management of companies and enterprises",
        "Administrative and support services", "Waste management and remediation services", "Educational services",
        "Ambulatory health care services", "Hospitals and Nursing and residential care", "Social assistance",
        "Performing arts, spectator sports, museums, and related activities", "Amusements, gambling, and recreation industries",
        "Accommodation", "Food services and drinking places", "Other services, except government", "Federal", "State and local"
    ]
})

<a id="83-export-as-excel"></a>
#### 8.3 Export as an excel to specified directory 

In [33]:
os.makedirs(export_file_path, exist_ok=True)
output_file = os.path.join(export_file_path, "clean_data.xlsx")

with pd.ExcelWriter(output_file, engine="xlsxwriter") as writer:
    workbook                   = writer.book
    worksheet                  = workbook.add_worksheet("Info")
    writer.sheets["Info"] = worksheet
    var_defs.to_excel(writer, sheet_name="Info", index=False, startrow=0, startcol=0)
    industry_44.to_excel(writer, sheet_name="Info", index=False, startrow=0, startcol=len(var_defs.columns)+2)
    industry_63.to_excel(writer, sheet_name="Info", index=False,
                         startrow=0, startcol=len(var_defs.columns) + 2 + len(industry_44.columns) + 2)
    df_1963to2023.to_excel(writer, sheet_name="Data_63Ind_1963to2023", index=False)
    df_1947to2023.to_excel(writer, sheet_name="Data_44Ind_1947to2023", index=False)