# Measuring US Industry Level Productivity for 1947-2023
## [Juan Ignacio Vizcaino](https://www.jivizcaino.com/) and [Selim Elbadri](https://www.selimelbadri.com/) 

We combine data from US KLEMS, March 2017 Release, with BEA-BLS Integrated Industry-Level Production Account for 1947–2016, and BEA-BLS Integrated Industry-Level Production Account for 1997-2023 to produce industry-level measures of Gross Output (**GO**), Value Added (**VA**), Capital (**CAP**), Labor (**LAB**), and Intermediate Inputs (**II**), in Nominal  terms. We also provide Quantity Indices for **GO**, **VA**, **CAP**, **LAB**, **II** and total hours employed (**HRS**), and utilize these indices to compute measures of Total Labor Productivity (**LP**) and Total Factor Productivity (**TFP**) for the corresponding time period, following the methodology in US KLEMS, April 2013 Release.
See our [Gihub repo](https://github.com/selbadri/Measuring-US-Industry-Level-Productivity-1947-to-2023) for details on data processing. 



# Introduction

We construct the final dataset in two stages.
Step 1 merges the raw datasets and standardizes all variables. This step, implemented in clean.ipynb, produces [clean_data.xlsx](../Output/clean_data.xlsx) — a unified dataset containing the core variables for 44 industries (1947–2023) and 63 industries (1963–2023). It includes nominal series for **GO**, **CAP**, **LAB**, **II**, and **VA**, as well as nominal compensation values and quantity indices for each input type.

Step 2, implemented by [analysis.ipynb](./analysis.ipynb), uses clean_data.xlsx to aggregate the growth rates of the various **CAP**, **LAB**, and **II** types, weighting them by their nominal compensation shares to generate the corresponding quantity indices. Combined with the **GO** quantity index, we use these to produce the **VA** quantity index and the productivity measures. The final output is [EV_production_accounts_1947to2023.xlsx](../Output/EV_production_accounts_1947to2023.xlsx).

In addition to the construction of the dataset, we offer a code ([validation.ipynb](./validation.ipynb)) that verifies our dataset against well-established and comparable datasets. This is implemented in validation.ipynb. Specifically, we compare industry- and broad-sector level **GO** and **VA** shares as well as US economy-wide quantity and productivity indices (**GO**, **CAP**, **LAB**, **II**, **VA**, **TFPVA** and **LPVA**).

Running the notebook [run_all.ipynb](./run_all.ipynb) automatically executes [clean.ipynb](./clean.ipynb), followed by [analysis.ipynb](./analysis.ipynb) and then validation.ipynb, so there is no need to open the three notebooks. 

1. [Preliminaries](#1-preliminaries)
   - [1.1 Set-up working directory](#11-set-up-working-directory)
   - [1.2 Import required libraries](#12-import-required-libraries)

2. [Functions & Lists](#2-functions--lists)
   - [2.1 Define widely used functions](#21-define-widely-used-functions)
   - [2.2 Define widely used lists / aggregate_groups](#22-define-widely-used-lists--aggregate_groups)

3. [Import clean dataset](#3-import-clean-dataset)

4. [Compute GO, CAP, LAB, II and HRS quantity growth rates for 1963-2023 dataset](#4-compute-go-cap-lab-ii-and-hrs-quantity-growth-rates-for-1963-2023-dataset)
   - [4.1 Compute the growth rate of II quantity between 1998 and 2023](#41-compute-the-growth-rate-of-ii-quantity-between-1998-and-2023)
   - [4.2 Compute growth rate of GO & HRS quantity indices](#42-compute-growth-rate-of-go--hrs-quantity-indices)
   - [4.3 Compute growth rate of CAP & LAB quantities](#43-compute-growth-rate-of-cap--lab-quantities)

5. [Compute GO, CAP, LAB, II and HRS quantity growth rates for 1947-2023 dataset](#5-compute-go-cap-lab-ii-and-hrs-quantity-growth-rates-for-1947-2023-dataset)
   - [5.1 Compute GO, II & HRS growth rates from 1947-1963](#51-compute-go-ii--hrs-growth-rates-from-1947-1963)
   - [5.2 Compute growth rate of CAP & LAB quantities](#52-compute-growth-rate-of-cap--lab-quantities)
   - [5.3 Extend the growth rates to 2023 using 1963-2023 dataset](#53-extend-the-growth-rates-to-2023-using-1963-2023-dataset)

7. [Compute productivity indices](#7-compute-productivity-indices)
   - [7.1 Compute value-added quantity growth rate](#71-compute-value-added-quantity-growth-rate)
   - [7.2 Compute labor productivity (VA)](#72-compute-labor-productivity-va)
   - [7.3 Compute labor productivity (GO)](#73-compute-labor-productivity-go)
   - [7.4 Compute total factor productivity (GO)](#74-compute-total-factor-productivity-go)
   - [7.5 Compute total factor productivity (VA)](#75-compute-total-factor-productivity-va)
   - [7.6 Chain growth rates into indices](#76-chain-growth-rates-into-indices)

8. [Export final dataset](#8-export-final-dataset)
   - [8.1 Format both datasets](#81-format-both-datasets)
   - [8.2 Create variable definitions and industry key](#82-create-variable-definitions-and-industry-key)
   - [8.3 Export as an excel to specified directory](#83-export-as-an-excel-to-specified-directory)


<a id="1-preliminaries"></a>
# 1 Preliminaries

<a id="11-set-up-working-directory"></a>
#### 1.1 Set-up working directory

In [1]:
import_file_path = rf"..\\Output"
export_file_path = rf"..\\Output"

<a id="12-import-required-libraries"></a>
#### 1.2 Import required libraries

In [2]:
import pandas as pd
import numpy as np
import os

<a id="2-functions--lists"></a>
# 2. Functions & Lists

<a id="21-define-widely-used-functions"></a>
#### 2.1. Define widely used functions

In [3]:
def tornqvist_component_aggregation(df, group_vars, base_vars, result_col="tornq_g"):
    
    """Compute a Tornqvist-style aggregate growth rate across components.

    Parameters
    ----------
    df : pandas.DataFrame
        Input dataframe containing component level series and their quantity indices (expected columns: '<COMP>' and '<COMP>_QI' for each component in base_vars),
        plus grouping columns and a time column named 'yr'.
    group_vars : list
        List of column names used to group the panel (e.g. ['indnum']).
    base_vars : list
        List of component base names (e.g. ['IIEN', 'IISERV']). For quantity calculations the function expects '<var>_QI' columns to exist in df.
    result_col : str, optional
        Name of the output column storing the aggregated growth rate (default 'tornq_g').

    Returns
    -------
    pandas.DataFrame
        Copy of the input dataframe with a new column `result_col` containing the Tornqvist-weighted growth rate for each row.

    Notes
    -----
    The function computes safe log-differences of each component's quantity index (variable named '<var>_QI'),
    computes current row-wise shares across the supplied base_vars, lags those shares by group, and then computes:
        tornq_g_t = sum_i 0.5*(s_i,t + s_i,t-1) * dlog(q_i,t)
    where dlog(q_i,t) = ln(q_i,t) - ln(q_i,t-1).

    """

    df = df.sort_values(group_vars + ["yr"]).copy()

    dl_cols = {}
    for var in base_vars:
        qi = f"{var}_QI"
        dl_cols[var] = df.groupby(group_vars)[qi].transform(lambda x: (np.log(x) - np.log(x.shift(1))).fillna(0))

    total_size = df[base_vars].sum(axis=1)
    shares = df[base_vars].div(total_size, axis=0)

    lagged_shares = shares.groupby(df[group_vars[0]]).shift(1)

    weighted_sum = 0
    for var in base_vars:
        weight = 0.5 * (shares[var] + lagged_shares[var])
        weighted_sum += weight * dl_cols[var]

    df[result_col] = weighted_sum

    return df

In [4]:
def growth_to_index(df, group_var, growth_vars, base_year=1963):
    """
    Convert growth rates to level indices, starting at 1 in the specified base year.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing the growth rates and year column ('yr').
    group_var : str
        Column to group by (e.g., 'indnum').
    growth_vars : list of str
        Columns containing growth rates (e.g., ['GO_g', 'VA_g']).
    base_year : int
        Year where the index is set to 1.

    Returns
    -------
    pd.DataFrame
        DataFrame with new columns for indices (e.g., 'GO', 'VA').
    """

    df = df.sort_values([group_var, "yr"]).copy()

    for var in growth_vars:
        # Drop the trailing "_g" to name the index variable
        if var.endswith("_g"):
            idx_var = var[:-2]
        else:
            raise ValueError(f"Growth variable '{var}' must end with '_g'.")

        df[idx_var] = np.nan

        for ind in df[group_var].unique():
            mask = df[group_var] == ind
            years = df.loc[mask, "yr"].values
            growth = df.loc[mask, var].values

            idx_vals = np.empty(len(years))
            idx_vals[:] = np.nan

            if base_year in years:
                base_pos = np.where(years == base_year)[0][0]
                idx_vals[base_pos] = 1.0

                # Forward fill
                for t in range(base_pos + 1, len(years)):
                    if not np.isnan(growth[t]):
                        idx_vals[t] = idx_vals[t-1] * np.exp(growth[t])
                    else:
                        idx_vals[t] = idx_vals[t-1]

                # Backward fill
                for t in range(base_pos - 1, -1, -1):
                    if not np.isnan(growth[t+1]):
                        idx_vals[t] = idx_vals[t+1] / np.exp(growth[t+1])
                    else:
                        idx_vals[t] = idx_vals[t+1]
            else:
                # If base year not present, just normalize first observation to 1
                idx_vals[0] = 1.0
                for t in range(1, len(years)):
                    if not np.isnan(growth[t]):
                        idx_vals[t] = idx_vals[t-1] * np.exp(growth[t])
                    else:
                        idx_vals[t] = idx_vals[t-1]

            df.loc[mask, idx_var] = idx_vals

    return df


In [5]:
def tornqvist_industry_aggregation(df_results, aggregate_groups, vars_to_agg):
    
    """Aggregate growth rates for industries using Tornqvist VA shares.

    Parameters
    ----------
    df_results : pd.DataFrame
        Industry-level data with columns ['yr', 'indnum', 'VA'] + vars_to_agg.
    aggregate_groups : dict
        Mapping from aggregated industry codes (keys) to list of component industries (values).
        Industries not listed will be kept as-is.
    vars_to_agg : list of str
        Growth rate variables to aggregate, e.g. ["CAP_QI_g", "II_QI_g", "LAB_QI_g", "GO_QI_g", "HRS_QI_g"].

    Returns
    -------
    pd.DataFrame
        Aggregated DataFrame with columns ['yr', 'indnum'] + vars_to_agg.

    Notes
    -----
    For each aggregation key (agg_ind) the function selects component industries, computes
    lagged VA to form Tornqvist weights (using VA and VA_lag), constructs current and lagged
    VA shares across the components in each year, and computes the Tornqvist-weighted
    aggregate growth rate for each variable in vars_to_agg as:
        agg_v_t = sum_i 0.5*(s_i,t + s_i,t-1) * v_i,t
    where v_i,t are the component growth rates. Single-component aggregations are passed
    through and assigned the agg_ind code. Industries not listed in aggregate_groups are
    retained unchanged.
"""

    out = []

    mapped_inds = set(ind for inds in aggregate_groups.values() for ind in inds)

    for agg_ind, comp_inds in aggregate_groups.items():
        df_comp = df_results[df_results["indnum"].isin(comp_inds)].copy()
        df_comp = df_comp.sort_values(["yr", "indnum"])

        df_comp["VA_lag"] = df_comp.groupby("indnum")["VA"].shift(1)
        df_comp["VA_lag"] = df_comp["VA_lag"].fillna(df_comp["VA"])

        df_comp["VA_tot_t"] = df_comp.groupby("yr")["VA"].transform("sum")
        df_comp["VA_tot_tm1"] = df_comp.groupby("yr")["VA_lag"].transform("sum")

        df_comp["s_t"] = df_comp["VA"] / df_comp["VA_tot_t"]
        df_comp["s_tm1"] = df_comp["VA_lag"] / df_comp["VA_tot_tm1"]

        if len(comp_inds) == 1:
            df_tmp = df_comp[["yr", "indnum"] + vars_to_agg].copy()
            df_tmp["indnum"] = int(agg_ind)
            out.append(df_tmp)
        else:
            records = []
            for yr, group in df_comp.groupby("yr"):
                rec = {"yr": yr, "indnum": int(agg_ind)}
                for v in vars_to_agg:
                    rec[v] = np.sum(0.5 * (group["s_t"] + group["s_tm1"]) * group[v])
                records.append(rec)
            out.append(pd.DataFrame(records))

    unmapped_inds = [ind for ind in df_results["indnum"].unique() if ind not in mapped_inds]
    for ind in unmapped_inds:
        df_tmp = df_results[df_results["indnum"] == ind][["yr", "indnum"] + vars_to_agg].copy()
        out.append(df_tmp)

    df_out = pd.concat(out, ignore_index=True)
    df_out = df_out.sort_values(["indnum", "yr"]).reset_index(drop=True)

    return df_out


<a id="22-define-widely-used-lists--aggregate_groups"></a>
#### 2.2 Define widely used lists

The list *aggregate_groups* is useful to aggregate industries at the same level as the BEA-BLS Industry Production Account Experimental for the period of 1947-1963. 

In [6]:
aggregate_groups = {
    2936: list(range(29, 37)),
    3740: list(range(37, 41)),
    4144: list(range(41, 45)),
    4749: list(range(47, 50)),
    5152: list(range(51, 53)),
    5456: list(range(54, 57)),
    5758: list(range(57, 59))
}

<a id="3-import-clean-dataset"></a>
# 3. Import clean dataset

This section imports the two data sheets in the clean.xlsx file of the Output folder. This file is created using the clean.ipynb notebook. The first sheet contains all the industry-level data requirements needed to produce our nominal compensation, quantity and productivity indices for 63 industries from 1963 to 2023. The second sheet contains all the industry-level data requirements to extend the nominal compensation, quantity and productivity indices back to 1947 for the coarser set of 44 industries. 

In [7]:
df_1963to2023 = pd.read_excel(os.path.join(import_file_path, 'clean_data.xlsx'), 
    sheet_name='Data_63Ind_1963to2023', skiprows=0)
df_1947to2023 = pd.read_excel(os.path.join(import_file_path, 'clean_data.xlsx'), 
    sheet_name='Data_44Ind_1947to2023', skiprows=0)

<a id="4-compute-go-cap-lab-ii-and-hrs-quantity-growth-rates-for-1963-2023-dataset"></a>
# 4. Compute GO, CAP, LAB, II and HRS quantity growth rates for 1963-2023 dataset

<a id="41-compute-the-growth-rate-of-ii-quantity-between-1998-and-2023"></a>
#### 4.1 Compute the growth rate of II quantity between 1998 and 2023

In contrast to the BEA-BLS Integrated Industry Production Accounts 1947-2016 dataset, the BEA-BLS Integrated Industry-Level Production Accounts 1997-2023 dataset does not offer a single II quantity index. Instead, it provides the quantity indices of various different intermediate inputs (particularly energy, materials and services). To compute the II quantity index growth rate for 1998-2023, we must aggregate the quantity growth rates of the different types of intermediate inputs. We opt for Tornqvist nominal comepensation-weighted aggregation.

In [8]:
df_1963to2023                  = tornqvist_component_aggregation(df_1963to2023,group_vars=["indnum"],base_vars=["IIEN", "IISERV", "IIMT"],result_col="II_tornq_g")
df_1963to2023["II_QI_logdiff"] = df_1963to2023.groupby("indnum")["II_QI"].transform(lambda x: np.where(x > 0, np.log(x) - np.log(x.shift(1)), np.nan))
df_1963to2023["II_QI_g"]       = df_1963to2023["II_QI_logdiff"].fillna(df_1963to2023["II_tornq_g"])
df_1963to2023.drop(columns=["II_QI_logdiff", "II_tornq_g"], inplace=True)

<a id="42-compute-growth-rate-of-go--hrs-quantity-indices"></a>
#### 4.2 Compute growth rate of GO & HRS quantity indices

This amounts to a log-difference of the GO & HRS quantity indices. 

In [9]:
df_1963to2023["GO_QI_g"]  = df_1963to2023.groupby("indnum")["GO_QI"].transform(lambda x: np.log(x) - np.log(x.shift(1)))
df_1963to2023["HRS_QI_g"] = df_1963to2023.groupby("indnum")["HRS_QI"].transform(lambda x: np.log(x) - np.log(x.shift(1)))

<a id="43-compute-growth-rate-of-cap--lab-quantities"></a>
#### 4.3 Compute growth rate of CAP & LAB quantities

Both BEA-BLS Integrated Industry Production Accounts datasets do not provide an aggregate capital quantity index or an aggregate labor quantity index. Instead, they provide the quantity indices and nominal comepnsation values of various types of capital and labor. We must therefore aggregate the growth rates of different types of capital and labor to compute aggregate capital and labor quantity growth indices. We opt for Tornqvist nominal comepensation-weighted aggregation.

In [10]:
cap_vars      = ["CAPIT", "CAPSOFT", "CAPRD", "CAPART", "CAPOTH"]
df_1963to2023 = tornqvist_component_aggregation(
    df=df_1963to2023,            
    group_vars=["indnum"],   
    base_vars=cap_vars,      
    result_col="CAP_QI_g"    
)

lab_vars      = ["LABCOL", "LABNCOL"]
df_1963to2023 = tornqvist_component_aggregation(
    df=df_1963to2023,           
    group_vars=["indnum"],  
    base_vars=lab_vars,     
    result_col="LAB_QI_g"    
)
df_1963to2023=df_1963to2023[["indnum", "yr", "CAP","II", "GO", "VA", "LAB","II_QI_g", "CAP_QI_g", "LAB_QI_g", "GO_QI_g", "HRS_QI_g"]]

<a id="5-compute-go-cap-lab-ii-and-hrs-quantity-growth-rates-for-1947-2023-dataset"></a>
# 5. Compute GO, CAP, LAB, II and HRS quantity growth rates for 1947-2023 dataset

The 1947-2023 dataset contains information on 44 industries. These 44 industries are aggregate the 63 industries into a coarser industry classification (see *aggregate_groups* list). 

<a id="51-compute-go-ii--hrs-growth-rates-from-1947-1963"></a>
#### 5.1 Compute GO, II & HRS growth rates from 1947-1963

The 1947-1963 dataset has the quantity indices of GO, II & HRS from 1947 to 1963. The growth rates are the log difference of these quantity indices. 

In [11]:
df_1947to2023["GO_QI_g"]  = df_1947to2023.groupby("indnum")["GO_QI"].transform(lambda x: np.log(x)  - np.log(x.shift(1)))
df_1947to2023["II_QI_g"]  = df_1947to2023.groupby("indnum")["II_QI"].transform(lambda x: np.log(x)  - np.log(x.shift(1)))
df_1947to2023["HRS_QI_g"] = df_1947to2023.groupby("indnum")["HRS_QI"].transform(lambda x: np.log(x) - np.log(x.shift(1)))

<a id="52-compute-growth-rate-of-cap--lab-quantities"></a>
#### 5.2 Compute growth rate of CAP & LAB quantities

Both BEA-BLS Integrated Industry Production Accounts datasets do not provide an aggregate capital quantity index or an aggregate labor quantity index. Instead, they provide the quantity indices and nominal comepnsation values of various *types* of capital and labor for the 44 industries between 1947 and 1963. We must therefore aggregate the growth rates of different types of capital and labor to compute aggregate capital and labor quantity growth indices. We opt for Tornqvist nominal comepensation-weighted aggregation.

In [12]:
cap_vars      = ["CAPIT", "CAPSOFT", "CAPRD", "CAPART", "CAPOTH"]
df_1947to2023 = tornqvist_component_aggregation(
    df=df_1947to2023,           
    group_vars=["indnum"],  
    base_vars=cap_vars,     
    result_col="CAP_QI_g"   
)

lab_vars      = ["LABCOL", "LABNCOL"]
df_1947to2023 = tornqvist_component_aggregation(
    df=df_1947to2023,        
    group_vars=["indnum"],  
    base_vars=lab_vars,   
    result_col="LAB_QI_g"  
)
df_1947to2023 = df_1947to2023[["indnum","yr","GO", "VA", "CAP", "II", "LAB", "GO_QI_g", "CAP_QI_g", "II_QI_g", "LAB_QI_g", "HRS_QI_g"]]

<a id="53-extend-the-growth-rates-to-2023-using-1963-2023-dataset"></a>
#### 5.3 Extend the growth rates to 2023 using 1963-2023 dataset

To extend the quantity growth rates of GO, CAP, LAB, II, and HRS prior to 1963, we draw on the 1963–2023 dataset. As described in Section 4, this dataset provides growth rates for 63 detailed industries over 1963–2023. To construct consistent series for the 44 broader industries used in our analysis, we aggregate the corresponding growth rates from the detailed dataset so that they align with the broader level of industry aggregation. We opt for Tornqvist nominal value-added-weighted growth rates. After aggregating the corresponding growth rates, we merge these into the 1947-2023 dataset.

In [13]:
vars_to_agg                          = ["CAP_QI_g", "II_QI_g", "LAB_QI_g", "GO_QI_g", "HRS_QI_g"]
df_agg                               = tornqvist_industry_aggregation(df_1963to2023, aggregate_groups, vars_to_agg)
df_1947to2023_merged                 = df_1947to2023.merge(df_agg,on=["yr", "indnum"],how="left",suffixes=("", "_agg"))
vars_to_fill = ["CAP_QI_g", "II_QI_g", "LAB_QI_g", "GO_QI_g", "HRS_QI_g"]
for var in vars_to_fill:
    df_1947to2023_merged[var]        = df_1947to2023_merged[var].fillna(df_1947to2023_merged[var + "_agg"])
    df_1947to2023_merged.drop(columns=[var + "_agg"], inplace=True)

<a id="7-compute-productivity-indices"></a>
# 7. Compute productivity indices

This section computes the value added quantity, TFP and labor productivity growth rates as per the [manual.pdf](https://github.com/selbadri/Measuring-US-Industry-Level-Productivity-1947-to-2023/blob/main/manual.pdf). Each computation will be done twice - once for the 1947-2023 dataset containing data on the 44 broad industries and once for the 1963-2023 dataset containing data on the 63 industries. 

<a id="71-compute-value-added-quantity-growth-rate"></a>
#### 7.1 Compute value-added quantity growth rate

To compute a value added quantity index, we start from the definition of a Tornqvist Quantity Index for total output $Y$

$$
\Delta \ln Q_Y (t) = \bar{\nu}_{V A} (t) \Delta \ln Q_{VA}(t)   + \bar{\nu}_{II} (t) \Delta \ln Q_{II} (t),
$$

where

$$
\Delta \ln Q_{j}(t) = \ln Q_{j}(t) - \ln Q_{j}(t-1),
$$

and

$$
\bar{\nu}_{j}(t) = 0.5 \times \left( \frac{P_{j}(t) Q_{j}(t)}{P_Y (t) Q_Y (t)} + \frac{P_{j}(t - 1) Q_{j}(t - 1)}{P_Y (t - 1) Q_Y (t - 1)} \right)
$$

representing the Tornqvist weight for the component $j \in \left\lbrace VA,II \right\rbrace$ of $GO$.

Re arranging terms, we get:

$$
\Delta \ln Q_{VA}(t) = \frac{\Delta \ln Q_Y (t) - \bar{\nu}_{II} (t) \Delta \ln Q_{II} (t)}{\bar{\nu}_{V A} (t)}
$$

In [14]:
for df in [df_1963to2023, df_1947to2023_merged]:
    df["IIGO_SHARE_curr"] = df["II"] / df["GO"]
    df["VAGO_SHARE_curr"] = df["VA"] / df["GO"]
    df["IIGO_SHARE_lag"]  = df.groupby("indnum")["IIGO_SHARE_curr"].shift(1)
    df["VAGO_SHARE_lag"]  = df.groupby("indnum")["VAGO_SHARE_curr"].shift(1)
    df["IIGO_SHARE"]      = 0.5 * (df["IIGO_SHARE_curr"] + df["IIGO_SHARE_lag"])
    df["VAGO_SHARE"]      = 0.5 * (df["VAGO_SHARE_curr"] + df["VAGO_SHARE_lag"])
    df["VA_QI_g"]         = (df["GO_QI_g"] - df["IIGO_SHARE"] * df["II_QI_g"]) / df["VAGO_SHARE"]
    df.drop(columns       =["IIGO_SHARE_curr", "VAGO_SHARE_curr",
                            "IIGO_SHARE_lag", "VAGO_SHARE_lag",
                            "IIGO_SHARE", "VAGO_SHARE"], inplace=True)

<a id="72-compute-labor-productivity-va"></a>
#### 7.2 Compute labor productivity (VA)

We follow KLEMS and use total hours as a measure of Labor Input. Therefore:

$$
\Delta \ln LP_{VA}(t) = \Delta \ln Q_{VA}(t) - \Delta \ln Q_L(t)
$$

In [15]:
for df in [df_1963to2023, df_1947to2023_merged]:
    df["LPVA_g"] = df["VA_QI_g"]-df["HRS_QI_g"]

<a id="73-compute-labor-productivity-go"></a>
#### 7.3 Compute labor productivity (GO)

We follow KLEMS and use total hours as a measure of Labor Input. Therefore:

$$
\Delta \ln LP_{GO}(t) = \Delta \ln Q_{GO}(t) - \Delta \ln Q_L(t)
$$

In [16]:
for df in [df_1963to2023, df_1947to2023_merged]:
    df["LPGO_g"] = df["GO_QI_g"]-df["HRS_QI_g"]

<a id="74-compute-total-factor-productivity-go"></a>
#### 7.4 Compute total factor productivity (GO)

Start with the Tornqvist Index for GO
$$
\Delta \ln Q_{GO}(t) = \Delta \ln TFP_{GO}(t) + \bar{\psi}_L(t) \Delta \ln Q_L(t) + \bar{\psi}_K(t) \Delta \ln Q_K(t) + \bar{\psi}_{II}(t) \Delta \ln Q_{II}(t)
$$

Re-arranging terms, we get
$$
\Delta \ln TFP_{GO}(t) = \Delta \ln Q_{GO}(t) - \bar{\psi}_L(t) \Delta \ln Q_L(t) - \bar{\psi}_K(t) \Delta \ln Q_K(t) -  \bar{\psi}_{II}(t) \Delta \ln Q_{II}(t),
$$

where

$$
\bar{\psi}_{j}(t) = 0.5 \times \left( \frac{P_{j}(t) Q_{j}(t)}{P_{VA}(t) Q_{VA}(t)} + \frac{P_{j}(t-1) Q_{j}(t-1)}{P_{VA}(t-1) Q_{VA}(t-1)} \right),
$$

for $j \in \left\lbrace LAB,CAP, II \right\rbrace$ and  are nominal labor LAB ($L$), nominal CAP ($K$) or nominal intermediate inputs ($II$).

In [17]:
for df in [df_1963to2023, df_1947to2023_merged]:
    df["CAPGO_SHARE_curr"] = df["CAP"] / df["GO"]
    df["LABGO_SHARE_curr"] = df["LAB"] / df["GO"]
    df["IIGO_SHARE_curr"] = df["II"] / df["GO"]

    df["CAPGO_SHARE_lag"] = df.groupby("indnum")["CAPGO_SHARE_curr"].shift(1)
    df["LABGO_SHARE_lag"] = df.groupby("indnum")["LABGO_SHARE_curr"].shift(1)
    df["IIGO_SHARE_lag"] = df.groupby("indnum")["IIGO_SHARE_curr"].shift(1)

    df["CAP_SHARE"] = 0.5 * (df["CAPGO_SHARE_curr"] + df["CAPGO_SHARE_lag"])
    df["LAB_SHARE"] = 0.5 * (df["LABGO_SHARE_curr"] + df["LABGO_SHARE_lag"])
    df["II_SHARE"] = 0.5 * (df["IIGO_SHARE_curr"] + df["IIGO_SHARE_lag"])

    df["TFPGO_g"] = (df["GO_QI_g"] - df["II_SHARE"]*df["II_QI_g"]
                        - df["CAP_SHARE"]*df["CAP_QI_g"]
                        - df["LAB_SHARE"]*df["LAB_QI_g"])
    
    df.drop(columns=["II_SHARE", "CAP_SHARE", "LAB_SHARE",
                     "CAPGO_SHARE_curr", "LABGO_SHARE_curr", "IIGO_SHARE_curr",
                     "CAPGO_SHARE_lag", "LABGO_SHARE_lag", "IIGO_SHARE_lag"],
            inplace=True)

<a id="75-compute-total-factor-productivity-va"></a>
#### 7.5 Compute total factor productivity (VA)

Start with the Tornqvist Index for VA
$$
\Delta \ln Q_{VA}(t) = \Delta \ln TFP_{VA}(t) + \bar{\psi}_L(t) \Delta \ln Q_L(t) + \bar{\psi}_K(t) \Delta \ln Q_K(t).
$$

Re-arranging terms, we get
$$
\Delta \ln TFP_{VA}(t) = \Delta \ln Q_{VA}(t) - \bar{\psi}_L(t) \Delta \ln Q_L(t) - \bar{\psi}_K(t) \Delta \ln Q_K(t),
$$

where

$$
\bar{\psi}_{j}(t) = 0.5 \times \left( \frac{P_{j}(t) Q_{j}(t)}{P_{VA}(t) Q_{VA}(t)} + \frac{P_{j}(t-1) Q_{j}(t-1)}{P_{VA}(t-1) Q_{VA}(t-1)} \right),
$$

for $j \in \left\lbrace LAB,CAP \right\rbrace$ and  are nominal labor LAB ($L$) or nominal CAP ($K$).

In [18]:
for df in [df_1963to2023, df_1947to2023_merged]:
    df["CAPVA_SHARE_curr"] = df["CAP"] / df["VA"]
    df["LABVA_SHARE_curr"] = df["LAB"] / df["VA"]

    df["CAPVA_SHARE_lag"] = df.groupby("indnum")["CAPVA_SHARE_curr"].shift(1)
    df["LABVA_SHARE_lag"] = df.groupby("indnum")["LABVA_SHARE_curr"].shift(1)

    df["CAP_SHARE"] = 0.5 * (df["CAPVA_SHARE_curr"] + df["CAPVA_SHARE_lag"])
    df["LAB_SHARE"] = 0.5 * (df["LABVA_SHARE_curr"] + df["LABVA_SHARE_lag"])

    df["TFPVA_g"] = (df["VA_QI_g"] - df["CAP_SHARE"]*df["CAP_QI_g"] - df["LAB_SHARE"]*df["LAB_QI_g"])

    df.drop(columns=["CAP_SHARE", "LAB_SHARE",
                     "CAPVA_SHARE_curr", "LABVA_SHARE_curr",
                     "CAPVA_SHARE_lag", "LABVA_SHARE_lag"],
            inplace=True)

<a id="76-chain-growth-rates-into-indices"></a>
#### 7.6 Chain growth rates into indices

In [19]:
growth_vars1          = ["CAP_QI_g", "LAB_QI_g", "GO_QI_g", "HRS_QI_g", "II_QI_g", "VA_QI_g", "LPGO_g", "LPVA_g", "TFPGO_g", "TFPVA_g"]
df_1947to2023_merged  = growth_to_index(df_1947to2023_merged, group_var="indnum", growth_vars=growth_vars1, base_year=1947)
df_1963to2023         = growth_to_index(df_1963to2023, group_var="indnum", growth_vars=growth_vars1, base_year=1963)

<a id="8-export-final-dataset"></a>
# 8. Export final dataset

<a id="81-format-both-datasets"></a>
#### 8.1 Format both datasets

In [20]:
df_1963to2023        = df_1963to2023[["indnum", "yr", "GO", "CAP", "LAB", "II", "VA","GO_QI","CAP_QI", "LAB_QI", "II_QI", "HRS_QI", "VA_QI", "TFPGO", "TFPVA", "LPGO", "LPVA"]]
df_1947to2023_merged = df_1947to2023_merged[["indnum", "yr", "GO", "CAP", "LAB", "II", "VA","GO_QI","CAP_QI", "LAB_QI", "II_QI","HRS_QI", "VA_QI", "TFPGO", "TFPVA", "LPGO", "LPVA"]]

def format_dataframe_for_export(df):
    """Format dataframe with specific decimal places for different variable groups"""
    df_formatted = df.copy()
    two_decimal_vars = ['GO', 'VA', 'CAP', 'LAB', 'II']
    four_decimal_vars = [col for col in df.columns if col not in ['indnum', 'yr'] + two_decimal_vars]
    for col in two_decimal_vars:
        if col in df_formatted.columns:
            df_formatted[col] = df_formatted[col].round(2)
    for col in four_decimal_vars:
        if col in df_formatted.columns:
            df_formatted[col] = df_formatted[col].round(4)
    return df_formatted

df_1963to2023        = format_dataframe_for_export(df_1963to2023)
df_1947to2023_merged = format_dataframe_for_export(df_1947to2023_merged)

df_1947to2023_merged = df_1947to2023_merged.rename(columns={"indnum": "industry_id", "yr": "year"})
df_1963to2023        = df_1963to2023.rename(columns={"indnum": "industry_id", "yr": "year"})

<a id="82-create-variable-definitions-and-industry-key"></a>
#### 8.2 Create variable definitions and industry key

In [21]:
#variable definitions
var_defs = pd.DataFrame({
    "Variable" : ['industry_id','year', 'GO', 'VA', 'CAP', 'LAB', 'II',
                  'GO_QI','CAP_QI', 'LAB_QI','II_QI','HRS_QI', 'VA_QI',
                  'TFPGO','TFPVA',
                  'LPGO', 'LPVA'],
    "Description": [
        "Industry Identifier",                                  
        "Year",                                                 
        "Nominal Gross Output",                                 
        "Nominal Value Added",                                  
        "Nominal Capital Compensation",                         
        "Nominal Labor Compensation",                           
        "Nominal Intermediate Inputs", 
        "Gross Output Quantity Index",            
        "Capital Quantity Index",                
        "Labor Quantity Index",                      
        "Intermediate Inputs Quantity Index",  
        "Total Hour Index",
        "Value Added Quantity Index",        
        "Total Factor Productivity Index (GO)",   
        "Total Factor Productivity Index (VA)", 
        "Labor Productivity Index (GO)",         
        "Labor Productivity Index (VA)",    
    ]
})

#1947-2023 dataset industry list
industry_44 = pd.DataFrame({
    "industry_id": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,2936,3740,4144,45,46,4749,50,5152,53,5456,5758,59,60,61,62,63],
    "Industry Description": [
        "Farms", "Forestry, fishing, and related activities", "Oil and gas extraction", "Mining, except oil and gas",
        "Support activities for mining", "Utilities", "Construction", "Wood products", "Nonmetallic mineral products",
        "Primary metals", "Fabricated metal products", "Machinery", "Computer and electronic products",
        "Electrical equipment, appliances, and components", "Motor vehicles, bodies and trailers, and parts",
        "Other transportation equipment", "Furniture and related products", "Miscellaneous manufacturing",
        "Food and beverage and tobacco products", "Textile mills and textile product mills",
        "Apparel and leather and allied products", "Paper products", "Printing and related support activities",
        "Petroleum and coal products", "Chemical products", "Plastics and rubber products", "Wholesale trade",
        "Retail trade", "Transportation and warehousing", "Information", "Finance and insurance", "Real estate",
        "Rental and leasing services and lessors of intangible assets", "Professional, scientific, and technical services",
        "Management of companies and enterprises", "Administrative and waste management services", "Educational services",
        "Health care and social assistance", "Arts, entertainment, and recreation", "Accommodation",
        "Food services and drinking places", "Other services, except government", "Federal", "State and local"
    ]
})

#1963-2023 dataset industry list
industry_63 = pd.DataFrame({
    "industry_id": list(range(1, 64)),
    "Industry Description": [
        "Farms", "Forestry, fishing, and related activities", "Oil and gas extraction", "Mining, except oil and gas",
        "Support activities for mining", "Utilities", "Construction", "Wood products", "Nonmetallic mineral products",
        "Primary metals", "Fabricated metal products", "Machinery", "Computer and electronic products",
        "Electrical equipment, appliances, and components", "Motor vehicles, bodies and trailers, and parts",
        "Other transportation equipment", "Furniture and related products", "Miscellaneous manufacturing",
        "Food and beverage and tobacco products", "Textile mills and textile product mills",
        "Apparel and leather and allied products", "Paper products", "Printing and related support activities",
        "Petroleum and coal products", "Chemical products", "Plastics and rubber products", "Wholesale trade",
        "Retail trade", "Air transportation", "Rail transportation", "Water transportation", "Truck transportation",
        "Transit and ground passenger transportation", "Pipeline transportation", "Other transportation and support activities",
        "Warehousing and storage", "Publishing industries, except internet (includes software)",
        "Motion picture and sound recording industries", "Broadcasting and telecommunications",
        "Data processing, internet publishing, and other information services",
        "Federal Reserve banks, credit intermediation, and related activities", "Securities, commodity contracts, and investments",
        "Insurance carriers and related activities", "Funds, trusts, and other financial vehicles", "Real estate",
        "Rental and leasing services and lessors of intangible assets", "Legal services", "Computer systems design and related services",
        "Miscellaneous professional, scientific, and technical services", "Management of companies and enterprises",
        "Administrative and support services", "Waste management and remediation services", "Educational services",
        "Ambulatory health care services", "Hospitals and Nursing and residential care", "Social assistance",
        "Performing arts, spectator sports, museums, and related activities", "Amusements, gambling, and recreation industries",
        "Accommodation", "Food services and drinking places", "Other services, except government", "Federal", "State and local"
    ]
})

<a id="83-export-as-an-excel-to-specified-directory"></a>#### 8.3 Export as an excel to specified directory 

In [22]:
os.makedirs(export_file_path, exist_ok=True)
output_file   = os.path.join(export_file_path, "EV_production_accounts_1947to2023.xlsx")

with pd.ExcelWriter(output_file, engine="xlsxwriter") as writer:
    workbook                   = writer.book
    worksheet                  = workbook.add_worksheet("Info")
    writer.sheets["Info"] = worksheet
    var_defs.to_excel(writer, sheet_name="Info", index=False, startrow=0, startcol=0)
    industry_44.to_excel(writer, sheet_name="Info", index=False, startrow=0, startcol=len(var_defs.columns)+2)
    industry_63.to_excel(writer, sheet_name="Info", index=False,
                         startrow=0, startcol=len(var_defs.columns) + 2 + len(industry_44.columns) + 2)
    df_1963to2023.to_excel(writer, sheet_name="Data_63Ind_1963to2023", index=False)
    df_1947to2023_merged.to_excel(writer, sheet_name="Data_44Ind_1947to2023", index=False)