## **Exploratory analysis of dynamism from the BSD**

**Research objectives**
- RQ1: How has the composition of UK firms evolved over the past decade according to the BSD?
- RQ2: To what extent has the rate of creative destruction in the UK declined between 1997 and 2023? 
- RQ3: How have gaps between the most productive ‘frontier’ firms and ‘laggard’ firms evolved? 
- RQ4: How are changes in business dynamism and productivity dispersion related?

**Data source**: Business Structure Database (1998-2023). Aggregated data tables have been exported from the UK Data Service SecureLab

### Executive Summary
- The business population has grown.
- Stable entry and exit. Entry was particularly strong between 2012 and 2018.

In [14]:
# Import packages and set filepaths
import pandas as pd
import numpy as np
import altair as alt
from pathlib import Path
from pandas.api.types import CategoricalDtype
import os


In [23]:
script_dir = Path.cwd()
import_path = script_dir.parent / "Data"

In [57]:
whole_economy_df = pd.read_excel(import_path / 'BSD/business_dynamism_BSD_1997_2023.xlsx', sheet_name='whole_economy')
firm_size_df = pd.read_excel(import_path / 'BSD/business_dynamism_BSD_1997_2023.xlsx', sheet_name='firm_size')
firm_age_df = pd.read_excel(import_path / 'BSD/business_dynamism_BSD_1997_2023.xlsx', sheet_name='firm_age')
industry_df = pd.read_excel(import_path / 'BSD/business_dynamism_BSD_1997_2023.xlsx', sheet_name='industry')
region_df = pd.read_excel(import_path / 'BSD/business_dynamism_BSD_1997_2023.xlsx', sheet_name='region')

#### **Summary of data tables**
##### *Table 1 - Population and job flows*
This table provides information on the business population each year and job flows.Index is the year. The following dimensions are provided:
- Total
- Firm size (employment)
- Firm age
- Sector
- Region
- Within-industry productivity decile

|Year|Dimension|Category|Number of firms|Employment|Turnover|Entrants|Exits|JC|JD|Multi-site firms|Multi-site emp|Site expansion|Site contraction|
|----|---------|--------|---------------|----------|--------|--------|-----|--|--|----------------|--------------|--------------|----------------|
|2000|Total|All|
|2000|Size|Micro|
|2000|Size|Small|
|2000|Size|Medium|
|2000|Size|Large|

##### *Table 2 - Cohort analysis*
This table looks at cohorts of firms starting in each year and tracks the entire cohort by age. The followning dimensions are provided:
- Total
- Sector
- Region
- Firm size (employment)

|Cohort|Age|Dimension|Category|Number of firms|Avg size|Survival rate|KM rate|Share of employment|Share of turnover|High growth firms|Stagnant firms|
|------|---|---------|--------|---------------|--------|-------------|-------|-------------------|-----------------|-----------------|--------------|
|2000|0|Total|All|
|2000|1|Total|All|
|2000|2|Total|All|
|2000|3|Total|All|
|2000|4|Total|All|

##### *Table 3 - Growth rates*

|Year|Dimension|Category|Number of firms|Employment|Turnover|Entrants|Exits|JC|JD|Multi-site firms|Multi-site emp|Site expansion|Site contraction|
|----|---------|--------|---------------|----------|--------|--------|-----|--|--|----------------|--------------|--------------|----------------|
|2000|Total|All|
|2000|Size|Micro|
|2000|Size|Small|
|2000|Size|Medium|
|2000|Size|Large|

##### *Table 4 - Productivity dispersion*

|Year|Dimension|Category|Number of firms|P10_Prod|P25_Prod|P50_Prod|Mean_Prod|P75_Prod|P90_Prod|SD_Prod|
|----|---------|--------|---------------|--------|--------|--------|---------|--------|--------|-------|
|2000|Total|All|
|2000|Size|Micro|
|2000|Size|Small|
|2000|Size|Medium|
|2000|Size|Large|

In [None]:
#-------------------
#  Load data tables
#--------------------

population_df = 
cohort_df = 
growth_df =
prod_df = 

SyntaxError: invalid syntax (1372898641.py, line 6)

In [58]:
# OG RATE FUNCTION

# Write function to calculate rates for dynamism measures, apply this across dataframes
def calculate_dynamism_rates(df, group_by_cols=None):
    # Make a copy to avoid modifying the original
    df = df.copy()
    
    # Sort data
    sort_cols = group_by_cols + ['year'] if group_by_cols else ['year']
    df = df.sort_values(sort_cols)
    
    # Create lagged employment (with or without grouping)
    if group_by_cols is None:
        df['total_employment_lagged'] = df['employment'].shift(1)
    else:
        df['total_employment_lagged'] = df.groupby(group_by_cols)['employment'].shift(1)
    
    # Calculate rates (same regardless of grouping)
    df['Entry rate'] = (df['n_entrants'] + df['n_entry_and_exit']) / df['n_firms']
    df['Exit rate'] = (df['n_exiters'] + df['n_entry_and_exit']) / df['n_firms']
    df['Job creation rate'] = (df['jc_incumbents'] + df['jc_entrants']) / df['total_employment_lagged']
    df['Job destruction rate'] = (df['jd_incumbents'] + df['jd_exiters']) / df['total_employment_lagged']
    df['Entry job creation rate'] = (df['jc_entrants']) / df['total_employment_lagged']
    df['Incumbent job creation rate'] = (df['jc_incumbents']) / df['total_employment_lagged']
    df['Exit job destruction rate'] = (df['jd_exiters']) / df['total_employment_lagged']
    df['Incumbent job destruction rate'] = (df['jd_incumbents']) / df['total_employment_lagged']


    # We can't use the first/last year for dynamic variables due to no backward/forward looking observatinons
    years = df['year'].unique()
    df = df[~df['year'].isin([years.min(), years.max()])]

    return df

# Apply function to dataframes
test_rates = calculate_dynamism_rates(test_file)
firm_size_dynamism = calculate_dynamism_rates(firm_size_df, group_by_cols=['emp_sizeband'])
firm_age_dynamism = calculate_dynamism_rates(firm_age_df, group_by_cols=['age_group'])
industry_dynamism = calculate_dynamism_rates(industry_df, group_by_cols=['industry_name'])
region_dynamism = calculate_dynamism_rates(region_df, group_by_cols=['region'])

In [52]:
#-------------------------
# Create rate columns
#---------------------------

def calculate_dynamism_rates(df):
    # Make a copy to avoid modifying the original
    df = df.copy()
    
    # Sort data
    sort_cols = ['category','year']
    df = df.sort_values(sort_cols)
    
    # Create lagged employment (with or without grouping)
    if group_by_cols is None:
        df['total_employment_lagged'] = df['employment'].shift(1)
    else:
        df['total_employment_lagged'] = df.groupby(category)['employment'].shift(1)
    
    # Calculate rates (same regardless of grouping)
    df['Entry rate'] = (df['n_entrants'] + df['n_entry_and_exit']) / df['n_firms']
    df['Exit rate'] = (df['n_exiters'] + df['n_entry_and_exit']) / df['n_firms']
    df['Job creation rate'] = (df['jc_incumbents'] + df['jc_entrants']) / df['total_employment_lagged']
    df['Job destruction rate'] = (df['jd_incumbents'] + df['jd_exiters']) / df['total_employment_lagged']
    df['Entry job creation rate'] = (df['jc_entrants']) / df['total_employment_lagged']
    df['Incumbent job creation rate'] = (df['jc_incumbents']) / df['total_employment_lagged']
    df['Exit job destruction rate'] = (df['jd_exiters']) / df['total_employment_lagged']
    df['Incumbent job destruction rate'] = (df['jd_incumbents']) / df['total_employment_lagged']


    # We can't use the first/last year for dynamic variables due to no backward/forward looking observatinons
    years = df['year'].unique()
    df = df[~df['year'].isin([years.min(), years.max()])]

    return df



<details>
<summary> View data preprocessing code</summary>

hi

</details>

#### **1. The composition of the UK business population**

First, we want to assess what types of firms make up the business population in 2023. Big or small, young or old. Which types of firms contribute the most to economic activity?

How has this changed over the last 20 years? Can we learn anything about structural change in the economy?

**Overall section findings**

In [None]:
# BSD facts - how has the total number of firms, employment and turnover changed over time?

total_population_df = population_df[population_df['Dimension']=='Total']

n_firm_chart = alt.Chart(total_population_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('n_firms:Q',title='Total number of firms in BSD', scale=alt.Scale(domainMin=1500000, domainMax=2500000),axis=alt.Axis(format=".2s"))
)

emp_chart = alt.Chart(total_population_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('employment:Q',title='Total employment in BSD', scale=alt.Scale(domainMin=15000000, domainMax=22000000), axis=alt.Axis(format=".2s"))
)

turnover_chart = alt.Chart(total_population_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('turnover:Q',title='Total turnover in BSD', scale=alt.Scale(domainMin=, domainMax=), axis=alt.Axis(format=".2s"))
)

productivity_chart = alt.Chart(total_population_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('turnover_per_employee:Q',title='Average turnover per employee in BSD', scale=alt.Scale(domainMin=0, domainMax=), axis=alt.Axis(format=".2s"))
)

basic_facts_chart = n_firm_chart | emp_chart | turnover_chart | productivity_chart

**Key Findings**
- The business population has expanded over the last 20 years, with substantial growth taking place between 2011 and 2018.

**Questions to explore**

### **Assessing the decline in business dynamism**
- Entry and exit rates
- Survival rates
- Growth rates
- Job reallocation rates



#### **2. Entry and exit rates**

This section examines firm entry and exit dynamics over time. Entry rates measure the flow of new firms into the market relative to the total population, while exit rates capture firms leaving the market. These metrics reveal the intensity of business turnover and provide insights into entrepreneurial activity, market competitiveness, and structural changes in the business environment.

**Headline findings**
- Entry and exit rates have remained relatively stable, there is no prominent decline unlike the US.

##### **2.1 Aggregate entry/exit rates**
- Annual birth and death rates (firm counts)
- Entry/exit rates weighted by employment
- Net entry rates and churn rates

In [63]:
# AGGREGATE: entry and exit rates

# Process entry and exit rates from dataframe
#total_population_df = population_df[population_df['dimension'] == 'Total']

#total_entry_exit_df = total_population_df.melt(id_vars='year',value_vars=['Entry rate','Exit rate'])

# Create chart of entry and exites rates in notebook
entry_exit_df = test_rates.melt(id_vars='year',value_vars=['Entry rate','Exit rate'])

entry_exit_chart = alt.Chart(entry_exit_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('value:Q', axis=alt.Axis(format='%')),
    color=alt.Color('variable:N')
)

display(entry_exit_chart)

# Create table of entry and exit rates in notebook
entry_exit_df = test_rates[['year','Entry rate','Exit rate','n_entrants','n_exiters']]

total_entry_exit_table = entry_exit_df.style.format({
    'Entry rate': '{:.2%}',
    'Exit rate': '{:.2%}',
    'n_entrants': '{:,.0f}',
    'n_exiters': '{:,.0f}'
}).background_gradient(cmap='YlOrRd', subset=['Entry rate', 'Exit rate'])

display(total_entry_exit_table)


Unnamed: 0,year,Entry rate,Exit rate,n_entrants,n_exiters
1,1998,15.34%,11.34%,234733,160782
2,1999,12.57%,13.92%,187154,212458
3,2000,13.19%,11.92%,199710,176270
4,2001,12.90%,12.92%,196912,197356
5,2002,12.97%,13.39%,200563,208480
6,2003,13.40%,13.95%,203886,214205
7,2004,15.89%,13.58%,247150,202611
8,2005,15.03%,12.82%,237378,194044
9,2006,14.37%,11.96%,235783,187612
10,2007,15.01%,15.74%,234317,249264


In [64]:
# AGGREGATE: net churn rate

# Process entry and exit rates from dataframe
#total_population_df = population_df[population_df['dimension'] == 'Total']

#total_entry_exit_df = total_population_df.melt(id_vars='year',value_vars=['Entry rate','Exit rate'])

# Create chart of entry and exites rates in notebook
entry_exit_df = test_rates.melt(id_vars='year',value_vars=['Entry rate','Exit rate'])

test_rates['Total churn rate']  = test_rates['Entry rate'] + test_rates['Exit rate']

total_churn_chart = alt.Chart(test_rates).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('Total churn rate:Q', axis=alt.Axis(format='%')),
)

display(total_churn_chart)

# Create table of entry and exit rates in notebook
total_churn_df = test_rates[['year','Total churn rate','n_entrants','n_exiters']]

total_churn_table = total_churn_df.style.format({
    'Total churn rate': '{:.2%}',
    'n_entrants': '{:,.0f}',
    'n_exiters': '{:,.0f}'
}).background_gradient(cmap='YlOrRd', subset=['Total churn rate'])

display(total_churn_table)


Unnamed: 0,year,Total churn rate,n_entrants,n_exiters
1,1998,26.68%,234733,160782
2,1999,26.49%,187154,212458
3,2000,25.11%,199710,176270
4,2001,25.82%,196912,197356
5,2002,26.36%,200563,208480
6,2003,27.35%,203886,214205
7,2004,29.47%,247150,202611
8,2005,27.85%,237378,194044
9,2006,26.33%,235783,187612
10,2007,30.75%,234317,249264


##### **2.2 Entry and exit rate by size**

We should anticipate most entering firms to be Micro or Small. It would be unusual for a firm to start up with lots of employees. In some cases this may arise as a result of M&A activity.

We should also expect a high share of exiting firms to be small due to high rates of experimentation, entrepreneurship and thus failure.

**Key findings**
- Micro firms continue to drive exit. Most firms are micro, so this sustains the aggregate exit rate.
- The exit rate for firms with over 10 employees has fallen significantly since the early 2000s.

**Questions**
- Is the fall in exit in larger firms consistent across sectors?

In [None]:
#  FIRM SIZE: entry and exit rates

# size_df = population_df[population_df['dimension'] == 'Size']

# Create chart of entry and exites rates by size in notebook
# Two side-by-side plots for entry and exit as different mechanisms

entry_exit_df = test_rates.melt(id_vars='year',value_vars=['Entry rate','Exit rate'])

# Define entry and exit dataframes
size_entry =
size_exit = 

size_entry_chart = alt.Chart(size_entry).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('Entry rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N')
)

size_exit_chart = alt.Chart(size_exit).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('Exit rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N')
)

size_entry_exit = size_entry_chart | size_exit_chart
display(size_entry_exit)



In [None]:
# Average employment at entry over time

# Use cohort table for this and plot avg_size at age 0 across cohorts

size_at_entry = cohort_df[cohort_df['age']==0]

size_at_entry_chart = alt.Chart(size_at_entry).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''", 
            labelAngle=0)),
    y=alt.Y('avg_size:Q')
)

##### **2.3 Entry and exit rate by sector**


In [None]:
#  SECTOR: entry and exit rates

sectoral_df = population_df[population_df['dimension'] == 'Sector']

sectoral_entry_exit_df = sectoral_df.melt(id_vars=['year','category'],value_vars=['Entry rate','Exit rate'])

# Display facet charts of entry and exit across regions

sector_entry_exit_chart = alt.Chart(sectoral_entry_exit_df).mark_line().encode(
    x=alt.X(),
    y=alt.Y('value:Q', axis=alt.Axis(format='%')),
    facet=alt.Facet('category', columns=2)
)

sector_entry_exit_chart

##### **2.4 Entry and exit rate by region**

In [None]:
#  REGION: entry and exit rates
region_df = population_df[population_df['dimension'] == 'Region']

region_entry_exit_df = sectoral_df.melt(id_vars=['year','category'],value_vars='Exit rate')

# Display facet charts of entry and exit across regions

region_entry_exit_chart = alt.Chart(region_entry_exit_df).mark_line().encode(
    x=alt.X(),
    y=alt.Y('value:Q', axis=alt.Axis(format='%')),
    facet=alt.Facet('category', columns=2)
)

region_entry_exit_chart

##### **2.5 Exit rates by age**

All entering firms are new (0-2 years) by definition, but the rate of exit differs across firm age. This measure provides insight into the ability of firms to survive, and the persistence of old incumbents firms. E.G a falling rate of exit for old firms over time might indicate a lack of competitive pressure or use of anti-competitive practices by existing firms to maintain dominant positions.

In [None]:
#  FIRM AGE: exit rates

age_df = population_df[population_df['dimension'] == 'Age']


region_entry_exit_df = sectoral_df.melt(id_vars=['year','category'],value_vars=['Entry rate','Exit rate'])

# Display facet charts of entry and exit across regions

region_entry_exit_chart = alt.Chart(region_entry_exit_df).mark_line().encode(
    x=alt.X(),
    y=alt.Y('value:Q', axis=alt.Axis(format='%')),
    facet=alt.Facet('category', columns=2)
)

region_entry_exit_chart

##### **2.6 Reactivations analysis**

A reactivating firm is one which was active in previous years, ceases operation, and then starts back up again. If not accounted for in firm entry definitions these can inflate numbers. 

This is primarily a consistency check, likely to feature in the annex. In the early stages of the analysis I realised one way in which a firm can appear as an entry, is by reactivating. 
Here is an example of how a reactivating firm appears in the BSD panel.

|Year|Entref|Status|
|----|------|-------|
|2000|ENTREF1|Entrant|
|2001|ENTREF1|Incumbent|
|2002|ENTREF1|Incumbent|
|2003|ENTREF1|Exit|
|----|-------|----|
|2010|ENTREF1|Reactivation|
|2011|ENTREF1|Incumbent|
|2012|ENTREF1|Exit|

The average number of reactivations across the panel is __%, increasing to __% post-2010.

To assess the prevalence of firms reactivating over time, we plot the entry rate both with and without reactivations.


In [None]:
# ENTRY RATE WITH AND WITHOUT REACTIVATIONS

df['entry_w_reactivations'] = df['n_entrants'] + df['n_reactivations']
df['entry_w_reactivation_rate'] = df['entry_w_reactivations'] / df['n_firms']

reactivation_df = df.melt(id_vars='year', vars=['entry_rate','entry_w_reactivation_rate'])
reactivation_chart = alt.Chart(df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('value:Q', axis=alt.Axis(format='%')),
    color=alt.Color('variable:N')
)

reactivation_chart

#### **3. Survival and growth (cohort analysis)**
- Are firms surviving the same rate over time?
- Are firms growing at the same rate over time?
- Cross-sectional differences across industries: which sectors/regions perform  better?


##### **3.1 Survival rates (cohorts)**

- Survival probability.

In [None]:
# 3.1 Survival by Cohort

total_cohort_df = cohort_df[cohort_df['dimension']=='Total']

cohort_survival = cohort_df['cohort','age','km']

# Plot the survival rates of each cohort
chart = alt.Chart(cohort_survival).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('km:Q'),
    color=alt.Color('cohort')
)
chart

# Probability that a firm reaches five years over cohorts.
threeyr_survival = cohort_survival[cohort_survival['age']==5]
fiveyr_survival = cohort_survival[cohort_survival['age']==5]


# Does the average survival probability change before/after GFC?


>>> HEADLINE: Overall Survival by Cohort



##### **3.2 Growth rates (cohorts)**

In [None]:
# 3.2 Growth rates by cohort

total_cohort_df = cohort_df[cohort_df['dimension']=='Total']

cohort_growth = cohort_df['cohort','age','mean_dhs_growth']

# Plot the growth rates of each cohort
chart = alt.Chart(cohort_growth).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('mean_dhs_growth:Q'),
    color=alt.Color('cohort')
)


##### **3.3 Average size of cohorts at each age**

#### **4. Job reallocation rates**

##### **4.1 Aggregate job reallocation rates**


In [None]:
# AGGREGATE: Total job reallocation rate

In [None]:
# AGGREGATE: Job reallocation by margin (entry/exit vs incumbents)

##### **4.2 Aggregate job flows by margin**

In [None]:
# AGGREGATE: Job creation from entrants vs expanding incumbents

In [None]:
# AGGREGATE: Job destruction from exiters vs contracting incumbents

##### **4.3 Job reallocation by size**

In [None]:
# SIZE: Total job reallocation

In [None]:
# SIZE: Reallocation by margin

##### **4.4 Job reallocation by sector & region**
- Is total job reallocation declining across all sectors?
- How big is variation in reallocation across regions?

In [None]:
# SECTOR: Total job reallocation

# Facet plot of all sectors

#### **5. Annual growth rates**

Here we are interested in how existing firms are growing/shrinking each year, not just entering cohorts. This analysis focuses exclusively on incumbents firms.

At the firm-level, we calculate DHS growth rates.

##### **5.1 Growth rate distributions**

In [None]:
# AGGREGATE: DHS growth rate distributions over time

##### **5.2 Growth by firm characteristics**
- Size-growth patterns
- Growth rates by productivity decile
- Sector-specific growth patterns

##### **5.3 High-growth firm analysis**
- HGF counts and employment shares over time
- Characteristics of HGFs (age, size, sector)

##### **5.4 Stagnation analysis**
- Rising share of stagnant firms
- Characteristics of persistently stagnant firms
- Employment trapped in low-growth firms

##### **5.5 Growth-productivity relationship**
- Do high-productivity firms grow faster?
- Changes in this relationship over time

#### **6. Firm-level productivity dispersion**

##### **6.1 Dispersion metrics**
- Standard deviation, IQR, 90/10 ratio, 90/50 ratio
- Time trends in dispersion 
- Has dispersion increased?

#### **7. Site expansion dynamics**


##### **7.1 Multi-site expansion trends**
- Site expansion rates over time
- Decomposition: entrants vs incumbents

##### **7.2 Site dynamics by sector**