# **Exploratory analysis of dynamism from the BSD**

**Research objectives**
- RQ1: How has the composition of UK firms evolved over the past decade according to the BSD?
- RQ2: To what extent has the rate of creative destruction in the UK declined between 1997 and 2023? 
- RQ3: How have gaps between the most productive ‘frontier’ firms and ‘laggard’ firms evolved? 
- RQ4: How are changes in business dynamism and productivity dispersion related?

**Data source**: Business Structure Database (1998-2023). Aggregated data tables have been exported from the UK Data Service SecureLab

## **Executive Summary**
- The business population has grown.
- Stable entry and exit across the whole economy over time, with entry exceeding exit between 2012-2017.
- Micro firms have sustained exit rates, whilst firms with over 10 employees are less likely to exit.
- Firms starting up after the GFC are more likely to survive.
- Total job reallocation fell after the GFC and didn't recover.
- 

In [123]:
# Import packages and set filepaths
import pandas as pd
import numpy as np
import altair as alt
from pathlib import Path
from pandas.api.types import CategoricalDtype
import os
import eco_style 
alt.themes.enable("report")

ThemeRegistry.enable('report')

In [None]:
# Set relative file paths
script_dir = Path.cwd()
import_path = script_dir.parent / "Data"
chart_path = script_dir.parent / "Charts"

In [None]:
# Import datasets
population_df = pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='population')
firm_dynamics_df =  pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='firm_dynamics')
job_flows_df =  pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='job_flows')
site_dynamics_df =  pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='site_dynamics')
cohort_df =  pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='cohort_analysis')
growth_rates_df =  pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='growth_rates')
growth_cats_df =  pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='growth_cats')
prod_df =  pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='prod')


In [126]:
# Force data types to numeric where possible, * for diclosive values automatically read the column in as a string
def convert_numeric_columns(df):
    """
    Convert all numeric columns to numeric type, replacing '*' with NaN.
    Preserves non-numeric columns (like year, category names) as they are.
    """
    df_converted = df.copy()
    
    for col in df_converted.columns:
        # Try to convert to numeric, replacing '*' and other errors with NaN
        df_converted[col] = pd.to_numeric(df_converted[col].replace('*', np.nan), errors='ignore')
    
    return df_converted

# Apply to all dataframes
population_df = convert_numeric_columns(population_df)
firm_dynamics_df = convert_numeric_columns(firm_dynamics_df)
job_flows_df = convert_numeric_columns(job_flows_df)
site_dynamics_df = convert_numeric_columns(site_dynamics_df)
cohort_df = convert_numeric_columns(cohort_df)
growth_rates_df = convert_numeric_columns(growth_rates_df)
growth_cats_df = convert_numeric_columns(growth_cats_df)
prod_df = convert_numeric_columns(prod_df)

  df_converted[col] = pd.to_numeric(df_converted[col].replace('*', np.nan), errors='ignore')
  df_converted[col] = pd.to_numeric(df_converted[col].replace('*', np.nan), errors='ignore')


## **Summary of data tables**
##### *Table 1 - Population and job flows*
This table provides information on the business population each year and job flows.Index is the year. The following dimensions are provided:
- Total
- Firm size (employment)
- Firm age
- Sector
- Region
- Within-industry productivity decile

|Year|Dimension|Category|Number of firms|Employment|Turnover|Entrants|Exits|JC|JD|Multi-site firms|Multi-site emp|Site expansion|Site contraction|
|----|---------|--------|---------------|----------|--------|--------|-----|--|--|----------------|--------------|--------------|----------------|
|2000|Total|All|
|2000|Size|Micro|
|2000|Size|Small|
|2000|Size|Medium|
|2000|Size|Large|

##### *Table 2 - Cohort analysis*
This table looks at cohorts of firms starting in each year and tracks the entire cohort by age. The followning dimensions are provided:
- Total
- Sector
- Region
- Firm size (employment)

|Cohort|Age|Dimension|Category|Number of firms|Avg size|Survival rate|KM rate|Share of employment|Share of turnover|High growth firms|Stagnant firms|
|------|---|---------|--------|---------------|--------|-------------|-------|-------------------|-----------------|-----------------|--------------|
|2000|0|Total|All|
|2000|1|Total|All|
|2000|2|Total|All|
|2000|3|Total|All|
|2000|4|Total|All|

##### *Table 3 - Growth rates*

|Year|Dimension|Category|Number of incumbents|Mean DHS growth rate|Median DHS growth rate|P10 DHS|P90 DHS|Number of high-growth firms|Number of stagnant firms|Number of shrinking firms|Incumbent employment|Emp HGF|Emp Stagnant|Emp shrinking|
|----|---------|--------|--------------------|--------------------|----------------------|-------|-------|---------------------------|------------------------|-------------------------|--------------------|-------|------------|-------------|
|2000|Total|All|
|2000|Size|Micro|
|2000|Size|Small|
|2000|Size|Medium|
|2000|Size|Large|

##### *Table 4 - Productivity dispersion*

|Year|Dimension|Category|Number of firms|P10_Prod|P25_Prod|P50_Prod|Mean_Prod|P75_Prod|P90_Prod|SD_Prod|
|----|---------|--------|---------------|--------|--------|--------|---------|--------|--------|-------|
|2000|Total|All|
|2000|Size|Micro|
|2000|Size|Small|
|2000|Size|Medium|
|2000|Size|Large|

<details>
<summary> View data preprocessing code</summary>

hi

</details>

## **1. The composition of the UK business population**

First, we want to assess what types of firms make up the business population in 2023. Big or small, young or old. Which types of firms contribute the most to economic activity?

How has this changed over the last 20 years? What do we learn about structural change in the economy?

### **Overall section findings**

**Total number of firms, employment and turnover findings**
- The business population has expanded over the last 20 years, with substantial growth taking place between 2011 and 2018.
- Turnover has roughly doubled in nominal terms. Rising from £1.8tn to £4.5tn. A significant portion of this growth may simply reflect price increases rather than real economic expansion.
- Productivity (turnover per employee) has grown modestly from £90k to £150k. Again, adjusting for inflation would likely show much flatter real productivity growth

**Questions to explore**
- What's driving the surge in firm numbers post-2010? Is it genuine entrepreneurship, or growth in shell companies, single-person LTDs, and contractor structures (IR35 responses?)? What is the size distribution of these new firms - are they predominantly zero-employee or micro businesses?
- How much of turnover growth is real vs nominal? 
- Why did employment jump so sharply after 2012?
- What's happening at the sectoral level?
- What's the relationship between firm growth and productivity? If the new firms are predominantly very small and low-productivity, this could be diluting the aggregate productivity figures.
- How has firm survival changed over time? Are firms lasting longer or failing faster?
- What role does geography play?

In [183]:
# BSD facts - how has the total number of firms, employment and turnover changed over time?

total_population_df = population_df[population_df['dimension']=='Total']

n_firm_chart = alt.Chart(total_population_df.assign(n_firms_m=total_population_df['n_firms'] / 1e6)).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 != 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('n_firms_m:Q', title='Total number of firms in BSD', scale=alt.Scale(domainMin=1.5, domainMax=2.5), axis=alt.Axis(format=".1f", labelExpr="datum.value + 'm'"))
)

emp_chart = alt.Chart(total_population_df.assign(employment_m=total_population_df['employment'] / 1e6)).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 != 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('employment_m:Q', title='Total employment in BSD', scale=alt.Scale(domainMin=15, domainMax=24), axis=alt.Axis(format=".0f", labelExpr="datum.value + 'm'"))
)

turnover_chart = alt.Chart(total_population_df.assign(turnover_bn=total_population_df['turnover'] / 1e9)).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 != 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('turnover_bn:Q',title='Total turnover in BSD', axis=alt.Axis(format=".0f", labelExpr="'£' + datum.value + 'tn'"))
)

productivity_chart = alt.Chart(total_population_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 != 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('avg_turnover_per_employee:Q',title='Average turnover per employee in BSD', axis=alt.Axis(format=".0f", labelExpr="'£' + datum.value + 'k'"))
)

basic_facts_chart = (n_firm_chart | emp_chart) & (turnover_chart | productivity_chart)

display(basic_facts_chart)
basic_facts_chart.save(chart_path / 'Descriptive paper/Data/BSD_basic_facts.png', scale_factor=2.0)

#### **Firm size analysis**

Economic contribution by firm size findings
- Most firms are small with between 1 and 9 employees (90%), but these firms only employ __% of the workforce.
- Large firms with over 250 employees make up just 3% of the business population but employ over 40% of the workforce. Changes in the number of people that these firms employ will have a large effect on employmet dynamics. 

In [None]:
# Economic contribution by firm size 2023

size_df = population_df[population_df['dimension']=='Size']
size_df_2023 = size_df[size_df['year']==2023]


size_df_2023['Total_employment'] = size_df_2023['employment'].sum()
size_df_2023['Total_firms'] = size_df_2023['n_firms'].sum()
size_df_2023['Total_turnover'] = size_df_2023['turnover'].sum()

size_df_2023['Share_of_employment'] = size_df_2023['employment']/size_df_2023['Total_employment']
size_df_2023['Share_of_firms'] = size_df_2023['n_firms']/size_df_2023['Total_firms']
size_df_2023['Share_of_turnover'] = size_df_2023['turnover']/size_df_2023['Total_turnover']

firmsize_share_of_activity = size_df_2023.melt(id_vars='category',
                                                value_vars=['Share_of_employment','Share_of_firms','Share_of_turnover'],
                                                value_name='Share of activity')

label_map = {
    'Share_of_employment': 'Employment',
    'Share_of_firms': 'Firms',
    'Share_of_turnover': 'Turnover'
}

firmsize_share_of_activity['variable'] = firmsize_share_of_activity['variable'].map(label_map)

sizeband_order = ['Micro (0-9)', 'Small (10-49)', 'Medium (50-249)', 'Large (250+)']
variable_order = ['Firms', 'Employment', 'Turnover']

chart = alt.Chart(firmsize_share_of_activity).mark_bar().encode(
    x=alt.X('category:O', sort=sizeband_order),
    y=alt.Y('Share of activity:Q', axis=alt.Axis(format='%'), scale=alt.Scale(domain=[0, 1])),
    color=alt.Color('variable:N', sort=variable_order).legend(title=None, orient='bottom', 
        direction='horizontal'),
    xOffset=alt.XOffset('variable:N', sort=variable_order)
)

display(chart)
chart.save(chart_path / 'Descriptive paper/Composition/BSD_firmsize_contribution.png', scale_factor=2.0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  size_df_2023['Total_employment'] = size_df_2023['employment'].sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  size_df_2023['Total_firms'] = size_df_2023['n_firms'].sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  size_df_2023['Total_turnover'] = size_df_2023['turnover'].sum()
A value is tr

In [None]:
# Contribution of young firms (firm share and employment share)
age_population_df = population_df[population_df['dimension']=='Age']

age_population_df = age_population_df[age_population_df['year'] >=2004]

# Combine new and young firms to create new 0-5 category
age_population_df['age_category'] = age_population_df['category'].replace({'New (0-2 years)': 'Young (0-5 years)', 'Young (3-5 years)': 'Young (0-5 years)'})
age_population_df = age_population_df.groupby(['year', 'dimension', 'age_category']).agg({'n_firms':'sum', 'employment':'sum', 'turnover':'sum'}).reset_index()

# Calculate share of total firms, employment, and turnover for each age category
age_population_df['total_firms'] = age_population_df.groupby('year')['n_firms'].transform('sum')        
age_population_df['total_employment'] = age_population_df.groupby('year')['employment'].transform('sum')
age_population_df['total_turnover'] = age_population_df.groupby('year')['turnover'].transform('sum')

age_population_df['share_of_firms'] = age_population_df['n_firms'] / age_population_df['total_firms']
age_population_df['share_of_employment'] = age_population_df['employment'] / age_population_df['total_employment']
age_population_df['share_of_turnover'] = age_population_df['turnover'] / age_population_df['total_turnover']

# Plot just the young firm share
young_firm_contribution_df = age_population_df[age_population_df['age_category']=='Young (0-5 years)']
young_firm_chart = alt.Chart(young_firm_contribution_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 != 0 ? datum.label : ''",
             labelAngle=0)),
    y=alt.Y('share_of_firms:Q', axis=alt.Axis(format='%'), title='Share of firms that are young (0-5 years)'),
)       

young_emp_chart = alt.Chart(young_firm_contribution_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 != 0 ? datum.label : ''",
             labelAngle=0)),
    y=alt.Y('share_of_employment:Q', axis=alt.Axis(format='%'), title='Share of employment in young firms (0-5 years)'),
) 

young_activity_chart = young_firm_chart | young_emp_chart
display(young_activity_chart)

#young_activity_chart.save(chart_path / 'Descriptive paper/Composition/BSD_young_firm_contribution.png', scale_factor=2.0)

In [None]:
age_population_df[age_population_df['age_category']=='Young (0-5 years)']

Unnamed: 0,year,dimension,age_category,n_firms,employment,turnover,total_firms,total_employment,total_turnover,share_of_firms,share_of_employment,share_of_turnover
2,2003,Age,Young (0-5 years),992637,5476946,562659041,1955494,19895382,2170231099,0.507614,0.275287,0.259262
5,2004,Age,Young (0-5 years),977540,4484150,466846632,2001900,19771175,2225910253,0.488306,0.226802,0.209733
8,2005,Age,Young (0-5 years),1014917,4500934,459449456,2039140,19961266,2325863150,0.497718,0.225483,0.197539
11,2006,Age,Young (0-5 years),1042338,4417219,445258186,2078660,20165912,2418159388,0.501447,0.219044,0.184131
14,2007,Age,Young (0-5 years),1095704,4291660,453353289,2154012,20335449,2580670446,0.508681,0.211043,0.175673
17,2008,Age,Young (0-5 years),1093721,4300139,497446009,2146995,20405414,2739614188,0.509419,0.210735,0.181575
20,2009,Age,Young (0-5 years),1020355,4008262,460032664,2096709,20472211,2910066095,0.486646,0.19579,0.158083
23,2010,Age,Young (0-5 years),927773,3638428,378707702,2029958,20009830,3142202910,0.45704,0.181832,0.120523
26,2011,Age,Young (0-5 years),866722,3321091,336620257,1986939,19673490,2931514856,0.43621,0.16881,0.114828
29,2012,Age,Young (0-5 years),923705,3543796,360469856,2063744,20158683,3026439519,0.447587,0.175795,0.119107


In [None]:
# 2023 economic contribution by firm age

age_df = population_df[population_df['dimension']=='Age']
age_df_2023 = age_df[age_df['year']==2023]


age_df_2023['Total_employment'] = age_df_2023['employment'].sum()
age_df_2023['Total_firms'] = age_df_2023['n_firms'].sum()
age_df_2023['Total_turnover'] = age_df_2023['turnover'].sum()

age_df_2023['Share_of_employment'] = age_df_2023['employment']/age_df_2023['Total_employment']
age_df_2023['Share_of_firms'] = age_df_2023['n_firms']/age_df_2023['Total_firms']
age_df_2023['Share_of_turnover'] = age_df_2023['turnover']/age_df_2023['Total_turnover']

firmage_share_of_activity = age_df_2023.melt(id_vars='category',
                                                value_vars=['Share_of_employment','Share_of_firms','Share_of_turnover'],
                                                value_name='Share of activity')

label_map = {
    'Share_of_employment': 'Employment',
    'Share_of_firms': 'Firms',
    'Share_of_turnover': 'Turnover'
}

firmage_share_of_activity['variable'] = firmage_share_of_activity['variable'].map(label_map)

sizeband_order = ['New (0-2 years)', 'Young (3-5 years)', 'Mature (6-10 years)', 'Old (11+ years)']
variable_order = ['Firms', 'Employment', 'Turnover']

chart = alt.Chart(firmage_share_of_activity).mark_bar().encode(
    x=alt.X('category:O', sort=sizeband_order),
    y=alt.Y('Share of activity:Q', axis=alt.Axis(format='%'), scale=alt.Scale(domain=[0, 1])),
    color=alt.Color('variable:N', sort=variable_order).legend(title=None, orient='bottom', 
        direction='horizontal'),
    xOffset=alt.XOffset('variable:N', sort=variable_order)
)

display(chart)
#chart.save(chart_path / 'Descriptive paper/Composition/BSD_firmsize_contribution.png', scale_factor=2.0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_df_2023['Total_employment'] = age_df_2023['employment'].sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_df_2023['Total_firms'] = age_df_2023['n_firms'].sum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_df_2023['Total_turnover'] = age_df_2023['turnover'].sum()
A value is trying t

In [186]:
# STACKED BAR CHART OF FIRM AGE CONTRIBUTION TO FIRMS, EMPLOYMENT AND TURNOVER (2023)

age_population_df = population_df[population_df['dimension']=='Age']

# Calculate share of total firms, employment, and turnover for each age category
age_population_df['total_firms'] = age_population_df.groupby('year')['n_firms'].transform('sum')        
age_population_df['total_employment'] = age_population_df.groupby('year')['employment'].transform('sum')
age_population_df['total_turnover'] = age_population_df.groupby('year')['turnover'].transform('sum')

age_population_df['share_of_firms'] = age_population_df['n_firms'] / age_population_df['total_firms']
age_population_df['share_of_employment'] = age_population_df['employment'] / age_population_df['total_employment']
age_population_df['share_of_turnover'] = age_population_df['turnover'] / age_population_df['total_turnover']

age_firm_share_chart = alt.Chart(age_population_df).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 != 0 ? datum.label : ''",
             labelAngle=0)),
    y=alt.Y('share_of_firms:Q', axis=alt.Axis(format='%'), title='Share of firms by age category', scale=alt.Scale(
            domain=[0, 1], 
            clamp=True
        )),
    color=alt.Color('category:N', legend=alt.Legend(title='Firm age category')),
)

age_emp_share_chart = alt.Chart(age_population_df).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 != 0 ? datum.label : ''",
             labelAngle=0)),
    y=alt.Y('share_of_employment:Q', axis=alt.Axis(format='%'), title='Share of employment by age category'),
    color=alt.Color('category:N', legend=alt.Legend(title='Firm age category')),
)

age_emp_share_chart = alt.Chart(age_population_df).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 != 0 ? datum.label : ''",
             labelAngle=0)),
    y=alt.Y('share_of_employment:Q', axis=alt.Axis(format='%'), title='Share of employment by age category'),
    color=alt.Color('category:N', legend=alt.Legend(title='Firm age category')),
)

display(age_firm_share_chart)
display(age_emp_share_chart)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_population_df['total_firms'] = age_population_df.groupby('year')['n_firms'].transform('sum')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_population_df['total_employment'] = age_population_df.groupby('year')['employment'].transform('sum')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age

In [112]:
# CONTRIBUTION TO EMPLOYMENT BY AGE EACH YEAR

age_population_df = population_df[population_df['dimension']=='Age']

age_population_df['total_employment'] = age_population_df.groupby('year')['employment'].transform('sum')
age_population_df['share_of_employment'] = age_population_df['employment'] / age_population_df['total_employment']  

age_emp_share_chart = alt.Chart(age_population_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 != 0 ? datum.label : ''",
             labelAngle=0)),
    y=alt.Y('share_of_employment:Q', axis=alt.Axis(format='%'), title='Share of employment by age category'),
    color=alt.Color('category:N', legend=alt.Legend(title='Firm age category')),
)

age_emp_share_chart

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_population_df['total_employment'] = age_population_df.groupby('year')['employment'].transform('sum')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  age_population_df['share_of_employment'] = age_population_df['employment'] / age_population_df['total_employment']


#### **Sectoral analysis**

First, what sectors are most important in 2023? 

Second, how has the importance of sectors changed over the last twenty years?

In [None]:
# SECTOR SUMMARY IN 2023

In [None]:
# CHANGE IN RELATIVE SHARE OF FIRMS AND EMPLOYMENT ACROSS SECTORS
START_YEAR = 2000
END_YEAR = 2023

# --- Filter to sector dimension and endpoint years ---
df_sector = population_df[
    (population_df['dimension'] == 'Sector') &
    (population_df['year'].isin([START_YEAR, END_YEAR]))
].copy()

# --- Compute shares for both measures ---
records = []
for measure, label in [('n_firms', 'Firm share'), ('employment', 'Employment share')]:
    temp = df_sector[['year', 'category', measure]].copy()
    totals = temp.groupby('year')[measure].sum().reset_index()
    totals.columns = ['year', 'total']
    temp = temp.merge(totals, on='year')
    temp['share'] = (temp[measure] / temp['total']) * 100
    pivoted = temp.pivot(index='category', columns='year', values='share').reset_index()
    pivoted.columns = ['category', 'start_share', 'end_share']
    pivoted['change'] = pivoted['end_share'] - pivoted['start_share']
    pivoted['measure'] = label
    records.append(pivoted[['category', 'change', 'measure']])

plot_df = pd.concat(records, ignore_index=True)

# --- Sort industries by firm share change ---
sort_order = (
    plot_df[plot_df['measure'] == 'Firm share']
    .sort_values('change')['category']
    .tolist()
)

# --- Grouped bar chart ---
bars = alt.Chart(plot_df).mark_bar().encode(
    x=alt.X('change:Q',
             axis=alt.Axis(title='Change in share (pp)', format='+.1f',
                           grid=True, gridOpacity=0.3),
             scale=alt.Scale(domain=[
                 plot_df['change'].min(),
                 plot_df['change'].max()
             ])),
    y=alt.Y('category:N',
             sort=sort_order,
             axis=alt.Axis(title=None, labelFontSize=11)),
    color=alt.Color('measure:N',
                     scale=alt.Scale(
                         domain=['Firm share','Employment share'],
                         range=["#179fdb","#e6224b"]
                     ),
                     legend=alt.Legend(title=None, orient='none',
                                       legendX=420,   
                                       legendY=10,  
                                       direction='vertical')),
    yOffset=alt.YOffset('measure:N', sort=['Firm share', 'Employment share'])
)

# --- Zero line ---
rule = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(
    color='#374151', strokeWidth=1
).encode(x='x:Q')

# --- Combine ---
chart = (bars + rule).properties(
    width=500,
    height=400
).configure_view(
    strokeWidth=0
)

display(chart)
chart.save(chart_path / 'Descriptive paper/Composition/BSD_sector_change.png', scale_factor=2.0)

In [None]:
# CHANGE IN ABSOLUTE FIRM COUNT AND EMPLOYMENT ACROSS SECTORS
START_YEAR = 2000
END_YEAR = 2023

# --- Filter to sector dimension and endpoint years ---
df_sector = population_df[
    (population_df['dimension'] == 'Sector') &
    (population_df['year'].isin([START_YEAR, END_YEAR]))
].copy()

# --- Compute % change for both measures ---
records = []
for measure, label in [('n_firms', '% change in firms'), ('employment', '% change in employment')]:
    temp = df_sector[['year', 'category', measure]].copy()
    pivoted = temp.pivot(index='category', columns='year', values=measure).reset_index()
    pivoted.columns = ['category', 'start_val', 'end_val']
    pivoted['change'] = ((pivoted['end_val'] - pivoted['start_val']) / pivoted['start_val']) * 100
    pivoted['measure'] = label
    records.append(pivoted[['category', 'change', 'measure']])

plot_df = pd.concat(records, ignore_index=True)

# --- Sort industries by firm % change ---
sort_order = (
    plot_df[plot_df['measure'] == '% change in firms']
    .sort_values('change')['category']
    .tolist()
)

# --- Grouped bar chart ---
bars = alt.Chart(plot_df).mark_bar().encode(
    x=alt.X('change:Q',
             axis=alt.Axis(title='Change (%)', format='+.1f',
                           grid=True, gridOpacity=0.3),
             scale=alt.Scale(domain=[
                 plot_df['change'].min(),
                 plot_df['change'].max()
             ])),
    y=alt.Y('category:N',
             sort=sort_order,
             axis=alt.Axis(title=None, labelFontSize=11)),
    color=alt.Color('measure:N',
                     scale=alt.Scale(
                         domain=['% change in firms','% change in employment'],
                         range=["#179fdb","#e6224b"]
                     ),
                     legend=alt.Legend(title=None, orient='none',
                                       legendX=420,   
                                       legendY=10,  
                                       direction='vertical')),
    yOffset=alt.YOffset('measure:N', sort=['% change in firms', '% change in employment'])
)

# --- Zero line ---
rule = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(
    color='#374151', strokeWidth=1
).encode(x='x:Q')

# --- Combine ---
chart = (bars + rule).properties(
    width=500,
    height=400
).configure_view(
    strokeWidth=0
)

display(chart)
#chart.save(chart_path / 'Descriptive paper/Composition/BSD_sector_change.png', scale_factor=2.0)

In [None]:
# employment shares by sector 2023
sector_population_df = population_df[population_df['dimension'] == 'Sector'].copy()

totals = sector_population_df.groupby('year')['employment'].sum().reset_index()
totals.columns = ['year', 'total_employment']

sector_population_df = sector_population_df.merge(totals, on='year')
sector_population_df['employment_share'] = sector_population_df['employment'] / sector_population_df['total_employment']

sector_emp_chart = alt.Chart(sector_population_df).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 != 0 ? datum.label : ''",
             labelAngle=0)),
    y=alt.Y('employment_share:Q', axis=alt.Axis(format='%'), title='Share of employment by sector in BSD'),
    color=alt.Color('category:N', legend=alt.Legend(title='Sector')),
)

sector_emp_chart

scatter_input_df = sector_population_df[sector_population_df['year'].isin([2003, 2023])]

sector_emp_scatter = alt.Chart

In [None]:
scatter_input_df = sector_population_df[sector_population_df['year'].isin([2003, 2023])]
scatter_input_df = scatter_input_df.pivot(index='category', columns='year', values='employment_share').reset_index()
scatter_input_df.columns = ['category', 'share_2003', 'share_2023']

max_val = max(scatter_input_df['share_2003'].max(), scatter_input_df['share_2023'].max())
line_df = pd.DataFrame({'share_2003': [0, max_val], 'share_2023': [0, max_val]})

scatter_plot = alt.Chart(scatter_input_df).mark_point(size=100).encode(
    x=alt.X('share_2003:Q', axis=alt.Axis(format='%'), title='Employment share by sector in 2003'),
    y=alt.Y('share_2023:Q', axis=alt.Axis(format='%'), title='Employment share by sector in 2023'),
    color=alt.Color('category:N'),
    tooltip=['category:N', alt.Tooltip('share_2003:Q', format='.2%'), alt.Tooltip('share_2023:Q', format='.2%')]
).properties(height=400, width=400)

diagonal = alt.Chart(line_df).mark_line(
    strokeDash=[4, 4],
    color='#374151',
    opacity=0.5
).encode(
    x='x:Q',
    y='y:Q'
)

(scatter_plot + diagonal)

scatter_plot

#### **Multi-site analysis**
- How many firms in the BSD operate multiple sites?
- Are these multi-site firms responsible for a disproportionate amount of economic activity?
- Is this is a sectoral phenonememon?

In [None]:
# PREVALENCE OF MULTI-SITE FIRMS IN 2023
sector_population_df = population_df[population_df['dimension']=='Sector']
sector_population_df_2023 = sector_population_df[sector_population_df['year']==2023]



In [117]:

# Assuming your 2023 data is in a dataframe called multisite_df
# with columns: category, multi_site_pct, emp_multi_site_pct
multisite_df = pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='sector_sites')
multisite_df['multi_site_pct'] = multisite_df['n_multi_site'] / multisite_df['n_firms']
multisite_df['emp_multi_site_pct'] = multisite_df['emp_multi_site'] / multisite_df['emp']
multisite_df = multisite_df[multisite_df['year']==2023]

# Reshape to long format
plot_df = multisite_df.melt(
    id_vars='final_sector_name',
    value_vars=['multi_site_pct', 'emp_multi_site_pct'],
    var_name='measure',
    value_name='value'
)

plot_df['measure'] = plot_df['measure'].map({
    'multi_site_pct': 'Share of multi-site firms (%)',
    'emp_multi_site_pct': 'Share of employment (%)'
})

sort_order = (
    multisite_df.sort_values('emp_multi_site_pct', ascending=False)['final_sector_name'].tolist()
)

# Connecting lines
lines = alt.Chart(multisite_df).mark_rule(color='#d1d5db', strokeWidth=1.5).encode(
    x='multi_site_pct:Q',
    x2='emp_multi_site_pct:Q',
    y=alt.Y('final_sector_name:N', sort=sort_order, axis=alt.Axis(title=None, labelFontSize=11))
)

# Dots
dots = alt.Chart(plot_df).mark_circle(size=80).encode(
    x=alt.X('value:Q', axis=alt.Axis(format='%')),
    y=alt.Y('final_sector_name:N', sort=sort_order, axis=alt.Axis(title=None)),
    color=alt.Color('measure:N',
                     scale=alt.Scale(
                         domain=['Share of multi-site firms (%)', 'Share of employment (%)'],
                         range=["#179fdb","#e6224b"]
                     ),
                     legend=alt.Legend(title=None, orient='top'))
)

chart = (lines + dots).properties(
    width=450,
    height=400).configure_view(strokeWidth=0)

chart
chart.save(chart_path / 'Descriptive paper/Composition/BSD_multisite_prevalence.png', scale_factor=2.0)

In [None]:
# MULTI-SITE FIRMS TABLE 2003 - 2023 CHANGE

multi_site_sectors = pd.read_excel(import_path / 'BSD/29_02_2026_BSD_Dynamism_Stats.xlsx', sheet_name='sector_sites')

df = multi_site_sectors.copy()

df['multi_site_firms_pct'] = df['n_multi_site'] / df['n_firms']
df['emp_multi_site_pct'] = df['emp_multi_site'] / df['emp']
df['avg_sites'] = df['avg_sites']  # already exists

# Pivot for 2003 and 2023
result = df[df['year'].isin([2003, 2023])].pivot_table(
    index='final_sector_name',
    columns='year',
    values=['multi_site_firms_pct', 'emp_multi_site_pct', 'avg_sites']
)

# Flatten column names
result.columns = [f'{col[0]}_{col[1]}' for col in result.columns]

# Rename columns to match desired output
result = result.rename(columns={
    'multi_site_firms_pct_2003': 'Multi-site firms (%) 2003',
    'multi_site_firms_pct_2023': 'Multi-site firms (%) 2023',
    'emp_multi_site_pct_2003': 'Emp in multi-site (%) 2003',
    'emp_multi_site_pct_2023': 'Emp in multi-site (%) 2023',
    'avg_sites_2003': 'Avg sites 2003',
    'avg_sites_2023': 'Avg sites 2023'
})

# Reorder columns
result = result[[
    'Multi-site firms (%) 2003', 'Multi-site firms (%) 2023',
    'Emp in multi-site (%) 2003', 'Emp in multi-site (%) 2023',
    'Avg sites 2003', 'Avg sites 2023'
]]

# Add total row
totals = df[df['year'].isin([2003, 2023])].groupby('year').agg(
    n_firms=('n_firms', 'sum'),
    n_multi_site=('n_multi_site', 'sum'),
    emp=('emp', 'sum'),
    emp_multi_site=('emp_multi_site', 'sum'),
    avg_sites=('avg_sites', 'mean')
).reset_index()

total_row = pd.DataFrame({
    'Multi-site firms (%) 2003': [totals.loc[totals['year']==2003, 'n_multi_site'].values[0] / totals.loc[totals['year']==2003, 'n_firms'].values[0]],
    'Multi-site firms (%) 2023': [totals.loc[totals['year']==2023, 'n_multi_site'].values[0] / totals.loc[totals['year']==2023, 'n_firms'].values[0]],
    'Emp in multi-site (%) 2003': [totals.loc[totals['year']==2003, 'emp_multi_site'].values[0] / totals.loc[totals['year']==2003, 'emp'].values[0]],
    'Emp in multi-site (%) 2023': [totals.loc[totals['year']==2023, 'emp_multi_site'].values[0] / totals.loc[totals['year']==2023, 'emp'].values[0]],
    'Avg sites 2003': [totals.loc[totals['year']==2003, 'avg_sites'].values[0]],
    'Avg sites 2023': [totals.loc[totals['year']==2023, 'avg_sites'].values[0]]
}, index=['Total'])

result = result.sort_values(by='Emp in multi-site (%) 2023', ascending=False)
result = pd.concat([result, total_row])
result[['Multi-site firms (%) 2003', 'Multi-site firms (%) 2023',
        'Emp in multi-site (%) 2003', 'Emp in multi-site (%) 2023']] *= 100
result = result.round(1)

result


Unnamed: 0,Multi-site firms (%) 2003,Multi-site firms (%) 2023,Emp in multi-site (%) 2003,Emp in multi-site (%) 2023,Avg sites 2003,Avg sites 2023
Retail Trade,5.4,2.9,73.1,71.1,9.0,14.0
Transport & Logistics,3.0,1.7,67.9,66.1,7.0,8.0
Utilities,6.7,3.3,85.0,64.8,16.0,9.0
Social care,10.6,9.9,51.1,56.1,7.0,9.0
Other Information Services,2.5,1.0,70.7,54.8,8.0,10.0
Other Primary Industries,3.3,2.4,60.3,53.9,8.0,7.0
Manufacturing,5.2,3.9,51.2,49.5,4.0,4.0
Wholesale Trade,4.2,3.1,43.6,46.4,5.0,6.0
Business Support Services,2.1,1.3,46.0,44.2,9.0,15.0
Hospitality,2.4,2.1,48.2,42.6,11.0,13.0


#### **Regional firm composition**

All regions have seen an increase in the number of firms.
However, 


In [102]:
# 2023 regional composition table
region_population_df = population_df[population_df['dimension'] == 'Region']
region_population_2023 = region_population_df[region_population_df['year']==2023].copy()

# Calculate share of total firms
totals = region_population_2023['n_firms'].sum()
region_population_2023['share_of_firms'] = (region_population_2023['n_firms'] / totals) * 100

# Build display table
table = region_population_2023[['category', 'n_firms', 'share_of_firms', 'avg_age', 'avg_turnover_per_employee']].copy()
table = table.sort_values('share_of_firms', ascending=False)

# Format columns
table.columns = ['dimension', 'n_firms', 'share_of_firms', 'avg_age', 'avg_turnover_per_employee']
table['n_firms'] = table['n_firms'].apply(lambda x: f"{x/1000:,.0f}")
table['share_of_firms'] = table['share_of_firms'].apply(lambda x: f"{x:.1f}")
table['avg_age'] = table['avg_age'].apply(lambda x: f"{x:.0f}")
table['avg_turnover_per_employee'] = table['avg_turnover_per_employee'].apply(lambda x: f"{x:,.0f}")

table = table.rename(columns={
    'dimension': 'Region',
    'n_firms': 'Firms (000s)',
    'share_of_firms': 'Share of firms (%)',
    'avg_age': 'Average age',
    'avg_turnover_per_employee': 'Avg turnover per employee (£k)'
})

totals_row = pd.DataFrame([{
    'Region': 'Total',
    'Firms (000s)': f"{region_population_2023['n_firms'].sum()/1000:,.0f}",
    'Share of firms (%)': '100.0',
    'Average age': f"{region_population_2023['avg_age'].mean():.0f}",
    'Avg turnover per employee (£k)': f"{region_population_2023['avg_turnover_per_employee'].mean():,.0f}"
}])

table = pd.concat([table, totals_row], ignore_index=True)

table = table.reset_index(drop=True)
table.style.hide(axis='index')

Region,Firms (000s),Share of firms (%),Average age,Avg turnover per employee (£k)
London,478,19.1,8,212
South East,372,14.9,11,150
East Of England,247,9.9,12,145
North West,242,9.7,11,142
South West,217,8.7,13,116
West Midlands,201,8.0,12,132
Yorkshire and The Humber,177,7.1,12,127
East Midlands,167,6.7,12,197
Scotland,160,6.4,13,121
Wales,100,4.0,15,100


In [90]:
# 2023 REGIONAL FIRM SHARES
region_population_df = population_df[population_df['dimension'] == 'Region']
region_population_2023 = region_population_df[region_population_df['year']==2023].copy()

# Calculate share of total firms per year
totals = region_population_2023.groupby('year')['n_firms'].sum().reset_index()
totals.columns = ['year', 'total']
region_population_2023 = region_population_2023.merge(totals, on='year')
region_population_2023['share_of_firms'] = (region_population_2023['n_firms'] / region_population_2023['total']) * 100

# Sort order: highest share in 2023 at the top
sort_order = (
    region_population_2023[region_population_2023['year'] == 2023]
    .sort_values('share_of_firms', ascending=True)['category']
    .tolist()
)

region_chart = alt.Chart(region_population_2023).mark_bar().encode(
    x=alt.X('share_of_firms:Q', axis=alt.Axis(title='Share of firms (%)', labelFontSize=11)),
    y=alt.Y('category:N', sort=sort_order, axis=alt.Axis(title=None)),
    color=alt.Color('year:N', legend=alt.Legend(title=None)),
    yOffset='year:N',
).properties(
    width=500,
    height=300
)

region_chart

In [88]:
region_df = population_df[
    (population_df['dimension'] == 'Region') &
    (population_df['year'].isin([2003, 2023]))
].copy()

# Compute absolute % change and share change
totals = region_df.groupby('year')['n_firms'].sum().reset_index()
totals.columns = ['year', 'total']
region_df = region_df.merge(totals, on='year')
region_df['share'] = (region_df['n_firms'] / region_df['total']) * 100

pivoted = region_df.pivot(index='category', columns='year', values=['n_firms', 'share']).reset_index()
pivoted.columns = ['category', 'firms_2003', 'firms_2023', 'share_2003', 'share_2023']
pivoted['abs_change'] = ((pivoted['firms_2023'] - pivoted['firms_2003']) / pivoted['firms_2003']) * 100
pivoted['share_change'] = pivoted['share_2023'] - pivoted['share_2003']

# Sort by absolute growth
sort_order = pivoted.sort_values('abs_change', ascending=True)['category'].tolist()

# Left panel: absolute % growth (all positive)
left = alt.Chart(pivoted).mark_bar(color='#179fdb').encode(
    x=alt.X('abs_change:Q', axis=alt.Axis(title='Growth in number of firms (%)', format='+.0f')),
    y=alt.Y('category:N', sort=sort_order, axis=alt.Axis(title=None, labelFontSize=11))
).properties(width=250, height=350, title='Absolute growth')

# Right panel: share change (London positive, rest negative)
right = alt.Chart(pivoted).mark_bar().encode(
    x=alt.X('share_change:Q', axis=alt.Axis(title='Change in share (pp)', format='+.1f')),
    y=alt.Y('category:N', sort=sort_order, axis=alt.Axis(title=None, labels=False, ticks=False)),
    color=alt.condition(
        alt.datum.share_change > 0,
        alt.value('#179fdb'),
        alt.value('#e6224b')
    )
).properties(width=250, height=350, title='Change in share of total firms')

# Zero line for right panel
rule = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(
    color='#374151', strokeWidth=1
).encode(x='x:Q')

chart = (left | (right + rule)).configure_view(strokeWidth=0)
chart

In [75]:
# CHANGE IN NUMBER OF FIRMS IN EACH REGION 2003-2023

START_YEAR = 2000
END_YEAR = 2023

# --- Filter to sector dimension and endpoint years ---
df_sector = population_df[
    (population_df['dimension'] == 'Region') &
    (population_df['year'].isin([START_YEAR, END_YEAR]))
].copy()

# --- Compute shares for both measures ---
records = []
for measure, label in [('n_firms', 'Firm share'), ('employment', 'Employment share')]:
    temp = df_sector[['year', 'category', measure]].copy()
    totals = temp.groupby('year')[measure].sum().reset_index()
    totals.columns = ['year', 'total']
    temp = temp.merge(totals, on='year')
    temp['share'] = (temp[measure] / temp['total']) * 100
    pivoted = temp.pivot(index='category', columns='year', values='share').reset_index()
    pivoted.columns = ['category', 'start_share', 'end_share']
    pivoted['change'] = pivoted['end_share'] - pivoted['start_share']
    pivoted['measure'] = label
    records.append(pivoted[['category', 'change', 'measure']])

plot_df = pd.concat(records, ignore_index=True)

# --- Sort industries by firm share change ---
sort_order = (
    plot_df[plot_df['measure'] == 'Firm share']
    .sort_values('change')['category']
    .tolist()
)

# --- Grouped bar chart ---
bars = alt.Chart(plot_df).mark_bar().encode(
    x=alt.X('change:Q',
             axis=alt.Axis(title='Change in share (pp)', format='+.1f',
                           grid=True, gridOpacity=0.3),
             scale=alt.Scale(domain=[
                 plot_df['change'].min(),
                 plot_df['change'].max()
             ])),
    y=alt.Y('category:N',
             sort=sort_order,
             axis=alt.Axis(title=None, labelFontSize=11)),
    color=alt.Color('measure:N',
                     scale=alt.Scale(
                         domain=['Firm share','Employment share'],
                         range=["#179fdb","#e6224b"]
                     ),
                     legend=alt.Legend(title=None, orient='none',
                                       legendX=420,   
                                       legendY=10,  
                                       direction='vertical')),
    yOffset=alt.YOffset('measure:N', sort=['Firm share', 'Employment share'])
)

# --- Zero line ---
rule = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(
    color='#374151', strokeWidth=1
).encode(x='x:Q')

# --- Combine ---
chart = (bars + rule).properties(
    width=500,
    height=400
).configure_view(
    strokeWidth=0
)

display(chart)
#chart.save(chart_path / 'Descriptive paper/Composition/BSD_sector_change.png', scale_factor=2.0)

## **2. Entry and exit rates**

This section examines firm entry and exit dynamics over time. Entry rates measure the flow of new firms into the market relative to the total population, while exit rates capture firms leaving the market. These metrics reveal the intensity of business turnover and provide insights into entrepreneurial activity, market competitiveness, and structural changes in the business environment.

**Headline findings**
- Entry and exit rates have remained relatively stable, there is no prominent decline unlike the US.

In [110]:
# Calculate entry and exit rates

firm_dynamics_df = firm_dynamics_df.sort_values(['category','dimension','year'])

firm_dynamics_df['total_firms_lag'] = firm_dynamics_df.groupby(['category','dimension'])['n_firms'].shift(1)

firm_dynamics_df['entry_rate'] = (firm_dynamics_df['n_entrants'] + firm_dynamics_df['n_entry_and_exit']) / firm_dynamics_df['total_firms_lag']
firm_dynamics_df['exit_rate'] = (firm_dynamics_df['n_exiters'] + firm_dynamics_df['n_entry_and_exit']) / firm_dynamics_df['total_firms_lag']

firm_dynamics_df

Unnamed: 0,year,dimension,category,n_firms,employment,n_entrants,n_exiters,n_entry_and_exit,n_reactivations,n_incumbents,total_firms_lag,entry_rate,exit_rate
20,1998,Sector,Agriculture,165454,489985,5872,8469,735.0,0.0,150378,,,
60,1999,Sector,Agriculture,160383,466239,3531,8270,529.0,161.0,147892,165454.0,0.024539,0.053181
100,2000,Sector,Agriculture,157821,442802,4720,7180,642.0,927.0,144352,160383.0,0.033432,0.048771
140,2001,Sector,Agriculture,155900,423667,4152,7843,642.0,950.0,142313,157821.0,0.030376,0.053763
180,2002,Sector,Agriculture,151321,409434,3002,12499,587.0,300.0,134933,155900.0,0.023021,0.083938
...,...,...,...,...,...,...,...,...,...,...,...,...,...
803,2018,Age,Young (3-5 years),430386,1941054,0,54284,2449.0,6878.0,366775,408333.0,0.005998,0.138938
843,2019,Age,Young (3-5 years),460895,2022007,0,59165,2980.0,7983.0,390767,430386.0,0.006924,0.144394
883,2020,Age,Young (3-5 years),453040,1989838,0,59317,3032.0,8333.0,382358,460895.0,0.006579,0.135278
923,2021,Age,Young (3-5 years),434851,1932015,0,61812,3200.0,7738.0,362101,453040.0,0.007063,0.143502


##### **2.1 Aggregate entry/exit rates**
Looking across the whole economy
- Apart from a sharp fall in the entry rate between 2008 and 2012, there is no sign of a long-term decline in entry.
- The rate of entry remained consistently above the rate of exit between 2014 and 2017, leading to an expansion in the business population.

In [111]:
# AGGREGATE: entry and exit rates

# Process entry and exit rates from dataframe
total_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Total']

total_entry_exit_df = total_firm_dynamics_df.melt(id_vars='year',value_vars=['entry_rate','exit_rate'])

# Create chart of entry and exites rates in notebook

entry_exit_chart = alt.Chart(total_entry_exit_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0), title=None),
    y=alt.Y('value:Q', axis=alt.Axis(format='%'), title='% of total firms'),
    color=alt.Color('variable:N')
).properties(
    width=800,
    height=500
)

display(entry_exit_chart)

# Create table of entry and exit rates in notebook
total_entry_exit_df = total_firm_dynamics_df[['year','entry_rate','exit_rate','n_entrants','n_exiters']]

total_entry_exit_table = total_entry_exit_df.style.format({
    'entry_rate': '{:.2%}',
    'exit_rate': '{:.2%}',
    'n_entrants': '{:,.0f}',
    'n_exiters': '{:,.0f}'
}).background_gradient(cmap='YlOrRd', subset=['entry_rate', 'exit_rate'])

display(total_entry_exit_table)


Unnamed: 0,year,entry_rate,exit_rate,n_entrants,n_exiters
39,1998,nan%,nan%,312837,159218
79,1999,12.34%,13.95%,187497,218462
119,2000,12.56%,11.69%,197861,180852
159,2001,12.46%,12.93%,193751,202753
199,2002,12.42%,13.25%,198153,214281
239,2003,12.93%,13.83%,202702,220156
279,2004,15.47%,13.74%,242693,208922
319,2005,14.47%,12.92%,230920,200064
359,2006,13.73%,12.13%,227519,194973
399,2007,14.45%,16.68%,221538,267836


In [114]:
# AVERAGE ENTRY AND EXIT RATES
total_firm_dynamics_df.agg({
    'entry_rate': 'mean',
    'exit_rate': 'mean'})


entry_rate    0.132469
exit_rate     0.132840
dtype: float64

In [None]:
# AGGREGATE: net churn rate

# Process entry and exit rates from dataframe
#total_population_df = population_df[population_df['dimension'] == 'Total']

#total_entry_exit_df = total_population_df.melt(id_vars='year',value_vars=['Entry rate','Exit rate'])

# Create chart of entry and exites rates in notebook

total_firm_dynamics_df['Total churn rate']  = total_firm_dynamics_df['entry_rate'] + total_firm_dynamics_df['exit_rate']

total_churn_chart = alt.Chart(total_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0), title=None),
    y=alt.Y('Total churn rate:Q', axis=alt.Axis(format='%')),
).properties(title='Net churn rate (entry + exit) in the whole economy',
             width=1000,
             height=600)

display(total_churn_chart)

# Create table of entry and exit rates in notebook
total_churn_df = total_firm_dynamics_df[['year','Total churn rate','n_entrants','n_exiters']]

total_churn_table = total_churn_df.style.format({
    'Total churn rate': '{:.2%}',
    'n_entrants': '{:,.0f}',
    'n_exiters': '{:,.0f}'
}).background_gradient(cmap='YlOrRd', subset=['Total churn rate'])

display(total_churn_table)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  total_firm_dynamics_df['Total churn rate']  = total_firm_dynamics_df['entry_rate'] + total_firm_dynamics_df['exit_rate']


Unnamed: 0,year,Total churn rate,n_entrants,n_exiters
39,1998,nan%,312837,159218
79,1999,26.29%,187497,218462
119,2000,24.25%,197861,180852
159,2001,25.39%,193751,202753
199,2002,25.67%,198153,214281
239,2003,26.76%,202702,220156
279,2004,29.21%,242693,208922
319,2005,27.39%,230920,200064
359,2006,25.86%,227519,194973
399,2007,31.12%,221538,267836


##### **2.2 Entry and exit rate by size**

We should anticipate most entering firms to be Micro or Small. It would be unusual for a firm to start up with lots of employees. In some cases this may arise as a result of M&A activity.

We should also expect a high share of exiting firms to be small due to high rates of experimentation, entrepreneurship and thus failure.

**Key findings**
- Micro firms continue to drive exit. Most firms are micro, so this sustains the aggregate exit rate.
- The exit rate for firms with over 10 employees has fallen significantly since the early 2000s.

**Questions**
- Is the fall in exit in larger firms consistent across sectors?

In [None]:
#  FIRM SIZE: entry and exit rates

size_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Size']


# Create chart of entry and exites rates by size in notebook
# Two side-by-side plots for entry and exit as different mechanisms

# Define entry and exit dataframes
size_entry = size_firm_dynamics_df[['year','category','entry_rate']]
size_exit = size_firm_dynamics_df[['year','category','exit_rate']]

size_entry_chart = alt.Chart(size_entry).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('entry_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N')
)

size_exit_chart = alt.Chart(size_exit).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('exit_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N')
)

size_entry_exit = size_entry_chart | size_exit_chart
display(size_entry_exit)



In [57]:
size_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Size']
size_firm_dynamics_df = size_firm_dynamics_df[size_firm_dynamics_df['year'] >= 1999]

size_exit = size_firm_dynamics_df[['year','category','exit_rate']]

# Get last year for end labels
last_year = size_exit['year'].max()
end_labels = size_exit[size_exit['year'] == last_year]

end_labels.loc[end_labels['category'] == 'Large (250+)', 'exit_rate'] -= 0.0025
end_labels.loc[end_labels['category'] == 'Medium (50-249)', 'exit_rate'] += 0.0025

lines = alt.Chart(size_exit).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('exit_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N',
                     scale=alt.Scale(
                         domain=['Micro (0-9)', 'Small (10-49)', 'Medium (50-249)', 'Large (250+)'],
                         range=['#eb5c2e', 'rgba(24, 42, 56, 0.4)', 'rgba(24, 42, 56, 0.7)', '#122b39']
                     ),
                     legend=None)
).properties(width=600, height=400)

labels = alt.Chart(end_labels).mark_text(align='left', dx=5, fontSize=11).encode(
    x=alt.X('year:O'),
    y=alt.Y('exit_rate:Q'),
    text='category:N',
    color=alt.Color('category:N',
                     scale=alt.Scale(
                         domain=['Micro (0-9)', 'Small (10-49)', 'Medium (50-249)', 'Large (250+)'],
                         range=['#eb5c2e', 'rgba(24, 42, 56, 0.9)', 'rgba(24, 42, 56, 0.7)', '#122b39']
                     ))
)

size_exit_chart = (lines + labels)

display(size_exit_chart)

size_exit_chart.save(chart_path / 'Descriptive paper/Dynamism/BSD_exit_rates_by_size.png', scale_factor=2.0)



In [60]:
# Average employment at entry over time

# Use cohort table for this and plot avg_size at age 0 across cohorts

size_at_entry = cohort_df[cohort_df['age']==0]

size_at_entry_chart = alt.Chart(size_at_entry).mark_line().encode(
    x=alt.X('cohort:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''", 
            labelAngle=0)),
    y=alt.Y('avg_size:Q')
).properties(height=400, width=600, title='Average employment at entry (age 0) across cohorts')

display(size_at_entry_chart)
size_at_entry_chart.save(chart_path / 'average_size_at_entry.png')

In [None]:
firm_dynamics_df

Unnamed: 0,year,dimension,category,n_firms,employment,n_entrants,n_exiters,n_entry_and_exit,n_reactivations,n_incumbents,total_firms_lag,entry_rate,exit_rate
20,1998,Sector,Agriculture,165454,489985,5872,8469,735.0,0.0,150378,,,
60,1999,Sector,Agriculture,160383,466239,3531,8270,529.0,161.0,147892,165454.0,0.024539,0.053181
100,2000,Sector,Agriculture,157821,442802,4720,7180,642.0,927.0,144352,160383.0,0.033432,0.048771
140,2001,Sector,Agriculture,155900,423667,4152,7843,642.0,950.0,142313,157821.0,0.030376,0.053763
180,2002,Sector,Agriculture,151321,409434,3002,12499,587.0,300.0,134933,155900.0,0.023021,0.083938
...,...,...,...,...,...,...,...,...,...,...,...,...,...
803,2018,Age,Young (3-5 years),430386,1941054,0,54284,2449.0,6878.0,366775,408333.0,0.005998,0.138938
843,2019,Age,Young (3-5 years),460895,2022007,0,59165,2980.0,7983.0,390767,430386.0,0.006924,0.144394
883,2020,Age,Young (3-5 years),453040,1989838,0,59317,3032.0,8333.0,382358,460895.0,0.006579,0.135278
923,2021,Age,Young (3-5 years),434851,1932015,0,61812,3200.0,7738.0,362101,453040.0,0.007063,0.143502


##### **2.3 Entry and exit rate by sector**

- Which sectors have the highest/lowest entry rates? Exit rates?
- How do net entry rates (entry minus exit) vary across sectors?
- Are there sectors with high churn (both high entry and exit)?

**Average entry/exit rates 1998-2022**



In [None]:
# SECTOR: average entry and exit rates across whole time period, ranked by net entry rate

sector_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension']=='Sector']

sector_entry_exit_avg = sector_firm_dynamics_df.groupby('category').agg({'entry_rate':'mean',
                                       'exit_rate':'mean'
})

sector_entry_exit_avg['Net entry rate'] = sector_entry_exit_avg['entry_rate'] - sector_entry_exit_avg['exit_rate']
sector_entry_exit_avg = sector_entry_exit_avg.sort_values(by='Net entry rate', ascending=False)

sector_entry_exit_avg_table = sector_entry_exit_avg.style.format({
    'entry_rate': '{:.2%}',
    'exit_rate': '{:.2%}',
    'Net entry rate': '{:.2%}'
}).background_gradient(cmap='YlOrRd', subset=['Net entry rate'])

display(sector_entry_exit_avg_table)

Unnamed: 0_level_0,entry_rate,exit_rate,Net entry rate
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Utilities,22.48%,13.91%,8.57%
Professional Services,15.45%,13.84%,1.61%
Other Information Services,14.92%,13.51%,1.41%
Business Support Services,19.09%,17.69%,1.40%
IT & Computer Services,17.96%,16.95%,1.01%
Transport & Logistics,17.33%,16.48%,0.84%
Construction,13.83%,13.29%,0.54%
Social care,9.63%,9.41%,0.22%
Hospitality,17.28%,17.59%,-0.30%
Other Services,10.71%,11.18%,-0.47%


In [35]:
#  SECTOR: entry and exit rates on facet plot

sector_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension']=='Sector']

sectoral_entry_exit_df = sector_firm_dynamics_df.melt(id_vars=['year','category'],value_vars=['entry_rate','exit_rate'])

# Display facet charts of entry and exit across regions

sector_entry_exit_chart = alt.Chart(sectoral_entry_exit_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('value:Q', axis=alt.Axis(format='%')),
    color=alt.Color('variable:N'),
    facet=alt.Facet('category', columns=3)
).resolve_scale(y='independent')   

sector_entry_exit_chart

In [36]:
# SECTOR: net entry rate of all sectors on a single line chart

sector_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension']=='Sector']

sector_firm_dynamics_df['Net entry rate'] = sector_firm_dynamics_df['entry_rate'] - sector_firm_dynamics_df['exit_rate']

# Exclude utilities for now, outleir
sector_firm_dynamics_df = sector_firm_dynamics_df[sector_firm_dynamics_df['category'] != 'Utilities']

# Perhaps highlight certain sectors and make others transparent for better visualisation?

selection = alt.selection_point(fields=['category_value'], bind='legend')

sector_net_entry_chart = alt.Chart(sector_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''", 
            labelAngle=0)),
    y=alt.Y('Net entry rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N'),
    tooltip=[
        alt.Tooltip('year:Q', title='Year'),
        alt.Tooltip('entry_rate:Q', title='Entry Rate (%)', format='.2f'),
        alt.Tooltip('category_value:N', title='Sector')
    ]
).add_params(selection).properties(height=600, width=1000)

sector_net_entry_chart


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sector_firm_dynamics_df['Net entry rate'] = sector_firm_dynamics_df['entry_rate'] - sector_firm_dynamics_df['exit_rate']


In [20]:
# SECTOR: scatter plot of entry and exit rates across sectors in two time periods
sector_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension']=='Sector']
df = sector_firm_dynamics_df.copy()


df['period'] = df['year'].apply(lambda x: 'Pre-2008' if x < 2008 else 'Post-2008')

# Calculate average entry and exit rates for each sector in each period
sector_averages = df.groupby(['category', 'period']).agg({
    'entry_rate': 'mean',
    'exit_rate': 'mean'
}).reset_index()

# Create faceted scatter plot
scatter = alt.Chart(sector_averages).mark_circle(size=200).encode(
    x=alt.X('entry_rate:Q', 
            title='Average Entry Rate (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    y=alt.Y('exit_rate:Q', 
            title='Average Exit Rate (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    color=alt.Color('category:N', 
                    title='Sector',
                    legend=alt.Legend(orient='right')),
    tooltip=['category:N', 'entry_rate:Q', 'exit_rate:Q']
).properties(
    width=500,
    height=500
).facet(
    column=alt.Column('period:N', title=None, header=alt.Header(labelFontSize=14),
                      sort=['Pre-2008', 'Post-2008'])
)

scatter.display()


In [32]:
# Define period bins
bins = [1997, 2007, 2016, 2022]
labels = ['1998-2007', '2008-2016', '2017-2022']

df_sector = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Sector'].copy()
df_sector['period'] = pd.cut(df_sector['year'], bins=bins, labels=labels)

table = (
    df_sector
    .groupby(['category', 'period'])['entry_rate']
    .mean()
    .mul(100)
    .round(1)
    .unstack('period')
    .reset_index()
)

table.columns.name = None
table = table.rename(columns={'category': 'Industry'})
table['avg'] = table[labels].mean(axis=1)
table = table.sort_values('avg', ascending=False).drop(columns='avg').reset_index(drop=True)
table.style.hide(axis='index').format('{:.1f}', subset=labels)

  .groupby(['category', 'period'])['entry_rate']


Industry,1998-2007,2008-2016,2017-2022
Utilities,33.3,20.8,12.3
Business Support Services,22.5,16.3,18.1
Transport & Logistics,13.6,15.0,26.5
IT & Computer Services,22.2,17.5,12.3
Hospitality,19.2,15.3,17.3
Professional Services,15.8,17.2,12.3
Other Information Services,15.1,16.1,12.9
Construction,14.4,13.0,14.3
Retail Trade,11.2,11.7,15.7
Other Services,11.7,10.0,10.3


##### **2.4 Entry and exit rate by region**

In [None]:
# REGION: average entry and exit rates across whole time period, ranked by net entry rate

regional_entry_exit_avg = region_dynamism.groupby('region').agg({'Entry rate':'mean',
                                       'Exit rate':'mean'
})

regional_entry_exit_avg['Net entry rate'] = regional_entry_exit_avg['Entry rate'] - regional_entry_exit_avg['Exit rate']
regional_entry_exit_avg = regional_entry_exit_avg.sort_values(by='Net entry rate', ascending=False)

regional_entry_exit_avg_table = regional_entry_exit_avg.style.format({
    'Entry rate': '{:.2%}',
    'Exit rate': '{:.2%}',
    'Net entry rate': '{:.2%}'
}).background_gradient(cmap='YlOrRd', subset=['Net entry rate'])

display(regional_entry_exit_avg_table)

Unnamed: 0_level_0,Entry rate,Exit rate,Net entry rate
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
London,17.92%,15.48%,2.44%
North West,14.97%,13.99%,0.98%
North East,14.05%,13.13%,0.93%
South East,14.02%,13.18%,0.84%
Northern Ireland,10.04%,9.24%,0.80%
East Of England,13.42%,12.65%,0.77%
West Midlands,13.70%,12.97%,0.74%
Yorkshire and The Humber,13.52%,12.79%,0.74%
East Midlands,13.38%,12.66%,0.72%
South West,12.17%,11.67%,0.50%


In [None]:
# REGION: average entry and exit rates across whole time period, ranked by net entry rate

region_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension']=='Region']

regional_entry_exit_avg = region_firm_dynamics_df.groupby('category').agg({'entry_rate':'mean',
                                       'exit_rate':'mean'
})

regional_entry_exit_avg['Net entry rate'] = regional_entry_exit_avg['entry_rate'] - regional_entry_exit_avg['exit_rate']
regional_entry_exit_avg = regional_entry_exit_avg.sort_values(by='Net entry rate', ascending=False)

regional_entry_exit_avg_table = regional_entry_exit_avg.style.format({
    'entry_rate': '{:.2%}',
    'exit_rate': '{:.2%}',
    'Net entry rate': '{:.2%}'
}).background_gradient(cmap='YlOrRd', subset=['Net entry rate'])

display(regional_entry_exit_avg_table)

Unnamed: 0_level_0,entry_rate,exit_rate,Net entry rate
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
London,17.13%,15.74%,1.40%
North West,14.17%,14.20%,-0.02%
North East,13.24%,13.29%,-0.05%
South East,13.13%,13.31%,-0.18%
West Midlands,12.92%,13.12%,-0.20%
East Of England,12.62%,12.82%,-0.20%
Yorkshire and The Humber,12.70%,12.93%,-0.23%
East Midlands,12.56%,12.80%,-0.24%
South West,11.19%,11.77%,-0.58%
Scotland,11.59%,12.21%,-0.62%


In [None]:
#  REGION: entry and exit rates
region_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension']=='Region']

region_entry_exit_df = region_firm_dynamics_df.melt(id_vars=['year','category'],value_vars=['entry_rate','exit_rate'])

# Display facet charts of entry and exit across regions

region_entry_exit_chart = alt.Chart(region_entry_exit_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''", 
            labelAngle=0)),
    y=alt.Y('value:Q', axis=alt.Axis(format='%')),
    color=alt.Color('variable:N'),
    facet=alt.Facet('category', columns=3)
).resolve_scale(y='independent')

region_entry_exit_chart

In [38]:
# REGION: net entry rate of all sectors on a single line chart

region_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension']=='Region']

region_firm_dynamics_df['Net entry rate'] = region_firm_dynamics_df['entry_rate'] - region_firm_dynamics_df['exit_rate']

# Perhaps highlight certain sectors and make others transparent for better visualisation?

region_net_entry_chart = alt.Chart(region_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''", 
            labelAngle=0)),
    y=alt.Y('Net entry rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N')
).properties(height=600, width=1000)

rule = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(
    color='#374151', strokeWidth=1
).encode(y='y:Q')

region_net_entry_chart = region_net_entry_chart + rule

region_net_entry_chart

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  region_firm_dynamics_df['Net entry rate'] = region_firm_dynamics_df['entry_rate'] - region_firm_dynamics_df['exit_rate']


In [23]:
# REGION: scatter plot of entry and exit rates across regions pre-GFC and post-GFC
region_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension']=='Region']
df = region_firm_dynamics_df.copy()

df['period'] = df['year'].apply(lambda x: 'Pre-2008' if x < 2008 else 'Post-2008')

# Calculate average entry and exit rates for each sector in each period
region_averages = df.groupby(['category', 'period']).agg({
    'entry_rate': 'mean',
    'exit_rate': 'mean'
}).reset_index()

# Create faceted scatter plot
scatter = alt.Chart(region_averages).mark_circle(size=200).encode(
    x=alt.X('entry_rate:Q', 
            title='Average Entry Rate (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    y=alt.Y('exit_rate:Q', 
            title='Average Exit Rate (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    color=alt.Color('category:N', 
                    title='Region',
                    legend=alt.Legend(orient='right')),
    tooltip=['category:N', 'entry_rate:Q', 'exit_rate:Q']
).properties(
    width=400,
    height=400
).facet(
    column=alt.Column('period:N', title=None, header=alt.Header(labelFontSize=14),
                      sort=['Pre-2008', 'Post-2008'])
)

scatter.display()


##### **2.5 Exit rates by age**

All entering firms are new (0-2 years) by definition, but the rate of exit differs across firm age. This measure provides insight into the ability of firms to survive, and the persistence of old incumbents firms. E.G a falling rate of exit for old firms over time might indicate a lack of competitive pressure or use of anti-competitive practices by existing firms to maintain dominant positions.

In [None]:
#  FIRM AGE: exit rates

age_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Age']

age_exit_chart = alt.Chart(age_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('exit_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N')
).properties(height=600, width=1000)

display(age_exit_chart)

# Display table of exit rates by age group

age_exit_table = age_firm_dynamics_df[['year','category','exit_rate']]
age_exit_table = age_exit_table.pivot(index='year', columns='category', values='exit_rate').reset_index()

# Apply styling - need to format all columns except 'year'
exit_rate_columns = [col for col in age_exit_table.columns if col != 'year']

age_exit_table_styled = age_exit_table.style.format(
    {col: '{:.2%}' for col in exit_rate_columns}  # Format as decimal (rates are already in %)
).background_gradient(
    cmap='YlOrRd', 
    subset=exit_rate_columns,
    axis=0 
)

display(age_exit_table_styled)

category,year,Mature (6-10 years),New (0-2 years),Old (11+ years),Young (3-5 years)
0,1998,nan%,nan%,nan%,nan%
1,1999,13.81%,20.64%,6.72%,16.65%
2,2000,9.78%,20.95%,6.32%,8.96%
3,2001,11.46%,17.04%,6.94%,21.25%
4,2002,12.36%,19.46%,7.87%,14.78%
5,2003,9.18%,20.26%,8.33%,18.85%
6,2004,15.86%,21.39%,7.47%,12.89%
7,2005,10.68%,19.46%,7.09%,15.07%
8,2006,10.35%,17.76%,6.89%,13.78%
9,2007,15.40%,24.27%,8.15%,20.97%


In [None]:
##### **2.5 Exit rates by productivity**

Are the least productive firms within industries more likely to exit? This can provide insight into the extent to which market selection is driving productivity growth. A falling exit rate for low productivity firms over time might indicate a weakening of market selection forces, which could be a factor behind slowing productivity growth.

In [63]:
#  FIRM PRODUCTIVITY: exit rates

prod_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Productivity']
prod_firm_dynamics_df = prod_firm_dynamics_df[prod_firm_dynamics_df['year'] >=1999]

line = alt.Chart(prod_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('exit_rate:Q', axis=alt.Axis(format='%'), title='Firm exit rate by productivity group'),
    color=alt.Color('category:N')
).properties(height=600, width=800)


# End labels - filter to last year for each category
last_points = prod_firm_dynamics_df.loc[
    prod_firm_dynamics_df.groupby('category')['year'].idxmax()
]

# Adjust y-values directly for label positioning
last_points['label_y'] = last_points.apply(
    lambda row: row['exit_rate'] + 0.003 if row['category'] == 'High-Median (P50-P90)' 
                else row['exit_rate'] - 0.003 if row['category'] == 'Low-Median (P10-P50)'
                else row['exit_rate'],
    axis=1
)

labels = alt.Chart(last_points).mark_text(
    align='left',
    dx=5,
    fontSize=11,
    fontWeight='bold'
).encode(
    x=alt.X('year:O'),
    y=alt.Y('label_y:Q'),  # Use adjusted y position
    text='category:N',
    color=alt.Color('category:N', legend=None)
)
# Combine
prod_exit_chart = (line + labels).properties(
    height=400, 
    width=600
)

display(prod_exit_chart)
prod_exit_chart.save(chart_path / 'Exploratory/exit_rates_by_productivity.png')

# Display table of exit rates by age group

prod_exit_table = prod_firm_dynamics_df[['year','category','exit_rate']]
prod_exit_table = prod_exit_table.pivot(index='year', columns='category', values='exit_rate').reset_index()

# Apply styling - need to format all columns except 'year'
exit_rate_columns = [col for col in prod_exit_table.columns if col != 'year']

prod_exit_table_styled = prod_exit_table.style.format(
    {col: '{:.2%}' for col in exit_rate_columns}  # Format as decimal (rates are already in %)
).background_gradient(
    cmap='YlOrRd', 
    subset=exit_rate_columns,
    axis=0 
)

display(prod_exit_table_styled)

category,year,Frontier (P90+),High-Median (P50-P90),Laggards (  Low-Median (P10-P50),Unnamed: 5
0,1999,8.25%,15.54%,19.80%,12.42%
1,2000,7.13%,10.95%,14.54%,12.68%
2,2001,8.23%,12.60%,19.96%,12.61%
3,2002,8.64%,12.25%,22.73%,12.75%
4,2003,9.22%,13.39%,19.47%,13.81%
5,2004,8.32%,14.42%,15.24%,14.02%
6,2005,8.30%,12.98%,15.28%,13.32%
7,2006,8.00%,11.37%,15.69%,12.79%
8,2007,10.07%,16.14%,21.10%,17.48%
9,2008,9.09%,14.60%,18.32%,15.34%


In [64]:
prod_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Productivity']
prod_firm_dynamics_df = prod_firm_dynamics_df[prod_firm_dynamics_df['year'] >=1999]

color_scale = alt.Scale(
    domain=['Laggards (<P10)', 'Low-Median (P10-P50)', 'High-Median (P50-P90)', 'Frontier (P90+)'],
    range=['#e6224b', '#eb5c2e', '#122b39', '#36b7b4']
)

line = alt.Chart(prod_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('exit_rate:Q', axis=alt.Axis(format='%'), title='Firm exit rate by productivity group'),
    color=alt.Color('category:N', scale=color_scale, legend=None)
).properties(height=600, width=800)

last_points = prod_firm_dynamics_df.loc[
    prod_firm_dynamics_df.groupby('category')['year'].idxmax()
]

last_points['label_y'] = last_points.apply(
    lambda row: row['exit_rate'] + 0.003 if row['category'] == 'High-Median (P50-P90)' 
                else row['exit_rate'] - 0.003 if row['category'] == 'Low-Median (P10-P50)'
                else row['exit_rate'],
    axis=1
)

labels = alt.Chart(last_points).mark_text(
    align='left',
    dx=5,
    fontSize=11,
    fontWeight='bold'
).encode(
    x=alt.X('year:O'),
    y=alt.Y('label_y:Q'),
    text='category:N',
    color=alt.Color('category:N', scale=color_scale, legend=None)
)

prod_exit_chart = (line + labels).properties(
    height=400, 
    width=600
)

display(prod_exit_chart)
#prod_exit_chart.save(chart_path / 'Exploratory/exit_rates_by_productivity.png')

In [None]:
# FIRM EXIT BY PRODUCTIVITY WITH MIDDLE GROUPS COMBINED
prod_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Productivity']
prod_firm_dynamics_df = prod_firm_dynamics_df[prod_firm_dynamics_df['year'] >= 1999].copy()

# Combine middle categories
prod_firm_dynamics_df['category'] = prod_firm_dynamics_df['category'].replace({
    'Low-Median (P10-P50)': 'Middle (P10-P90)',
    'High-Median (P50-P90)': 'Middle (P10-P90)'
})

# Average the exit rates for the combined middle group
prod_firm_dynamics_df = (
    prod_firm_dynamics_df
    .groupby(['year', 'category'])['exit_rate']
    .mean()
    .reset_index()
)

color_scale = alt.Scale(
    domain=['Laggards (<P10)', 'Middle (P10-P90)', 'Frontier (P90+)'],
    range=['#e6224b', '#122b39', "#179fdb"]
)

line = alt.Chart(prod_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('exit_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N', scale=color_scale, legend=None)
).properties(height=600, width=800)

last_points = prod_firm_dynamics_df.loc[
    prod_firm_dynamics_df.groupby('category')['year'].idxmax()
].copy()

labels = alt.Chart(last_points).mark_text(
    align='left',
    dx=5,
    fontSize=11,
    fontWeight='bold'
).encode(
    x=alt.X('year:O'),
    y=alt.Y('exit_rate:Q'),
    text='category:N',
    color=alt.Color('category:N', scale=color_scale, legend=None)
)

prod_exit_chart = (line + labels).properties(
    height=400,
    width=600
)

display(prod_exit_chart)
prod_exit_chart.save(chart_path / 'Descriptive paper/Dynamism/BSD_exit_rates_by_productivity.png')

##### **2.6 Reactivations analysis**

A reactivating firm is one which was active in previous years, ceases operation, and then starts back up again. If not accounted for in firm entry definitions these can inflate numbers. 

This is primarily a consistency check, likely to feature in the annex. In the early stages of the analysis I realised one way in which a firm can appear as an entry, is by reactivating. 
Here is an example of how a reactivating firm appears in the BSD panel.

|Year|Entref|Status|
|----|------|-------|
|2000|ENTREF1|Entrant|
|2001|ENTREF1|Incumbent|
|2002|ENTREF1|Incumbent|
|2003|ENTREF1|Exit|
|----|-------|----|
|2010|ENTREF1|Reactivation|
|2011|ENTREF1|Incumbent|
|2012|ENTREF1|Exit|

The average number of reactivations across the panel is __%, increasing to __% post-2010.

To assess the prevalence of firms reactivating over time, we plot the entry rate both with and without reactivations.


In [None]:
# AVERAGE REACTIVATIONS OVER TIME (PCT OF FIRMS)
total_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Total']
total_firm_dynamics_df['reactivation_share'] = total_firm_dynamics_df['n_reactivations'] / total_firm_dynamics_df['n_firms']

avg_reactivation_share = total_firm_dynamics_df['reactivation_share'].mean()
print(f"Average Reactivation Share: {avg_reactivation_share:.2%}")

# Average for years strictly before 2008
avg_before = total_firm_dynamics_df[total_firm_dynamics_df['year'] < 2008]['reactivation_share'].mean()

# Average for 2008 and everything after
avg_after = total_firm_dynamics_df[total_firm_dynamics_df['year'] >= 2008]['reactivation_share'].mean()

print(f"Pre-2008: {avg_before:.2%}") 
print(f"Post-2008: {avg_after:.2%}")

# Plot reactivation share each year
chart = alt.Chart(total_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('reactivation_share:Q', axis=alt.Axis(format='%'), title='Reactivation share (% of firms)'))

chart

Average Reactivation Share: 0.90%
Pre-2008: 0.56%
Post-2008: 1.13%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  total_firm_dynamics_df['reactivation_share'] = total_firm_dynamics_df['n_reactivations'] / total_firm_dynamics_df['n_firms']


In [72]:
# ENTRY RATE WITH AND WITHOUT REACTIVATIONS
total_firm_dynamics_df = firm_dynamics_df[firm_dynamics_df['dimension'] == 'Total']

total_firm_dynamics_df['total_firms_lag'] = total_firm_dynamics_df.groupby(['category','dimension'])['n_firms'].shift(1)


total_firm_dynamics_df['entry_w_reactivations'] = total_firm_dynamics_df['n_entrants'] + total_firm_dynamics_df['n_entry_and_exit'] + total_firm_dynamics_df['n_reactivations']
total_firm_dynamics_df['entry_w_reactivation_rate'] = total_firm_dynamics_df['entry_w_reactivations'] / total_firm_dynamics_df['total_firms_lag']

total_firm_dynamics_df = total_firm_dynamics_df.melt(id_vars='year', value_vars=['entry_rate','entry_w_reactivation_rate'])

total_firm_dynamics_df = total_firm_dynamics_df[total_firm_dynamics_df['year']>=1999]

reactivation_chart = alt.Chart(total_firm_dynamics_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('value:Q', axis=alt.Axis(format='%')),
    color=alt.Color('variable:N')
).properties(height=600,width=1000, title='Entry rates with and without reactivations')

reactivation_chart

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  total_firm_dynamics_df['total_firms_lag'] = total_firm_dynamics_df.groupby(['category','dimension'])['n_firms'].shift(1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  total_firm_dynamics_df['entry_w_reactivations'] = total_firm_dynamics_df['n_entrants'] + total_firm_dynamics_df['n_entry_and_exit'] + total_firm_dynamics_df['n_reactivations']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pan

#### **3. Survival and growth (cohort analysis)**
- Are firms surviving the same rate over time?
- Are firms growing at the same rate over time?
- Cross-sectional differences across industries: which sectors/regions perform  better?

In this section we focus on firms that entered the market after 1998, assessing the performance of these new entering firms.

As a nice statistic can I assess the share of activity from firms active prior to 1998 (i.e already in the panel) versus new entrants each year? Might need to be from SecureLab
- Define group with age > 0 in 1998.
- Sum turnover and employment for this group, grouped by year



##### **3.1 Survival rates (cohorts)**

- Survival probability.

Each cohort of firms that set up each year in the UK will face unique, time-specific challenges.

In [83]:
# 3.1 Survival by Cohort


# Plot the survival rates of each cohort
chart = alt.Chart(cohort_df).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('kaplan_meier_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('cohort:N')
)
display(chart)

# Select cohorts to plot
first_ten_cohort_df = cohort_df[cohort_df['age'] <=10]

selected_cohorts = first_ten_cohort_df[first_ten_cohort_df['cohort'].isin([2000,2005,2010,2015])]

selected_cohort_survival_chart = alt.Chart(selected_cohorts).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('kaplan_meier_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('cohort:N')
).properties(width=600, height=500)
display(selected_cohort_survival_chart)

# Probability that a firm reaches five years over cohorts.
threeyr_survival = cohort_df[cohort_df['age']==5]
fiveyr_survival = cohort_df[cohort_df['age']==5]






In [108]:
first_ten_cohort_df = cohort_df[cohort_df['age'] <= 10]
selected_cohorts = first_ten_cohort_df[first_ten_cohort_df['cohort'].isin([2000, 2005, 2010, 2015])]

color_scale = alt.Scale(
    domain=[2000, 2005, 2010, 2015],
    range=['#36b7b4', '#eb5c2e', '#179fdb', '#122b39']
)

last_points = selected_cohorts.loc[
    selected_cohorts.groupby('cohort')['age'].idxmax()
].copy()

last_points.loc[last_points['cohort'] == 2015, 'kaplan_meier_rate'] += 0.022
last_points.loc[last_points['cohort'] == 2005, 'kaplan_meier_rate'] -= 0.03


lines = alt.Chart(selected_cohorts).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('kaplan_meier_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('cohort:N', scale=color_scale, legend=None)
).properties(width=500, height=400)

labels = alt.Chart(last_points).mark_text(
    align='left',
    dx=5,
    fontSize=11,
    fontWeight='bold'
).encode(
    x=alt.X('age:O'),
    y=alt.Y('kaplan_meier_rate:Q'),
    text='cohort:N',
    color=alt.Color('cohort:N', scale=color_scale, legend=None)
)

rules = alt.Chart(pd.DataFrame({'age': [3, 5]})).mark_rule(
    strokeDash=[4, 4],
    color='#374151',
    strokeWidth=1,
    opacity=0.5
).encode(
    x='age:O'
)

selected_cohort_survival_chart = (lines + labels + rules)
display(selected_cohort_survival_chart)
selected_cohort_survival_chart.save(chart_path / 'Descriptive paper/Dynamism/BSD_survival_by_cohort.png', scale_factor=2.0)

In [None]:
# Does the average survival probability change before/after GFC?

# Let's start with 3 year survival
threeyr_survival = cohort_df[cohort_df['age']==3]

avg_threeyr_survival_rate = threeyr_survival['kaplan_meier_rate'].mean()
print(f"Average 3 yr KM rate: {avg_threeyr_survival_rate:.2%}")

# Average for years strictly before 2008
avg_before = threeyr_survival[threeyr_survival['cohort'] < 2008]['kaplan_meier_rate'].mean()

# Average for 2008 and everything after
avg_after = threeyr_survival[threeyr_survival['cohort'] >= 2008]['kaplan_meier_rate'].mean()

print(f"Pre-2008: {avg_before:.2%}") 
print(f"Post-2008: {avg_after:.2%}")

# Now with 5 year survival
fiveyr_survival = cohort_df[cohort_df['age']==5]

avg_fiveyr_survival_rate = fiveyr_survival['kaplan_meier_rate'].mean()
print(f"Average 5 yr KM rate: {avg_fiveyr_survival_rate:.2%}")

# Average for years strictly before 2008
avg_before = fiveyr_survival[fiveyr_survival['cohort'] < 2008]['kaplan_meier_rate'].mean()

# Average for 2008 and everything after
avg_after = fiveyr_survival[fiveyr_survival['cohort'] >= 2008]['kaplan_meier_rate'].mean()

print(f"Pre-2008: {avg_before:.2%}") 
print(f"Post-2008: {avg_after:.2%}")

# Now with 7 year survival
sevenyr_survival = cohort_df[cohort_df['age']==7]

avg_sevenyr_survival_rate = sevenyr_survival['kaplan_meier_rate'].mean()
print(f"Average 7 yr KM rate: {avg_sevenyr_survival_rate:.2%}")

# Average for years strictly before 2008
avg_before = sevenyr_survival[sevenyr_survival['cohort'] < 2008]['kaplan_meier_rate'].mean()

# Average for 2008 and everything after
avg_after = sevenyr_survival[sevenyr_survival['cohort'] >= 2008]['kaplan_meier_rate'].mean()

print(f"Pre-2008: {avg_before:.2%}") 
print(f"Post-2008: {avg_after:.2%}")

Average 3 yr KM rate: 43.18%
Pre-2008: 42.70%
Post-2008: 43.58%
Average 5 yr KM rate: 32.25%
Pre-2008: 31.10%
Post-2008: 33.40%
Average 7 yr KM rate: 25.33%
Pre-2008: 24.10%
Post-2008: 26.88%


##### **3.2 Growth rates (cohorts)**

Having established that firms entering post-GFC are more likely to survive - by a small margin - the next natural step is to consider what happens to these firms following entry. 

We have calculated DHS growth rates for each cohort. These growth rates take the current employment and divide it by the average of the previous two years employment to yield a value bounded between -2 and 2. The average of these firm-level growth rates is taken for all firms born in a specific year, at each age.

**Average growth over first five years across cohorts**
- The 2010 cohort stands out dramatically as having the highest average growth rate over the first five years (~0.064), suggesting firms born during the recovery from the Great Recession experienced exceptionally strong early growth.
- There's a clear structural break around the financial crisis period. Pre-crisis cohorts (1998-2006) show relatively modest and stable growth rates, mostly clustering between 0.040-0.052. Post-crisis cohorts (2007-2013) generally show elevated growth rates, with most exceeding 0.052.
- The most recent cohorts (2014-2017) show a declining trend, returning to levels closer to the late 1990s/early 2000s baseline. This could suggest a normalization after the post-crisis rebound period.

This increased rate of growth of firms beginning after the GFC seems counter-intuitive. It could reflect survivor bias, those firms that do enter were the most viable/resilient/necessary ventures and as such grew quicker.

In [None]:
# Average growth rate over lifespan of cohorts
first_five_yr_growth = cohort_df[cohort_df['age']<=5]
first_five_yr_growth = first_five_yr_growth[first_five_yr_growth['cohort']<=2017]

avg_first_give_yr_growth = first_five_yr_growth.groupby('cohort').agg({'mean_dhs_growth':'mean'}).reset_index()

chart = alt.Chart(avg_first_give_yr_growth).mark_bar().encode(
    x=alt.X('mean_dhs_growth:Q'),
    y=alt.Y('cohort:O')
).properties(width=1000, height=600, title='Average growth rate over first five years of life across cohorts')

display(chart)
chart.save(chart_path / 'cohort_first_five_yr_growth.png')

**PROBLEM WITH DHS GROWTH RATES**


In [None]:
# 3.2 Growth rates by cohort 

chart = alt.Chart(cohort_df).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('mean_dhs_growth:Q'),
    color=alt.Color('cohort:N')
)
display(chart)
chart.save(chart_path / 'Exploratory/cohort_growth_rates.png')

# Select cohorts to plot
selected_cohorts = cohort_df[cohort_df['cohort'].isin([2000,2003,2005,2007,2010])]

selected_cohort_survival_chart = alt.Chart(selected_cohorts).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('mean_dhs_growth:Q'),
    color=alt.Color('cohort:N')
)
display(selected_cohort_survival_chart)


In [None]:
# Dispersion of growth rates by cohort

chart = alt.Chart(cohort_df).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('sd_dhs_growth:Q'),
    color=alt.Color('cohort:N')
)
display(chart)


# Select cohorts to plot
selected_cohorts = cohort_df[cohort_df['cohort'].isin([2000,2003,2005,2007,2010,2013,2015])]

selected_cohort_survival_chart = alt.Chart(selected_cohorts).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('sd_dhs_growth:Q'),
    color=alt.Color('cohort:N')
)
display(selected_cohort_survival_chart)

In [None]:
# Share of high growth firms in each cohort at each age

cohort_df['high_growth_firm_share'] = cohort_df['hgf_count'] / cohort_df['n_firms']

chart = alt.Chart(cohort_df).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('high_growth_firm_share:Q', axis=alt.Axis(format='%')),
    color=alt.Color('cohort:N')
)
display(chart)

In [None]:
# Share of stagnant firms at each age
cohort_df['stagnant_firm_share'] = cohort_df['stagnant_count'] / cohort_df['n_firms']
cohort_df_excluding_age_zero = cohort_df[cohort_df['age'] > 0] 
cohort_df_excluding_age_zero = cohort_df_excluding_age_zero.groupby('age').agg({'stagnant_firm_share':'mean'}).reset_index()

chart = alt.Chart(cohort_df_excluding_age_zero).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('stagnant_firm_share:Q', axis=alt.Axis(format='%')),
)
display(chart)

In [None]:
# Share of stagnant firms in each cohort at each age

cohort_df['stagnant_firm_share'] = cohort_df['stagnant_count'] / cohort_df['n_firms']
cohort_df_excluding_age_zero = cohort_df[cohort_df['age'] > 0] 

chart = alt.Chart(cohort_df_excluding_age_zero).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('stagnant_firm_share:Q', axis=alt.Axis(format='%')),
    color=alt.Color('cohort:N')
)
display(chart)

##### **3.3 Average size of cohorts at each age**
This measure tells us how the size of a firm, measured in employees, changes over the lifecycle of a firm. Typically a firm will start with a small number of employees and make decisions to expand 

In [None]:
# Average size of a firm by age
age_df = cohort_df.groupby('age').agg({'avg_size':'mean'}).reset_index()

chart = alt.Chart(age_df).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('avg_size:Q')
).properties(title='Average size of a firm by age')

chart

The early cohorts seem to have a consistently higher number of employees.

In [None]:
# AVERAGE FIRM SIZE: by age and cohort

chart = alt.Chart(cohort_df).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('avg_size:Q'),
    color=alt.Color('cohort:N')
).properties(title='Average size of a firm by age', height=500, width=800)

chart
chart.save(chart_path / 'Exploratory/average_firm_size_by_age_and_cohort.png')

In [None]:
# AVERAGE FIRM SIZE: by age and cohort

cohort_df['period'] = cohort_df['cohort'].apply(lambda x: 'Pre-2008' if x < 2008 else 'Post-2008')
age_size_2004_split_df = cohort_df.groupby(['age','period']).agg({'avg_size':'mean'}).reset_index()

line = alt.Chart(age_size_2004_split_df).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('avg_size:Q'),
    color=alt.Color('period:N')
).properties(title='Average size of a firm by age',
             height=400, width=600)

# End labels - filter to last age point for each period
last_points = age_size_2004_split_df.loc[
    age_size_2004_split_df.groupby('period')['age'].idxmax()
]

labels = alt.Chart(last_points).mark_text(
    align='left',
    dx=5,  # Offset to the right
    fontSize=12,
    fontWeight='bold'
).encode(
    x=alt.X('age:O'),
    y=alt.Y('avg_size:Q'),
    text='period:N',
    color=alt.Color('period:N', legend=None)
)

# Combine
chart = (line + labels).properties(
    title='Average Size of a Firm by Age',
    height=400, 
    width=600
)


display(chart)
chart.save(chart_path / 'Exploratory/average_firm_size_by_age_pre_post_gfc.png')

In [None]:
# AVERAGE FIRM SIZE: by age and cohort (post 2005 cohorts only)

post_2005_cohort_df = cohort_df[cohort_df['cohort'] >= 2005]

chart = alt.Chart(post_2005_cohort_df).mark_line().encode(
    x=alt.X('age:O'),
    y=alt.Y('avg_size:Q'),
    color=alt.Color('cohort:N')
).properties(width=1000,height=600,title='Average size of a firm by age')

chart

#### **4. Job reallocation rates**

In [127]:
# Calculate job reallocation and job creation/destruction rates

job_flows_df = job_flows_df.sort_values(['category','dimension','year'])

job_flows_df['total_employment_lag'] = job_flows_df.groupby(['category','dimension'])['employment'].shift(1)

job_flows_df['job_reallocation_rate'] = (job_flows_df['job_creation_entrants'] + job_flows_df['job_creation_incumbents'] + job_flows_df['job_destruction_exiters'] + job_flows_df['job_destruction_incumbents']) / job_flows_df['total_employment_lag']
job_flows_df['jr_rate_entry_exit'] = (job_flows_df['job_creation_entrants'] + job_flows_df['job_destruction_exiters']) / job_flows_df['total_employment_lag']
job_flows_df['jr_rate_incumbents'] = (job_flows_df['job_creation_incumbents'] + job_flows_df['job_destruction_incumbents']) / job_flows_df['total_employment_lag']

job_flows_df['job_creation_rate'] = (job_flows_df['job_creation_entrants'] + job_flows_df['job_creation_incumbents']) / job_flows_df['total_employment_lag']
job_flows_df['jc_rate_entrants'] = job_flows_df['job_creation_entrants']  / job_flows_df['total_employment_lag']
job_flows_df['jc_rate_incumbents'] = job_flows_df['job_creation_incumbents']  / job_flows_df['total_employment_lag']

job_flows_df['job_destruction_rate'] = (job_flows_df['job_destruction_exiters'] + job_flows_df['job_destruction_incumbents']) / job_flows_df['total_employment_lag']
job_flows_df['jd_rate_exiters'] = job_flows_df['job_destruction_exiters']  / job_flows_df['total_employment_lag']
job_flows_df['jd_rate_incumbents'] = job_flows_df['job_destruction_incumbents']  / job_flows_df['total_employment_lag']

job_flows_df

Unnamed: 0,year,dimension,category,n_firms,employment,job_creation_entrants,job_creation_incumbents,job_destruction_exiters,job_destruction_incumbents,total_employment_lag,job_reallocation_rate,jr_rate_entry_exit,jr_rate_incumbents,job_creation_rate,jc_rate_entrants,jc_rate_incumbents,job_destruction_rate,jd_rate_exiters,jd_rate_incumbents
20,1998,Sector,Agriculture,165454,489985,15552,24821,26070,27860,,,,,,,,,,
60,1999,Sector,Agriculture,160383,466239,8790,26704,23995,29958,489985.0,0.182550,0.066910,0.115640,0.072439,0.017939,0.054500,0.110112,0.048971,0.061141
100,2000,Sector,Agriculture,157821,442802,9460,19495,19202,28805,466239.0,0.165070,0.061475,0.103595,0.062103,0.020290,0.041813,0.102967,0.041185,0.061782
140,2001,Sector,Agriculture,155900,423667,9763,18245,19500,29691,442802.0,0.174342,0.066086,0.108256,0.063252,0.022048,0.041204,0.111090,0.044038,0.067053
180,2002,Sector,Agriculture,151321,409434,7707,23116,24402,26622,423667.0,0.193187,0.075788,0.117399,0.072753,0.018191,0.054562,0.120434,0.057597,0.062837
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
803,2018,Age,Young (3-5 years),430386,1941054,0,232433,170408,144410,1840529.0,0.297334,0.092586,0.204747,0.126286,0.000000,0.126286,0.171048,0.092586,0.078461
843,2019,Age,Young (3-5 years),460895,2022007,0,232556,167546,166897,1941054.0,0.292109,0.086317,0.205792,0.119809,0.000000,0.119809,0.172300,0.086317,0.085983
883,2020,Age,Young (3-5 years),453040,1989838,0,235170,143090,152330,2022007.0,0.262408,0.070766,0.191641,0.116305,0.000000,0.116305,0.146102,0.070766,0.075336
923,2021,Age,Young (3-5 years),434851,1932015,0,194980,131312,172201,1989838.0,0.250519,0.065991,0.184528,0.097988,0.000000,0.097988,0.152532,0.065991,0.086540


##### **4.1 Aggregate job reallocation rates**


In [117]:
# AGGREGATE: Total job reallocation rate

total_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Total']
total_job_flows_df = total_job_flows_df[total_job_flows_df['year']>=1999]

total_reallocation_chart = alt.Chart(total_job_flows_df).mark_line().encode(
    x=  alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('job_reallocation_rate:Q', axis=alt.Axis(format='%'))
).properties(height=400, width=600)

total_reallocation_chart
#total_reallocation_chart.save(chart_path / 'Exploratory/total_job_reallocation_rate.png')

In [118]:
# SIZE: Total job reallocation rate

size_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Size']
size_job_flows_df = size_job_flows_df[size_job_flows_df['year']>=1999]

total_reallocation_chart = alt.Chart(size_job_flows_df).mark_line().encode(
    x=  alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('job_reallocation_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N', title='Size category')
).properties(height=400, width=600)

total_reallocation_chart
#total_reallocation_chart.save(chart_path / 'Exploratory/total_job_reallocation_rate.png')

In [121]:
# AGE: Total job reallocation rate

age_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Age']
age_job_flows_df = age_job_flows_df[age_job_flows_df['year']>=1999]

total_reallocation_chart = alt.Chart(age_job_flows_df).mark_line().encode(
    x=  alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('job_reallocation_rate:Q', axis=alt.Axis(format='%')),
    facet=alt.Facet('category:N', title='Size category',columns=2)
).properties(height=400, width=600)

total_reallocation_chart
#total_reallocation_chart.save(chart_path / 'Exploratory/total_job_reallocation_rate.png')

In [118]:
# AGGREGATE: Job reallocation by margin (entry/exit vs incumbents)
total_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Total']
total_job_flows_df = total_job_flows_df[total_job_flows_df['year'] >= 1999]

total_job_flows_df = total_job_flows_df.melt(
    id_vars=['year'], 
    value_vars=['jr_rate_entry_exit', 'jr_rate_incumbents'], 
    var_name='margin', 
    value_name='reallocation_rate'
)

# Map variable names
label_map = {
    'jr_rate_entry_exit': 'Entry & Exit',
    'jr_rate_incumbents': 'Incumbents'
}
total_job_flows_df['margin'] = total_job_flows_df['margin'].map(label_map)

color_scale = alt.Scale(
    domain=['Entry & Exit', 'Incumbents'],
    range=['#179fdb', '#e6224b']
)

last_points = total_job_flows_df.loc[
    total_job_flows_df.groupby('margin')['year'].idxmax()
].copy()

lines = alt.Chart(total_job_flows_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('reallocation_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('margin:N', scale=color_scale, legend=None)
)

labels = alt.Chart(last_points).mark_text(
    align='left',
    dx=5,
    fontSize=11,
    fontWeight='bold'
).encode(
    x=alt.X('year:O'),
    y=alt.Y('reallocation_rate:Q'),
    text='margin:N',
    color=alt.Color('margin:N', scale=color_scale, legend=None)
)

total_reallocation_chart = (lines + labels)
total_reallocation_chart

In [156]:
total_job_flows_df[['year','jr_rate_entry_exit', 'jr_rate_incumbents']].multiply(100)

#total_job_flows_df[total_job_flows_df['year'] >= 2010][['jr_rate_entry_exit', 'jr_rate_incumbents']].multiply(100).mean()


Unnamed: 0,year,jr_rate_entry_exit,jr_rate_incumbents
79,199900,11.187104,11.180888
119,200000,10.238355,13.214238
159,200100,10.883097,14.504773
199,200200,11.035734,18.863361
239,200300,10.949286,14.097373
279,200400,10.679926,12.16715
319,200500,9.530076,14.4744
359,200600,9.034131,13.705744
399,200700,10.321641,14.289535
439,200800,9.292709,12.796073


##### **4.2 Aggregate job flows by margin**

In [138]:
# AGGREGATE: Job creation from entrants vs expanding incumbents
total_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Total']
total_job_flows_df = total_job_flows_df[total_job_flows_df['year'] >= 1999]

incumbent_job_flows_df = total_job_flows_df.melt(
    id_vars=['year'], 
    value_vars=['jc_rate_incumbents', 'jd_rate_incumbents'], 
    var_name='job_flow_type', 
    value_name='job_flow_rate'
)

# Map variable names
label_map = {
    'jc_rate_incumbents': 'Job creation',
    'jd_rate_incumbents': 'Job destruction'
}
incumbent_job_flows_df['job_flow_type'] = incumbent_job_flows_df['job_flow_type'].map(label_map)

color_scale = alt.Scale(
    domain=['Job creation', 'Job destruction'],
    range=['#179fdb', '#e6224b']
)

last_points = incumbent_job_flows_df.loc[
    incumbent_job_flows_df.groupby('job_flow_type')['year'].idxmax()
].copy()

lines = alt.Chart(incumbent_job_flows_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('job_flow_rate:Q', axis=alt.Axis(format='%'), title='Incumbents'),
    color=alt.Color('job_flow_type:N', scale=color_scale, legend=None)
).properties(height=500, width=600)

labels = alt.Chart(last_points).mark_text(
    align='left',
    dx=5,
    fontSize=11,
    fontWeight='bold'
).encode(
    x=alt.X('year:O'),
    y=alt.Y('job_flow_rate:Q'),
    text='job_flow_type:N',
    color=alt.Color('job_flow_type:N', scale=color_scale, legend=None)
)

incumbent_job_flow_chart = (lines + labels)
incumbent_job_flow_chart

In [139]:
# AGGREGATE: Job destruction from exiters vs contracting incumbents

total_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Total']
total_job_flows_df = total_job_flows_df[total_job_flows_df['year'] >= 1999]

entry_exit_job_flows_df = total_job_flows_df.melt(
    id_vars=['year'], 
    value_vars=['jc_rate_entrants', 'jd_rate_exiters'], 
    var_name='job_flow_type', 
    value_name='job_flow_rate'
)

# Map variable names
label_map = {
    'jc_rate_entrants': 'Job creation',
    'jd_rate_exiters': 'Job destruction'
}
entry_exit_job_flows_df['job_flow_type'] = entry_exit_job_flows_df['job_flow_type'].map(label_map)

color_scale = alt.Scale(
    domain=['Job creation', 'Job destruction'],
    range=['#179fdb', '#e6224b']
)

last_points = entry_exit_job_flows_df.loc[
    entry_exit_job_flows_df.groupby('job_flow_type')['year'].idxmax()
].copy()

lines = alt.Chart(entry_exit_job_flows_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('job_flow_rate:Q', axis=alt.Axis(format='%'), title='Entry and exit'),
    color=alt.Color('job_flow_type:N', scale=color_scale, legend=None)
).properties(height=500, width=600)

labels = alt.Chart(last_points).mark_text(
    align='left',
    dx=5,
    fontSize=11,
    fontWeight='bold'
).encode(
    x=alt.X('year:O'),
    y=alt.Y('job_flow_rate:Q'),
    text='job_flow_type:N',
    color=alt.Color('job_flow_type:N', scale=color_scale, legend=None)
)

entry_exit_job_flow_chart = (lines + labels)
entry_exit_job_flow_chart

In [140]:
# JOB FLOW COMPARISON BY MARGIN
job_flow_comparison = entry_exit_job_flow_chart | incumbent_job_flow_chart
job_flow_comparison

job_flow_comparison.save(chart_path / 'Descriptive paper/Dynamism/BSD_job_flow_comparison.png', scale_factor=2.0)

In [None]:
# FACET THIS LOWEST GRANULARITY MARGINAL BREAKDWON

##### **4.3 Job reallocation by size**

In [131]:
# SIZE: Total job reallocation
size_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Size']
size_job_flows_df = size_job_flows_df[size_job_flows_df['year']>=1999]

size_reallocation_chart = alt.Chart(size_job_flows_df).mark_line().encode(
    x=  alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('job_reallocation_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N')
).properties(height=400, width=600).resolve_scale(y='independent')

size_reallocation_chart
#total_reallocation_chart.save(chart_path / 'Exploratory/total_job_reallocation_rate.png')

In [87]:
# SIZE: Reallocation by margin

size_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Size']
size_job_flows_df = size_job_flows_df[size_job_flows_df['year']>=1999]

size_job_flows_margin_df = size_job_flows_df.melt(
    id_vars=['year', 'category'], 
    value_vars=['jr_rate_entry_exit', 'jr_rate_incumbents'], 
    var_name='margin', 
    value_name='reallocation_rate'
)

size_reallocation_chart = alt.Chart(size_job_flows_margin_df).mark_line().encode(
    x=  alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('reallocation_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N'),
    facet=alt.Facet('margin:N', columns=1, title=None)
).properties(height=400, width=600).resolve_scale(y='independent')

size_reallocation_chart
#total_reallocation_chart.save(chart_path / 'Exploratory/total_job_reallocation_rate.png')

##### **4.4 Job reallocation by sector & region**
- Is total job reallocation declining across all sectors?
- How big is variation in reallocation across regions?

In [137]:
# SECTOR: Total job reallocation

# Facet plot of all sectors

sector_job_flows_df = job_flows_df[job_flows_df['dimension'] == 'Sector']
sector_job_flows_df = sector_job_flows_df[sector_job_flows_df['year']>=1999]

sector_reallocation_chart = alt.Chart(sector_job_flows_df).mark_line().encode(
    x=  alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",  # Show every 2nd year
            labelAngle=0)),
    y=alt.Y('job_reallocation_rate:Q', axis=alt.Axis(format='%')),
    facet=alt.Facet('category:N', columns=2)
).properties(height=400, width=600).resolve_scale(y='independent')

sector_reallocation_chart
#total_reallocation_chart.save(chart_path / 'Exploratory/total_job_reallocation_rate.png')

In [141]:
job_flows_df

Unnamed: 0,year,dimension,category,n_firms,employment,job_creation_entrants,job_creation_incumbents,job_destruction_exiters,job_destruction_incumbents,total_employment_lag,job_reallocation_rate,jr_rate_entry_exit,jr_rate_incumbents,job_creation_rate,jc_rate_entrants,jc_rate_incumbents,job_destruction_rate,jd_rate_exiters,jd_rate_incumbents
20,1998,Sector,Agriculture,165454,489985,15552,24821,26070,27860,,,,,,,,,,
60,1999,Sector,Agriculture,160383,466239,8790,26704,23995,29958,489985.0,0.182550,0.066910,0.115640,0.072439,0.017939,0.054500,0.110112,0.048971,0.061141
100,2000,Sector,Agriculture,157821,442802,9460,19495,19202,28805,466239.0,0.165070,0.061475,0.103595,0.062103,0.020290,0.041813,0.102967,0.041185,0.061782
140,2001,Sector,Agriculture,155900,423667,9763,18245,19500,29691,442802.0,0.174342,0.066086,0.108256,0.063252,0.022048,0.041204,0.111090,0.044038,0.067053
180,2002,Sector,Agriculture,151321,409434,7707,23116,24402,26622,423667.0,0.193187,0.075788,0.117399,0.072753,0.018191,0.054562,0.120434,0.057597,0.062837
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
803,2018,Age,Young (3-5 years),430386,1941054,0,232433,170408,144410,1840529.0,0.297334,0.092586,0.204747,0.126286,0.000000,0.126286,0.171048,0.092586,0.078461
843,2019,Age,Young (3-5 years),460895,2022007,0,232556,167546,166897,1941054.0,0.292109,0.086317,0.205792,0.119809,0.000000,0.119809,0.172300,0.086317,0.085983
883,2020,Age,Young (3-5 years),453040,1989838,0,235170,143090,152330,2022007.0,0.262408,0.070766,0.191641,0.116305,0.000000,0.116305,0.146102,0.070766,0.075336
923,2021,Age,Young (3-5 years),434851,1932015,0,194980,131312,172201,1989838.0,0.250519,0.065991,0.184528,0.097988,0.000000,0.097988,0.152532,0.065991,0.086540


In [146]:
# SECTOR: scatter plot of entry and exit rates across sectors in two time periods
sector_job_flows_df = job_flows_df[job_flows_df['dimension']=='Sector']
df = sector_job_flows_df.copy()


df['period'] = df['year'].apply(lambda x: 'Pre-2010' if x < 2010 else 'Post-2010')

# Calculate average entry and exit rates for each sector in each period
sector_averages = df.groupby(['category', 'period']).agg({
    'job_creation_rate': 'mean',
    'job_destruction_rate': 'mean'
}).reset_index()

# Create faceted scatter plot
scatter = alt.Chart(sector_averages).mark_circle(size=200).encode(
    x=alt.X('job_creation_rate:Q', 
            title='Average Job Creation Rate (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    y=alt.Y('job_destruction_rate:Q', 
            title='Average Job Destruction Rate (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    color=alt.Color('category:N', 
                    title='Sector',
                    legend=alt.Legend(orient='right')),
    tooltip=['category:N', 'job_creation_rate:Q', 'job_destruction_rate:Q']
).properties(
    width=500,
    height=500
).facet(
    column=alt.Column('period:N', title=None, header=alt.Header(labelFontSize=14),
                      sort=['Pre-2010', 'Post-2010'])
)

scatter.display()


In [158]:
job_flows_df.columns

Index(['year', 'dimension', 'category', 'n_firms', 'employment',
       'job_creation_entrants', 'job_creation_incumbents',
       'job_destruction_exiters', 'job_destruction_incumbents',
       'total_employment_lag', 'job_reallocation_rate', 'jr_rate_entry_exit',
       'jr_rate_incumbents', 'job_creation_rate', 'jc_rate_entrants',
       'jc_rate_incumbents', 'job_destruction_rate', 'jd_rate_exiters',
       'jd_rate_incumbents'],
      dtype='object')

In [None]:
# Calculate average reallocation and employment by sector and period
sector_job_flows_df = job_flows_df[job_flows_df['dimension']=='Sector']
df = sector_job_flows_df.copy()

df['period'] = df['year'].apply(lambda x: 'Pre-2010' if x < 2010 else 'Post-2010')

sector_averages = df.groupby(['category', 'period']).agg({
    'job_reallocation_rate': 'mean',
    'employment': 'mean'
}).reset_index()

sector_pivot = sector_averages.pivot(index='category', columns='period')
sector_pivot.columns = ['_'.join(col) for col in sector_pivot.columns]
sector_pivot = sector_pivot.reset_index()

scatter = alt.Chart(sector_pivot).mark_circle().encode(
    x=alt.X('job_reallocation_rate_Pre-2010:Q', 
            title='Average Job Reallocation Rate 2000-2009 (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    y=alt.Y('job_reallocation_rate_Post-2010:Q', 
            title='Average Job Reallocation Rate 2010+ (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    size=alt.Size('employment_Post-2010:Q', 
                  title='Avg Employment (Post-2010)',
                  scale=alt.Scale(range=[50, 1000])),
    color=alt.Color('category:N', title='Sector', legend=alt.Legend(orient='right')),
    tooltip=['category:N', 'job_reallocation_rate_Pre-2010:Q', 'job_reallocation_rate_Post-2010:Q', 'employment_Post-2010:Q']
).properties(
    width=500,
    height=500
)

line = alt.Chart(sector_pivot).mark_line(strokeDash=[5,5], color='grey').encode(
    x='job_reallocation_rate_Pre-2010:Q',
    y='job_reallocation_rate_Pre-2010:Q'
)

(scatter + line).display()

In [173]:
# Calculate average reallocation by margin for each sector and period
sector_margin = df.groupby(['category', 'period']).agg({
    'jr_rate_entry_exit': 'mean',
    'jr_rate_incumbents': 'mean',
    'employment': 'mean'
}).reset_index()

sector_margin_pivot = sector_margin.pivot(index='category', columns='period')
sector_margin_pivot.columns = ['_'.join(col) for col in sector_margin_pivot.columns]
sector_margin_pivot = sector_margin_pivot.reset_index()

# Calculate change in each margin
sector_margin_pivot['change_entry_exit'] = (
    sector_margin_pivot['jr_rate_entry_exit_Post-2010'] - sector_margin_pivot['jr_rate_entry_exit_Pre-2010']
)
sector_margin_pivot['change_incumbents'] = (
    sector_margin_pivot['jr_rate_incumbents_Post-2010'] - sector_margin_pivot['jr_rate_incumbents_Pre-2010']
)

scatter2 = alt.Chart(sector_margin_pivot).mark_circle().encode(
    x=alt.X('change_entry_exit:Q', 
            title='Change in Entry/Exit Reallocation Rate (pp)',
            axis=alt.Axis(format='.1%')),
    y=alt.Y('change_incumbents:Q', 
            title='Change in Incumbent Reallocation Rate (pp)',
            axis=alt.Axis(format='.1%')),
    size=alt.Size('employment_Post-2010:Q', 
                  title='Avg Employment (Post-2010)',
                  scale=alt.Scale(range=[50, 1000])),
    color=alt.Color('category:N', title='Sector', legend=alt.Legend(orient='right')),
    tooltip=['category:N', 'change_entry_exit:Q', 'change_incumbents:Q']
).properties(
    width=500,
    height=500
)

# Add reference lines at zero
hline = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(strokeDash=[5,5], color='grey').encode(y='y:Q')
vline = alt.Chart(pd.DataFrame({'x': [0]})).mark_rule(strokeDash=[5,5], color='grey').encode(x='x:Q')

(scatter2 + hline + vline).display()

In [177]:
# Calculate average reallocation and employment by sector and period
sector_job_flows_df = job_flows_df[job_flows_df['dimension']=='Sector']
df = sector_job_flows_df.copy()

df['period'] = df['year'].apply(lambda x: 'Pre-2010' if x < 2010 else 'Post-2010')

sector_averages = df.groupby(['category', 'period']).agg({
    'jd_rate_exiters': 'mean',
    'employment': 'mean'
}).reset_index()

sector_pivot = sector_averages.pivot(index='category', columns='period')
sector_pivot.columns = ['_'.join(col) for col in sector_pivot.columns]
sector_pivot = sector_pivot.reset_index()

scatter = alt.Chart(sector_pivot).mark_circle().encode(
    x=alt.X('jd_rate_exiters_Pre-2010:Q', 
            title='Average Job Reallocation Rate 2000-2009 (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    y=alt.Y('jd_rate_exiters_Post-2010:Q', 
            title='Average Job Reallocation Rate 2010+ (%)',
            scale=alt.Scale(zero=False),
            axis=alt.Axis(format='%')),
    size=alt.Size('employment_Post-2010:Q', 
                  title='Avg Employment (Post-2010)',
                  scale=alt.Scale(range=[50, 1000])),
    color=alt.Color('category:N', title='Sector', legend=alt.Legend(orient='right')),
    tooltip=['category:N', 'jd_rate_exiters_Pre-2010:Q', 'jd_rate_exiters_Post-2010:Q', 'employment_Post-2010:Q']
).properties(
    width=500,
    height=500
)

line = alt.Chart(sector_pivot).mark_line(strokeDash=[5,5], color='grey').encode(
    x='jd_rate_exiters_Pre-2010:Q',
    y='jd_rate_exiters_Pre-2010:Q'
)

(scatter + line).display()

In [169]:
# Reallocation rates by productivity group
# FIRST THE ANNUAL LINE SERIES TO SPOT ANOMALIES AND THEN AVERAGE OVER TWO PERIODS

prod_reallocation_df = job_flows_df[job_flows_df['dimension']=='Productivity']

annual_prod_reallocation_chart = alt.Chart(prod_reallocation_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(
                labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('job_reallocation_rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('category:N', title='Productivity group')
).properties(height=400, width=600)

annual_prod_reallocation_chart

In [182]:
# PRODUCTIVITY STATUS: Job reallocation by productivity group - TWO PERIOD AVERAGE BAR CHART

prod_reallocation_df = job_flows_df[job_flows_df['dimension']=='Productivity']
df = prod_reallocation_df.copy()

df['period'] = df['year'].apply(lambda x: 'Pre-2010' if x < 2010 else 'Post-2010')

prod_averages = df.groupby(['category', 'period']).agg({
    'job_reallocation_rate': 'mean',
    'employment': 'mean'
}).reset_index()
prod_averages

prod_order = ['Frontier (P90+)', 'High-Median (P50-P90)', 'Low-Median (P10-P50)', 'Laggards (<P10)']


prod_reallocation_bar_chart = alt.Chart(prod_averages).mark_bar().encode(
    x=alt.X('job_reallocation_rate:Q',
            title='Average Job Reallocation Rate (%)',
            axis=alt.Axis(format='%')), 
    y=alt.Y('category:N', title='Productivity group', sort=prod_order),
    color=alt.Color('period:N', title='Period'),
    yOffset=alt.YOffset('period:N', sort=['Pre-2010', 'Post-2010']),
    tooltip=['category:N', 'period:N', 'job_reallocation_rate:Q']
).properties(height=400, width=600)

prod_reallocation_bar_chart

In [None]:
# SECTOR REALLOCATION BY MARGIN

In [None]:
# SECTOR REALLOCATION BY JOB FLOW TYPE

#### **5. Annual growth rates**

Here we are interested in how existing firms are growing/shrinking each year, not just entering cohorts. This analysis focuses exclusively on incumbents firms.

At the firm-level, we calculate DHS growth rates.

##### **5.1 Growth rate distributions**

In [24]:
# HEADLINE: DHS growth rate distributions annually
total_growth_df = growth_rates_df[growth_rates_df['dimension'] == 'Total'].copy()
total_growth_df = total_growth_df.melt(id_vars='year', value_vars=['mean_dhs_growth','p10_dhs_growth','p90_dhs_growth'], var_name='statistic', value_name='growth_rate')


label_map = {'mean_dhs_growth': 'Mean', 'p10_dhs_growth': 'P10', 'p90_dhs_growth': 'P90'}
total_growth_df['label'] = total_growth_df['statistic'].map(label_map)

max_year = total_growth_df['year'].max()
label_df = total_growth_df[total_growth_df['year'] == max_year]

# Use a shared base to avoid the facet-layer data conflict
base = alt.Chart(total_growth_df).encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 2 == 0 ? datum.label : ''",
        labelAngle=0)),
    y=alt.Y('growth_rate:Q'),
    color=alt.Color('statistic:N', legend=None)
)

lines = base.mark_line()

end_labels = base.transform_filter(
    alt.datum.year == max_year
).mark_text(align='left', dx=5).encode(
    text=alt.Text('label:N')
)

total_dhs_growth_chart = (lines + end_labels)

display(total_dhs_growth_chart)
#total_growth_df.save(chart_path / 'Exploratory/total_dhs_growth_distribution.png')

In [38]:
size_growth_df = growth_rates_df[growth_rates_df['dimension'] == 'Size'].copy()
size_growth_df = size_growth_df.melt(
    id_vars=['year', 'category'],
    value_vars=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
    var_name='statistic',
    value_name='growth_rate'
)

label_map = {'mean_dhs_growth': 'Mean', 'p10_dhs_growth': 'P10', 'p90_dhs_growth': 'P90'}
size_growth_df['label'] = size_growth_df['statistic'].map(label_map)

size_order = ['Micro (0-9)', 'Small (10-49)', 'Medium (50-249)', 'Large (250+)']
size_growth_df['category'] = pd.Categorical(size_growth_df['category'], categories=size_order, ordered=True)

max_year = size_growth_df['year'].max()
label_df = size_growth_df[size_growth_df['year'] == max_year]

# Colour scheme: dark navy for mean, muted tones for P10/P90
color_scale = alt.Scale(
    domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
    range=['#1b3a4b', '#a3b8c8', '#a3b8c8']
)

# Use a shared base to avoid the facet-layer data conflict
base = alt.Chart(size_growth_df).encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 2 == 0 ? datum.label : ''",
        labelAngle=0)),
    y=alt.Y('growth_rate:Q'),
    color=alt.Color('statistic:N', scale=color_scale, legend=None)
)

lines = base.mark_line()

end_labels = base.transform_filter(
    alt.datum.year == max_year
).mark_text(align='left', dx=5).encode(
    text=alt.Text('label:N')
)

size_growth_chart = (lines + end_labels).facet(
    facet=alt.Facet('category:N', sort=size_order, title='Firm Size'),
    columns=2
).resolve_scale(
    y='independent'
)

display(size_growth_chart)
size_growth_chart.save(chart_path / 'Exploratory/size_dhs_growth_distribution.png')

In [36]:
# DHS growth rate distribution (excluding Micro firms)
total_growth_df = growth_rates_df[growth_rates_df['dimension'] == 'Size'].copy()
total_growth_df = total_growth_df[total_growth_df['category'] != 'Micro (0-9)']
total_growth_df = total_growth_df[total_growth_df['year'] >= 2003]
total_growth_df = total_growth_df.groupby('year').agg({
    'mean_dhs_growth': 'mean',
    'p10_dhs_growth': 'mean',
    'p90_dhs_growth': 'mean'
}).reset_index()

# Keep a wide version for the shaded band
band_df = total_growth_df[['year', 'p10_dhs_growth', 'p90_dhs_growth']].copy()

# Melt for the lines
total_growth_df = total_growth_df.melt(
    id_vars='year',
    value_vars=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
    var_name='statistic', value_name='growth_rate'
)

label_map = {'mean_dhs_growth': 'Mean', 'p10_dhs_growth': 'P10', 'p90_dhs_growth': 'P90'}
total_growth_df['label'] = total_growth_df['statistic'].map(label_map)

max_year = total_growth_df['year'].max()

# Colour scheme: dark navy for mean, muted tones for P10/P90
color_scale = alt.Scale(
    domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
    range=['#1b3a4b', '#a3b8c8', '#a3b8c8']
)

# Shaded band between P10 and P90
band = alt.Chart(band_df).mark_area(opacity=0.08, color='#1b3a4b').encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 2 == 0 ? datum.label : ''",
        labelAngle=0, title=None)),
    y=alt.Y('p10_dhs_growth:Q'),
    y2=alt.Y2('p90_dhs_growth:Q')
)

# Lines
lines = alt.Chart(total_growth_df).mark_line(strokeWidth=2).encode(
    x=alt.X('year:O'),
    y=alt.Y('growth_rate:Q'),
    color=alt.Color('statistic:N', scale=color_scale, legend=None),
    strokeDash=alt.StrokeDash(
        'statistic:N',
        scale=alt.Scale(
            domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
            range=[[0], [4, 2], [4, 2]]  # solid for mean, dashed for P10/P90
        ),
        legend=None
    )
)

# End labels
end_labels = alt.Chart(total_growth_df).transform_filter(
    alt.datum.year == max_year
).mark_text(align='left', dx=5, fontSize=11).encode(
    x=alt.X('year:O'),
    y=alt.Y('growth_rate:Q'),
    text=alt.Text('label:N'),
    color=alt.Color('statistic:N', scale=color_scale, legend=None)
)

total_dhs_growth_chart = (band + lines + end_labels).properties(
    width=500,
    height=300,
)

display(total_dhs_growth_chart)
total_dhs_growth_chart.save(chart_path / 'Descriptive paper/Dynamism/total_dhs_growth_distribution_excluding_micro.png', scale_factor=2.0)

In [29]:
# DISPERSION OF GROWTH RATES : Total economy (excluding micro firms)

# DHS growth rate distribution (excluding Micro firms)
# HEADLINE: DHS growth rate distributions annually
total_growth_df = growth_rates_df[growth_rates_df['dimension'] == 'Size'].copy()
total_growth_df = total_growth_df[total_growth_df['category'] != 'Micro (0-9)']
total_growth_df = total_growth_df[total_growth_df['year'] >=2003]
total_growth_df = total_growth_df.groupby('year').agg({'sd_dhs_growth': 'mean'}).reset_index()

max_year = total_growth_df['year'].max()
label_df = total_growth_df[total_growth_df['year'] == max_year]

# Use a shared base to avoid the facet-layer data conflict
base = alt.Chart(total_growth_df).encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 2 == 0 ? datum.label : ''",
        labelAngle=0)),
    y=alt.Y('sd_dhs_growth:Q'),
)

lines = base.mark_line()

end_labels = base.transform_filter(
    alt.datum.year == max_year
).mark_text(align='left', dx=5).encode(
    text=alt.Text('sd_dhs_growth')
)

total_dhs_growth_chart = (lines + end_labels)

display(total_dhs_growth_chart)
#total_growth_df.save(chart_path / 'Exploratory/total_dhs_growth_distribution_excluding_micro.png')

##### **5.2 Growth by firm characteristics**
- Size-growth patterns
- Growth rates by productivity decile
- Sector-specific growth patterns

As identified above, Micro firms introduce lots of volatility into the distribution of DHS growth rates. We haven't produced any cross-category breakdowns in this data export so we are unable to filter these firms out of the industry breakdown.

>**_QUESTION:_** Do we want to re-export growth rates across categories excluding Micro firms?

In [42]:
# DHS growth rate distribution by sector
# DHS growth rate distribution by sector
sector_growth_df = growth_rates_df[growth_rates_df['dimension'] == 'Sector'].copy()
sector_growth_df = sector_growth_df[sector_growth_df['year'] >= 2003]

# Wide version for the shaded band
band_df = sector_growth_df[['year', 'category', 'p10_dhs_growth', 'p90_dhs_growth']].copy()

# Melt for the lines
line_df = sector_growth_df.melt(
    id_vars=['year', 'category'],
    value_vars=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
    var_name='statistic', value_name='growth_rate'
)

label_map = {'mean_dhs_growth': 'Mean', 'p10_dhs_growth': 'P10', 'p90_dhs_growth': 'P90'}
line_df['label'] = line_df['statistic'].map(label_map)

max_year = line_df['year'].max()

color_scale = alt.Scale(
    domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
    range=['#1b3a4b', '#a3b8c8', '#a3b8c8']
)

sector_order = sorted(sector_growth_df['category'].unique())

# Band — uses wide data, so separate faceted chart
band = alt.Chart(band_df).mark_area(opacity=0.08, color='#1b3a4b').encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 4 == 0 ? datum.label : ''",
        labelAngle=0, title=None)),
    y=alt.Y('p10_dhs_growth:Q', title='DHS growth rate'),
    y2=alt.Y2('p90_dhs_growth:Q')
).properties(width=500, height=400)

# Lines — shared base from melted data
base = alt.Chart(line_df).encode(
    x=alt.X('year:O'),
    y=alt.Y('growth_rate:Q'),
    color=alt.Color('statistic:N', scale=color_scale, legend=None)
).properties(width=500, height=400)

lines = base.mark_line(strokeWidth=1.5).encode(
    strokeDash=alt.StrokeDash(
        'statistic:N',
        scale=alt.Scale(
            domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
            range=[[0], [4, 2], [4, 2]]
        ),
        legend=None
    )
)

end_labels = base.transform_filter(
    alt.datum.year == max_year
).mark_text(align='left', dx=3, fontSize=8).encode(
    text=alt.Text('label:N')
)

# Facet each layer separately, then layer the faceted charts
band_facet = band.facet(
    facet=alt.Facet('category:N', sort=sector_order, title=None,
                    header=alt.Header(labelFontSize=10)),
    columns=3
).resolve_scale(y='independent')

lines_facet = (lines + end_labels).facet(
    facet=alt.Facet('category:N', sort=sector_order, title=None,
                    header=alt.Header(labelFontSize=10)),
    columns=3
).resolve_scale(y='independent')

# Note: Altair can't directly layer two faceted charts.
# Workaround: build each sector panel individually and concatenate.

panels = []
for sector in sector_order:
    s_band = band_df[band_df['category'] == sector]
    s_line = line_df[line_df['category'] == sector]

    b = alt.Chart(s_band).mark_area(opacity=0.08, color='#1b3a4b').encode(
        x=alt.X('year:O', axis=alt.Axis(
            labelExpr="datum.value % 4 == 0 ? datum.label : ''",
            labelAngle=0, title=None)),
        y=alt.Y('p10_dhs_growth:Q', title=None),
        y2=alt.Y2('p90_dhs_growth:Q')
    )

    l = alt.Chart(s_line).mark_line(strokeWidth=1.5).encode(
        x=alt.X('year:O'),
        y=alt.Y('growth_rate:Q', title=None),
        color=alt.Color('statistic:N', scale=color_scale, legend=None),
        strokeDash=alt.StrokeDash('statistic:N',
            scale=alt.Scale(
                domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
                range=[[0], [4, 2], [4, 2]]),
            legend=None)
    )

    e = alt.Chart(s_line).transform_filter(
        alt.datum.year == max_year
    ).mark_text(align='left', dx=3, fontSize=8).encode(
        x=alt.X('year:O'),
        y=alt.Y('growth_rate:Q'),
        text=alt.Text('label:N'),
        color=alt.Color('statistic:N', scale=color_scale, legend=None)
    )

    panel = (b + l + e).properties(width=500, height=400, title=sector)
    panels.append(panel)

# Arrange in rows of 3
n_cols = 3
rows = [alt.hconcat(*panels[i:i+n_cols]) for i in range(0, len(panels), n_cols)]
sector_chart = alt.vconcat(*rows).resolve_scale(y='independent')

display(sector_chart)
sector_chart.save(chart_path / 'Descriptive paper/Dynamism/sector_dhs_growth_distribution.png', scale_factor=2.0)


In [45]:
# PRODUCTIVITY: growth rates by productivity band

productivity_growth_df = growth_rates_df[growth_rates_df['dimension'] == 'Productivity'].copy()
productivity_growth_df = productivity_growth_df[productivity_growth_df['year'] >= 2003]

# Wide version for the shaded band
band_df = productivity_growth_df[['year', 'category', 'p10_dhs_growth', 'p90_dhs_growth']].copy()

# Melt for the lines
line_df = productivity_growth_df.melt(
    id_vars=['year', 'category'],
    value_vars=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
    var_name='statistic', value_name='growth_rate'
)

label_map = {'mean_dhs_growth': 'Mean', 'p10_dhs_growth': 'P10', 'p90_dhs_growth': 'P90'}
line_df['label'] = line_df['statistic'].map(label_map)

max_year = line_df['year'].max()

color_scale = alt.Scale(
    domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
    range=['#1b3a4b', '#a3b8c8', '#a3b8c8']
)

sector_order = sorted(productivity_growth_df['category'].unique())

# Band — uses wide data, so separate faceted chart
band = alt.Chart(band_df).mark_area(opacity=0.08, color='#1b3a4b').encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 4 == 0 ? datum.label : ''",
        labelAngle=0, title=None)),
    y=alt.Y('p10_dhs_growth:Q', title='DHS growth rate'),
    y2=alt.Y2('p90_dhs_growth:Q')
).properties(width=500, height=400)

# Lines — shared base from melted data
base = alt.Chart(line_df).encode(
    x=alt.X('year:O'),
    y=alt.Y('growth_rate:Q'),
    color=alt.Color('statistic:N', scale=color_scale, legend=None)
).properties(width=500, height=400)

lines = base.mark_line(strokeWidth=1.5).encode(
    strokeDash=alt.StrokeDash(
        'statistic:N',
        scale=alt.Scale(
            domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
            range=[[0], [4, 2], [4, 2]]
        ),
        legend=None
    )
)

end_labels = base.transform_filter(
    alt.datum.year == max_year
).mark_text(align='left', dx=3, fontSize=8).encode(
    text=alt.Text('label:N')
)

# Facet each layer separately, then layer the faceted charts
band_facet = band.facet(
    facet=alt.Facet('category:N', sort=sector_order, title=None,
                    header=alt.Header(labelFontSize=10)),
    columns=3
).resolve_scale(y='independent')

lines_facet = (lines + end_labels).facet(
    facet=alt.Facet('category:N', sort=sector_order, title=None,
                    header=alt.Header(labelFontSize=10)),
    columns=3
).resolve_scale(y='independent')

# Note: Altair can't directly layer two faceted charts.
# Workaround: build each sector panel individually and concatenate.

panels = []
for sector in sector_order:
    s_band = band_df[band_df['category'] == sector]
    s_line = line_df[line_df['category'] == sector]

    b = alt.Chart(s_band).mark_area(opacity=0.08, color='#1b3a4b').encode(
        x=alt.X('year:O', axis=alt.Axis(
            labelExpr="datum.value % 4 == 0 ? datum.label : ''",
            labelAngle=0, title=None)),
        y=alt.Y('p10_dhs_growth:Q', title=None),
        y2=alt.Y2('p90_dhs_growth:Q')
    )

    l = alt.Chart(s_line).mark_line(strokeWidth=1.5).encode(
        x=alt.X('year:O'),
        y=alt.Y('growth_rate:Q', title=None),
        color=alt.Color('statistic:N', scale=color_scale, legend=None),
        strokeDash=alt.StrokeDash('statistic:N',
            scale=alt.Scale(
                domain=['mean_dhs_growth', 'p10_dhs_growth', 'p90_dhs_growth'],
                range=[[0], [4, 2], [4, 2]]),
            legend=None)
    )

    e = alt.Chart(s_line).transform_filter(
        alt.datum.year == max_year
    ).mark_text(align='left', dx=3, fontSize=8).encode(
        x=alt.X('year:O'),
        y=alt.Y('growth_rate:Q'),
        text=alt.Text('label:N'),
        color=alt.Color('statistic:N', scale=color_scale, legend=None)
    )

    panel = (b + l + e).properties(width=500, height=400, title=sector)
    panels.append(panel)

# Arrange in rows of 3
n_cols = 3
rows = [alt.hconcat(*panels[i:i+n_cols]) for i in range(0, len(panels), n_cols)]
sector_chart = alt.vconcat(*rows).resolve_scale(y='independent')

display(sector_chart)
#sector_chart.save(chart_path / 'Descriptive paper/Dynamism/sector_dhs_growth_distribution.png', scale_factor=2.0)


##### **5.3 High-growth firm analysis**
- HGF counts and employment shares over time
- Characteristics of HGFs (age, size, sector)

In [46]:
# Calculate share of high-growth firms

growth_cats_df['high_growth_firm_share'] = growth_cats_df['n_hgf'] / growth_cats_df['n_incumbents']

In [47]:
# AGGREGATE: share of high growth firms over time
total_growth_cats_df = growth_cats_df[growth_cats_df['dimension']=='Total']
total_growth_cats_df = total_growth_cats_df[total_growth_cats_df['year']>=1999]

hgf_chart = alt.Chart(total_growth_cats_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('high_growth_firm_share:Q', axis=alt.Axis(format='%'), title='Share of high-growth firms')
)

hgf_chart

In [50]:
# FIRM SIZE: share of high growth firms over time
size_growth_cats_df = growth_cats_df[growth_cats_df['dimension']=='Size']
size_growth_cats_df = size_growth_cats_df[size_growth_cats_df['year']>=1999]

hgf_chart = alt.Chart(size_growth_cats_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('high_growth_firm_share:Q', axis=alt.Axis(format='%'), title='Share of high-growth firms'),
    color=alt.Color('category:N', title='Size band')
)

hgf_chart

In [None]:
# FIRM AGE: share of high growth firms over time
size_growth_cats_df = growth_cats_df[growth_cats_df['dimension']=='Age']
size_growth_cats_df = size_growth_cats_df[size_growth_cats_df['year']>=1999]

hgf_chart = alt.Chart(size_growth_cats_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('high_growth_firm_share:Q', axis=alt.Axis(format='%'), title='Share of high-growth firms'),
    color=alt.Color('category:N', title='Age group')
)

hgf_chart

##### **5.4 Stagnation analysis**
- Rising share of stagnant firms
- Characteristics of persistently stagnant firms
- Employment trapped in low-growth firms

In [None]:
growth_cats_df['stagnant_firm_share'] = growth_cats_df['n_stagnant'] / growth_cats_df['n_incumbents']

In [None]:
# AGGREGATE: share of stagnant firms over time
total_growth_cats_df = growth_cats_df[growth_cats_df['dimension']=='Total']
total_growth_cats_df = total_growth_cats_df[total_growth_cats_df['year']>=1999]

stagnation_chart = alt.Chart(total_growth_cats_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('stagnant_firm_share:Q', axis=alt.Axis(format='%'), title='Share of stagnant firms')
)

stagnation_chart

In [None]:
# FIRM SIZE: share of stagnant firms over time
size_growth_cats_df = growth_cats_df[growth_cats_df['dimension']=='Size']
size_growth_cats_df = size_growth_cats_df[size_growth_cats_df['year']>=1999]

stagnation_chart = alt.Chart(size_growth_cats_df).mark_line().encode(
    x=alt.X('year:O', axis=alt.Axis(labelExpr="datum.value % 2 == 0 ? datum.label : ''",
            labelAngle=0)),
    y=alt.Y('stagnant_firm_share:Q', axis=alt.Axis(format='%'), title='Share of stagnant firms'),
    color=alt.Color('category:N', title='Size band')
)

stagnation_chart

### Create table of high growth, stagnant and shrinking firms over average time periods

In [49]:
# Filter to Total and relevant years
df = growth_cats_df[
    (growth_cats_df['dimension'] == 'Total') &
    (growth_cats_df['year'] >= 2000)
].copy()

# Assign periods
df['period'] = pd.cut(
    df['year'],
    bins=[1999, 2007, 2015, 2022],
    labels=['2000-2007', '2008-2015', '2016-2022']
)

# Calculate shares within each year first, then average across period
df['hgf_firm_share'] = df['n_hgf'] / df['n_incumbents']
df['stagnant_firm_share'] = df['n_stagnant'] / df['n_incumbents']
df['shrinking_firm_share'] = df['n_shrinkng'] / df['n_incumbents']

df['hgf_emp_share'] = df['hgf_emp'] / df['employment']
df['stagnant_emp_share'] = df['stagnant_emp'] / df['employment']
df['shrinking_emp_share'] = df['shrinking_emp'] / df['employment']

# Average shares across years within each period
period_avg = df.groupby('period').agg({
    'hgf_firm_share': 'mean',
    'stagnant_firm_share': 'mean',
    'shrinking_firm_share': 'mean',
    'hgf_emp_share': 'mean',
    'stagnant_emp_share': 'mean',
    'shrinking_emp_share': 'mean'
}).reset_index()

# Reshape into the target format
records = []
for _, row in period_avg.iterrows():
    records.append({
        'Growth Category': 'High-growth firms',
        'Period': row['period'],
        '% of Firms': f"{row['hgf_firm_share']:.1%}",
        '% of Employment': f"{row['hgf_emp_share']:.1%}"
    })
    records.append({
        'Growth Category': 'Stagnant firms',
        'Period': row['period'],
        '% of Firms': f"{row['stagnant_firm_share']:.1%}",
        '% of Employment': f"{row['stagnant_emp_share']:.1%}"
    })
    records.append({
        'Growth Category': 'Shrinking firms',
        'Period': row['period'],
        '% of Firms': f"{row['shrinking_firm_share']:.1%}",
        '% of Employment': f"{row['shrinking_emp_share']:.1%}"
    })

summary_table = pd.DataFrame(records)
summary_table = summary_table.sort_values(['Growth Category', 'Period']).reset_index(drop=True)

display(summary_table)

  period_avg = df.groupby('period').agg({


Unnamed: 0,Growth Category,Period,% of Firms,% of Employment
0,High-growth firms,2000-2007,10.2%,16.7%
1,High-growth firms,2008-2015,11.6%,14.5%
2,High-growth firms,2016-2022,8.6%,11.5%
3,Shrinking firms,2000-2007,7.6%,6.4%
4,Shrinking firms,2008-2015,8.1%,5.9%
5,Shrinking firms,2016-2022,7.3%,5.3%
6,Stagnant firms,2000-2007,66.2%,38.9%
7,Stagnant firms,2008-2015,62.7%,37.3%
8,Stagnant firms,2016-2022,67.7%,46.2%


In [53]:
# Filter to sector dimension
df = growth_cats_df[
    (growth_cats_df['dimension'] == 'Sector') &
    (growth_cats_df['year'] >= 2000)
].copy()

# Calculate annual shares
df['hgf_firm_share'] = df['n_hgf'] / df['n_incumbents']
df['stagnant_firm_share'] = df['n_stagnant'] / df['n_incumbents']
df['hgf_emp_share'] = df['hgf_emp'] / df['employment']
df['stagnant_emp_share'] = df['stagnant_emp'] / df['employment']

# Split into pre/post 2016 and average
df['period'] = df['year'].apply(lambda y: 'pre' if y < 2016 else 'post')

period_avg = df.groupby(['category', 'period']).agg({
    'hgf_firm_share': 'mean',
    'stagnant_firm_share': 'mean',
    'hgf_emp_share': 'mean',
    'stagnant_emp_share': 'mean'
}).reset_index()

# Pivot to get pre and post side by side
pre = period_avg[period_avg['period'] == 'pre'].drop(columns='period')
post = period_avg[period_avg['period'] == 'post'].drop(columns='period')
merged = pre.merge(post, on='category', suffixes=('_pre', '_post'))

# Calculate changes (post minus pre)
merged['hgf_firm_change'] = merged['hgf_firm_share_post'] - merged['hgf_firm_share_pre']
merged['hgf_emp_change'] = merged['hgf_emp_share_post'] - merged['hgf_emp_share_pre']
merged['stagnant_firm_change'] = merged['stagnant_firm_share_post'] - merged['stagnant_firm_share_pre']
merged['stagnant_emp_change'] = merged['stagnant_emp_share_post'] - merged['stagnant_emp_share_pre']

# Reshape for plotting: one row per industry × growth category × measure
records = []
for _, row in merged.iterrows():
    for gc, fc, ec in [
        ('High-growth', 'hgf_firm_change', 'hgf_emp_change'),
        ('Stagnant', 'stagnant_firm_change', 'stagnant_emp_change')
    ]:
        records.append({
            'industry': row['category'],
            'growth_category': gc,
            'measure': '% of firms',
            'change': row[fc]
        })
        records.append({
            'industry': row['category'],
            'growth_category': gc,
            'measure': '% of employment',
            'change': row[ec]
        })

plot_df = pd.DataFrame(records)

# Sort industries by HGF employment change for a meaningful order
hgf_emp_order = (
    plot_df[(plot_df['growth_category'] == 'High-growth') & (plot_df['measure'] == '% of employment')]
    .sort_values('change')['industry'].tolist()
)

# Shared base from single dataframe
base = alt.Chart(plot_df).encode(
    y=alt.Y('industry:N', sort=hgf_emp_order, title=None)
)

# Connecting line between the two measures per industry
line = base.mark_rule(color='#cccccc', strokeWidth=1).encode(
    x=alt.X('min(change):Q'),
    x2=alt.X2('max(change):Q')
)

# Dots
dots = base.mark_circle(size=60).encode(
    x=alt.X('change:Q', title='Change in share (post-2016 minus pre-2016)',
            axis=alt.Axis(format='.0%')),
    color=alt.Color('measure:N',
        scale=alt.Scale(
            domain=['% of firms', '% of employment'],
            range=['#a3b8c8', '#1b3a4b']
        ),
        legend=alt.Legend(title=None, orient='bottom')
    )
)

# Zero reference line using same data with a transform
zero = base.mark_rule(
    strokeDash=[3, 3], color='grey', strokeWidth=0.5
).encode(x=alt.datum(0))

chart = (line + zero + dots).facet(
    column=alt.Column('growth_category:N',
        sort=['High-growth', 'Stagnant'],
        header=alt.Header(labelFontSize=12, titleFontSize=0))
).properties(
    title='Change in share of high-growth and stagnant firms by industry (post-2016 vs pre-2016)'
)

display(chart)
#chart.save(chart_path / 'Descriptive paper/Dynamism/industry_hgf_stagnant_shift.png', scale_factor=2.0)

##### **5.5 Growth-productivity relationship**
- Do high-productivity firms grow faster?
- Changes in this relationship over time

In [None]:
growth_rates_df

Unnamed: 0,year,dimension,category,n_incumbents,employment,mean_dhs_growth,median_dhs_growth,sd_dhs_growth,p10_dhs_growth,p90_dhs_growth
0,1999,Age,Mature (6-10 years),286568,2947611,0.020,0.0,0.247,0.000,0.095
1,1999,Age,New (0-2 years),265915,1502113,0.069,0.0,0.373,0.000,0.667
2,1999,Age,Old (11+ years),645324,10706355,0.004,0.0,0.210,0.000,0.000
3,1999,Age,Young (3-5 years),283415,1280400,0.059,0.0,0.326,0.000,0.400
4,2000,Age,Mature (6-10 years),275795,2562087,0.025,0.0,0.269,0.000,0.222
...,...,...,...,...,...,...,...,...,...,...
955,2018,Total,All,1870075,21274005,0.012,0.0,0.262,-0.043,0.194
956,2019,Total,All,1890780,21592637,0.010,0.0,0.262,-0.057,0.182
957,2020,Total,All,1922555,21932337,0.006,0.0,0.260,-0.080,0.162
958,2021,Total,All,1929683,21791606,0.002,0.0,0.265,-0.133,0.154


#### **6. Firm-level productivity dispersion**

##### **6.1 Dispersion metrics**
- Standard deviation, IQR, 90/10 ratio, 90/50 ratio
- Time trends in dispersion 
- Has dispersion increased?

#### **7. Site expansion dynamics**


##### **7.1 Multi-site expansion trends**
- Site expansion rates over time
- Decomposition: entrants vs incumbents

In [54]:
site_dynamics_df.columns

Index(['year', 'dimension', 'category', 'n_firms', 'total_sites',
       'site_exp_entrants', 'site_exp_incumbents', 'site_closure_exit',
       'site_closure_incumbents'],
      dtype='object')

In [67]:
total_sites_df[['year','total_sites']]


Unnamed: 0,year,total_sites
39,1998,914525
79,1999,892848
119,2000,873737
159,2001,901342
199,2002,913095
239,2003,917723
279,2004,948556
319,2005,950935
359,2006,951852
399,2007,960846


In [78]:
# Quickly plot total sites to visually inspect - looking out for BRES volatility
total_sites_df = site_dynamics_df[site_dynamics_df['dimension'] == 'Total'] 

site_chart = alt.Chart(total_sites_df).mark_line().encode( 
    x=alt.X('year:O', axis=alt.Axis( labelExpr="datum.value % 2 == 0 ? datum.label : ''", )), 
    y=alt.Y('total_sites:Q', title='Total sites'))
    
display(site_chart)

size_sites_df = site_dynamics_df[site_dynamics_df['dimension'] == 'Size'] 

size_site_chart = alt.Chart(size_sites_df).mark_line().encode( 
    x=alt.X('year:O', axis=alt.Axis( labelExpr="datum.value % 2 == 0 ? datum.label : ''", )), 
    y=alt.Y('total_sites:Q', title='Total sites'),
    color=alt.Color('category:N', title='Size band') )

display(size_site_chart)
#size_site_chart.save(chart_path / 'Exploratory/total_sites_by_size.png', scale_factor=2.0)


In [55]:
# Calculate site expansion and contraction rates 
site_dynamics_df['total_sites_lag'] = site_dynamics_df.groupby(['category','dimension'])['total_sites'].shift(1)

site_dynamics_df['site_exp_rate_entrants'] = site_dynamics_df['site_exp_entrants'] / site_dynamics_df['total_sites_lag']
site_dynamics_df['site_exp_rate_incumbents'] = site_dynamics_df['site_exp_incumbents'] / site_dynamics_df['total_sites_lag']
site_dynamics_df['site_closure_rate_exit'] = site_dynamics_df['site_closure_exit'] / site_dynamics_df['total_sites_lag']
site_dynamics_df['site_closure_rate_incumbents'] = site_dynamics_df['site_closure_incumbents'] / site_dynamics_df['total_sites_lag']

site_dynamics_df

Unnamed: 0,year,dimension,category,n_firms,total_sites,site_exp_entrants,site_exp_incumbents,site_closure_exit,site_closure_incumbents,total_sites_lag,site_exp_rate_entrants,site_exp_rate_incumbents,site_closure_rate_exit,site_closure_rate_incumbents
0,1998,Age,Mature (6-10 years),291285,175638,0.0,9404.0,13079,3491.0,,,,,
1,1998,Age,New (0-2 years),566620,70840,53211.0,4678.0,1446,298.0,,,,,
2,1998,Age,Old (11+ years),664527,550832,0.0,27816.0,27724,16311.0,,,,,
3,1998,Age,Young (3-5 years),394013,117215,0.0,9836.0,11407,1866.0,,,,,
4,1998,Productivity,Frontier (P90+),187265,131632,6609.0,8365.0,5787,4736.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,2022,Size,Large (250+),7854,252479,326.0,10865.0,4806,10623.0,255377.0,0.001277,0.042545,0.018819,0.041597
996,2022,Size,Medium (50-249),38611,86999,195.0,4198.0,2291,2574.0,87203.0,0.002236,0.048141,0.026272,0.029517
997,2022,Size,Micro (0-9),2267041,426590,3643.0,16766.0,31633,3278.0,428937.0,0.008493,0.039087,0.073747,0.007642
998,2022,Size,Small (10-49),221189,152091,361.0,5651.0,4653,2117.0,153688.0,0.002349,0.036769,0.030276,0.013775


In [70]:
# Site expansion and contraction rates over time
total_site_dynamics_df = site_dynamics_df[site_dynamics_df['dimension'] == 'Total'].copy()

site_dynamics_rates_df = total_site_dynamics_df.melt(
    id_vars=['year'],
    value_vars=['site_exp_rate_entrants', 'site_exp_rate_incumbents',
                'site_closure_rate_exit', 'site_closure_rate_incumbents'],
    var_name='flow_type', value_name='rate'
)

# Flip closure rates to negative
closure_flows = ['site_closure_rate_exit', 'site_closure_rate_incumbents']
site_dynamics_rates_df.loc[
    site_dynamics_rates_df['flow_type'].isin(closure_flows), 'rate'
] *= -1

flow_type_map = {
    'site_exp_rate_entrants': 'Expansion - entrants',
    'site_exp_rate_incumbents': 'Expansion - incumbents',
    'site_closure_rate_exit': 'Closure - exits',
    'site_closure_rate_incumbents': 'Closure - incumbents'
}
site_dynamics_rates_df['flow_type'] = site_dynamics_rates_df['flow_type'].map(flow_type_map)

color_scale = alt.Scale(
    domain=['Expansion - entrants', 'Expansion - incumbents',
            'Closure - exits', 'Closure - incumbents'],
    range=['#179fdb', '#122b39',   # blue + dark navy for expansion
           '#eb5c2e', '#e6224b']    # reds/oranges for closure
)

# Zero line
zero = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(
    color='black', strokeWidth=0.5
).encode(y='y:Q')

bars = alt.Chart(site_dynamics_rates_df).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 2 == 0 ? datum.label : ''",
        labelAngle=0, title=None)),
    y=alt.Y('rate:Q', axis=alt.Axis(format='%'), title='Rate'),
    color=alt.Color('flow_type:N', scale=color_scale,
                    legend=alt.Legend(title=None, orient='bottom',
                                     columns=2)),
    order=alt.Order('flow_type:N')
)

site_dynamics_chart = (zero + bars).properties(width=600, height=400)

display(site_dynamics_chart)
# site_dynamics_chart.save(chart_path / 'Descriptive paper/Dynamism/total_site_dynamics_rates.png', scale_factor=2.0)

In [94]:
# HEADLINE: Just site expansion
# Site expansion and contraction rates over time
total_site_dynamics_df = site_dynamics_df[site_dynamics_df['dimension'] == 'Total'].copy()

site_dynamics_rates_df = total_site_dynamics_df.melt(
    id_vars=['year'],
    value_vars=['site_exp_rate_entrants', 'site_exp_rate_incumbents'],
    var_name='flow_type', value_name='rate'
)

flow_type_map = {
    'site_exp_rate_entrants': 'Expansion - entrants',
    'site_exp_rate_incumbents': 'Expansion - incumbents'
}
site_dynamics_rates_df['flow_type'] = site_dynamics_rates_df['flow_type'].map(flow_type_map)

color_scale = alt.Scale(
    domain=['Expansion - incumbents', 'Expansion - entrants'],
    range=['#179fdb', '#122b39']
)

# Zero line
zero = alt.Chart(pd.DataFrame({'y': [0]})).mark_rule(
    color='black', strokeWidth=0.5
).encode(y='y:Q')

bars = alt.Chart(site_dynamics_rates_df).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 2 == 0 ? datum.label : ''",
        labelAngle=0, title=None)),
    y=alt.Y('rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('flow_type:N', scale=color_scale,
                    legend=alt.Legend(title=None, orient='none',
                                       legendX=450,   
                                       legendY=12,  
                                       direction='vertical')),
    order=alt.Order('flow_type:N')
)

site_dynamics_chart = (zero + bars).properties(width=600, height=400)

display(site_dynamics_chart)
site_dynamics_chart.save(chart_path / 'Descriptive paper/Dynamism/total_site_expansion_rates.png', scale_factor=2.0)

In [102]:
# HEADLINE: Just site expansion
# Site expansion and contraction rates over time
size_site_dynamics_df = site_dynamics_df[site_dynamics_df['dimension'] == 'Size'].copy()

site_dynamics_rates_df = size_site_dynamics_df.melt(
    id_vars=['year','category'],
    value_vars=['site_exp_rate_entrants', 'site_exp_rate_incumbents'],
    var_name='flow_type', value_name='rate'
)

size_order = ['Micro (0-9)', 'Small (10-49)', 'Medium (50-249)', 'Large (250+)']
site_dynamics_rates_df['category'] = pd.Categorical(site_dynamics_rates_df['category'], categories=size_order, ordered=True)

flow_type_map = {
    'site_exp_rate_entrants': 'Expansion - entrants',
    'site_exp_rate_incumbents': 'Expansion - incumbents'
}
site_dynamics_rates_df['flow_type'] = site_dynamics_rates_df['flow_type'].map(flow_type_map)

color_scale = alt.Scale(
    domain=['Expansion - incumbents', 'Expansion - entrants'],
    range=['#179fdb', '#122b39']
)

bars = alt.Chart(site_dynamics_rates_df).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 2 == 0 ? datum.label : ''",
        labelAngle=0, title=None)),
    y=alt.Y('rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('flow_type:N', scale=color_scale,
                    legend=alt.Legend(title=None, orient='top')),
    order=alt.Order('flow_type:N'),
    facet=alt.Facet('category:N', columns=2, sort=size_order, title=None, header=alt.Header(labelFontSize=10))
)

site_dynamics_chart = bars.properties(width=500, height=400)

display(site_dynamics_chart)
site_dynamics_chart.save(chart_path / 'Exploratory/size_site_expansion_rates.png', scale_factor=2.0)

In [104]:
# HEADLINE: Just site expansion
# Site expansion and contraction rates over time
size_site_dynamics_df = site_dynamics_df[site_dynamics_df['dimension'] == 'Sector'].copy()

site_dynamics_rates_df = size_site_dynamics_df.melt(
    id_vars=['year','category'],
    value_vars=['site_exp_rate_entrants', 'site_exp_rate_incumbents'],
    var_name='flow_type', value_name='rate'
)

flow_type_map = {
    'site_exp_rate_entrants': 'Expansion - entrants',
    'site_exp_rate_incumbents': 'Expansion - incumbents'
}
site_dynamics_rates_df['flow_type'] = site_dynamics_rates_df['flow_type'].map(flow_type_map)

color_scale = alt.Scale(
    domain=['Expansion - incumbents', 'Expansion - entrants'],
    range=['#179fdb', '#122b39']
)

bars = alt.Chart(site_dynamics_rates_df).mark_bar().encode(
    x=alt.X('year:O', axis=alt.Axis(
        labelExpr="datum.value % 2 == 0 ? datum.label : ''",
        labelAngle=0, title=None)),
    y=alt.Y('rate:Q', axis=alt.Axis(format='%')),
    color=alt.Color('flow_type:N', scale=color_scale,
                    legend=alt.Legend(title=None, orient='top')),
    order=alt.Order('flow_type:N'),
    facet=alt.Facet('category:N', columns=2, sort=size_order, title=None, header=alt.Header(labelFontSize=10))
).resolve_scale(y='independent')

site_dynamics_chart = bars.properties(width=500, height=400)

display(site_dynamics_chart)
#site_dynamics_chart.save(chart_path / 'Exploratory/size_site_expansion_rates.png', scale_factor=2.0)

In [82]:
# Site expansion and contraction rates by firm size
size_site_df = site_dynamics_df[site_dynamics_df['dimension'] == 'Size'].copy()

size_site_rates = size_site_df.melt(
    id_vars=['year', 'category'],
    value_vars=['site_exp_rate_entrants', 'site_exp_rate_incumbents',
                'site_closure_rate_exit', 'site_closure_rate_incumbents'],
    var_name='flow_type', value_name='rate'
)

# Flip closure rates to negative
closure_flows = ['site_closure_rate_exit', 'site_closure_rate_incumbents']
size_site_rates.loc[
    size_site_rates['flow_type'].isin(closure_flows), 'rate'
] *= -1

flow_type_map = {
    'site_exp_rate_entrants': 'Expansion - entrants',
    'site_exp_rate_incumbents': 'Expansion - incumbents',
    'site_closure_rate_exit': 'Closure - exits',
    'site_closure_rate_incumbents': 'Closure - incumbents'
}
size_site_rates['flow_type'] = size_site_rates['flow_type'].map(flow_type_map)

color_scale = alt.Scale(
    domain=['Expansion - entrants', 'Expansion - incumbents',
            'Closure - exits', 'Closure - incumbents'],
    range=['#179fdb', '#122b39',
           '#eb5c2e', '#e6224b']
)

size_order = ['Micro (0-9)', 'Small (10-49)', 'Medium (50-249)', 'Large (250+)']

# Build panels individually to allow zero line layering
panels = []
for size in size_order:
    s_df = size_site_rates[size_site_rates['category'] == size]

    zero = alt.Chart(s_df).mark_rule(
        color='black', strokeWidth=0.5
    ).encode(y=alt.datum(0))

    show_legend = (size == size_order[-1])
    bars = alt.Chart(s_df).mark_bar().encode(
        x=alt.X('year:O', axis=alt.Axis(
            labelExpr="datum.value % 4 == 0 ? datum.label : ''",
            labelAngle=0, title=None)),
        y=alt.Y('rate:Q', axis=alt.Axis(format='%'), title=None),
        color=alt.Color('flow_type:N', scale=color_scale,
                        legend=alt.Legend(title=None, orient='bottom', columns=2)
                        if show_legend else None),
        order=alt.Order('flow_type:N')
    )

    panel = (zero + bars).properties(width=250, height=180, title=size)
    panels.append(panel)

# Shared legend from last panel
legend_chart = alt.Chart(size_site_rates).mark_point(size=0, opacity=0).encode(
    color=alt.Color('flow_type:N', scale=color_scale,
                    legend=alt.Legend(title=None, orient='bottom', columns=2,
                                     symbolType='square', symbolSize=100))
).properties(width=0, height=0)

# Arrange in 2x2 grid
grid = alt.vconcat(
    alt.hconcat(panels[0], panels[1]),
    alt.hconcat(panels[2], panels[3]),
    legend_chart
).resolve_scale(y='independent')

display(grid)
# grid.save(chart_path / 'Descriptive paper/Dynamism/size_site_dynamics_rates.png', scale_factor=2.0)

##### **7.2 Site dynamics by sector**