![ESRC Logo](../logos/Economic_and_Social_Research_Council_logo.svg)

# Using Administrative Data to Understand UK Civil Society Organisations

We are an ESRC funded group of collaborators seeking to improve data infrastructure surrounding the UK’s third sector and civil society more generally. The grant is entitled “Improving Access to and Use of Organisation-Level Data on the Third Sector and Civil Society” (Project Reference: ES/X000524/1).

In this notebook we demonstrate how to use some of the data resources we constructed during the project. Feedback on these training materials, and the data resources more generally, is very welcome - you can find contact information at the end of this file.

## Civil Society Organisation Spine

The initial data resource of our project is a list or "spine" of all formally registered UK third sector and civil society organisations, their names, addresses and dates of registration (and dissolution where relevant). Additionally, where organisations are found in more than one register, we provide a file listing these linkages. Data relating to alternative organisation names and addresses is also collated.

### Construction of the Spine

For general information on how the spine was constructed, including which raw datasets were utilised, how datasets were linked through organisation ids and names and more, please see our short presentation: 
* [LINK AVAILABLE AFTER RECORDING]()

In addition our project website contains a number of blogs on conceptual and definitional issues when analysing UK civil society:
* [Motivations for mapping third sector organisations in the UK](https://uk-third-sector-database.github.io/_posts/2024/001/blog-post-2)
* [Building a foundational dataset of third sector organisations in the UK](https://uk-third-sector-database.github.io/_posts/2024/001/blog-post-3/)

### Downloading the Data

The latest versions of the dataset are found on the project webpage https://uk-third-sector-database.github.io/data/

For this lesson we download and unzip the data directly from source - for your purposes you will likely download it to your machine first.

Our first task is to ensure we have the functionality we need to handling the data in Python (the programming language this notebook uses).

In [None]:
import pandas as pd
import os
import zipfile
import urllib.request
from IPython.display import display

In [None]:
url = "https://github.com/uk-third-sector-database/tso-database-builder/raw/new-build-spine/tso-spine-files.March2025.zip?download="
local_path, _ = urllib.request.urlretrieve(url)

with zipfile.ZipFile(local_path, 'r') as zf:
    zf.extractall()

### Importing the Data

The main file in the spine download is dataset containing one row per civil society organisation and some core information about each entity.

In [None]:
spine_df = pd.read_csv('public_spine.spine.csv')

In [None]:
print('The top 5 rows of the spine data are:')
display(spine_df.head())
print('\nThe summary statistics of the spine data are:')
display(spine_df.describe())
print(f'\nThe spine data has {spine_df.shape[0]} rows and {spine_df.shape[1]} columns.\n')
print(f'Spine data columns are: {list(spine_df.columns)}.\n')


### Exploring the Data

The previous section provides some basic metadata about the spine: it contains over 750k unique organisations and a brief number of fields / variables describing these entities. As part of these training materials we demonstrate how to combine the spine with census and geographic administrative data to answer important social science research questions.

However there are some interesting features of the spine that we can explore just now.

For example, how many organisations are registered as Community Interest Companies (CICs)?

In [None]:
spine_df["is_cic"].value_counts()

Or what data sources provide the most instances of organisations in the spine?

In [None]:
spine_df['source'] = spine_df['uid'].str.split('-').str[1]
spine_df['source'].value_counts()

We can see that the Charity Commission for England and Wales provided the most records for organisations in the spine, followed by Companies House and Scottish Charity Regulator.

We can examine when individual organisations were registered with their respective source.

In [None]:
spine_df['registerdate'] = pd.to_datetime(spine_df['registerdate'], dayfirst=True)
spine_df['regy'] = spine_df['registerdate'].dt.year

In [None]:
summary = (
    spine_df
    .groupby('source')['regy']
    .agg(earliest_year='min', typical_year='median', latest_year='max')
    .reset_index()
)

print(summary)

Finally, we can examine how many organisations have been deregistered with their respective source.

In [None]:
count_removed = (
    spine_df
    .groupby('source')
    .agg(removed_organisations=('removeddate', 'count'))
    .reset_index()
)

print(count_removed)

### Examining Matches Between Organisations

One of the challenges in constructing the spine was implementing a process for determining which record to list for each organisation, as many entities are registered with multiple sources. For example, a charity registered in England and Wales can operate in Scotland (and thus be registered with the Scottish Charity Regulator) and be registered with the Care Quality Commission (CQC). Other examples include charities that change legal form and thus require a new registration with the Charity Commission.

To address this issue we created a data resource that allows users to look up all of the matches for an organisation list on the spine.

In [None]:
matches_df = pd.read_csv('public_spine.matches.csv')

In [None]:
print(f'\nThe matches data has {matches_df.shape[0]} rows and {matches_df.shape[1]} columns.\n')
print(f'matches data columns are: {list(matches_df.columns)}.\n')
print('The top 5 rows of the matches data are:')
display(matches_df.head())
print('\nThe summary statistics of the matches data are:')
display(matches_df.describe())

If a match is assured, using the rules described in the associated documents, then the 'uid' is the same as a row in the spine, for example uid GB-CHC-200009:

In [None]:
display(matches_df[matches_df['uid'] == 'GB-CHC-200009'])

In [None]:
display(spine_df[spine_df['uid']=='GB-CHC-200009'])
display(spine_df[spine_df['uid']=='GB-COH-00686799'])

So 'The Ralph Levy Charitable Company Limited' is in the spine, but the matched organisation 'GB-COH-00686799' is not, since these have been found to be the same organisation. 

The 'match_type' field in public_spine.matches.csv shows how the match was determined; in this case 'companyid - id_in_source' comes from the Charity Commission register having a note of the associated record at Companies House.

While there are some obvious matches (the Companies House record of an incorporated charity, for example), others are of more substantive interest. For example, how many charities are also registered with the Care regulators?

In [None]:
mask = (
    matches_df['orgA_source'].isin(['CCEW', 'OSCR']) &
    matches_df['orgB_source'].isin(['CareInspectorateScot', 'CareQualityCommission'])
)

ct = pd.crosstab(
    matches_df.loc[mask, 'orgA_source'],
    matches_df.loc[mask, 'orgB_source'],
    margins=True
)

print(ct)

There are more Scottish charities with the Care Inspectorate than there are English and Welsh charities registered with the Care Quality Commission.

### Gathering Supplementary Information for Organisations

A consequence of the matching process is the generation of additional information about each organisation. For example, a charity may have a different registration or removal date with Companies House, or a different address listed with the Care Quality Commission. We can examine this supplementary information in the third and final dataset associated with the spine.

In [None]:
supplementary_df = pd.read_csv('public_spine.supplementary.csv')

In [None]:
print(f'\nThe supplementary data has {supplementary_df.shape[0]} rows and {supplementary_df.shape[1]} columns.\n')
print(f'supplementary data columns are: {list(supplementary_df.columns)}.\n')
print('The top 5 rows of the supplementary data are:')
display(supplementary_df.head())
print('\nThe summary statistics of the supplementary data are:')
display(supplementary_df.describe())

Using the same example of the Ralph Levy Charitable Company, we find the following data in the supplementary dataset for the organisation, from both the Charity Commission (CCEW) and Companies House (CH):

In [None]:
display(supplementary_df[supplementary_df['uid']=='GB-CHC-200009'])
display(supplementary_df[supplementary_df['uid']=='GB-COH-00686799'])

These rows show four previous addresses for the organisation, and two alternative registration dates.

### Conclusion

The spine was designed to be flexible to user needs. We defined civil society organisations broadly to allow users to select sub-samples or exclude certain types of organisations depending on the analysis. We wanted a file containing one row per organisation to aid usability and interpretability, while not losing the richness and variation that comes with having some organisations registered with multiple sources.

To see how the spine can be combined with other sources of data to answer social science research questions, please see our short recording: 
* [LINK AVAILABLE AFTER RECORDING]()

### Feedback

We welcome feedback - critical and complementary - on the training materials and the data resources more generally. Please get in contact with:
* Professor Alasdair Rutherford, University of Stirling (alasdair.rutherford@stir.ac.uk)
* Dr Diarmuid McDonnell, University of the West of Scotland (diarmuid.mcdonnell@uws.ac.uk)

We are particularly interested in ideas for improving the data resources or research questions that involve the use of the data.