In [3]:
from IPython.core.display import display, HTML
from IPython.display import Image, display
from IPython.display import IFrame
import os

# Global AI Data Policy Observatory

## Purpose
Data plays a critical role in AI development, yet it has often been overlooked. There is now increasing recognition of the importance of data quality and engineering in building effective AI systems, a perspective termed ‘data-centric AI’ by leading practitioners like Ng (2021).

Governments influence AI development through laws, regulations, and the framing of public discourse. Therefore, it is essential for those involved in AI governance and policy to understand how national policies worldwide address data in AI, as this shapes the technology’s impact.

Through this observatory, we want to support future capacity building efforts by policymakers currently, or wishing to, further data-centric approaches to AI governance. To do this, we explore to what extent, ‘data’ and data-related topics are mentioned in national AI policies across the world. We also explore potentially why these trends exist. This will spotlight areas that may need improvement and some indications of how progress can be made. 

To explore these questions, we scanned **512** policy documents from the ‘OECD.ai Policy Observatory’, spanning **64** countries and **2** supranational organisations. In terms of policy type, the largest amount (**21.7%**) were classified as “National strategies, agendas and plans”, with the next four largest categories being “Emerging AI-related regulation” (**16.0%**), “Networking and collaborative platforms” (**9.2%**) “Regulatory oversight and ethical advice bodies” (**7.6%**) and “Project grants for public research” (**5.9%**).

Based on our findings, we recommend the following:
1. **Several crucial and popular topics need greater attention from policymakers, particularly ‘data licensing’ and ‘responsible data’.**

2. **There is an urgent need to re-prioritise declining topics like ‘data privacy’, ‘data security’, and ‘responsible data’ to ensure that increasing calls for access to AI data are executed safely, transparently and ethically.**

3. **To effectively integrate data-centric AI principles, policymakers should consider strengthening digital infrastructure in key areas, fostering open data initiatives and online services, but also e-participation, and R&D investments.**

4. **Greater efforts should be made to engage governments who are not currently prioritising open and shared data in general.**

## Data-centric AI Policy Discourse

Starting wide, we see a minority of countries showing notable interest in “data” as an Artificial Intelligence (AI) policy topic. These countries often belong to certain regions. As shown by the standard deviation scores for different regions, countries in Asia-Pacific, MENA and Europe exhibit higher standard deviation scores (**33.3**, **27.4** and **19.2**, respectively), pointing to fewer countries placing greater importance on 'data' in these regions. Whereas, the standard deviation scores in N. America, S. America and Sub-Saharan Africa are lower (**11.2**, **7.0** and **2.2**, respectively) demonstrating greater regional consistency.

### Global Levels of Interest in Data

In [5]:
display(HTML(filename=os.path.join('static', 'globaldatapolicymap.html')))

In [6]:
display(HTML(filename=os.path.join('static', 'globaldatastandarddeviation.html')))

Next, to gain greater insight into how data is being mentioned in these policy documents, we mapped several topics associated with data-centric AI and we found that several are highly influential in policy discourse (their degree of influence in brackets):
<li>Data protection (0.431017)
<li>Open data (0.400188)
<li>Data sharing (0.350491)
<li>Data quality (0.330102)
<li>Data privacy (0.326594)

### Global Levels of Interest in Data-centric AI Topics

In [7]:
display(HTML(filename=os.path.join('static', 'sankey_diagram.html')))

But, importantly, the following keywords are the least central to the network:
<li>Data licensing (0.011033)
<li>Data transparency (0.023948)
<li>Data lineage (0.026415)
<li>Data provenance (0.051985)
<li>Responsible data (0.137291)

According to Google trends data, the keywords most central to the network are generally more popular over time than those at the bottom of the list (except 'responsible data' and 'data license', the latter being the most used variant of 'data licensing').

### General Popularity of Data-centric AI Topics

In [8]:
display(HTML(filename=os.path.join('static', 'influential_topics.html')))

In [9]:
display(HTML(filename=os.path.join('static', 'marginal_topics.html')))

What we are noticing is likely a general lack of awareness or popularity associated with these topics, which in turn may have implications for how they are emphasised in policy circles. Interestingly, the difference between ***general*** emphasis and emphasis in ***policy*** documents with 'responsible data' and 'data license' suggests a relative lack of policymakers catching on to this topic despite broader attention to them by the ~5 billion users of Google web search. This calls for special attention to these areas by policymakers given their broader (at least cultural) relevance.

Next, we look to the future by asking whether these topics are increasing or decreasing over time (2010-2024). Alarmingly, we see that most topics are consistently decreasing, whilst a few are consistently increasing, with interesting implications for both categories.

### Consistency and Trend in Data-centric Topics (2010 - 2024)

The most consistently declining topics are ‘data provenance’, ‘responsible data’, ‘data privacy’, ‘data standards’ and ‘data security’. These areas are critical for ensuring safe, ethical, and transparent AI systems, their decline in AI policy discussions signals a potential oversight. Policymakers must start focusing on these areas more, since they relate to mitigation of risks like data breaches or biased AI systems, both of which can have significant social and political repercussions.

In [11]:
display(HTML(filename=os.path.join('static', 'styled_dataframe.html')))

Keyword,Consistency,Trend
data provenance,0.678371,decreasing
responsible data,0.984958,decreasing
data quality,1.066357,increasing
data privacy,1.192401,decreasing
data access,1.282878,increasing
data standards,1.361918,decreasing
data sharing,1.435957,increasing
data security,1.632472,decreasing
data governance,1.753847,decreasing
open data,1.954773,decreasing


However, countries worldwide are increasingly mentioning ‘data quality, ‘data access’ and ‘data sharing’ in their national AI policies, with ‘data quality’ being the most consistently increasing topic. Over the least few years we have seen a range of governments calling for access to data for regulatory purposes and so this could support an increasing focus on access to high quality data by these entities. 

This trend highlights the need for nations to invest in data infrastructure and governance, ensuring that AI systems are built on reliable, accurate, and accessible data. This is reinforced by topics such as ‘data privacy’, ‘data security’ and ‘data provenance’ declining, as this indicates there is a risk that governments will want to facilitate data access with less regard for these essential enabling areas.

## Drivers of Data-centric AI in Global Policies

**What might be driving these differences?** Using the highly robust ***Network Readiness Index (NRI)***, there is a strong correlation between countries mentioning these topics in national AI policy documents and certain digital readiness indicators. In particular, countries that actively publish and promote the use of open data and invest in government online services are significantly more likely to focus on data-centric AI policies. 

This suggests that a focus on data in the context of AI often sits within a more general trend of publishing and using data. In turn, this suggests there may be more fundamental prerequisites for focusing on data in domains like AI, such as actively running open data initiatives and a focus on government online services.

### Relationship Between Interest in Data and Digital Readiness

What might be driving these differences? Using the highly robust **Network Readiness Index (NRI)**, there is a strong correlation between countries mentioning these topics in national AI policy documents and certain digital readiness indicators. In particular, countries that actively publish and promote the use of open data and invest in government online services are significantly more likely to focus on data-centric AI policies. This suggests that a focus on data in the context of AI often sits within a more general trend of publishing and using data. In turn, this suggests there may be more fundamental prerequisites driving this focus on data in domains like AI, such as actively running open data initiatives and recognising the value of digital public infrastructure.

### Relationship Between Interest in Data and Digital Readiness

In [12]:
display(HTML(filename=os.path.join('static', 'spearman_correlation_results.html')))

Unnamed: 0,Spearman Correlation,Spearman P-value,Sample Size (N),Interpretation
Publication And Use Of Open Data,0.545,0.0,63,Strong
Government Online Services,0.532,0.0,63,Strong
E-Participation,0.476,0.0001,63,Moderate
Use Of Virtual Social Networks,0.467,0.0001,63,Moderate
Pct Patent Applications,0.466,0.0001,63,Moderate
Regulation Of Emerging Technologies,0.444,0.0003,63,Moderate
Gerd Performed By Business Enterprise,0.426,0.0005,63,Moderate
Investment In Emerging Technologies,0.42,0.0006,63,Moderate
R&D Expenditure By Governments And Higher Education,0.416,0.0007,63,Moderate
Healthy Life Expectancy At Birth,0.414,0.0007,63,Moderate


Given we are looking at trends globally, we used the **KOF Globalisation Index (KOFGI)** to see whether different forms of globalisation made a difference. Unsurprisingly, “informational globalisation” is by far the highest correlated to a focus on data-related topics.

### Relationship Between Interest in Data and Forms of Globalisation

In [13]:
display(HTML(filename=os.path.join('static', 'spearman_correlation_results2.html')))

Unnamed: 0,Spearman Correlation,Spearman P-value,Sample Size (N),Interpretation
"Informational Globalisation, de facto",0.515,0.0,64,Strong
"Cultural Globalisation, de facto",0.373,0.0024,64,Moderate
"Cultural Globalisation, de jure",0.318,0.0106,64,Moderate
"Political Globalisation, de facto",0.315,0.0113,64,Moderate
"Financial Globalisation, de jure",0.313,0.0117,64,Moderate
"Informational Globalisation, de jure",0.259,0.039,64,Weak


In [14]:
display(HTML(filename=os.path.join('static', 'spearman_correlation_results3.html')))

Unnamed: 0,Spearman Correlation,Spearman P-value,Sample Size (N),Interpretation
"Patent applications, nonresidents",0.451,0.0002,64,Moderate
High-technology exports (current US$),0.401,0.001,64,Moderate
"Bandwidth, % of pop",0.281,0.0246,64,Weak


The informational globalisation category is measured in terms of “patent applications, nonresidents”, “high-technology exports” and “internet bandwidth”. Agreeing with some of the highest digital readiness indicators from the NRI, the former two are the most strongly correlated with a focus on data-centric AI topics. This supports that a strong, open innovation ecosystem is also important to (or a driver of) focus on data-centric AI topics.