### **Fibre optic data - European statistical data.**
## Load data and clean

This notebook contains code to load data provided by the European Commission about broadband connectivity and availability in 27 EU states and the UK.
The data will be used to compare fibre optic coverage in the UK with that in key european countries. The data covers several years and also provides a split between rural and urban areas.


1 Load datafile



1.1 load libraries and mount google drive

In [1]:
# Import and alias the necessary libraries
import pandas as pd
import os.path
import errno

#mount google drive
from google.colab import drive
drive.mount('/content/drive')



Mounted at /content/drive


1.2 Read the data file - location of the files should be updated based on your environment
EC Broadband coverage in Europe data file available from https://digital-strategy.ec.europa.eu/en/library/digital-decade-2024-broadband-coverage-europe-2023

Code checks files exist in specified location

NB - currently I have downloaded the excel spreadsheet from the link above and saved the last sheet as a csv file. I'm not sure how to automate this????

In [11]:
#############
# EC broadband data file available from https://digital-strategy.ec.europa.eu/en/library/digital-decade-2024-broadband-coverage-europe-2023
# save the last sheet in the spreadsheet as .csv and name EUROPE_FIBRE.csv
# This path should be set to the location of the file
#############
fileEuropeData = '/content/drive/MyDrive/Colab/EUROPE_FIBRE.csv'
#############
# Check we can find the file required, and read it into a pandas dataframe
# show the shape of the dataframe
#############
if os.path.exists(fileEuropeData) :
    print("Reading Europe fibre data.....")
    dfEurope = pd.read_csv(fileEuropeData)
else:
    raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), fileEuropeData)
print("Europe fibre file data shape:", dfEurope.shape)



Reading Europe fibre data.....
Europe fibre file data shape: (1650, 16)


1.3 Define functions that will clean data
This code strips the % sign from the fields that hold percentage values
and change the datatype of each field from string to float if required
This allows numerical operations to be carried out on the data.
We also need to change the name of the column geography Level to URClass

In [12]:
###DEFINE FUNCTIONS FOR CLEANING THE FILE
#fn_clean_years removed % sign and changes datatype to float
#fn_change_col_name will change any column name in any dataframe


def fn_clean_years(thisdf, thiscol):
  #check column is still a string
  if thisdf[thiscol].dtype == 'object':
    #convert the percentage value to a float - first remove the % sign
    thisdf[thiscol] = thisdf[thiscol].str.replace('%', '').astype(float)
    #now change the datatype of the column
    thisdf.astype({thiscol: 'float'}).dtypes
  return thisdf

def fn_change_col_name(thisdf, oldname, newname):
  #change the name of the geography column to URClass
  thisdf.rename(columns={oldname: newname}, inplace=True)
  return thisdf

1.4 Clean data - call the functions defined above. We only clean the years 2018 - 2023 as we are only going to be looking at historical data from 2018 onwards

In [13]:
#clean and prepare data
#we need to rename the Geography level URClass cos it didn't seemto like the space in the name
#then for each of the year columns strip the % sign from the value and change it from a string to a float
dfEuropeClean = fn_change_col_name(dfEurope, 'Geography level', 'URClass')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2023')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2022')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2021')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2020')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2019')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2018')
print(dfEuropeClean.head())

   Country                        Metric URClass             Unit  \
0  Austria                     Land area   Total       km squared   
1  Austria                    Population   Total              ###   
2  Austria                    Households   Total              ###   
3  Austria   Broadband coverage (>2Mbps)   Total  % of Households   
4  Austria  Broadband coverage (>30Mbps)   Total  % of Households   

           2013          2014          2015          2016          2017  \
0    8387900.0%    8387900.0%    8387900.0%    8387900.0%    8387900.0%   
1  844301800.0%  845186000.0%  850688900.0%  857626100.0%  869007600.0%   
2  367087600.0%  373890627.2%  381326121.0%  385514998.2%  390333256.0%   
3         98.6%         98.4%         98.0%         98.1%         97.9%   
4         55.6%         60.3%         65.2%         67.3%         71.3%   

          2018         2019         2020         2021         2022  \
0    8387900.0    8387900.0    8387900.0    8392700.0    8392700

1.5 Drop columns from before 2018

In [14]:
europeCols = ['Country','Metric','URClass','2018','2019','2020','2021','2022','2023']
dfEuropeClean = dfEuropeClean[europeCols]
print(dfEuropeClean.head())

   Country                        Metric URClass         2018         2019  \
0  Austria                     Land area   Total    8387900.0    8387900.0   
1  Austria                    Population   Total  877286500.0  885877500.0   
2  Austria                    Households   Total  393553380.2  388331200.0   
3  Austria   Broadband coverage (>2Mbps)   Total         98.1         98.2   
4  Austria  Broadband coverage (>30Mbps)   Total         72.4         78.8   

          2020         2021         2022         2023  
0    8387900.0    8392700.0    8392700.0    8392700.0  
1  890106400.0  893266400.0  897892900.0  910477200.0  
2  391892900.0  395914300.0  399505000.0  403308000.0  
3         98.6          NaN          NaN          NaN  
4         86.6         93.3         94.8         94.2  


1.6 Now we can filter the data to just provide data for full fibre availability. This is where Metric='FTTP'
To make this flexible, I've written a function that will filter by a specified metric, so we could include other metrics if necessary

In [15]:
###DEFINE FUNCTION TO FILTER DATASET BY METRIC

def fn_filter_by_metric(thisdf, thismetric):
  #return a new dataset which only contains specified metric
  return thisdf.query('Metric == "' + thismetric + '"')

#now run this function to get only those rows where Metric = FTTP
#this will give us all the rows for full fibre availability
dfEuropeCleanFTTP = fn_filter_by_metric(dfEuropeClean, 'FTTP')
print(dfEuropeCleanFTTP.head())

      Country Metric URClass  2018  2019  2020  2021  2022  2023
15    Austria   FTTP   Total  13.0  13.8  20.5  26.6  36.6  41.0
44    Belgium   FTTP   Total   1.4   3.6   6.5  10.1  17.2  25.0
73   Bulgaria   FTTP   Total  54.2  65.2  75.2  81.4  85.6  88.6
102   Croatia   FTTP   Total  23.4  31.0  35.6  38.7  53.9  62.1
131    Cyprus   FTTP   Total   0.5  10.1  26.2  41.4  60.0  77.1
