<a href="https://colab.research.google.com/github/yuliiabosher/Fiber-optic-project/blob/europe_stats_analysis/european_fibre_optic_data_clean_and_load.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Fibre optic data - European statistical data.
Load data and clean

This notebook contains code to load data provided by the European Commission about broadband connectivity and availability in 27 EU states and the UK. The data will be used to compare fibre optic coverage in the UK with that in key european countries. The data covers several years and also provides a split between rural and urban areas.


Read the data files - location of the files should be updated based on your environment EC Broadband coverage in Europe data file available from https://digital-strategy.ec.europa.eu/en/library/digital-decade-2024-broadband-coverage-europe-2023

Code checks files exist in specified location

NB - currently I have downloaded the excel spreadsheet from the link above and saved the last 2 sheets as csv files

sheet **data** - save as EUROPE_FIBRE_HH

sheet **data%** - save as EUROPE_FIBRE

In [None]:
# Import and alias the necessary libraries
import pandas as pd
import os.path
import errno

#mount google drive
from google.colab import drive
drive.mount('/content/drive')

#############
# EC broadband data file available from https://digital-strategy.ec.europa.eu/en/library/digital-decade-2024-broadband-coverage-europe-2023
# save the last sheet in the spreadsheet as .csv and name EUROPE_FIBRE.csv
# This path should be set to the location of the file
#############
fileEuropeData = '/content/drive/MyDrive/Colab/EUROPE_FIBRE.csv'
fileEuropeHouseholds = '/content/drive/MyDrive/Colab/EUROPE_FIBRE_HH.csv'

#fileEuropeData = 'd:/Users/Sharon/Documents/College/data/EUROPE_FIBRE.csv'
#fileEuropeHouseholds = 'd:/Users/Sharon/Documents/College/data/EUROPE_FIBRE_HH.csv'
#############
# Check we can find the file required, and read it into a pandas dataframe
# show the shape of the dataframe
#############
if os.path.exists(fileEuropeData) :
    print("Reading Europe fibre data.....")
    dfEurope = pd.read_csv(fileEuropeData)
else:
    raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), fileEuropeData)
print("Europe fibre file data shape:", dfEurope.shape)

if os.path.exists(fileEuropeHouseholds) :
    print("Reading Europe fibre data (households).....")
    dfEuropeHH = pd.read_csv(fileEuropeHouseholds)
else:
    raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), fileEuropeHouseholds)
print("Europe fibre file data (households) shape:", dfEuropeHH.shape)

Mounted at /content/drive
Reading Europe fibre data.....
Europe fibre file data shape: (1650, 16)
Reading Europe fibre data (households).....
Europe fibre file data (households) shape: (1650, 15)


The data fields holding the values for each year require some cleaning.

This is done for both dataframes -
dfEurope and dfEuropeHH

1. The % and absolute values are loaded as strings because they contain commas and % signs. There are also some dashes (-) and spaces.
2. The column called Geography level is renamed to URClass to reflect Urban/Rural values

The resulting cleaned dataframes are named

dfEuropeClean

dfEuropeCleanHH

In [None]:
###DEFINE FUNCTIONS FOR CLEANING THE FILE
#fn_clean_years removed % sign and changes datatype to float
#fn_change_col_name will change any column name in any dataframe


def fn_clean_years(thisdf, thiscol):
  #check column is still a string
  if thisdf[thiscol].dtype == 'object':
    #check for dashes -
    thisdf[thiscol] = thisdf[thiscol].str.replace('-', '')
    #convert the yearly value to a float - first remove the % sign
    thisdf[thiscol] = thisdf[thiscol].str.replace('%', '')#.astype(float)
    #check if there's a comma and remove those
    thisdf[thiscol] = thisdf[thiscol].str.replace(',', '')#.astype(float)
    #check for blanks
    thisdf[thiscol] = thisdf[thiscol].str.replace(' ', '')#.astype(float)
    #finally replace empty strings with None
    thisdf[thiscol] = thisdf[thiscol].replace('', None)
    #now change the datatype of the column
    #thisdf.astype({thiscol: 'float'}).dtypes
    thisdf[thiscol] = thisdf[thiscol].astype(float)
  return thisdf

def fn_change_col_name(thisdf, oldname, newname):
  #change the name of the geography column to URClass
  thisdf.rename(columns={oldname: newname}, inplace=True)
  return thisdf

In [None]:
#clean and prepare data
#we need to rename the Geography level URClass cos it didn't seemto like the space in the name
#then for each of the year columns strip the % sign from the value and change it from a string to a float
dfEuropeClean = fn_change_col_name(dfEurope, 'Geography level', 'URClass')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2023')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2022')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2021')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2020')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2019')
dfEuropeClean = fn_clean_years(dfEuropeClean, '2018')
print(dfEuropeClean.head())

#now do the same for the households data
#in this case, we need to strip a , from the year columns and change it to a float
dfEuropeCleanHH = fn_change_col_name(dfEuropeHH, 'Geography level', 'URClass')
dfEuropeCleanHH = fn_clean_years(dfEuropeCleanHH, '2023')
dfEuropeCleanHH = fn_clean_years(dfEuropeCleanHH, '2022')
dfEuropeCleanHH = fn_clean_years(dfEuropeCleanHH, '2021')
dfEuropeCleanHH = fn_clean_years(dfEuropeCleanHH, '2020')
dfEuropeCleanHH = fn_clean_years(dfEuropeCleanHH, '2019')
dfEuropeCleanHH = fn_clean_years(dfEuropeCleanHH, '2018')
print(dfEuropeCleanHH.head())

   Country                        Metric URClass             Unit  \
0  Austria                     Land area   Total       km squared   
1  Austria                    Population   Total              ###   
2  Austria                    Households   Total              ###   
3  Austria   Broadband coverage (>2Mbps)   Total  % of Households   
4  Austria  Broadband coverage (>30Mbps)   Total  % of Households   

           2013          2014          2015          2016          2017  \
0    8387900.0%    8387900.0%    8387900.0%    8387900.0%    8387900.0%   
1  844301800.0%  845186000.0%  850688900.0%  857626100.0%  869007600.0%   
2  367087600.0%  373890627.2%  381326121.0%  385514998.2%  390333256.0%   
3         98.6%         98.4%         98.0%         98.1%         97.9%   
4         55.6%         60.3%         65.2%         67.3%         71.3%   

          2018         2019         2020         2021         2022  \
0    8387900.0    8387900.0    8387900.0    8392700.0    8392700

Reduce the columns to just show those for 2018 onwards, as well as the country, the metric(FTTP etc) and the URClass (Rural or Total)

In [None]:
europeCols = ['Country','Metric','URClass','2018','2019','2020','2021','2022','2023']
dfEuropeClean = dfEuropeClean[europeCols]
print(dfEuropeClean.head())

europeHHCols = ['Country','Metric','URClass','2018','2019','2020','2021','2022','2023']
dfEuropeCleanHH = dfEuropeCleanHH[europeCols]
print(dfEuropeCleanHH.head())

   Country                        Metric URClass         2018         2019  \
0  Austria                     Land area   Total    8387900.0    8387900.0   
1  Austria                    Population   Total  877286500.0  885877500.0   
2  Austria                    Households   Total  393553380.2  388331200.0   
3  Austria   Broadband coverage (>2Mbps)   Total         98.1         98.2   
4  Austria  Broadband coverage (>30Mbps)   Total         72.4         78.8   

          2020         2021         2022         2023  
0    8387900.0    8392700.0    8392700.0    8392700.0  
1  890106400.0  893266400.0  897892900.0  910477200.0  
2  391892900.0  395914300.0  399505000.0  403308000.0  
3         98.6          NaN          NaN          NaN  
4         86.6         93.3         94.8         94.2  
   Country                        Metric URClass       2018       2019  \
0  Austria                     Land area   Total    83879.0    83879.0   
1  Austria                    Population   Tota

Now rename the value fields in each dataset so when merged they have sensible names

in the percentage dataframe
2018 becomes 2018% etc

in the no of households dataframe
2018 becomes 2018HH etc

Then merge the two dataframes on Country/Metric/URClass

In [None]:
#rename year columns in each dataset
dfEuropeClean = fn_change_col_name(dfEuropeClean, '2018', '2018%')
dfEuropeClean = fn_change_col_name(dfEuropeClean, '2019', '2019%')
dfEuropeClean = fn_change_col_name(dfEuropeClean, '2020', '2020%')
dfEuropeClean = fn_change_col_name(dfEuropeClean, '2021', '2021%')
dfEuropeClean = fn_change_col_name(dfEuropeClean, '2022', '2022%')
dfEuropeClean = fn_change_col_name(dfEuropeClean, '2023', '2023%')

dfEuropeCleanHH = fn_change_col_name(dfEuropeCleanHH, '2018', '2018HH')
dfEuropeCleanHH = fn_change_col_name(dfEuropeCleanHH, '2019', '2019HH')
dfEuropeCleanHH = fn_change_col_name(dfEuropeCleanHH, '2020', '2020HH')
dfEuropeCleanHH = fn_change_col_name(dfEuropeCleanHH, '2021', '2021HH')
dfEuropeCleanHH = fn_change_col_name(dfEuropeCleanHH, '2022', '2022HH')
dfEuropeCleanHH = fn_change_col_name(dfEuropeCleanHH, '2023', '2023HH')

##merge the two datasets
colslist = ['Country', 'Metric','URClass']
dfFinal = pd.merge(dfEuropeClean,dfEuropeCleanHH,on=colslist, how='inner')



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  thisdf.rename(columns={oldname: newname}, inplace=True)


Finally, lets only return those rows for the metric FTTP - this means we will have only values for fibre to the premises availability

The resulting dataframe dfEuropeCleanFTTP, has the following structure

Country - name of the country

Metric - now always FTTP

URClass - Total (ie rural and urban) or Rural

2018% - percentage of households with access to FTTP in 2018

2018HH - absolute number of households with access to FTTP in 2018

2019% - percentage of households with access to FTTP in 2019

2019HH - absolute number of households with access to FTTP in 2019

2020% - percentage of households with access to FTTP in 2020

2020HH - absolute number of households with access to FTTP in 2020

2021% - percentage of households with access to FTTP in 2021

2021HH - absolute number of households with access to FTTP in 2021

2022% - percentage of households with access to FTTP in 2022

2022HH - absolute number of households with access to FTTP in 2022

2023% - percentage of households with access to FTTP in 2023

2023HH - absolute number of households with access to FTTP in 2023



In [None]:
###DEFINE FUNCTION TO FILTER DATASET BY METRIC

def fn_filter_by_metric(thisdf, thismetric):
  #return a new dataset which only contains specified metric
  return thisdf.query('Metric == "' + thismetric + '"')

#now run this function to get only those rows where Metric = FTTP
#this will give us all the rows for full fibre availability
dfEuropeCleanFTTP = fn_filter_by_metric(dfFinal, 'FTTP')
dfEuropeCleanFTTP.head()

Unnamed: 0,Country,Metric,URClass,2018%,2019%,2020%,2021%,2022%,2023%,2018HH,2019HH,2020HH,2021HH,2022HH,2023HH
15,Austria,FTTP,Total,13.0,13.8,20.5,26.6,36.6,41.0,512932.0,534791.0,805015.0,1054017.0,1463133.0,1652409.0
44,Belgium,FTTP,Total,1.4,3.6,6.5,10.1,17.2,25.0,68689.0,174923.0,309472.0,503257.0,861948.0,1204619.0
73,Bulgaria,FTTP,Total,54.2,65.2,75.2,81.4,85.6,88.6,1589576.0,1877295.0,2173209.0,2345497.0,2439632.0,2484082.0
102,Croatia,FTTP,Total,23.4,31.0,35.6,38.7,53.9,62.1,350771.0,450768.0,506428.0,547711.0,760941.0,891208.0
131,Cyprus,FTTP,Total,0.5,10.1,26.2,41.4,60.0,77.1,1526.0,31661.0,81913.0,130178.0,190216.0,246704.0
