<a href="https://colab.research.google.com/github/yuliiabosher/Fiber-optic-project/blob/european_historical_data/EU_historical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The required Python libraries were imported

In [16]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns

The sheet 'Data' was loaded from the Excel document that could be downloaded using the following link https://ec.europa.eu/newsroom/dae/redirection/document/106734 found at https://digital-strategy.ec.europa.eu/en/library/digital-decade-2024-broadband-coverage-europe-2023. The columns of interest were separated.

In [2]:
eu_broadband = pd.read_excel('https://ec.europa.eu/newsroom/dae/redirection/document/106734', sheet_name ='Data', skiprows=6)
eu_columns = ['Country', 'Metric', 'Geography level', 2018, 2019, 2020, 2021, 2022, 2023]
eu_broadband = eu_broadband[eu_columns]
display(eu_broadband.head())

Unnamed: 0,Country,Metric,Geography level,2018,2019,2020,2021,2022,2023
0,Austria,Land area,Total,83879.0,83879.0,83879.0,83927.0,83927.0,83927.0
1,Austria,Population,Total,8772865.0,8858775.0,8901064.0,8932664.0,8978929.0,9104772.0
2,Austria,Households,Total,3935534.0,3883312.0,3918929.0,3959143.0,3995050.0,4033080.0
3,Austria,Broadband coverage (>2Mbps),Total,3858862.0,3813412.384,3863423.0,,,
4,Austria,Broadband coverage (>30Mbps),Total,2847375.0,3058873.0,3394576.0,3694166.0,3787714.0,3797226.0


The base dataframe datatypes and number of missing values are as follows

In [3]:
display(eu_broadband.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1650 entries, 0 to 1649
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Country          1650 non-null   object 
 1   Metric           1650 non-null   object 
 2   Geography level  1650 non-null   object 
 3   2018             1122 non-null   float64
 4   2019             1091 non-null   float64
 5   2020             1157 non-null   float64
 6   2021             1138 non-null   float64
 7   2022             1170 non-null   float64
 8   2023             1138 non-null   float64
dtypes: float64(6), object(3)
memory usage: 116.1+ KB


None

The distinct values in the 'Geography level' column are as follows

In [4]:
display(eu_broadband['Geography level'].unique())

array(['Total', 'Rural'], dtype=object)

The EU broadband dataframe was cleaned to include 'Total' values only in the 'Geography level' column.

In [5]:
eu_broadband = eu_broadband.query('`Geography level` == "Total"')
display(eu_broadband['Geography level'].unique())
display(eu_broadband.head())

array(['Total'], dtype=object)

Unnamed: 0,Country,Metric,Geography level,2018,2019,2020,2021,2022,2023
0,Austria,Land area,Total,83879.0,83879.0,83879.0,83927.0,83927.0,83927.0
1,Austria,Population,Total,8772865.0,8858775.0,8901064.0,8932664.0,8978929.0,9104772.0
2,Austria,Households,Total,3935534.0,3883312.0,3918929.0,3959143.0,3995050.0,4033080.0
3,Austria,Broadband coverage (>2Mbps),Total,3858862.0,3813412.384,3863423.0,,,
4,Austria,Broadband coverage (>30Mbps),Total,2847375.0,3058873.0,3394576.0,3694166.0,3787714.0,3797226.0


Following the cleaning steps the 'Geography level' column was dropped altogether.

In [6]:
eu_broadband = eu_broadband.drop(columns=['Geography level'])
display(eu_broadband.head())

Unnamed: 0,Country,Metric,2018,2019,2020,2021,2022,2023
0,Austria,Land area,83879.0,83879.0,83879.0,83927.0,83927.0,83927.0
1,Austria,Population,8772865.0,8858775.0,8901064.0,8932664.0,8978929.0,9104772.0
2,Austria,Households,3935534.0,3883312.0,3918929.0,3959143.0,3995050.0,4033080.0
3,Austria,Broadband coverage (>2Mbps),3858862.0,3813412.384,3863423.0,,,
4,Austria,Broadband coverage (>30Mbps),2847375.0,3058873.0,3394576.0,3694166.0,3787714.0,3797226.0


The distinct metrics in the base dataframe are as follows

In [7]:
display(eu_broadband['Metric'].unique())

array(['Land area', 'Population', 'Households',
       'Broadband coverage (>2Mbps)', 'Broadband coverage (>30Mbps)',
       'Broadband coverage (>100Mbps)', 'Broadband coverage (>1Gbps)',
       'Broadband coverage (>1Gbps upload and download)',
       'Fixed broadband coverage', 'NGA coverage',
       'Fixed VHCN coverage (FTTP & DOCSIS 3.1)',
       'VHCN coverage (as defined by BEREC)', 'DSL', 'VDSL',
       'VDSL 2 Vectoring', 'FTTP', 'Cable modem DOCSIS 3.0',
       'Cable modem DOCSIS 3.1', 'FWA', 'LTE', 'Average LTE coverage',
       '5G', '5G in the 3.4–3.8\xa0GHz band', 'Satellite',
       'Overall broadband coverage', 'DOCSIS 3.0 & FTTP coverage',
       'Cable modem', 'WiMAX', 'HSPA'], dtype=object)

The dataframe was cleaned to include 'FTTP' or 'Households' values only in the 'Metric' column.

In [8]:
eu_broadband = eu_broadband.query('`Metric` == "FTTP" | `Metric` == "Households"')
display(eu_broadband['Metric'].unique())
display(eu_broadband.head())

array(['Households', 'FTTP'], dtype=object)

Unnamed: 0,Country,Metric,2018,2019,2020,2021,2022,2023
2,Austria,Households,3935534.0,3883312.0,3918929.0,3959143.0,3995050.0,4033080.0
15,Austria,FTTP,512932.4,534791.0,805015.0,1054017.0,1463133.0,1652409.0
31,Belgium,Households,4914168.0,4899404.0,4751936.0,4989764.0,5022036.0,4818475.0
44,Belgium,FTTP,68688.76,174923.2,309472.3,503256.6,861948.1,1204619.0
60,Bulgaria,Households,2930380.0,2877314.0,2888188.0,2881895.0,2849557.0,2803352.0


The year columns were collapsed changing the dataframe from wide to long format.

In [9]:
eu_broadband = eu_broadband.melt(id_vars=['Country', 'Metric'], var_name='Year', value_name='Number of households')
display(eu_broadband.head())

Unnamed: 0,Country,Metric,Year,Number of households
0,Austria,Households,2018,3935534.0
1,Austria,FTTP,2018,512932.4
2,Belgium,Households,2018,4914168.0
3,Belgium,FTTP,2018,68688.76
4,Bulgaria,Households,2018,2930380.0


The 'Metric' column was transformed from long to wide format.

In [10]:
eu_broadband = eu_broadband.pivot(index =['Country', 'Year'], columns='Metric', values = 'Number of households')
display(eu_broadband)

Unnamed: 0_level_0,Metric,FTTP,Households
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1
Austria,2018,5.129324e+05,3.935534e+06
Austria,2019,5.347910e+05,3.883312e+06
Austria,2020,8.050150e+05,3.918929e+06
Austria,2021,1.054017e+06,3.959143e+06
Austria,2022,1.463133e+06,3.995050e+06
...,...,...,...
United Kingdom,2019,2.654659e+06,3.117797e+07
United Kingdom,2020,4.566558e+06,3.157650e+07
United Kingdom,2021,6.791722e+06,2.919635e+07
United Kingdom,2022,1.079281e+07,2.969939e+07


The dataframe index was reset to columns.

In [11]:
eu_broadband = eu_broadband.reset_index()
eu_broadband

Metric,Country,Year,FTTP,Households
0,Austria,2018,5.129324e+05,3.935534e+06
1,Austria,2019,5.347910e+05,3.883312e+06
2,Austria,2020,8.050150e+05,3.918929e+06
3,Austria,2021,1.054017e+06,3.959143e+06
4,Austria,2022,1.463133e+06,3.995050e+06
...,...,...,...,...
193,United Kingdom,2019,2.654659e+06,3.117797e+07
194,United Kingdom,2020,4.566558e+06,3.157650e+07
195,United Kingdom,2021,6.791722e+06,2.919635e+07
196,United Kingdom,2022,1.079281e+07,2.969939e+07


The column axis was renamed from 'Metric' to None.

In [12]:
eu_broadband = eu_broadband.rename_axis(columns=None)
display(eu_broadband.head())

Unnamed: 0,Country,Year,FTTP,Households
0,Austria,2018,512932.4,3935534.0
1,Austria,2019,534791.0,3883312.0
2,Austria,2020,805015.0,3918929.0
3,Austria,2021,1054017.0,3959143.0
4,Austria,2022,1463133.0,3995050.0


The 'Year' column was set as index.

In [13]:
eu_broadband = eu_broadband.set_index('Year')
display(eu_broadband.head())

Unnamed: 0_level_0,Country,FTTP,Households
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,Austria,512932.4,3935534.0
2019,Austria,534791.0,3883312.0
2020,Austria,805015.0,3918929.0
2021,Austria,1054017.0,3959143.0
2022,Austria,1463133.0,3995050.0


'FTTP' column was divided by 'Households' column and multipled by 100 to create a new column titled 'Percentage of households with FTTP availability'.

In [14]:
eu_broadband['Percentage of households with FTTP availability'] = eu_broadband['FTTP'] / eu_broadband['Households'] * 100
display(eu_broadband.head())

Unnamed: 0_level_0,Country,FTTP,Households,Percentage of households with FTTP availability
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018,Austria,512932.4,3935534.0,13.033364
2019,Austria,534791.0,3883312.0,13.771518
2020,Austria,805015.0,3918929.0,20.541709
2021,Austria,1054017.0,3959143.0,26.622352
2022,Austria,1463133.0,3995050.0,36.623647


The SHP file was downloaded at https://ec.europa.eu/eurostat/web/gisco/geodata/administrative-units/countries with the following parameters: Year - 2024, File Format - SHP, Geometry Type - Polygons (RG), Scale - 01M, Coordinate Reference System - EPSG 4326.

In [17]:
country_polygons = gpd.read_file('https://raw.githubusercontent.com/yuliiabosher/Fiber-optic-project/refs/heads/european_historical_data/CNTR_RG_01M_2024_4326.shp.zip')
display(country_polygons.head())

Unnamed: 0,CNTR_ID,CNTR_NAME,NAME_ENGL,NAME_FREN,ISO3_CODE,SVRG_UN,CAPT,EU_STAT,EFTA_STAT,CC_STAT,NAME_GERM,geometry
0,AD,Andorra,Andorra,Andorre,AND,UN Member State,Andorra la Vella,F,F,F,Andorra,"POLYGON ((1.57472 42.64779, 1.58054 42.63804, ..."
1,AE,الإمارات العربية المتحدة,United Arab Emirates,Émirats arabes unis,ARE,UN Member State,Abu Dhabi,F,F,F,Vereinigten Arabischen Emirate,"MULTIPOLYGON (((56.38171 25.3521, 56.377 25.35..."
2,AF,افغانستان-افغانستان,Afghanistan,Afghanistan,AFG,UN Member State,Kabul,F,F,F,Afghanistan,"POLYGON ((71.03061 38.45177, 71.04486 38.42181..."
3,AG,Antigua and Barbuda,Antigua and Barbuda,Antigua-et-Barbuda,ATG,UN Member State,St John's,F,F,F,Antigua und Barbuda,"MULTIPOLYGON (((-61.65911 17.0691, -61.6621 17..."
4,AI,Anguilla,Anguilla,Anguilla,AIA,UK Non-Self-Governing Territory,The Valley,F,F,F,Anguilla,"MULTIPOLYGON (((-62.95206 18.27833, -62.95537 ..."


The columns 'NAME_ENGL' and 'geometry' were separated to create a new cleaned dataframe.

In [18]:
country_polygons = country_polygons[['NAME_ENGL', 'geometry']]
display(country_polygons.head())

Unnamed: 0,NAME_ENGL,geometry
0,Andorra,"POLYGON ((1.57472 42.64779, 1.58054 42.63804, ..."
1,United Arab Emirates,"MULTIPOLYGON (((56.38171 25.3521, 56.377 25.35..."
2,Afghanistan,"POLYGON ((71.03061 38.45177, 71.04486 38.42181..."
3,Antigua and Barbuda,"MULTIPOLYGON (((-61.65911 17.0691, -61.6621 17..."
4,Anguilla,"MULTIPOLYGON (((-62.95206 18.27833, -62.95537 ..."
