<a href="https://colab.research.google.com/github/yuliiabosher/Fiber-optic-project/blob/europe_stats_analysis/5year_predictive_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Fibre optic data - predicting 100% coverage
This notebook contains a small snippet of code to predict 100% coverage
of fibre optic availability given a set of 5 values, which will be the %availability for the last 5 years for a given geographical entity(country, parlimentary constituency, postcode area etc)

This function takes in five values and calculates a line of best fit
assuming they represent % availaibility values over a 5 year period

It will then do one of two things:

where input parameter yearOrValueFlag = 0 it will return the year at which availability = 100%

where  input parameter yearOrValueFlag > 0, it will return the predicted availability percentage for the year (startyear + yearOrValueFlag)

So if startyear = 2018 and you want to know what the value will be in 2025, set yearOrValueFlag to 7

the funtion then returns the year or percentage value



In [1]:
import numpy as np

#############
#This function takes in five values and calculates a line of best fit
#assuming they represent % availaibility values over a 5 year period
#it will then do one of two things:
#where input parameter yearOrValueFlag = 0 it will return the year at which availability = 100%
#where  input parameter yearOrValueFlag > 0, it will return the predicted availability percentage at time startyear + yearOrValueFlag
#so if startyear = 2018 and you want to know what the value will be in 2025, set yearOrValueFlag to 7,
#using this line of best fit, we then predict how many years it will take for the value to be 100
#and add that to the startyear specified by the user to determine the year 100 will be achieved
#the funtion then returns the year or percentage value

def fn_predict_progress(year1Value, year2Value, year3Value, year4Value, year5Value, startyear, yearOrValueFlag):

  # define two arrays to hold the years and the corresponding values
  year = np.array([1,2,3,4,5])
  value = np.array([year1Value,year2Value,year3Value,year4Value,year5Value])

  # find line equation, beta1 (coefficient) and beta0 (y-intercept)
  beta1, beta0 = np.polyfit(year, value, 1)

  if yearOrValueFlag == 0:
    #find year for a 100% coverage
    y_value = 100
    x_year = 0

    x_year = (y_value - beta0)/beta1

    #having some problems with array so check if its a value in an array or just a scalar
    if isinstance(x_year, np.ndarray):
     thisyear = round(x_year[0] + startyear,0)
    else:
      thisyear = round(x_year + startyear, 0) # Use x_year directly if it's a scalar

    return thisyear
  else:
    #find coverage where year = yearOrValueFlag
    y_value = 0
    x_year = yearOrValueFlag

    y_value = beta0 + beta1*x_year

    return y_value


I've included below some examples of how to call it. The hardcoded values will be derived from dataframes when its in use. However they are taken from actual figures in the EC dataset

In [2]:
print("UK will reach 100% full fibre in ", fn_predict_progress(8.5,14.5,23.3,36.3,51.6,2018,0))
print("Rural areas of the UK will reach 100% full fibre in ", fn_predict_progress(8.1,11.9,20.2,29.5,39.4,2018,0))
print("UK fibre availability in 2024 will be  ", fn_predict_progress(8.5,14.5,23.3,36.3,51.6,2018,6))
print("UK fibre availability in 2025 will be  ", fn_predict_progress(8.5,14.5,23.3,36.3,51.6,2018,7))
print("UK fibre availability in 2026 will be  ", fn_predict_progress(8.5,14.5,23.3,36.3,51.6,2018,8))
print("UK fibre availability in 2027 will be  ", fn_predict_progress(8.5,14.5,23.3,36.3,51.6,2018,9))
print("UK fibre availability in 2028 will be  ", fn_predict_progress(8.5,14.5,23.3,36.3,51.6,2018,10))
print("UK fibre availability in 2029 will be  ", fn_predict_progress(8.5,14.5,23.3,36.3,51.6,2018,11))
print("UK fibre availability in 2030 will be  ", fn_predict_progress(8.5,14.5,23.3,36.3,51.6,2018,12))

print("France will reach 100% full fibre in ", fn_predict_progress(43.8,52.6,63.4,73.4,81.4,2018,0))
print("Rural areas of France will reach 100% full fibre in ", fn_predict_progress(12.4,18.4,28.8,45.9,64.6,2018,0))

print("Germany will reach 100% full fibre in ", fn_predict_progress(10.5,13.8,15.4,19.3,29.8,2018,0))
print("Rural areas of Germany will reach 100% full fibre in ", fn_predict_progress(5.6,10.6,11.3,16.9,25.6,2018,0))

print("Germany availability in 2027 will be ", fn_predict_progress(10.5,13.8,15.4,19.3,29.8,2018,9))
print("Germany availability in 2027 will be ", fn_predict_progress(10.5,13.8,15.4,19.3,29.8,2018,22))


UK will reach 100% full fibre in  2028.0
Rural areas of the UK will reach 100% full fibre in  2031.0
UK fibre availability in 2024 will be   59.24000000000001
UK fibre availability in 2025 will be   70.04000000000002
UK fibre availability in 2026 will be   80.84000000000002
UK fibre availability in 2027 will be   91.64000000000001
UK fibre availability in 2028 will be   102.44000000000003
UK fibre availability in 2029 will be   113.24000000000002
UK fibre availability in 2030 will be   124.04000000000002
France will reach 100% full fibre in  2025.0
Rural areas of France will reach 100% full fibre in  2026.0
Germany will reach 100% full fibre in  2040.0
Rural areas of Germany will reach 100% full fibre in  2040.0
Germany availability in 2027 will be  44.22
Germany availability in 2027 will be  101.55000000000001


Here's an example of using the function to predict % coverage per parlimentary constituency for 2024 and 2025

I copied the code Yuliia wrote to load historical data

In [4]:
import pandas as pd
import matplotlib.pyplot as plt

def clean_and_load_parldf(link_to_file, year):
    ofcom_df = pd.read_csv(link_to_file, encoding="latin")
  #2018 column names a bit different
    if year == 2018:
        ofcom_full_fibre_df = ofcom_df[['parl_const_name','FTTP availability (% premises)']]
        if ofcom_full_fibre_df['parl_const_name'].str.contains('YNYS MÃ”N').any():
            index = ofcom_full_fibre_df[ofcom_full_fibre_df['parl_const_name'].str.contains('YNYS MÃ”N')].index[0]
            ofcom_full_fibre_df.loc[index,'parl_const_name'] = 'YNYS MÔN'
        ofcom_full_fibre_df['parl_const_name'] = ofcom_full_fibre_df['parl_const_name'].str.upper().str.strip()
        ofcom_full_fibre_df.rename(columns={"parl_const_name": "parliamentary_constituency_name"}, inplace=True)
    else:

        ofcom_full_fibre_df = ofcom_df[['parliamentary_constituency_name','Full Fibre availability (% premises)']]
        if ofcom_full_fibre_df['parliamentary_constituency_name'].str.contains('YNYS MÃ”N').any():
            index = ofcom_full_fibre_df[ofcom_full_fibre_df['parliamentary_constituency_name'].str.contains('YNYS MÃ”N')].index[0]
            ofcom_full_fibre_df.loc[index,'parliamentary_constituency_name'] = 'YNYS MÔN'

        ofcom_full_fibre_df['parliamentary_constituency_name'] = ofcom_full_fibre_df['parliamentary_constituency_name'].str.upper().str.strip()


    return ofcom_full_fibre_df

df2023 = clean_and_load_parldf("https://raw.githubusercontent.com/yuliiabosher/Fiber-optic-project/refs/heads/parliamentary-constituencies/202401_fixed_pcon_coverage_r01.csv", \
2024)
df2022 = clean_and_load_parldf("https://raw.githubusercontent.com/yuliiabosher/Fiber-optic-project/refs/heads/parliamentary-constituencies/202209_fixed_pcon_coverage_r02.csv", \
2022)
df2021 = clean_and_load_parldf("https://raw.githubusercontent.com/yuliiabosher/Fiber-optic-project/refs/heads/parliamentary-constituencies/202109_fixed_pcon_coverage_r01.csv", \
2021)
df2020 = clean_and_load_parldf("https://raw.githubusercontent.com/yuliiabosher/Fiber-optic-project/refs/heads/parliamentary-constituencies/202009_fixed_pcon_coverage_r01.csv", \
2020)
df2019 = clean_and_load_parldf("https://raw.githubusercontent.com/yuliiabosher/Fiber-optic-project/refs/heads/parliamentary-constituencies/201909_fixed_pcon_coverage_r01.csv", \
2019)
df2018 = clean_and_load_parldf("https://raw.githubusercontent.com/yuliiabosher/Fiber-optic-project/refs/heads/parliamentary-constituencies/201809_fixed_pcon_coverage_r01.csv", \
2018)

#rename the value cols so they have different names
df2023.rename(columns={"Full Fibre availability (% premises)": "FTTP2023"}, inplace=True)
df2022.rename(columns={"Full Fibre availability (% premises)": "FTTP2022"}, inplace=True)
df2021.rename(columns={"Full Fibre availability (% premises)": "FTTP2021"}, inplace=True)
df2020.rename(columns={"Full Fibre availability (% premises)": "FTTP2020"}, inplace=True)
df2019.rename(columns={"Full Fibre availability (% premises)": "FTTP2019"}, inplace=True)

#merge them all together
#possibly a better way to do this?
df_once = df2023.merge(df2022, on=['parliamentary_constituency_name'],how='inner')
df_again = df_once.merge(df2021, on=['parliamentary_constituency_name'],how='inner')
df_andagain =  df_again.merge(df2020, on=['parliamentary_constituency_name'],how='inner')
df_final =  df_andagain.merge(df2019, on=['parliamentary_constituency_name'],how='inner')
df_final.head()

#loop through merged dataframe
#for each row, get the predicted coverage for 2024, 2025 and 2026

vals2024= []
vals2025=[]
vals2026=[]
for _,row in df_final.iterrows():
    thisval2024 = fn_predict_progress(row['FTTP2019'], row['FTTP2020'], row['FTTP2021'], row['FTTP2022'], row['FTTP2023'], 2018, 6)
    vals2024.append(thisval2024)
    thisval2025 = fn_predict_progress(row['FTTP2019'], row['FTTP2020'], row['FTTP2021'], row['FTTP2022'], row['FTTP2023'], 2018, 7)
    vals2025.append(thisval2025)
    thisval2026 = fn_predict_progress(row['FTTP2019'], row['FTTP2020'], row['FTTP2021'], row['FTTP2022'], row['FTTP2023'], 2018, 8)
    vals2026.append(thisval2026)

#create new columns in the dataframe
df_final['FTTP2024'] = vals2024
df_final['FTTP2025'] = vals2025
df_final['FTTP2026'] = vals2026
df_final.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ofcom_full_fibre_df['parliamentary_constituency_name'] = ofcom_full_fibre_df['parliamentary_constituency_name'].str.upper().str.strip()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ofcom_full_fibre_df['parliamentary_constituency_name'] = ofcom_full_fibre_df['parliamentary_constituency_name'].str.upper().str.strip()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-doc

Unnamed: 0,parliamentary_constituency_name,FTTP2023,FTTP2022,FTTP2021,FTTP2020,FTTP2019,FTTP2024,FTTP2025,FTTP2026
0,ABERAVON,39.9,7.3,4.0,1.5,1.6,35.58,43.82,52.06
1,ABERCONWY,77.1,45.7,11.4,8.0,5.9,83.65,101.66,119.67
2,ABERDEEN NORTH,88.8,75.6,58.7,20.5,8.0,115.33,137.0,158.67
3,ABERDEEN SOUTH,85.9,74.8,70.6,58.3,22.1,105.57,119.98,134.39
4,AIRDRIE AND SHOTTS,41.6,18.2,8.5,4.3,1.2,43.17,52.64,62.11
