# Predicting Project Costs in UK Public Sector

## 1. Motivation & Background Knowledge about Projects in UK Public Sector

this is text

## 2. Getting the Data & EDA

The data is retrieved through https://www.gov.uk/government/collections/major-projects-data that collects data from 2012-2022 about the progress of projects in the Government Major Projects Portfolio. For each year we can download a `.csv` file and after downloading all of them, we compare the columns in order to merge them together. These are stored in the folder `raw_data`.

### 2.1 Importing Libraries

In [7]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, accuracy_score

### 2.2 Preparing & Merging `.csv` files into a Dataframe

## 3. Baseline Models

### 3.0 Preps (needs to be updated with final df)

In [3]:
# Import .csv file as Dataframe
#df = pd.read_csv('../data/raw_data/2021_2023.csv')

In [15]:
# Adding a duration column to the Dataframe
#df['duration'] = (df['end_date'] - df['start_date']).dt.days

In [None]:
# Adding a variance column to the Dataframe
#df['duration'] = (df['end_date'] - df['start_date']).dt.days

In [18]:
# Sorting the DataFrame by 'project_name' and 'year'
df.sort_values(by=['project_name', 'year'], inplace=True)

In [20]:
# Calculating Forecast Variance (Yearly Forecast - Previous Yearly Forecast)
df['forecast_variance_prev_year'] = df.groupby('project_name')['yearly_forecast'].pct_change() * 100

# Calculating Budget Variance (Yearly Forecast - Yearly Budget)
df['forecast_variance_budget'] = ((df['yearly_forecast'] - df['yearly_budget']) / df['yearly_budget']) * 100

In [28]:
# Calculate the percentage change trend for each project
df['forecast_percentage_change_prev_year_filled'] = df.groupby('project_name')['forecast_variance_prev_year'].fillna(method='ffill')

In [45]:
df['forecast_variance_prev_year'].fillna(0, inplace=True)

### 3.1 Last Observation Carried Forward (LOCF)

In [39]:
# LOCF baseline-model for costs (Forecast) based on £
# Approach: We are taking the last two instances

# Creating the predicted values based on the LOCF-method
df['forecast_pred'] = df['yearly_forecast'].shift(1)

# There is no previous value for the first entry, so the predicted value remains NaN
# The model ignores the first entry as it has no previous point

# Calculating the errors for the baseline model
mse_forecast = mean_squared_error(df['yearly_forecast'].iloc[1:], df['forecast_pred'].iloc[1:])

print(f"LOCF forecast MSE: {mse_forecast}")


LOCF forecast MSE: 495769.72430543805


In [47]:
# LOCF baseline-model for costs (Forecast) based on %
# Approach: We are taking the last two instances

# Creating the predicted values based on the LOCF-method
df['forecast_pred_percentage'] = df['forecast_variance_prev_year'].shift(1)

# There is no previous value for the first entry, so the predicted value remains NaN
# The model ignores the first entry as it has no previous point

df['forecast_pred_percentage'].fillna(0, inplace=True)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.fillna(0, inplace=True)

# Calculating the errors for the baseline model
mse_forecast = mean_squared_error(df['forecast_variance_prev_year'].iloc[1:], df['forecast_pred_percentage'].iloc[1:])

print(f"LOCF forecast MSE: {mse_forecast}")

LOCF forecast MSE: 1084528.7616024816


### 3.2 Calculating Forecast based on Variance Percentage Trend

In [48]:
# Calculate the last known forecast value and the trend of the percentage change
last_forecast = df.groupby('project_name').last()['yearly_forecast']
last_percentage_change = df.groupby('project_name').last()['forecast_percentage_change_prev_year_filled']

# Calculating forecast for 2024
predicted_forecast_2024 = last_forecast * (1 + last_percentage_change / 100)

# Creating new DataFrame including 2024 data
df_2024 = pd.DataFrame({
    'project_name': last_forecast.index,
    'year': 2024,
    'predicted_forecast': predicted_forecast_2024
})

### 3.3 Mean/Median Predictor

In [49]:
# Calculating mean of 'forecast_variance_prev_year'
mean_forecast = df['forecast_variance_prev_year'].mean()

# Berechne den Median der Spalte 'yearly_forecast'
median_forecast = df['forecast_variance_prev_year'].median()

# Erstelle Vorhersagen für ein neues Jahr (z.B. 2024) basierend auf dem Mean und Median
df['mean_forecast_prediction'] = mean_forecast
df['median_forecast_prediction'] = median_forecast

# Optional: Ergebnisse für das nächste Jahr in einem neuen DataFrame
df_2024_1 = pd.DataFrame({
    'project_name': df['project_name'].unique(),
    'year': 2024,
    'predicted_forecast_mean': mean_forecast,
    'predicted_forecast_median': median_forecast
})

# Ergebnisse anzeigen
print(df_2024_1)

                                          project_name  year   
0                      10,000 Additional Prison Places  2024  \
1    10,000 Additional Prison Places Programme - Es...  2024   
2    10K Additional Prison Places Estate Expansion ...  2024   
3    10K additional Prison places Women's Estate Ex...  2024   
4             10k Additional Prison Places - New Build  2024   
..                                                 ...   ...   
335                Workplace and Facilities Management  2024   
336                    YOI Education Services Retender  2024   
337                     YOUTH JUSTICE REFORM PROGRAMME  2024   
338                              Youth Investment Fund  2024   
339                     Youth Justice Reform Programme  2024   

     predicted_forecast_mean  predicted_forecast_median  
0                  80.923483                        0.0  
1                  80.923483                        0.0  
2                  80.923483                        0.0  

In [None]:
# Define train and test data
y_train = 
y_test = 

# Mean Predictor
mean_pred = np.mean(y_train)
y_pred_mean = np.full(y_test.shape, mean_pred)

# Median Predictor (optional)
median_pred = np.median(y_train)
y_pred_median = np.full(y_test.shape, median_pred)

# Berechnung der Fehlermetrik (z.B. Mean Squared Error)
mse_mean = mean_squared_error(y_test, y_pred_mean)
mse_median = mean_squared_error(y_test, y_pred_median)

print(f"Mean Predictor MSE: {mse_mean}")
print(f"Median Predictor MSE: {mse_median}")