
# Project Portfolio Analysis

This notebook provides an exploratory and predictive analysis of a synthetic project portfolio dataset. The dataset simulates records for projects managed by program and data analysts and includes planning and performance metrics such as budgets, timelines, team sizes, and satisfaction scores. The goal of this analysis is to demonstrate skills in data manipulation, visualization, and machine learning modelling suitable for roles such as business analyst, program manager, or data analyst.



## Dataset Description

The dataset (`project_portfolio_data.csv`) contains **200** synthetic project records with the following fields:

- **Project_ID**: Unique identifier for each project.
- **Project_Name**: A simple name for the project.
- **Start_Date**: The date when the project started.
- **Planned_End_Date**: The planned completion date.
- **Actual_End_Date**: The actual completion date.
- **Planned_Duration_Days**: Duration between the start date and planned end date (days).
- **Actual_Duration_Days**: Duration between the start date and actual end date (days).
- **Planned_Budget**: Initial budget allocated to the project.
- **Actual_Budget**: Actual spend for the project.
- **Team_Size**: Number of people assigned to the project.
- **Complexity**: Categorical measure of project complexity (`Low`, `Medium`, `High`).
- **Client_Satisfaction_Score**: Satisfaction rating from clients (1–10).
- **Risk_Level**: Categorical risk assessment (`Low`, `Medium`, `High`).
- **Manager_Experience_Years**: Years of experience for the program manager.
- **Completed**: Binary flag indicating whether the project was completed before 2024-01-01 (1) or not (0).

We will load the dataset, perform exploratory data analysis (EDA), visualize key relationships, and build simple predictive models.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, accuracy_score
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
%matplotlib inline


In [None]:

# Load dataset
file_path = 'project_portfolio_data.csv'
df = pd.read_csv(file_path, parse_dates=['Start_Date', 'Planned_End_Date', 'Actual_End_Date'])
# Display basic info
df.head()


In [None]:

# Summary statistics
df.describe(include='all')


In [None]:

# Data types and missing values
df.info()


In [None]:

# Visualizations
sns.set(style='whitegrid', palette='deep')

# Budget comparison
plt.figure(figsize=(8,5))
sns.scatterplot(data=df, x='Planned_Budget', y='Actual_Budget', hue='Complexity')
plt.title('Planned vs Actual Budget by Project Complexity')
plt.xlabel('Planned Budget')
plt.ylabel('Actual Budget')
plt.legend(title='Complexity')
plt.tight_layout()
plt.show()

# Actual Duration by Complexity
plt.figure(figsize=(8,5))
sns.boxplot(data=df, x='Complexity', y='Actual_Duration_Days')
plt.title('Distribution of Actual Duration by Complexity')
plt.xlabel('Project Complexity')
plt.ylabel('Actual Duration (days)')
plt.tight_layout()
plt.show()

# Completion counts
plt.figure(figsize=(6,4))
sns.countplot(x='Completed', data=df)
plt.title('Completed vs Not Completed Projects')
plt.xlabel('Completed')
plt.ylabel('Count')
plt.tight_layout()
plt.show()


In [None]:

# Prepare data for modelling
features = ['Planned_Duration_Days', 'Planned_Budget', 'Team_Size', 'Manager_Experience_Years',
            'Complexity', 'Risk_Level']

# Predicting Client Satisfaction Score (regression)
X_reg = df[features]
y_reg = df['Client_Satisfaction_Score']

# Predicting Completed flag (classification)
X_clf = df[features]
y_clf = df['Completed']

# Define categorical and numerical columns
categorical_cols = ['Complexity', 'Risk_Level']
numerical_cols = ['Planned_Duration_Days', 'Planned_Budget', 'Team_Size', 'Manager_Experience_Years']

# Preprocessor: one-hot encode categorical features, pass through numerical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', 'passthrough', numerical_cols)
    ])

# Regression pipeline
reg_model = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Classification pipeline
clf_model = Pipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train-test split for regression
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
reg_model.fit(X_train_reg, y_train_reg)
y_pred_reg = reg_model.predict(X_test_reg)
rmse = np.sqrt(mean_squared_error(y_test_reg, y_pred_reg))

# Train-test split for classification
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42)
clf_model.fit(X_train_clf, y_train_clf)
y_pred_clf = clf_model.predict(X_test_clf)
accuracy = accuracy_score(y_test_clf, y_pred_clf)

print(f"Regression model RMSE: {rmse:.2f}")
print(f"Classification model Accuracy: {accuracy:.2f}")



## Conclusions and Next Steps

In this synthetic project portfolio analysis, we explored relationships between planned and actual project budgets and durations, examined distributions by project complexity, and built simple predictive models to estimate client satisfaction and project completion likelihood.

Key takeaways:

- **Planned vs actual budgets**: Visual analysis highlights how budget overruns vary with project complexity.
- **Duration differences**: High-complexity projects show greater variation in actual durations.
- **Predictive modelling**: Basic Random Forest models achieved reasonable performance predicting client satisfaction (RMSE) and project completion (classification accuracy).

**Next steps** could include hyperparameter tuning, more sophisticated modelling (e.g., gradient boosting, time-series analysis), and integration with real project management data.
