
# Project Portfolio Analysis

This notebook demonstrates an end-to-end analysis of a synthetic project portfolio dataset. The dataset contains information about projects, including dates, budgets, costs, team sizes, complexity, risk levels, status, success outcomes, and customer satisfaction scores. The analysis includes exploratory data visualizations and predictive modeling to assess factors influencing project success.

You can run this notebook as-is. It uses only publicly available Python packages listed in `requirements.txt`. To reproduce the results, ensure you have the dataset file `project_portfolio_dataset.csv` in the same directory.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Configure plots
sns.set(style='whitegrid', context='notebook')

# Load dataset
file_path = 'project_portfolio_dataset.csv'
df = pd.read_csv(file_path, parse_dates=['start_date','planned_end_date','actual_end_date'])

# Display first few rows
df.head()



## Dataset Overview

The dataset contains the following columns:

| Column | Description |
|---|---|
| `project_id` | Unique identifier for each project | 
| `start_date` | Date the project started | 
| `planned_end_date` | Planned project completion date | 
| `actual_end_date` | Actual project completion date | 
| `planned_duration_days` | Planned duration in days | 
| `actual_duration_days` | Actual duration in days | 
| `budget_usd` | Planned budget in USD | 
| `actual_cost_usd` | Actual cost in USD | 
| `team_size` | Number of team members on the project | 
| `complexity` | Project complexity score (1–10) | 
| `risk_level` | Project risk level (1–5) | 
| `status` | Project status (Completed, Ongoing, Cancelled) | 
| `success` | Binary indicator of project success (1=successful, 0=not successful) | 
| `satisfaction_score` | Customer satisfaction score (1–5) |


In [None]:

# Summary statistics
summary = df.describe(include='all').T
summary


In [None]:

# Distribution of project budgets
plt.figure(figsize=(8,4))
sns.histplot(df['budget_usd']/1000, bins=30, kde=True, color='skyblue')
plt.title('Distribution of Project Budgets (thousand USD)')
plt.xlabel('Budget (thousand USD)')
plt.ylabel('Frequency')
plt.show()

# Project status counts
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='status', palette='pastel')
plt.title('Project Status Counts')
plt.xlabel('Status')
plt.ylabel('Count')
plt.show()

# Success vs Complexity
plt.figure(figsize=(7,4))
sns.boxplot(data=df, x='success', y='complexity', palette='viridis')
plt.title('Complexity vs Success')
plt.xlabel('Success')
plt.ylabel('Complexity')
plt.show()


In [None]:

# Correlation matrix for numerical features
numeric_cols = ['planned_duration_days', 'actual_duration_days', 'budget_usd', 'actual_cost_usd', 'team_size', 'complexity', 'risk_level', 'satisfaction_score']
corr = df[numeric_cols].corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
plt.title('Correlation Matrix')
plt.show()



## Predictive Modeling

We will build a logistic regression model to predict project success based on numerical features. The target variable is `success`, and the feature set includes planned and actual durations, budgets, costs, team size, complexity, risk level, and satisfaction score.


In [None]:

# Select features and target
features = ['planned_duration_days','actual_duration_days','budget_usd','actual_cost_usd','team_size','complexity','risk_level','satisfaction_score']
X = df[features]
y = df['success']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Build logistic regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Predict on test set
y_pred = log_reg.predict(X_test_scaled)

# Evaluation
print("Classification Report:
", classification_report(y_test, y_pred))
print("Confusion Matrix:
", confusion_matrix(y_test, y_pred))



## Conclusion

This analysis showcased a synthetic project portfolio dataset designed for aspiring business analysts, program managers, and data analysts. Through exploratory data visualizations, we explored budget distributions, project statuses, and the relationship between project complexity and success. We also examined correlations among numerical features.

A logistic regression model was built to predict project success, demonstrating a straightforward predictive modeling workflow. This project highlights data preparation, visualization, and modeling techniques relevant to roles in business analysis and program management. Feel free to extend the analysis by experimenting with different models (e.g., random forests, gradient boosting) or additional feature engineering.
