# Project Management Analytics

This notebook demonstrates exploratory data analysis (EDA) and predictive modeling on a synthetic project management dataset. The goal is to explore relationships between project attributes (such as budget, team size, complexity and risk) and project success, and to build a model that predicts whether a project is likely to succeed.

The dataset is synthetic but draws inspiration from commonly monitored project management metrics such as schedule adherence, budget control, resource utilization and stakeholder engagement【729401205523784†L170-L207】.

## Dataset Overview

The dataset (`synthetic_project_data.csv`) contains records of simulated projects with the following columns:

- **project_id**: Unique identifier for each project.
- **start_date** and **end_date**: Project start and end dates.
- **budget_usd**: Estimated project budget in US dollars. Budget adherence is a core metric for project management【729401205523784†L170-L207】.
- **team_size**: Number of people assigned to the project.
- **complexity**: An integer from 1 to 10 representing the project's technical or organizational complexity.
- **risk_level**: An integer from 1 to 5 indicating the level of perceived risk associated with the project. Monitoring risk is vital for proactive mitigation【729401205523784†L220-L235】.
- **stakeholder_engagement_score**: Score from 1 to 10 summarizing how engaged stakeholders are throughout the project life cycle【154641038672183†L638-L653】.
- **business_value**: Estimate of the value delivered by the project (in arbitrary units).
- **duration_days**: Duration of the project in days.
- **success**: Binary indicator (1 for success, 0 for failure) derived from the other attributes.

We will read this dataset, perform exploratory data analysis, visualize distributions and relationships, and build predictive models to estimate project success.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Configure plotting
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (8, 5)

In [None]:
# Load synthetic dataset
data = pd.read_csv('synthetic_project_data.csv')

# Display first few rows and dataset info
print('Data shape:', data.shape)
display(data.head())
print('Data types:', data.dtypes)

In [None]:
# Summary statistics
display(data.describe())

# Check for missing values
print('Missing values per column:', data.isnull().sum())

In [None]:
# Compute correlation matrix (numeric columns only)
num_cols = ['budget_usd','team_size','complexity','risk_level','stakeholder_engagement_score','business_value','duration_days','success']
correlation_matrix = data[num_cols].corr()

# Plot heatmap
plt.figure(figsize=(10, 7))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Histogram of project budget
sns.histplot(data['budget_usd'], bins=20, kde=True, color='skyblue')
plt.xlabel('Budget (USD)')
plt.ylabel('Count')
plt.title('Distribution of Project Budgets')
plt.show()

# Histogram of project complexity
sns.countplot(x='complexity', data=data, palette='viridis')
plt.xlabel('Complexity')
plt.ylabel('Count')
plt.title('Distribution of Project Complexity')
plt.show()

# Scatter plot: complexity vs. business value
sns.scatterplot(x='complexity', y='business_value', hue='success', data=data, palette=['orange','green'])
plt.xlabel('Complexity')
plt.ylabel('Business Value')
plt.title('Complexity vs Business Value by Success')
plt.show()

# Boxplot of business value grouped by success
sns.boxplot(x='success', y='business_value', data=data, palette=['salmon','lightgreen'])
plt.xlabel('Success (0 = No, 1 = Yes)')
plt.ylabel('Business Value')
plt.title('Business Value Distribution by Success')
plt.show()

# Bar chart of risk level vs average success rate
avg_success = data.groupby('risk_level')['success'].mean().reset_index()
sns.barplot(x='risk_level', y='success', data=avg_success, palette='magma')
plt.xlabel('Risk Level')
plt.ylabel('Average Success Rate')
plt.title('Success Rate by Risk Level')
plt.ylim(0,1)
plt.show()

In [None]:
# Features and target
features = ['budget_usd','team_size','complexity','risk_level','stakeholder_engagement_score','duration_days']
X = data[features]
y = data['success']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Standardize numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Logistic Regression Model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

y_pred_lr = log_reg.predict(X_test_scaled)

# Evaluation metrics
print('Logistic Regression Results')
print('Accuracy:', accuracy_score(y_test, y_pred_lr))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred_lr))
print('Classification Report:', classification_report(y_test, y_pred_lr))

In [None]:
# Random Forest Model
rf_model = RandomForestClassifier(n_estimators=200, random_state=42)
rf_model.fit(X_train, y_train)

y_pred_rf = rf_model.predict(X_test)

print('Random Forest Results')
print('Accuracy:', accuracy_score(y_test, y_pred_rf))
print('Confusion Matrix:', confusion_matrix(y_test, y_pred_rf))
print('Classification Report:', classification_report(y_test, y_pred_rf))

## Conclusions

- **Data Exploration:** The dataset exhibits diverse budgets, team sizes, complexity levels and risk profiles. As expected, higher complexity projects tend to deliver lower business value, while projects with lower risk levels achieve higher success rates.

- **Predictive Modeling:** The logistic regression model achieved moderate accuracy, indicating that project success is influenced by multiple factors. The random forest model performed slightly better, capturing non‑linear relationships between variables.

- **Next Steps:**
  * Experiment with other algorithms (e.g., gradient boosting, SVM).
  * Perform feature engineering (e.g., compute cost per team member, schedule variance).
  * Incorporate additional project management KPIs such as schedule variance, cost performance index, and stakeholder satisfaction metrics【729401205523784†L170-L207】.
  * Apply cross‑validation and hyperparameter tuning to improve model performance.

This notebook provides a foundation for exploring project management data and building predictive models to inform decision‑making. Feel free to modify and extend it for your own analysis.