# Project Management Success Prediction

This notebook demonstrates an end-to-end analysis and predictive modeling workflow for a synthetic project management dataset.

We generate a dataset that includes project characteristics—such as team size, project duration (in days), budget (k USD), complexity rating, stakeholder count, number of scope changes and risk score—and outcomes such as whether the project stayed on schedule and budget, customer satisfaction, and overall success.

The goals of this notebook are:

* **Exploratory data analysis (EDA):** compute summary statistics and visualize feature distributions and relationships.
* **Correlation analysis:** explore relationships between project variables and outcomes.
* **Predictive modeling:** build classification models to estimate the probability that a project will be successful. We try logistic regression and random forest algorithms, evaluate their performance on a held‑out test set, and inspect classification metrics.

Feel free to modify the data generation or modeling steps to explore different scenarios!


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score

# Configure plotting style
sns.set(style='whitegrid')
%matplotlib inline


In [None]:
# Load the synthetic dataset
# The CSV file is generated separately; make sure project_management_dataset.csv is in the same directory

df = pd.read_csv('project_management_dataset.csv')
df.head()


## Overview of the dataset

The dataset contains the following columns:

| Column | Description |
|-------|-------------|
| `project_id` | Unique identifier for each project |
| `team_size` | Number of people on the project team |
| `duration_days` | Duration of the project in days |
| `budget_k_usd` | Budget allocated to the project (thousands of USD) |
| `complexity` | Project complexity score on a 1–10 scale (higher means more complex) |
| `stakeholder_count` | Number of stakeholders associated with the project |
| `scope_changes` | Number of scope change requests during the project |
| `risk_score` | Risk score from 0 to 100 (higher means higher risk) |
| `on_schedule` | Binary indicator (1 if on schedule, 0 otherwise) |
| `on_budget` | Binary indicator (1 if on budget, 0 otherwise) |
| `customer_satisfaction` | Customer satisfaction rating on a 1–5 scale |
| `overall_success` | Target variable indicating overall success (1 for success, 0 for failure) |

We will perform exploratory data analysis and predictive modeling on this data.

In [None]:
# Summary statistics for numeric features
df.describe().T


In [None]:
# Class distribution of the target variable
y_counts = df['overall_success'].value_counts().rename(index={0:'Failure', 1:'Success'})
print('Class distribution (Success vs Failure):')
print(y_counts)
y_counts.plot(kind='bar', color=['salmon','skyblue'])
plt.title('Distribution of project success')
plt.ylabel('Number of projects')
plt.show()


In [None]:
# Plot distributions of key numeric features
features = ['team_size', 'duration_days', 'budget_k_usd', 'complexity', 'stakeholder_count', 'scope_changes', 'risk_score', 'customer_satisfaction']
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(12, 16))
for idx, feature in enumerate(features):
    ax = axes[idx // 2, idx % 2]
    sns.histplot(df[feature], ax=ax, kde=True, color='steelblue')
    ax.set_title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()


In [None]:
# Correlation matrix and heatmap
corr = df.drop(columns=['project_id']).corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', center=0)
plt.title('Correlation matrix of project features and outcomes')
plt.show()


In [None]:
# Explore how complexity relates to project success
plt.figure(figsize=(6,4))
sns.boxplot(x='overall_success', y='complexity', data=df, palette='Set2')
plt.title('Complexity vs. Overall Success')
plt.xlabel('Overall Success (0 = Failure, 1 = Success)')
plt.ylabel('Complexity score')
plt.show()


In [None]:
# Prepare features (X) and target (y)
X = df[['team_size', 'duration_days', 'budget_k_usd', 'complexity', 'stakeholder_count', 'scope_changes', 'risk_score', 'on_schedule', 'on_budget', 'customer_satisfaction']]
y = df['overall_success']

# Train‑test split with stratification
test_size = 0.25
random_state = 42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)

# Scale continuous features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


In [None]:
# Logistic Regression model
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_scaled, y_train)

# Predictions and evaluation
y_pred_lr = lr.predict(X_test_scaled)
y_prob_lr = lr.predict_proba(X_test_scaled)[:, 1]

print('Logistic Regression Accuracy:', round(accuracy_score(y_test, y_pred_lr), 3))
print('
Classification Report:
', classification_report(y_test, y_pred_lr))

# Confusion matrix
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix: Logistic Regression')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# ROC curve
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_prob_lr)
auc_lr = roc_auc_score(y_test, y_prob_lr)
plt.figure(figsize=(6,4))
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {auc_lr:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Logistic Regression')
plt.legend(loc='lower right')
plt.show()


In [None]:
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=200, max_depth=None, random_state=42)
rf.fit(X_train, y_train)

# Predictions and evaluation
y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]

print('Random Forest Accuracy:', round(accuracy_score(y_test, y_pred_rf), 3))
print('
Classification Report:
', classification_report(y_test, y_pred_rf))

# Confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens')
plt.title('Confusion Matrix: Random Forest')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

# ROC curve
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_prob_rf)
auc_rf = roc_auc_score(y_test, y_prob_rf)
plt.figure(figsize=(6,4))
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Random Forest')
plt.legend(loc='lower right')
plt.show()


## Conclusion

In this notebook we explored a synthetic project management dataset. After performing basic exploratory data analysis and visualizing the distributions and correlations of various project features, we trained logistic regression and random forest classifiers to predict whether a project would be successful. Both models achieved high accuracy and AUC scores, with the random forest slightly outperforming the logistic regression model in this synthetic setting.

This analysis demonstrates how project characteristics (such as team size, complexity, risk score and resource constraints) can influence project outcomes. In real-world scenarios, additional domain knowledge and feature engineering would be essential to build robust predictive models. Feel free to experiment with different algorithms, parameter settings, or additional features to further improve predictive performance.
