# Project Management Analytics

    This notebook accompanies a synthetic dataset of project management metrics. The goal is to explore the data, build some basic visualizations, and train a simple predictive model to classify project outcomes (success vs. failure).

    The dataset contains the following columns:

    - **project_id**: Unique identifier for each project.
    - **planned_start**, **planned_end**, **actual_end**: Dates representing the planned and actual schedules.
    - **planned_duration_days**: Number of days planned for the project duration.
    - **delay_days**: Difference between planned and actual end dates (positive means delay, negative means finished early).
    - **budget**: Planned budget in USD.
    - **actual_cost**: Actual cost incurred in USD.
    - **team_size**: Number of people assigned to the project.
    - **complexity**: Ordinal complexity score (1–5).
    - **risk_score**: Continuous risk score (0–1).
    - **success**: Binary outcome (1 for successful projects, 0 for failed).

    We'll perform exploratory data analysis (EDA) followed by a simple predictive model using logistic regression.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Set plotting style
sns.set(style="whitegrid", palette="tab10")

In [None]:
# Load the synthetic project dataset
data = pd.read_csv(r"synthetic_project_data.csv")

# Display the first few rows
data.head()

In [None]:
# Summary statistics for numerical columns
data.describe()

# Check for missing values
data.isnull().sum()

In [None]:
# Histogram of project durations and delays
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.histplot(data['planned_duration_days'], ax=axes[0], bins=20, kde=False)
axes[0].set_title('Planned Duration (days)')
axes[0].set_xlabel('Days')
axes[0].set_ylabel('Count')

sns.histplot(data['delay_days'], ax=axes[1], bins=20, kde=False, color='orange')
axes[1].set_title('Delay (days)')
axes[1].set_xlabel('Days late/early')
axes[1].set_ylabel('Count')
plt.tight_layout()

# Boxplots to compare cost and budget by success
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.boxplot(x='success', y='budget', data=data, ax=axes[0])
axes[0].set_title('Budget Distribution by Success')
axes[0].set_xlabel('Success (1=yes, 0=no)')
axes[0].set_ylabel('Budget (USD)')

sns.boxplot(x='success', y='actual_cost', data=data, ax=axes[1], color='salmon')
axes[1].set_title('Actual Cost Distribution by Success')
axes[1].set_xlabel('Success (1=yes, 0=no)')
axes[1].set_ylabel('Actual Cost (USD)')
plt.tight_layout()

# Correlation heatmap
numeric_cols = ['planned_duration_days', 'delay_days', 'budget', 'actual_cost', 'team_size', 'complexity', 'risk_score']
corr = data[numeric_cols + ['success']].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Prepare feature matrix X and target vector y
feature_cols = ['planned_duration_days', 'delay_days', 'budget', 'actual_cost', 'team_size', 'complexity', 'risk_score']
X = data[feature_cols]
y = data['success']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit logistic regression model
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_scaled, y_train)

# Predictions and evaluation
y_pred = clf.predict(X_test_scaled)
print("Classification Report:", classification_report(y_test, y_pred))
print("Confusion Matrix:", confusion_matrix(y_test, y_pred))