# Project Performance Analysis

This notebook explores a synthetic dataset of project performance metrics tailored for roles such as Business Analyst, Program Manager, and Data Analyst. The analysis includes descriptive statistics, visualizations, and predictive modeling to demonstrate how data-driven approaches can inform decision-making in project management.


## Load Dataset

We start by loading the dataset stored in `synthetic_project_data.csv`.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Read dataset
df = pd.read_csv('synthetic_project_data.csv')
df.head()

## Descriptive Statistics

Next, we examine basic statistics of numerical columns and visualize distributions.


In [None]:
# Summary statistics
df.describe(include='all')


In [None]:
# Convert StartDate and EndDate to datetime for duration calculations
df['StartDate'] = pd.to_datetime(df['StartDate'])
df['EndDate'] = pd.to_datetime(df['EndDate'])
df['DurationActual'] = (df['EndDate'] - df['StartDate']).dt.days

# Plot distribution of budgets
plt.figure(figsize=(8,4))
sns.histplot(df['Budget'], bins=20, kde=True)
plt.title('Distribution of Project Budgets')
plt.xlabel('Budget')
plt.ylabel('Frequency')
plt.show()

# Plot risk levels count
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='RiskLevel', order=['Low','Medium','High','Critical'])
plt.title('Count of Projects by Risk Level')
plt.xlabel('Risk Level')
plt.ylabel('Count')
plt.show()

# Correlation heatmap for numeric features
numeric_cols = ['DurationDays','Budget','ActualCost','TeamSize','ScopeComplexity','DurationActual']
plt.figure(figsize=(8,6))
sns.heatmap(df[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()

## Predictive Modeling

We build two classifiers—logistic regression and random forest—to predict project success.


In [None]:
# Prepare features and target
X = df[['DurationDays','Budget','ActualCost','TeamSize','ScopeComplexity']].copy()
y = df['Success']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)
y_pred_lr = log_reg.predict(X_test_scaled)

# Random Forest
rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Evaluation function
def evaluate(model_name, y_true, y_pred):
    print(model_name)
    print('Accuracy:', accuracy_score(y_true, y_pred))
    print('Confusion Matrix:
', confusion_matrix(y_true, y_pred))
    print('Classification Report:
', classification_report(y_true, y_pred))
    print('-'*50)

# Evaluate models
evaluate('Logistic Regression', y_test, y_pred_lr)
evaluate('Random Forest', y_test, y_pred_rf)

## Conclusion

This synthetic dataset and accompanying analysis demonstrate how a combination of descriptive analytics and predictive modeling can be used to assess project performance. You can extend this notebook by exploring feature importance, hyperparameter tuning, or incorporating additional project management variables.
