# Project Performance Analysis

This notebook analyzes a synthetic project performance dataset generated for demonstration purposes. The goal is to explore how project characteristics relate to actual delivery performance and build simple predictive models.

The dataset includes project-specific features such as team size, estimated duration, budget, complexity score, scope changes, client priority, methodology, and risk level. It also contains calculated fields for the actual duration and whether the project was delivered on time (within 10% of the estimated duration).


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, confusion_matrix, mean_absolute_error, mean_squared_error

%matplotlib inline
sns.set(style='whitegrid')

# Load dataset
df = pd.read_csv('project_data.csv')

# Display the first few rows and summary statistics
display(df.head())
df.describe(include='all').T


## Exploratory Data Analysis

Let's start by exploring the distribution of key numeric variables (team size, estimated duration, budget, complexity score, scope changes, risk level) and examining relationships with the target variable `delivery_on_time`.


In [None]:
# Histograms for numeric variables
numeric_cols = ['team_size', 'estimated_duration_weeks', 'budget_kusd', 'complexity_score', 'scope_changes', 'risk_level']
df[numeric_cols].hist(figsize=(12, 8), bins=15)
plt.suptitle('Distribution of Numeric Features', fontsize=16)
plt.show()

# Countplot for categorical variables
categorical_cols = ['client_priority', 'methodology', 'delivery_on_time']
for col in categorical_cols:
    plt.figure(figsize=(6,4))
    sns.countplot(data=df, x=col)
    plt.title(f"Countplot of {col}")
    plt.show()


### Correlations

Next, we evaluate pairwise correlations between numeric features and visualize them using a heatmap.


In [None]:
# Compute correlation matrix for numeric variables
corr_matrix = df[['team_size', 'estimated_duration_weeks', 'budget_kusd', 'complexity_score', 'scope_changes', 'risk_level', 'actual_duration_weeks']].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', square=True)
plt.title('Correlation Matrix')
plt.show()


### Relationship between Features and Delivery Outcome

We'll examine how certain features differ across projects that were delivered on time versus late.


In [None]:
# Boxplots of numeric features by delivery outcome
for col in ['team_size', 'estimated_duration_weeks', 'budget_kusd', 'complexity_score', 'scope_changes', 'risk_level']:
    plt.figure(figsize=(6,4))
    sns.boxplot(data=df, x='delivery_on_time', y=col)
    plt.title(f"{col} by Delivery Outcome")
    plt.xlabel('Delivered On Time (1 = Yes, 0 = No)')
    plt.show()


## Predictive Modeling

We'll build two simple predictive models:

1. **Logistic Regression** to predict whether a project will be delivered on time based on features.
2. **Linear Regression** to predict the actual duration of a project.

These models provide a baseline for understanding relationships in the dataset.


In [None]:
# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, columns=['client_priority', 'methodology'], drop_first=True)

# Logistic Regression
X_log = df_encoded.drop(columns=['project_id', 'actual_duration_weeks', 'delivery_on_time'])
y_log = df_encoded['delivery_on_time']

X_train_log, X_test_log, y_train_log, y_test_log = train_test_split(X_log, y_log, test_size=0.3, random_state=42)

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_log, y_train_log)
y_pred_log = log_reg.predict(X_test_log)
accuracy = accuracy_score(y_test_log, y_pred_log)
conf_mat = confusion_matrix(y_test_log, y_pred_log)

print(f"Logistic Regression Accuracy: {accuracy:.3f}")
print("Confusion Matrix:
", conf_mat)

# Linear Regression
X_lin = df_encoded.drop(columns=['project_id', 'actual_duration_weeks', 'delivery_on_time'])
y_lin = df_encoded['actual_duration_weeks']

X_train_lin, X_test_lin, y_train_lin, y_test_lin = train_test_split(X_lin, y_lin, test_size=0.3, random_state=42)

lin_reg = LinearRegression()
lin_reg.fit(X_train_lin, y_train_lin)
y_pred_lin = lin_reg.predict(X_test_lin)
mae = mean_absolute_error(y_test_lin, y_pred_lin)
rmse = np.sqrt(mean_squared_error(y_test_lin, y_pred_lin))

print(f"Linear Regression MAE: {mae:.2f} weeks")
print(f"Linear Regression RMSE: {rmse:.2f} weeks")


## Conclusions and Next Steps

This notebook demonstrates an end-to-end workflow for analyzing project performance data:

- **Data Exploration**: We explored distributions, relationships, and correlations between various project attributes.
- **Predictive Modeling**: We built basic logistic and linear regression models to predict delivery outcomes and actual project duration.

Potential next steps include evaluating more sophisticated models (e.g., decision trees, random forests), tuning hyperparameters, and exploring interactions between features. Additionally, one might simulate different project scenarios or incorporate external data (such as team experience or stakeholder engagement metrics) to enhance the analysis.
