# Project Performance Analysis

This notebook explores a synthetic dataset of program and project management metrics. We will perform exploratory data analysis, visualize key distributions, and build predictive models to identify factors associated with project delays.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline

# Increase figure size for better readability
plt.rcParams['figure.figsize'] = (10, 6)


## Load the dataset

In [None]:
# Load the synthetic project dataset
project_df = pd.read_csv('project_data.csv')

# Display the first few rows
project_df.head()

## Summary statistics

In [None]:
# Describe the numeric columns
project_df.describe()

## Data preprocessing

We need to encode the categorical variables and prepare the dataset for modeling.

In [None]:
# Copy dataframe and encode categorical features
data_encoded = project_df.copy()

# Encode the project_type categorical variable
le = LabelEncoder()
data_encoded['project_type_encoded'] = le.fit_transform(data_encoded['project_type'])

# Encode the status variable as target (1 for delayed, 0 for on time)
data_encoded['status_flag'] = data_encoded['status'].apply(lambda x: 1 if x == 'Delayed' else 0)

# Drop original categorical columns that won't be used in the model
model_df = data_encoded[['planned_duration_days','actual_duration_days','budget_usd','team_size','complexity_level','risk_score','project_type_encoded','status_flag']]

model_df.head()

## Exploratory data analysis (EDA)

In [None]:
# Distribution of project types
sns.countplot(x='project_type', data=project_df)
plt.title('Project Type Distribution')
plt.xlabel('Project Type')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()

In [None]:
# Histogram of budget
plt.hist(project_df['budget_usd'], bins=30, color='skyblue', edgecolor='black')
plt.title('Budget Distribution')
plt.xlabel('Budget (USD)')
plt.ylabel('Frequency')
plt.tight_layout()

In [None]:
# Correlation matrix
correlation = model_df.drop(columns='status_flag').corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.tight_layout()

## Predictive modeling

We will train a Random Forest classifier to predict whether a project is delayed based on project characteristics.

In [None]:
# Define features and target
X = model_df.drop('status_flag', axis=1)
y = model_df['status_flag']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred = rf_model.predict(X_test)

# Evaluate model
acc = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=['On Time', 'Delayed'])
conf_matrix = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {acc:.2f}')
print('Classification Report:
', report)
print('Confusion Matrix:
', conf_matrix)

In [None]:
# Feature importance
importances = pd.Series(rf_model.feature_importances_, index=X.columns).sort_values(ascending=False)

# Plot feature importances
importances.plot(kind='bar')
plt.title('Feature Importance')
plt.ylabel('Importance Score')
plt.xlabel('Features')
plt.tight_layout()

## Conclusion

This notebook demonstrated how to perform exploratory data analysis and build a predictive model on a synthetic project management dataset. The random forest model provides insights into which factors contribute most to project delays, highlighting the importance of duration differences and risk scores. This workflow showcases skills in data wrangling, visualization, and machine learning modeling relevant to business analysis and program management roles.