# Project Portfolio Analysis

This notebook provides an exploratory analysis and predictive modeling exercise on a synthetic project portfolio dataset. 

The dataset contains information about various projects, including their type, budget, duration, team size, complexity, risk level, stakeholder engagement, and historical success rate. The goal of this analysis is to perform exploratory data analysis (EDA) and build predictive models to estimate the likelihood of project success. This notebook is structured as follows:

1. Data loading and overview
2. Exploratory Data Analysis (EDA)
3. Data preprocessing
4. Predictive modeling (Logistic Regression and Random Forest)
5. Model evaluation and comparison

Feel free to run the cells and explore the dataset and models further!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set plot style
sns.set(style="whitegrid")
%matplotlib inline

In [None]:
# Load the dataset
file_path = 'project_data.csv'
data = pd.read_csv(file_path)

# Display basic information
data.head()

In [None]:
# Data Overview

# Display summary statistics
data.describe()

In [None]:
# Histogram of numerical features
numeric_cols = ['budget_kUSD', 'duration_months', 'team_size', 'complexity', 'risk_level', 'stakeholder_engagement', 'prev_success_rate']

data[numeric_cols].hist(bins=20, figsize=(12, 10))
plt.suptitle('Distribution of Numerical Features', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Correlation Matrix
corr = data[['budget_kUSD','duration_months','team_size','complexity','risk_level','stakeholder_engagement','prev_success_rate','project_success']].corr()

plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Feature matrix and target vector
X = data.drop(columns=['project_id', 'project_success'])
# One-hot encode categorical variable 'project_type'
X = pd.get_dummies(X, columns=['project_type'], drop_first=True)
y = data['project_success']

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Standardize numerical features
scaler = StandardScaler()
numeric_features = ['budget_kUSD','duration_months','team_size','complexity','risk_level','stakeholder_engagement','prev_success_rate']
X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features])
X_test[numeric_features] = scaler.transform(X_test[numeric_features])

In [None]:
# Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)

# Predictions
y_pred_log = log_reg.predict(X_test)

# Evaluation
acc_log = accuracy_score(y_test, y_pred_log)
print(f"Logistic Regression Accuracy: {acc_log:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_log))

# Confusion Matrix
cm_log = confusion_matrix(y_test, y_pred_log)
sns.heatmap(cm_log, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
# Random Forest model
rf_clf = RandomForestClassifier(n_estimators=200, random_state=42)
rf_clf.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_clf.predict(X_test)

# Evaluation
acc_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {acc_rf:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))

# Confusion Matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Greens')
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Feature importance plot
importances = pd.Series(rf_clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

plt.figure(figsize=(10,6))
importances.head(15).plot(kind='bar')
plt.title('Top Feature Importances (Random Forest)')
plt.xlabel('Features')
plt.ylabel('Importance Score')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Conclusion

This notebook demonstrates a full workflow for analyzing a synthetic project portfolio dataset. Through EDA, we explored the distributions and correlations of features that may influence project success. We then built two predictive models: Logistic Regression and Random Forest. 

The Random Forest model generally performed better due to its ability to capture non-linear relationships and interactions among variables. This project can be expanded further by experimenting with additional models, performing hyperparameter tuning, or incorporating new synthetic data features such as time-series metrics or resource allocation details. This analysis provides a solid foundation for business analysts and data enthusiasts looking to practice data-driven decision making and predictive modeling.

Feel free to modify, extend, and refine this analysis to suit your needs!