# Project Management Dataset Analysis

This notebook explores a synthetic project management dataset and builds a predictive model to determine project success.

The data is generated for demonstration purposes and includes variables such as team size, budget, duration, complexity score, methodology used, manager's experience, number of changes during the project, risk score, domain, and the project outcome (`status`).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Load dataset
data = pd.read_csv('synthetic_project_data.csv')

# Display first few rows
data.head()

## Summary Statistics

Let's look at summary statistics for numeric variables and the distribution of categorical variables.

In [None]:
# Summary statistics for numeric features
data.describe()

In [None]:
# Distribution of categorical variables
for col in ['methodology', 'domain', 'status']:
    print(f"
{col} distribution:")
    print(data[col].value_counts())

## Exploratory Visualizations

Visualize distributions and relationships between features and the project outcome.

In [None]:
# Histograms for numeric features
numeric_cols = ['team_size', 'budget_k', 'duration_months', 'complexity_score', 'manager_experience_yrs', 'num_changes', 'risk_score']
fig, axs = plt.subplots(len(numeric_cols), 1, figsize=(8, 3*len(numeric_cols)))
for i, col in enumerate(numeric_cols):
    sns.histplot(data[col], kde=True, ax=axs[i], color='skyblue')
    axs[i].set_title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap for numeric features
plt.figure(figsize=(8,6))
corr = data[numeric_cols].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap of Numeric Features')
plt.show()

In [None]:
# Relationship between complexity_score and risk_score, colored by outcome
plt.figure(figsize=(8,6))
sns.scatterplot(x='complexity_score', y='risk_score', hue='status', data=data)
plt.title('Complexity vs Risk by Project Outcome')
plt.show()

## Predictive Modeling

We'll build a classification model to predict whether a project will be successful. We'll use a Random Forest classifier and evaluate its performance.

In [None]:
# Separate features and target
X = data.drop('status', axis=1)
y = data['status']

# Identify categorical and numeric columns
cat_cols = ['methodology', 'domain']
num_cols = [col for col in X.columns if col not in cat_cols + ['project_id']]

# Preprocess: One-hot encode categorical variables and pass numeric variables unchanged
preprocessor = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first'), cat_cols)
], remainder='passthrough')

# Build the model pipeline
model = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=42))
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Fit the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate performance
print('Accuracy:', accuracy_score(y_test, y_pred))
print('
Classification Report:
', classification_report(y_test, y_pred))
print('Confusion Matrix:
', confusion_matrix(y_test, y_pred))

In [None]:
# Feature importances
# To access feature importances, we need to get feature names after one-hot encoding
# Extract the trained RandomForestClassifier
rf = model.named_steps['classifier']
# Get one-hot encoder categories
ohe = model.named_steps['preprocessor'].named_transformers_['cat']
ohe_feature_names = ohe.get_feature_names_out(cat_cols)
# Combine with numeric column names
feature_names = list(ohe_feature_names) + num_cols + ['project_id']  # original features order
# Get importances
importances = rf.feature_importances_
# Combine into a DataFrame and sort
importances_df = pd.DataFrame({'feature': feature_names, 'importance': importances})
importances_df = importances_df.sort_values(by='importance', ascending=False)

# Plot feature importances
plt.figure(figsize=(10,6))
sns.barplot(x='importance', y='feature', data=importances_df.head(10), palette='viridis')
plt.title('Top 10 Feature Importances from Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

## Conclusion

In this analysis, we explored a synthetic project management dataset and built a classification model to predict project success.

Key takeaways:

- Team size and manager experience were positively correlated with project success.
- Higher complexity, risk, budget, and number of changes tended to decrease the probability of success.
- The Random Forest model achieved good accuracy on the synthetic dataset, with feature importances highlighting the most influential factors.

Feel free to experiment with the code and adjust model parameters to see how the results change. Since the dataset is synthetic, the analysis is meant to showcase analytical techniques rather than produce actionable real-world insights.