
# Project Success Analysis

This notebook demonstrates an exploratory and predictive analysis on a synthetic project management dataset. The dataset contains 200 hypothetical projects with attributes like budget, cost, estimated and actual durations, team size, complexity, customer satisfaction and success indicators.

We will:

1. Perform exploratory data analysis (EDA) with summary statistics and visualizations to understand the relationships between variables.
2. Build a predictive model to estimate whether a project will be successful (on time and on budget) based on its characteristics.
3. Evaluate the model using accuracy and classification metrics.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

# Set visual style
sns.set(style='whitegrid')


In [None]:
# Load the synthetic dataset
data = pd.read_csv('synthetic_project_data.csv')

# Display first five rows
data.head()


In [None]:
# Summary statistics
summary = data.describe()
print(summary)

# Histogram of budget and actual cost
plt.figure(figsize=(8,4))
plt.hist(data['budget_k'], bins=20, alpha=0.5, label='Budget (k)')
plt.hist(data['actual_cost_k'], bins=20, alpha=0.5, label='Actual Cost (k)')
plt.xlabel('Amount (thousand dollars)')
plt.ylabel('Frequency')
plt.title('Distribution of Budget vs Actual Cost')
plt.legend()
plt.show()

# Scatter plot: Estimated vs Actual Duration colored by success
plt.figure(figsize=(6,5))
sns.scatterplot(x='estimated_duration_m', y='actual_duration_m', hue='overall_success', data=data, palette='Set1')
plt.plot([data['estimated_duration_m'].min(), data['estimated_duration_m'].max()], [data['estimated_duration_m'].min(), data['estimated_duration_m'].max()], 'k--', linewidth=1)
plt.xlabel('Estimated Duration (months)')
plt.ylabel('Actual Duration (months)')
plt.title('Estimated vs Actual Duration by Overall Success')
plt.show()

# Correlation heatmap
plt.figure(figsize=(8,6))
corr = data.drop(columns=['project_id']).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()


In [None]:
# Prepare features (X) and target (y)
features = ['budget_k', 'estimated_duration_m', 'team_size', 'complexity']
X = data[features]
y = data['overall_success']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train logistic regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate the model
acc = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=['Fail','Success'])
print(f'Accuracy: {acc:.2f}')
print('Classification Report:
', report)
