# Project Management Dataset Analysis

This project presents a synthetic **project management** dataset designed to mimic real-world scenarios encountered by business analysts, program managers, and data analysts. The dataset contains 500 projects with features like planned and actual durations, budgets, team sizes, complexity, stakeholder engagement, risk levels, and whether the project was successfully delivered (on time and within budget).

The goal is to perform exploratory data analysis (EDA) to uncover insights into project performance and build predictive models:

* **Classification**: Predict whether a project will be successful based on its characteristics.
* **Regression**: Predict the actual project duration based on project features.

This notebook demonstrates loading the dataset, exploring its structure, visualizing relationships, and building predictive models using scikit-learn.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, confusion_matrix, mean_absolute_error, r2_score

# Load the dataset
file_path = 'project_management_dataset.csv'
df = pd.read_csv(file_path)

# Convert date columns to datetime
for col in ['Start_Date', 'End_Date']:
    df[col] = pd.to_datetime(df[col])

# Display first few rows
df.head()


In [None]:
# Summary statistics
df.describe(include='all')

# Check for missing values
df.isnull().sum()


In [None]:
# Histograms of numeric features
numeric_cols = ['Planned_Duration', 'Actual_Duration', 'Budget_kUSD', 'Spent_kUSD', 'Team_Size', 'Complexity', 'Stakeholder_Engagement', 'Risk_Level']
plt.figure(figsize=(14, 10))
for i, col in enumerate(numeric_cols, 1):
    plt.subplot(3, 3, i)
    sns.histplot(df[col], kde=True)
    plt.title(col)
plt.tight_layout()
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 8))
corr = df[numeric_cols + ['Success']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

# Boxplot of Actual Duration by Success
plt.figure(figsize=(6, 4))
sns.boxplot(x='Success', y='Actual_Duration', data=df)
plt.title('Actual Duration by Success')
plt.xlabel('Success (1=Yes, 0=No)')
plt.ylabel('Actual Duration (days)')
plt.show()


In [None]:
# Prepare data for classification
y = df['Success']
X = df[['Planned_Duration', 'Budget_kUSD', 'Spent_kUSD', 'Team_Size', 'Complexity', 'Stakeholder_Engagement', 'Risk_Level']]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Predictions
y_pred = log_reg.predict(X_test_scaled)

# Evaluation
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

acc, cm


In [None]:
# Prepare data for regression target (Actual_Duration)
y_reg = df['Actual_Duration']
X_reg = df[['Planned_Duration', 'Budget_kUSD', 'Spent_kUSD', 'Team_Size', 'Complexity', 'Stakeholder_Engagement', 'Risk_Level']]

# Train-test split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Linear Regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_r, y_train_r)

# Predictions
y_pred_r = lin_reg.predict(X_test_r)

# Evaluation
mae = mean_absolute_error(y_test_r, y_pred_r)
r2 = r2_score(y_test_r, y_pred_r)

mae, r2
