# Project Risk Analysis

This project uses a **synthetic dataset** to analyze factors that contribute to project risk. The dataset simulates 1,000 projects with variables such as manager experience, team size, budget, duration, domain and priority.

The dataset includes:

- `Project_ID`: Unique identifier for each project
- `Manager_Experience`: Years of experience of the project manager
- `Team_Size`: Number of people on the project team
- `Budget_kUSD`: Budget allocated to the project (in thousands of USD)
- `Duration_days`: Planned duration of the project in days
- `Domain`: Business domain (IT, Finance, Marketing, Operations or HR)
- `Priority`: Project priority (High, Medium or Low)
- `Risk_Score`: Calculated risk score between 0 and 100
- `Status`: Project status classification derived from the risk score

We'll explore the data, visualize relationships and build predictive models to estimate the `Risk_Score` and classify project status.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import classification_report, accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier

# Display plots inline
%matplotlib inline

# Set seaborn style
sns.set(style='whitegrid')


In [None]:
# Load the dataset
df = pd.read_csv('synthetic_project_data.csv')

# Show the first few rows
df.head()

In [None]:
# Data summary
print('Dataset dimensions:', df.shape)

# Data types
print('
Data types:')
print(df.dtypes)

# Statistical summary of numeric columns
df.describe()

# Check for missing values
print('
Missing values per column:')
print(df.isnull().sum())

In [None]:
# Distribution of risk score
plt.figure(figsize=(6,4))
sns.histplot(df['Risk_Score'], bins=30, kde=True)
plt.title('Distribution of Risk Score')
plt.xlabel('Risk Score')
plt.ylabel('Frequency')
plt.show()

# Count of statuses
plt.figure(figsize=(6,4))
sns.countplot(x='Status', data=df, order=['On Schedule','At Risk','Delayed'])
plt.title('Project Status Counts')
plt.xlabel('Status')
plt.ylabel('Count')
plt.show()

# Scatter plot: Manager experience vs Risk score colored by priority
plt.figure(figsize=(6,4))
sns.scatterplot(x='Manager_Experience', y='Risk_Score', data=df, hue='Priority')
plt.title('Manager Experience vs Risk Score')
plt.xlabel('Manager Experience (years)')
plt.ylabel('Risk Score')
plt.show()

# Correlation heatmap for numeric variables
numeric_cols = ['Manager_Experience','Team_Size','Budget_kUSD','Duration_days','Risk_Score']
corr = df[numeric_cols].corr()
plt.figure(figsize=(6,5))
sns.heatmap(corr, annot=True, cmap='Blues')
plt.title('Correlation Heatmap')
plt.show()

In [None]:
# Prepare features and target for regression
X_reg = df.drop(['Risk_Score','Status','Project_ID'], axis=1)
y_reg = df['Risk_Score']

# Identify categorical and numeric columns
categorical_cols = ['Domain','Priority']
numeric_cols = X_reg.columns.difference(categorical_cols)

# Preprocessor: one-hot encode categorical variables and pass through numeric variables unchanged
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categorical_cols),
        ('num', 'passthrough', numeric_cols)
    ]
)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Baseline model: Linear Regression
lin_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

lin_model.fit(X_train, y_train)
y_pred_lin = lin_model.predict(X_test)

# Evaluate baseline model
mae_lin = mean_absolute_error(y_test, y_pred_lin)
rmse_lin = mean_squared_error(y_test, y_pred_lin, squared=False)
r2_lin = r2_score(y_test, y_pred_lin)
print('Linear Regression Results')
print('Mean Absolute Error (MAE):', round(mae_lin, 2))
print('Root Mean Squared Error (RMSE):', round(rmse_lin, 2))
print('R-squared (R2):', round(r2_lin, 3))

In [None]:
# Improved model: Random Forest Regressor
rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(n_estimators=200, random_state=42))
])

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Evaluate Random Forest
mae_rf = mean_absolute_error(y_test, y_pred_rf)
rmse_rf = mean_squared_error(y_test, y_pred_rf, squared=False)
r2_rf = r2_score(y_test, y_pred_rf)
print('Random Forest Regressor Results')
print('Mean Absolute Error (MAE):', round(mae_rf, 2))
print('Root Mean Squared Error (RMSE):', round(rmse_rf, 2))
print('R-squared (R2):', round(r2_rf, 3))

In [None]:
# Classification: predict project status
X_clf = df.drop(['Status','Project_ID'], axis=1)
y_clf = df['Status']

# Categorical and numeric columns for classification
categorical_cols_clf = ['Domain','Priority']
numeric_cols_clf = X_clf.columns.difference(categorical_cols_clf)

preprocessor_clf = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categorical_cols_clf),
        ('num', 'passthrough', numeric_cols_clf)
    ]
)

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_clf, y_clf, test_size=0.2, random_state=42, stratify=y_clf)

rf_clf = Pipeline(steps=[
    ('preprocessor', preprocessor_clf),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42))
])

rf_clf.fit(X_train_c, y_train_c)
y_pred_c = rf_clf.predict(X_test_c)

# Evaluate classification model
acc = accuracy_score(y_test_c, y_pred_c)
f1 = f1_score(y_test_c, y_pred_c, average='weighted')
print('Random Forest Classifier Results')
print('Accuracy:', round(acc, 3))
print('Weighted F1 Score:', round(f1, 3))
print('Classification Report:
', classification_report(y_test_c, y_pred_c))

## Conclusion

In this notebook, we explored a synthetic project management dataset and built predictive models to estimate the project risk score and classify project status.

Key points:

- **Exploratory analysis** showed that risk score tends to increase with larger budgets, longer durations and bigger teams, while more experienced managers and high‑priority projects typically have lower risk.
- **Linear regression** provides a simple baseline with an R² that indicates how much of the variance in risk scores can be explained by the features.
- **Random forest regression** captures non‑linear relationships and generally achieves better predictive performance than the linear model.
- **Random forest classification** predicts project status (On Schedule, At Risk, Delayed) with reasonable accuracy.

Feel free to experiment with other models or feature engineering techniques to further improve performance or gain additional insights.
