# Project Resource Allocation Analytics

This Jupyter notebook provides an end-to-end analysis of a synthetic project-resource dataset. The dataset simulates tasks associated with different projects, departments and phases within an organization. It includes start and due dates, budget allocations, estimated and actual effort, complexity, risk levels and whether the task was completed on time.

In this notebook we will:

* Perform exploratory data analysis (EDA) with summary statistics and visualizations.
* Visualize task distributions and correlations between numerical features.
* Build a classification model to predict whether a task will experience a delay beyond its due date.
* Build a regression model to predict the actual effort hours needed for a task.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error
from sklearn.linear_model import LogisticRegression, LinearRegression

# Load the dataset
df = pd.read_csv('resource_allocation_dataset.csv')

# Preview the data
df.head()

## Exploratory Data Analysis

We begin by exploring the distribution of numerical features and visualizing the relationships between them. The dataset includes categories such as project phase, department, risk level and task type that will be encoded later for modeling.



In [None]:
# Convert date columns to datetime
for col in ['start_date', 'due_date', 'finish_date']:
    df[col] = pd.to_datetime(df[col])

# Summary statistics for numerical features
print(df[['estimated_effort_hours','actual_effort_hours','budget_allocated','budget_used','complexity','satisfaction_rating','delay_days']].describe())

# Histogram of complexity
plt.figure(figsize=(6,4))
sns.countplot(x='complexity', data=df)
plt.title('Task Complexity Distribution')
plt.show()

# Correlation heatmap
plt.figure(figsize=(8,6))
numeric_features = ['estimated_effort_hours','actual_effort_hours','budget_allocated','budget_used','complexity','satisfaction_rating','delay_days']
corr = df[numeric_features].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## Classification: Predicting Task Delays

We will model whether a task will be delayed beyond its due date (`delay_days > 0`). A delay greater than zero means the finish date was after the due date. We'll use a simple logistic regression model with categorical features encoded via one-hot encoding.


In [None]:
# Create binary target for delay
df['delayed'] = (df['delay_days'] > 0).astype(int)

# Features and target
X = df[['task_type','phase','department','complexity','risk_level','estimated_effort_hours','budget_allocated','satisfaction_rating']]
y = df['delayed']

# Identify categorical and numeric columns
cat_cols = ['task_type','phase','department','risk_level']
num_cols = ['complexity','estimated_effort_hours','budget_allocated','satisfaction_rating']

# Preprocess: one-hot encode categorical features
preprocessor = ColumnTransformer(transformers=[('cat', OneHotEncoder(drop='first'), cat_cols)], remainder='passthrough')

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define model
clf = LogisticRegression(max_iter=200)

# Create pipeline
pipe = Pipeline(steps=[('preprocessor', preprocessor), ('model', clf)])

# Fit model
pipe.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipe.predict(X_test)
print('Classification Report:
', classification_report(y_test, y_pred))
print('Confusion Matrix:
', confusion_matrix(y_test, y_pred))

## Regression: Predicting Actual Effort Hours

Here we predict the actual effort hours required for a task using linear regression. Again we encode categorical variables and use a simple linear model for illustrative purposes.


In [None]:
# Features and target for regression
X_reg = df[['task_type','phase','department','complexity','risk_level','estimated_effort_hours','budget_allocated','satisfaction_rating']]
y_reg = df['actual_effort_hours']

# Split the data
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Preprocess and model
preprocessor_r = ColumnTransformer(transformers=[('cat', OneHotEncoder(drop='first'), cat_cols)], remainder='passthrough')
model_r = LinearRegression()
pipe_r = Pipeline(steps=[('preprocessor', preprocessor_r), ('model', model_r)])

# Fit model
pipe_r.fit(X_train_r, y_train_r)

# Predict
y_pred_r = pipe_r.predict(X_test_r)

# Evaluate with RMSE
rmse = mean_squared_error(y_test_r, y_pred_r, squared=False)
print('Root Mean Squared Error:', rmse)

## Conclusion

This synthetic project resource dataset illustrates how basic analytics and machine learning can inform project management decisions. From simple visualizations we discovered relationships between budget, effort and complexity. A logistic regression model helped identify factors that contribute to delays, while a linear regression provided a baseline prediction of actual effort hours based on early indicators.

Further improvements might include more sophisticated models (e.g., random forests or gradient boosting), cross-validation for better generalization, and domain-specific feature engineering.
