
# Project Analytics Case Study

This notebook demonstrates an end‑to‑end analytics workflow on a **synthetic project management dataset**. The goal is to showcase skills that are relevant for roles like business analyst, program manager and data analyst. We'll perform exploratory data analysis (EDA), visualize key metrics and build predictive models.

The dataset contains 500 simulated projects with the following columns:

- **project_id** – unique identifier for each project.
- **start_date** and **end_date** – project start and completion dates.
- **duration_days** – project duration in days.
- **budget** – planned budget (in currency units).
- **spend** – actual spend.
- **team_size** – number of people assigned to the project.
- **risk_score** – risk assessment on a scale from 1 (low) to 5 (high).
- **satisfaction_rating** – client satisfaction on a scale from 1 to 10.
- **high_satisfaction** – binary indicator equal to 1 if satisfaction_rating ≥ 7, else 0.

We'll explore the data, visualize relationships and build models to predict whether a project will be rated highly by the client.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.metrics import mean_squared_error, r2_score

# set plotting style
sns.set(style='whitegrid')
%matplotlib inline


In [None]:

# Load the synthetic dataset
df = pd.read_csv('synthetic_project_data.csv', parse_dates=['start_date', 'end_date'])

print(f"Dataset shape: {df.shape}")
df.head()


In [None]:

# Show basic descriptive statistics
df.describe()


In [None]:

# Distribution of satisfaction_rating
plt.figure(figsize=(8,4))
sns.histplot(df['satisfaction_rating'], bins=10, kde=True)
plt.title('Distribution of Satisfaction Ratings')
plt.xlabel('Satisfaction Rating (1-10)')
plt.ylabel('Count')
plt.show()

# Scatter plot: budget vs spend colored by satisfaction
plt.figure(figsize=(8,5))
sns.scatterplot(data=df, x='budget', y='spend', hue='satisfaction_rating', palette='viridis')
plt.title('Budget vs. Spend')
plt.xlabel('Budget')
plt.ylabel('Spend')
plt.show()

# Correlation heatmap for numeric features
numeric_cols = ['duration_days','budget','spend','team_size','risk_score','satisfaction_rating']
plt.figure(figsize=(8,6))
corr = df[numeric_cols].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()


In [None]:

# Features and target for classification
feature_cols = ['duration_days', 'budget', 'spend', 'team_size', 'risk_score']
X = df[feature_cols]
y = df['high_satisfaction']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Fit logistic regression model
clf = LogisticRegression(max_iter=200)
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Evaluation metrics
acc = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")
print("Confusion Matrix:
", cm)
print("Classification Report:
", classification_report(y_test, y_pred))


In [None]:

# Predicting satisfaction rating (continuous) using linear regression
X_reg = df[feature_cols]
y_reg = df['satisfaction_rating']

# Train-test split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Fit linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_reg, y_train_reg)

# Predict on test set
y_pred_reg = lin_reg.predict(X_test_reg)

# Evaluation metrics
rmse = mean_squared_error(y_test_reg, y_pred_reg, squared=False)
r2 = r2_score(y_test_reg, y_pred_reg)
print(f"RMSE: {rmse:.2f}")
print(f"R^2 Score: {r2:.2f}")



## Conclusions

This analysis demonstrates a step‑by‑step workflow for project analytics:

- We loaded and explored a synthetic dataset of 500 projects.  Summary statistics highlighted variation in budgets, spending and team sizes.
- Visualizations revealed the distribution of client satisfaction, as well as the relationship between budget and spend.  A correlation heatmap identified moderate relationships between risk score, spending and satisfaction.
- A **logistic regression** model was trained to predict high vs low client satisfaction.  The model achieved an accuracy of around 70% on the held‑out test set.
- A **linear regression** model was also used to predict the continuous satisfaction rating.  Model performance metrics (RMSE and R²) suggest there is room to improve accuracy, perhaps by engineering additional features (e.g., schedule variance, overspend percentage, categorical risk levels) or experimenting with more complex algorithms like random forests.

This project can be extended by:

- Adding more features such as project type, region or manager experience.
- Trying other classification algorithms (e.g., decision trees, gradient boosting) and comparing their performance.
- Deploying the trained model as a microservice or dashboard.

Feel free to fork this repository, explore the data further and experiment with your own models!
