
# Project Analysis: Synthetic Project Tasks Dataset

This notebook performs exploratory data analysis and builds a predictive model on a synthetic dataset of project tasks. The dataset simulates project management tasks with features such as project ID, task duration, budget, actual cost, status, priority, and whether the task went over budget. The goal is to demonstrate skills relevant to roles such as **Business Analyst**, **Program Manager**, and **Data Analyst**.


In [None]:

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression

# Configure plots
sns.set(style='whitegrid')

# Load dataset
file_path = 'synthetic_project_tasks.csv'
df = pd.read_csv(file_path)

# Preview the dataset
print('Shape:', df.shape)
df.head()



## Dataset Overview

The dataset contains **synthetic** project task records with the following columns:

- `ProjectID`: Identifier for each project.
- `TaskID`: Unique identifier for each task.
- `TaskName`: A short description of the task.
- `AssignedTo`: Team member assigned to the task.
- `StartDate`, `EndDate`: Task start and end dates.
- `DurationDays`: Duration of the task in days.
- `Budget`: Allocated budget for the task.
- `ActualCost`: Actual cost incurred.
- `TaskStatus`: Current status of the task (Not Started, In Progress, Completed, Cancelled, On Hold).
- `Priority`: Priority level (Low, Medium, High, Critical).
- `CompletionPercentage`: Percent completion.
- `OverBudget`: **Target variable** indicating whether the actual cost exceeded the budget (1) or not (0).

We will analyze the distribution of variables, relationships between budget and actual cost, and build a predictive model for budget overruns.


In [None]:

# Summary statistics for numeric columns
numeric_cols = ['DurationDays', 'Budget', 'ActualCost', 'CompletionPercentage']
summary = df[numeric_cols].describe()
print(summary)

# Plot distribution of Budget and ActualCost
plt.figure(figsize=(10,5))
sns.histplot(df['Budget'], color='skyblue', label='Budget', kde=True)
sns.histplot(df['ActualCost'], color='salmon', label='Actual Cost', kde=True)
plt.title('Distribution of Budget vs Actual Cost')
plt.legend()
plt.show()

# Bar chart of TaskStatus
plt.figure(figsize=(8,4))
df['TaskStatus'].value_counts().plot(kind='bar', color='lightgreen')
plt.title('Task Status Distribution')
plt.ylabel('Count')
plt.show()

# Scatter plot of Budget vs ActualCost
plt.figure(figsize=(8,5))
sns.scatterplot(x='Budget', y='ActualCost', hue='OverBudget', data=df, palette='coolwarm')
plt.title('Budget vs Actual Cost')
plt.show()



## Predictive Modeling

To predict whether a task will go over budget (`OverBudget`), we'll build a logistic regression model. The categorical features (ProjectID, TaskStatus, Priority) will be one-hot encoded. We split the data into training and test sets, fit the model, and evaluate its performance using accuracy and the classification report.


In [None]:

# Define features and target
X = df[['ProjectID', 'DurationDays', 'Budget', 'ActualCost', 'TaskStatus', 'Priority']]
y = df['OverBudget']

# Preprocess categorical features
categorical_features = ['ProjectID', 'TaskStatus', 'Priority']
numeric_features = ['DurationDays', 'Budget', 'ActualCost']

preprocess = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('num', 'passthrough', numeric_features)
    ]
)

# Create pipeline with logistic regression
model = Pipeline([
    ('preprocess', preprocess),
    ('classifier', LogisticRegression(max_iter=1000))
])

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Fit the model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm, display_labels=['Within Budget', 'Over Budget']).plot()
plt.title('Confusion Matrix')
plt.show()



## Conclusion

The synthetic project tasks dataset provides an overview of project management metrics and demonstrates common data analysis tasks such as:

- Exploring distributions and relationships between key variables like budget, actual cost, and task status.
- Creating visualizations to communicate patterns and insights.
- Building a predictive model to identify tasks likely to exceed their budget.

This repository showcases the ability to clean and analyze data, visualize results, and apply machine learning, which are valuable skills for Business Analyst, Program Manager, and Data Analyst roles.
