
# Business Analytics Portfolio Project

This notebook is part of a portfolio project designed for roles such as **Business Analyst**, **Program Manager**, and **Data Analyst**. It uses a synthetic dataset representing projects with various attributes like duration, budget, cost, team size, complexity, risk, and status. The objectives are to explore the data, perform exploratory data analysis (EDA), visualize key patterns, and build predictive models to understand factors associated with project success.

The dataset is completely synthetic and generated for demonstration purposes only.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline


In [None]:

# Load the synthetic dataset
# If running this notebook after cloning the repo, ensure that the CSV file is in the same directory or adjust the path accordingly.
df = pd.read_csv('synthetic_project_data.csv', parse_dates=['StartDate','EndDate','ActualEndDate'])

# Display the first few rows
print("Dataset shape:", df.shape)
df.head()


In [None]:

# Check data types and missing values
print(df.info())
print("
Missing values per column:
", df.isnull().sum())

# Summary statistics for numeric columns
df.describe()


In [None]:

# Distribution of planned vs actual durations
plt.figure(figsize=(10,5))
sns.histplot(df['PlannedDurationDays'], kde=True, color='skyblue', label='Planned Duration', bins=20)
sns.histplot(df['ActualDurationDays'], kde=True, color='orange', label='Actual Duration', bins=20)
plt.title('Distribution of Planned vs Actual Durations')
plt.xlabel('Duration (days)')
plt.legend()
plt.show()

# Budget distribution
plt.figure(figsize=(8,4))
sns.histplot(df['Budget'], kde=True, color='green', bins=20)
plt.title('Distribution of Project Budgets')
plt.xlabel('Budget')
plt.show()

# Complexity counts
plt.figure(figsize=(6,4))
sns.countplot(x='Complexity', data=df, order=['Low','Medium','High'])
plt.title('Count of Projects by Complexity')
plt.show()

# Cost variance by risk
plt.figure(figsize=(8,4))
sns.boxplot(x='Risk', y='CostVariance', data=df, order=['Low','Medium','High'])
plt.title('Cost Variance by Risk Level')
plt.show()

# Correlation heatmap for numeric variables
numeric_cols = ['PlannedDurationDays', 'ActualDurationDays', 'Budget', 'ActualCost', 'TeamSize', 'ScheduleVarianceDays', 'CostVariance', 'Success']
plt.figure(figsize=(10,8))
corr_matrix = df[numeric_cols].corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


In [None]:

# Prepare data for modeling
# Feature columns: numeric + encoded categorical
X = df[['PlannedDurationDays','ActualDurationDays','Budget','ActualCost','TeamSize','Complexity','Risk','ScheduleVarianceDays','CostVariance']]
y = df['Success']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Identify categorical and numeric columns
categorical_cols = ['Complexity', 'Risk']
numeric_cols = ['PlannedDurationDays','ActualDurationDays','Budget','ActualCost','TeamSize','ScheduleVarianceDays','CostVariance']

# Preprocess: OneHotEncode categorical variables
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categorical_cols),
        ('num', 'passthrough', numeric_cols)
    ]
)

# Create pipeline with logistic regression model
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(max_iter=1000))
])

# Fit the model
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Evaluate the model
acc = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {acc:.2f}")
print("
Classification Report:
", classification_report(y_test, y_pred))
print("Confusion Matrix:
", confusion_matrix(y_test, y_pred))


In [None]:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

# Define features and target for regression
X_reg = df[['PlannedDurationDays','ActualDurationDays','Budget','TeamSize','Complexity','Risk','ScheduleVarianceDays']]
y_reg = df['ActualCost']

# Train-test split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.25, random_state=42)

# Preprocess (same as before)
preprocessor_reg = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), ['Complexity', 'Risk']),
        ('num', 'passthrough', ['PlannedDurationDays','ActualDurationDays','Budget','TeamSize','ScheduleVarianceDays'])
    ]
)

# Pipeline with RandomForestRegressor
reg_model = Pipeline(steps=[
    ('preprocessor', preprocessor_reg),
    ('model', RandomForestRegressor(n_estimators=200, random_state=42))
])

# Fit
reg_model.fit(X_train_reg, y_train_reg)

# Predict
y_pred_reg = reg_model.predict(X_test_reg)

# Evaluate
mae = mean_absolute_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)
print(f"Random Forest Regression MAE: {mae:.2f}")
print(f"R^2 Score: {r2:.2f}")



## Conclusion

This analysis demonstrated how to explore a project dataset, visualize key patterns, and build predictive models.

**Key takeaways:**

- The synthetic dataset contained information about project durations, budgets, costs, team sizes, complexity levels, risk levels, and project outcomes.
- Exploratory analysis highlighted distributions and relationships between variables, such as budget and cost variance across risk categories and the prevalence of different complexity levels.
- A logistic regression model was built to predict project success, achieving reasonable accuracy. The model can be further improved through feature engineering and experimenting with more sophisticated algorithms.
- A random forest regression model was used to predict actual project costs, demonstrating how regression techniques can be applied to project budgeting.

Feel free to build upon this project by:

- Exploring additional visualizations (e.g., time-series analysis by year).
- Trying other classification algorithms (e.g., Random Forest, XGBoost) and comparing performance.
- Performing hyperparameter tuning for improved model performance.
- Incorporating other synthetic or real-world datasets for a more comprehensive portfolio.

This notebook and dataset serve as a solid foundation for demonstrating analytical and modeling skills relevant to business analysis, program management, and data analysis roles.
