
# Business Analytics Project: Synthetic Program Data

This project showcases an end‑to‑end analytics workflow suitable for business analysts, program managers, and data analysts.

The dataset simulates information about company programs/projects, including budget, duration, team size, complexity, region, stakeholder satisfaction, return on investment (ROI), and a binary indicator for whether the program was deemed a success.

In this notebook we'll:

1. Explore the data and visualize key relationships.
2. Prepare features and build predictive models to estimate program success.
3. Discuss next steps for future refinement.

The dataset is provided in `synthetic_project_data.csv` located in the `data/` folder of this repository.


In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = 'data/synthetic_project_data.csv'
data = pd.read_csv(file_path)

# Display basic information
data.head()



## Dataset Description

| Column                    | Description                                                   |
|---------------------------|---------------------------------------------------------------|
| `Budget_Million`          | Total program budget in millions of currency units             |
| `Duration_Months`         | Program duration in months                                    |
| `Team_Size`               | Number of people involved in the program                      |
| `Complexity`              | Categorical indicator of complexity (Low, Medium, High)       |
| `Region`                  | Geographic region (North, South, East, West)                  |
| `Stakeholder_Satisfaction`| Stakeholder satisfaction rating on a scale of 1–5            |
| `Success`                 | Binary indicator (1 for success, 0 for failure)               |
| `ROI_Million`             | Return on investment in millions of currency units            |

This synthetic dataset is generated for illustrative purposes. The relationships between features and the success indicator have been modeled using a logistic function with added noise to reflect real‑world variability.


In [None]:

# Summary statistics
summary = data.describe(include='all')
print('Summary Statistics:')
print(summary)

# Histograms for numeric variables
numeric_cols = ['Budget_Million', 'Duration_Months', 'Team_Size', 'Stakeholder_Satisfaction', 'ROI_Million']
fig, axs = plt.subplots(len(numeric_cols), 1, figsize=(8, 16))

for i, col in enumerate(numeric_cols):
    sns.histplot(data[col], kde=True, ax=axs[i])
    axs[i].set_title(f'Distribution of {col}')

plt.tight_layout()
plt.show()

# Success distribution
plt.figure(figsize=(4, 3))
sns.countplot(x='Success', data=data)
plt.title('Success vs Failure Counts')
plt.show()



From the histograms we can examine the distribution of key metrics. For instance, budgets and ROI values span a wide range, while stakeholder satisfaction scores follow a fairly even distribution between 1 and 5. The success indicator is balanced enough to allow for meaningful modeling.

The next step is to prepare the data for modeling. We'll start with a simple logistic regression model using just the numerical features, then expand to a more complex model by including categorical variables using one‑hot encoding and a random forest classifier.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

# Select numeric features for a baseline model
X = data[['Budget_Million', 'Duration_Months', 'Team_Size', 'Stakeholder_Satisfaction', 'ROI_Million']]
y = data['Success']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit logistic regression model
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)

# Predict and evaluate
preds = log_reg.predict(X_test_scaled)
acc = accuracy_score(y_test, preds)
print(f"Logistic Regression Accuracy: {acc:.2f}")
print(classification_report(y_test, preds))


In [None]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

# Features and target
X_full = data.drop('Success', axis=1)
y_full = data['Success']

# Identify numeric and categorical columns
numeric_features = ['Budget_Million', 'Duration_Months', 'Team_Size', 'Stakeholder_Satisfaction', 'ROI_Million']
categorical_features = ['Complexity', 'Region']

# Preprocess: scale numeric and one‑hot encode categorical
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(), categorical_features)
    ]
)

# Define model
rf = RandomForestClassifier(n_estimators=200, random_state=42)

# Create the pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', rf)
                     ])

# Split data
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(X_full, y_full, test_size=0.3, random_state=42)

# Fit model
clf.fit(X_train_full, y_train_full)

# Predict and evaluate
rf_preds = clf.predict(X_test_full)
rf_acc = accuracy_score(y_test_full, rf_preds)
print(f"Random Forest Accuracy: {rf_acc:.2f}")
print(classification_report(y_test_full, rf_preds))

# Confusion Matrix
conf_mat = confusion_matrix(y_test_full, rf_preds)
print('Confusion Matrix:')
print(conf_mat)



## Conclusions and Next Steps

In this notebook we explored a synthetic dataset and built two predictive models:

- **Logistic Regression** using only numeric features achieved moderate performance. Scaling numeric features is important for this model and the results provide a baseline.
- **Random Forest Classifier** leveraging both numeric and categorical features (via one‑hot encoding) achieved higher accuracy and better overall classification metrics, capturing nonlinear relationships and interactions.

### Potential Enhancements

- **Hyperparameter Tuning**: Explore grid search or randomized search to find optimal model parameters.
- **Feature Engineering**: Create additional features, such as budget per team member or duration per complexity level, which might improve model performance.
- **Regression Task**: Predict continuous ROI or satisfaction scores using regression algorithms to provide more granular insights.
- **Dashboard Creation**: Build an interactive dashboard (e.g., using Plotly Dash or Streamlit) to allow stakeholders to explore the data and model predictions.

This project demonstrates an increasing level of complexity by moving from basic descriptive analysis through to more sophisticated modeling. It can serve as a solid example in your portfolio for roles such as business analyst, program manager, or data analyst.

If you'd like to expand the project further or adapt it to specific industries or questions, feel free to modify the data generation process or analysis steps.
