# Business Analytics Synthetic Project

This project provides a synthetic dataset simulating business projects and their outcomes. The goal is to analyze project characteristics and predict project success. The repository contains a dataset (`business_analytics_dataset.csv`) and a Jupyter notebook for exploratory data analysis (EDA) and predictive modeling.

## Dataset Overview

The dataset includes 500 synthetic project entries with the following columns:

- **Project_ID**: Unique identifier for each project.
- **Start_Date**: Project start date.
- **End_Date**: Project end date.
- **Duration_Days**: Duration of the project in days.
- **Team_Size**: Number of people assigned to the project.
- **Budget_USD**: Budget allocated to the project (USD).
- **Complexity**: Categorical feature indicating project complexity (`Low`, `Medium`, `High`).
- **Client_Satisfaction**: Satisfaction score from 1 to 10.
- **Project_Success**: Target variable indicating whether the project was successful (1) or not (0).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Load dataset
df = pd.read_csv('business_analytics_dataset.csv', parse_dates=['Start_Date', 'End_Date'])

# Display first few rows
df.head()

## Exploratory Data Analysis

We start by exploring the dataset to understand distributions and relationships between variables.

In [None]:
# Basic descriptive statistics
print(df.describe(include='all'))

In [None]:
# Histogram of project durations
plt.figure(figsize=(6,4))
sns.histplot(df['Duration_Days'], bins=20, kde=False)
plt.title('Distribution of Project Durations')
plt.xlabel('Duration (days)')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Relationship between complexity and project success
plt.figure(figsize=(6,4))
sns.countplot(x='Complexity', hue='Project_Success', data=df)
plt.title('Complexity vs Project Success')
plt.xlabel('Complexity')
plt.ylabel('Count')
plt.show()

In [None]:
# Correlation heatmap for numerical features
corr = df[['Duration_Days','Team_Size','Budget_USD','Client_Satisfaction','Project_Success']].corr()
plt.figure(figsize=(6,4))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

## Predictive Modeling

Next, we build predictive models to classify project success based on project features.

In [None]:
# Prepare features and target
X = df[['Duration_Days','Team_Size','Budget_USD','Complexity','Client_Satisfaction']]
y = df['Project_Success']

# One-hot encode categorical features
categorical_features = ['Complexity']
numerical_features = ['Duration_Days','Team_Size','Budget_USD','Client_Satisfaction']

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first'), categorical_features),
        ('num', 'passthrough', numerical_features)
    ]
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define models
log_reg_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression(max_iter=1000))])
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))])

# Fit models
log_reg_pipeline.fit(X_train, y_train)
rf_pipeline.fit(X_train, y_train)

# Evaluate models
for name, model in [('Logistic Regression', log_reg_pipeline), ('Random Forest', rf_pipeline)]:
    y_pred = model.predict(X_test)
    print(f"
{name} Accuracy: {accuracy_score(y_test, y_pred):.2f}")
    print(classification_report(y_test, y_pred))
    print('Confusion Matrix:
', confusion_matrix(y_test, y_pred))

## Conclusion

In this notebook, we analyzed a synthetic dataset of business projects. We conducted exploratory data analysis to understand the distributions of project features and their relationships to project success. We then built logistic regression and random forest models to predict project success, demonstrating a basic workflow for modeling classification problems in business analytics. You can extend this analysis with more sophisticated models, cross-validation, and hyperparameter tuning.