# Exploratory Data Analysis and Predictive Modeling

This notebook performs exploratory data analysis (EDA) on the synthetic business dataset and builds a predictive model to forecast profit based on several features.

## Objectives

1. Load and inspect the dataset
2. Visualize distributions and relationships
3. Perform feature engineering
4. Train regression models to predict profit
5. Evaluate model performance

Let's begin!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
# Load dataset
df = pd.read_csv('synthetic_business_data.csv')

# Display first few rows
df.head()

In [None]:
# Summary statistics
df.describe(include='all')

In [None]:
# Histograms for numerical features
num_cols = ['Marketing_Spend', 'Price', 'Quantity_Sold', 'Revenue', 'Cost', 'Profit', 'Customer_Satisfaction']
df[num_cols].hist(figsize=(15, 10), bins=20)
plt.suptitle('Distribution of Numeric Features')
plt.show()

# Boxplot of Profit by Region and Product Category
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='Region', y='Profit', hue='Product_Category')
plt.title('Profit by Region and Product Category')
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(10, 8))
correlation = df[['Marketing_Spend','Price','Quantity_Sold','Revenue','Cost','Profit','Customer_Satisfaction']].corr()
sns.heatmap(correlation, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Prepare data for modeling
X = df.drop(columns=['Profit', 'Revenue', 'Cost'])
y = df['Profit']

# Identify categorical and numeric columns
cat_features = ['Region', 'Product_Category']
num_features = ['Month', 'Marketing_Spend', 'Price', 'Quantity_Sold', 'Customer_Satisfaction']

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_features),
        ('num', 'passthrough', num_features)
    ])

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define and train model
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

model.fit(X_train, y_train)

# Predict and evaluate
y_pred = model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

rmse, r2

## Model Evaluation

The Root Mean Squared Error (RMSE) and R-squared values provide insight into model performance. A lower RMSE indicates a better fit, while an R-squared closer to 1 indicates a higher proportion of variance explained by the model.

Feel free to explore different algorithms, such as Random Forest or Gradient Boosting, and experiment with hyperparameters to improve performance.