# Sales Prediction Analytics Project

This notebook provides a full walkthrough of a synthetic retail sales dataset, including exploratory data analysis (EDA), data visualization, and predictive modeling. The project is designed to demonstrate business analytics and data science skills for roles such as **business analyst**, **program manager**, and **data analyst**.

The synthetic dataset contains 1,000 records of retail transactions with features like product categories, quantities, prices, discounts, regions, customer demographics, and revenue/profit. We engineer a binary target variable (`high_profit`) indicating whether a transaction's profit is above the median.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Set plotting style
sns.set(style='whitegrid', palette='muted')

# Load dataset
data = pd.read_csv('synthetic_sales_data.csv')

# Display first few rows
data.head()


## Exploratory Data Analysis

We begin by exploring the dataset's basic structure and summary statistics. We inspect distributions of numerical variables like **revenue** and **profit**, and evaluate categorical distributions such as **product_category** and **region**. Visualizations help us identify trends and potential relationships in the data.



In [None]:
# Summary statistics
print(data.describe(include='all'))

# Plot distribution of revenue and profit
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].hist(data['revenue'], bins=30, color='skyblue', edgecolor='black')
axes[0].set_title('Revenue Distribution')
axes[0].set_xlabel('Revenue')
axes[0].set_ylabel('Frequency')

axes[1].hist(data['profit'], bins=30, color='salmon', edgecolor='black')
axes[1].set_title('Profit Distribution')
axes[1].set_xlabel('Profit')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

# Bar plot for product categories
plt.figure(figsize=(8, 4))
data['product_category'].value_counts().plot(kind='bar', color='purple')
plt.title('Transaction Counts by Product Category')
plt.xlabel('Product Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# Bar plot for regions
plt.figure(figsize=(6, 4))
data['region'].value_counts().plot(kind='bar', color='green')
plt.title('Transaction Counts by Region')
plt.xlabel('Region')
plt.ylabel('Count')
plt.show()


## Feature Engineering & Preprocessing

To build a predictive model, we'll prepare our dataset by:

1. Selecting relevant features and the target variable (`high_profit`).
2. Splitting the data into training and testing sets.
3. Applying one-hot encoding to categorical variables.
4. Building pipelines for machine learning models.

We will evaluate two models:

- **Logistic Regression**: A baseline linear classifier.
- **Random Forest Classifier**: An ensemble model that can capture non-linear relationships.



In [None]:
# Select features and target
features = ['region', 'product_category', 'product_subcategory', 'quantity', 'unit_price', 'discount',
            'shipping_cost', 'customer_age_group', 'payment_method', 'returned']
target = 'high_profit'

X = data[features]
y = data[target]

# Identify categorical and numerical columns
categorical_cols = ['region', 'product_category', 'product_subcategory', 'customer_age_group', 'payment_method']
numerical_cols = ['quantity', 'unit_price', 'discount', 'shipping_cost', 'returned']

# Preprocess: one-hot encode categorical variables and pass through numerical variables unchanged
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ], remainder='passthrough'
)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Define models
log_reg = LogisticRegression(max_iter=1000)
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Create pipelines
log_reg_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', log_reg)])
rf_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', rf_clf)])

# Train models
log_reg_pipeline.fit(X_train, y_train)
rf_pipeline.fit(X_train, y_train)

# Evaluate models
def evaluate_model(model, X_t, y_t, name):
    y_pred = model.predict(X_t)
    print(f"
=== {name} ===")
    print(classification_report(y_t, y_pred))
    print("Confusion Matrix:
", confusion_matrix(y_t, y_pred))

# Evaluation on test set
evaluate_model(log_reg_pipeline, X_test, y_test, 'Logistic Regression')
evaluate_model(rf_pipeline, X_test, y_test, 'Random Forest Classifier')


## Conclusion

In this project, we created a **synthetic retail sales dataset** and performed exploratory data analysis to uncover distribution patterns across revenue, profit, product categories, and regions. We then engineered a binary target variable (`high_profit`) and trained two classification models—**Logistic Regression** and **Random Forest**—to predict high-profit transactions.

The Random Forest model typically shows higher accuracy due to its ability to capture complex relationships between features. You can further experiment with hyperparameter tuning, cross-validation, and different modeling techniques (e.g., gradient boosting) to improve performance.

This project demonstrates essential steps in a data analytics workflow: data generation, EDA, visualization, feature engineering, model building, and evaluation. Feel free to expand upon this foundation by exploring additional insights (e.g., segmenting by customer age group, analyzing return rates) or deploying the model in a production environment.
