
# Sales Data Analysis and Prediction

This Jupyter notebook explores a synthetic sales dataset generated for practicing data analysis and predictive modeling skills. The dataset simulates sales transactions across various product categories, regions, customer demographics, and time periods. The analysis includes exploratory data analysis (EDA), visualizations, and building predictive models for both regression and classification tasks.

## Objectives

- Understand the structure of the synthetic sales dataset.
- Perform exploratory data analysis to uncover insights about sales patterns, customer demographics, and regional performance.
- Visualize relationships between variables using charts and plots.
- Build a regression model to predict **Total_Sales** based on available features.
- Build a classification model to predict whether **Sales_Quantity** is above the median.
- Evaluate model performance using appropriate metrics.



In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, classification_report
from sklearn.linear_model import LinearRegression, LogisticRegression

# Visualization settings
sns.set(style="whitegrid")

# Load dataset
file_path = 'synthetic_sales_data.csv'
data = pd.read_csv(file_path)

print(f"Dataset shape: {data.shape}")
data.head()


In [None]:

# Summary statistics
data.describe(include='all')


In [None]:

# Distribution of Total_Sales
plt.figure(figsize=(8,4))
sns.histplot(data['Total_Sales'], bins=30, kde=True)
plt.title('Distribution of Total Sales')
plt.xlabel('Total Sales')
plt.ylabel('Frequency')
plt.show()

# Sales quantity by product category
plt.figure(figsize=(10,4))
sns.boxplot(x='Product_Category', y='Sales_Quantity', data=data)
plt.title('Sales Quantity by Product Category')
plt.xticks(rotation=45)
plt.show()

# Correlation heatmap for numerical features
numerical_cols = ['Customer_Age','Sales_Quantity','Unit_Price','Discount','Total_Sales','Advertising_Spend','Cost_per_unit','Profit','Profit_Margin']
plt.figure(figsize=(10,8))
correlation_matrix = data[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Heatmap')
plt.show()


In [None]:

# Prepare data for regression
X = data.drop(['Total_Sales', 'Date', 'Profit', 'Profit_Margin'], axis=1)
y = data['Total_Sales']

# Identify numeric and categorical columns
numeric_features = X.select_dtypes(include=['int64','float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Preprocess: One-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Build pipeline
reg_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LinearRegression())
])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
reg_model.fit(X_train, y_train)

# Predict
y_pred = reg_model.predict(X_test)

# Evaluate
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"Regression RMSE: {rmse:.2f}")
print(f"Regression R^2 Score: {r2:.2f}")


In [None]:

# Create binary target: 1 if Sales_Quantity above median, else 0
median_sales_qty = data['Sales_Quantity'].median()
data['High_Sales'] = (data['Sales_Quantity'] > median_sales_qty).astype(int)

X_cls = data.drop(['High_Sales','Total_Sales','Date','Profit','Profit_Margin'], axis=1)
y_cls = data['High_Sales']

numeric_features_cls = X_cls.select_dtypes(include=['int64','float64']).columns
categorical_features_cls = X_cls.select_dtypes(include=['object']).columns

preprocessor_cls = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numeric_features_cls),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features_cls)
    ]
)

cls_model = Pipeline(steps=[
    ('preprocessor', preprocessor_cls),
    ('model', LogisticRegression(max_iter=1000))
])

# Train-test split
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X_cls, y_cls, test_size=0.2, random_state=42)

cls_model.fit(X_train_cls, y_train_cls)

# Predict
y_pred_cls = cls_model.predict(X_test_cls)

# Evaluate
accuracy = accuracy_score(y_test_cls, y_pred_cls)
report = classification_report(y_test_cls, y_pred_cls, target_names=['Low Sales','High Sales'])

print(f"Classification Accuracy: {accuracy:.2f}
")
print("Classification Report:
", report)



## Conclusion

In this notebook, we explored a synthetic sales dataset representing transactions across products, regions, and customer demographics. Exploratory analysis revealed distributions and relationships among features, while correlation analysis highlighted associations between numeric variables. We built a linear regression model to predict Total Sales and achieved a moderate coefficient of determination (R²) and reasonable RMSE. We also constructed a logistic regression classifier to predict whether sales quantity is above the median, evaluating its performance via accuracy and classification report.

Future improvements could include testing additional models (e.g., RandomForest or Gradient Boosting), tuning hyperparameters, and incorporating additional derived features or time-series components. This project serves as a starting point for showcasing data analysis and modeling skills for business and data analyst roles.
