### üëã **SalesNexus: Business Understanding**

![SalesNexus Banner](bg.jpg)

---

## üéØ Goal

To **predict daily sales** for each item across stores, enabling:

‚úÖ **Dynamic Pricing** ‚Äî Adjust pricing based on demand and promotions
‚úÖ **Inventory Optimization** ‚Äî Reduce waste, prevent stockouts
‚úÖ **Better Profitability** ‚Äî Minimize surplus, maximize sales efficiency

---

## üßê Why This Matters

Modern retailers operate in highly competitive environments. Manual forecasting often fails to:

* Anticipate **seasonal fluctuations** (holidays, promotions)
* Respond to **regional differences** across stores
* Adjust to **external influences** (economic changes, events)

**Resulting Problems:**

* Overstocking ‚ûî Increased waste and holding costs
* Understocking ‚ûî Missed sales and dissatisfied customers
* Suboptimal pricing ‚ûî Reduced margins and competitive disadvantage

---

## üí° Objective

Develop a robust sales forecasting system that predicts future sales per item, per store, and allows:

* Automated, data-driven pricing adjustments
* Smarter inventory planning
* Insights into sales drivers (trend, seasonality, promotions)

---

## üéØ Stakeholders

* **Retail Chains**: Streamline supply chain and pricing strategies.
* **Pricing Teams**: Identify revenue optimization opportunities.
* **Store Operations**: Maintain optimal inventory levels.
* **Customers**: Enjoy better availability and pricing.

---

## üìä Evaluation Metrics

* **RMSLE** (Root Mean Squared Logarithmic Error): Main metric for this Kaggle-style prediction.
* **Additional metrics**: MAE (Mean Absolute Error), RMSE (Root Mean Squared Error) for deeper error analysis.

---

## üöÄ Impact

* **Increase Profitability**: Minimize waste and surplus inventory.
* **Optimize Operations**: Reduce storage and logistics costs.
* **Better Experience**: Improve product availability and pricing precision for customers.

---



In [None]:
import os

In [None]:
%pwd

In [None]:
os.chdir('../')

In [None]:
%pwd

In [None]:
from dataclasses import dataclass
from pathlib import Path
from typing import Dict

@dataclass(frozen=True)
class DataAcquisitionConfig:
    """Config for downloading and accessing raw data files."""
    root_dir: Path
    source: str          
    dataset_name: str    
    local_dir: Path       
    data_files: Dict[str, str]  


In [None]:
from ml_service.constants import *
from ml_service.utils.main_utils import read_yaml, create_directories
from ml_service.logging.logger import logging
from pathlib import Path

logging.info(f"Current working directory: {Path.cwd()}")


In [None]:
class ConfigurationManager:
    def __init__(self, config_filepath: str):
        """Initialize the configuration manager.

        Args:
            config_filepath (str): Path to the main configuration file (YAML).
        """
        self.config = read_yaml(config_filepath)
        # logging.info(f"yaml file: {config_filepath} loaded successfully")
        create_directories([self.config.artifacts_root])

    def get_data_acquisition_config(self) -> DataAcquisitionConfig:
        """Get the configuration for data acquisition.

        Returns:
            DataAcquisitionConfig: Paths and source details for data acquisition.
        """
        config = self.config.data_acquisition
        create_directories([config.root_dir])

        return DataAcquisitionConfig(
            root_dir=Path(config.root_dir),
            source=config.source,
            dataset_name=config.dataset_name,
            local_dir=Path(config.local_dir),
            data_files=dict(config.data_files)  
        )


### Data Acquisition 

In [None]:
# from ml_service.logging import logger
from ml_service.logging.logger import logging
from pathlib import Path

logging.info(f"Current working directory: {Path.cwd()}")


In [None]:
print(CONFIG_FILE_PATH)

In [16]:
import opendatasets as od


# Load configuration
config_manager = ConfigurationManager(CONFIG_FILE_PATH)
data_acquisition_config = config_manager.get_data_acquisition_config()

# Download dataset from Kaggle
od.download(data_acquisition_config.source, data_acquisition_config.root_dir)

[2025-11-29 16:25:40,559: INFO: main_utils: yaml file: config\config.yaml loaded successfully]
[2025-11-29 16:25:40,562: INFO: main_utils: created directory at: artifacts]
[2025-11-29 16:25:40,564: INFO: main_utils: created directory at: artifacts/data_acquisition]
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username:Your Kaggle Key:Downloading store-sales-time-series-forecasting.zip to artifacts\data_acquisition\store-sales-time-series-forecasting


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 21.4M/21.4M [00:02<00:00, 8.14MB/s]



Extracting archive artifacts\data_acquisition\store-sales-time-series-forecasting/store-sales-time-series-forecasting.zip to artifacts\data_acquisition\store-sales-time-series-forecasting


### Data Loading

In [17]:
import pandas as pd

# Check available files in the data directory
print("Files in data directory:", list(data_acquisition_config.local_dir.glob("*")))

# Extract the file names from the configuration
train_file = data_acquisition_config.data_files["train"]
test_file = data_acquisition_config.data_files["test"]
oil_file = data_acquisition_config.data_files["oil"]
stores_file = data_acquisition_config.data_files["stores"]
transactions_file = data_acquisition_config.data_files["transactions"]
holiday_events_file = data_acquisition_config.data_files["holidays_events"]



# Define paths for the dataset files
train_path = data_acquisition_config.local_dir / train_file
test_path = data_acquisition_config.local_dir / test_file
oil_file = data_acquisition_config.local_dir / oil_file
stores_file = data_acquisition_config.local_dir / stores_file
transactions_file = data_acquisition_config.local_dir / transactions_file
holidays_events_file = data_acquisition_config.local_dir / holiday_events_file

# Load the dataset
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
oil_df = pd.read_csv(oil_file)
stores_df = pd.read_csv(stores_file)
transactions_df = pd.read_csv(transactions_file)
holidays_events_df = pd.read_csv(holidays_events_file)


Files in data directory: [WindowsPath('artifacts/data_acquisition/store-sales-time-series-forecasting')]
Unexpected exception formatting exception. Falling back to standard exception


Traceback (most recent call last):
  File "c:\Users\rosha\anaconda3\envs\salesforecast\Lib\site-packages\IPython\core\interactiveshell.py", line 3699, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\rosha\AppData\Local\Temp\ipykernel_16288\2848325831.py", line 25, in <module>
    train_df = pd.read_csv(train_path)
               ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\rosha\anaconda3\envs\salesforecast\Lib\site-packages\pandas\io\parsers\readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\rosha\anaconda3\envs\salesforecast\Lib\site-packages\pandas\io\parsers\readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\rosha\anaconda3\envs\salesforecast\Lib\site-packages\pandas\io\parsers\readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self

In [None]:
train_df.head()

In [None]:
test_df.head()

In [None]:
oil_df.head()

In [None]:
transactions_df.head()

In [None]:
holidays_events_df.head()

In [None]:
stores_df.head()

In [None]:
from pathlib import Path
from typing import Dict
import pandas as pd
import opendatasets as od
import shutil

class DataLoader:
    """
    Handles downloading, flattening, and loading raw data files from Kaggle.
    """
    def __init__(self, data_dir: Path, source: str, data_files: Dict[str, str], dataset_name: str) -> None:
        """
        Args:
            data_dir (Path): Directory where raw files reside.
            source (str): Kaggle dataset URL.
            data_files (dict): Mapping of dataset names to filenames.
            dataset_name (str): Name of the Kaggle dataset (folder name after download).
        """
        self.data_dir = data_dir
        self.source = source
        self.data_files = data_files
        self.dataset_name = dataset_name

    def download(self) -> None:
        """Check if files already exist. If not, download and flatten."""
        if self._all_files_exist():
            print(f"‚úÖ All files already present in: {self.data_dir}. Skipping download.")
            return

        # Otherwise, proceed with downloading
        self.data_dir.mkdir(parents=True, exist_ok=True)
        print(f"üöÄ Downloading dataset from Kaggle...")
        od.download(self.source, str(self.data_dir))
        print(f"‚úÖ Download complete.")

        self._flatten_download()

    def _all_files_exist(self) -> bool:
        """Check if all required files already exist in the data_dir."""
        return all((self.data_dir / filename).exists() for filename in self.data_files.values())

    def _flatten_download(self) -> None:
        """Move files from Kaggle's subfolder into data_dir root."""
        kaggle_subdir = self.data_dir / self.dataset_name
        if kaggle_subdir.exists():
            for file in kaggle_subdir.iterdir():
                shutil.move(str(file), str(self.data_dir))
            kaggle_subdir.rmdir()
            print(f"üìÇ Flattened downloaded files into: {self.data_dir}")

    def load(self, name: str) -> pd.DataFrame:
        """Load a dataset by its configured name."""
        if name not in self.data_files:
            raise ValueError(f"Dataset '{name}' not found in configuration.")
        path = self.data_dir / self.data_files[name]
        if not path.exists():
            raise FileNotFoundError(f"File not found at {path}")
        print(f"üì• Loading: {path}")
        return pd.read_csv(path)

    def load_all(self) -> Dict[str, pd.DataFrame]:
        """Load all configured datasets as a dict of DataFrames."""
        return {name: self.load(name) for name in self.data_files.keys()}


In [None]:
from pathlib import Path

# Assuming data_acquisition_config is already obtained from ConfigurationManager
loader = DataLoader(
    data_dir=Path(data_acquisition_config.local_dir),
    source=data_acquisition_config.source,
    data_files=data_acquisition_config.data_files,
    dataset_name=data_acquisition_config.dataset_name
)



In [None]:

# 1Ô∏è‚É£ Download + flatten files
loader.download()

# 2Ô∏è‚É£ Load individual files
train_df = loader.load("train")
test_df = loader.load("test")

# 3Ô∏è‚É£ Load all at once
all_data = loader.load_all()

# Example: see loaded train data
print(train_df.head())

In [None]:
train_df.head()