To get started with demand forecasting, click [here](https://colab.research.google.com/github/saschaschworm/big-data-and-data-science/blob/master/labs/demand-forecasting.ipynb) to open the Jupyter Notebook in your Google Colab account. This notebook contains the exercises and code you can work through directly in Colab. If you’d like to check your work or explore the completed solution, click [here](https://colab.research.google.com/github/saschaschworm/big-data-and-data-science/blob/master/labs/demand-forecasting-solution.ipynb) to import the solution notebook into Google Colab. Both resources are designed to help you deepen your understanding of demand forecasting and apply data science techniques hands-on.

## Business Understanding

FreshMart, a local supermarket, struggles with managing the inventory of freshly prepared salads, which must be sold the same day they are delivered due to a one-day shelf life. The current ordering method, based on demand from exactly seven days ago plus a 10 % buffer, often results in either overstocking, leading to waste, or understocking, causing missed sales. To address these issues, FreshMart has initiated a data-driven project to improve inventory management.

### Business Objectives

The project's main objective is to **reduce salad waste while ensuring sufficient stock to meet demand**, balancing waste reduction with product availability to improve inventory efficiency, profitability, and customer satisfaction.

### Situtation Assessment

Currently, FreshMart uses a simple but flawed method to order salads, which doesn't account for daily demand fluctuations. The market has valuable resources like historical sales data but faces challenges such as the **salads' short shelf life**, **seasonal demand variations**, and **limited storage**. The risks include financial loss from **waste due to overestimation** and missed sales from **stockouts due to underestimation**.

### Data Mining Goals

The goal of the data mining process is to **develop a predictive model to accurately forecast daily salad demand**, enabling more informed ordering decisions. This model aims to create a dynamic system that provides **real-time order recommendations**, **reducing waste** and **improving stock availability**.

### Project Plan

The project is divided into five phases: data collection and preparation, exploratory data analysis and feature engineering, model development and testing, implementation and integration, and monitoring and optimization. These phases will ensure the chosen model is integrated into the inventory management system, with a focus on continuous performance tracking and adjustment.

### Initial Assessment of Tools and Techniques

Linear regression will be the primary modeling technique, with Python used for data analysis, modeling, visualization, and reporting. The data is available in CSV or Feather formats, eliminating the need for SQL databases. The model's performance will be evaluated using Root Mean Square Error (RMSE) and business metrics like waste reduction and stock turn rate improvement.

### Success Criteria Determination

Success will be measured by both business and technical criteria. Business success involves **reducing salad waste by at least 70 %**, **increasing profitability 80 %**, and **maintaining at least 50 %** in-stock rate. On the technical side, success will be determined by achieving at least **20** RMSE, and providing real-time order recommendations.

## Setup, Configuration, and Utilities

### Package Installation

In [1]:
# This installs the necessary libraries. If you're not working in Google Colab, alternatives like Poetry may also work 
# for managing dependencies. You can refer to the pyproject.toml in the GitHub repository for more information.
#
# !pip install -q feature_engine pyampute

### Package Import

In [2]:
# Third-Party Packages
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

from feature_engine.pipeline import make_pipeline
from feature_engine.wrappers import SklearnTransformerWrapper

from holidays import country_holidays

from matplotlib.dates import DateFormatter
from matplotlib.ticker import FixedFormatter, FixedLocator, MultipleLocator, PercentFormatter

from pandas import DataFrame

from scipy import stats
from sklearn.linear_model import SGDRegressor
from sklearn.metrics import root_mean_squared_error
from sklearn.model_selection import KFold, RandomizedSearchCV, TimeSeriesSplit, cross_validate, train_test_split
from sklearn.preprocessing import FunctionTransformer, PolynomialFeatures, StandardScaler

In [3]:
try:
    from google.colab import data_table
    data_table.enable_dataframe_formatter()
except ModuleNotFoundError:
    pass

### Package Configuration

In [4]:
# Configuration for Matplotlib and Seaborn
sns.set_theme(style="darkgrid")

# Configuration for Pandas
pd.set_option("display.max_rows", 11)
pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", None)
pd.set_option("display.float_format", lambda x: f"{x:,.4f}")

# Configuration for Scikit-Learn
sklearn.set_config(transform_output="pandas")

### Custom Helper Methods

In [5]:
def get_business_metrics(transactions: DataFrame):
    """Calculate key business metrics from transaction data.

    This function computes various key performance indicators (KPIs) related to inventory and sales based on the 
    provided transaction data.

    Args:
        transactions (DataFrame): A data frame with the following required columns: actual demand for the product 
            (`DEMAND`), purchased quantity of the product (`PQTY`), cost of excess inventory (`CEI`), cost of lost 
            sales (`CLS`), sales revenue (`SR`), purchase cost (`PC`), gross profit (`GP`), net profit (`NP`), and 
            inventory levels (`INVENTORY`).
    
    Returns:
        DataFrame: A data frame containing the calculated KPIs with the following columns: the name of the key 
            performance indicator (`KPI`), a brief description of the KPI (`DESCRIPTION`), and the calculated value of 
            the KPI, formatted as a string (`VALUE`).

    Raises:
        KeyError: If required columns are missing in the input data frame.
    """

    metrics = DataFrame(columns=["KPI", "DESCRIPTION", "VALUE"])
    metrics.loc[len(metrics), :] = ["Root Mean Squared Error", "Measures the average difference between actual demand and purchased quantity.", f"{root_mean_squared_error(transactions['DEMAND'], transactions['PQTY']):,.2f} UNITS"]
    metrics.loc[len(metrics), :] = ["Total Cost of Inefficiency", "The combined cost of excess inventory and lost sales.", f"{(transactions['CEI'] + transactions['CLS']).sum():,.2f} EUR"]
    metrics.loc[len(metrics), :] = ["Total Sales Revenue", "Total revenue generated from sales.", f"{transactions['SR'].sum():,.2f} EUR"]
    metrics.loc[len(metrics), :] = ["Total Purchase Cost", "Total cost incurred from purchasing goods.", f"{transactions['PC'].sum():,.2f} EUR"]
    metrics.loc[len(metrics), :] = ["Total Gross Profit", "Total profit before accounting for overhead costs.", f"{transactions['GP'].sum():,.2f} EUR"]
    metrics.loc[len(metrics), :] = ["Total Net Profit", "Total profit after all expenses have been deducted.", f"{transactions['NP'].sum():,.2f} EUR"]
    metrics.loc[len(metrics), :] = ["Total Waste Rate", "Percentage of inventory left over (waste) compared to total purchased quantity.", f"{np.abs(transactions[transactions['INVENTORY'] > 0]['INVENTORY'].sum()) / sum(transactions['PQTY']) * 100:,.2f} %"]
    metrics.loc[len(metrics), :] = ["Total Waste Quantity", "Total number of units left as waste.", f"{np.abs(transactions[transactions['INVENTORY'] > 0]['INVENTORY'].sum())} UNITS"]
    metrics.loc[len(metrics), :] = ["Overstock Rate", "Percentage of records where there is excess inventory.", f"{len(transactions[transactions['INVENTORY'] > 0]) / len(transactions) * 100:,.2f} %"]
    metrics.loc[len(metrics), :] = ["Understock Rate", "Percentage of records where there is a shortage of inventory.", f"{len(transactions[transactions['INVENTORY'] < 0]) / len(transactions) * 100:,.2f} %"]
    metrics.loc[len(metrics), :] = ["Perfect Order Rate", "Percentage of records where the inventory matched demand perfectly.", f"{len(transactions[transactions['INVENTORY'] == 0]) / len(transactions) * 100:,.2f} %"]
    return metrics

In [6]:
def get_inventory_metrics(transactions: DataFrame, target = "DEMAND"):
    """Calculate detailed inventory and financial metrics from transaction data.

    This function computes key financial metrics related to inventory, such as purchase cost, gross profit, inventory 
    levels, cost of excess inventory, cost of lost sales, and net profit, based on the provided transaction data.

    Args:
        transactions (DataFrame): A data frame containing the following required columns: purchase price per unit of 
            the product (`PPRC`), purchased quantity of the product (`PQTY`), actual demand for the product (`DEMAND`), 
            unit gross profit (profit per unsatisfied demand) (`UGP`).

    Returns:
        DataFrame: A data frame containing the calculated metrics with the following columns: total purchase cost 
            (`PC`), gross profit (`GP`), inventory levels (`INVENTORY`), cost of excess inventory (`CEI`), cost of lost 
            sales (`CLS`), net profit (`NP`), sales revenue for the product (`SR`), sales quantity (`SQTY`).

    Raises:
        KeyError: If required columns are missing in the input data frame.
    """

    metrics = DataFrame(columns=["PC", "GP", "INVENTORY", "CEI", "CLS", "NP", "SQTY", "SR"], index=transactions.index)
    metrics["PC"] = (transactions["PPRC"] * transactions["PQTY"]).round(2)
    metrics["SQTY"] = transactions[target].case_when([(lambda demand: demand >= transactions["PQTY"], transactions["PQTY"]), (lambda demand: demand < transactions["PQTY"], transactions["DEMAND"])])
    metrics["SR"] = (transactions["RPRC"] * metrics["SQTY"]).round(2)
    metrics["GP"] = (metrics["SR"] - metrics["PC"]).round(2)
    metrics["INVENTORY"] = transactions["PQTY"] - transactions["DEMAND"]
    metrics["CEI"] = metrics["INVENTORY"].case_when([(lambda x: x > 0, metrics["INVENTORY"] * transactions["PPRC"]), (lambda x: x <= 0, 0)]).round(2)
    metrics["CLS"] = metrics["INVENTORY"].case_when([(lambda x: x < 0,  -1 * metrics["INVENTORY"] * transactions["UGP"]), (lambda x: x >= 0, 0)]).round(2)
    metrics["NP"] = (metrics["GP"] - metrics["CLS"]).round(2)
    return metrics

### Custom Data Transformers

In [7]:
ForwardFillTransformer = FunctionTransformer(lambda x: x.copy().ffill())

## Data Understanding

This phase involves getting acquainted with the data, identifying quality issues, and uncovering insights to guide the project's next steps. This phase typically starts with collecting relevant data from databases, files, or external sources. The data is then summarized to understand its structure, including variable types (categorical or numerical), record counts, and basic statistics like mean, median, standard deviation, and distributions. Exploratory Data Analysis (EDA) follows, using visualizations like histograms, scatter plots, and box plots to reveal patterns, trends, and relationships. Finally, data quality is assessed by identifying issues such as missing values, outliers, and inconsistencies, which are crucial to address in the Data Preparation phase.

### Historic Retail Transactions Dataset

In [8]:
transactions = pd.read_feather("https://github.com/saschaschworm/big-data-and-data-science/raw/refs/heads/master/datasets/demand-forecasting.feather")

#### Data Dictionary

| **Field Name**  | **Data Type** | **Description**                                                                                                     |
|-----------------|---------------|---------------------------------------------------------------------------------------------------------------------|
| `ODATE`         | `str`         | The order date when all relevant information becomes available to predict the demand on the target date (`TDATE`). |
| `TDATE`         | `str`         | The target date for which the demand and other metrics are being predicted. This is the actual date for which the forecast is applicable. |
| `SEASON`        | `str`         | The season (e.g., Winter, Spring, Summer, Fall) at the target date (`TDATE`). |
| `DEMAND`        | `int`         | The actual number of units demanded or sold on the target date (`TDATE`). This is the realized demand and serves as the ground truth for evaluating the accuracy of the forecast. |
| `DEMAND7CD`     | `int`         | The actual number of units demanded or sold 7 calendar days before the target date (`TDATE`). This metric reflects the realized demand on week prior to the target date. |
| `MARKETING`     | `str`         | The level of marketing effort deployed on the target date (`TDATE`). This is typically categorized as `LOW`, `MEDIUM`, or `HIGH`, and represents the intensity of marketing campaigns, advertisements, or promotions aimed at driving sales. |
| `PROMOTION`     | `str`         | The type of promotion applied on the target date (`TDATE`), such as `NONE`, `DISCOUNT`, or `BOGO`. Promotions can have a significant impact on demand by incentivizing purchases. |
| `CAF`           | `str`         | An indicator of whether a competitor is currently running an active marketing campaign on the target date (`TDATE`). This is a binary field with values `YES` or `NO`, where `YES` indicates that a competitor has an active campaign, which could potentially impact demand. |
| `TEMPERATURE`   | `float`       | The forecasted temperature (in degrees Celsius) expected on the target date (`TDATE`), as predicted on the prediction date (`ODATE`). Weather conditions can influence consumer behavior, especially for certain seasonal products. |
| `PRECIPITATION` | `float`       | The forecasted probability of precipitation expected on the target date (`TDATE`), as predicted on the prediction date (`ODATE`). This metric represents the likelihood of rain or other forms of precipitation, which can affect foot traffic and overall demand. |
| `PPRC`          | `float`       | The purchase price per unit on the target date (`TDATE`). This is the cost at which the supermarket acquires each unit of the product from suppliers, and it directly influences the cost of goods sold (COGS). |
| `RPRC`          | `float`       | The sale price per unit on the target date (`TDATE`). This is the price at which the supermarket sells each unit to customers. It is a crucial factor in determining revenue and profitability. |
| `PQTY`          | `int`         | The quantity of units purchased on the target date (`TDATE`). This reflects the inventory replenishment made by the supermarket to meet anticipated demand. |
| `SQTY`          | `int`         | The quantity of units sold on the target date (`TDATE`). This is the actual sales volume achieved on the target date and is used to calculate revenue and inventory levels. |
| `INVENTORY`     | `int`         | The stock level on the target date (`TDATE`) after accounting for sales and purchases. A positive value indicates surplus inventory (overstocking), while a negative value indicates a shortage (understocking). Inventory management is critical for minimizing holding costs and avoiding lost sales. |
| `UGP`           | `float`       | The unit gross profit on the target date (`TDATE`), calculated as the difference between the sale price (`RPRC`) and the purchase price (`PPRC`). This metric indicates the profit margin per unit sold. |
| `SR`            | `float`       | The total sales revenue on the target date (`TDATE`), calculated as the product of the quantity sold (`SQTY`) and the sale price (`RPRC`). Sales revenue is a key indicator of the supermarkets's top-line performance. |
| `PC`            | `float`       | The total purchase cost on the target date (`TDATE`), calculated as the product of the quantity purchased (`PQTY`) and the purchase price (`PPRC`). This represents the total expenditure on acquiring inventory. |
| `GP`            | `float`       | The gross profit on the target date (`TDATE`), calculated as the difference between total sales revenue (`SR`) and total purchase cost (`PC`). Gross profit indicates the profitability before accounting for operating expenses. |
| `CEI`           | `float`       | The cost of excessive inventory on the target date (`TDATE`). This represents the holding costs incurred due to overstocking. Holding costs can include storage, insurance, spoilage, and depreciation of unsold goods. It is calculated based on the excess units in inventory that exceed expected demand. High CEI indicates inefficient inventory management and can negatively impact profitability. |
| `CLS`           | `float`       | The cost of lost sales due to understocking on the target date (`TDATE`). This cost represents the opportunity loss when demand exceeds available inventory, leading to missed sales opportunities. Understocking can result in dissatisfied customers and potential loss of market share. The cost is typically estimated based on the profit that could have been earned if sufficient inventory had been available. |
| `NP`            | `float`       | The net profit on the target date (`TDATE`), calculated as the difference between gross profit (`GP`) and the cost of lost sales (`CLS`). Net profit is a key measure of overall profitability after accounting for all relevant costs, including inefficiencies in inventory management. |

#### Preliminary Data Analysis

In [None]:
transactions

In [None]:
transactions.dtypes.to_frame().transpose()

In [None]:
transactions.describe(include="all")

In [None]:
get_business_metrics(transactions)

#### Exploratory Data Analysis

##### Line Plot: Transaction Date and Demand

##### Scatter Plot: Retail Price and Demand

In [None]:
fig = plt.figure(figsize=(10, 6))
ax = sns.scatterplot(x="RPRC", y="DEMAND", data=transactions)

ax.set_xlabel("Retail Price")
ax.set_ylabel("Demand (Units)")
# ax.set_ylim(0, 500)

# ax.xaxis.set_major_locator(MultipleLocator(0.5))
# ax.xaxis.set_major_formatter("{x:.2f} €")

fig.tight_layout()

##### Scatter Plot: Precipitation, Competitor Activity, and Demand

In [None]:
fig = plt.figure(figsize=(10, 6))
ax = sns.scatterplot(x="PRECIPITATION", y="DEMAND", hue="CAF", data=transactions, linewidth=1, alpha=1.0)

ax.set_xlabel("Precipitation Probability")
ax.set_ylabel("Demand (Units)")

# ax.xaxis.set_major_formatter(PercentFormatter(xmax=1))

fig.tight_layout()

##### Box and Whisker Plot: Season and Demand

In [None]:
fig = plt.figure(figsize=(10, 6))
ax = sns.boxplot(x="SEASON", y="DEMAND", data=transactions)

ax.set_xlabel("Seasonal Cycle")
ax.set_ylabel("Demand (Units)")

# ax.xaxis.set_major_locator(FixedLocator([0, 1, 2, 3]))
# ax.xaxis.set_major_formatter(FixedFormatter(["Fall", "Spring", "Summer", "Winter"]))

fig.tight_layout()

##### Scatter Plot: Forecasted Temperature and Demand

**Exercise:** Create a scatter plot to analyze the relationship between temperature and demand. What trends can you identify in how temperature affects demand?

In [None]:
fig = plt.figure(figsize=(10, 6))

# <PLACE YOUR CODE HERE>

fig.tight_layout()

##### Box and Whisker Plot: Promotion and Demand

**Exercise:** Create a box plot to compare the effectiveness of different promotion types on demand, and identify which promotion type leads to the highest demand.

In [None]:
fig = plt.figure(figsize=(10, 6))

# <PLACE YOUR CODE HERE>

fig.tight_layout()

##### Box and Whisker Plot: Competitor Activity, Marketing, and Demand

**Exercise:** Create a box plot to visualize demand based on competitor activity, using marketing intensitiy as the hue. When should marketing strategies be adjusted based on competitor activity?

In [None]:
fig = plt.figure(figsize=(10, 6))

# <PLACE YOUR CODE HERE>

fig.tight_layout()

### Nutritional Trends Index Dataset

In [19]:
nti = pd.read_csv("https://github.com/saschaschworm/big-data-and-data-science/raw/refs/heads/master/datasets/nutritional-trends-index.csv", parse_dates=["RELEASE", "VALID_FROM", "VALID_TO"])

#### Data Dictionary

| **Field Name** | **Data Type** | **Description**                                                                                                                                  |
|----------------|---------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
| `RELEASE`      | `str`         | The release date of the Nutritional Trends Index (NTI) data. This is the date on which the NTI value is published, indicating when the data becomes available. |
| `TARGET`       | `str`         | The month and year for which the NTI is applicable. This field typically contains the abbreviated month name (e.g., "SEP" for September) and refers to the time period that the NTI value represents. |
| `TYPE`         | `str`         | The type of NTI value provided, which is always `ACTUAL`, and refers to the final NTI value observed or recorded for the target month.           |                                                                            
| `VALUE`        | `float`       | The NTI for the specified target month (`TARGET`). This index tracks shifts in consumer preferences, dietary habits, and nutritional awareness over time. |
| `VALID_FROM`   | `str`         | The start date from which the NTI value is considered valid.                                                                                     |
| `VALID_TO`     | `str`         | The end date up to which the NTI value is valid.                                                                                                 |

#### Preliminary Data Analysis

In [None]:
nti

In [None]:
nti.dtypes.to_frame().transpose()

In [None]:
nti.describe(include="all")

#### Exploratory Data Analysis

##### Histogram: Nutritional Trends Index

In [None]:
fig = plt.figure(figsize=(10, 6))
ax = sns.histplot(x="VALUE", binwidth=1, data=nti)

ax.set_ylabel("Frequency")
ax.set_xlabel("Nutritional Trends Index")

fig.tight_layout()

##### Line Plot: Nutritional Trends Index

**Exercise:** Create a line plot to examine the Nutritional Trends Index over time by date. At what time do unusual observations occur?

In [None]:
fig = plt.figure(figsize=(10, 6))

# <PLACE YOUR CODE HERE>

fig.tight_layout()

### Consumer Climate Index Dataset

In [25]:
cci = pd.read_csv("https://github.com/saschaschworm/big-data-and-data-science/raw/refs/heads/master/datasets/consumer-climate-index.csv", parse_dates=["RELEASE", "VALID_FROM", "VALID_TO"])

#### Data Dictionary

| **Field Name** | **Data Type** | **Description**                                                                                                     |
|----------------|---------------|---------------------------------------------------------------------------------------------------------------------|
| `RELEASE`      | `str`         | The release date of the Consumer Climate Index (CCI) data. This is the date on which the forecast or actual value of the CCI is published. It indicates when the data becomes available to the public or businesses for decision-making. |
| `TARGET`       | `str`         | The month and year for which the CCI is applicable. This field typically contains the abbreviated month name (e.g., "SEP" for September) and refers to the time period that the CCI value represents. |
| `TYPE`         | `str`         | The type of CCI value provided, which can be either `FORECAST` or `ACTUAL`. `FORECAST` indicates a predicted CCI value released before or at the beginning of the target month. `ACTUAL` refers to the final CCI value observed or recorded for the target month. |
| `VALUE`        | `float`       | The value of the CCI for the specified target month (`TARGET`). The CCI is a measure of consumer confidence and sentiment, with negative values indicating pessimism (reduced consumer spending and economic activity) and positive values indicating optimism (increased consumer spending and economic activity). The value is typically expressed as an index relative to a base period. |
| `VALID_FROM`   | `str`         | The start date from which the CCI value is considered valid. For `FORECAST` data, this is typically the first day of the target month (`TARGET`). For `ACTUAL` data, it is the release date (`RELEASE`). |
| `VALID_TO`     | `str`         | The end date up to which the CCI value is valid. For `FORECAST` data, this is typically the last day before the next forecast or actual value is released. For `ACTUAL` data, it is the last day of the target month (`TARGET`). This period defines the coverage of the CCI value in terms of its relevance to economic conditions. |

#### Preliminary Data Analysis

In [None]:
cci

In [None]:
cci.dtypes.to_frame().transpose()

In [None]:
cci.describe(include="all")

### German Public Holidays Dataset

In [29]:
holidays = DataFrame(country_holidays(country="DE", subdiv="NW", years=[2023, 2024]).items(), columns=["DATE", "HOLIDAY"])

#### Data Dictionary

| **Field Name** | **Data Type** | **Description**                                                                                                     |
|----------------|---------------|---------------------------------------------------------------------------------------------------------------------|
| `DATE`         | `str`         | The date of the public holiday in Germany. This field represents the specific day on which the holiday is observed. The date is formatted as `YYYY-MM-DD`. |
| `HOLIDAY`      | `str`         | The name of the public holiday in Germany. This field contains the official name of the holiday in German, such as "Neujahr" (New Year's Day) or "Tag der Deutschen Einheit" (German Unity Day). |

#### Preliminary Data Analysis

In [None]:
holidays

In [None]:
holidays.dtypes.to_frame().transpose()

In [None]:
holidays.describe(include="all")

## Data Preparation and Evaluation for Univariate Modeling

### Data Preparation

This phase transforms raw data into a format suitable for modeling, often making it the most time-consuming part of the process. It begins with data integration, where data from multiple sources are combined, such as merging datasets or joining tables. Next, data cleaning addresses issues identified during the previous phase. This includes handling missing values through imputation or removal, correcting outliers, and standardizing formats to resolve inconsistencies. Data transformation follows, performing feature engineering to create new, potentially more predictive features, and modifying the data for modeling by normalizing or standardizing numerical features, encoding categorical variables. Finally, data formatting ensures the data is properly structured and formatted for the modeling tools, including converting data types and ensuring consistent column names.

In [33]:
transformations = make_pipeline(
    SklearnTransformerWrapper(ForwardFillTransformer, variables=["TEMPERATURE"]),
    verbose=False,
)

In [None]:
transactions[["TEMPERATURE"]].transpose()

In [None]:
transactions[["TEMPERATURE"]].describe().transpose()

In [None]:
transformations.fit_transform(transactions[["TEMPERATURE"]]).transpose()

In [None]:
transformations.fit_transform(transactions[["TEMPERATURE"]]).describe().transpose()

### Modeling

This phase involves selecting and applying algorithms to the prepared data to create predictive models. This phase requires careful selection of model types and parameters, as different models may be better suited to the specific data and problem. Multiple models are typically built and tested to identify the best performer. The process includes splitting the data into training and testing sets, training the models on the training data, and fine-tuning them by adjusting hyperparameters to optimize performance. The goal is to develop a model that accurately captures underlying patterns and generalizes well to new, unseen data.

In [38]:
model = make_pipeline(
    transformations,
    SGDRegressor(penalty=None, alpha=0.0001, max_iter=1000, tol=0.001, shuffle=True, verbose=0, random_state=1207, 
        learning_rate="invscaling", eta0=0.001, power_t=0.25, n_iter_no_change=5,
    ),
    verbose=False,
).fit(transactions[["TEMPERATURE"]], transactions["DEMAND"])

In [None]:
DataFrame([[model[-1].intercept_[0], *model[-1].coef_]], columns=["INTERCEPT", "TEMPERATURE"])

In [None]:
predictions = transactions[["ODATE", "TDATE", "DEMAND", "TEMPERATURE", "PPRC", "RPRC", "UGP"]].copy()
predictions["YHAT"] = model.predict(transactions[["TEMPERATURE"]])
predictions["PQTY"] = np.ceil(predictions["YHAT"])
predictions = predictions.merge(get_inventory_metrics(predictions), left_index=True, right_index=True)
predictions

In [None]:
forecast = DataFrame({"TDATE": pd.date_range("2024-09-01", periods=7), "TEMPERATURE": [27, 28, 29, 28, 28, 27, 27]})
forecast["FORECAST"] = model.predict(forecast[["TEMPERATURE"]])
forecast

### Evaluation

This phase assesses the performance of the models developed during the Modeling phase to ensure they meet the project's objectives. This involves testing the models on the test data and evaluating them with relevant metrics, such as accuracy, precision, recall, F1-score, or mean squared error, depending on the problem type (classification, regression, etc.). The process also includes validating the model's robustness and consistency across different data subsets, often using cross-validation techniques to ensure reliability. The goal is to identify the most effective model that strikes the best balance between performance and generalization.

In [None]:
print(f"RMSE (History): {root_mean_squared_error(transactions["DEMAND"], transactions["PQTY"]):,.2f} UNITS")
print(f"RMSE (Model): {root_mean_squared_error(transactions["DEMAND"], predictions["PQTY"]):,.2f} UNITS")

In [None]:
metrics = pd.merge(get_business_metrics(transactions), get_business_metrics(predictions), on=["KPI", "DESCRIPTION"])
metrics = metrics.rename({"VALUE_x": "BENCHMARK", "VALUE_y": "SLR"}, axis=1)
metrics

#### Holdout Method

In [44]:
X_train, X_test, y_train, y_test = train_test_split(
    transactions[["TEMPERATURE"]], transactions["DEMAND"], test_size=0.25, shuffle=True, random_state=1207,
)

In [None]:
X_train.merge(y_train, left_index=True, right_index=True).transpose()

In [None]:
X_test.merge(y_test, left_index=True, right_index=True).transpose()

In [47]:
model = model.fit(X_train, y_train)

In [None]:
print(f"RMSE (Training Set): {root_mean_squared_error(y_train, model.predict(X_train)):,.2f} UNITS")
print(f"RMSE (Test Set): {root_mean_squared_error(y_test, model.predict(X_test)):,.2f} UNITS")

#### K-Fold Cross-Validation

In [49]:
kfold = KFold(n_splits=10, shuffle=False)
evaluation = cross_validate(
    estimator=model, X=transactions[["TEMPERATURE"]], y=transactions["DEMAND"], cv=kfold, n_jobs=-1, 
    scoring="neg_root_mean_squared_error", verbose=0, return_train_score=True,
)
cv = DataFrame(evaluation)

In [None]:
cv.transpose()

In [None]:
cv.agg("mean").to_frame().transpose()

**Exercise:** Conduct a $k$-fold cross-validation analysis while varying the value of k. What observations can you make regarding the model’s performance as you change $k$?

#### Time-Series Cross-Validation

In [52]:
tscv = TimeSeriesSplit(n_splits=10, gap=1)
evaluation = cross_validate(
    estimator=model, X=transactions[["TEMPERATURE"]], y=transactions["DEMAND"], cv=tscv, n_jobs=-1, 
    scoring="neg_root_mean_squared_error", verbose=0, return_train_score=True,
)
cv = DataFrame(evaluation)

In [None]:
cv.transpose()

In [None]:
cv.agg("mean").to_frame().transpose()

### Data Preparation, Modeling, and Evaluation (Part II)

In [55]:
model = make_pipeline(
    SklearnTransformerWrapper(ForwardFillTransformer, variables=["TEMPERATURE"]),
    SklearnTransformerWrapper(PolynomialFeatures(degree=2, include_bias=False)),
    SGDRegressor(penalty=None, alpha=0.0001, max_iter=1000, tol=0.001, shuffle=True, verbose=0, random_state=1207, 
        learning_rate="invscaling", eta0=0.001, power_t=0.25, n_iter_no_change=5,
    ),
    verbose=False,
).fit(transactions[["TEMPERATURE"]], transactions["DEMAND"])

In [None]:
DataFrame([[model[-1].intercept_[0], *model[-1].coef_]], columns=["INTERCEPT", *model[:-1].get_feature_names_out()])

### Data Preparation, Modeling, and Evaluation (Part III)

In [57]:
model = make_pipeline(
    SklearnTransformerWrapper(ForwardFillTransformer, variables=["TEMPERATURE"]),
    SklearnTransformerWrapper(PolynomialFeatures(degree=2, include_bias=False)),
    SklearnTransformerWrapper(StandardScaler()),
    SGDRegressor(penalty=None, alpha=0.0001, max_iter=1000, tol=0.001, shuffle=True, verbose=0, random_state=1207, 
        learning_rate="invscaling", eta0=0.1, power_t=0.25, n_iter_no_change=5,
    ),
    verbose=False,
).fit(transactions[["TEMPERATURE"]], transactions["DEMAND"])

In [None]:
DataFrame(
    data=[[model[-1].intercept_[0], *model[-1].coef_]],
    columns=["INTERCEPT", *model[:-1].get_feature_names_out()],
)

In [59]:
tscv = TimeSeriesSplit(n_splits=10, gap=1)
evaluation = cross_validate(
    estimator=model, X=transactions[["TEMPERATURE"]], y=transactions["DEMAND"], cv=tscv, n_jobs=-1, 
    scoring="neg_root_mean_squared_error", verbose=0, return_train_score=True,
)
cv = DataFrame(evaluation)

In [None]:
cv.transpose()

In [None]:
cv.agg("mean").to_frame().transpose()

**Exercise:** Experiment with different polynomial degrees to determine the best fit for your model. Then, apply a high polynomial degree while varying the L1 (`penalty = "l1"`) and L2 (`penalty = "l2"`) regularization rates. What do you notice about the coefficients as you change the regularization rate and learning rate?

### Data Preparation, Modeling, and Evaluation (Part IV)

In [None]:
stats.randint(1, 11).rvs(size=1)

In [63]:
params = {
    "sklearntransformerwrapper-2__transformer__degree": stats.randint(1, 11),
    "sgdregressor__alpha": stats.loguniform(0.0001, 0.1),
    "sgdregressor__max_iter": stats.randint(1000, 3001),
    "sgdregressor__eta0": stats.loguniform(0.0001, 0.1),
    "sgdregressor__penalty": ["l2", "l1", None],
}

In [None]:
X_train, X_test, y_train, y_test = train_test_split(transactions[["TEMPERATURE"]], transactions["DEMAND"], shuffle=False)
tscv = TimeSeriesSplit(n_splits=10, gap=1)

evaluation = RandomizedSearchCV(
    estimator=model, param_distributions=params, n_iter=10, scoring="neg_root_mean_squared_error", cv=tscv, 
    random_state=1207, n_jobs=-1, return_train_score=True, verbose=0,
)

evaluation = evaluation.fit(X_train, y_train)
cv = DataFrame(evaluation.cv_results_)

In [None]:
cv.drop("params", axis=1)

In [None]:
cv[["mean_train_score", "std_train_score", "mean_test_score", "std_test_score"]].agg("mean").to_frame().transpose()

In [None]:
DataFrame([evaluation.best_params_])

In [None]:
evaluation.best_estimator_[:-1]

In [None]:
DataFrame(
    data=[[evaluation.best_estimator_[-1].intercept_[0], *evaluation.best_estimator_[-1].coef_]], 
    columns=["INTERCEPT", *evaluation.best_estimator_[:-1].get_feature_names_out()]
)

In [None]:
predictions = transactions[["ODATE", "TDATE", "DEMAND", "TEMPERATURE", "PPRC", "RPRC", "UGP"]].copy()
predictions["YHAT"] = evaluation.best_estimator_.predict(transactions[["TEMPERATURE"]])
predictions["PQTY"] = np.ceil(predictions["YHAT"])
predictions = predictions.merge(get_inventory_metrics(predictions), left_index=True, right_index=True)
predictions

In [None]:
metrics = pd.merge(get_business_metrics(transactions), get_business_metrics(predictions), on=["KPI", "DESCRIPTION"])
metrics = metrics.rename({"VALUE_x": "BENCHMARK", "VALUE_y": "SLR"}, axis=1)
metrics

In [None]:
fig = plt.figure(figsize=(10, 6))
ax = sns.scatterplot(x="TEMPERATURE", y="DEMAND", data=predictions, c="C0", alpha=0.5)
ax = sns.lineplot(x="TEMPERATURE", y="YHAT", data=predictions, c="C1")

ax.set_xlabel("Forecasted Temperature (°C)")
ax.set_ylabel("Demand (Units)")

fig.tight_layout()

In [None]:
fig = plt.figure(figsize=(10, 6))
ax = sns.lineplot(x="TDATE", y="DEMAND", data=predictions, c="C0", alpha=0.5)
ax = sns.lineplot(x="TDATE", y="YHAT", data=predictions, c="C1")

ax.set_xlabel("Transaction Date")
ax.set_ylabel("Demand (Units)")

ax.xaxis.set_major_formatter(DateFormatter("%b %d, %Y"))

fig.tight_layout()

## Data Preparation and Evaluation for Multivariate Modeling

After completing the univariate analysis notebook, you can continue your exploration of demand forecasting by accessing the next notebook, which focuses on multivariate analysis. This upcoming notebook will introduce more sophisticated methods, including feature engineering and outlier handling, to enhance your analysis. Click [here](https://colab.research.google.com/github/saschaschworm/big-data-and-data-science/blob/master/labs/demand-forecasting-advanced.ipynb) to dive into these advanced techniques and further expand your understanding of data science in the context of demand forecasting.