# 📈 Sales Forecasting

---

## Background
One of the key advantages of the new SageMaker AI Unified Studio is its ability to integrate data from multiple sources. In this notebook, we'll walk through an example of bringing data from a Lakehouse to train a sales forecasting model. 

We'll go through the following steps:

- 🔍 Querying historical sales data
- 🔄 Transforming the data to feed into Amazon SageMaker algorithms
- 🤖 Training a model to forecast sales
- 📈 Predicting sales for the next year to detect our top 5 products


## 🛠️ Preparation

Let's start by bringing in the Python libraries that we'll use throughout the notebook and retrieving the name of the project's database through the default catalog.

In [None]:
# Import required libraries
from sagemaker_studio import Project
import lab_utilities

# Get database names and find the project database
project = Project()
databases = project.connection().catalog().databases
project_db = next(db.name for db in databases if db.name.startswith('project_ml_'))

<div class="alert alert-block alert-info">
<center><b>‼️️ IMPORTANT:</b> Copy the database name <code>project_ml_...</code> from the output of the next cell as we'll use in the different SQL cells of the notebook ‼️️</center>
</div>

In [None]:
print(project_db)

## 🔍 Data Exploration

Here we'll be using SQL to query our historical sales data and inspect it.

Let's query our data (change `[REPLACE-VALUE-HERE]` with the `project_ml_...` value retrieved above):


In [None]:
%%sql project.athena
select * from "awsdatacatalog"."[REPLACE-VALUE-HERE]"."sales_table_enriched_w_campaigns"

Convert query results to pandas DataFrame

In [None]:
data = _.to_pandas()

***

## 🔄 Data Preparation

Our data preparation process transforms raw sales data into a format that enables accurate sales forecasting. This critical step:

1. 📅 **Organizes Sales by Time**: Arranges sales chronologically and extracts time patterns (year, month, day of week) to identify seasonal trends and weekly purchasing behaviors.

2. 📈 **Creates Sales Memory**: Calculates recent performance metrics for each product and channel, helping the system understand sales momentum and trends.

3. 🔢  **Translates Categories to Numbers**: Converts business categories (regions, product types, sales channels) into numerical formats while preserving their distinct characteristics.

4. ⏳ **Establishes Learning Timeline**: Divides data into historical training data, recent validation data, and the most current test data to ensure forecasts are built on solid patterns while being tested against recent trends.

5. ⚖️ **Balances Numerical Scales**: Adjusts values like prices and costs to comparable scales, preventing larger numbers from receiving disproportionate importance.

**Business Value**: This process enables our system to predict sales up to a year ahead with high accuracy, account for seasonal and regional patterns, adapt to market changes, and provide insights for inventory planning and marketing timing. It transforms raw data into actionable intelligence for strategic decision-making about product focus, market expansion, and resource investment.


Let's start by defining hour horizon, for example 1 year.

Horizon format:
- year: example "1y"
- months: example "3m"
- days: example "15d"

In [None]:
horizon = "1y"

Let's now prepare our dataset

In [None]:
# Process and split the data for machine learning training and evaluation
(
    X_train,
    X_val,
    X_test,
    y_train,
    y_val,
    y_test,
    feature_columns,
    scaler,
    df_encoded,
    future_dates,
) = lab_utilities.process_data(data, prediction_horizon=horizon)

---

# 🤖 Model Training: Building Our Sales Prediction Engine

The model training process takes our prepared data and builds a powerful prediction engine that can forecast future sales. This critical step:

1. **Creates a Learning System**: Establishes a specialized algorithm (XGBoost) that excels at finding complex patterns in sales data and making accurate predictions based on those patterns.

2. **Teaches the System with Historical Data**: The system learns by analyzing thousands of past sales transactions, identifying which factors most strongly influence sales volumes.

3. **Tests Prediction Accuracy**: The system makes predictions on recent data where we already know the actual sales figures, allowing us to measure how accurate it is before using it for future forecasting.

4. **Identifies Key Sales Drivers**: Reveals which factors have the greatest impact on sales, such as seasonality, pricing, or regional differences, providing valuable business insights.

**Business Value**: This trained model becomes our sales forecasting engine, enabling leadership to anticipate market demand, optimize inventory, plan marketing campaigns, and allocate resources more effectively. By identifying the most influential sales factors, it also provides strategic insights for business growth.


In [None]:
# Train the model using our prepared data
model = lab_utilities.model_train_eval(X_train, y_train, X_val, y_val, feature_columns)

---

# 📈 Sales Prediction: Forecasting Future Business Performance

The sales prediction process takes our trained model and uses it to forecast future sales across products, regions, and time periods. This critical step:

1. **Creates a Future Scenario**: Builds a template based on our most recent sales data, establishing a realistic starting point for predictions.

2. **Simulates Time Progression**: Updates all time-related factors (day of week, month, season) to reflect future dates, allowing the system to account for seasonal patterns and trends.

3. **Applies Market Intelligence**: Uses our trained model to predict how sales will respond to these changing conditions based on historical patterns.

4. **Adds Real-World Variability**: Incorporates natural variations like weekend effects (+15% sales) and seasonal factors (summer months +10%), making predictions more realistic by mimicking actual market fluctuations.

5. **Organizes Results for Decision-Making**: Arranges predictions by date, region, and product category, highlighting top-performing products and time periods to guide strategic planning.

**Business Value**: This prediction process delivers actionable sales forecasts that enable leadership to identify future top-selling products, optimize inventory levels, plan targeted marketing campaigns, and allocate resources effectively across regions and product lines. The realistic variability built into the predictions helps prepare for best-case and worst-case scenarios.

Let's predict the sales for the next year for the Europe region.

In [None]:
# Set target region for analysis
region = "Europe"

# Generate sales predictions for the region
product_sales = lab_utilities.predict_sales(
    model, df_encoded, future_dates, target_region=region
)

# Create summary of total predicted sales by product
product_summary = (
    product_sales.groupby(["region", "item_type", "product_category"])[
        "predicted_units"
    ]
    .sum()
    .reset_index()
    .sort_values("predicted_units", ascending=False)
)

# Get top 5 products
top_5_products_predictions = product_summary.head(5)

# Display results
print(f"\nTotal Predicted Sales by Product for the {len(future_dates)}-day period:")
product_summary[["region", "item_type", "predicted_units"]] 

Here the top 5 products for which we predicted we'll sell the most units

In [None]:
for item in top_5_products_predictions["item_type"]:
    print(f"• {item}")

## 📊 Visualize the forecasted sales

This graph shows the average weekly predicted sales for each product over time.

In [None]:
lab_utilities.plot_forescast(top_5_products_predictions, product_sales, region)

***

## 🤖💡 Preparing for the GenAI Marketing Campaign Generator

Now that we kwnow the top 5 items that will generate profit, let's extrapolate some of the positive reviews to reinforce the marketing campaing.

We query the reviews table (change `[REPLACE-VALUE-HERE]` with `project_ml_...` value retrieved below):

In [None]:
print(project_db)

In [None]:
%%sql project.athena
select * from "awsdatacatalog"."[REPLACE-VALUE-HERE]"."product_reviews"

Convert query results to pandas DataFrame

In [None]:
reviews = _.to_pandas()

Here the base GenAI prompt that we'll use in the next section of the workshop to generate our marketing campaigns:

In [None]:
base_text = """
Using this product data:
Product Name: {product_name}
Description: {product_description}
Reviews: {product_reviews}
Target Persona: Working people

Create personalized marketing content for [target persona]:

1. Personal Hook: Create emotional connection using customer pain points/desires
2. Generate:

* Custom headline addressing [target persona] needs
* 2 personalized social posts highlighting how product fits their lifestyle
* Tailored benefits matching customer profile
* Specific use cases for this customer type
* Personal testimonials from similar customers

Tone: Match [target persona] preferences
Focus: Connect product benefits to personal needs
Include: Targeted solutions and relatable experiences
"""

Here we generate one prompt for each Product campaign. Copy the generated output in a notepad as you will need it for the next lab.

In [None]:
# Generate the prompts
prompts = lab_utilities.generate_marketing_prompts(base_text, top_5_products_predictions, reviews)

# Print the prompts
lab_utilities.print_marketing_prompts(prompts)