# Sales Forecasting Model: Predicting Future Sales from Historical Data

This notebook builds a sales forecasting model using historical sales data. It leverages feature enrichment from the **Upgini** platform and uses the **CatBoost** regression model to predict future sales.

## Installing Libraries

Run this to install required packages if you haven't already.

In [None]:
!pip install -Uq upgini catboost matplotlib seaborn

## Load Data

In [None]:
from os.path import exists
import pandas as pd

# Load dataset: local if exists, else download
df_path = 'train.csv.zip' if exists('train.csv.zip') else 'https://github.com/upgini/upgini/raw/main/notebooks/train.csv.zip'
df = pd.read_csv(df_path)

# Sample 19,000 rows for faster prototyping
df = df.sample(n=19000, random_state=0)

# Convert identifiers to strings
df['store'] = df['store'].astype(str)
df['item'] = df['item'].astype(str)

# Convert date column to datetime
df['date'] = pd.to_datetime(df['date'])

# Sort chronologically
df.sort_values('date', inplace=True)
df.reset_index(inplace=True, drop=True)

df.head()

## Data Preparation

Split the data into training and testing sets based on date.

In [None]:
train = df[df['date'] < '2017-01-01']
test = df[df['date'] >= '2017-01-01']

train_features = train.drop(columns=['sales'])
train_target = train['sales']
test_features = test.drop(columns=['sales'])
test_target = test['sales']

### Feature Enrichment

Use Upgini to enrich features based on date.

In [None]:
from upgini import FeaturesEnricher, SearchKey
from upgini.metadata import CVType

enricher = FeaturesEnricher(
    search_keys={
        'date': SearchKey.DATE,
    },
    cv = CVType.time_series
)

enricher.fit(train_features,
             train_target,
             eval_set=[(test_features, test_target)])

## Model Training and Evaluation

In [None]:
from catboost import CatBoostRegressor
from catboost.utils import eval_metric

model = CatBoostRegressor(verbose=False, allow_writing_files=False, random_state=0)

enricher.calculate_metrics(
    train_features, train_target,
    eval_set=[(test_features, test_target)],
    estimator=model,
    scoring='mean_absolute_percentage_error'
)

enriched_train_features = enricher.transform(train_features, keep_input=True)
enriched_test_features = enricher.transform(test_features, keep_input=True)

model.fit(train_features, train_target)
preds = model.predict(test_features)
print('SMAPE without enrichment:', eval_metric(test_target.values, preds, 'SMAPE'))

model.fit(enriched_train_features, train_target)
enriched_preds = model.predict(enriched_test_features)
print('SMAPE with enrichment:', eval_metric(test_target.values, enriched_preds, 'SMAPE'))

## Visualization and Analysis

Let's visualize the actual vs predicted sales to get more insight into model performance.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot actual vs predicted sales
plt.figure(figsize=(12,6))
sns.scatterplot(x=test_target, y=preds, alpha=0.6)
plt.plot([test_target.min(), test_target.max()], [test_target.min(), test_target.max()], 'r--')
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales (Without Enrichment)')
plt.title('Actual vs Predicted Sales - Base Model')
plt.show()

In [None]:
# Plot actual vs enriched predictions
plt.figure(figsize=(12,6))
sns.scatterplot(x=test_target, y=enriched_preds, alpha=0.6, color='green')
plt.plot([test_target.min(), test_target.max()], [test_target.min(), test_target.max()], 'r--')
plt.xlabel('Actual Sales')
plt.ylabel('Predicted Sales (With Enrichment)')
plt.title('Actual vs Predicted Sales - Enriched Model')
plt.show()

### Insight:

- The enriched model shows a tighter correlation with actual sales, indicating improved prediction accuracy.
- Points closer to the diagonal red line represent better predictions.
- Feature enrichment using Upgini provides valuable external information that helps the model generalize better.
- There are still some outliers; further tuning or more features might improve results.

### Next steps:

- Explore additional features or external datasets.
- Experiment with hyperparameter tuning of CatBoost.
- Consider other advanced time-series models or ensembling techniques.