# Sales Forecasting with Coptic Library

This notebook demonstrates how to use the Coptic forecasting library for time series prediction. We'll walk through the complete process from data loading to model evaluation.

## What we'll cover:
1. Loading and exploring sample sales data
2. Data cleaning and preprocessing
3. Training different forecasting models
4. Generating forecasts
5. Evaluating model performance
6. Comparing multiple models
7. Visualizing results

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Import coptic library
from coptic import CopticForecaster
from coptic.preprocessing import DataCleaner, FeatureGenerator
from coptic.utils.metrics import calculate_metrics, forecast_accuracy_summary
from coptic.utils.plot import plot_forecast, plot_multiple_forecasts

# Set up plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 6)

In [None]:
# Load sample sales data
df = pd.read_csv('../datasets/sample_data.csv')
df['date'] = pd.to_datetime(df['date'])

print(f"Dataset shape: {df.shape}")
print(f"Date range: {df['date'].min()} to {df['date'].max()}")
print(f"\nFirst few rows:")
df.head()

In [None]:
# Visualize the raw data
plt.figure(figsize=(15, 6))
plt.plot(df['date'], df['sales'], linewidth=1.5)
plt.title('Daily Sales Data')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

# Basic statistics
print("Sales Statistics:")
print(df['sales'].describe())

## 2. Data Cleaning and Quality Assessment

In [None]:
# Initialize data cleaner
cleaner = DataCleaner(
    remove_outliers=False,  # We'll check for outliers first
    fill_method='interpolate'
)

# Get data quality report
quality_report = cleaner.get_data_quality_report(df, 'date', 'sales')

print("Data Quality Report:")
print(f"Total rows: {quality_report['dataset_info']['total_rows']}")
print(f"Missing values in sales: {quality_report['missing_values']['target_missing']}")
print(f"Duplicate dates: {quality_report['duplicates']['duplicate_dates']}")
print(f"\nOutlier detection (IQR method):")
print(f"Number of outliers: {quality_report['outliers']['iqr_outliers']['count']}")

In [None]:
# Clean the data (minimal cleaning since our sample data is already clean)
df_clean = cleaner.clean(df, 'date', 'sales')

print(f"Original data shape: {df.shape}")
print(f"Cleaned data shape: {df_clean.shape}")

## 3. Train-Test Split

For time series, we need to split the data chronologically to maintain temporal order.

In [None]:
# Split data: 80% for training, 20% for testing
split_idx = int(len(df_clean) * 0.8)

train_df = df_clean[:split_idx].copy()
test_df = df_clean[split_idx:].copy()

print(f"Training data: {len(train_df)} samples")
print(f"Test data: {len(test_df)} samples")
print(f"Training period: {train_df['date'].min()} to {train_df['date'].max()}")
print(f"Test period: {test_df['date'].min()} to {test_df['date'].max()}")

In [None]:
# Visualize train-test split
plt.figure(figsize=(15, 6))
plt.plot(train_df['date'], train_df['sales'], label='Training', color='blue')
plt.plot(test_df['date'], test_df['sales'], label='Test', color='orange')
plt.axvline(x=train_df['date'].max(), color='red', linestyle='--', alpha=0.7, label='Split Point')
plt.title('Train-Test Split')
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 4. Model Training and Forecasting

We'll train multiple models and compare their performance.

### 4.1 Random Forest Model

In [None]:
# Train Random Forest model
rf_forecaster = CopticForecaster(
    model_type="randomforest",
    n_estimators=100,
    max_depth=10,
    random_state=42
)

print("Training Random Forest model...")
rf_forecaster.fit(train_df, date_col='date', target_col='sales')

# Generate forecasts for the test period
rf_forecast = rf_forecaster.predict(periods=len(test_df), freq='D')

print(f"Generated {len(rf_forecast)} forecasts")
rf_forecast.head()

In [None]:
# Plot Random Forest results
fig = rf_forecaster.plot(figsize=(15, 8))
plt.title('Random Forest Forecast')
plt.show()

# Plot feature importance
rf_forecaster.plot_feature_importance(top_n=10)
plt.show()

### 4.2 XGBoost Model

In [None]:
# Train XGBoost model
xgb_forecaster = CopticForecaster(
    model_type="xgboost",
    n_estimators=100,
    learning_rate=0.1,
    max_depth=6,
    random_state=42
)

print("Training XGBoost model...")
xgb_forecaster.fit(train_df, date_col='date', target_col='sales')

# Generate forecasts
xgb_forecast = xgb_forecaster.predict(periods=len(test_df), freq='D')

print(f"Generated {len(xgb_forecast)} forecasts")
xgb_forecast.head()

In [None]:
# Plot XGBoost results
fig = xgb_forecaster.plot(figsize=(15, 8))
plt.title('XGBoost Forecast')
plt.show()

# Plot feature importance
xgb_forecaster.plot_feature_importance(top_n=10)
plt.show()

### 4.3 Prophet Model

In [None]:
# Train Prophet model
prophet_forecaster = CopticForecaster(
    model_type="prophet",
    seasonality_mode='additive',
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False
)

print("Training Prophet model...")
prophet_forecaster.fit(train_df, date_col='date', target_col='sales')

# Generate forecasts
prophet_forecast = prophet_forecaster.predict(periods=len(test_df), freq='D')

print(f"Generated {len(prophet_forecast)} forecasts")
prophet_forecast.head()

In [None]:
# Plot Prophet results
fig = prophet_forecaster.plot(figsize=(15, 8))
plt.title('Prophet Forecast')
plt.show()

# Plot Prophet components
prophet_forecaster.plot_components(figsize=(15, 10))
plt.show()

## 5. Model Evaluation

Let's evaluate each model's performance on the test set.

In [None]:
# Evaluate Random Forest
rf_metrics = rf_forecaster.evaluate(test_df)
print("Random Forest Performance:")
print(forecast_accuracy_summary(rf_metrics))
print("\n" + "="*50 + "\n")

In [None]:
# Evaluate XGBoost
xgb_metrics = xgb_forecaster.evaluate(test_df)
print("XGBoost Performance:")
print(forecast_accuracy_summary(xgb_metrics))
print("\n" + "="*50 + "\n")

In [None]:
# Evaluate Prophet
prophet_metrics = prophet_forecaster.evaluate(test_df)
print("Prophet Performance:")
print(forecast_accuracy_summary(prophet_metrics))
print("\n" + "="*50 + "\n")

## 6. Model Comparison

In [None]:
# Create comparison dataframe
comparison_data = {
    'Random Forest': rf_metrics,
    'XGBoost': xgb_metrics,
    'Prophet': prophet_metrics
}

comparison_df = pd.DataFrame(comparison_data).T

# Select key metrics for comparison
key_metrics = ['mae', 'rmse', 'mape', 'r2']
comparison_summary = comparison_df[key_metrics].round(4)

print("Model Comparison Summary:")
print(comparison_summary)

In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

metrics_to_plot = ['mae', 'rmse', 'mape', 'r2']
titles = ['Mean Absolute Error (MAE)', 'Root Mean Square Error (RMSE)', 
          'Mean Absolute Percentage Error (MAPE)', 'R-squared']

for i, (metric, title) in enumerate(zip(metrics_to_plot, titles)):
    ax = axes[i//2, i%2]
    comparison_summary[metric].plot(kind='bar', ax=ax, color=['skyblue', 'lightgreen', 'coral'])
    ax.set_title(title)
    ax.set_ylabel(metric.upper())
    ax.tick_params(axis='x', rotation=45)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Visualize All Forecasts Together

In [None]:
# Compare all forecasts visually
forecasts_dict = {
    'Random Forest': rf_forecast,
    'XGBoost': xgb_forecast,
    'Prophet': prophet_forecast
}

fig = plot_multiple_forecasts(
    train_df, 
    forecasts_dict, 
    date_col='date', 
    target_col='sales',
    title='Model Comparison - All Forecasts',
    figsize=(18, 8)
)

# Add actual test values for comparison
plt.plot(test_df['date'], test_df['sales'], 
         label='Actual Test', color='black', linewidth=2, linestyle='--')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

## 8. Future Forecasting

Now let's use the best performing model to generate future forecasts beyond our test data.

In [None]:
# Determine best model based on RMSE
best_model_name = comparison_summary['rmse'].idxmin()
print(f"Best performing model: {best_model_name} (RMSE: {comparison_summary.loc[best_model_name, 'rmse']:.2f})")

# Select the best model
if best_model_name == 'Random Forest':
    best_model = rf_forecaster
elif best_model_name == 'XGBoost':
    best_model = xgb_forecaster
else:
    best_model = prophet_forecaster

In [None]:
# Retrain best model on full dataset
print(f"Retraining {best_model_name} on full dataset...")
best_model.fit(df_clean, date_col='date', target_col='sales')

# Generate 30-day future forecast
future_forecast = best_model.predict(periods=30, freq='D')

print(f"Generated 30-day future forecast")
print(f"Forecast period: {future_forecast['date'].min()} to {future_forecast['date'].max()}")
future_forecast.head()

In [None]:
# Plot future forecast
fig = best_model.plot(figsize=(18, 8))
plt.title(f'{best_model_name} - 30-Day Future Forecast')
plt.show()

# Print forecast statistics
print("Future Forecast Statistics:")
print(f"Mean forecasted sales: ${future_forecast['yhat'].mean():.2f}")
print(f"Min forecasted sales: ${future_forecast['yhat'].min():.2f}")
print(f"Max forecasted sales: ${future_forecast['yhat'].max():.2f}")

## 9. Save the Best Model

In [None]:
# Save the best model for future use
model_filename = f'best_sales_forecaster_{best_model_name.lower().replace(" ", "_")}.pkl'
best_model.save(model_filename)

print(f"Best model saved as: {model_filename}")

# Show model information
model_info = best_model.get_model_info()
print("\nModel Information:")
for key, value in model_info.items():
    print(f"{key}: {value}")

## 10. Summary and Conclusions

In this notebook, we demonstrated the complete workflow for time series forecasting using the Coptic library:

1. **Data Loading and Exploration**: We loaded sample sales data and explored its characteristics
2. **Data Quality Assessment**: Used built-in tools to check for missing values, outliers, and data quality issues
3. **Model Training**: Trained three different forecasting models (Random Forest, XGBoost, Prophet)
4. **Model Evaluation**: Compared models using comprehensive metrics including MAE, RMSE, MAPE, and R²
5. **Future Forecasting**: Used the best-performing model to generate future predictions
6. **Model Persistence**: Saved the trained model for future use

### Key Takeaways:
- The Coptic library provides a unified interface for multiple forecasting algorithms
- Automatic feature engineering saves time and improves model performance
- Built-in evaluation metrics make model comparison straightforward
- Visualization tools help understand model behavior and forecast quality

### Next Steps:
- Try different model parameters for optimization
- Experiment with custom feature engineering
- Use cross-validation for more robust model evaluation
- Implement automated model selection and hyperparameter tuning