1. 数据清洗和处理：对历史数据进行清洗和处理，包括去除异常值、缺失值处理等。此外，还需要将数据按照时间序列的方式进行排序。
2. 时间序列分解：将时间序列数据分解为趋势、季节和随机成分。这可以通过拟合加法模型或乘法模型来实现。其中加法模型假定季节成分与趋势成分之和等于原始数据，而乘法模型假定季节成分与趋势成分的乘积等于原始数据。
3. 模型选择和拟合：选择合适的时间序列模型对趋势、季节和随机成分进行拟合。一般常用的模型包括ARIMA模型、指数平滑模型等。
4. 模型诊断：对拟合好的模型进行诊断，检验其残差是否符合正态分布、是否存在自相关性等。
5. 模型预测：使用已经拟合好的模型进行未来需求量的预测，并计算预测精度。

针对本问题，建议分别采用日、周、月三种时间粒度进行预测，通过比较预测结果的误差，得出不同粒度对预测精度的影响。
我们将以月为粒度对训练数据进行聚合，并提取出需要预测的销售区域、产品、产品品类和产品细品的所有组合。对于每个组合，我们将月销售量作为标签，其他特征包括销售区域、产品、产品品类、产品细品以及月份，我们将这些特征进行独热编码处理。我们将最后 3 个月的数据作为验证集，其余数据作为训练集。

接下来，我们将使用ARIMAX来建立模型并进行预测。

# 更新时间：2023-4-21
注意：以下程序我使用1000条数据跑的，请自己训练的使用全部训练数据
以下是以月为时间粒度，去掉我以下的注释，就可以实现周和日为粒度

In [7]:

import pandas as pd
import numpy as np
import xgboost as xgb

# 读取训练集
order_train_df = pd.read_csv('data/order_train1.csv')

# 定义目标函数和训练参数
def rmspe(y_pred, y_true):
    return 'RMSPE', np.sqrt(np.mean(np.square((y_true - y_pred) / y_true))), False
# 提取训练集中需要的列
train_df = order_train_df[['order_date', 'sales_region_code', 'item_code','first_cate_code','second_cate_code','ord_qty']]

# 按时间粒度和产品分组，并计算月销售量
train_df['order_date'] = pd.to_datetime(train_df['order_date']) # 解决时间格式问题
# 以月为时间粒度
train_df['month'] = train_df['order_date'].dt.month
train_df = train_df.groupby(['sales_region_code', 'item_code', 'first_cate_code','second_cate_code','month'], as_index=False).agg({'ord_qty': 'sum'})

# 以日为时间粒度
# train_df['day'] = train_df['order_date'].dt.day
# train_df = train_df.groupby(['sales_region_code', 'item_code', 'first_cate_code','second_cate_code','day'], as_index=False).agg({'ord_qty': 'sum'})

# # 以周为时间粒度
# train_df['week'] = train_df['order_date'].dt.week
# train_df = train_df.groupby(['sales_region_code', 'item_code', 'first_cate_code','second_cate_code','week'], as_index=False).agg({'ord_qty': 'sum'})


# 将月销售量转换为目标变量，即下月销售量
train_df['label'] = train_df['ord_qty'].shift(-1)

# 将数据集拆分为训练集和验证集
train_size = int(len(train_df) * 0.8)
train_data = train_df[:train_size]
valid_data = train_df[train_size:]

# 100条训练数据（以下程序我使用1000条数据跑的，请自己训练的使用全部训练数据）
# x_train = train_data.drop(['ord_qty', 'label'], axis=1)[0:100]
# y_train = train_data['label'][0:100]

# 全部训练数据
x_train = train_data.drop(['ord_qty', 'label'], axis=1)
y_train = train_data['label']


x_val = valid_data.drop(['ord_qty', 'label'], axis=1)[0:-1]
y_val = valid_data['label'][0:-1]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['order_date'] = pd.to_datetime(train_df['order_date']) # 解决时间格式问题
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['month'] = train_df['order_date'].dt.month


In [13]:
x_train

Unnamed: 0,sales_region_code,item_code,first_cate_code,second_cate_code,month
0,101,20001,302,408,3
1,101,20001,302,408,5
2,101,20002,303,406,3
3,101,20002,303,406,4
4,101,20002,303,406,5
...,...,...,...,...,...
95,101,20014,307,403,8
96,101,20014,307,403,9
97,101,20014,307,403,10
98,101,20014,307,403,11


In [10]:
params = {'booster': 'gbtree',
          'objective': 'reg:squarederror', # 使用回归损失函数
          'eval_metric': 'rmspe',
          'gamma': 0.1,
          'min_child_weight': 1,
          'max_depth': 10,
          'lambda': 10,
          'subsample': 0.7,
          'colsample_bytree': 0.7,
          'colsample_bylevel': 0.7,
          'eta': 0.03,
          'tree_method': 'exact',
          'seed': 0}

model = xgb.XGBClassifier(params)
model.fit(x_train, y_train, verbose=False)





In [27]:
test_data = pd.read_csv('data/predict_sku1.csv')
test_data_res = test_data.copy()
test_data['month'] = [1]*len(test_data)
test_data_res['2019年1月预测需求量'] = model.predict(test_data)
test_data = pd.read_csv('data/predict_sku1.csv')
test_data['month'] = [2]*len(test_data)
test_data_res['2019年2月预测需求量'] = model.predict(test_data)
test_data = pd.read_csv('data/predict_sku1.csv')
test_data['month'] = [3]*len(test_data)
test_data_res['2019年3月预测需求量'] = model.predict(test_data)
test_data_res



Unnamed: 0,sales_region_code,item_code,first_cate_code,second_cate_code,2019年1月预测需求量,2019年2月预测需求量,2019年3月预测需求量
0,101,20002,303,406,19.0,19.0,19.0
1,101,20003,301,405,396.0,396.0,396.0
2,101,20006,307,403,402.0,402.0,402.0
3,101,20011,303,401,19.0,19.0,19.0
4,101,20014,307,403,19.0,19.0,19.0
...,...,...,...,...,...,...,...
2614,105,22066,307,403,19.0,19.0,19.0
2615,105,22072,305,412,2.0,2.0,2.0
2616,105,22075,307,403,19.0,19.0,19.0
2617,105,22083,303,401,45.0,45.0,45.0


In [28]:

# 保存结果
test_data_res.to_excel('data/result1.xlsx', index=False)

# 二、使用ARIMAX 模型

In [7]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# 读取训练集
order_train_df = pd.read_csv('data/order_train1.csv')

# 定义目标函数和训练参数
def rmspe(y_pred, y_true):
    return np.sqrt(np.mean(np.square((y_true - y_pred) / y_true)))

# 提取训练集中需要的列
train_df = order_train_df[['order_date', 'sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code', 'ord_qty']]

# 按时间粒度和产品分组，并计算月销售量
train_df['order_date'] = pd.to_datetime(train_df['order_date']) # 解决时间格式问题
train_df['month'] = train_df['order_date'].dt.month
train_df = train_df.groupby(['sales_region_code', 'item_code', 'first_cate_code', 'second_cate_code', 'month'], as_index=False).agg({'ord_qty': 'sum'})

# 将月销售量转换为目标变量，即下月销售量
train_df['label'] = train_df['ord_qty'].shift(-1)

# 将数据集拆分为训练集和验证集
train_size = int(len(train_df) * 0.8)
# 100条数据
train_data = train_df[:train_size][0:100]
# 全部数据
# train_data = train_df[:train_size]
valid_data = train_df[train_size:]

# 将特征编码为哑变量
x_train = train_data.drop(['ord_qty', 'label'], axis=1)
y_train = train_data['label']

x_val = valid_data.drop(['ord_qty', 'label'], axis=1)
y_val = valid_data['label'][:-1]

# 训练ARIMAX模型
model = sm.tsa.statespace.SARIMAX(endog=y_train, exog=x_train, order=(1, 0, 1), seasonal_order=(1, 1, 1, 12), enforce_stationarity=False)
model_fit = model.fit()





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['order_date'] = pd.to_datetime(train_df['order_date']) # 解决时间格式问题
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_df['month'] = train_df['order_date'].dt.month


In [9]:

# 预测验证集数据
y_pred = model_fit.predict(exog=x_val)

# 计算RMSPE
print('RMSPE:', rmspe(y_pred, y_val))

RMSPE: nan


In [14]:
test_data = pd.read_csv('data/predict_sku1.csv')
test_data_res = test_data.copy()
test_data['month'] = [1]*len(test_data)
test_data
test_data_res['2019年1月预测需求量'] = model_fit.predict(exog=test_data)
test_data = pd.read_csv('data/predict_sku1.csv')
test_data['month'] = [2]*len(test_data)
test_data_res['2019年2月预测需求量'] = model_fit.predict(exog=test_data)
test_data = pd.read_csv('data/predict_sku1.csv')
test_data['month'] = [3]*len(test_data)
test_data_res['2019年3月预测需求量'] = model_fit.predict(exog=test_data)
test_data_res



Unnamed: 0,sales_region_code,item_code,first_cate_code,second_cate_code,2019年1月预测需求量,2019年2月预测需求量,2019年3月预测需求量
0,101,20002,303,406,4.137609e+06,4.137609e+06,4.137609e+06
1,101,20003,301,405,2.302688e+06,2.302688e+06,2.302688e+06
2,101,20006,307,403,1.728433e+06,1.728433e+06,1.728433e+06
3,101,20011,303,401,1.398033e+06,1.398033e+06,1.398033e+06
4,101,20014,307,403,1.216422e+06,1.216422e+06,1.216422e+06
...,...,...,...,...,...,...,...
2614,105,22066,307,403,,,
2615,105,22072,305,412,,,
2616,105,22075,307,403,,,
2617,105,22083,303,401,,,
