## Created by <a href="https://github.com/yunsuxiaozi">yunsuxiaozi</a> 2025/03/09

## Competition:<a href="https://www.kaggle.com/competitions/rohlik-sales-forecasting-challenge-v2/overview">Rohlik Sales Forecasting Challenge</a>

## 原始代码 <a href="https://www.kaggle.com/code/hardyxu52/simplified-3rd-place-solution-rohlik-sales/notebook?scriptVersionId=222614435">Simplified 3rd Place Solution - Rohlik Sales</a>

## 这里默认学习这个top方案的人都是有一定基础渴望提高的人,所以这里不会讲特别基础的东西。读者可以自行补充基础知识。

#### 更多数据挖掘比赛的top方案可以关注<a href="https://github.com/yunsuxiaozi/AI-and-competition">这里</a>。

## 方案亮点分析(我学到了什么?)

1.使用2022年1月1日以后的数据来训练模型。

2.Z-score处理(x-mean)/std

3.指数加权移动平均

4.模型训练前后的开方和平方处理。

5.其他一些离散的特征工程。

导入一些常用的库.

In [1]:
import pandas as pd#读取csv文件
import numpy as np#进行矩阵运算的库
from copy import deepcopy#浅拷贝,改copy数据,原始数据也会改,深拷贝,改copy数据原始数据不会改.
from sklearn.model_selection import KFold#k折交叉验证.
from xgboost import XGBRegressor, DMatrix#导入xgboost模型
import warnings#avoid some negligible errors
#The filterwarnings () method is used to set warning filters, which can control the output method and level of warning information.
warnings.filterwarnings('ignore')

import random#provide some function to generate random_seed.
#set random seed,to make sure model can be recurrented.
def seed_everything(seed):
    np.random.seed(seed)#numpy's random seed
    random.seed(seed)#python built-in random seed
seed_everything(seed=2025)

导入库存的表格。

观察表格可以发现,product_unique_id和name是一一对应的,unique_id和(warehouse和product_unique_id的组合)是一一对应的,所以这里在读取的时候drop warehouse和product_unique_id,因为这些已经可以用unique_id表示了.

## 这里需要澄清的是有2种产品。一种是细分的产品,如:'Pastry_196',在一种产品的基础上还要具体到某种种类。一种是粗略的产品,就是后面的'common_name',也就是取name的前半部分.

In [2]:
inventory = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/inventory.csv').drop(['warehouse','product_unique_id'],axis=1)
inventory.head()

Unnamed: 0,unique_id,name,L1_category_name_en,L2_category_name_en,L3_category_name_en,L4_category_name_en
0,5255,Pastry_196,Bakery,Bakery_L2_14,Bakery_L3_26,Bakery_L4_1
1,4948,Herb_19,Fruit and vegetable,Fruit and vegetable_L2_30,Fruit and vegetable_L3_86,Fruit and vegetable_L4_1
2,2146,Beet_2,Fruit and vegetable,Fruit and vegetable_L2_3,Fruit and vegetable_L3_65,Fruit and vegetable_L4_34
3,501,Chicken_13,Meat and fish,Meat and fish_L2_13,Meat and fish_L3_27,Meat and fish_L4_5
4,4461,Chicory_1,Fruit and vegetable,Fruit and vegetable_L2_17,Fruit and vegetable_L3_33,Fruit and vegetable_L4_1


这里是针对日历或者说节日构造的特征。

作者最后只保留了holiday,day_before_holiday和day_after_holiday这3个二元分类(bool)变量,也就是这个日期是节日当天、前一天和后一天的特征。

['last_holiday_date','next_holiday_date']是字符串所以删掉。

['days_since_last_holiday','days_to_next_holiday'],我觉得有用,但可能作者做过实验,发现效果不好,所以删了?

['shops_closed','winter_school_holidays','school_holidays','holiday_name']比赛方提供的特征,可能没用?

In [3]:
calendar = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/calendar.csv', parse_dates=['date'])
calendar.loc[calendar['holiday_name'].isna(), 'holiday'] = 0 
calendar['last_holiday_date'] = calendar['date']
calendar['next_holiday_date'] = calendar['date']
calendar.loc[calendar['holiday'] == 0, ['last_holiday_date','next_holiday_date']] = np.nan
calendar['last_holiday_date'] = calendar.sort_values('date').groupby('warehouse')['last_holiday_date'].ffill()
calendar['next_holiday_date'] = calendar.sort_values('date').groupby('warehouse')['next_holiday_date'].bfill()
calendar['days_since_last_holiday'] = ((calendar['date'] - calendar['last_holiday_date']).dt.days)
calendar['days_to_next_holiday'] = ((calendar['next_holiday_date'] - calendar['date']).dt.days)
calendar['day_before_holiday'] = calendar['days_to_next_holiday'] == 1
calendar['day_after_holiday'] = calendar['days_since_last_holiday'] == 1
calendar.drop(['last_holiday_date','next_holiday_date'],axis=1,inplace=True)
calendar.drop(['days_since_last_holiday','days_to_next_holiday'],axis=1,inplace=True)
calendar.drop(['shops_closed','winter_school_holidays','school_holidays','holiday_name'],axis=1,inplace=True)
calendar.head()

Unnamed: 0,date,holiday,warehouse,day_before_holiday,day_after_holiday
0,2022-03-16,0,Frankfurt_1,False,False
1,2020-03-22,0,Frankfurt_1,False,False
2,2018-02-07,0,Frankfurt_1,False,False
3,2018-08-10,0,Frankfurt_1,False,False
4,2017-10-26,0,Prague_2,False,False


这里构造的是常见的时间特征,没什么好说的.

In [4]:
def fe_date(df):
    df['year'] = df['date'].dt.year
    df['day_of_week'] = df['date'].dt.dayofweek
    df['days_since_2020'] = (df['date'] - pd.to_datetime('2020-01-01')).dt.days.astype('int')
    df['day_of_year'] = df['date'].dt.dayofyear
    df['cos_day'] = np.cos(df['day_of_year']*2*np.pi/365)
    df['sin_day'] = np.sin(df['day_of_year']*2*np.pi/365)

折扣一般不可能是负数,所以clip是在去除异常值,统计最大折扣的时候没有type_6,这个我没有问过原作者。

sell_price_main是长尾分布,所以取log。common_name是粗略的产品。

然后是一些离散的特征:

每天每个仓库每种粗略的产品中有几种细分的产品(比如土豆有土豆1号,土豆2号,土豆3号3种)

每天每个仓库每种粗略的产品中细分产品的最大折扣的均值

每天每个细分的产品在几家商店有。

In [5]:
def fe_other(df):
    discount_cols = ['type_0_discount','type_1_discount','type_2_discount','type_3_discount','type_4_discount','type_5_discount','type_6_discount']
    df[discount_cols] = df[discount_cols].clip(0)
    df['max_discount'] = df[['type_0_discount','type_1_discount','type_2_discount',
                             'type_3_discount','type_4_discount','type_5_discount']].max(axis=1)
    
    df['sell_price_main'] = np.log(df['sell_price_main']) 

    df['common_name'] = df['name'].apply(lambda x: x[:x.find('_')])
    df['CN_total_products'] = df.groupby(['date','warehouse','common_name'])['unique_id'].transform('nunique')
    df['CN_discount_avg'] = df.groupby(['date','warehouse','common_name'])['max_discount'].transform('mean')
    df['name_num_warehouses'] = df.groupby(['date','name'])['unique_id'].transform('nunique')

这里对sell_price_main和total_orders这2个特征做了重点的特征工程,可能是作者观察到这2个特征特别重要吧.具体做了什么,可以看注释。

普通的窗口平均就是设置一个窗口,然后对这个窗口内的数求平均值。指数加权移动平均则是

$EWMA_{t}=αx_{t}+(1−α)⋅EWMA_{t−1}$

比如一组数据是[1,2,3],α=0.1,则 $EWMA_{0}=1$

$EWMA_{1}=0.1*2+0.9*1=1.1$   $EWMA_{2}=0.1*3+0.9*1.1=1.29$

In [6]:
def fe_combined(df):
    #这里没仔细看代码,根据作者的命名,应该是过去28天这个商店营业了几天
    df['num_sales_days_28D'] = pd.MultiIndex.from_frame(df[['unique_id','date']]).map(df.sort_values('date').groupby('unique_id').rolling(
        window='28D', on='date', closed='left')['date'].count().fillna(0))

    print("< sell_price_main features >")
    #这里首先对sell_price_main特征得到了标准化以后的数值(x-mean)/std
    #然后减去每天每个仓库的 price-scaled的均值,看的是这个sell_price_main在当天这个仓库的相对的位置.
    mean_prices = df.groupby(df['unique_id'])['sell_price_main'].mean()
    std_prices = df.groupby(df['unique_id'])['sell_price_main'].std()
    df['price_scaled'] = np.where(df['unique_id'].map(std_prices) == 0, 0, 
                                  (df['sell_price_main'] - df['unique_id'].map(mean_prices))/df['unique_id'].map(std_prices))
    #days_since_2020其实和date代表的意思是相同的,只是表现的形式不同.
    df['price_detrended'] = df['price_scaled'] - df.groupby(['days_since_2020','warehouse'])['price_scaled'].transform('mean')
    df.drop('price_scaled',axis=1,inplace=True)

    print("< total orders features >")
    #每天每个商店total_orders的中位数
    warehouse_stats = df.groupby(['date','warehouse'])['total_orders'].median().rename('med_total_orders').reset_index().sort_values('date')
    #对每个商店的total_orders的中位数进行了指数移动平均
    warehouse_stats['ewmean_orders_56'] = warehouse_stats.groupby('warehouse')['med_total_orders'].transform(lambda x:x.ewm(alpha=1/56).mean())
    df['ewmean_orders_56'] = pd.MultiIndex.from_frame(df[['warehouse','date']]).map(
        warehouse_stats.set_index(['warehouse','date'])['ewmean_orders_56'])
    #每个商店在14天的窗口的med_total_orders的普通平均.
    df['mean_orders_14d'] = pd.MultiIndex.from_frame(df[['warehouse','date']]).map(
        warehouse_stats.groupby('warehouse').rolling(on='date',window='14D')['med_total_orders'].mean())
    return df

这里主要就是读取训练数据和测试数据,然后把fe_date,fe_other和fe_combined过一遍。

In [7]:
train = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_train.csv', parse_dates=['date'])
train['id'] = train['unique_id'].astype('str') + '_' + train['date'].astype('str')
train.set_index('id',inplace=True)
train = train[~train['sales'].isna()]
train = train.reset_index().merge(inventory, on='unique_id').set_index('id').loc[train.index]
train = train.reset_index().merge(calendar, on=['date','warehouse']).set_index('id').loc[train.index]
fe_date(train)
fe_other(train)

test = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/sales_test.csv', parse_dates=['date'])
test['id'] = test['unique_id'].astype('str') + '_' + test['date'].astype('str')
test.set_index('id',inplace=True)
test = test.reset_index().merge(inventory, on='unique_id').set_index('id').loc[test.index]
test = test.reset_index().merge(calendar, on=['date','warehouse']).set_index('id')
fe_date(test)
fe_other(test)

all_data = pd.concat([train,test])
all_data = fe_combined(all_data)
train = all_data.loc[train.index]
test = all_data.loc[test.index].drop(['sales','availability'],axis=1)

< sell_price_main features >
< total orders features >


去除训练数据:X_train,y_train,X_train_weights. sales是需要预测的变量,availability是测试数据没有,所以删除掉。

In [8]:
X_train = train.drop(['sales','availability'],axis=1)
y_train = train['sales']
weights = pd.read_csv('/kaggle/input/rohlik-sales-forecasting-challenge-v2/test_weights.csv').set_index('unique_id')
X_train_weights = X_train['unique_id'].map(weights['weight'])

这里其实增加的就是add_cols这3个特征.

last_sales_ema005:每个仓库每种细分的商品的sales的指数加权移动平均。

CN_sales_sum:每个仓库每天对每个粗略的商品的last_sales_ema005求和.

last_sales_zs:对last_sales_ema005进行了z-score处理.

这里选择了2022年1月1日以后的数据进行了训练.

In [9]:
cat_cols = ['unique_id'] + list(X_train.columns[X_train.dtypes == 'object'])
all_data = pd.concat([X_train, test])
add_cols = ['last_sales_ema005','CN_sales_sum','last_sales_zs']

train_cp = train.groupby('unique_id')['date'].apply(lambda s: pd.date_range(s.min(), test.date.max())).explode().reset_index()
train_cp = train_cp.merge(
    pd.concat([train[['unique_id','date','sales','warehouse']], 
               test[['unique_id','date','warehouse']]]),
    on=['unique_id','date'],how='left')
train_cp = train_cp.merge(inventory, left_on='unique_id', right_index=True)
train_cp['common_name'] = train_cp['name'].apply(lambda x: x[:x.find('_')])
train_cp.sort_values('date',inplace=True)
train_cp['last_sales_ema005'] = train_cp.groupby(['unique_id'])['sales'].transform(lambda x: x.shift(1).ewm(alpha=.005).mean()).fillna(0)
train_cp['CN_sales_sum'] = train_cp.groupby(['common_name','warehouse','date'])['last_sales_ema005'].transform('sum')
all_data = all_data.merge(train_cp.set_index(['unique_id','date'])[[
    'last_sales_ema005','CN_sales_sum'
]], left_on=['unique_id','date'],right_index=True,how='left')
sales_stats = train_cp.groupby(['common_name','warehouse'])['sales'].agg(['mean','std'])
all_data['last_sales_zs'] = (all_data['last_sales_ema005'] - pd.MultiIndex.from_frame(all_data[['common_name','warehouse']]).map(
    sales_stats['mean']))/ pd.MultiIndex.from_frame(all_data[['common_name','warehouse']]).map(sales_stats['std'])

X_train = X_train[X_train['date'] >= '2022-01-01']
y_train = y_train.loc[X_train.index]
X_train_weights = X_train_weights.loc[X_train.index]

X_train[add_cols] = all_data[add_cols]
test[add_cols] = all_data[add_cols]
all_data[cat_cols] = all_data[cat_cols].astype('str').astype('category')

这里就是普通的k折交叉验证,唯一的亮点就是在模型训练之前开平方,预测之后再把平方复原回去。

In [10]:
#xgb模型的参数.
xgb_params = {
    'n_estimators':50000
    ,'learning_rate':0.1
    ,'verbosity':0
    ,'enable_categorical':True
    ,'early_stopping_rounds':10
    ,'random_state':2025
    ,'objective':'reg:squarederror'
    ,'eval_metric':'rmse'
    ,'device':'cuda'
    ,'reg_lambda':0
    ,'min_child_weight':1
}

drop_cols = ['date','name','L1_category_name_en']#一些字符串,去掉
oof_preds = []
test_preds = []
n_splits=5
kf = KFold(n_splits=n_splits,shuffle=True,random_state=2025)
X,y = deepcopy(X_train),deepcopy(y_train)
X[cat_cols] = all_data[cat_cols]
X.drop(drop_cols,axis=1,inplace=True)
test_copy = deepcopy(test)
test_copy[cat_cols] = all_data[cat_cols]
test_copy.drop(drop_cols,axis=1,inplace=True)
oof_pred_df = pd.DataFrame(index=X.index, columns=['Pred_0'])
for i, (idx_t, idx_v) in enumerate(kf.split(X)):
    X_t, X_v = X.iloc[idx_t], X.iloc[idx_v]        
    y_t, y_v = y.loc[X_t.index], y.loc[X_v.index]
 
    y_t, y_v = np.power(y_t,0.5), np.power(y_v,0.5)
    xgb = XGBRegressor(**xgb_params)
    xgb.fit(X_t, y_t, eval_set=[(X_v, y_v)], verbose=1000)
    model_test_preds = np.power(xgb.predict(test_copy).clip(0), 2) 
    test_preds.append(model_test_preds)
    model_oof_preds = np.power(xgb.predict(X_v).clip(0), 2)
    oof_pred_df.iloc[idx_v,int(i/n_splits)] = model_oof_preds
oof_preds.append(oof_pred_df)

[0]	validation_0-rmse:5.86580
[1000]	validation_0-rmse:1.26942
[2000]	validation_0-rmse:1.23773
[2292]	validation_0-rmse:1.23398
[0]	validation_0-rmse:5.91743
[1000]	validation_0-rmse:1.26903
[2000]	validation_0-rmse:1.23575
[2377]	validation_0-rmse:1.23093
[0]	validation_0-rmse:5.88980
[1000]	validation_0-rmse:1.26632
[1645]	validation_0-rmse:1.24039
[0]	validation_0-rmse:5.85311
[1000]	validation_0-rmse:1.26425
[2000]	validation_0-rmse:1.23110
[2106]	validation_0-rmse:1.22932
[0]	validation_0-rmse:5.90635
[1000]	validation_0-rmse:1.27021
[2000]	validation_0-rmse:1.23754
[2119]	validation_0-rmse:1.23524


计算线下的WMAE评估指标,不过由于使用的是普通的kfold,所以线下CV肯定是偏低的.

In [11]:
#评估指标mae
from sklearn.metrics import mean_absolute_error
oof_pred_df = pd.concat(oof_preds,axis=1)
oof_pred_vals = oof_pred_df.mean(axis=1)
print(f'WMAE:{np.round(mean_absolute_error(y_train, oof_pred_vals, sample_weight=X_train_weights), 3)}')

WMAE:13.327


将测试数据的预测结果保存到submission.

In [12]:
test_pred_df = pd.DataFrame(np.transpose(test_preds), index=test.index)
test_sub = test_pred_df.mean(axis=1)
test_sub.name = 'sales_hat'
test_sub.to_csv('submission.csv')