## Created by <a href="https://github.com/yunsuxiaozi/">yunsuxiaozi </a> 2024/9/2

#### 比赛链接如下:<a href="https://www.kaggle.com/competitions/rohlik-orders-forecasting-challenge/overview">Rohlik Orders Forecasting Challenge</a>

#### 这个notebook是公榜第2私榜第1的notebook,主要学习了<a href="https://www.kaggle.com/code/darkswordmg/rohlik-2024-2nd-place-solution-single-lgbm">top2 solution</a>,针对我不理解的地方询问了原作者,并且得到了不错的解释。

#### 这里先说明以下几点:

#### 1.原作者考虑到代码的通用性,用了很多if else。我这里为了简化代码,让代码更加简洁,只保留了实际运行到的代码,在程序运行过程中没有执行的代码我会删除。

#### 2.我对于欧洲节日了解的不多,所以很多代码的注释也只知道做了什么,不知道为什么这样做。

#### 3.注释的内容基本是我个人的理解,如果有问题可以指出。


### 1.导入必要的python库。

In [1]:
import pandas as pd#导入csv文件的库
import numpy as np#矩阵运算与科学计算的库
import re#正则表达式的库
from datetime import datetime, timedelta#datetime处理日期和时间的库,timedelta用于计算日期的加减
from sklearn.feature_extraction.text import TfidfVectorizer#将文本数据转换为tfidf特征
#lightgbm回归器
from lightgbm import LGBMRegressor

### 2.读取比赛方提供的文件,并对'date'列转换。

In [2]:
data_path = '/kaggle/input/rohlik-orders-forecasting-challenge'#比赛方提供的数据集路径

#读取数据和日历数据
df = pd.read_csv(f'{data_path}/train.csv')
test = pd.read_csv(f'{data_path}/test.csv')
df_cld = pd.read_csv(f'{data_path}/train_calendar.csv')
test_cld = pd.read_csv(f'{data_path}/test_calendar.csv')

#pd.to_datetime 转换成日期时间类型,并且遇到无法处理的数据转换成naT(not a time)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
test['date'] = pd.to_datetime(test['date'], errors='coerce')
df_cld['date'] = pd.to_datetime(df_cld['date'], errors='coerce')
test_cld['date'] = pd.to_datetime(test_cld['date'], errors='coerce')

### 3.日期重命名.

#### 'Memorial Day'在很多节日中都出现了,为了更好的区分这些节日,所以考虑对前2个节日重命名。最后一个节日太长了,构建词向量模型可能会有很多词['Den','vzniku','samostatneho','ceskoslovenskeho','statu'],所以考虑缩短。

In [3]:
#重命名特定节日
#“大屠杀受害者纪念日”：“大屠杀受害者”，
#“共产主义独裁受害者纪念日”：“共产主义受害者”，
#“捷克斯洛伐克独立地位的建立日”：“建立日”
#前2个day修改是因为Memorial Day有很多节日,为了更好的区分。
#"Den vzniku samostatneho ceskoslovenskeho statu"->"Den vzniku"是为了减少词长度,方便tfidf构造词向量
rename_dict = {
    "Memorial Day for the Victims of the Holocaust": "Victims of the Holocaust",
    "Memorial Day for the Victims of the Communist Dictatorships": "Victims of the Communist",
    "Den vzniku samostatneho ceskoslovenskeho statu": "Den vzniku"
}
df['holiday_name'] = df['holiday_name'].replace(rename_dict)
test['holiday_name'] = test['holiday_name'].replace(rename_dict)
df.head()

Unnamed: 0,warehouse,date,orders,holiday_name,holiday,shutdown,mini_shutdown,shops_closed,winter_school_holidays,school_holidays,blackout,mov_change,frankfurt_shutdown,precipitation,snow,user_activity_1,user_activity_2,id
0,Prague_1,2020-12-05,6895.0,,0,0,0,0,0,0,0,0.0,0,0.0,0.0,1722.0,32575.0,Prague_1_2020-12-05
1,Prague_1,2020-12-06,6584.0,,0,0,0,0,0,0,0,0.0,0,0.0,0.0,1688.0,32507.0,Prague_1_2020-12-06
2,Prague_1,2020-12-07,7030.0,,0,0,0,0,0,0,0,0.0,0,0.0,0.0,1696.0,32552.0,Prague_1_2020-12-07
3,Prague_1,2020-12-08,6550.0,,0,0,0,0,0,0,0,0.0,0,0.8,0.0,1681.0,32423.0,Prague_1_2020-12-08
4,Prague_1,2020-12-09,6910.0,,0,0,0,0,0,0,0,0.0,0,0.5,0.0,1704.0,32410.0,Prague_1_2020-12-09


### 4.补充节假日。

#### 作者通过查阅网络上相关资料,发现数据集中的日期有节日没有标注的现象,所以手动填充上了这些节日。最后对日历保留'holiday'==1的数据,同时发现'holiday'==1但是holiday_name=nan的节日是Easter Monday.

In [4]:
#这里据我分析,应该是作者查阅了网络,发现一些数据中没有注明的节日,做了补充。
# Holidays get from https://www.holidays-info.com/
# https://www.holidays-info.com/czech-republic/calendar/prague/2024/

#只留下需要的几列特征
df_cld = df_cld[['warehouse', 'date', 'holiday', 'holiday_name']]
test_cld = test_cld[['warehouse', 'date', 'holiday', 'holiday_name']].reset_index()

czech_holiday = [ # Prague
    (['03/31/2024', '04/09/2023', '04/17/2022', '04/04/2021', '04/12/2020'], 'Easter Day'),#loss
    (['05/12/2024', '05/10/2020', '05/09/2021', '05/08/2022', '05/14/2023'], "Mother Day"), #loss
]
brno_holiday = [ # Brno
    (['03/31/2024', '04/09/2023', '04/17/2022', '04/04/2021', '04/12/2020'], 'Easter Day'),#loss
    (['05/12/2024', '05/10/2020', '05/09/2021', '05/08/2022', '05/14/2023'], "Mother Day"), #loss
]

budapest_holidays = []
# Bavaria - Munich
munich_holidays = [
    (['03/30/2024', '04/08/2023', '04/16/2022', '04/03/2021'], 'Holy Saturday'),#loss
    (['05/12/2024', '05/14/2023', '05/08/2022', '05/09/2021'], 'Mother Day'),#loss
]

# Hesse - Frankfurt
frank_holidays = [
    (['03/30/2024', '04/08/2023', '04/16/2022', '04/03/2021'], 'Holy Saturday'),#loss
    (['05/12/2024', '05/14/2023', '05/08/2022', '05/09/2021'], 'Mother Day'),#loss
]

#df_fill是日历的表格,warehouse是哪家店,holidays是店对应需要添加的假期
def fill_loss_holidays(df_fill, warehouses, holidays):
    df = df_fill.copy()
    for item in holidays:#item就是([日期列表],对应的节日名称)
        dates, holiday_name = item#日期列表和节日名称
        #对日期进行格式化操作 12/29/2019 会变成 2019-12-29
        generated_dates = [datetime.strptime(date, '%m/%d/%Y').strftime('%Y-%m-%d') for date in dates]
        #对应店对应日期 holiday更新为1,holiday_name为对应的节日
        for generated_date in generated_dates:
            df.loc[(df['warehouse'].isin(warehouses)) & (df['date'] == generated_date), 'holiday'] = 1
            df.loc[(df['warehouse'].isin(warehouses)) & (df['date'] == generated_date), 'holiday_name'] = holiday_name
    return df

#给每家店添加节日
df_cld = fill_loss_holidays(df_fill=df_cld, warehouses=['Prague_1', 'Prague_2', 'Prague_3'], holidays=czech_holiday)
test_cld = fill_loss_holidays(df_fill=test_cld, warehouses=['Prague_1', 'Prague_2', 'Prague_3'], holidays=czech_holiday)
df_cld = fill_loss_holidays(df_fill=df_cld, warehouses=['Brno_1'], holidays=brno_holiday)
test_cld = fill_loss_holidays(df_fill=test_cld, warehouses=['Brno_1'], holidays=brno_holiday)
df_cld = fill_loss_holidays(df_fill=df_cld, warehouses=['Munich_1'], holidays=munich_holidays)
test_cld = fill_loss_holidays(df_fill=test_cld, warehouses=['Munich_1'], holidays=munich_holidays)
df_cld = fill_loss_holidays(df_fill=df_cld, warehouses=['Frankfurt_1'], holidays=frank_holidays)
test_cld = fill_loss_holidays(df_fill=test_cld, warehouses=['Frankfurt_1'], holidays=frank_holidays)
df_cld = fill_loss_holidays(df_fill=df_cld, warehouses=['Budapest_1'], holidays=budapest_holidays)
test_cld = fill_loss_holidays(df_fill=test_cld, warehouses=['Budapest_1'], holidays=budapest_holidays)

#这里只保留holiday为1的日期
df_cld = df_cld[df_cld['holiday']==1].reset_index(drop=True)
test_cld = test_cld[test_cld['holiday']==1].reset_index(drop=True)
#holiday=1同时holiday_name有缺失值的节日是Easter Monday,我对节日不了解,只是解释代码
df_cld = df_cld.fillna('Easter Monday')
test_cld = test_cld.fillna('Easter Monday')
#每家店按照日期排序
df_cld  = df_cld.sort_values(by=['warehouse', 'date'])
test_cld  = test_cld.sort_values(by=['warehouse', 'date'])
df_cld.head()

Unnamed: 0,warehouse,date,holiday,holiday_name
122,Brno_1,2019-01-01,1,New Years Day
116,Brno_1,2019-04-19,1,Good Friday
136,Brno_1,2019-04-20,1,Easter Monday
91,Brno_1,2019-04-21,1,Easter Monday
90,Brno_1,2019-04-22,1,Easter Monday


### 5.将日历中的节假日填充到df主表里。其中劳动节和复活节是超级重要的节日,所以用1.5表示超级重要。

In [5]:
#将日历信息填充到df
def fill_calendar2df(df, df_cld):
    #df_cld已经是holiday=1的数据了
    for _, row in df_cld.iterrows():
        #店,节日名,节日的日期
        warehouse,holiday_date,holiday_name = row['warehouse'],row['date'],row['holiday_name']
        is_spec = False#是不是特殊节日,初始化False
        ##劳动节和复活节是特殊节日,date_range为[-2,1],普通节日是[-1,1]
        #在英国，复活节假期通常持续四天,欧洲劳动节貌似是1天+周末放假
        if (holiday_name in ['Labour Day','Easter Monday']):
            date_range = pd.date_range(start=holiday_date - pd.Timedelta(days=2), end=holiday_date + pd.Timedelta(days=1))
            is_spec = True
        else: #可能是调休,节日放假1天+周末的放假
            date_range = pd.date_range(start=holiday_date - pd.Timedelta(days=1), end=holiday_date + pd.Timedelta(days=1))
        #date_range:[-2,-1,0,1]
        for i, date in enumerate(date_range):
            mask = (df['warehouse'] == warehouse) & (df['date'] == date)
            if is_spec and i==0:#特殊节日前2天
                df.loc[mask, 'holiday'] = 1.5#holiday给个大点的数值
            else:#不是特殊节日就是[-1,0,1],特殊节日的其他天也是[-1,0,1],给个小点的数值
                df.loc[mask, 'holiday'] = 1
            #如果不是最后一天(也就是date_range里的1),就算作holiday
            if i+1!=len(date_range):
                df.loc[mask, 'holiday_name'] = holiday_name
    return df

#snow和precipitation由于测试集中没有,后面会drop,缺失值列就只有holiday_name了,对holiday_name进行填充‘Not’
df = df.fillna('Not')
test = test.fillna('Not')
#转成float是为了后面特殊节日的1.5
df['holiday'] = df['holiday'].astype(float)
test['holiday'] = test['holiday'].astype(float)
df = fill_calendar2df(df, df_cld)
test = fill_calendar2df(test, test_cld)
df.head()

Unnamed: 0,warehouse,date,orders,holiday_name,holiday,shutdown,mini_shutdown,shops_closed,winter_school_holidays,school_holidays,blackout,mov_change,frankfurt_shutdown,precipitation,snow,user_activity_1,user_activity_2,id
0,Prague_1,2020-12-05,6895.0,Not,0.0,0,0,0,0,0,0,0.0,0,0.0,0.0,1722.0,32575.0,Prague_1_2020-12-05
1,Prague_1,2020-12-06,6584.0,Not,0.0,0,0,0,0,0,0,0.0,0,0.0,0.0,1688.0,32507.0,Prague_1_2020-12-06
2,Prague_1,2020-12-07,7030.0,Not,0.0,0,0,0,0,0,0,0.0,0,0.0,0.0,1696.0,32552.0,Prague_1_2020-12-07
3,Prague_1,2020-12-08,6550.0,Not,0.0,0,0,0,0,0,0,0.0,0,0.8,0.0,1681.0,32423.0,Prague_1_2020-12-08
4,Prague_1,2020-12-09,6910.0,Not,0.0,0,0,0,0,0,0,0.0,0,0.5,0.0,1704.0,32410.0,Prague_1_2020-12-09


### 6.复活节前1天标记为节假日。

In [6]:
#复活节日期列表
datesx = ['03/31/2024', '04/09/2023', '04/17/2022', '04/04/2021', '04/12/2020']
#时间字符串格式化得到复活节前1天
holidaysx = [datetime.strptime(date, '%m/%d/%Y') - timedelta(days=1) for date in datesx]
#这3家店复活节前1天,标记为holiday
warehouses = ['Prague_1', 'Prague_2', 'Prague_3']
df.loc[(df['date'].isin(holidaysx)) & (df['warehouse'].isin(warehouses)), 'holiday'] = 1
test.loc[(test['date'].isin(holidaysx)) & (test['warehouse'].isin(warehouses)), 'holiday'] = 1

### 7.构造holiday前1天和后一天的特征。

In [7]:
#构造一天前是不是holiday,一天后是不是holiday的特征,fillna(-1)和普通的0进行区别
df['day_before_holiday'] = df['holiday'].shift(-1).fillna(-1)
df['day_after_holiday'] = df['holiday'].shift().fillna(-1)
test['day_before_holiday'] = test['holiday'].shift(-1).fillna(-1)
test['day_after_holiday'] = test['holiday'].shift().fillna(-1)

#这里的操作把1.5变成了1,不管前后是不是特殊的节日,毕竟今天不特殊,所以1.5不需要和1区分开。
df['day_before_holiday'] = df['day_before_holiday'].astype(np.int8)
df['day_after_holiday'] = df['day_after_holiday'].astype(np.int8)
test['day_before_holiday'] = test['day_before_holiday'].astype(np.int8)
test['day_after_holiday'] = test['day_after_holiday'].astype(np.int8)

### 8.还算比较常规的特征工程。

In [8]:
# Data processing function
def data_process(df:pd.DataFrame, is_test:bool, tfidfvectorizer=None):
    testids = test['id']

    #id就是date+warehouse,没有新的信息
    #'shutdown','mini_shutdown', 'blackout','mov_change', 'frankfurt_shutdown', 'precipitation', 'snow','user_activity_1', 'user_activity_2',
    #是训练集有但是测试集没有的列
    ignore_columns = ['id','shutdown','mini_shutdown', 'blackout',
                      'mov_change', 'frankfurt_shutdown', 'precipitation', 'snow',
                      'user_activity_1', 'user_activity_2',
                      ]
    #如果ignore_columns中出现没有的列,应该直接忽略而不是报错,这是为test考虑
    df = df.drop(ignore_columns, axis=1, errors="ignore")

    #根据日期构造年、月、日、星期几特征
    print("date feature")
    for _col in ['date']:
        train_date_col = pd.to_datetime(df[_col], errors='coerce')
        df[_col + "_year"] = train_date_col.dt.year#.fillna(-1)
        df[_col + "_month"] = train_date_col.dt.month#.fillna(-1)
        df[_col + "_day"] = train_date_col.dt.day#.fillna(-1)
        df[_col + "_day_of_week"] = train_date_col.dt.dayofweek#.fillna(-1)
        df.drop(_col, axis=1, inplace=True)

    print("deal with text(holiday_name)")
    #对文本进行一些简单的处理
    def process_text(df):
        for _col in ['holiday_name']:
            #将文本转换成小写
            process_text = [t.lower() for t in df[_col]]
            #删除所有的标点符号
            table = str.maketrans('', '', '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
            process_text = [t.translate(table) for t in process_text]
            #使用正则表达式替换所有数字为num
            process_text = [re.sub(r'\d+', 'num', t) for t in process_text]
            df[_col] = process_text
        return df
    df = process_text(df)
    
    TARGET_COLUMNS = ['orders']
    #如果 ['orders']是df.columns的子集,即:df.columns里有orders,即df为训练集
    if set(TARGET_COLUMNS).issubset(df.columns.tolist()):
        #train_X
        feature_train = df.drop(TARGET_COLUMNS, axis=1)
        #train_y
        target_train = df[TARGET_COLUMNS].copy()
    else:#test_X
        feature_train = df

    print("one_hot_encoder")
    for u in [ 'Brno_1','Budapest_1','Frankfurt_1','Munich_1','Prague_1', 'Prague_2', 'Prague_3']:
        feature_train[f'warehouse_{u}']=(feature_train['warehouse']==u).astype(float)#np.int8
    feature_train.drop(['warehouse'], axis=1, inplace=True)
    
    
    print("tf-idf_feature")#构造holiday_name的tf-idf特征
    TEXT_COLUMNS = ['holiday_name']
    #取出holiday_name的文本数据
    temp_train_data = feature_train[TEXT_COLUMNS]
    #除了holiday_name的文本数据外其他数据转成float64,并用0填充缺失值
    feature_train = feature_train.drop(TEXT_COLUMNS, axis=1).astype(pd.SparseDtype('float64', 0))
    for _col in ['holiday_name']:
        if is_test:
            vector_train = tfidfvectorizer.transform(temp_train_data[_col])
        else:
            vector_train = tfidfvectorizer.fit_transform(temp_train_data[_col])
        feature_names = ['_'.join([_col, name]) for name in tfidfvectorizer.get_feature_names_out()]
        vector_train = pd.DataFrame.sparse.from_spmatrix(vector_train, columns=feature_names, index=temp_train_data.index)
        feature_train = pd.concat([feature_train, vector_train], axis=1)
    print("-"*30)
    if is_test:
        return feature_train, testids
    else:
        return feature_train, target_train, tfidfvectorizer
    
#tfidf模型,使用出现频率最高的100个词
tfidfvectorizer = TfidfVectorizer(max_features=100)
feature_train, target_train, tfidfvectorizer = data_process(df, is_test=False,
                                                                tfidfvectorizer=tfidfvectorizer)
feature_test, testids = data_process(test, is_test=True,
                                     tfidfvectorizer=tfidfvectorizer)
feature_train

date feature
deal with text(holiday_name)
one_hot_encoder
tf-idf_feature
------------------------------
date feature
deal with text(holiday_name)
one_hot_encoder
tf-idf_feature
------------------------------


Unnamed: 0,holiday,shops_closed,winter_school_holidays,school_holidays,day_before_holiday,day_after_holiday,date_year,date_month,date_day,date_day_of_week,...,holiday_name_svobodu,holiday_name_the,holiday_name_unity,holiday_name_victims,holiday_name_virgin,holiday_name_vzniku,holiday_name_whit,holiday_name_womens,holiday_name_years,holiday_name_za
0,0,0,0,0,0,-1.0,2020.0,12.0,5.0,5.0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,2020.0,12.0,6.0,6.0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,2020.0,12.0,7.0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,2020.0,12.0,8.0,1.0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,2020.0,12.0,9.0,2.0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7335,0,0,0,0,0,0,2024.0,3.0,10.0,6.0,...,0,0,0,0,0,0,0,0,0,0
7336,0,0,0,0,0,0,2024.0,3.0,11.0,0,...,0,0,0,0,0,0,0,0,0,0
7337,0,0,0,0,0,0,2024.0,3.0,12.0,1.0,...,0,0,0,0,0,0,0,0,0,0
7338,0,0,0,0,1.0,0,2024.0,3.0,13.0,2.0,...,0,0,0,0,0,0,0,0,0,0


### 9.模型的训练和推理。

#### 这里对target数据进行了log1p的变换,据说是为了更好的预测异常值,最终实验下来效果也比不用log1p要好。

In [9]:
#用optuna找到的参数
lgb_params = {'objective':'regression_l1',
              'n_estimators':600,
             'reg_alpha': 0.22395576225297806,
             'reg_lambda': 0.013055491064310818,
             'learning_rate': 0.48284825276236043,
             'colsample_bytree': 0.7922000123536603,
             'min_child_weight': 0.00010297333065138669,
             'num_leaves': 14,
             'min_child_samples': 10}
model = LGBMRegressor( **lgb_params)

#这里使用log1p是为了处理异常值
def pred_test(algo, X_train, y_train, X_test):
    model = algo.fit(X_train, np.log1p(y_train))
    y_pred = np.expm1(model.predict(X_test))
    return y_pred

y_pred = pred_test(model, feature_train, target_train.values, feature_test)



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003943 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 188
[LightGBM] [Info] Number of data points in the train set: 7340, number of used features: 51
[LightGBM] [Info] Start training from score 8.588769


### 10.submission

#### 考虑到数据中存在递增趋势,乘了1.015,这是原来的notebook里没有的。

In [10]:
submission=pd.read_csv("/kaggle/input/rohlik-orders-forecasting-challenge/solution_example.csv")
submission['orders']=y_pred*1.015
submission.to_csv("submission.csv",index=None)
submission.head()

Unnamed: 0,id,orders
0,Prague_1_2024-03-16,10740.443952
1,Prague_1_2024-03-17,10251.168705
2,Prague_1_2024-03-18,10036.754932
3,Prague_1_2024-03-19,9674.753361
4,Prague_1_2024-03-20,9741.424527
