**<font size=9>Mercari Price Suggestion Challenge</font>**
TING-LE,LIN    ,SCU FEAM

**<font size=6>前言</font>**

不知道大家有沒有曾經在拍賣平台中販售物品的經驗，販售物品時最困難的決策莫過於商品售價，**若賣得太貴，會使得商品乏人問津，若賣太便宜，又會不小心和財富擦肩而過**。大部分的人在定價時會通常做的事情是，參考同品項的價格、商品原價、新舊程度等等，**但這些步驟顯得相當繁瑣沒有效率，也有可能花了大量時間到最後卻還是無法訂出合適價格。**

今天我就要跟大家談談這個問題的實現，這是由Mercari在Kaggle所舉辦的一場比賽，比賽目的在於，**為拍賣平台中的使用者提供參考售價**，在比賽中，我們會得到的資訊有**商品名稱、運費歸屬、商品分類、商品品牌、商品狀況、商品描述**，我們被要求以這些資訊來預測商品的售價，這些資料全都是由**文字**組成，因此在預測時相對較複雜。評分方式以**RMSLE**計算，較低者勝。
        
**<font size=6>解決方案</font>**

**<font size=3>1. 資料載入及前置處理</font>**

訓練資料有1482353筆，測試資料有3460725筆(stage 2)。為了後面cross-validation評分方便，先將訓練資料的售價做log(price+1)處理。

這個部分除了處理資料缺失值外，由於商品品牌數量過多，在程式運行時會佔據過多內存，導致程式無法運行，**因此只保留前2500大**，而2500個商品品牌已經包含了70%以上的商品數量。
        
        
        

In [None]:
#-----------------載入套件-------------------------
import pandas as pd
import numpy as np
import scipy

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer

from sklearn.linear_model import Ridge
import xgboost as xgb
import lightgbm as lgb

from sklearn.model_selection import KFold

from sklearn.metrics import mean_squared_error

from scipy.optimize import minimize

import matplotlib.pyplot as plt

import gc


In [None]:
#-------------------資料準備-----------------------
print("Reading in Data")
df_train = pd.read_csv('../input/train.tsv', sep='\t')
df_test = pd.read_csv('../input/test_stg2.tsv', sep='\t')

nrow_train=df_train.shape[0]
nrow_test=df_test.shape[0]
Y_train=np.log1p(df_train["price"])

print('Size of train set',nrow_train)
print('Size of test set',nrow_test)

df=pd.concat([df_train,df_test],axis=0,sort=True)
df_test=pd.DataFrame(df_test["test_id"])


#----------------處理缺失值--------------------
NUM_BRAND=2500
df["category_name"]=df["category_name"].fillna("Other").astype("category")

df["brand_name"]=df["brand_name"].fillna("Unknown")
pop_brands=df["brand_name"].value_counts().index[1:NUM_BRAND+1]
df.loc[~df["brand_name"].isin(pop_brands),"brand_name"]="Other"
df["brand_name"]=df["brand_name"].astype("category")

df["item_description"]=df["item_description"].fillna("None")

df["item_condition_id"]=df["item_condition_id"].astype("category")
df_train.head()

In [None]:
del df_train
gc.collect

**<font size=3>2. 特徵處理</font>**

試想在網路商販售物品時，填寫哪個商品特徵欄位會讓你下最多著墨呢? 我想大部分的人都會回答商品描述吧，本次特徵的處理就是以這個想法出發，透過對**大部分的項目用CountVectorizer來做特徵提取**，**運費歸屬及商品狀況使用LabelBinarizer加註標籤**，最後，本次最重要的項目處理方式為，**TF-IDF對商品描述的加權特徵提取**(TfidfVectorizer)，實踐對這次特徵想法，最後再將特徵結合並轉換成稀疏矩正。

In [None]:
#--------------------encode------------------------
NAME_MIN_DF=10
MAX_FEAT_DESCP=50000
print("Encodings")

print("Condition Encoders")
vect_condition=LabelBinarizer(sparse_output=True)
X_Condition=vect_condition.fit_transform(df["item_condition_id"])

print("Shipping Encoders")
vect_shipping=LabelBinarizer(sparse_output=True)
X_Shipping=vect_shipping.fit_transform(df["shipping"])

print("Name Encoders")
count_name=CountVectorizer(min_df=NAME_MIN_DF)
X_name=count_name.fit_transform(df["name"])

print("Category Encoders")
count_category=CountVectorizer()
X_category=count_category.fit_transform(df["category_name"])

print("Brand Encoders")
count_brand=CountVectorizer()
X_brand=count_brand.fit_transform(df["brand_name"])


print("Descp Encoders")
count_descp=TfidfVectorizer(max_features=MAX_FEAT_DESCP,
                            ngram_range=(1,3),
                            stop_words="english")
X_descp=count_descp.fit_transform(df["item_description"])



In [None]:
del df
gc.collect


In [None]:
X=scipy.sparse.hstack([X_Shipping,X_Condition,X_brand,
                       X_category,X_descp,X_name]).tocsr()


In [None]:
del X_descp
del X_brand
del X_category
del X_name
del X_Shipping
del X_Condition
gc.collect


In [None]:
X_train=X[:nrow_train]
X_test=X[nrow_train:]
dtest=xgb.DMatrix(X_test)

In [None]:
del X
gc.collect

**<font size=3>3. 交叉驗證及預測</font>**

在時間許可下，我使用**KFold來做Cross-Validation**，將訓練資料分成三個Fold，製作出三個模型，**將三次Validation的數值儲存，結合成完整預測資料**，並將其RMSLE分數計算出來，作為下個步驟的材料，同時分別**使用三個模型對測試資料做預測並取平均**。在上述的過程中，我們會以三個不同的演算法，**Ridge Regression、LightGBM、XGBoost**來執行，至於原因則會在下個步驟做解釋。

![](https://i.imgur.com/8svMOQl.jpg)

In [None]:
#--------------------Cross-Validation------------------------
print("Cross-Validation")
xgb_pred_val_index = np.zeros(X_train.shape[0])
ridge_pred_val_index = np.zeros(X_train.shape[0])
lgb_pred_val_index = np.zeros(X_train.shape[0])
xgb_pred_all_sum=[]
ridge_pred_all_sum=[]
lgb_pred_all_sum=[]
xgb_cv_RMSLE_sum=0
ridge_cv_RMSLE_sum=0
lgb_cv_RMSLE_sum=0


folds=3
kf = KFold(n_splits=folds, random_state=1001)
for i, (train_index, val_index) in enumerate(kf.split(X_train, Y_train)):
    x_train, x_val = X_train[train_index], X_train[val_index]
    y_train, y_val = Y_train[train_index], Y_train[val_index]

    dtrain=xgb.DMatrix(x_train,y_train)
    dval=xgb.DMatrix(x_val)
    deval=xgb.DMatrix(x_val,y_val)
    
    
    params = {
    'booster': 'gblinear',
    'objective': 'reg:linear', 
    'gamma': 0,                
    'max_depth': 10,           
    'lambda': 0,                   
    'subsample': 0.85,             
    'colsample_bytree': 0.9,      
    'min_child_weight': 17,
    'silent': 1,                  
    'eta': 0.4,                 
    'seed': 1001,
    'nthread': 4,                 
    'eval_metric':'rmse'
    }
    
    plst = params.items()
    evallist = [(deval, 'eval'), (dtrain, 'train')]
    num_round=300
    
    model=xgb.train(plst,dtrain,num_round,evallist, verbose_eval=100,early_stopping_rounds=100)
    xgb_pred_val=model.predict(dval)
    xgb_RMSLE=np.sqrt(mean_squared_error(xgb_pred_val,y_val))
    print('\n Fold %02d XGBoost RMSLE: %.6f' % ((i + 1), xgb_RMSLE))
    xgb_pred_all=model.predict(dtest)
    
    del dtrain
    del dval
    del deval
    gc.collect()
    
    params = {
        'boosting': 'gbdt',
        'max_depth': 7,
        'min_data_in_leaf': 80,
        'num_leaves': 30,
        'learning_rate': 0.4,
        'objective': 'regression',
        'metric': 'rmse',
        'nthread': 4,
        'bagging_freq': 1,
        'subsample': 0.9,
        'colsample_bytree': 0.7,
        'min_child_weight': 17,
        'is_unbalance': False,
        'verbose': -1,
        'seed': 1001,
        'max_bin':511,
        'num_threads':4
    }
    
    dtrain = lgb.Dataset(x_train, label=y_train)
    deval = lgb.Dataset(x_val, label=y_val)
    watchlist = [dtrain, deval]
    watchlist_names = ['train', 'val']

    model = lgb.train(params,
    train_set=dtrain,
    num_boost_round=3000,
    valid_sets=watchlist,
    valid_names=watchlist_names,
    early_stopping_rounds=100,
    verbose_eval=300)
    lgb_pred_val = model.predict(x_val)
    lgb_RMSLE = np.sqrt(mean_squared_error(lgb_pred_val,y_val))
    print(' Fold %02d LightGBM RMSLE: %.6f' % ((i + 1), lgb_RMSLE))
    lgb_pred_all = model.predict(X_test)
    
    del dtrain
    del deval
    gc.collect()
    
    
    
    model=Ridge(solver='sag',alpha=4.75)
    model.fit(x_train,y_train)
    ridge_pred_val=model.predict(x_val)
    ridge_RMSLE=np.sqrt(mean_squared_error(ridge_pred_val,y_val))
    print('\n Fold %02d Ridge RMSLE: %.6f' % ((i + 1), ridge_RMSLE))
    ridge_pred_all=model.predict(X_test)
    
    del x_train
    del y_train
    del x_val
    del y_val
    gc.collect()
    
    xgb_pred_val_index[val_index] = xgb_pred_val
    ridge_pred_val_index[val_index] = ridge_pred_val
    lgb_pred_val_index[val_index] = lgb_pred_val
    
    if i > 0:
        xgb_pred_all_sum = xgb_pred_all_sum + xgb_pred_all
        ridge_pred_all_sum = ridge_pred_all_sum + ridge_pred_all
        lgb_pred_all_sum = lgb_pred_all_sum + lgb_pred_all
    else:
        xgb_pred_all_sum = xgb_pred_all
        ridge_pred_all_sum = ridge_pred_all
        lgb_pred_all_sum = lgb_pred_all
    
    xgb_cv_RMSLE_sum = xgb_cv_RMSLE_sum + xgb_RMSLE
    ridge_cv_RMSLE_sum = ridge_cv_RMSLE_sum + ridge_RMSLE
    lgb_cv_RMSLE_sum = lgb_cv_RMSLE_sum + lgb_RMSLE

xgb_cv_avg_score=xgb_cv_RMSLE_sum/folds
ridge_cv_avg_score=ridge_cv_RMSLE_sum/folds
lgb_cv_avg_score=lgb_cv_RMSLE_sum/folds

xgb_val_real_RMSLE=np.sqrt(mean_squared_error(xgb_pred_val_index,Y_train))
ridge_val_real_RMSLE=np.sqrt(mean_squared_error(ridge_pred_val_index,Y_train))
lgb_val_real_RMSLE=np.sqrt(mean_squared_error(lgb_pred_val_index,Y_train))

print('\n Average XGBoost RMSLE(cv):\t%.6f' % xgb_cv_avg_score)
print(' Out-of-fold XGBoost RMSLE:\t%.6f' % xgb_val_real_RMSLE)
print('\n Average LightGBM RMSLE(cv):\t%.6f' % lgb_cv_avg_score)
print(' Out-of-fold LightGBM RMSLE:\t%.6f' % lgb_val_real_RMSLE)
print('\n Average Ridge RMSLE(cv):\t%.6f' % ridge_cv_avg_score)
print(' Out-of-fold Ridge RMSLE:\t%.6f' % ridge_val_real_RMSLE)

xgb_pred_all_avg=xgb_pred_all_sum/folds
ridge_pred_all_avg=ridge_pred_all_sum/folds
lgb_pred_all_avg=lgb_pred_all_sum/folds


Cross-Validation中可以發現，LightGBM的成績是最優秀的，其次是Ridge Regression，最後才是XGBoost。LightGBM建模仍有更優化的可能，但由於時間關係疊代次數設為3000次，XGBoost建模過程雖然訓練誤差持續下降，但驗證誤差卻不斷上升，因此不得已捨棄更高的疊代次數。

**<font size=3>4. 模型混合與最終預測</font>**

完成步驟三之後，會有Ridge、XGBoost、LightGBM對訓練資料做出的預測，用**線性組合將三組資料的預測做混合，算出其錯誤率，找出最佳的模型比例，最後依比例做出測試資料的最終答案**。

![](https://i.imgur.com/HzuS1XH.jpg)

In [None]:
#------------blend-------------------
def rmse_min_func(weights):
    final_prediction=0
    for weight,prediction in zip(weights,blend_train):
        final_prediction+=weight*prediction
    return np.sqrt(mean_squared_error(Y_train,final_prediction))

blend_train = []
blend_test = []

blend_train.append(xgb_pred_val_index) 
blend_train.append(lgb_pred_val_index)
blend_train.append(ridge_pred_val_index)
blend_train=np.array(blend_train)

blend_test.append(xgb_pred_all_avg)
blend_test.append(lgb_pred_all_avg)
blend_test.append(ridge_pred_all_avg)
blend_test=np.array(blend_test)

print('\n Finding Blending Weights ...')

res_list=[]
weight_list=[]

for k in range(20):
    starting_value=np.random.uniform(-1,1,len(blend_train))
    bounds=[(-1,1)]*len(blend_train)
    
    res=minimize(
        rmse_min_func,
        starting_value,
        method='L-BFGS-B',
        bounds=bounds,
        options={'disp':False,
        'maxiter':100000})
    
    res_list.append(res['fun'])
    weight_list.append(res['x'])
    print('{iter}\tScore: {score}\tWeights: {weights}'.format(
        iter=(k+1),
        score=round(res['fun'],6),
        weights='\t'.join([str(round(item,10)) for item in res['x']])))

bestSC=np.min(res_list)
bestweight=weight_list[np.argmin(res_list)]

print('\n Ensemble Score:{best_score}'.format(best_score=bestSC))
print('\n Best Weights:{weight}'.format(weight=bestweight))


test_price=np.zeros(len(blend_test[0]))
train_price =np.zeros(len(blend_train[0]))

print('\n Your final model:')
for k in range(len(blend_test)):
    print('%.6f * model-%d'%(bestweight[k],(k+1)))
    test_price+=blend_test[k]*bestweight[k]
    train_price+= blend_train[k] * bestweight[k]

df_test["price"]=np.expm1(test_price)
print("Generatig File")
df_test[["test_id","price"]].to_csv("submission.csv",index=False)

最終模型約為，**XGBoost 30%、LightGBM 60%、 Ridge Regression 10%**，與Cross-Validation中的驗證誤差並沒有直接相關，但LightGBM模型確實占了最大的比例。

**<font size=3>5. 模型檢視</font>**

從圖形來看，**LightGBM與Ridge Regression對商品價格高估的情況較多，而XGBoost則是高估的比例較高**，由此可以解釋最後的模型比例，由**LightGBM跟XGBoost來做互補**，而準確率較高LightGBM占最多比例。最終的模型上在低估狀況改善許多。

In [None]:
#------------------模型檢視---------------------------
print('\n Making scatter plots of actual vs. predicted prices ...')
x_true = np.expm1(Y_train)
x_pred = np.expm1(xgb_pred_val_index)
cm = plt.cm.get_cmap('RdYlBu')
# Normalized prediction error clipped so the color-coding covers -75% to 75% range
x_diff = np.clip(100 * ((x_pred - x_true) / x_true), -75, 75)
plt.figure(1, figsize=(12, 10))
plt.title('Actual vs. Predicted Prices - XGBoost')
plt.scatter(x_true, x_pred, c=x_diff, s=10, cmap=cm)
plt.colorbar()
plt.plot([x_true.min() - 50, x_true.max() + 50],
         [x_true.min() - 50, x_true.max() + 50],
         'k--',lw=1)
plt.xlabel('Prices')
plt.ylabel('Predicted Prices')
plt.xlim(-50, 2050)
plt.ylim(-50, 2050)
plt.tight_layout()
plt.show()


x_pred = np.expm1(lgb_pred_val_index)
# Normalized prediction error clipped so the color-coding covers -75% to 75% range
x_diff = np.clip(100 * ((x_pred - x_true) / x_true), -75, 75)
plt.figure(1, figsize=(12, 10))
plt.title('Actual vs. Predicted Prices - LightGBM')
plt.scatter(x_true, x_pred, c=x_diff, s=10, cmap=cm)
plt.colorbar()
plt.plot([x_true.min() - 50, x_true.max() + 50],
         [x_true.min() - 50, x_true.max() + 50],
         'k--',lw=1)
plt.xlabel('Prices')
plt.ylabel('Predicted Prices')
plt.xlim(-50, 2050)
plt.ylim(-50, 2050)
plt.tight_layout()
plt.show()


x_pred = np.expm1(ridge_pred_val_index)
# Normalized prediction error clipped so the color-coding covers -75% to 75% range
x_diff = np.clip(100 * ((x_pred - x_true) / x_true), -75, 75)
plt.figure(1, figsize=(12, 10))
plt.title('Actual vs. Predicted Prices - Ridge Regression')
plt.scatter(x_true, x_pred, c=x_diff, s=10, cmap=cm)
plt.colorbar()
plt.plot([x_true.min() - 50, x_true.max() + 50],
         [x_true.min() - 50, x_true.max() + 50],
         'k--',lw=1)
plt.xlabel('Prices')
plt.ylabel('Predicted Prices')
plt.xlim(-50, 2050)
plt.ylim(-50, 2050)
plt.tight_layout()
plt.show()


x_pred = np.expm1(train_price)
# Normalized prediction error clipped so the color-coding covers -75% to 75% range
x_diff = np.clip(100 * ((x_pred - x_true) / x_true), -75, 75)
plt.figure(4, figsize=(12, 10))
plt.title('Actual vs. Predicted Prices - Blend')
plt.scatter(x_true, x_pred, c=x_diff, s=10, cmap=cm)
plt.colorbar()
plt.plot([x_true.min() - 50, x_true.max() + 50],
         [x_true.min() - 50, x_true.max() + 50],
         'k--',lw=1)
plt.xlabel('Prices')
plt.ylabel('Predicted Prices')
plt.xlim(-50, 2050)
plt.ylim(-50, 2050)
plt.tight_layout()
plt.show()

**<font size=6>後記</font>**

在特徵處理方面，我還試過將商品描述做**詞幹提取(Stemming)**，更精準的提取描述重點，但是這麼做使得內存無法負荷且時間過長，因此就放棄了這個做法。

這次比賽的做法仍有很多可修正的方式，例如**研究不同演算法之間的配適程度、以更複雜的方式混合模型**，找出更好的演算法組合，但對於我目前的能力還沒辦法完成，所以就當作給大家優化的參考，**本次解答最後得到0.44327分，排名約為288名**(比賽結束後完成，無明確名次)，第一名分數為0.37758，參加隊伍為2384組。
![](https://i.imgur.com/s2qtX6w.jpg)

<font size=1>比賽網址:https://www.kaggle.com/c/mercari-price-suggestion-challenge</font>