# Background
- **Author**: `<林宜萱>`
- **Created At**: `<2025-10-26>`
- **Path to Training Data：extent-of-discount-rate-DE_train.csv**
- **Path to Testing Data：extent-of-discount-rate-DE_test.csv**
- **Model Specification 
    - Method：Random Forest Regression
    - Variables：
    Dependent Variable (y): DiscountRate  
    Independent Variables (X):  
    ["Age", "AccumulatedPositiveRate", "SalePeriod",
 "PlayerGrowthRate1W", "PlayerGrowthRate2W", "PlayerGrowthRate1M",
 "FollowersGrowthRate1W", "FollowersGrowthRate2W", "FollowersGrowthRate1M",
 "PositiveRateGrowthRate1W", "PositiveRateGrowthRate2W", "PositiveRateGrowthRate1M",
 "DLC_since_last_discount", "Sequel_since_last_discount"]
    - Tuning Parameters：
      - `test_size = 0.2`    
      - `random_state = 42`   
      - `n_estimators = 200`   
      - `n_jobs = -1`   

    - Optimization Method：
    The model is trained via ensemble learning using Random Forests, which aggregates multiple decision trees trained on bootstrapped subsets of data to minimize prediction variance and capture non-linear relationships among variables.
- **Main Findings and Takeaways：**
    - In-sample `< R², RMSE>`:  
    1w(0.9640, 0.0361)、2w(0.9671, 0.0345)、1m(0.9668, 0.0346)
    - Out-sample `< R², RMSE>`:  
    1w(0.7199, 0.0814)、2w(0.6291, 0.0937)、1m(0.6683, 0.0886)  
    - Feature Importance Ranking:  
  | 1 | AccumulatedPositiveRate |   
  | 2 | Age |   
  | 3 | FollowersGrowthRate |   
  | 4 | PositiveRateGrowthRate |   
  | 5 | PlayerGrowthRate |   
  | 6 | SalePeriod |   
  | 7 | DLC_since_last_discount |   
  | 8 | Sequel_since_last_discount |   

    - Interpretation:  
- 模型在訓練資料上表現優異（R² 約 0.96），同時在測試資料上仍維持良好的解釋力（R² 約 0.72），顯示 Random Forest 能有效捕捉主要特徵模式，且未出現明顯的過度擬合。    
- 在所有變數中，**AccumulatedPositiveRate**、**Age** 以及 **FollowersGrowthRate** 對折扣率的影響最大。   
- **long-term user sentiment** (positive review accumulation) and **community engagement** (follower and player growth) 是影響折扣策略幅度的主要因素。    
- 相較之下，與**recent discount history** (DLC and sequel timing)相關的變數影響力較小，定價決策更可能受到遊戲表現與聲譽的影響，而非過往促銷紀錄。
- **Future Direciton：採用 交叉驗證 與 超參數調整（如 Grid Search、Random Search）來優化模型的穩健性與泛化能力。**

In [45]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error

### - 1 week

In [46]:
# 讀取資料
df = pd.read_csv('../data/processed/extent-of-discount-rate-DE.csv')


# 定義特徵與目標變數
X = df[["Age", "AccumulatedPositiveRate", "SalePeriod",
        "PlayerGrowthRate1W", "FollowersGrowthRate1W",
        "PositiveRateGrowthRate1W", "DLC_since_last_discount",
        "Sequel_since_last_discount"]]
y = df["DiscountRate"]

# 切分訓練與測試資料
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 顯示訓練資料前 10 筆
print("Training Data Preview:")
display(X_train.head(10))

Training Data Preview:


Unnamed: 0,Age,AccumulatedPositiveRate,SalePeriod,PlayerGrowthRate1W,FollowersGrowthRate1W,PositiveRateGrowthRate1W,DLC_since_last_discount,Sequel_since_last_discount
18,11.060274,0.941803,1,-0.083322,0.002566,3.8e-05,0,0
203,18.378082,0.967921,1,0.006607,0.000635,5e-06,0,0
351,5.252055,0.851617,0,0.024654,0.000248,1.5e-05,0,0
275,8.356164,0.94703,1,-0.097654,0.0002,1.6e-05,0,0
63,4.564384,0.951567,1,-0.095179,0.001978,7.4e-05,0,0
249,4.791781,0.967286,0,-0.064399,0.000422,5e-06,0,0
302,6.254795,0.749023,0,-0.080332,0.001233,0.000263,0,0
108,9.884932,0.92499,0,-0.067359,0.00281,4e-05,0,0
90,6.079452,0.955034,0,0.015896,0.000419,-5e-06,0,0
234,8.857534,0.444776,1,0.0,-0.001718,0.0,0,0


In [47]:
print("Testing Data Preview:")
display(X_test.head(10))

Testing Data Preview:


Unnamed: 0,Age,AccumulatedPositiveRate,SalePeriod,PlayerGrowthRate1W,FollowersGrowthRate1W,PositiveRateGrowthRate1W,DLC_since_last_discount,Sequel_since_last_discount
285,3.526027,0.969657,1,-0.07758,0.001286,6.2e-05,0,0
281,3.090411,0.970458,0,0.0997,0.001003,-1e-05,0,0
33,3.260274,0.815636,1,-0.030299,0.000837,0.000512,0,0
211,8.178082,0.982537,0,-0.097956,0.00331,0.001163,0,0
93,6.583562,0.955034,1,-0.293778,0.00062,6e-06,0,0
84,5.167123,0.954422,1,-0.023813,0.000412,1.7e-05,0,0
391,3.10137,0.953009,1,-0.09884,0.001803,3.7e-05,0,0
94,6.643836,0.955118,1,-0.416542,0.000617,6e-06,0,0
225,7.824658,0.45122,0,5921.714286,0.0,0.003049,0,0
126,5.750685,0.884745,1,-0.076967,0.00052,2.8e-05,0,0


#### The actual modeling starts below

In [48]:
# 建立線性迴歸模型
model = RandomForestRegressor(
    n_estimators=200,       # 樹的數量
    max_depth=None,         # 讓模型自動選擇深度
    random_state=42,
    n_jobs=-1               # 使用所有 CPU 加速訓練
)

# 使用訓練資料進行模型擬合
model.fit(X_train, y_train)


In [None]:
# 訓練資料預測
y_train_pred = model.predict(X_train)

# 評估模型表現
r2_train = r2_score(y_train, y_train_pred)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)

print("Training Data Performance:")
print(f"  R²: {r2_train:.4f}")
print(f"  RMSE: {rmse_train:.4f}")

Training Data Performance:
  R²: 0.9640
  RMSE: 0.0361




In [50]:
# 測試資料預測
y_test_pred = model.predict(X_test)

# 評估模型表現
r2_test = r2_score(y_test, y_test_pred)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("Testing Performance:")
print(f"  R²: {r2_test:.4f}")
print(f"  RMSE: {rmse_test:.4f}")

Testing Performance:
  R²: 0.7199
  RMSE: 0.0814


In [51]:
importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
display(importances)


Feature Importances:


Unnamed: 0,Feature,Importance
1,AccumulatedPositiveRate,0.554992
0,Age,0.146904
4,FollowersGrowthRate1W,0.145339
5,PositiveRateGrowthRate1W,0.110676
3,PlayerGrowthRate1W,0.031105
2,SalePeriod,0.008671
6,DLC_since_last_discount,0.002074
7,Sequel_since_last_discount,0.000238


### - 2 week

In [52]:
# 讀取資料
df = pd.read_csv('../data/processed/extent-of-discount-rate-DE.csv')


# 定義特徵與目標變數
X = df[["Age", "AccumulatedPositiveRate", "SalePeriod",
        "PlayerGrowthRate2W", "FollowersGrowthRate2W",
        "PositiveRateGrowthRate2W", "DLC_since_last_discount",
        "Sequel_since_last_discount"]]
y = df["DiscountRate"]

# 切分訓練與測試資料
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 顯示訓練資料前 10 筆
print("Training Data Preview:")
display(X_train.head(10))

Training Data Preview:


Unnamed: 0,Age,AccumulatedPositiveRate,SalePeriod,PlayerGrowthRate2W,FollowersGrowthRate2W,PositiveRateGrowthRate2W,DLC_since_last_discount,Sequel_since_last_discount
18,11.060274,0.941803,1,-0.12154,0.005281,6.3e-05,0,0
203,18.378082,0.967921,1,-0.04237,0.001301,1.8e-05,0,0
351,5.252055,0.851617,0,0.059666,0.000512,-3e-05,0,0
275,8.356164,0.94703,1,-0.423671,0.000458,1.4e-05,0,0
63,4.564384,0.951567,1,0.262779,0.00443,-7.5e-05,0,0
249,4.791781,0.967286,0,-0.048033,0.001024,3.6e-05,0,0
302,6.254795,0.749023,0,-0.166326,0.002349,0.000467,0,0
108,9.884932,0.92499,0,-0.049449,0.005543,0.000171,0,0
90,6.079452,0.955034,0,-0.10132,0.0007,4e-06,0,0
234,8.857534,0.444776,1,-0.176471,-0.001718,0.0,0,0


In [53]:
print("Testing Data Preview:")
display(X_test.head(10))

Testing Data Preview:


Unnamed: 0,Age,AccumulatedPositiveRate,SalePeriod,PlayerGrowthRate2W,FollowersGrowthRate2W,PositiveRateGrowthRate2W,DLC_since_last_discount,Sequel_since_last_discount
285,3.526027,0.969657,1,0.045287,0.006364,3.9e-05,0,0
281,3.090411,0.970458,0,0.051584,0.002086,-5.7e-05,0,0
33,3.260274,0.815636,1,-0.210916,0.001681,0.00124,0,0
211,8.178082,0.982537,0,-0.176385,0.006844,-0.000321,0,0
93,6.583562,0.955034,1,-0.721785,0.001203,2.4e-05,0,0
84,5.167123,0.954422,1,0.028463,0.000875,3.3e-05,0,0
391,3.10137,0.953009,1,-0.107192,0.004042,7.5e-05,0,0
94,6.643836,0.955118,1,-0.348077,0.00118,6e-05,0,0
225,7.824658,0.45122,0,-0.324636,0.0,0.003049,0,0
126,5.750685,0.884745,1,-0.18492,0.001139,6.3e-05,0,0


#### The actual modeling starts below

In [54]:
# 建立線性迴歸模型
model = RandomForestRegressor(
    n_estimators=200,       # 樹的數量
    max_depth=None,         # 讓模型自動選擇深度
    random_state=42,
    n_jobs=-1               # 使用所有 CPU 加速訓練
)

# 使用訓練資料進行模型擬合
model.fit(X_train, y_train)


In [55]:
# 訓練資料預測
y_train_pred = model.predict(X_train)

# 評估模型表現
r2_train = r2_score(y_train, y_train_pred)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)

print("Training Data Performance:")
print(f"  R²: {r2_train:.4f}")
print(f"  RMSE: {rmse_train:.4f}")

Training Data Performance:
  R²: 0.9671
  RMSE: 0.0345




In [56]:
# 測試資料預測
y_test_pred = model.predict(X_test)

# 評估模型表現
r2_test = r2_score(y_test, y_test_pred)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("Testing Performance:")
print(f"  R²: {r2_test:.4f}")
print(f"  RMSE: {rmse_test:.4f}")

Testing Performance:
  R²: 0.6291
  RMSE: 0.0937


In [57]:
importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
display(importances)


Feature Importances:


Unnamed: 0,Feature,Importance
1,AccumulatedPositiveRate,0.562901
0,Age,0.169201
4,FollowersGrowthRate2W,0.125582
5,PositiveRateGrowthRate2W,0.089377
3,PlayerGrowthRate2W,0.037721
2,SalePeriod,0.01286
6,DLC_since_last_discount,0.002215
7,Sequel_since_last_discount,0.000143


### - 1 month

In [58]:
# 讀取資料
df = pd.read_csv('../data/processed/extent-of-discount-rate-DE.csv')


# 定義特徵與目標變數
X = df[["Age", "AccumulatedPositiveRate", "SalePeriod",
        "PlayerGrowthRate1M", "FollowersGrowthRate1M",
        "PositiveRateGrowthRate1M", "DLC_since_last_discount",
        "Sequel_since_last_discount"]]
y = df["DiscountRate"]

# 切分訓練與測試資料
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 顯示訓練資料前 10 筆
print("Training Data Preview:")
display(X_train.head(10))

Training Data Preview:


Unnamed: 0,Age,AccumulatedPositiveRate,SalePeriod,PlayerGrowthRate1M,FollowersGrowthRate1M,PositiveRateGrowthRate1M,DLC_since_last_discount,Sequel_since_last_discount
18,11.060274,0.941803,1,0.102063,0.012877,-1.6e-05,0,0
203,18.378082,0.967921,1,-0.04557,0.003205,3.9e-05,0,0
351,5.252055,0.851617,0,-0.088322,0.000775,8.6e-05,0,0
275,8.356164,0.94703,1,-0.021211,0.001693,-2.1e-05,0,0
63,4.564384,0.951567,1,-0.088474,0.013172,-3.3e-05,0,0
249,4.791781,0.967286,0,-0.165792,0.002303,0.000106,0,0
302,6.254795,0.749023,0,-0.09625,0.005604,0.000558,0,0
108,9.884932,0.92499,0,0.019425,0.011309,0.000313,0,0
90,6.079452,0.955034,0,-0.295558,0.001365,3e-05,0,0
234,8.857534,0.444776,1,-0.326531,-0.003431,-0.002985,0,0


In [59]:
print("Testing Data Preview:")
display(X_test.head(10))

Testing Data Preview:


Unnamed: 0,Age,AccumulatedPositiveRate,SalePeriod,PlayerGrowthRate1M,FollowersGrowthRate1M,PositiveRateGrowthRate1M,DLC_since_last_discount,Sequel_since_last_discount
285,3.526027,0.969657,1,-0.029158,0.012,5.2e-05,0,0
281,3.090411,0.970458,0,-0.16869,0.004481,4.1e-05,0,0
33,3.260274,0.815636,1,-0.046926,0.003939,0.002761,0,0
211,8.178082,0.982537,0,0.473536,0.01649,-0.000212,0,0
93,6.583562,0.955034,1,4.533607,0.007444,-3.2e-05,0,0
84,5.167123,0.954422,1,0.052761,0.001671,7.4e-05,0,0
391,3.10137,0.953009,1,-0.177216,0.007496,0.000187,0,0
94,6.643836,0.955118,1,-0.362682,0.005238,9.3e-05,0,0
225,7.824658,0.45122,0,0.066555,-0.001718,0.006803,0,0
126,5.750685,0.884745,1,-0.02664,0.002766,0.000235,0,0


#### The actual modeling starts below

In [60]:
# 建立線性迴歸模型
model = RandomForestRegressor(
    n_estimators=200,       # 樹的數量
    max_depth=None,         # 讓模型自動選擇深度
    random_state=42,
    n_jobs=-1               # 使用所有 CPU 加速訓練
)

# 使用訓練資料進行模型擬合
model.fit(X_train, y_train)


In [61]:
# 訓練資料預測
y_train_pred = model.predict(X_train)

# 評估模型表現
r2_train = r2_score(y_train, y_train_pred)
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)

print("Training Data Performance:")
print(f"  R²: {r2_train:.4f}")
print(f"  RMSE: {rmse_train:.4f}")

Training Data Performance:
  R²: 0.9668
  RMSE: 0.0346




In [62]:
# 測試資料預測
y_test_pred = model.predict(X_test)

# 評估模型表現
r2_test = r2_score(y_test, y_test_pred)
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))

print("Testing Performance:")
print(f"  R²: {r2_test:.4f}")
print(f"  RMSE: {rmse_test:.4f}")

Testing Performance:
  R²: 0.6683
  RMSE: 0.0886


In [63]:
importances = pd.DataFrame({
    'Feature': X.columns,
    'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nFeature Importances:")
display(importances)


Feature Importances:


Unnamed: 0,Feature,Importance
1,AccumulatedPositiveRate,0.542242
0,Age,0.180833
4,FollowersGrowthRate1M,0.140783
5,PositiveRateGrowthRate1M,0.091171
3,PlayerGrowthRate1M,0.035393
2,SalePeriod,0.008263
6,DLC_since_last_discount,0.00113
7,Sequel_since_last_discount,0.000185
