# Background
- **Author**: `<郭伊軒>`
- **Created At**: `<2025-11-15>`
- **Path to Training Data： discount-timing-DE.csv**
- **Path to Testing Data： discount-timing-DE.csv**
- **Model Specification 
    - Method：random forest
    - Variables：  
    ['Age', 'MultiPlayer', 'DiscountFreq3M', 'PlayerGrowthRate1W', 'FollowersGrowthRate1W', 'PositiveRateGrowthRate1W', 'SalePeriod', 'DLC_sum_1W', 'Sequel_sum_1W']
    - Tuning Parameters：  
    ['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'class_weight']
    - Optimization Method：
    'n_estimators' = , 
    'max_depth' = ,
    'min_samples_split' = ,
    'min_samples_leaf' = ,
    'class_weight' = 
- **Main Findings and Takeaways：**
    - In-sample `<AUC>`:  
    DiscountOrNot(0.7925,    0.1337,  0.8611), DiscountDuringSale(0.7925,    0.1337,  0.8611), DiscountOutOfSale(0.6403,    0.0482,  0.8383)
    - Out-sample `<AUC>`:  
    DiscountOrNot(0.5767,    0.0494,  0.7555), DiscountDuringSale(0.7515,    0.0833,  0.7542), DiscountOutOfSale(0.5767,    0.0494,  0.7555
)
    - Feature Importance Ranking:  
  | 1 | PlayerGrowthRate1W  
  | 2 | FollowersGrowthRate1W   
  | 3 | SalePeriod
  | 4 | AccumulatedPositiveRate  
  | 5 | Age  
  | 6 | DiscountFreq3M     
  | 7 | PositiveRateGrowthRate1W  
  | 8 | SDLC_sum_1W   
  | 9 | MultiPlayer   
  | 10 | Sequel_sum_1W   
- **Future Direciton：**

In [1]:
# Load packages here
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix, make_scorer


In [2]:
# Load the TRAINING data here and please finish all the data manipulation here.
input_data_file = "/Users/10610/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
#input_data_file = "/Users/user/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
df = pd.read_csv(input_data_file)

df_dummies = pd.get_dummies(df, columns=['GameID'], drop_first=True)

train = df_dummies[df_dummies['Date'] < '2025-01-01']
test = df_dummies[df_dummies['Date'] >= '2025-01-01']

def prepare_xy(df, feature_cols, target_col):
    X = df[feature_cols].copy()
    y = df[target_col].copy()
    # 將 bool 欄轉成 int
    X = X.astype({col: 'int' for col in X.select_dtypes(bool).columns})
    return X, y


In [3]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
GameID,23938.0,461376.742,298559.181056,10.0,244850.0,431730.0,644930.0,1145360.0
MultiPlayer,23938.0,0.464241,0.49873,0.0,0.0,0.0,1.0,1.0
ConstantDiscount,23938.0,0.214387,0.410405,0.0,0.0,0.0,0.0,1.0
DiscountOrNot,23938.0,0.019885,0.139607,0.0,0.0,0.0,0.0,1.0
DiscountDuration,23938.0,0.221196,1.715483,0.0,0.0,0.0,0.0,32.0
DiscountFreq3M,23938.0,1.797644,1.043279,0.0,1.0,2.0,3.0,6.0
Age,23938.0,7.634427,4.458471,2.389041,4.95137,6.323288,8.479452,24.84658
AccumulatedPositiveRate,23938.0,0.928061,0.064186,0.738751,0.905517,0.953165,0.972651,0.9929734
SalePeriod,23938.0,0.14642,0.353534,0.0,0.0,0.0,0.0,1.0
DiscountDuringSale,23938.0,0.008647,0.09259,0.0,0.0,0.0,0.0,1.0


### The actual modeling starts below
For the remaining blocks, make sure you have followed the guidelines as specified in [專案資料夾結構、檔案命名與文件規範](https://docs.google.com/document/d/1sl6gEFMdmiGsiNjLe17UmZ30xKxq15U0Mb2B-Jvusxg/edit?tab=t.33iie8ybx7s4).


## function

In [13]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)

    importances = pd.DataFrame({
        'Feature': X_train.columns,
        'Importance': model.feature_importances_
    }).sort_values(by='Importance', ascending=False)

    print("\nFeature Importances:")
    display(importances)


    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    y_prob_train = model.predict_proba(X_train)[:, 1]
    y_prob_test = model.predict_proba(X_test)[:, 1]


    acc_train = accuracy_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train)
    auc_train = roc_auc_score(y_train, y_prob_train)

    acc_test = accuracy_score(y_test, y_pred_test)
    f1_test = f1_score(y_test, y_pred_test)
    auc_test = roc_auc_score(y_test, y_prob_test)
    cm = confusion_matrix(y_test, y_pred_test)

    results = {
        'Accuracy': [round(acc_train, 4), round(acc_test, 4)],
        'F1 score': [round(f1_train, 4), round(f1_test, 4)],
        'AUC': [round(auc_train, 4), round(auc_test, 4)]
    }

    row_names = ['train', 'test']

    result = pd.DataFrame(results, index=row_names)


    print(f"\n=== {name} ===")
    print("Confusion matrix:\n", cm)
    return result


In [14]:
def find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid):

    # 1. 初始化
    rf_clf = RandomForestClassifier(
        random_state=71, 
        max_features='sqrt', 
    )
    
    # 2. 設定交叉驗證策略
    # 針對不平衡資料，強烈建議使用 StratifiedKFold
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=71)
    
    # 3. 定義評分標準
    # 對於不平衡資料，AUC 或 F1-score 通常優於單純的 accuracy
    scorer = make_scorer(roc_auc_score) # 這裡使用 AUC 作為主要優化目標
    
    # 4. 初始化 GridSearchCV
    grid_search = GridSearchCV(
        estimator=rf_clf,
        param_grid=param_grid,
        scoring=scorer,       # 使用定義好的評分標準
        cv=skf,               # 使用分層交叉驗證
        verbose=1,            # 顯示進度
        n_jobs=-1             # 使用所有可用的 CPU 核心進行並行計算
    )
    
    # 5. 執行網格搜索
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_

    y_pred_train = best_model.predict(X_train)
    y_pred_test = best_model.predict(X_test)

    y_prob_train = best_model.predict_proba(X_train)[:, 1]
    y_prob_test = best_model.predict_proba(X_test)[:, 1]


    acc_train = accuracy_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train)
    auc_train = roc_auc_score(y_train, y_prob_train)

    acc_test = accuracy_score(y_test, y_pred_test)
    f1_test = f1_score(y_test, y_pred_test)
    auc_test = roc_auc_score(y_test, y_prob_test)


    result = {
        'Accuracy': [round(acc_train, 4), round(acc_test, 4)],
        'F1 score': [round(f1_train, 4), round(f1_test, 4)],
        'AUC': [round(auc_train, 4), round(auc_test, 4)]
    }

    row_names = ['train', 'test']

    df = pd.DataFrame(result, index=row_names)
        
    # 返回最佳模型
    return grid_search.best_params_, df




# 1W

In [None]:
feature_cols = [
    'Age', 'AccumulatedPositiveRate', 'MultiPlayer', 'PlayerGrowthRate1W', 'FollowersGrowthRate1W', 'PositiveRateGrowthRate1W', 
    'SalePeriod', 'DiscountFreq3M', 'DLC_sum_1W', 'Sequel_sum_1W'
]

baseline_model = RandomForestClassifier(      
    n_estimators=200,
    max_features='sqrt',
    max_depth=6,
    min_samples_split=2,
    random_state=71
)

### 所有折扣

In [11]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOrNot')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOrNot')

In [21]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7],          
    'min_samples_split': [2, 4],       
    'min_samples_leaf': [1, 2],        
    'class_weight': ['balanced'] 
}

best_param, result = find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid)

print(result)
print(best_param)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
       Accuracy  F1 score     AUC
train    0.6403    0.0482  0.8383
test     0.5767    0.0494  0.7555
{'class_weight': 'balanced', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}


In [22]:
model = RandomForestClassifier(
    n_estimators=300,
    max_features='sqrt',
    max_depth=3,
    min_samples_split=2,
    min_samples_leaf=1,
    class_weight='balanced',
    random_state=71
)
result1 = evaluate_model('baseline', baseline_model, X_train, y_train, X_test, y_test)
result2 = evaluate_model('selection', model, X_train, y_train, X_test, y_test)


combined_results = pd.concat([result1, result2], keys=['baseline', 'selection'])
print("\n模型比較結果:")
print(combined_results)


Feature Importances:


Unnamed: 0,Feature,Importance
4,FollowersGrowthRate1W,0.205538
1,AccumulatedPositiveRate,0.193841
0,Age,0.158503
3,PlayerGrowthRate1W,0.144213
5,PositiveRateGrowthRate1W,0.120536
7,DiscountFreq3M,0.11977
6,SalePeriod,0.031071
8,DLC_sum_1W,0.020442
2,MultiPlayer,0.006076
9,Sequel_sum_1W,1e-05



=== baseline ===
Confusion matrix:
 [[6729    0]
 [  93    0]]

Feature Importances:


Unnamed: 0,Feature,Importance
7,DiscountFreq3M,0.3978763
6,SalePeriod,0.2326626
4,FollowersGrowthRate1W,0.1598095
1,AccumulatedPositiveRate,0.06256267
0,Age,0.05146141
3,PlayerGrowthRate1W,0.04940945
5,PositiveRateGrowthRate1W,0.04148597
2,MultiPlayer,0.002978712
8,DLC_sum_1W,0.001753452
9,Sequel_sum_1W,8.332833e-16



=== selection ===
Confusion matrix:
 [[3859 2870]
 [  18   75]]

模型比較結果:
                 Accuracy  F1 score     AUC
baseline  train    0.9897    0.0000  0.9774
          test     0.9864    0.0000  0.7313
selection train    0.6403    0.0482  0.8383
          test     0.5767    0.0494  0.7555


### 季節性折扣

In [None]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')

In [16]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7],          
    'min_samples_split': [2, 4],       
    'min_samples_leaf': [1, 2],        
    'class_weight': ['balanced'] 
}
best_param, result = find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid)

print(result)
print(best_param)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
       Accuracy  F1 score     AUC
train    0.7925    0.1337  0.8611
test     0.7515    0.0833  0.7542
{'class_weight': 'balanced', 'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 4, 'n_estimators': 300}


In [17]:
model = RandomForestClassifier(
    n_estimators=300,
    max_features='sqrt',
    max_depth=4,
    min_samples_split=4,
    min_samples_leaf=1,
    class_weight='balanced',
    random_state=71
)
result1 = evaluate_model('baseline', baseline_model, X_train, y_train, X_test, y_test)
result2 = evaluate_model('selection', model, X_train, y_train, X_test, y_test)


combined_results = pd.concat([result1, result2], keys=['baseline', 'selection'])
print("\n模型比較結果:")
print(combined_results)


Feature Importances:


Unnamed: 0,Feature,Importance
3,PlayerGrowthRate1W,0.208431
4,FollowersGrowthRate1W,0.157588
6,SalePeriod,0.156854
1,AccumulatedPositiveRate,0.133996
0,Age,0.1264
7,DiscountFreq3M,0.11032
5,PositiveRateGrowthRate1W,0.089854
8,DLC_sum_1W,0.008649
2,MultiPlayer,0.007899
9,Sequel_sum_1W,8e-06



=== baseline ===
Confusion matrix:
 [[6701    0]
 [ 121    0]]

Feature Importances:


Unnamed: 0,Feature,Importance
6,SalePeriod,0.338822
7,DiscountFreq3M,0.334691
4,FollowersGrowthRate1W,0.110376
3,PlayerGrowthRate1W,0.087969
5,PositiveRateGrowthRate1W,0.045619
1,AccumulatedPositiveRate,0.045051
0,Age,0.031834
2,MultiPlayer,0.004494
8,DLC_sum_1W,0.001133
9,Sequel_sum_1W,1e-05



=== selection ===
Confusion matrix:
 [[5050 1651]
 [  44   77]]

模型比較結果:
                 Accuracy  F1 score     AUC
baseline  train    0.9793    0.0000  0.9322
          test     0.9823    0.0000  0.7333
selection train    0.7925    0.1337  0.8611
          test     0.7515    0.0833  0.7542


### 非季節性折扣

In [18]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')

In [19]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7],          
    'min_samples_split': [2, 4],       
    'min_samples_leaf': [1, 2],        
    'class_weight': ['balanced'] 
}
best_param, result = find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid)

print(result)
print(best_param)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
       Accuracy  F1 score     AUC
train    0.6403    0.0482  0.8383
test     0.5767    0.0494  0.7555
{'class_weight': 'balanced', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}


In [20]:
model = RandomForestClassifier(
    n_estimators=300,
    max_features='sqrt',
    max_depth=3,
    min_samples_split=2,
    min_samples_leaf=1,
    class_weight='balanced',
    random_state=71
)
result1 = evaluate_model('baseline', baseline_model, X_train, y_train, X_test, y_test)
result2 = evaluate_model('selection', model, X_train, y_train, X_test, y_test)


combined_results = pd.concat([result1, result2], keys=['baseline', 'selection'])
print("\n模型比較結果:")
print(combined_results)


Feature Importances:


Unnamed: 0,Feature,Importance
4,FollowersGrowthRate1W,0.205538
1,AccumulatedPositiveRate,0.193841
0,Age,0.158503
3,PlayerGrowthRate1W,0.144213
5,PositiveRateGrowthRate1W,0.120536
7,DiscountFreq3M,0.11977
6,SalePeriod,0.031071
8,DLC_sum_1W,0.020442
2,MultiPlayer,0.006076
9,Sequel_sum_1W,1e-05



=== baseline ===
Confusion matrix:
 [[6729    0]
 [  93    0]]

Feature Importances:


Unnamed: 0,Feature,Importance
7,DiscountFreq3M,0.3978763
6,SalePeriod,0.2326626
4,FollowersGrowthRate1W,0.1598095
1,AccumulatedPositiveRate,0.06256267
0,Age,0.05146141
3,PlayerGrowthRate1W,0.04940945
5,PositiveRateGrowthRate1W,0.04148597
2,MultiPlayer,0.002978712
8,DLC_sum_1W,0.001753452
9,Sequel_sum_1W,8.332833e-16



=== selection ===
Confusion matrix:
 [[3859 2870]
 [  18   75]]

模型比較結果:
                 Accuracy  F1 score     AUC
baseline  train    0.9897    0.0000  0.9774
          test     0.9864    0.0000  0.7313
selection train    0.6403    0.0482  0.8383
          test     0.5767    0.0494  0.7555
