# Background
- **Author**: `<郭伊軒>`
- **Created At**: `<2025-11-15>`
- **Path to Training Data： discount-timing-DE.csv**
- **Path to Testing Data： discount-timing-DE.csv**
- **Model Specification:** 
    - Method：random forest
    - Variables：  
        - 計算方法:一週成長率  
           ['Age','AccumulatedPositiveRate', 'MultiPlayer', 'SalePeriod', 'DiscountFreq3M',    
            'PlayerGrowthRate1W_lag0', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14',   
            'FollowersGrowthRate1W_lag0', 'FollowersGrowthRate1W_lag7', 'FollowersGrowthRate1W_lag14',   
            'PositiveRateGrowthRate1W_lag0', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14',   
            'DLC_sum_1W_lag0', 'DLC_sum_1W_lag7', 'DLC_sum_1W_lag14',   
            'Sequel_sum_1W_lag0', 'Sequel_sum_1W_lag7', 'Sequel_sum_1W_lag14']  
        - 計算方法:兩週成長率:  
           ['Age','AccumulatedPositiveRate', 'MultiPlayer', 'SalePeriod', 'DiscountFreq3M',    
            'PlayerGrowthRate2W_lag0', 'PlayerGrowthRate2W_lag7',    
            'FollowersGrowthRate2W_lag0', 'FollowersGrowthRate2W_lag7',   
            'PositiveRateGrowthRate2W_lag0', 'PositiveRateGrowthRate2W_lag7',   
            'DLC_sum_2W_lag0', 'DLC_sum_2W_lag7',   
            'Sequel_sum_2W_lag0', 'Sequel_sum_2W_lag7']
    - Tuning Parameters：  
        ['n_estimators', 'max_depth', 'min_samples_split', 'min_samples_leaf', 'class_weight']
    - Optimization Method：
        - 計算方法:一週成長率
            - 非季節折扣  
                {'class_weight': 'balanced', 'max_depth': 6, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
            - 季節折扣  
                {'class_weight': 'balanced', 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}   
        - 計算方法:兩週成長率
            - 非季節折扣  
                {'class_weight': 'balanced', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
            - 季節折扣  
                {'class_weight': 'balanced', 'max_depth': 7, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}   
- **Main Findings and Takeaways：**
    - 計算方法:一週成長率
        - In-sample `<AUC>`:  
            DiscountOutOfSale(`0.9273`), DiscountDuringSale(`0.9673`)
        - Out-sample `<AUC>`:  
            DiscountOutOfSale(`0.7628`), DiscountDuringSale(`0.9710`)
        - Feature Importance Ranking:
            - 非季節折扣  
            | 1 | DiscountFreq3M  
            | 2 | SalePeriod   
            | 3 | PlayerGrowthRate1W_lag0  
            | 4 | PlayerGrowthRate1W_lag14  
            | 5 | FollowersGrowthRate1W_lag0     
            - 季節折扣  
            | 1 | SalePeriod  
            | 2 | PlayerGrowthRate1W_lag0   
            | 3 | DiscountFreq3M   
            | 4 | FollowersGrowthRate1W_lag0     
            | 5 | PlayerGrowthRate1W_lag7   

    - 計算方法:兩週成長率
        - In-sample `<AUC>`:  
            DiscountOutOfSale(`0.9298`), DiscountDuringSale(`0.9806`)
        - Out-sample `<AUC>`:  
            DiscountOutOfSale(`0.7534`), DiscountDuringSale(`0.9652`)
        - Feature Importance Ranking:
            - 非季節折扣  
            | 1 | DiscountFreq3M  
            | 2 | SalePeriod   
            | 3 | PlayerGrowthRate2W_lag7   
            | 4 | FollowersGrowthRate2W_lag0     
            | 5 | PositiveRateGrowthRate2W_lag0  
            - 季節折扣  
            | 1 | SalePeriod  
            | 2 | DiscountFreq3M   
            | 3 | PlayerGrowthRate2W_lag0   
            | 4 | PlayerGrowthRate2W_lag7    
            | 5 | FollowersGrowthRate2W_lag0    
- **Future Direciton：可以看出以一周計算的成長率比較好嘗試使用XGBoost。**

### Pre-processing

In [152]:
# Load packages here
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix, make_scorer, average_precision_score
from sklearn.inspection import permutation_importance

In [153]:
# Load the TRAINING data here and please finish all the data manipulation here.
#input_data_file = "/Users/10610/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
input_data_file = "/Users/user/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
df = pd.read_csv(input_data_file)
df_dummies = pd.get_dummies(df, columns=['GameID'], drop_first=True)
df_dummies.dropna(inplace=True)

train = df_dummies[df_dummies['Date'] < '2025-01-01']
test = df_dummies[df_dummies['Date'] >= '2025-01-01']

def prepare_xy(df, feature_cols, target_col):
    X = df[feature_cols].copy()
    y = df[target_col].copy()
    # 將 bool 欄轉成 int
    X = X.astype({col: 'int' for col in X.select_dtypes(bool).columns})
    return X, y


In [154]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
GameID,23938.0,461376.742,298559.181056,10.0,244850.0,431730.0,644930.0,1145360.0
MultiPlayer,23938.0,0.464241,0.49873,0.0,0.0,0.0,1.0,1.0
ConstantDiscount,23938.0,0.214387,0.410405,0.0,0.0,0.0,0.0,1.0
DiscountOrNot,23938.0,0.019885,0.139607,0.0,0.0,0.0,0.0,1.0
DiscountDuration,23938.0,0.221196,1.715483,0.0,0.0,0.0,0.0,32.0
DiscountFreq3M,23938.0,1.797644,1.043279,0.0,1.0,2.0,3.0,6.0
Age,23938.0,7.634427,4.458471,2.389041,4.95137,6.323288,8.479452,24.84658
AccumulatedPositiveRate,23938.0,0.928061,0.064186,0.738751,0.905517,0.953165,0.972651,0.9929734
SalePeriod,23938.0,0.14642,0.353534,0.0,0.0,0.0,0.0,1.0
DiscountDuringSale,23938.0,0.008647,0.09259,0.0,0.0,0.0,0.0,1.0


## function

In [155]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)

    '''importances = pd.DataFrame({
        'Feature': X_train.columns,
        'Importance': model.feature_importances_
    }).sort_values(by='Importance', ascending=False)

    print("\nFeature Importances:")
    display(importances)'''


    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    y_prob_train = model.predict_proba(X_train)[:, 1]
    y_prob_test = model.predict_proba(X_test)[:, 1]


    acc_train = accuracy_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train)
    auc_train = roc_auc_score(y_train, y_prob_train)
    #pr_auc_train = average_precision_score(y_train, y_prob_train)

    acc_test = accuracy_score(y_test, y_pred_test)
    f1_test = f1_score(y_test, y_pred_test)
    auc_test = roc_auc_score(y_test, y_prob_test)
    #pr_auc_test = average_precision_score(y_test, y_prob_test)
    cm = confusion_matrix(y_test, y_pred_test)

    results = {
        'Accuracy': [round(acc_train, 4), round(acc_test, 4)],
        'F1 score': [round(f1_train, 4), round(f1_test, 4)],
        'AUC': [round(auc_train, 4), round(auc_test, 4)]
        #'pr_AUC':[round(pr_auc_train, 4), round(pr_auc_test, 4)]
    }

    row_names = ['train', 'test']

    result = pd.DataFrame(results, index=row_names)


    print(f"\n=== {name} ===")
    print("Confusion matrix:\n", cm)
    return result


In [156]:
def find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid):

    # 1. 初始化
    rf_clf = RandomForestClassifier(
        random_state=71, 
        max_features='sqrt', 
    )
    
    # 2. 時間序列交叉驗證
    tscv = TimeSeriesSplit(n_splits=5)

    # 3. 定義評分標準
    scorer = make_scorer(roc_auc_score, needs_proba=True)
    
    # 4. 初始化 GridSearchCV
    grid_search = GridSearchCV(
        estimator=rf_clf,
        param_grid=param_grid,
        scoring=scorer,       # 使用定義好的評分標準
        cv=tscv,               # 使用分層交叉驗證
        verbose=1,            # 顯示進度
        n_jobs=-1             # 使用所有可用的 CPU 核心進行並行計算
    )
    
    # 5. 執行網格搜索
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_

    y_pred_train = best_model.predict(X_train)
    y_pred_test = best_model.predict(X_test)

    y_prob_train = best_model.predict_proba(X_train)[:, 1]
    y_prob_test = best_model.predict_proba(X_test)[:, 1]

    acc_train = accuracy_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train)
    auc_train = roc_auc_score(y_train, y_prob_train)
    #pr_auc_train = average_precision_score(y_train, y_prob_train)

    acc_test = accuracy_score(y_test, y_pred_test)
    f1_test = f1_score(y_test, y_pred_test)
    auc_test = roc_auc_score(y_test, y_prob_test)
    #pr_auc_test = average_precision_score(y_test, y_prob_test)


    result = {
        'Accuracy': [round(acc_train, 4), round(acc_test, 4)],
        'F1 score': [round(f1_train, 4), round(f1_test, 4)],
        'AUC': [round(auc_train, 4), round(auc_test, 4)]
        #'pr_AUC':[round(pr_auc_train, 4), round(pr_auc_test, 4)]
    }

    row_names = ['train', 'test']

    df = pd.DataFrame(result, index=row_names)
        
    # 返回最佳模型
    return grid_search.best_params_, df




# 1W

In [188]:
feature_cols = [
    'Age', 'SalePeriod', 'AccumulatedPositiveRate', "MultiPlayer", 'DiscountFreq3M', 
    'PlayerGrowthRate1W_lag0', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14',
    'FollowersGrowthRate1W_lag0', 'FollowersGrowthRate1W_lag7', 'FollowersGrowthRate1W_lag14',
    'PositiveRateGrowthRate1W_lag0', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14',
    'DLC_sum_1W_lag0', 'DLC_sum_1W_lag7', 'DLC_sum_1W_lag14',
    'Sequel_sum_1W_lag0', 'Sequel_sum_1W_lag7', 'Sequel_sum_1W_lag14'
] + [col for col in df_dummies.columns if col.startswith('GameID_')]


baseline_model = RandomForestClassifier(      
    n_estimators=200,
    max_features='sqrt',
    max_depth=6,
    min_samples_split=2,
    random_state=71
)

### 非季節性折扣

In [189]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')

#### 調參數

In [191]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7],          
    'min_samples_split': [2, 4],       
    'min_samples_leaf': [1, 2],        
    'class_weight': ['balanced'] 
}
best_param, result = find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid)

print(result)
print(best_param)

Fitting 5 folds for each of 60 candidates, totalling 300 fits




       Accuracy  F1 score     AUC
train    0.7990    0.0832  0.9210
test     0.7603    0.0662  0.7521
{'class_weight': 'balanced', 'max_depth': 6, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}


#### 排序變數重要程度

In [192]:
best_rf = RandomForestClassifier(
    n_estimators=200,
    max_features='sqrt',
    max_depth=6,
    min_samples_split=2,
    min_samples_leaf=1,
    class_weight='balanced',
    random_state=71
)

best_rf.fit(X_train, y_train)

perm = permutation_importance(
    best_rf,
    X_train,
    y_train,
    n_repeats=10,
    scoring='roc_auc',
    random_state=42
)

importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance_mean': perm.importances_mean,
    'importance_std': perm.importances_std
}).sort_values(by='importance_mean', ascending=False)

importance_df

Unnamed: 0,feature,importance_mean,importance_std
4,DiscountFreq3M,0.111986,0.01162842
1,SalePeriod,0.07432985,0.01719432
5,PlayerGrowthRate1W_lag0,0.03211592,0.004303669
7,PlayerGrowthRate1W_lag14,0.03004914,0.003478126
8,FollowersGrowthRate1W_lag0,0.02271097,0.002236324
2,AccumulatedPositiveRate,0.01677686,0.00292198
10,FollowersGrowthRate1W_lag14,0.01615434,0.001343006
6,PlayerGrowthRate1W_lag7,0.0157494,0.001694323
12,PositiveRateGrowthRate1W_lag7,0.01536331,0.001301625
11,PositiveRateGrowthRate1W_lag0,0.01521701,0.0008810836


#### 刪除多餘的變數

In [193]:
threshold = 0.001

to_delete = importance_df.loc[
    importance_df['importance_mean'] < threshold, 'feature'
].tolist()

tscv = TimeSeriesSplit(n_splits=5)

def cv_auc(model, X, y):
    aucs = []
    for train_idx, test_idx in tscv.split(X):
        model.fit(X.iloc[train_idx], y.iloc[train_idx])
        pred = model.predict_proba(X.iloc[test_idx])[:, 1]
        aucs.append(roc_auc_score(y.iloc[test_idx], pred))
    return np.mean(aucs)

# baseline AUC（用全部變數）
auc_base = cv_auc(best_rf, X_train, y_train)
print("Baseline AUC:", auc_base)

retain = []
for col in X_train.columns:
    if col in to_delete:
        # 試刪除這個變數
        reduced_features = [c for c in X_train.columns if c != col]
        auc_new = cv_auc(best_rf, X_train[reduced_features], y_train)

        if (auc_base - auc_new) >= 0.003:
            retain.append(col)
    else:
        retain.append(col)

print("最終保留的變數：")
print(retain)


Baseline AUC: 0.7550367633432082
最終保留的變數：
['Age', 'SalePeriod', 'AccumulatedPositiveRate', 'MultiPlayer', 'DiscountFreq3M', 'PlayerGrowthRate1W_lag0', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14', 'FollowersGrowthRate1W_lag0', 'FollowersGrowthRate1W_lag7', 'FollowersGrowthRate1W_lag14', 'PositiveRateGrowthRate1W_lag0', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14', 'GameID_233860', 'GameID_242760', 'GameID_244210', 'GameID_244850', 'GameID_413150', 'GameID_477160', 'GameID_582660']


#### 模型效果

In [194]:
X_train_final = X_train[retain]
X_test_final = X_test[retain]
result1 = evaluate_model('baseline', baseline_model, X_train_final, y_train, X_test_final, y_test)
result2 = evaluate_model('selection', best_rf, X_train_final, y_train, X_test_final, y_test)


combined_results = pd.concat([result1, result2], keys=['baseline', 'selection'])
print("\n模型比較結果:")
print(combined_results)


=== baseline ===
Confusion matrix:
 [[6729    0]
 [  93    0]]

=== selection ===
Confusion matrix:
 [[4966 1763]
 [  33   60]]

模型比較結果:
                 Accuracy  F1 score     AUC
baseline  train    0.9897    0.0000  0.9845
          test     0.9864    0.0000  0.7067
selection train    0.8005    0.0872  0.9273
          test     0.7367    0.0626  0.7628


### 季節性折扣

In [195]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')

#### 調參數

In [196]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7],          
    'min_samples_split': [2, 4],       
    'min_samples_leaf': [1, 2],        
    'class_weight': ['balanced'] 
}
best_param, result = find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid)

print(result)
print(best_param)

Fitting 5 folds for each of 60 candidates, totalling 300 fits




       Accuracy  F1 score     AUC
train    0.8862    0.1545  0.9678
test     0.9134    0.0837  0.9698
{'class_weight': 'balanced', 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}


#### 排序變數重要程度

In [197]:
best_rf = RandomForestClassifier(
    n_estimators=300,
    max_features='sqrt',
    max_depth=3,
    min_samples_split=2,
    min_samples_leaf=2,
    class_weight='balanced',
    random_state=71
)

best_rf.fit(X_train, y_train)

perm = permutation_importance(
    best_rf,
    X_train,
    y_train,
    n_repeats=10,
    scoring='roc_auc',
    random_state=42
)

importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance_mean': perm.importances_mean,
    'importance_std': perm.importances_std
}).sort_values(by='importance_mean', ascending=False)

importance_df

Unnamed: 0,feature,importance_mean,importance_std
1,SalePeriod,0.2468629,0.00887
5,PlayerGrowthRate1W_lag0,0.01763637,0.001718
4,DiscountFreq3M,0.00486997,0.002277
8,FollowersGrowthRate1W_lag0,0.002604245,0.000967
6,PlayerGrowthRate1W_lag7,0.002033695,0.000704
10,FollowersGrowthRate1W_lag14,0.001533122,0.000309
13,PositiveRateGrowthRate1W_lag14,0.001238982,0.000329
12,PositiveRateGrowthRate1W_lag7,0.001111909,0.000267
7,PlayerGrowthRate1W_lag14,0.001061558,0.000118
11,PositiveRateGrowthRate1W_lag0,0.0009748747,0.000172


#### 刪除多餘的變數

In [198]:
threshold = 0.001

to_delete = importance_df.loc[
    importance_df['importance_mean'] < threshold, 'feature'
].tolist()

tscv = TimeSeriesSplit(n_splits=5)

def cv_auc(model, X, y):
    aucs = []
    for train_idx, test_idx in tscv.split(X):
        model.fit(X.iloc[train_idx], y.iloc[train_idx])
        pred = model.predict_proba(X.iloc[test_idx])[:, 1]
        aucs.append(roc_auc_score(y.iloc[test_idx], pred))
    return np.mean(aucs)

# baseline AUC（用全部變數）
auc_base = cv_auc(best_rf, X_train, y_train)
print("Baseline AUC:", auc_base)

retain = []
for col in X_train.columns:
    if col in to_delete:
        # 試刪除這個變數
        reduced_features = [c for c in X_train.columns if c != col]
        auc_new = cv_auc(best_rf, X_train[reduced_features], y_train)

        if (auc_base - auc_new) >= 0.003:
            retain.append(col)
    else:
        retain.append(col)

print("最終保留的變數：")
print(retain)


Baseline AUC: 0.953083547122017
最終保留的變數：
['SalePeriod', 'DiscountFreq3M', 'PlayerGrowthRate1W_lag0', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14', 'FollowersGrowthRate1W_lag0', 'FollowersGrowthRate1W_lag14', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14']


#### 模型效果

In [199]:
X_train_final = X_train[retain]
X_test_final = X_test[retain]
result1 = evaluate_model('baseline', baseline_model, X_train_final, y_train, X_test_final, y_test)
result2 = evaluate_model('selection', best_rf, X_train_final, y_train, X_test_final, y_test)


combined_results = pd.concat([result1, result2], keys=['baseline', 'selection'])
print("\n模型比較結果:")
print(combined_results)


=== baseline ===
Confusion matrix:
 [[6794    0]
 [  28    0]]

=== selection ===
Confusion matrix:
 [[6143  651]
 [   0   28]]

模型比較結果:
                 Accuracy  F1 score     AUC
baseline  train    0.9896    0.0111  0.9867
          test     0.9959    0.0000  0.9752
selection train    0.8654    0.1345  0.9673
          test     0.9046    0.0792  0.9710


# 2W

In [200]:
feature_cols = [
    'Age', 'SalePeriod', 'AccumulatedPositiveRate', "MultiPlayer", 'DiscountFreq3M', 
    'PlayerGrowthRate2W_lag0', 'PlayerGrowthRate2W_lag7',
    'FollowersGrowthRate2W_lag0', 'FollowersGrowthRate2W_lag7',
    'PositiveRateGrowthRate2W_lag0', 'PositiveRateGrowthRate2W_lag7',
    'DLC_sum_2W_lag0', 'DLC_sum_2W_lag7',
    'Sequel_sum_2W_lag0', 'Sequel_sum_2W_lag7'
] + [col for col in df_dummies.columns if col.startswith('GameID_')]


baseline_model = RandomForestClassifier(      
    n_estimators=200,
    max_features='sqrt',
    max_depth=6,
    min_samples_split=2,
    random_state=71
)

### 非季節性折扣

In [201]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')

#### 調參數

In [203]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7],          
    'min_samples_split': [2, 4],       
    'min_samples_leaf': [1, 2],        
    'class_weight': ['balanced'] 
}
best_param, result = find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid)

print(result)
print(best_param)

Fitting 5 folds for each of 60 candidates, totalling 300 fits




       Accuracy  F1 score     AUC
train    0.8184    0.0912  0.9298
test     0.7832    0.0645  0.7534
{'class_weight': 'balanced', 'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}


#### 排序變數重要程度

In [204]:
best_rf = RandomForestClassifier(
    n_estimators=300,
    max_features='sqrt',
    max_depth=7,
    min_samples_split=2,
    min_samples_leaf=1,
    class_weight='balanced',
    random_state=71
)

best_rf.fit(X_train, y_train)

perm = permutation_importance(
    best_rf,
    X_train,
    y_train,
    n_repeats=10,
    scoring='roc_auc',
    random_state=42
)

importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance_mean': perm.importances_mean,
    'importance_std': perm.importances_std
}).sort_values(by='importance_mean', ascending=False)

importance_df

Unnamed: 0,feature,importance_mean,importance_std
4,DiscountFreq3M,0.137159,0.013257
1,SalePeriod,0.082393,0.018656
6,PlayerGrowthRate2W_lag7,0.043162,0.00396
7,FollowersGrowthRate2W_lag0,0.032665,0.001944
9,PositiveRateGrowthRate2W_lag0,0.025905,0.00159
5,PlayerGrowthRate2W_lag0,0.022817,0.002978
8,FollowersGrowthRate2W_lag7,0.021045,0.002275
0,Age,0.020958,0.003618
10,PositiveRateGrowthRate2W_lag7,0.020692,0.002519
2,AccumulatedPositiveRate,0.017744,0.002032


#### 刪除多餘的變數

In [205]:
threshold = 0.001

to_delete = importance_df.loc[
    importance_df['importance_mean'] < threshold, 'feature'
].tolist()

tscv = TimeSeriesSplit(n_splits=5)

def cv_auc(model, X, y):
    aucs = []
    for train_idx, test_idx in tscv.split(X):
        model.fit(X.iloc[train_idx], y.iloc[train_idx])
        pred = model.predict_proba(X.iloc[test_idx])[:, 1]
        aucs.append(roc_auc_score(y.iloc[test_idx], pred))
    return np.mean(aucs)

# baseline AUC（用全部變數）
auc_base = cv_auc(best_rf, X_train, y_train)
print("Baseline AUC:", auc_base)

retain = []
for col in X_train.columns:
    if col in to_delete:
        # 試刪除這個變數
        reduced_features = [c for c in X_train.columns if c != col]
        auc_new = cv_auc(best_rf, X_train[reduced_features], y_train)

        if (auc_base - auc_new) >= 0.003:
            retain.append(col)
    else:
        retain.append(col)

print("最終保留的變數：")
print(retain)


Baseline AUC: 0.7659520134689123
最終保留的變數：
['Age', 'SalePeriod', 'AccumulatedPositiveRate', 'MultiPlayer', 'DiscountFreq3M', 'PlayerGrowthRate2W_lag0', 'PlayerGrowthRate2W_lag7', 'FollowersGrowthRate2W_lag0', 'FollowersGrowthRate2W_lag7', 'PositiveRateGrowthRate2W_lag0', 'PositiveRateGrowthRate2W_lag7', 'DLC_sum_2W_lag0', 'DLC_sum_2W_lag7', 'Sequel_sum_2W_lag0', 'Sequel_sum_2W_lag7', 'GameID_3590', 'GameID_4000', 'GameID_108600', 'GameID_233860', 'GameID_242760', 'GameID_244210', 'GameID_244850', 'GameID_294100', 'GameID_323190', 'GameID_367520', 'GameID_376210', 'GameID_381210', 'GameID_413150', 'GameID_431730', 'GameID_431960', 'GameID_457140', 'GameID_477160', 'GameID_548430', 'GameID_582660', 'GameID_588650', 'GameID_644930', 'GameID_703080', 'GameID_814380', 'GameID_880940', 'GameID_881100', 'GameID_1091500', 'GameID_1145360']


#### 模型效果

In [206]:
X_train_final = X_train[retain]
X_test_final = X_test[retain]
result1 = evaluate_model('baseline', baseline_model, X_train, y_train, X_test, y_test)
result2 = evaluate_model('selection', best_rf, X_train, y_train, X_test, y_test)


combined_results = pd.concat([result1, result2], keys=['baseline', 'selection'])
print("\n模型比較結果:")
print(combined_results)


=== baseline ===
Confusion matrix:
 [[6729    0]
 [  93    0]]

=== selection ===
Confusion matrix:
 [[5292 1437]
 [  42   51]]

模型比較結果:
                 Accuracy  F1 score     AUC
baseline  train    0.9897    0.0000  0.9770
          test     0.9864    0.0000  0.7361
selection train    0.8184    0.0912  0.9298
          test     0.7832    0.0645  0.7534


### 季節性折扣

In [207]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')

#### 調參數

In [208]:
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 4, 5, 6, 7],          
    'min_samples_split': [2, 4],       
    'min_samples_leaf': [1, 2],        
    'class_weight': ['balanced'] 
}
best_param, result = find_best_params_grid_searchCV(X_train, y_train, X_test, y_test, param_grid)

print(result)
print(best_param)

Fitting 5 folds for each of 60 candidates, totalling 300 fits




       Accuracy  F1 score     AUC
train    0.8883    0.1577  0.9821
test     0.9160    0.0803  0.9616
{'class_weight': 'balanced', 'max_depth': 7, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}


#### 排序變數重要程度

In [209]:
best_rf = RandomForestClassifier(
    n_estimators=300,
    max_features='sqrt',
    max_depth=7,
    min_samples_split=2,
    min_samples_leaf=2,
    class_weight='balanced',
    random_state=71
)

best_rf.fit(X_train, y_train)

perm = permutation_importance(
    best_rf,
    X_train,
    y_train,
    n_repeats=10,
    scoring='roc_auc',
    random_state=42
)

importance_df = pd.DataFrame({
    'feature': X_train.columns,
    'importance_mean': perm.importances_mean,
    'importance_std': perm.importances_std
}).sort_values(by='importance_mean', ascending=False)

importance_df

Unnamed: 0,feature,importance_mean,importance_std
1,SalePeriod,0.1712001,0.006367
4,DiscountFreq3M,0.02284183,0.003081
5,PlayerGrowthRate2W_lag0,0.02051701,0.002
6,PlayerGrowthRate2W_lag7,0.01174768,0.001132
7,FollowersGrowthRate2W_lag0,0.01109537,0.001424
9,PositiveRateGrowthRate2W_lag0,0.007735436,0.000799
8,FollowersGrowthRate2W_lag7,0.007250662,0.000621
10,PositiveRateGrowthRate2W_lag7,0.00722533,0.000679
2,AccumulatedPositiveRate,0.004720022,0.00042
0,Age,0.004209652,0.00035


#### 刪除多餘的變數

In [210]:
threshold = 0.001

to_delete = importance_df.loc[
    importance_df['importance_mean'] < threshold, 'feature'
].tolist()

tscv = TimeSeriesSplit(n_splits=5)

def cv_auc(model, X, y):
    aucs = []
    for train_idx, test_idx in tscv.split(X):
        model.fit(X.iloc[train_idx], y.iloc[train_idx])
        pred = model.predict_proba(X.iloc[test_idx])[:, 1]
        aucs.append(roc_auc_score(y.iloc[test_idx], pred))
    return np.mean(aucs)

# baseline AUC（用全部變數）
auc_base = cv_auc(best_rf, X_train, y_train)
print("Baseline AUC:", auc_base)

retain = []
for col in X_train.columns:
    if col in to_delete:
        # 試刪除這個變數
        reduced_features = [c for c in X_train.columns if c != col]
        auc_new = cv_auc(best_rf, X_train[reduced_features], y_train)

        if (auc_base - auc_new) >= 0.003:
            retain.append(col)
    else:
        retain.append(col)

print("最終保留的變數：")
print(retain)


Baseline AUC: 0.9482386308284255
最終保留的變數：
['Age', 'SalePeriod', 'AccumulatedPositiveRate', 'MultiPlayer', 'DiscountFreq3M', 'PlayerGrowthRate2W_lag0', 'PlayerGrowthRate2W_lag7', 'FollowersGrowthRate2W_lag0', 'FollowersGrowthRate2W_lag7', 'PositiveRateGrowthRate2W_lag0', 'PositiveRateGrowthRate2W_lag7', 'GameID_3590', 'GameID_431960', 'GameID_477160']


#### 模型效果

In [211]:
X_train_final = X_train[retain]
X_test_final = X_test[retain]
result1 = evaluate_model('baseline', baseline_model, X_train_final, y_train, X_test_final, y_test)
result2 = evaluate_model('selection', best_rf, X_train_final, y_train, X_test_final, y_test)


combined_results = pd.concat([result1, result2], keys=['baseline', 'selection'])
print("\n模型比較結果:")
print(combined_results)


=== baseline ===
Confusion matrix:
 [[6794    0]
 [  28    0]]

=== selection ===
Confusion matrix:
 [[6247  547]
 [   3   25]]

模型比較結果:
                 Accuracy  F1 score     AUC
baseline  train    0.9895    0.0000  0.9862
          test     0.9959    0.0000  0.9455
selection train    0.8968    0.1685  0.9806
          test     0.9194    0.0833  0.9652
