# Background
- **Author**: `<郭伊軒>`
- **Created At**: `<2025-11-1>`
- **Path to Training Data： discount-timing-DE.csv**
- **Path to Testing Data： discount-timing-DE.csv**
- **Model Specification:**
    - Method：logistic regression
    - Variables：  
        - 計算方法:一週成長率  
           ['Age','AccumulatedPositiveRate', 'MultiPlayer', 'SalePeriod', 'DiscountFreq3M',    
            'PlayerGrowthRate1W_lag0', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14',   
            'FollowersGrowthRate1W_lag0', 'FollowersGrowthRate1W_lag7', 'FollowersGrowthRate1W_lag14',   
            'PositiveRateGrowthRate1W_lag0', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14',   
            'DLC_sum_1W_lag0', 'DLC_sum_1W_lag7', 'DLC_sum_1W_lag14',   
            'Sequel_sum_1W_lag0', 'Sequel_sum_1W_lag7', 'Sequel_sum_1W_lag14']  
        - 計算方法:兩週成長率:  
           ['Age','AccumulatedPositiveRate', 'MultiPlayer', 'SalePeriod', 'DiscountFreq3M',    
            'PlayerGrowthRate2W_lag0', 'PlayerGrowthRate2W_lag7',    
            'FollowersGrowthRate2W_lag0', 'FollowersGrowthRate2W_lag7',   
            'PositiveRateGrowthRate2W_lag0', 'PositiveRateGrowthRate2W_lag7',   
            'DLC_sum_2W_lag0', 'DLC_sum_2W_lag7',   
            'Sequel_sum_2W_lag0', 'Sequel_sum_2W_lag7']
    - Tuning Parameters： 
        - p_value = 0.05   
          判斷是否顯著  
        - auc_threshold = 0.003 (AUC 的自然隨機波動 ≈ ±0.002 ～ ±0.005)    
          判斷此變數是否重要，如果刪除此變數AUC會上升的值如果大於 auc_threshold，將此變數刪除。 
        - feature_col += [lag變數]   
          是否增加滯後性變數。
    - Optimization Method：增加滯後性變數模型
- **Main Findings and Takeaways：**   
    - 計算方法:一週成長率
        - In-sample `<AUC>`:  
          DiscountOutOfSale(`0.7711`), DiscountDuringSale(`0.7334`)
        - Out-sample `<AUC>`:  
          DiscountOutOfSale(`0.7189`), DiscountDuringSale(`0.7654`)
        - 非季節性折扣:  
          與近期是否喜歡打折(DiscountFreq3M, coef=0.830)、遊玩人數成長率(PlayerGrowthRate1W_lag0, coef=-0.305)、追蹤人數成長率(FollowersGrowthRate1W_lag0, coef=-0.323)相關。                
        - 季節性折扣:  
          與近期是否喜歡打折(DiscountFreq3M, coef=0.783)、遊玩人數成長率(PlayerGrowthRate1W_lag0, coef=-0.594)相關。
    - 計算方法:兩週成長率
        - In-sample `<AUC>`:  
          DiscountOutOfSale(`0.7718`), DiscountDuringSale(`0.7312`)
        - Out-sample `<AUC>`:  
          DiscountOutOfSale(`0.7262`), DiscountDuringSale(`0.7663`)
        - 非季節性折扣:  
          與近期是否喜歡打折(DiscountFreq3M, coef=0.869)、遊玩人數成長率(PlayerGrowthRate2W_lag0, coef=-0.403)相關。                
        - 季節性折扣:  
          與近期是否喜歡打折(DiscountFreq3M, coef=0.767)、遊玩人數成長率(PlayerGrowthRate1W_lag0, coef=-0.489)相關。
    - 個體差異:  
      GameID_548430喜歡在非季節性期間打折  
      GameID_367520喜歡在季節期間打折  
      GameID_431960不喜歡在季節期間打折
- **Future Direciton：嘗試使用非線性模型（如 Random Forest、XGBoost）。**



### Pre-processing

In [843]:
# Load packages here
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
import statsmodels.api as sm
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt



In [845]:
# Load the TRAINING data here and please finish all the data manipulation here.
#input_data_file = "/Users/10610/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
input_data_file = "/Users/user/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
df = pd.read_csv(input_data_file)
df_dummies = pd.get_dummies(df, columns=['GameID'], drop_first=True)
df_dummies.dropna(inplace=True)


train = df_dummies[df_dummies['Date'] < '2025-01-01']
test = df_dummies[df_dummies['Date'] >= '2025-01-01']

def prepare_xy(df, feature_cols, target_col):
    X = df[feature_cols].copy()
    y = df[target_col].copy()
     
    # 將 bool 欄轉成 int
    X = X.astype({col: 'int' for col in X.select_dtypes(bool).columns})
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X) 
    X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)
    X_scaled_df = sm.add_constant(X_scaled_df)
    
    return X_scaled_df, y



In [908]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
MultiPlayer,17116.0,0.464244,0.498734,0.0,0.0,0.0,1.0,1.0
ConstantDiscount,17116.0,0.214361,0.41039,0.0,0.0,0.0,0.0,1.0
DiscountOrNot,17116.0,0.020741,0.14252,0.0,0.0,0.0,0.0,1.0
DiscountDuration,17116.0,0.226572,1.709819,0.0,0.0,0.0,0.0,28.0
DiscountFreq3M,17116.0,1.741061,1.049641,0.0,1.0,2.0,2.0,6.0
Age,17116.0,7.301154,4.434039,2.389041,4.715068,5.876712,8.091096,24.180822
AccumulatedPositiveRate,17116.0,0.927902,0.064157,0.744039,0.90552,0.953165,0.972584,0.992973
SalePeriod,17116.0,0.163882,0.370179,0.0,0.0,0.0,0.0,1.0
DiscountDuringSale,17116.0,0.010458,0.101731,0.0,0.0,0.0,0.0,1.0
DiscountOutOfSale,17116.0,0.010283,0.100884,0.0,0.0,0.0,0.0,1.0


### function


In [909]:
auc_threshold = 0.003

In [910]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    y_prob_train = model.predict(X_train)
    y_pred_train = (y_prob_train >= 0.02).astype(int)

    y_prob_test = model.predict(X_test)
    y_pred_test = (y_prob_test >= 0.02).astype(int)

    acc_train = accuracy_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train)
    auc_train = roc_auc_score(y_train, y_prob_train)

    acc_test = accuracy_score(y_test, y_pred_test)
    f1_test = f1_score(y_test, y_pred_test)
    auc_test = roc_auc_score(y_test, y_prob_test)
    cm = confusion_matrix(y_test, y_pred_test)

    results = {
        'Accuracy': [round(acc_train, 4), round(acc_test, 4)],
        'F1 score': [round(f1_train, 4), round(f1_test, 4)],
        'AUC': [round(auc_train, 4), round(auc_test, 4)]
    }

    row_names = ['train', 'test']

    result = pd.DataFrame(results, index=row_names)

    print(f"\n=== {name} ===")
    print("Confusion matrix:\n", cm)
    return result

# 1W

In [911]:
feature_cols_gameid = [
    'Age', 'PlayerGrowthRate1W_lag0', 'FollowersGrowthRate1W_lag0', 'PositiveRateGrowthRate1W_lag0', 
    'SalePeriod', 'DLC_sum_1W_lag0', 'Sequel_sum_1W_lag0'
] + [col for col in df_dummies.columns if col.startswith('GameID_')]


## 非季節性折扣

#### 刪除separation變數

In [912]:
# 完美分離檢查（簡單檢測：某個變數在 y=1 或 y=0 時只有一個 unique）
selected = []
for col in feature_cols_gameid:
    u1 = df_dummies.loc[df_dummies['DiscountOutOfSale']==1, col].nunique()
    u0 = df_dummies.loc[df_dummies['DiscountOutOfSale']==0, col].nunique()
    if u1==1 or u0==1:
        print("Possible separation:", col, "uniq_when_1=", u1, "uniq_when_0=", u0)
    else:
        selected.append(col)
X_train, y_train = prepare_xy(train, selected, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountOutOfSale')

Possible separation: SalePeriod uniq_when_1= 1 uniq_when_0= 2
Possible separation: DLC_sum_1W_lag0 uniq_when_1= 1 uniq_when_0= 3
Possible separation: Sequel_sum_1W_lag0 uniq_when_1= 1 uniq_when_0= 2


#### 刪除多餘變數

In [913]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

retain = []

for col in selected:
    reduced = [c for c in selected if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc_base-auc < auc_threshold:
        print(col, auc)
    else:
        retain.append(col)


baseline 0.5988703773217298
Age 0.6262878568251388
PlayerGrowthRate1W_lag0 0.6000525753646384
PositiveRateGrowthRate1W_lag0 0.5990861844338757
GameID_3590 0.6145878566865626
GameID_4000 0.6196269374220233
GameID_108600 0.6213866050083339
GameID_233860 0.6137302179616555
GameID_242760 0.6310213029809174
GameID_244850 0.6309307115841851
GameID_294100 0.6258833842733882
GameID_323190 0.6267991347978606
GameID_367520 0.6155497456656619
GameID_376210 0.6187275952684523
GameID_381210 0.615943517439584
GameID_431730 0.5998694785960945
GameID_431960 0.5974213014202254
GameID_457140 0.6032734819502613
GameID_548430 0.5976553165344743
GameID_582660 0.6087316743555077
GameID_588650 0.6001048767885976
GameID_644930 0.5995331184571377
GameID_814380 0.5999384103902986
GameID_880940 0.6120506385472905
GameID_881100 0.6004305176915009
GameID_1091500 0.624163314956698
GameID_1145360 0.6271613021766826


#### model summary

In [914]:
X_train, y_train = prepare_xy(train, retain, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, retain, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(cov_type="HAC", cov_kwds={"maxlags": 7})
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.054921
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17110
Method:                           MLE   Df Model:                            5
Date:                Sat, 29 Nov 2025   Pseudo R-squ.:                 0.04146
Time:                        02:32:44   Log-Likelihood:                -940.03
converged:                       True   LL-Null:                       -980.69
Covariance Type:                  HAC   LLR p-value:                 4.436e-16
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------
const                         -4.9321      0.114    -43.245      0.000     

#### 顯著變數

In [915]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                        -4.932
FollowersGrowthRate1W_lag0   -1.219
GameID_244210                 0.486
GameID_413150                 0.207
GameID_477160                 0.111
GameID_703080                 0.172
dtype: float64


#### 模型效果

In [916]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6062  667]
 [  74   19]]
       Accuracy  F1 score     AUC
train    0.9327    0.0400  0.6885
test     0.8914    0.0488  0.6297


#### 共線性

In [917]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                      feature       VIF
0  FollowersGrowthRate1W_lag0  1.202521
1               GameID_244210  1.161039
2               GameID_413150  1.029177
3               GameID_477160  1.024704
4               GameID_703080  1.004808


## 季節性折扣

In [918]:
# 完美分離檢查（簡單檢測：某個變數在 y=1 或 y=0 時只有一個 unique）
selected = []
for col in feature_cols_gameid:
    u1 = df_dummies.loc[df_dummies['DiscountDuringSale']==1, col].nunique()
    u0 = df_dummies.loc[df_dummies['DiscountDuringSale']==0, col].nunique()
    if u1==1 or u0==1:
        print("Possible separation:", col, "uniq_when_1=", u1, "uniq_when_0=", u0)
    else:
        selected.append(col)
X_train, y_train = prepare_xy(train, selected, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountDuringSale')

Possible separation: SalePeriod uniq_when_1= 1 uniq_when_0= 2
Possible separation: DLC_sum_1W_lag0 uniq_when_1= 1 uniq_when_0= 3
Possible separation: Sequel_sum_1W_lag0 uniq_when_1= 1 uniq_when_0= 2
Possible separation: GameID_3590 uniq_when_1= 1 uniq_when_0= 2


### 刪除多餘變數

In [919]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

retain = []

for col in selected:
    reduced = [c for c in selected if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc_base-auc < auc_threshold:
        print(col, auc)
    else:
        retain.append(col)


baseline 0.5886976609170477
GameID_4000 0.5921531846337451
GameID_431960 0.5921894514921615
GameID_457140 0.5948930009337181
GameID_477160 0.5900465836895247
GameID_548430 0.5885864791242922
GameID_582660 0.5910240384794921
GameID_703080 0.5902177929371936
GameID_814380 0.5893950727241377
GameID_880940 0.5912868126643602
GameID_881100 0.5892603407337349


#### model summary

In [920]:
X_train, y_train = prepare_xy(train, retain, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, retain, 'DiscountDuringSale')
logit_model = sm.Logit(y_train, X_train).fit(cov_type="HAC", cov_kwds={"maxlags": 7})
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.057065
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17095
Method:                           MLE   Df Model:                           20
Date:                Sat, 29 Nov 2025   Pseudo R-squ.:                 0.01775
Time:                        02:32:47   Log-Likelihood:                -976.72
converged:                       True   LL-Null:                       -994.37
Covariance Type:                  HAC   LLR p-value:                   0.01859
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                            -4.7007      0.087    -53.762      0

#### 顯著變數

In [921]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                        -4.701
PlayerGrowthRate1W_lag0      -0.462
FollowersGrowthRate1W_lag0   -0.525
GameID_108600                 0.221
GameID_431730                 0.201
dtype: float64


#### 模型效果

In [922]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6490  304]
 [  26    2]]
       Accuracy  F1 score     AUC
train    0.9753    0.0186  0.6193
test     0.9516    0.0120  0.6585


##### 共線性

In [923]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                          feature       VIF
0                             Age  1.199941
1         PlayerGrowthRate1W_lag0  1.147694
2      FollowersGrowthRate1W_lag0  1.757922
3   PositiveRateGrowthRate1W_lag0  1.460298
4                   GameID_108600  1.158849
5                   GameID_233860  1.067848
6                   GameID_242760  1.109104
7                   GameID_244210  1.273892
8                   GameID_244850  1.100509
9                   GameID_294100  1.071063
10                  GameID_323190  1.092158
11                  GameID_367520  1.049219
12                  GameID_376210  1.093518
13                  GameID_381210  1.071681
14                  GameID_413150  1.077876
15                  GameID_431730  1.141123
16                  GameID_588650  1.065817
17                  GameID_644930  1.122609
18                 GameID_1091500  1.500356
19                 GameID_1145360  1.105313


# 1W lag

In [930]:
feature_cols_gameid = [
    'Age', 'SalePeriod', 'AccumulatedPositiveRate', "MultiPlayer", 'DiscountFreq3M', 
    'PlayerGrowthRate1W_lag0', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14',
    'FollowersGrowthRate1W_lag0', 'FollowersGrowthRate1W_lag7', 'FollowersGrowthRate1W_lag14',
    'PositiveRateGrowthRate1W_lag0', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14',
    'DLC_sum_1W_lag0', 'DLC_sum_1W_lag7', 'DLC_sum_1W_lag14',
    'Sequel_sum_1W_lag0', 'Sequel_sum_1W_lag7', 'Sequel_sum_1W_lag14'
] + [col for col in df_dummies.columns if col.startswith('GameID_')]


## 非季節性折扣

#### 刪除separation變數

In [928]:
# 完美分離檢查（簡單檢測：某個變數在 y=1 或 y=0 時只有一個 unique）
selected = []
for col in feature_cols_gameid:
    u1 = df_dummies.loc[df_dummies['DiscountOutOfSale']==1, col].nunique()
    u0 = df_dummies.loc[df_dummies['DiscountOutOfSale']==0, col].nunique()
    if u1==1 or u0==1:
        print("Possible separation:", col, "uniq_when_1=", u1, "uniq_when_0=", u0)
    else:
        selected.append(col)


Possible separation: SalePeriod uniq_when_1= 1 uniq_when_0= 2
Possible separation: DLC_sum_1W_lag0 uniq_when_1= 1 uniq_when_0= 3
Possible separation: DLC_sum_1W_lag14 uniq_when_1= 1 uniq_when_0= 3
Possible separation: Sequel_sum_1W_lag0 uniq_when_1= 1 uniq_when_0= 2
Possible separation: Sequel_sum_1W_lag7 uniq_when_1= 1 uniq_when_0= 2
Possible separation: Sequel_sum_1W_lag14 uniq_when_1= 1 uniq_when_0= 2


#### 刪除多餘變數

In [929]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

retain = []

for col in selected:
    reduced = [c for c in selected if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc_base-auc < auc_threshold:
        print(col, auc)
    else:
        retain.append(col)


baseline 0.5961949039409755


KeyError: "['AccumulatedPositiveRate', 'MultiPlayer', 'DiscountFreq3M', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14', 'FollowersGrowthRate1W_lag7', 'FollowersGrowthRate1W_lag14', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14', 'DLC_sum_1W_lag7', 'GameID_3590', 'GameID_4000', 'GameID_431960', 'GameID_457140', 'GameID_477160', 'GameID_548430', 'GameID_582660', 'GameID_703080', 'GameID_814380', 'GameID_880940', 'GameID_881100'] not in index"

#### model summary

In [None]:
X_train, y_train = prepare_xy(train, retain, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, retain, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(cov_type="HAC", cov_kwds={"maxlags": 7})
print(logit_model.summary())

         Current function value: 0.052256
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17101
Method:                           MLE   Df Model:                           14
Date:                Sat, 29 Nov 2025   Pseudo R-squ.:                 0.08798
Time:                        02:22:51   Log-Likelihood:                -894.41
converged:                      False   LL-Null:                       -980.69
Covariance Type:                  HAC   LLR p-value:                 2.082e-29
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                            -5.6550      0.099    -57.151      0.000      -5.849      -5.461
Discount



#### 顯著變數

In [None]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                        -5.655
DiscountFreq3M                0.830
PlayerGrowthRate1W_lag0      -0.305
FollowersGrowthRate1W_lag0   -0.323
GameID_548430                 0.116
dtype: float64


#### 模型效果

In [None]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[5776  953]
 [  53   40]]
       Accuracy  F1 score     AUC
train    0.8453    0.0597  0.7711
test     0.8525    0.0737  0.7189


#### 共線性

In [None]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                          feature       VIF
0                  DiscountFreq3M  1.333002
1         PlayerGrowthRate1W_lag0  1.207913
2      FollowersGrowthRate1W_lag0  1.899726
3     FollowersGrowthRate1W_lag14  1.792447
4   PositiveRateGrowthRate1W_lag0  1.581365
5   PositiveRateGrowthRate1W_lag7  1.568329
6                     GameID_3590  1.135944
7                   GameID_233860  1.043066
8                   GameID_294100  1.017981
9                   GameID_323190  1.064815
10                  GameID_367520  1.030184
11                  GameID_431960  1.164201
12                  GameID_548430  1.041323
13                 GameID_1145360  1.018715


## 季節性折扣

#### 刪除separation變數

In [936]:
# 完美分離檢查（簡單檢測：某個變數在 y=1 或 y=0 時只有一個 unique）
selected = []
for col in feature_cols_gameid:
    u1 = df_dummies.loc[df_dummies['DiscountDuringSale']==1, col].nunique()
    u0 = df_dummies.loc[df_dummies['DiscountDuringSale']==0, col].nunique()
    if u1==1 or u0==1:
        print("Possible separation:", col, "uniq_when_1=", u1, "uniq_when_0=", u0)
    else:
        selected.append(col)
X_train, y_train = prepare_xy(train, selected, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountDuringSale')

Possible separation: SalePeriod uniq_when_1= 1 uniq_when_0= 2
Possible separation: DLC_sum_1W_lag0 uniq_when_1= 1 uniq_when_0= 3
Possible separation: DLC_sum_1W_lag7 uniq_when_1= 1 uniq_when_0= 3
Possible separation: Sequel_sum_1W_lag0 uniq_when_1= 1 uniq_when_0= 2
Possible separation: GameID_3590 uniq_when_1= 1 uniq_when_0= 2


#### 刪除多餘變數

In [937]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

retain = []

for col in selected:
    reduced = [c for c in selected if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc_base-auc < auc_threshold:
        print(col, auc)
    else:
        retain.append(col)


baseline 0.6639127418005673
Age 0.6904752502239638
AccumulatedPositiveRate 0.6644598130352308
MultiPlayer 0.711946765122994
PlayerGrowthRate1W_lag7 0.6632771638738997
PlayerGrowthRate1W_lag14 0.6621930511594284
FollowersGrowthRate1W_lag0 0.6644309025217187
FollowersGrowthRate1W_lag7 0.6612834789693538
FollowersGrowthRate1W_lag14 0.6620039112823332
PositiveRateGrowthRate1W_lag0 0.6656699529158048
PositiveRateGrowthRate1W_lag7 0.6632074345815088
PositiveRateGrowthRate1W_lag14 0.6648054869019424
DLC_sum_1W_lag14 0.6651937676796422
Sequel_sum_1W_lag7 0.6644578387673923
Sequel_sum_1W_lag14 0.6656543733695748
GameID_4000 0.6736729371572251
GameID_108600 0.6848937185977496
GameID_242760 0.6947271388612191
GameID_244210 0.6906740455445608
GameID_244850 0.6900698636686471
GameID_294100 0.6615660951269866
GameID_376210 0.6739697557105445
GameID_381210 0.6818965813682033
GameID_413150 0.6751617516087892
GameID_431730 0.6660532443696444
GameID_457140 0.6623193483011672
GameID_477160 0.668564581516

#### model summary

In [938]:
X_train, y_train = prepare_xy(train, retain, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, retain, 'DiscountDuringSale')
logit_model = sm.Logit(y_train, X_train).fit(cov_type="HAC", cov_kwds={"maxlags": 7})
print(logit_model.summary())

         Current function value: 0.054375
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17107
Method:                           MLE   Df Model:                            8
Date:                Sat, 29 Nov 2025   Pseudo R-squ.:                 0.06405
Time:                        02:40:57   Log-Likelihood:                -930.68
converged:                      False   LL-Null:                       -994.37
Covariance Type:                  HAC   LLR p-value:                 9.890e-24
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                      -5.5738      0.108    -51.390      0.000      -5.786      -5.361
DiscountFreq3M            



#### 顯著變數

In [939]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                     -5.574
DiscountFreq3M             0.783
PlayerGrowthRate1W_lag0   -0.594
GameID_367520              0.147
GameID_431960             -3.575
dtype: float64


#### 模型效果

In [None]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[5957  837]
 [  16   12]]
       Accuracy  F1 score     AUC
train    0.8682    0.0529  0.7334
test     0.8750    0.0274  0.7654


#### 共線性

In [None]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                   feature       VIF
0           DiscountFreq3M  1.240236
1  PlayerGrowthRate1W_lag0  1.002530
2            GameID_233860  1.039413
3            GameID_323190  1.048277
4            GameID_367520  1.020927
5            GameID_431960  1.122085
6           GameID_1091500  1.039611
7           GameID_1145360  1.009259


# 2W

In [None]:
feature_cols_gameid = [
    'Age', 'AccumulatedPositiveRate', "MultiPlayer", 'SalePeriod', 'DiscountFreq3M', 
    'PlayerGrowthRate2W_lag0', 'FollowersGrowthRate2W_lag0', 'PositiveRateGrowthRate2W_lag0', 
    'DLC_sum_2W_lag0', 'Sequel_sum_2W_lag0'
] + [col for col in df_dummies.columns if col.startswith('GameID_')]

## 非季節性折扣

#### 刪除separation變數

In [None]:
# 完美分離檢查（簡單檢測：某個變數在 y=1 或 y=0 時只有一個 unique）
selected = []
for col in feature_cols_gameid:
    u1 = df_dummies.loc[df_dummies['DiscountOutOfSale']==1, col].nunique()
    u0 = df_dummies.loc[df_dummies['DiscountOutOfSale']==0, col].nunique()
    if u1==1 or u0==1:
        print("Possible separation:", col, "uniq_when_1=", u1, "uniq_when_0=", u0)
    else:
        selected.append(col)
X_train, y_train = prepare_xy(train, selected, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountOutOfSale')

Possible separation: SalePeriod uniq_when_1= 1 uniq_when_0= 2
Possible separation: Sequel_sum_2W_lag0 uniq_when_1= 1 uniq_when_0= 2


#### 刪除多餘變數

In [None]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

retain = []

for col in selected:
    reduced = [c for c in selected if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc_base-auc < auc_threshold:
        print(col, auc)
    else:
        retain.append(col)


baseline 0.6652620029839553
Age 0.73053093492303
MultiPlayer 0.7113279232603417
DLC_sum_2W_lag0 0.6653024410658894
GameID_4000 0.6784935760245572
GameID_108600 0.7068199124584673
GameID_242760 0.7313005076245018
GameID_244210 0.6887230956626258
GameID_244850 0.7196388246097696
GameID_376210 0.686328634483502
GameID_381210 0.6815323707553654
GameID_413150 0.6703838826615328
GameID_431960 0.6688156249543091
GameID_477160 0.6647952101959052


#### model summary

In [None]:
X_train, y_train = prepare_xy(train, retain, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, retain, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(cov_type="HAC", cov_kwds={"maxlags": 7})
print(logit_model.summary())

         Current function value: 0.052076
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17093
Method:                           MLE   Df Model:                           22
Date:                Sat, 29 Nov 2025   Pseudo R-squ.:                 0.09112
Time:                        02:23:06   Log-Likelihood:                -891.33
converged:                      False   LL-Null:                       -980.69
Covariance Type:                  HAC   LLR p-value:                 1.563e-26
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                            -5.8056      0.100    -57.945      0.000      -6.002      -5.609
Accumula



#### 顯著變數

In [None]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                     -5.806
DiscountFreq3M             0.869
PlayerGrowthRate2W_lag0   -0.403
dtype: float64


#### 模型效果

In [None]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[5838  891]
 [  55   38]]
       Accuracy  F1 score     AUC
train    0.8489    0.0582  0.7718
test     0.8613    0.0744  0.7262


#### 共線性

In [None]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                          feature       VIF
0         AccumulatedPositiveRate  3.080532
1                  DiscountFreq3M  1.442615
2         PlayerGrowthRate2W_lag0  1.096308
3      FollowersGrowthRate2W_lag0  1.382289
4   PositiveRateGrowthRate2W_lag0  1.557887
5                     GameID_3590  1.186041
6                   GameID_233860  1.129892
7                   GameID_294100  1.153953
8                   GameID_323190  1.143113
9                   GameID_367520  1.109102
10                  GameID_431730  1.214672
11                  GameID_457140  1.122490
12                  GameID_548430  1.146554
13                  GameID_582660  1.847767
14                  GameID_588650  1.181655
15                  GameID_644930  1.231818
16                  GameID_703080  1.081754
17                  GameID_814380  1.088496
18                  GameID_880940  1.117804
19                  GameID_881100  1.072963
20                 GameID_1091500  1.977432
21                 GameID_114536

## 季節性折扣

#### 刪除separation變數

In [None]:
# 完美分離檢查（簡單檢測：某個變數在 y=1 或 y=0 時只有一個 unique）
selected = []
for col in feature_cols_gameid:
    u1 = df_dummies.loc[df_dummies['DiscountDuringSale']==1, col].nunique()
    u0 = df_dummies.loc[df_dummies['DiscountDuringSale']==0, col].nunique()
    if u1==1 or u0==1:
        print("Possible separation:", col, "uniq_when_1=", u1, "uniq_when_0=", u0)
    else:
        selected.append(col)
X_train, y_train = prepare_xy(train, selected, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountDuringSale')

Possible separation: SalePeriod uniq_when_1= 1 uniq_when_0= 2
Possible separation: DLC_sum_2W_lag0 uniq_when_1= 1 uniq_when_0= 3
Possible separation: GameID_3590 uniq_when_1= 1 uniq_when_0= 2


#### 刪除多餘變數

In [None]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

retain = []

for col in selected:
    reduced = [c for c in selected if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc_base-auc < auc_threshold:
        print(col, auc)
    else:
        retain.append(col)


baseline 0.6531699151478619
Age 0.6857326079842883
AccumulatedPositiveRate 0.6573179736309432
MultiPlayer 0.7149245762093193
FollowersGrowthRate2W_lag0 0.6651524475110573
PositiveRateGrowthRate2W_lag0 0.6555756643834518
Sequel_sum_2W_lag0 0.6547052401762723
GameID_4000 0.666869183431757
GameID_108600 0.6784851018115439
GameID_233860 0.6521015825972573
GameID_242760 0.6863853049392761
GameID_244210 0.6903720907110291
GameID_244850 0.6941359744470209
GameID_294100 0.65813470462198
GameID_323190 0.6572705218808685
GameID_376210 0.6704855366790863
GameID_381210 0.6758252887867837
GameID_413150 0.6687006454289929
GameID_431730 0.6606355226247396
GameID_457140 0.6569829879511857
GameID_477160 0.6612609556127365
GameID_548430 0.6586162792541501
GameID_582660 0.6589579505236154
GameID_588650 0.6573269768566316
GameID_644930 0.6565598227594741
GameID_703080 0.6540251955490598
GameID_814380 0.6504556285997385
GameID_880940 0.6601735803555916
GameID_881100 0.6531894533714646
GameID_1145360 0.6547

#### model summary

In [None]:
X_train, y_train = prepare_xy(train, retain, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, retain, 'DiscountDuringSale')
logit_model = sm.Logit(y_train, X_train).fit(cov_type="HAC", cov_kwds={"maxlags": 7})
print(logit_model.summary())

         Current function value: 0.054568
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17110
Method:                           MLE   Df Model:                            5
Date:                Sat, 29 Nov 2025   Pseudo R-squ.:                 0.06073
Time:                        02:23:11   Log-Likelihood:                -933.98
converged:                      False   LL-Null:                       -994.37
Covariance Type:                  HAC   LLR p-value:                 2.142e-24
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                      -5.5827        nan        nan        nan         nan         nan
DiscountFreq3M            



#### 顯著變數

In [None]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

DiscountFreq3M             0.767
PlayerGrowthRate2W_lag0   -0.489
GameID_367520              0.147
dtype: float64


#### 模型效果

In [None]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6014  780]
 [  14   14]]
       Accuracy  F1 score     AUC
train    0.8722    0.0569  0.7312
test     0.8836    0.0341  0.7663


#### 共線性

In [None]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                   feature       VIF
0           DiscountFreq3M  1.159785
1  PlayerGrowthRate2W_lag0  1.001962
2            GameID_367520  1.018592
3            GameID_431960  1.120266
4           GameID_1091500  1.028298


# 2W lag

In [None]:
feature_cols = [
    'Age', 'SalePeriod', 'AccumulatedPositiveRate', "MultiPlayer", 'SalePeriod', 'DiscountFreq3M', 
    'PlayerGrowthRate2W_lag0', 'PlayerGrowthRate2W_lag7',
    'FollowersGrowthRate2W_lag0', 'FollowersGrowthRate2W_lag7',
    'PositiveRateGrowthRate2W_lag0', 'PositiveRateGrowthRate2W_lag7',
    'DLC_sum_2W_lag0', 'DLC_sum_2W_lag7',
    'Sequel_sum_2W_lag0', 'Sequel_sum_2W_lag7'
] + [col for col in df_dummies.columns if col.startswith('GameID_')]

## 非季節性折扣

#### 刪除separation變數

In [None]:
# 完美分離檢查（簡單檢測：某個變數在 y=1 或 y=0 時只有一個 unique）
selected = []
for col in feature_cols_gameid:
    u1 = df_dummies.loc[df_dummies['DiscountOutOfSale']==1, col].nunique()
    u0 = df_dummies.loc[df_dummies['DiscountOutOfSale']==0, col].nunique()
    if u1==1 or u0==1:
        print("Possible separation:", col, "uniq_when_1=", u1, "uniq_when_0=", u0)
    else:
        selected.append(col)
X_train, y_train = prepare_xy(train, selected, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountOutOfSale')

Possible separation: SalePeriod uniq_when_1= 1 uniq_when_0= 2
Possible separation: Sequel_sum_2W_lag0 uniq_when_1= 1 uniq_when_0= 2


#### 刪除多餘變數

In [None]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

retain = []

for col in selected:
    reduced = [c for c in selected if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc_base-auc < auc_threshold:
        print(col, auc)
    else:
        retain.append(col)


baseline 0.6652620029839553
Age 0.73053093492303
MultiPlayer 0.7113279232603417
DLC_sum_2W_lag0 0.6653024410658894
GameID_4000 0.6784935760245572
GameID_108600 0.7068199124584673
GameID_242760 0.7313005076245018
GameID_244210 0.6887230956626258
GameID_244850 0.7196388246097696
GameID_376210 0.686328634483502
GameID_381210 0.6815323707553654
GameID_413150 0.6703838826615328
GameID_431960 0.6688156249543091
GameID_477160 0.6647952101959052


#### model summary

In [None]:
X_train, y_train = prepare_xy(train, retain, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, retain, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(cov_type="HAC", cov_kwds={"maxlags": 7})
print(logit_model.summary())

         Current function value: 0.052076
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17093
Method:                           MLE   Df Model:                           22
Date:                Sat, 29 Nov 2025   Pseudo R-squ.:                 0.09112
Time:                        02:23:17   Log-Likelihood:                -891.33
converged:                      False   LL-Null:                       -980.69
Covariance Type:                  HAC   LLR p-value:                 1.563e-26
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                            -5.8056      0.100    -57.945      0.000      -6.002      -5.609
Accumula



#### 顯著變數

In [None]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                     -5.806
DiscountFreq3M             0.869
PlayerGrowthRate2W_lag0   -0.403
dtype: float64


#### 模型效果

In [None]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[5838  891]
 [  55   38]]
       Accuracy  F1 score     AUC
train    0.8489    0.0582  0.7718
test     0.8613    0.0744  0.7262


#### 共線性

In [None]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                          feature       VIF
0         AccumulatedPositiveRate  3.080532
1                  DiscountFreq3M  1.442615
2         PlayerGrowthRate2W_lag0  1.096308
3      FollowersGrowthRate2W_lag0  1.382289
4   PositiveRateGrowthRate2W_lag0  1.557887
5                     GameID_3590  1.186041
6                   GameID_233860  1.129892
7                   GameID_294100  1.153953
8                   GameID_323190  1.143113
9                   GameID_367520  1.109102
10                  GameID_431730  1.214672
11                  GameID_457140  1.122490
12                  GameID_548430  1.146554
13                  GameID_582660  1.847767
14                  GameID_588650  1.181655
15                  GameID_644930  1.231818
16                  GameID_703080  1.081754
17                  GameID_814380  1.088496
18                  GameID_880940  1.117804
19                  GameID_881100  1.072963
20                 GameID_1091500  1.977432
21                 GameID_114536

## 季節性折扣

#### 刪除separation變數

In [None]:
# 完美分離檢查（簡單檢測：某個變數在 y=1 或 y=0 時只有一個 unique）
selected = []
for col in feature_cols_gameid:
    u1 = df_dummies.loc[df_dummies['DiscountDuringSale']==1, col].nunique()
    u0 = df_dummies.loc[df_dummies['DiscountDuringSale']==0, col].nunique()
    if u1==1 or u0==1:
        print("Possible separation:", col, "uniq_when_1=", u1, "uniq_when_0=", u0)
    else:
        selected.append(col)
X_train, y_train = prepare_xy(train, selected, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountDuringSale')

Possible separation: SalePeriod uniq_when_1= 1 uniq_when_0= 2
Possible separation: DLC_sum_2W_lag0 uniq_when_1= 1 uniq_when_0= 3
Possible separation: GameID_3590 uniq_when_1= 1 uniq_when_0= 2


#### 刪除多餘變數

In [None]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

retain = []

for col in selected:
    reduced = [c for c in selected if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc_base-auc < auc_threshold:
        print(col, auc)
    else:
        retain.append(col)


baseline 0.6531699151478619
Age 0.6857326079842883
AccumulatedPositiveRate 0.6573179736309432
MultiPlayer 0.7149245762093193
FollowersGrowthRate2W_lag0 0.6651524475110573
PositiveRateGrowthRate2W_lag0 0.6555756643834518
Sequel_sum_2W_lag0 0.6547052401762723
GameID_4000 0.666869183431757
GameID_108600 0.6784851018115439
GameID_233860 0.6521015825972573
GameID_242760 0.6863853049392761
GameID_244210 0.6903720907110291
GameID_244850 0.6941359744470209
GameID_294100 0.65813470462198
GameID_323190 0.6572705218808685
GameID_376210 0.6704855366790863
GameID_381210 0.6758252887867837
GameID_413150 0.6687006454289929
GameID_431730 0.6606355226247396
GameID_457140 0.6569829879511857
GameID_477160 0.6612609556127365
GameID_548430 0.6586162792541501
GameID_582660 0.6589579505236154
GameID_588650 0.6573269768566316
GameID_644930 0.6565598227594741
GameID_703080 0.6540251955490598
GameID_814380 0.6504556285997385
GameID_880940 0.6601735803555916
GameID_881100 0.6531894533714646
GameID_1145360 0.6547

#### model summary

In [None]:
X_train, y_train = prepare_xy(train, retain, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, retain, 'DiscountDuringSale')
logit_model = sm.Logit(y_train, X_train).fit(cov_type="HAC", cov_kwds={"maxlags": 7})
print(logit_model.summary())

         Current function value: 0.054568
         Iterations: 35
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17110
Method:                           MLE   Df Model:                            5
Date:                Sat, 29 Nov 2025   Pseudo R-squ.:                 0.06073
Time:                        02:23:22   Log-Likelihood:                -933.98
converged:                      False   LL-Null:                       -994.37
Covariance Type:                  HAC   LLR p-value:                 2.142e-24
                              coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------
const                      -5.5827        nan        nan        nan         nan         nan
DiscountFreq3M            



#### 顯著變數

In [None]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

DiscountFreq3M             0.767
PlayerGrowthRate2W_lag0   -0.489
GameID_367520              0.147
dtype: float64


#### 模型效果

In [None]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6014  780]
 [  14   14]]
       Accuracy  F1 score     AUC
train    0.8722    0.0569  0.7312
test     0.8836    0.0341  0.7663


#### 共線性

In [None]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                   feature       VIF
0           DiscountFreq3M  1.159785
1  PlayerGrowthRate2W_lag0  1.001962
2            GameID_367520  1.018592
3            GameID_431960  1.120266
4           GameID_1091500  1.028298
