# Background
- **Author**: `<郭伊軒>`
- **Created At**: `<2025-11-1>`
- **Path to Training Data： discount-timing-DE.csv**
- **Path to Testing Data： discount-timing-DE.csv**
- **Model Specification 
    - Method：logistic regression
    - Variables：  
        - 一周計算成長率:  
           ['Age','AccumulatedPositiveRate', 'MultiPlayer', 'SalePeriod', 'DiscountFreq3M',    
            'PlayerGrowthRate1W', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14',   
            'FollowersGrowthRate1W', 'FollowersGrowthRate1W_lag7', 'FollowersGrowthRate1W_lag14',   
            'PositiveRateGrowthRate1W', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14',   
            'DLC_sum_1W', 'DLC_sum_1W_lag7', 'DLC_sum_1W_lag14',   
            'Sequel_sum_1W', 'Sequel_sum_1W_lag7', 'Sequel_sum_1W_lag14']  
        - 兩周計算成長率:  
           ['Age','AccumulatedPositiveRate', 'MultiPlayer', 'SalePeriod', 'DiscountFreq3M',    
            'PlayerGrowthRate2W', 'PlayerGrowthRate2W_lag7',    
            'FollowersGrowthRate2W', 'FollowersGrowthRate2W_lag7',   
            'PositiveRateGrowthRate2W', 'PositiveRateGrowthRate2W_lag7',   
            'DLC_sum_2W', 'DLC_sum_2W_lag7',   
            'Sequel_sum_2W', 'Sequel_sum_2W_lag7']
    - Tuning Parameters：   
        - auc_threshold = 0.005 (AUC 的自然隨機波動 ≈ ±0.002 ～ ±0.005)    
          判斷此變數是否重要，如果刪除此變數AUC會上升的值如果大於 auc_threshold，將此變數刪除。 
        - feature_col += [lag變數]   
          是否增加滯後性變數。
    - Optimization Method：增加滯後性變數模型
- **Main Findings and Takeaways：**   
    滯後性模型
    - In-sample `<AUC>`:  
    DiscountOrNot1W(`0.8170`), DiscountDuringSale1W(`0.9627`), DiscountOutOfSale1W(`0.8360`),    
    DiscountOrNot2W(`0.8149`), DiscountDuringSale2W(`0.9548`), DiscountOutOfSale2W(`0.8348`)
    - Out-sample `<AUC>`:  
    DiscountOrNot1W(`0.7493`), DiscountDuringSale1W(`0.9774`), DiscountOutOfSale1W(`0.7764`),    
    DiscountOrNot2W(`0.7486`), DiscountDuringSale2W(`0.9724`), DiscountOutOfSale2W(`0.7810`)
    - 個體差異不顯著
- **Future Direciton：嘗試使用非線性模型（如 Random Forest、XGBoost）。**

In [305]:
# Load packages here
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
import statsmodels.api as sm
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt



In [306]:
# Load the TRAINING data here and please finish all the data manipulation here.
#input_data_file = "/Users/10610/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
input_data_file = "/Users/user/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
df = pd.read_csv(input_data_file)

df['PlayerGrowthRate1W'] = df['PlayerGrowthRate1W'].shift(1)
df['PlayerGrowthRate1W_lag7'] = df['PlayerGrowthRate1W'].shift(8)
df['PlayerGrowthRate1W_lag14'] = df['PlayerGrowthRate1W'].shift(15)

df['PlayerGrowthRate2W'] = df['PlayerGrowthRate2W'].shift(1)
df['PlayerGrowthRate2W_lag7'] = df['PlayerGrowthRate2W'].shift(8)

df['FollowersGrowthRate1W'] = df['FollowersGrowthRate1W'].shift(1)
df['FollowersGrowthRate1W_lag7'] = df['FollowersGrowthRate1W'].shift(8)
df['FollowersGrowthRate1W_lag14'] = df['FollowersGrowthRate1W'].shift(15)

df['FollowersGrowthRate2W'] = df['FollowersGrowthRate2W'].shift(1)
df['FollowersGrowthRate2W_lag7'] = df['FollowersGrowthRate2W'].shift(8)

df['PositiveRateGrowthRate1W'] = df['PositiveRateGrowthRate1W'].shift(1)
df['PositiveRateGrowthRate1W_lag7'] = df['PositiveRateGrowthRate1W'].shift(8)
df['PositiveRateGrowthRate1W_lag14'] = df['PositiveRateGrowthRate1W'].shift(15)

df['PositiveRateGrowthRate2W'] = df['PositiveRateGrowthRate2W'].shift(1)
df['PositiveRateGrowthRate2W_lag7'] = df['PositiveRateGrowthRate2W'].shift(8)

df['DLC_sum_1W'] = df['DLC_sum_1W'].shift(1)
df['DLC_sum_1W_lag7'] = df['DLC_sum_1W'].shift(8)
df['DLC_sum_1W_lag14'] = df['DLC_sum_1W'].shift(15)

df['DLC_sum_2W'] = df['DLC_sum_2W'].shift(1)
df['DLC_sum_2W_lag7'] = df['DLC_sum_2W'].shift(8)

df['Sequel_sum_1W'] = df['Sequel_sum_1W'].shift(1)
df['Sequel_sum_1W_lag7'] = df['Sequel_sum_1W'].shift(8)
df['Sequel_sum_1W_lag14'] = df['Sequel_sum_1W'].shift(15)

df['Sequel_sum_2W'] = df['Sequel_sum_2W'].shift(1)
df['Sequel_sum_2W_lag7'] = df['Sequel_sum_2W'].shift(8)

df_dummies = pd.get_dummies(df, columns=['GameID'], drop_first=True)
df_dummies.dropna(inplace=True)


train = df_dummies[df_dummies['Date'] < '2025-01-01']
test = df_dummies[df_dummies['Date'] >= '2025-01-01']

def prepare_xy(df, feature_cols, target_col):
    X = df[feature_cols].copy()
    y = df[target_col].copy()
     
    # 將 bool 欄轉成 int
    X = X.astype({col: 'int' for col in X.select_dtypes(bool).columns})
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X) 
    X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)
    X_scaled_df = sm.add_constant(X_scaled_df)
    
    return X_scaled_df, y


In [None]:
train.describe().T

### The actual modeling starts below
For the remaining blocks, make sure you have followed the guidelines as specified in [專案資料夾結構、檔案命名與文件規範](https://docs.google.com/document/d/1sl6gEFMdmiGsiNjLe17UmZ30xKxq15U0Mb2B-Jvusxg/edit?tab=t.33iie8ybx7s4).


In [418]:
auc_threshold = 0.005

In [308]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    y_prob_train = model.predict(X_train)
    y_pred_train = (y_prob_train >= 0.02).astype(int)

    y_prob_test = model.predict(X_test)
    y_pred_test = (y_prob_test >= 0.02).astype(int)

    acc_train = accuracy_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train)
    auc_train = roc_auc_score(y_train, y_prob_train)

    acc_test = accuracy_score(y_test, y_pred_test)
    f1_test = f1_score(y_test, y_pred_test)
    auc_test = roc_auc_score(y_test, y_prob_test)
    cm = confusion_matrix(y_test, y_pred_test)

    results = {
        'Accuracy': [round(acc_train, 4), round(acc_test, 4)],
        'F1 score': [round(f1_train, 4), round(f1_test, 4)],
        'AUC': [round(auc_train, 4), round(auc_test, 4)]
    }

    row_names = ['train', 'test']

    result = pd.DataFrame(results, index=row_names)

    print(f"\n=== {name} ===")
    print("Confusion matrix:\n", cm)
    return result

# 1W

In [473]:
feature_cols_gameid = [
    'Age', 'PlayerGrowthRate1W', 'FollowersGrowthRate1W', 'PositiveRateGrowthRate1W', 
    'SalePeriod', 'DLC_sum_1W', 'Sequel_sum_1W'
] + [col for col in df_dummies.columns if col.startswith('GameID_')]

feature_cols = [
    'Age','AccumulatedPositiveRate', "MultiPlayer", 'PlayerGrowthRate1W', 'FollowersGrowthRate1W', 'PositiveRateGrowthRate1W', 
    'SalePeriod', 'DiscountFreq3M', 'DLC_sum_1W', 'Sequel_sum_1W'
]


## 所有折扣

#### model summary

In [474]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOrNot')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOrNot')

logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.087557
         Iterations: 78
         Function evaluations: 80
         Gradient evaluations: 80
                           Logit Regression Results                           
Dep. Variable:          DiscountOrNot   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17089
Method:                           MLE   Df Model:                           10
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1330
Time:                        18:51:10   Log-Likelihood:                -1497.2
converged:                       True   LL-Null:                       -1726.8
Covariance Type:            nonrobust   LLR p-value:                 2.290e-92
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

#### 顯著變數

In [475]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                   -4.489
PlayerGrowthRate1W      -0.722
FollowersGrowthRate1W   -0.298
SalePeriod               0.595
DiscountFreq3M           0.641
dtype: float64


#### 共線性

In [421]:
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                    feature       VIF
0                       Age  1.393812
1   AccumulatedPositiveRate  1.317154
2               MultiPlayer  1.426298
3        PlayerGrowthRate1W  1.098304
4     FollowersGrowthRate1W  1.152741
5  PositiveRateGrowthRate1W  1.069350
6                SalePeriod  1.070197
7            DiscountFreq3M  1.192268
8                DLC_sum_1W  1.038748
9             Sequel_sum_1W  1.006119


#### 模型效果

In [422]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[4847 1854]
 [  38   83]]
       Accuracy  F1 score     AUC
train    0.7509    0.1066  0.8114
test     0.7227    0.0807  0.7334


## 季節性折扣

#### model summary

In [424]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')

logit_model = sm.Logit(y_train, X_train).fit_regularized(alpha=1)
print(logit_model.summary())

Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.036169286098874
            Iterations: 148
            Function evaluations: 148
            Gradient evaluations: 148
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17092
Method:                           MLE   Df Model:                            7
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.3917
Time:                        18:47:35   Log-Likelihood:                -604.76
converged:                       True   LL-Null:                       -994.20
Covariance Type:            nonrobust   LLR p-value:                6.656e-164
                               coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------

#### 顯著變數

In [None]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

FollowersGrowthRate1W   -0.370
DiscountFreq3M           0.995
dtype: float64


#### 模型效果

In [425]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)




=== unbalance ===
Confusion matrix:
 [[6171  623]
 [   0   28]]
       Accuracy  F1 score     AUC
train    0.8758    0.1415  0.9611
test     0.9087    0.0825  0.9771


## 非季節性折扣

#### 證明個體沒有明顯差異

In [476]:
X_train, y_train = prepare_xy(train, feature_cols_gameid, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols_gameid, 'DiscountOutOfSale') 
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.051657
         Iterations: 100
         Function evaluations: 101
         Gradient evaluations: 101
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17065
Method:                           MLE   Df Model:                           34
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                 0.09913
Time:                        18:53:10   Log-Likelihood:                -883.33
converged:                      False   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 2.219e-24
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                       -6.0472      0.54

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


顯著變數

In [477]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                   -6.047
FollowersGrowthRate1W   -1.229
SalePeriod              -1.945
dtype: float64


DLC_sum_1W(0.010)、DiscountFreq3M(0.000)、FollowersGrowthRate1W(0.006)

##### 共線性

In [428]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                     feature        VIF
0                        Age  86.406953
1         PlayerGrowthRate1W   1.219388
2      FollowersGrowthRate1W   2.334885
3   PositiveRateGrowthRate1W   1.504960
4                 SalePeriod   1.030019
5                 DLC_sum_1W   1.111778
6              Sequel_sum_1W   1.015277
7                GameID_3590  12.771527
8                GameID_4000   7.614570
9              GameID_108600  27.351742
10             GameID_233860  51.821231
11             GameID_242760  49.194477
12             GameID_244210  31.782275
13             GameID_244850  53.646254
14             GameID_294100  51.366321
15             GameID_323190  48.997498
16             GameID_367520  42.540862
17             GameID_376210  36.640742
18             GameID_381210  39.615619
19             GameID_413150  37.403214
20             GameID_431730  37.178571
21             GameID_431960  51.171005
22             GameID_457140  55.855453
23             GameID_477160  40.250964


##### Wald test

In [429]:
# 1. 取得所有 dummy variable 的名稱列表
game_cols = [col for col in df_dummies.columns if col.startswith('GameID_')]
game_cnt = len(game_cols)
variable_cnt = len(feature_cols_gameid) + 1 # 包含常數項及其他變數的總數

# 2. 初始化 R 矩陣
R_matrix = np.zeros([game_cnt, variable_cnt])

# 3. 找出這些變數在模型參數列表中的位置，並設定 R 矩陣
for i, var_name in enumerate(game_cols):
    # 找到該變數在 model.params 中的索引位置
    param_index = logit_model.params.index.get_loc(var_name)
    R_matrix[i, param_index] = 1


print('\n unbalance')
print(logit_model.wald_test(R_matrix))


 unbalance
<Wald test (chi2): statistic=[[64.82788876]], p-value=5.9376326260946565e-05, df_denom=27>




### 刪除前

#### model summary

In [430]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.049743
         Iterations: 93
         Function evaluations: 94
         Gradient evaluations: 94
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17089
Method:                           MLE   Df Model:                           10
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1325
Time:                        18:47:37   Log-Likelihood:                -850.61
converged:                       True   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 4.631e-50
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

#### 顯著變數

In [431]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

FollowersGrowthRate1W   -0.370
DiscountFreq3M           0.995
dtype: float64


#### 模型效果

In [432]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)



=== unbalance ===
Confusion matrix:
 [[6418  311]
 [  72   21]]
       Accuracy  F1 score     AUC
train    0.8562    0.0703  0.8244
test     0.9439    0.0988  0.7636


### 刪除多餘變數(僅刪除無影響變數)

In [433]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

selected = []

for col in feature_cols:
    reduced = [c for c in feature_cols if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc-auc_base > auc_threshold:
        print(col, auc)
    else:
        selected.append(col)


baseline 0.7908669664063481
Age 0.8008244132424366
AccumulatedPositiveRate 0.7961837196622371
MultiPlayer 0.799270324075


#### model summary

In [434]:
X_train, y_train = prepare_xy(train, selected, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.049771
         Iterations: 64
         Function evaluations: 65
         Gradient evaluations: 65
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17092
Method:                           MLE   Df Model:                            7
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1320
Time:                        18:47:38   Log-Likelihood:                -851.08
converged:                       True   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 3.526e-52
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

#### 顯著變數

In [435]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

FollowersGrowthRate1W   -0.383
DiscountFreq3M           1.002
dtype: float64


#### 模型效果

In [436]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6357  372]
 [  69   24]]
       Accuracy  F1 score     AUC
train    0.8562    0.0703  0.8248
test     0.9354    0.0982  0.7617


# 1W lag

In [521]:
feature_cols = [
    'Age','AccumulatedPositiveRate', "MultiPlayer", 'SalePeriod', 'DiscountFreq3M', 
    'PlayerGrowthRate1W', 'PlayerGrowthRate1W_lag7', 'PlayerGrowthRate1W_lag14',
    'FollowersGrowthRate1W', 'FollowersGrowthRate1W_lag7', 'FollowersGrowthRate1W_lag14',
    'PositiveRateGrowthRate1W', 'PositiveRateGrowthRate1W_lag7', 'PositiveRateGrowthRate1W_lag14',
    'DLC_sum_1W', 'DLC_sum_1W_lag7', 'DLC_sum_1W_lag14',
    'Sequel_sum_1W', 'Sequel_sum_1W_lag7', 'Sequel_sum_1W_lag14'
]


## 所有折扣

#### model summary

In [479]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOrNot')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOrNot') 

logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.086834
         Iterations: 100
         Function evaluations: 102
         Gradient evaluations: 102
                           Logit Regression Results                           
Dep. Variable:          DiscountOrNot   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17079
Method:                           MLE   Df Model:                           20
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1401
Time:                        18:54:39   Log-Likelihood:                -1484.9
converged:                      False   LL-Null:                       -1726.8
Covariance Type:            nonrobust   LLR p-value:                 6.750e-90
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                            

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


#### 顯著變數

In [480]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                      -4.548
SalePeriod                  0.601
DiscountFreq3M              0.662
PlayerGrowthRate1W         -0.805
PlayerGrowthRate1W_lag7    -0.225
PlayerGrowthRate1W_lag14   -0.295
FollowersGrowthRate1W      -0.439
dtype: float64


#### 共線性

In [439]:
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                           feature       VIF
0                              Age  1.400210
1          AccumulatedPositiveRate  1.359391
2                      MultiPlayer  1.452015
3                       SalePeriod  1.076177
4                   DiscountFreq3M  1.210835
5               PlayerGrowthRate1W  1.255601
6          PlayerGrowthRate1W_lag7  1.230177
7         PlayerGrowthRate1W_lag14  1.154126
8            FollowersGrowthRate1W  2.445849
9       FollowersGrowthRate1W_lag7  3.319988
10     FollowersGrowthRate1W_lag14  2.629376
11        PositiveRateGrowthRate1W  1.636391
12   PositiveRateGrowthRate1W_lag7  1.935200
13  PositiveRateGrowthRate1W_lag14  1.691669
14                      DLC_sum_1W  1.042755
15                 DLC_sum_1W_lag7  1.042222
16                DLC_sum_1W_lag14  1.041970
17                   Sequel_sum_1W  1.008160
18              Sequel_sum_1W_lag7  1.009327
19             Sequel_sum_1W_lag14  1.006333


#### 模型效果

In [440]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[4900 1801]
 [  38   83]]
       Accuracy  F1 score     AUC
train    0.7506    0.1087  0.8170
test     0.7304    0.0828  0.7393


## 季節性折扣

#### model summary

In [442]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')

logit_model = sm.Logit(y_train, X_train).fit_regularized(alpha=1)
print(logit_model.summary())

Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.03580419099445553
            Iterations: 219
            Function evaluations: 219
            Gradient evaluations: 219
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17079
Method:                           MLE   Df Model:                           20
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.3994
Time:                        18:47:41   Log-Likelihood:                -597.08
converged:                       True   LL-Null:                       -994.20
Covariance Type:            nonrobust   LLR p-value:                2.367e-155
                                     coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------

Try increasing solver accuracy or number of iterations, decreasing alpha, or switch solvers


#### 顯著變數

In [482]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                      -4.548
SalePeriod                  0.601
DiscountFreq3M              0.662
PlayerGrowthRate1W         -0.805
PlayerGrowthRate1W_lag7    -0.225
PlayerGrowthRate1W_lag14   -0.295
FollowersGrowthRate1W      -0.439
dtype: float64


#### 模型效果

In [443]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6171  623]
 [   0   28]]
       Accuracy  F1 score     AUC
train    0.8789    0.1446  0.9627
test     0.9087    0.0825  0.9774


## 非季節性折扣

### 刪除前

#### model summary

In [530]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.048989
         Iterations: 100
         Function evaluations: 101
         Gradient evaluations: 101
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17079
Method:                           MLE   Df Model:                           20
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1457
Time:                        19:12:06   Log-Likelihood:                -837.70
converged:                      False   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 6.815e-49
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                            

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


#### 顯著變數

In [523]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                      -6.223
DiscountFreq3M              1.026
PlayerGrowthRate1W_lag14   -0.617
FollowersGrowthRate1W      -0.443
dtype: float64


#### 模型效果

In [524]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6015  714]
 [  51   42]]
       Accuracy  F1 score     AUC
train    0.8637    0.0768  0.8364
test     0.8879    0.0989  0.7769


### 刪除多餘變數(僅刪除無影響變數)

In [531]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

selected = []

for col in feature_cols:
    reduced = [c for c in feature_cols if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc-auc_base > auc_threshold:
        print(col, auc)
    else:
        selected.append(col)


baseline 0.7999903725427109
MultiPlayer 0.8054319793244566


#### model summary

In [532]:
X_train, y_train = prepare_xy(train, selected, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.049007
         Iterations: 100
         Function evaluations: 101
         Gradient evaluations: 101
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17080
Method:                           MLE   Df Model:                           19
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1453
Time:                        19:12:14   Log-Likelihood:                -838.01
converged:                      False   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 2.310e-49
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                            

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


#### 顯著變數

In [533]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                      -6.151
DiscountFreq3M              1.037
PlayerGrowthRate1W_lag14   -0.615
FollowersGrowthRate1W      -0.440
dtype: float64


#### 模型效果

In [534]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[5933  796]
 [  50   43]]
       Accuracy  F1 score     AUC
train     0.864    0.0763  0.8360
test      0.876    0.0923  0.7764


# 2W

In [484]:
feature_cols = [
    'Age', 'AccumulatedPositiveRate', "MultiPlayer", 'SalePeriod', 'DiscountFreq3M', 
    'PlayerGrowthRate2W', 'FollowersGrowthRate2W', 'PositiveRateGrowthRate2W', 
    'DLC_sum_2W', 'Sequel_sum_2W'
]

## 所有折扣

#### model summary

In [485]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOrNot')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOrNot')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())


Optimization terminated successfully.
         Current function value: 0.088420
         Iterations: 77
         Function evaluations: 79
         Gradient evaluations: 79
                           Logit Regression Results                           
Dep. Variable:          DiscountOrNot   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17089
Method:                           MLE   Df Model:                           10
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1244
Time:                        18:58:11   Log-Likelihood:                -1512.0
converged:                       True   LL-Null:                       -1726.8
Covariance Type:            nonrobust   LLR p-value:                 4.509e-86
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

#### 顯著變數

In [486]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                   -4.430
SalePeriod               0.528
DiscountFreq3M           0.661
PlayerGrowthRate2W      -0.516
FollowersGrowthRate2W   -0.235
dtype: float64


DLC_sum_1W(0.010)、DiscountFreq3M(0.000)、FollowersGrowthRate1W(0.006)

#### 模型效果

In [487]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)



=== unbalance ===
Confusion matrix:
 [[4860 1841]
 [  41   80]]
       Accuracy  F1 score     AUC
train    0.7502    0.1089  0.8126
test     0.7241    0.0784  0.7421


## 季節性折扣

#### model summary

In [489]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.037257
         Iterations: 88
         Function evaluations: 89
         Gradient evaluations: 89
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17089
Method:                           MLE   Df Model:                           10
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.3592
Time:                        18:58:50   Log-Likelihood:                -637.09
converged:                       True   LL-Null:                       -994.20
Covariance Type:            nonrobust   LLR p-value:                5.566e-147
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

#### 顯著變數

In [490]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

DiscountFreq3M           0.256
PlayerGrowthRate2W      -0.694
FollowersGrowthRate2W   -0.289
dtype: float64


#### 模型效果

In [491]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6124  670]
 [   0   28]]
       Accuracy  F1 score     AUC
train    0.8591    0.1274  0.9466
test     0.9018    0.0771  0.9690


## 非季節性折扣

### 刪除前

#### model summary

In [498]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.049669
         Iterations: 100
         Function evaluations: 101
         Gradient evaluations: 101
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17089
Method:                           MLE   Df Model:                           10
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1338
Time:                        19:00:28   Log-Likelihood:                -849.34
converged:                      False   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 1.354e-50
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                       -7.4323     45.21

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


#### 顯著變數

In [499]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

DiscountFreq3M           1.013
PlayerGrowthRate2W      -0.369
FollowersGrowthRate2W   -0.242
dtype: float64


#### 模型效果

In [500]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6410  319]
 [  70   23]]
       Accuracy  F1 score     AUC
train    0.8591    0.0717  0.8268
test     0.9430    0.1057  0.7771


### 刪除多餘變數(僅刪除無影響變數)

In [501]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

selected = []

for col in feature_cols:
    reduced = [c for c in feature_cols if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc-auc_base > auc_threshold:
        print(col, auc)
    else:
        selected.append(col)


baseline 0.7913384038899796
Age 0.8010883110553829
MultiPlayer 0.7989950374651652


#### model summary

In [502]:
X_train, y_train = prepare_xy(train, selected, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.049686
         Iterations: 82
         Function evaluations: 83
         Gradient evaluations: 83
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17091
Method:                           MLE   Df Model:                            8
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1335
Time:                        19:00:40   Log-Likelihood:                -849.62
converged:                       True   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 5.379e-52
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

#### 顯著變數

In [503]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

DiscountFreq3M           1.025
PlayerGrowthRate2W      -0.374
FollowersGrowthRate2W   -0.237
dtype: float64


#### 模型效果

In [504]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6421  308]
 [  69   24]]
       Accuracy  F1 score     AUC
train    0.8607    0.0724  0.8252
test     0.9447    0.1129  0.7766


# 2W lag

In [505]:
feature_cols = [
    'Age','AccumulatedPositiveRate', "MultiPlayer", 'SalePeriod', 'DiscountFreq3M', 
    'PlayerGrowthRate2W', 'PlayerGrowthRate2W_lag7',
    'FollowersGrowthRate2W', 'FollowersGrowthRate2W_lag7',
    'PositiveRateGrowthRate2W', 'PositiveRateGrowthRate2W_lag7',
    'DLC_sum_2W', 'DLC_sum_2W_lag7',
    'Sequel_sum_2W', 'Sequel_sum_2W_lag7'
]

## 所有折扣

#### model summary

In [506]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOrNot')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOrNot')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())


         Current function value: 0.087807
         Iterations: 100
         Function evaluations: 102
         Gradient evaluations: 102
                           Logit Regression Results                           
Dep. Variable:          DiscountOrNot   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17084
Method:                           MLE   Df Model:                           15
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1305
Time:                        19:01:02   Log-Likelihood:                -1501.5
converged:                      False   LL-Null:                       -1726.8
Covariance Type:            nonrobust   LLR p-value:                 1.522e-86
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                            -4

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


#### 顯著變數

In [507]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                           -4.465
SalePeriod                       0.545
DiscountFreq3M                   0.661
PlayerGrowthRate2W              -0.290
PlayerGrowthRate2W_lag7         -0.307
FollowersGrowthRate2W           -0.790
FollowersGrowthRate2W_lag7       0.567
PositiveRateGrowthRate2W        -0.161
PositiveRateGrowthRate2W_lag7    0.163
dtype: float64


#### 模型效果

In [508]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)



=== unbalance ===
Confusion matrix:
 [[4927 1774]
 [  40   81]]
       Accuracy  F1 score     AUC
train    0.7450    0.1058  0.8149
test     0.7341    0.0820  0.7486


## 季節性折扣

#### model summary

In [509]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.036140
         Iterations: 100
         Function evaluations: 101
         Gradient evaluations: 101
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17084
Method:                           MLE   Df Model:                           15
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.3784
Time:                        19:01:44   Log-Likelihood:                -617.99
converged:                      False   LL-Null:                       -994.20
Covariance Type:            nonrobust   LLR p-value:                1.234e-150
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                           -13

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


#### 顯著變數

In [510]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

DiscountFreq3M                   0.227
PlayerGrowthRate2W              -0.553
FollowersGrowthRate2W           -1.271
FollowersGrowthRate2W_lag7       0.948
PositiveRateGrowthRate2W        -0.358
PositiveRateGrowthRate2W_lag7    0.357
dtype: float64


#### 模型效果

In [511]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6129  665]
 [   0   28]]
       Accuracy  F1 score     AUC
train    0.8701    0.1361  0.9548
test     0.9025    0.0777  0.9724


## 非季節性折扣

### 刪除前

#### model summary

In [512]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.048980
         Iterations: 100
         Function evaluations: 101
         Gradient evaluations: 101
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17084
Method:                           MLE   Df Model:                           15
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1458
Time:                        19:02:17   Log-Likelihood:                -837.55
converged:                      False   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 4.612e-52
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                            -6

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


#### 顯著變數

In [513]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                     -6.081
SalePeriod                -2.396
DiscountFreq3M             1.021
PlayerGrowthRate2W_lag7   -0.881
FollowersGrowthRate2W     -0.564
dtype: float64


#### 模型效果

In [514]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[6005  724]
 [  47   46]]
       Accuracy  F1 score     AUC
train    0.8599    0.0778  0.8351
test     0.8870    0.1066  0.7811


### 刪除多餘變數(僅刪除無影響變數)

In [515]:
tscv = TimeSeriesSplit(n_splits=5)
auc_base = cross_val_score(
    LogisticRegression(max_iter=500, class_weight="balanced"),
    X_train, y_train,
    cv=tscv, scoring='roc_auc'
).mean()
print('baseline', auc_base)

selected = []

for col in feature_cols:
    reduced = [c for c in feature_cols if c != col]
    auc = cross_val_score(
        LogisticRegression(max_iter=500, class_weight="balanced"),
        X_train[reduced], y_train,
        cv=tscv, scoring='roc_auc'
    ).mean()

    if auc-auc_base > auc_threshold:
        print(col, auc)
    else:
        selected.append(col)


baseline 0.8019735267884803
Age 0.8074124382856829


#### model summary

In [516]:
X_train, y_train = prepare_xy(train, selected, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, selected, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.048982
         Iterations: 100
         Function evaluations: 101
         Gradient evaluations: 101
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17100
Model:                          Logit   Df Residuals:                    17085
Method:                           MLE   Df Model:                           14
Date:                Fri, 28 Nov 2025   Pseudo R-squ.:                  0.1458
Time:                        19:02:40   Log-Likelihood:                -837.60
converged:                      False   LL-Null:                       -980.53
Covariance Type:            nonrobust   LLR p-value:                 1.044e-52
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const                            -6

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


#### 顯著變數

In [517]:
p_values = logit_model.pvalues
coefficients = logit_model.params

significant_coefficients = coefficients[p_values < 0.05]
print(round(significant_coefficients, 3))

const                     -6.085
DiscountFreq3M             1.028
PlayerGrowthRate2W_lag7   -0.879
FollowersGrowthRate2W     -0.573
dtype: float64


#### 模型效果

In [518]:
result = evaluate_model('unbalance', logit_model, X_train, y_train, X_test, y_test)

print(result)


=== unbalance ===
Confusion matrix:
 [[5974  755]
 [  47   46]]
       Accuracy  F1 score     AUC
train    0.8608    0.0782  0.8348
test     0.8824    0.1029  0.7810
