# Background
- **Author**: `<郭伊軒>`
- **Created At**: `<2025-11-1>`
- **Path to Training Data： discount-timing-DE.csv**
- **Path to Testing Data： discount-timing-DE.csv**
- **Model Specification 
    - Method：logistic regression
    - Variables：  
    ['Age', 'PlayerGrowthRate1W', 'FollowersGrowthRate1W', 'PositiveRateGrowthRate1W', 'SalePeriod', 'AccumulatedPositiveRate', 'DLC_sum_1W', 'Sequel_sum_1W']
    - Tuning Parameters：
    - Optimization Method：
- **Main Findings and Takeaways：**
    - In-sample `<metric>`:
    - Out-sample `<metric>`:
- **Future Direciton：**

In [154]:
# Load packages here
import pandas as pd
import numpy as np
from imblearn.over_sampling import SMOTE
import statsmodels.api as sm
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.preprocessing import StandardScaler



In [155]:
# Load the TRAINING data here and please finish all the data manipulation here.
#input_data_file = "/Users/10610/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
input_data_file = "/Users/user/Desktop/114-1 資料/steam-project/discount-timing-DE.csv"
df = pd.read_csv(input_data_file)
df_dummies = pd.get_dummies(df, columns=['GameID'], drop_first=True)

train = df_dummies[df_dummies['Date'] < '2025-01-01']
test = df_dummies[df_dummies['Date'] >= '2025-01-01']

def prepare_xy(df, feature_cols, target_col):
    X = df[feature_cols].copy()
    y = df[target_col].copy()
     
    # 將 bool 欄轉成 int
    X = X.astype({col: 'int' for col in X.select_dtypes(bool).columns})
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X) 
    X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns, index=X.index)
    X_scaled_df = sm.add_constant(X_scaled_df)
    
    return X_scaled_df, y


In [156]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
GameID,23938.0,461376.742,298559.181056,10.0,244850.0,431730.0,644930.0,1145360.0
MultiPlayer,23938.0,0.464241,0.49873,0.0,0.0,0.0,1.0,1.0
ConstantDiscount,23938.0,0.214387,0.410405,0.0,0.0,0.0,0.0,1.0
DiscountOrNot,23938.0,0.019885,0.139607,0.0,0.0,0.0,0.0,1.0
DiscountDuration,23938.0,0.221196,1.715483,0.0,0.0,0.0,0.0,32.0
DiscountFreq3M,23938.0,1.797644,1.043279,0.0,1.0,2.0,3.0,6.0
Age,23938.0,7.634427,4.458471,2.389041,4.95137,6.323288,8.479452,24.84658
AccumulatedPositiveRate,23938.0,0.928061,0.064186,0.738751,0.905517,0.953165,0.972651,0.9929734
SalePeriod,23938.0,0.14642,0.353534,0.0,0.0,0.0,0.0,1.0
DiscountDuringSale,23938.0,0.008647,0.09259,0.0,0.0,0.0,0.0,1.0


### The actual modeling starts below
For the remaining blocks, make sure you have followed the guidelines as specified in [專案資料夾結構、檔案命名與文件規範](https://docs.google.com/document/d/1sl6gEFMdmiGsiNjLe17UmZ30xKxq15U0Mb2B-Jvusxg/edit?tab=t.33iie8ybx7s4).


In [157]:
def evaluate_model(name, model, X_test, y_test):
    y_prob = model.predict(X_test)
    y_pred = (y_prob >= 0.5).astype(int)

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_prob)
    cm = confusion_matrix(y_test, y_pred)

    print(f"\n=== {name} ===")
    print(f"Accuracy: {acc:.4f}")
    print(f"F1-score: {f1:.4f}")
    print(f"AUC: {auc:.4f}")
    print("Confusion matrix:\n", cm)
    return {"Model": name, "Accuracy": acc, "F1": f1, "AUC": auc}

# 1W

### 所有折扣

In [158]:
feature_cols_gameid = [
    'Age', 'PlayerGrowthRate1W', 'FollowersGrowthRate1W', 'PositiveRateGrowthRate1W', 
    'SalePeriod', 'DLC_sum_1W', 'Sequel_sum_1W'
] + [col for col in df_dummies.columns if col.startswith('GameID_')]

feature_cols = [
    'Age', "MultiPlayer", 'PlayerGrowthRate1W', 'FollowersGrowthRate1W', 'PositiveRateGrowthRate1W', 
    'SalePeriod', 'DiscountFreq3M', 'DLC_sum_1W', 'Sequel_sum_1W'
]


#### 證明個體沒有明顯差異

In [159]:
X_train, y_train = prepare_xy(train, feature_cols_gameid, 'DiscountOrNot')
X_test, y_test = prepare_xy(test, feature_cols_gameid, 'DiscountOrNot') 
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

         Current function value: 0.087856
         Iterations: 100
         Function evaluations: 101
         Gradient evaluations: 101
                           Logit Regression Results                           
Dep. Variable:          DiscountOrNot   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17081
Method:                           MLE   Df Model:                           34
Date:                Fri, 14 Nov 2025   Pseudo R-squ.:                  0.1293
Time:                        16:52:32   Log-Likelihood:                -1503.7
converged:                      False   LL-Null:                       -1727.1
Covariance Type:            nonrobust   LLR p-value:                 1.875e-73
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const                       -4.6824      0.16

  res = _minimize_bfgs(f, x0, args, fprime, callback=callback, **opts)


##### 共線性

In [160]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                     feature        VIF
0                        Age  87.198322
1         PlayerGrowthRate1W   1.219776
2      FollowersGrowthRate1W   2.350926
3   PositiveRateGrowthRate1W   1.510173
4                 SalePeriod   1.037660
5                 DLC_sum_1W   1.111978
6              Sequel_sum_1W   1.015323
7                GameID_3590  12.664379
8                GameID_4000   7.536524
9              GameID_108600  27.180471
10             GameID_233860  51.572844
11             GameID_242760  48.961063
12             GameID_244210  31.592986
13             GameID_244850  53.400790
14             GameID_294100  51.125701
15             GameID_323190  48.764740
16             GameID_367520  42.324516
17             GameID_376210  36.451853
18             GameID_381210  39.412920
19             GameID_413150  37.198470
20             GameID_431730  36.973992
21             GameID_431960  50.921457
22             GameID_457140  55.600975
23             GameID_477160  40.048350


##### Wald test

In [161]:
# 1. 取得所有 dummy variable 的名稱列表
game_cols = [col for col in df_dummies.columns if col.startswith('GameID_')]
game_cnt = len(game_cols)
variable_cnt = len(feature_cols_gameid) + 1 # 包含常數項及其他變數的總數

# 2. 初始化 R 矩陣
R_matrix = np.zeros([game_cnt, variable_cnt])

# 3. 找出這些變數在模型參數列表中的位置，並設定 R 矩陣
for i, var_name in enumerate(game_cols):
    # 找到該變數在 model.params 中的索引位置
    param_index = logit_model.params.index.get_loc(var_name)
    R_matrix[i, param_index] = 1


print('\n unbalance')
print(logit_model.wald_test(R_matrix))


 unbalance
<Wald test (chi2): statistic=[[74.56627923]], p-value=2.445621130019125e-06, df_denom=27>




沒有明顯個體差異

#### model summary

In [162]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOrNot')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOrNot') 

logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.087305
         Iterations: 76
         Function evaluations: 78
         Gradient evaluations: 78
                           Logit Regression Results                           
Dep. Variable:          DiscountOrNot   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17106
Method:                           MLE   Df Model:                            9
Date:                Fri, 14 Nov 2025   Pseudo R-squ.:                  0.1348
Time:                        16:52:33   Log-Likelihood:                -1494.3
converged:                       True   LL-Null:                       -1727.1
Covariance Type:            nonrobust   LLR p-value:                 1.266e-94
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

PlayerGrowthRate1W、FollowersGrowthRate1W、SalePeriod、DiscountFreq3M顯著

##### 共線性

In [163]:
#檢查共線性 AccumulatedPositiveRate 和 Age 有共線性問題
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns[1:]  # 跳過常數項 'const'
vif_data["VIF"] = [
    variance_inflation_factor(X_train.iloc[:, 1:].values, i)
    for i in range(X_train.shape[1] - 1)
]
print(vif_data)

                    feature       VIF
0                       Age  1.283340
1               MultiPlayer  1.202316
2        PlayerGrowthRate1W  1.099560
3     FollowersGrowthRate1W  1.119896
4  PositiveRateGrowthRate1W  1.049941
5                SalePeriod  1.075511
6            DiscountFreq3M  1.182648
7                DLC_sum_1W  1.032767
8             Sequel_sum_1W  1.006161


#### 模型效果

In [164]:
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
logit_model_sm = sm.Logit(y_train_sm, X_train_sm).fit(method='bfgs', maxiter=100)
result1 = evaluate_model('unbalance', logit_model, X_test, y_test)
result2 = evaluate_model('balance', logit_model_sm, X_test, y_test)

results = pd.DataFrame([result1, result2])
print("\n模型比較結果:")
print(results.sort_values(by="F1", ascending=False))


Optimization terminated successfully.
         Current function value: 0.518770
         Iterations: 48
         Function evaluations: 49
         Gradient evaluations: 49

=== unbalance ===
Accuracy: 0.9823
F1-score: 0.0000
AUC: 0.7332
Confusion matrix:
 [[6701    0]
 [ 121    0]]

=== balance ===
Accuracy: 0.7347
F1-score: 0.0803
AUC: 0.7398
Confusion matrix:
 [[4933 1768]
 [  42   79]]

模型比較結果:
       Model  Accuracy        F1       AUC
1    balance  0.734682  0.080285  0.739787
0  unbalance  0.982263  0.000000  0.733206


有經過平衡處理的模型表現比較好

### 季節性折扣

#### model summary

In [165]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')

logit_model = sm.Logit(y_train, X_train).fit_regularized(alpha=1)
print(logit_model.summary())

Optimization terminated successfully    (Exit mode 0)
            Current function value: 0.035459003845110276
            Iterations: 142
            Function evaluations: 142
            Gradient evaluations: 142
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17108
Method:                           MLE   Df Model:                            7
Date:                Fri, 14 Nov 2025   Pseudo R-squ.:                  0.4040
Time:                        16:52:33   Log-Likelihood:                -592.69
converged:                       True   LL-Null:                       -994.37
Covariance Type:            nonrobust   LLR p-value:                3.487e-169
                               coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------

PlayerGrowthRate1W、PositiveRateGrowthRate1W、salePeriod顯著 DiscountFreq3M (0.018)

#### 模型效果

In [166]:
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
logit_model_sm = sm.Logit(y_train_sm, X_train_sm).fit()
result1 = evaluate_model('unbalance', logit_model, X_test, y_test)
result2 = evaluate_model('balance', logit_model_sm, X_test, y_test)

results = pd.DataFrame([result1, result2])
print("\n模型比較結果:")
print(results.sort_values(by="F1", ascending=False))


         Current function value: 0.184403
         Iterations: 35

=== unbalance ===
Accuracy: 0.9815
F1-score: 0.1711
AUC: 0.9760
Confusion matrix:
 [[6683  111]
 [  15   13]]

=== balance ===
Accuracy: 0.9044
F1-score: 0.0791
AUC: 0.9755
Confusion matrix:
 [[6142  652]
 [   0   28]]

模型比較結果:
       Model  Accuracy        F1       AUC
0  unbalance  0.981530  0.171053  0.976008
1    balance  0.904427  0.079096  0.975462




有經過平衡處理的模型表現比較好

### 非季節性折扣

#### model summary

In [167]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.049663
         Iterations: 76
         Function evaluations: 77
         Gradient evaluations: 77
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17106
Method:                           MLE   Df Model:                            9
Date:                Fri, 14 Nov 2025   Pseudo R-squ.:                  0.1332
Time:                        16:52:33   Log-Likelihood:                -850.04
converged:                       True   LL-Null:                       -980.69
Covariance Type:            nonrobust   LLR p-value:                 4.062e-51
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

DLC_sum_1W(0.010)、DiscountFreq3M(0.000)、FollowersGrowthRate1W(0.006)

#### 模型效果

In [168]:
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
logit_model_sm = sm.Logit(y_train_sm, X_train_sm).fit(method='bfgs', maxiter=100)
result1 = evaluate_model('unbalance', logit_model, X_test, y_test)
result2 = evaluate_model('balance', logit_model_sm, X_test, y_test)

results = pd.DataFrame([result1, result2])
print("\n模型比較結果:")
print(results.sort_values(by="F1", ascending=False))


Optimization terminated successfully.
         Current function value: 0.492254
         Iterations: 51
         Function evaluations: 52
         Gradient evaluations: 52

=== unbalance ===
Accuracy: 0.9864
F1-score: 0.0000
AUC: 0.7608
Confusion matrix:
 [[6729    0]
 [  93    0]]

=== balance ===
Accuracy: 0.7888
F1-score: 0.0709
AUC: 0.7602
Confusion matrix:
 [[5326 1403]
 [  38   55]]

模型比較結果:
       Model  Accuracy        F1       AUC
1    balance  0.788772  0.070922  0.760200
0  unbalance  0.986368  0.000000  0.760777


# 2W

In [169]:
feature_cols = [
    'Age', "MultiPlayer", 'PlayerGrowthRate2W', 'FollowersGrowthRate2W', 'PositiveRateGrowthRate2W', 
    'SalePeriod', 'DiscountFreq3M', 'DLC_sum_2W', 'Sequel_sum_2W'
]

### 所有折扣

#### model summary

In [170]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOrNot')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOrNot')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())


Optimization terminated successfully.
         Current function value: 0.088255
         Iterations: 75
         Function evaluations: 77
         Gradient evaluations: 77
                           Logit Regression Results                           
Dep. Variable:          DiscountOrNot   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17106
Method:                           MLE   Df Model:                            9
Date:                Fri, 14 Nov 2025   Pseudo R-squ.:                  0.1254
Time:                        16:52:34   Log-Likelihood:                -1510.6
converged:                       True   LL-Null:                       -1727.1
Covariance Type:            nonrobust   LLR p-value:                 1.153e-87
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

PlayerGrowthRate2W、FollowersGrowthRate2W、SalePeriod、FollowersGrowthRate2W顯著

#### 模型效果

In [171]:
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
logit_model_sm = sm.Logit(y_train_sm, X_train_sm).fit()
result1 = evaluate_model('unbalance', logit_model, X_test, y_test)
result2 = evaluate_model('balance', logit_model_sm, X_test, y_test)

results = pd.DataFrame([result1, result2])
print("\n模型比較結果:")
print(results.sort_values(by="F1", ascending=False))


         Current function value: 0.522157
         Iterations: 35

=== unbalance ===
Accuracy: 0.9823
F1-score: 0.0000
AUC: 0.7403
Confusion matrix:
 [[6701    0]
 [ 121    0]]

=== balance ===
Accuracy: 0.7284
F1-score: 0.0804
AUC: 0.7410
Confusion matrix:
 [[4888 1813]
 [  40   81]]

模型比較結果:
       Model  Accuracy        F1       AUC
1    balance  0.728379  0.080397  0.740979
0  unbalance  0.982263  0.000000  0.740350




### 季節性折扣

#### model summary

In [172]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountDuringSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountDuringSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.036801
         Iterations: 78
         Function evaluations: 79
         Gradient evaluations: 79
                           Logit Regression Results                           
Dep. Variable:     DiscountDuringSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17106
Method:                           MLE   Df Model:                            9
Date:                Fri, 14 Nov 2025   Pseudo R-squ.:                  0.3665
Time:                        16:52:34   Log-Likelihood:                -629.89
converged:                       True   LL-Null:                       -994.37
Covariance Type:            nonrobust   LLR p-value:                4.115e-151
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

PlayerGrowthRate2W顯著FollowersGrowthRate2W(0.004)、DiscountFreq3M(0.006)

#### 模型效果

In [173]:
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
logit_model_sm = sm.Logit(y_train_sm, X_train_sm).fit()
result1 = evaluate_model('unbalance', logit_model, X_test, y_test)
result2 = evaluate_model('balance', logit_model_sm, X_test, y_test)

results = pd.DataFrame([result1, result2])
print("\n模型比較結果:")
print(results.sort_values(by="F1", ascending=False))


         Current function value: 0.205498
         Iterations: 35

=== unbalance ===
Accuracy: 0.9285
F1-score: 0.0896
AUC: 0.9720
Confusion matrix:
 [[6310  484]
 [   4   24]]

=== balance ===
Accuracy: 0.9018
F1-score: 0.0771
AUC: 0.9716
Confusion matrix:
 [[6124  670]
 [   0   28]]

模型比較結果:
       Model  Accuracy        F1       AUC
0  unbalance  0.928467  0.089552  0.972008
1    balance  0.901788  0.077135  0.971582




### 非季節性折扣

#### model summary

In [174]:
X_train, y_train = prepare_xy(train, feature_cols, 'DiscountOutOfSale')
X_test, y_test = prepare_xy(test, feature_cols, 'DiscountOutOfSale')
logit_model = sm.Logit(y_train, X_train).fit(method='bfgs', maxiter=100)
print(logit_model.summary())

Optimization terminated successfully.
         Current function value: 0.049617
         Iterations: 89
         Function evaluations: 90
         Gradient evaluations: 90
                           Logit Regression Results                           
Dep. Variable:      DiscountOutOfSale   No. Observations:                17116
Model:                          Logit   Df Residuals:                    17106
Method:                           MLE   Df Model:                            9
Date:                Fri, 14 Nov 2025   Pseudo R-squ.:                  0.1340
Time:                        16:52:34   Log-Likelihood:                -849.24
converged:                       True   LL-Null:                       -980.69
Covariance Type:            nonrobust   LLR p-value:                 1.865e-51
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
const     

#### 模型效果

PlayerGrowthRate2W(0.039)、DiscountFreq3M(0.000)、DLC_sum_2W(0.033)

In [175]:
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)
logit_model_sm = sm.Logit(y_train_sm, X_train_sm).fit(method='bfgs', maxiter=100)
result1 = evaluate_model('unbalance', logit_model, X_test, y_test)
result2 = evaluate_model('balance', logit_model_sm, X_test, y_test)

results = pd.DataFrame([result1, result2])
print("\n模型比較結果:")
print(results.sort_values(by="F1", ascending=False))


Optimization terminated successfully.
         Current function value: 0.488811
         Iterations: 51
         Function evaluations: 52
         Gradient evaluations: 52

=== unbalance ===
Accuracy: 0.9864
F1-score: 0.0000
AUC: 0.7736
Confusion matrix:
 [[6729    0]
 [  93    0]]

=== balance ===
Accuracy: 0.7952
F1-score: 0.0718
AUC: 0.7744
Confusion matrix:
 [[5371 1358]
 [  39   54]]

模型比較結果:
       Model  Accuracy        F1       AUC
1    balance  0.795221  0.071761  0.774355
0  unbalance  0.986368  0.000000  0.773578
