### Introduction to Preprocessing: DecayTime-Weighting Experiments-Comparing SOS-Adjusted to Non-SOS Adjusted Versions of Optimized Recent Past Performance

This preprocessing step looks specifically at comparison of linear prediction quality for non-'Strength of Schedule' adjusted and 'Strength of Schedule' adjusted forms ('IS_pds_l10_ndw' and 'IS_pds_l10_dw_SOS_adj', respectively) for IS2 past performance identified in the previous preprocessing step (03a_IS2_Preprocessing_DTW_Experiments) 

CONCLUSION: The SOS-Adjusted version yields a SUBSTANTIALLY better prediction than the non-SOS version:

l10 non-SOS adjusted version
Training: (9.467881123742533, 1.1936856196221093), Testing: 10.253582416968488

l10 SOS adjusted version
Training: (9.323487546635262, 1.2065005232644828), Testing: 10.135272303754874

In [1]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
#from library.sb_utils import save_file

### Load Data

In [2]:
df = pd.read_csv('../../data/df3.csv')
df.head()

Unnamed: 0,P_Date,P_Date_str,IS2_Completed,Comp_Date,Comp_Date_str,DOW,DOW_num,Grid Size,IS2_ST(m),IS_pds_l10_ndw,...,Circle_Count,Shade_Count,Unusual_Sym,Black_Square_Fill,Outside_Grid,Unchecked_Sq,Uniclue,Duplicate_Answers,Quantum,Wordplay
0,2022-05-11 00:00:00,2022-05-11,1.0,2024-02-29 19:58:44,2024-02-29,Wednesday,4.0,1,8.166667,10.995,...,0,0,0,0,0,0,0,0,0,3.0
1,2022-05-18 00:00:00,2022-05-18,1.0,2024-02-29 17:34:25,2024-02-29,Wednesday,4.0,1,6.783333,11.678333,...,0,0,0,0,0,0,0,0,0,3.0
2,2024-02-28 00:00:00,2024-02-28,1.0,2024-02-28 18:02:10,2024-02-28,Wednesday,4.0,1,7.033333,11.625,...,0,0,0,0,0,0,0,0,0,3.0
3,2022-05-25 00:00:00,2022-05-25,1.0,2024-02-27 20:57:43,2024-02-27,Wednesday,4.0,1,11.75,11.531667,...,0,0,0,0,0,0,0,0,0,4.0
4,2022-06-01 00:00:00,2022-06-01,1.0,2024-02-24 21:13:46,2024-02-24,Wednesday,4.0,1,16.2,11.161667,...,16,0,0,0,0,0,0,0,0,1.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1230 entries, 0 to 1229
Data columns (total 43 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   P_Date                  1230 non-null   object 
 1   P_Date_str              1230 non-null   object 
 2   IS2_Completed           1228 non-null   float64
 3   Comp_Date               1230 non-null   object 
 4   Comp_Date_str           1230 non-null   object 
 5   DOW                     1230 non-null   object 
 6   DOW_num                 1230 non-null   float64
 7   Grid Size               1230 non-null   int64  
 8   IS2_ST(m)               1230 non-null   float64
 9   IS_pds_l10_ndw          1223 non-null   float64
 10  IS_pds_l10_stdev        1216 non-null   float64
 11  IS_pds_l10_ndw_SOS_adj  1223 non-null   float64
 12  GMST(m)                 1230 non-null   float64
 13  Constructors            1230 non-null   object 
 14  Words                   1230 non-null   

### Create Feature Variants for Testing

### Filter Data

In [4]:
# strip down df to just the columns we need to evaluate SOS and non-SOS adj versions of IS2 RPB
df1 = df[['DOW', 'Comp_Date', 'Comp_Date_str', 'IS2_ST(m)', 'IS_pds_l10_ndw', 'IS_pds_l10_ndw_SOS_adj']]

In [5]:
#Filter out Sunday
df1 =df1[df1["DOW"]!="Sunday"]

In [6]:
#Remove the first solve period (2018-2019) to calculate sample averages by day
df1 = df1[df1['Comp_Date_str'].str.contains("2020|2021|2022|2023|2024")]

Creating df variants with only the columns we will need to generate the benchmark models 

In [61]:
df_filter=df1.copy()

In [62]:
#df_model1 = df_filter[["IS2_ST(m)", "IS_pds_l10_ndw"]]
df_model1 = df_filter[["IS2_ST(m)", "IS_pds_l10_ndw_SOS_adj"]]

In [63]:
df_model1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 979 entries, 0 to 1219
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   IS2_ST(m)               979 non-null    float64
 1   IS_pds_l10_ndw_SOS_adj  979 non-null    float64
dtypes: float64(2)
memory usage: 22.9 KB


### Train Test Split

In [64]:
len(df_model1) * .80, len(df_model1) * .20

(783.2, 195.8)

In [65]:
X_train, X_test, y_train, y_test = train_test_split(df_model1.drop(columns='IS2_ST(m)'), 
                                                    df_model1["IS2_ST(m)"], test_size=0.20, 
                                                    random_state=2)

In [66]:
y_train.shape, y_test.shape

((783,), (196,))

In [67]:
y_train

1016     6.483333
379     18.783333
499     34.400000
152     35.233333
1056    16.100000
          ...    
750     23.733333
800     25.066667
709     19.550000
743     23.016667
186      7.283333
Name: IS2_ST(m), Length: 783, dtype: float64

In [68]:
X_train.shape, X_test.shape

((783, 1), (196, 1))

In [69]:
y_train.mean()

18.32660706683695

### Benchmark Linear Model Based on Last N Day-Specific Puzzles With X Decay Weighting

In [70]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 783 entries, 1016 to 186
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   IS_pds_l10_ndw_SOS_adj  783 non-null    float64
dtypes: float64(1)
memory usage: 12.2 KB


In [44]:
lr_pipe = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    SelectKBest(f_regression),
    LinearRegression()
)

In [45]:
#Dict of available parameters for linear regression pipe
lr_pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'simpleimputer', 'standardscaler', 'selectkbest', 'linearregression', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'selectkbest__k', 'selectkbest__score_func', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize', 'linearregression__positive'])

In [46]:
#Define search grid parameters
k = [k+1 for k in range(len(X_train.columns))]

grid_params = {
    'standardscaler': [StandardScaler(), None],
    'simpleimputer__strategy': ['mean', 'median'],
    'selectkbest__k': k
}

In [47]:
#Call `GridSearchCV` with linear regression pipeline, passing in the above `grid_params`
#dict for parameters to evaluate with 5-fold cross-validation
lr_grid_cv = GridSearchCV(lr_pipe, param_grid=grid_params, cv=5)

In [48]:
#Conduct grid search for this model
lr_grid_cv.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('simpleimputer',
                                        SimpleImputer(strategy='median')),
                                       ('standardscaler', StandardScaler()),
                                       ('selectkbest',
                                        SelectKBest(score_func=<function f_regression at 0x0000016172DE3310>)),
                                       ('linearregression',
                                        LinearRegression())]),
             param_grid={'selectkbest__k': [1],
                         'simpleimputer__strategy': ['mean', 'median'],
                         'standardscaler': [StandardScaler(), None]})

In [49]:
#Best params from grid search for this model
lr_grid_cv.best_params_

{'selectkbest__k': 1,
 'simpleimputer__strategy': 'mean',
 'standardscaler': StandardScaler()}

### Linear Model Metrics From RPB Variant

#### R-squared (COD)

In [50]:
#Cross-validation defaults to R^2 metric for scoring regression
lr_best_cv_results = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, cv=5)
lr_best_scores = lr_best_cv_results['test_score']
lr_best_scores

array([0.63279746, 0.56704739, 0.52951961, 0.59047458, 0.3963074 ])

In [51]:
#Training set CV mean and std
np.mean(lr_best_scores), np.std(lr_best_scores)

(0.543229288606841, 0.08074140564426967)

#### Mean Absolute Error (MAE)

In [52]:
lr_neg_mae = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

In [53]:
# Training set MAE and STD 
lr_mae_mean = np.mean(-1 * lr_neg_mae['test_score'])
lr_mae_std = np.std(-1 * lr_neg_mae['test_score'])
MAE_LR_train = lr_mae_mean, lr_mae_std
MAE_LR_train

(5.7172098290131785, 0.44907010240202266)

In [54]:
# Test set mean
MAE_LR_test = mean_absolute_error(y_test, lr_grid_cv.best_estimator_.predict(X_test))
MAE_LR_test

6.165079615172858

#### Mean Squared Error (MSE)

In [55]:
lr_neg_mse = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_mean_squared_error', cv=5)

In [56]:
#Training set CV mean and std
lr_mse_mean = np.mean(-1 * lr_neg_mse['test_score'])
lr_mse_std = np.std(-1 * lr_neg_mse['test_score'])
MSE_LR_train = lr_mse_mean, lr_mse_std
MSE_LR_train

(88.38306354490024, 22.468985879832516)

In [57]:
# Test set mean
MSE_LR_test = mean_squared_error(y_test, lr_grid_cv.best_estimator_.predict(X_test))
MSE_LR_test

102.72374467126065

#### Root Mean Square Error (RMSE)

In [58]:
lr_neg_rmse = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_root_mean_squared_error', cv=5)

In [59]:
#Training set CV mean and std
lr_rmse_mean = np.mean(-1 * lr_neg_rmse['test_score'])
lr_rmse_std = np.std(-1 * lr_neg_rmse['test_score'])
RMSE_LR_train = lr_rmse_mean, lr_rmse_std
RMSE_LR_train

(9.323487546635262, 1.2065005232644828)

In [60]:
# Test set mean
RMSE_LR_test = np.sqrt(mean_squared_error(y_test, lr_grid_cv.best_estimator_.predict(X_test)))
RMSE_LR_test

10.135272303754874

New Run: 03/02/2024
Note: Random state now 12 across model variants for IS2. ood mean balance betwee training and testing at that state.

l10 non-SOS adjusted version
Training: (9.467881123742533, 1.1936856196221093), Testing: 10.253582416968488

l10 SOS adjusted version
Training: (9.323487546635262, 1.2065005232644828), Testing: 10.135272303754874

Results for NO DECAY WEIGHT Variants (RMSE) deviation in minutes (gradual decay weighting for GMS, which impacts SOS calc). 
This series 2018-2019 solves HAVE been removed upfront

l10 non-SOS adjusted version
Training: (9.198646446096195, 0.6999090426716712), Testing: 10.815823998794091

l10 SOS adjusted version
Training: (9.15468258486095, 0.6532299354795762) , Testing: 10.398201272345949