### Introduction to Preprocessing: DecayTime-Weighting Experiments-Comparing SOS-Adjusted to Non-SOS Adjusted Versions of Optimized Recent Past Performance

This preprocessing step looks specifically at comparison of linear prediction quality for non-'Strength of Schedule' adjusted and 'Strength of Schedule' adjusted forms ('IS_pds_l10_ndw' and 'IS_pds_l10_dw_SOS_adj', respectively) for IS2 past performance identified in the previous preprocessing step (03a_IS2_Preprocessing_DTW_Experiments) 

CONCLUSION: The SOS-Adjusted version yields a SUBSTANTIALLY better prediction than the non-SOS version:

l10 non-SOS adjusted version
Training: (9.198646446096195, 0.6999090426716712), Testing: 10.815823998794091

l10 SOS-adjusted version
Training: (9.15468258486095, 0.6532299354795762) , Testing: 10.398201272345949

In [2]:
import pandas as pd
import numpy as np
import os
import pickle
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import __version__ as sklearn_version
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression
import datetime
#from library.sb_utils import save_file

### Load Data

In [3]:
df = pd.read_csv('../../data/df3.csv')
df.head()

Unnamed: 0,P_Date,P_Date_str,IS2_Completed,Comp_Date,Comp_Date_str,DOW,DOW_num,Grid Size,IS2_ST(m),IS_pds_l10_ndw,...,Circle_Count,Shade_Count,Unusual_Sym,Black_Square_Fill,Outside_Grid,Unchecked_Sq,Uniclue,Duplicate_Answers,Quantum,Wordplay
0,2024-02-21 00:00:00,2024-02-21,1.0,2024-02-21 10:33:24,2024-02-21,Wednesday,4.0,1,7.55,11.815,...,0,0,0,0,0,0,0,0,0,2.0
1,2022-06-08 00:00:00,2022-06-08,1.0,2024-02-20 19:27:34,2024-02-20,Wednesday,4.0,1,19.516667,10.83,...,0,0,0,0,0,0,0,0,0,10.0
2,2022-06-15 00:00:00,2022-06-15,1.0,2024-02-18 23:00:33,2024-02-18,Wednesday,4.0,1,7.5,10.893333,...,1,0,0,0,0,0,0,0,0,3.0
3,2022-06-22 00:00:00,2022-06-22,1.0,2024-02-18 21:52:33,2024-02-18,Wednesday,4.0,1,6.6,11.273333,...,0,0,0,0,0,0,0,0,0,0.0
4,2022-06-29 00:00:00,2022-06-29,1.0,2024-02-18 19:43:50,2024-02-18,Wednesday,4.0,1,9.866667,11.04,...,0,0,0,0,0,0,0,0,0,3.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1189 entries, 0 to 1188
Data columns (total 43 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   P_Date                  1189 non-null   object 
 1   P_Date_str              1189 non-null   object 
 2   IS2_Completed           1189 non-null   float64
 3   Comp_Date               1189 non-null   object 
 4   Comp_Date_str           1189 non-null   object 
 5   DOW                     1189 non-null   object 
 6   DOW_num                 1189 non-null   float64
 7   Grid Size               1189 non-null   int64  
 8   IS2_ST(m)               1189 non-null   float64
 9   IS_pds_l10_ndw          1182 non-null   float64
 10  IS_pds_l10_stdev        1175 non-null   float64
 11  IS_pds_l10_ndw_SOS_adj  1182 non-null   float64
 12  GMST(m)                 1189 non-null   float64
 13  Constructors            1189 non-null   object 
 14  Words                   1189 non-null   

### Create Feature Variants for Testing

### Filter Data

In [6]:
# strip down df to just the columns we need to evaluate SOS and non-SOS adj versions of IS2 RPB
df1 = df[['DOW', 'Comp_Date', 'Comp_Date_str', 'IS2_ST(m)', 'IS_pds_l10_ndw', 'IS_pds_l10_ndw_SOS_adj']]

In [7]:
#Filter out Sunday
df1 =df1[df1["DOW"]!="Sunday"]

In [8]:
#Remove the first solve period (2018-2019) to calculate sample averages by day
df1 = df1[df1['Comp_Date_str'].str.contains("2020|2021|2022|2023|2024")]

Creating df variants with only the columns we will need to generate the benchmark models 

In [38]:
df_filter=df1.copy()

In [41]:
#df_model1 = df_filter[["IS2_ST(m)", "IS_pds_l10_ndw"]]
df_model1 = df_filter[["IS2_ST(m)", "IS_pds_l10_ndw_SOS_adj"]]

In [42]:
df_model1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 943 entries, 0 to 1178
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   IS2_ST(m)               943 non-null    float64
 1   IS_pds_l10_ndw_SOS_adj  943 non-null    float64
dtypes: float64(2)
memory usage: 22.1 KB


### Train Test Split

In [43]:
len(df_model1) * .80, len(df_model1) * .20

(754.4000000000001, 188.60000000000002)

In [44]:
X_train, X_test, y_train, y_test = train_test_split(df_model1.drop(columns='IS2_ST(m)'), 
                                                    df_model1["IS2_ST(m)"], test_size=0.20, 
                                                    random_state=47)

In [45]:
y_train.shape, y_test.shape

((754,), (189,))

In [46]:
y_train

730     17.850000
271      8.266667
754     43.233333
975      9.950000
309      9.616667
          ...    
795     27.433333
282     12.783333
363      8.516667
1138    15.683333
135     17.866667
Name: IS2_ST(m), Length: 754, dtype: float64

In [47]:
X_train.shape, X_test.shape

((754, 1), (189, 1))

In [48]:
y_train.mean()

18.53567639257294

### Benchmark Linear Model Based on Last N Day-Specific Puzzles With X Decay Weighting

In [49]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 754 entries, 730 to 135
Data columns (total 1 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   IS_pds_l10_ndw_SOS_adj  754 non-null    float64
dtypes: float64(1)
memory usage: 11.8 KB


In [50]:
lr_pipe = make_pipeline(
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    SelectKBest(f_regression),
    LinearRegression()
)

In [51]:
#Dict of available parameters for linear regression pipe
lr_pipe.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'simpleimputer', 'standardscaler', 'selectkbest', 'linearregression', 'simpleimputer__add_indicator', 'simpleimputer__copy', 'simpleimputer__fill_value', 'simpleimputer__missing_values', 'simpleimputer__strategy', 'simpleimputer__verbose', 'standardscaler__copy', 'standardscaler__with_mean', 'standardscaler__with_std', 'selectkbest__k', 'selectkbest__score_func', 'linearregression__copy_X', 'linearregression__fit_intercept', 'linearregression__n_jobs', 'linearregression__normalize', 'linearregression__positive'])

In [52]:
#Define search grid parameters
k = [k+1 for k in range(len(X_train.columns))]

grid_params = {
    'standardscaler': [StandardScaler(), None],
    'simpleimputer__strategy': ['mean', 'median'],
    'selectkbest__k': k
}

In [53]:
#Call `GridSearchCV` with linear regression pipeline, passing in the above `grid_params`
#dict for parameters to evaluate with 5-fold cross-validation
lr_grid_cv = GridSearchCV(lr_pipe, param_grid=grid_params, cv=5)

In [54]:
#Conduct grid search for this model
lr_grid_cv.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('simpleimputer',
                                        SimpleImputer(strategy='median')),
                                       ('standardscaler', StandardScaler()),
                                       ('selectkbest',
                                        SelectKBest(score_func=<function f_regression at 0x00000167F40223A0>)),
                                       ('linearregression',
                                        LinearRegression())]),
             param_grid={'selectkbest__k': [1],
                         'simpleimputer__strategy': ['mean', 'median'],
                         'standardscaler': [StandardScaler(), None]})

In [55]:
#Best params from grid search for this model
lr_grid_cv.best_params_

{'selectkbest__k': 1,
 'simpleimputer__strategy': 'mean',
 'standardscaler': StandardScaler()}

### Linear Model Metrics From RPB Variant

#### R-squared (COD)

In [56]:
#Cross-validation defaults to R^2 metric for scoring regression
lr_best_cv_results = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, cv=5)
lr_best_scores = lr_best_cv_results['test_score']
lr_best_scores

array([0.59044631, 0.51156237, 0.51262659, 0.62085945, 0.57322609])

In [57]:
#Training set CV mean and std
np.mean(lr_best_scores), np.std(lr_best_scores)

(0.561744162964956, 0.04331515539548766)

#### Mean Absolute Error (MAE)

In [58]:
lr_neg_mae = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_mean_absolute_error', cv=5, n_jobs=-1)

In [59]:
# Training set MAE and STD 
lr_mae_mean = np.mean(-1 * lr_neg_mae['test_score'])
lr_mae_std = np.std(-1 * lr_neg_mae['test_score'])
MAE_LR_train = lr_mae_mean, lr_mae_std
MAE_LR_train

(5.802114008314906, 0.6006407510621464)

In [60]:
# Test set mean
MAE_LR_test = mean_absolute_error(y_test, lr_grid_cv.best_estimator_.predict(X_test))
MAE_LR_test

6.286140705949224

#### Mean Squared Error (MSE)

In [61]:
lr_neg_mse = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_mean_squared_error', cv=5)

In [62]:
#Training set CV mean and std
lr_mse_mean = np.mean(-1 * lr_neg_mse['test_score'])
lr_mse_std = np.std(-1 * lr_neg_mse['test_score'])
MSE_LR_train = lr_mse_mean, lr_mse_std
MSE_LR_train

(84.234922578163, 11.713973607049617)

In [63]:
# Test set mean
MSE_LR_test = mean_squared_error(y_test, lr_grid_cv.best_estimator_.predict(X_test))
MSE_LR_test

108.12258970021692

#### Root Mean Square Error (RMSE)

In [64]:
lr_neg_rmse = cross_validate(lr_grid_cv.best_estimator_, X_train, y_train, 
                            scoring='neg_root_mean_squared_error', cv=5)

In [65]:
#Training set CV mean and std
lr_rmse_mean = np.mean(-1 * lr_neg_rmse['test_score'])
lr_rmse_std = np.std(-1 * lr_neg_rmse['test_score'])
RMSE_LR_train = lr_rmse_mean, lr_rmse_std
RMSE_LR_train

(9.15468258486095, 0.6532299354795762)

In [66]:
# Test set mean
RMSE_LR_test = np.sqrt(mean_squared_error(y_test, lr_grid_cv.best_estimator_.predict(X_test)))
RMSE_LR_test

10.398201272345949

Results for NO DECAY WEIGHT Variants (RMSE) deviation in minutes (gradual decay weighting for GMS, which impacts SOS calc). 
This series 2018-2019 solves HAVE been removed upfront

l10 non-SOS adjusted version
Training: (9.198646446096195, 0.6999090426716712), Testing: 10.815823998794091

l10 SOS adjusted version
Training: (9.15468258486095, 0.6532299354795762) , Testing: 10.398201272345949