## Predicting Freezing of Gate episodes for Parkinsons disease

This notebook was created to train machine learning model for Kaggle competition "Parkinson's Freezing of Gait Prediction" 
(Event detection from wearable sensor data) issued by THE MICHAEL J. FOX FOUNDATION in 03.2023 

competition and data detailed description as well as data can be found on
https://www.kaggle.com/competitions/tlvmc-parkinsons-freezing-gait-prediction

Credit to 此般浅薄 for the initial inspiration

thanks to NICHOLAS GRAY  for sharing notebook "Gait Prediction" which was base for this solution

The goal was to predict probablility for 3 FOG types: Turn, Walking and StartHesitation based on values from 3 motion sensors weared by patients with Parkinsons disease on their back. Instead of motion sensor recordings there were also datasets with metadata related to information about patients like age, sex, being on medicines and many other.

Initial notebook scored 0.289 on submission evaluation.
As a model MultiOutputRegressor was chosen and was trained using time series statistical features from seglearn library. 2 models were trained and maximum value from their prediction was chosen as final result.
in this notebook several improvements where appied, scoring 0.328(public score) on submission evaluation (0.26385 private score)
- multiply tdcsfog values by g (9.81)
- use second model with smaller window of 600 timesteps (initial model has window = 5K), for prdiction max val of 2 models is chosen
- used information from notype and assign Turn=1 wherever event occured
- cleared first and last 1300 timesteps (in chart it looks like some noise during beginning and end of recording, maybe related to sensor starting to work)
- first 85000 values from notype data were removed (these sessions are relatively long and ususally events don't occur at the beginning)

In [1]:
# Install tsflex and seglearn
!pip install tsflex 
!pip install seglearn

Looking in links: file:///kaggle/input/time-series-tools
Processing /kaggle/input/time-series-tools/tsflex-0.3.0-py3-none-any.whl
Installing collected packages: tsflex
Successfully installed tsflex-0.3.0
[0mLooking in links: file:///kaggle/input/time-series-tools
Processing /kaggle/input/time-series-tools/seglearn-1.2.5-py3-none-any.whl
Installing collected packages: seglearn
Successfully installed seglearn-1.2.5
[0m

In [None]:
import os

!pip install kaggle
!kaggle competitions download -c tlvmc-parkinsons-freezing-gait-prediction

path = 'input/tlvmc-parkinsons-freezing-gait-prediction'
os.makedirs(path, exist_ok=True)
!unzip tlvmc-parkinsons-freezing-gait-prediction.zip -d input/tlvmc-parkinsons-freezing-gait-prediction

In [2]:
# import libraries:
# Numpy and Pandas for data aggregation, filtering and other preprocessing operation
# tsflex for extracting window based statistical features specified for time series
# matplotlib and seaborn to visualize data
# from Sci-kit-learn - GroupKFold for cross validation and MultiOutputRegressor as a regression model for predicting probability for each of 3 event types
# tqdm to visualize loops progress when preparing data for training and submission
import numpy as np
import pandas as pd
from sklearn import *
import glob
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
from os import path
from pathlib import Path
from seglearn.feature_functions import base_features, emg_features
from tsflex.features import FeatureCollection, MultipleFeatureDescriptors
from tsflex.features.integrations import seglearn_feature_dict_wrapper
from sklearn.model_selection import GroupKFold
import lightgbm as lgb
from sklearn.multioutput import MultiOutputRegressor
from sklearn.base import clone
from sklearn.metrics import average_precision_score

# Grab important files

In [3]:
#here we get all paths fro training files and create data frames for metadata

root = 'input/tlvmc-parkinsons-freezing-gait-prediction/'

train = glob.glob(path.join(root, 'train/**/**'))
test = glob.glob(path.join(root, 'test/**/**'))

subjects = pd.read_csv(path.join(root, 'subjects.csv'))
tasksBase = pd.read_csv(path.join(root, 'tasks.csv'))
events = pd.read_csv(path.join(root, 'events.csv'))

tdcsfog_metadata = pd.read_csv(path.join(root, 'tdcsfog_metadata.csv'))
defog_metadata = pd.read_csv(path.join(root, 'defog_metadata.csv')) 

tdcsfog_metadata['Module'] = 'tdcsfog'
defog_metadata['Module'] = 'defog'

full_metadata = pd.concat([tdcsfog_metadata, defog_metadata])

## Perform data analysis and display some charts

In [4]:
#check beginning and end of recording
cols = ['Time', 'AccV', 'AccML', 'AccAP', 'StartHesitation', 'Turn' , 'Walking']
df = pd.read_csv(train[15], index_col='Time', usecols=cols)

def highlight(indices, ax):
    i=0
    while i<len(indices):
        ax.axvspan(indices[i]-0.5, indices[i]+0.5, facecolor='pink', edgecolor='none', alpha=.2)
        i+=1



ax = df[['AccV', 'AccML', 'AccAP']].iloc[0:2000].plot()
highlight(df.iloc[0:1000].index, ax)

ax2 = df[['AccV', 'AccML', 'AccAP']].iloc[-2000:-1].plot()
highlight(df.iloc[-1100:-1].index, ax2)


In [5]:
#plot histograms and correclation for subjects

subjects[['Age', 'YearsSinceDx', 'Sex', 'Visit', 'UPDRSIII_On', 'UPDRSIII_Off', 'NFOGQ']].hist()

subjects.Age.plot.box()

plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(subjects.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':18}, pad=12);


In [6]:
seed = 100
cluster_size = 8

In [7]:
# encode categorical values as numbers for Sex
subjects['Sex'] = subjects['Sex'].factorize()[0]
subjects = subjects.fillna(0).groupby('Subject').median()
subjects['s_group'] = cluster.KMeans(n_clusters = cluster_size, random_state = seed).fit_predict(subjects[subjects.columns[1:]])
#simplify column names
new_names = {'Visit':'s_visit','Age':'s_age','YearsSinceDx':'s_years','UPDRSIII_On':'s_on','UPDRSIII_Off':'s_off','NFOGQ':'s_NFOGQ', 'Sex': 's_sex'}
subjects = subjects.rename(columns = new_names)
subjects

Unnamed: 0_level_0,s_visit,s_age,s_sex,s_years,s_on,s_off,s_NFOGQ,s_group
Subject,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
00f674,1.5,63.0,0.0,27.0,37.0,39.5,25.0,5
02bc69,0.0,69.0,0.0,4.0,21.0,0.0,22.0,7
040587,1.5,75.0,0.0,26.0,49.5,72.0,22.5,6
056372,2.0,69.0,0.0,13.0,44.0,50.0,22.0,4
07285e,0.0,58.0,0.0,1.0,18.0,26.0,10.0,0
...,...,...,...,...,...,...,...,...
f686f0,0.0,61.0,0.0,7.0,44.0,0.0,24.0,7
f80507,1.0,57.0,0.0,2.0,12.0,0.0,0.0,2
fa8764,0.0,60.0,1.0,7.0,30.0,0.0,19.0,7
fba3a3,1.0,65.0,1.0,8.0,28.0,0.0,0.0,2


In [8]:
#add disctinct column for each of task types (predefined tasks performed by patients during sessions)
tasksBase['Duration'] = tasksBase['End'] - tasksBase['Begin']
tasks = pd.pivot_table(tasksBase, values=['Duration'], index=['Id'], columns=['Task'], aggfunc='sum', fill_value=0)
tasks.columns = [c[1] for c in tasks.columns]
tasks = tasks.reset_index()
tasks['t_group'] = cluster.KMeans(n_clusters = cluster_size, random_state = seed).fit_predict(tasks[tasks.columns[1:]])
tasks

Unnamed: 0,Id,4MW,4MW-C,Hotspot1,Hotspot1-C,Hotspot2,Hotspot2-C,MB1,MB10,MB11,...,MB9,Rest1,Rest2,TUG-C,TUG-DT,TUG-ST,Turning-C,Turning-DT,Turning-ST,t_group
0,02ab235146,16.520,16.680,16.760,16.240,53.920,64.600,13.960,17.960,17.400,...,30.800,180.48,60.32,38.440,47.920,36.240,21.920,46.400,23.320,5
1,02ea782681,11.618,11.796,11.525,11.692,8.329,9.032,3.469,6.624,6.230,...,30.650,0.00,0.00,18.343,19.932,20.130,18.042,21.588,18.698,1
2,06414383cf,24.860,41.584,25.885,0.000,38.642,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.000,44.832,33.867,0.000,83.837,124.299,0
3,092b4c1819,13.664,0.000,15.409,0.000,34.834,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.000,30.166,23.347,0.000,45.025,39.351,1
4,0a900ed8a2,11.720,11.840,10.600,10.720,43.171,41.160,1.760,21.040,10.560,...,30.520,180.88,60.32,18.083,25.000,19.920,18.699,18.200,16.880,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
132,f3a921edee,16.722,0.000,16.383,0.000,76.200,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.000,29.043,29.820,0.000,136.840,76.960,0
133,f40e8c6ebe,12.867,0.000,27.906,0.000,152.333,0.000,0.000,0.000,0.000,...,0.000,0.00,0.00,0.000,36.805,28.232,0.000,98.579,69.889,0
134,f8ddbdd98d,23.787,0.000,32.556,0.000,38.557,0.000,6.347,10.356,10.655,...,33.289,0.00,0.00,0.000,44.132,54.993,0.000,100.433,72.152,0
135,f9efef91fb,16.351,17.552,16.155,15.166,67.059,51.715,0.000,0.000,0.000,...,0.000,0.00,0.00,23.483,23.541,22.137,36.898,38.632,30.831,1


In [9]:
# merge the subjects with the metadata
metadata_w_subjects = full_metadata.merge(subjects, how='left', on='Subject').copy()
features = metadata_w_subjects.columns

In [10]:
metadata_w_subjects['Medication'] = metadata_w_subjects['Medication'].factorize()[0]

## Train first of 2 models with window = 5K timesteps

In [11]:
basic_feats = MultipleFeatureDescriptors(
    functions=seglearn_feature_dict_wrapper(base_features()),
    series_names=['AccV', 'AccML', 'AccAP'],
    windows=[5000],
    strides=[5000],
)

emg_feats = emg_features()
del emg_feats['simple square integral'] # is same as abs_energy (which is in base_features)

emg_feats = MultipleFeatureDescriptors(
    functions=seglearn_feature_dict_wrapper(emg_feats),
    series_names=['AccV', 'AccML', 'AccAP'],
    windows=[5000],
    strides=[5000],
)

fc = FeatureCollection([basic_feats, emg_feats])

In [12]:
pd.set_option('display.max_columns', None)

def reader(file):
    try:
        path_split = file.split('/')
        session = path_split[-1].split('.')[0]
        dataset = Path(file).parts[-2]
        cols = ['Time', 'AccV', 'AccML', 'AccAP', 'StartHesitation', 'Turn' , 'Walking']
        if dataset == 'notype':
             cols = ['Time', 'AccV', 'AccML', 'AccAP', 'Event', 'Valid']
        if dataset == 'defog':        
            cols = ['Time', 'AccV', 'AccML', 'AccAP', 'StartHesitation', 'Turn' , 'Walking', 'Valid']
        df = pd.read_csv(file, index_col='Time', usecols=cols)
        df['Id'] = session
        df['Module'] = dataset
        
        if dataset == 'defog':
            c = df['Valid'].value_counts()
            validRatio = c[True]/(c[True] + c[False])
            if validRatio < 25:
                df.truncate()
            del df['Valid']

        if dataset == 'notype':
            df['Walking'] = 0
            df['Turn'] =  df['Event']
            df['StartHesitation'] = 0
            del df['Valid']
            del df['Event']

        # this is done because the speeds are at different rates for the datasets
        if dataset == 'tdcsfog':
            df.AccV = df.AccV * 9.80665
            df.AccML = df.AccML * 9.80665
            df.AccAP = df.AccAP * 9.80665

        df['Time_frac']=(df.index/df.index.max()).values

        df = pd.merge(df, tasks[['Id','t_group']], how='left', on='Id').fillna(-1)
        
        if dataset == 'tdcsfog':
            factor = 128
        else:
            factor = 100

        rows = tasksBase[tasksBase['Id'] == session]

        for index,row in rows.iterrows():
            t = df.iloc[0]['t_group']
            df.loc[:].t_group = -1
            start = int(row['Begin'] * factor)
            end = int(row['End'] * factor)
            df.loc[start:end].t_group = t
        
        df = pd.merge(df, metadata_w_subjects[['Id','Subject', 'Visit','Test','Medication','s_group']], how='left', on='Id').fillna(-1) # 's_off', 's_on', 's_age', 

        df_feats = fc.calculate(df, return_df=True, include_final_window=True, approve_sparsity=True, window_idx="begin").astype(np.float32)
        df = df.merge(df_feats, how="left", left_index=True, right_index=True)
        
        
        if dataset == 'notype':
            df.drop(index=df.iloc[0:85000].index, inplace=True)
        
        df.fillna(method="ffill", inplace=True)

        return df
    except: pass
   
train = pd.concat([reader(f) for f in tqdm(train)]).fillna(0); 
cols = [c for c in train.columns if c not in ['Id','Subject','Module', 'Time', 'StartHesitation', 'Turn' , 'Walking', 'Valid', 'Task','Event']]
cols = [x for x in cols if str(x) != 'nan']
pcols = ['StartHesitation', 'Turn' , 'Walking']
scols = ['Id', 'StartHesitation', 'Turn' , 'Walking']
train=train.reset_index(drop=True)

train.head()
train.columns

  0%|          | 0/970 [00:00<?, ?it/s]

0.2872802220913603


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


0.27139136653757406
0.23187170334061485
0.2706150527962056
0.2317585447547142
0.16730967368333974
0.36898148148148147
0.25449558873129424
0.50050758240381
0.23884228893373696
0.20869229496442557
0.35040312729049594
0.23930655142925095
0.2622312632124766
0.36991616194307025
0.1946524064171123
0.20972404013961607
0.2502844701196963
0.34512612227447625
0.23618820315137432
0.21181819665975543
0.272018514963418
0.22883340538982563
0.26500246493075785
0.3377400150926521
0.3617582525947292
0.45963451306962755
0.33167371436366117
0.3434618768889399
0.4101077438276099
0.38737749020354145
0.22634386490451608
0.20879908013044052
0.3102763643619675
0.2612965907083554
0.3146956232871024
0.26817142390461024
0.21796361369304013
0.19510730670936355
0.42988064791133845
0.26202902547485263
0.343317572967924
0.4573842901366622
0.39883519364112974
0.3579421894652702
0.33547879727596724
0.24105087705373443
0.2250540584429459
0.30185548117793787
0.11045658600678857
0.24064322518543685
0.2960927727145082
0.3

Index(['AccV', 'AccML', 'AccAP', 'StartHesitation', 'Turn', 'Walking', 'Id',
       'Module', 'Time_frac', 't_group', 'Subject', 'Visit', 'Test',
       'Medication', 's_group', 'AccAP__abs_energy__w=5000',
       'AccAP__abs_sum__w=5000', 'AccAP__emg_var__w=5000',
       'AccAP__kurt__w=5000', 'AccAP__maximum__w=5000', 'AccAP__mean__w=5000',
       'AccAP__mean_abs__w=5000', 'AccAP__mean_crossings__w=5000',
       'AccAP__median__w=5000', 'AccAP__minimum__w=5000', 'AccAP__mse__w=5000',
       'AccAP__root_mean_square__w=5000', 'AccAP__skew__w=5000',
       'AccAP__slope_sign_changes__w=5000', 'AccAP__std__w=5000',
       'AccAP__var__w=5000', 'AccAP__waveform_length__w=5000',
       'AccAP__willison_amplitude__w=5000', 'AccAP__zero_crossing__w=5000',
       'AccML__abs_energy__w=5000', 'AccML__abs_sum__w=5000',
       'AccML__emg_var__w=5000', 'AccML__kurt__w=5000',
       'AccML__maximum__w=5000', 'AccML__mean__w=5000',
       'AccML__mean_abs__w=5000', 'AccML__mean_crossings__w=5000

In [13]:
best_params_ = {'colsample_bytree': 0.5282057895135501,
 'learning_rate': 0.22659963168004743,
 'max_depth': 8,
 'min_child_weight': 3.1233911067827616,
 'n_estimators': 291,
 'subsample': 0.9961057796456088,
 }

def custom_average_precision(y_true, y_pred):
    score = average_precision_score(y_true, y_pred)
    return 'average_precision', score, True

class LGBMMultiOutputRegressor(MultiOutputRegressor):
    def fit(self, X, y, eval_set=None, **fit_params):
        self.estimators_ = [clone(self.estimator) for _ in range(y.shape[1])]
        
        for i, estimator in enumerate(self.estimators_):
            if eval_set:
                fit_params['eval_set'] = [(eval_set[0], eval_set[1][:, i])]
            estimator.fit(X, y[:, i], **fit_params)
        
        return self

In [14]:
kfold = GroupKFold(5)
groups=kfold.split(train, groups=train.Subject)

regs = []
cvs = []

for _, (tr_idx, te_idx) in enumerate(tqdm(groups, total=5, desc="Folds")):
    
    tr_idx = pd.Series(tr_idx).sample(n=2000000,random_state=42).values

    multioutput_regressor = LGBMMultiOutputRegressor(lgb.LGBMRegressor(**best_params_))

    x_train = train.loc[tr_idx, cols].to_numpy()
    y_train = train.loc[tr_idx, pcols].to_numpy()
    
    x_test = train.loc[te_idx, cols].to_numpy()
    y_test = train.loc[te_idx, pcols].to_numpy()

    multioutput_regressor.fit(
        x_train, y_train,
        eval_set=(x_test, y_test),
        eval_metric=custom_average_precision,
        early_stopping_rounds=15,
        verbose = 0,
    )
    
    regs.append(multioutput_regressor)
    
    cv = metrics.average_precision_score(y_test, multioutput_regressor.predict(x_test).clip(0.0,1.0))
    
    cvs.append(cv)
    
print(cvs)
print(np.mean(cvs))

Folds:   0%|          | 0/5 [00:00<?, ?it/s]



[0.08803883207602375, 0.3169366581870938, 0.09907104895730152, 0.16838343107070394, 0.16803207193030667]
0.16809240844428594


## Train second of 2 models with window = 600 timesteps

In [15]:
basic_featsS = MultipleFeatureDescriptors(
    functions=seglearn_feature_dict_wrapper(base_features()),
    series_names=['AccV', 'AccML', 'AccAP'],
    windows=[600],
    strides=[600],
)

emg_featsS = emg_features()
del emg_featsS['simple square integral'] # is same as abs_energy (which is in base_features)

emg_featsS = MultipleFeatureDescriptors(
    functions=seglearn_feature_dict_wrapper(emg_featsS),
    series_names=['AccV', 'AccML', 'AccAP'],
    windows=[600],
    strides=[600],
)

fcS = FeatureCollection([basic_featsS, emg_featsS])

In [16]:

def reader(file):
#     try:
    path_split = file.split('/')
    session = path_split[-1].split('.')[0]
    dataset = Path(file).parts[-2]
    cols = ['Time', 'AccV', 'AccML', 'AccAP', 'StartHesitation', 'Turn' , 'Walking']
    if dataset == 'notype':
         cols = ['Time', 'AccV', 'AccML', 'AccAP', 'Event', 'Valid']
    if dataset == 'defog':        
        cols = ['Time', 'AccV', 'AccML', 'AccAP', 'StartHesitation', 'Turn' , 'Walking', 'Valid']
    df = pd.read_csv(file, index_col='Time', usecols=cols)
    df['Id'] = session
    df['Module'] = dataset

    if dataset == 'defog':
        c = df['Valid'].value_counts()
        validRatio = c[True]/(c[True] + c[False])
        if validRatio < 25:
            df.truncate()
        del df['Valid']
    
    if dataset == 'notype':
        df['Walking'] = 0
        df['Turn'] =  df['Event']
        df['StartHesitation'] = 0
        del df['Valid']
        del df['Event']

    # this is done because the speeds are at different rates for the datasets
    if dataset == 'tdcsfog':
        df.AccV = df.AccV * 9.80665
        df.AccML = df.AccML * 9.80665
        df.AccAP = df.AccAP * 9.80665

    df['Time_frac']=(df.index/df.index.max()).values

    df = pd.merge(df, tasks[['Id','t_group']], how='left', on='Id').fillna(-1)
    
    if dataset == 'tdcsfog':
        factor = 128
    else:
        factor = 100

    rows = tasksBase[tasksBase['Id'] == session]

    for index,row in rows.iterrows():
        t = df.iloc[0]['t_group']
        df.loc[:].t_group = -1
        start = int(row['Begin'] * factor)
        end = int(row['End'] * factor)
        df.loc[start:end].t_group = t

    df = pd.merge(df, metadata_w_subjects[['Id','Subject', 'Visit','Test','Medication','s_group']], how='left', on='Id').fillna(-1) #'s_off', 's_on', 's_age',

    
    df_featsS = fcS.calculate(df, return_df=True, include_final_window=True, approve_sparsity=True, window_idx="begin").astype(np.float32)
    df = df.merge(df_featsS, how="left", left_index=True, right_index=True)
    
    if dataset == 'notype':
        df.drop(index=df.iloc[0:85000].index, inplace=True)
    

    df.fillna(method="ffill", inplace=True)

    return df
#     except: pass

trainFiles = glob.glob(path.join(root, 'train/**/**'))
del trainFiles[2::3]
del trainFiles[-70:]
train = pd.concat([reader(f) for f in tqdm(trainFiles)]).fillna(0); 
print(train.shape)



  0%|          | 0/577 [00:00<?, ?it/s]

0.2872802220913603


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


0.27139136653757406
0.2706150527962056
0.2317585447547142
0.36898148148148147
0.25449558873129424
0.23884228893373696
0.20869229496442557
0.23930655142925095
0.2622312632124766
0.1946524064171123
0.20972404013961607
0.34512612227447625
0.23618820315137432
0.272018514963418
0.22883340538982563
0.3377400150926521
0.3617582525947292
0.33167371436366117
0.3434618768889399
0.38737749020354145
0.22634386490451608
0.3102763643619675
0.2612965907083554
0.26817142390461024
0.21796361369304013
0.42988064791133845
0.26202902547485263
0.4573842901366622
0.39883519364112974
0.33547879727596724
0.24105087705373443
0.30185548117793787
0.11045658600678857
0.2960927727145082
0.32669363452491873
0.3268132697843424
0.31395640355738524
0.3799793898030206
0.28241043906022906
0.30395214479267857
0.31445165812707576
0.3004814114926671
0.3095592229493779
0.2604545161359678
0.26637749280214007
0.3300177141765513
0.27236654090895074
0.2304919204351451
0.23696435398107066
0.28408831960085873
0.32115851806863044


In [17]:
print(train.columns)

Index(['AccV', 'AccML', 'AccAP', 'StartHesitation', 'Turn', 'Walking', 'Id',
       'Module', 'Time_frac', 't_group', 'Subject', 'Visit', 'Test',
       'Medication', 's_group', 'AccAP__abs_energy__w=600',
       'AccAP__abs_sum__w=600', 'AccAP__emg_var__w=600', 'AccAP__kurt__w=600',
       'AccAP__maximum__w=600', 'AccAP__mean__w=600', 'AccAP__mean_abs__w=600',
       'AccAP__mean_crossings__w=600', 'AccAP__median__w=600',
       'AccAP__minimum__w=600', 'AccAP__mse__w=600',
       'AccAP__root_mean_square__w=600', 'AccAP__skew__w=600',
       'AccAP__slope_sign_changes__w=600', 'AccAP__std__w=600',
       'AccAP__var__w=600', 'AccAP__waveform_length__w=600',
       'AccAP__willison_amplitude__w=600', 'AccAP__zero_crossing__w=600',
       'AccML__abs_energy__w=600', 'AccML__abs_sum__w=600',
       'AccML__emg_var__w=600', 'AccML__kurt__w=600', 'AccML__maximum__w=600',
       'AccML__mean__w=600', 'AccML__mean_abs__w=600',
       'AccML__mean_crossings__w=600', 'AccML__median__w=600',


In [18]:
colsS = [c for c in train.columns if c not in ['Id','Subject','Module', 'Time', 'StartHesitation', 'Turn' , 'Walking', 'Valid', 'Task','Event']]
colsS = [x for x in colsS if str(x) != 'nan']
pcolsS = ['StartHesitation', 'Turn' , 'Walking']
train=train.reset_index(drop=True)
print(colsS)

['AccV', 'AccML', 'AccAP', 'Time_frac', 't_group', 'Visit', 'Test', 'Medication', 's_group', 'AccAP__abs_energy__w=600', 'AccAP__abs_sum__w=600', 'AccAP__emg_var__w=600', 'AccAP__kurt__w=600', 'AccAP__maximum__w=600', 'AccAP__mean__w=600', 'AccAP__mean_abs__w=600', 'AccAP__mean_crossings__w=600', 'AccAP__median__w=600', 'AccAP__minimum__w=600', 'AccAP__mse__w=600', 'AccAP__root_mean_square__w=600', 'AccAP__skew__w=600', 'AccAP__slope_sign_changes__w=600', 'AccAP__std__w=600', 'AccAP__var__w=600', 'AccAP__waveform_length__w=600', 'AccAP__willison_amplitude__w=600', 'AccAP__zero_crossing__w=600', 'AccML__abs_energy__w=600', 'AccML__abs_sum__w=600', 'AccML__emg_var__w=600', 'AccML__kurt__w=600', 'AccML__maximum__w=600', 'AccML__mean__w=600', 'AccML__mean_abs__w=600', 'AccML__mean_crossings__w=600', 'AccML__median__w=600', 'AccML__minimum__w=600', 'AccML__mse__w=600', 'AccML__root_mean_square__w=600', 'AccML__skew__w=600', 'AccML__slope_sign_changes__w=600', 'AccML__std__w=600', 'AccML__va

In [19]:
kfold = GroupKFold(5)
groups=kfold.split(train, groups=train.Subject)

regsS = []
cvsS = []

for _, (tr_idx, te_idx) in enumerate(tqdm(groups, total=5, desc="Folds")):
    
    tr_idx = pd.Series(tr_idx).sample(n=2000000,random_state=42).values

    multioutput_regressor = LGBMMultiOutputRegressor(lgb.LGBMRegressor(**best_params_))

    x_train = train.loc[tr_idx, colsS].to_numpy()
    y_train = train.loc[tr_idx, pcolsS].to_numpy()
    
    x_test = train.loc[te_idx, colsS].to_numpy()
    y_test = train.loc[te_idx, pcolsS].to_numpy()

    multioutput_regressor.fit(
        x_train, y_train,
        eval_set=(x_test, y_test),
        eval_metric=custom_average_precision,
        early_stopping_rounds=15,
        verbose = 0,
    )
    
    regsS.append(multioutput_regressor)
    
    cv = metrics.average_precision_score(y_test, multioutput_regressor.predict(x_test).clip(0.0,1.0))
    
    cvsS.append(cv)
    
print(cvsS)
print(np.mean(cvsS))

Folds:   0%|          | 0/5 [00:00<?, ?it/s]



[0.22531802109605414, 0.28099788899498684, 0.1799663193070661, 0.19646221960833196, 0.20141040505674024]
0.21683097081263586


## Generate submission csv file

In [20]:
sub = pd.read_csv(path.join(root, 'sample_submission.csv'))
submission = []


for f in test:
    df = pd.read_csv(f)
    df.set_index('Time', drop=True, inplace=True)

    session = f.split('/')[-1].split('.')[0]
    df['Id'] = session

    dataset = Path(f).parts[-2]

    if dataset == 'tdcsfog':
        df.AccV = df.AccV * 9.80665
        df.AccML = df.AccML * 9.80665
        df.AccAP = df.AccAP * 9.80665


    df['Time_frac']=(df.index/df.index.max()).values
    df = pd.merge(df, tasks[['Id','t_group']], how='left', on='Id').fillna(-1)
    
    if dataset == 'tdcsfog':
        factor = 128
    else:
        factor = 100

    rows = tasksBase[tasksBase['Id'] == session]

    for index,row in rows.iterrows():
        t = df.iloc[0]['t_group']
        df.loc[:].t_group = -1
        start = int(row['Begin'] * factor)
        end = int(row['End'] * factor)
        df.loc[start:end].t_group = t
    

    df = pd.merge(df, metadata_w_subjects[['Id','Subject', 'Visit','Test','Medication','s_group']], how='left', on='Id').fillna(-1) # 's_off', 's_on', 's_age',

    dfS = df.copy()

    df_feats = fc.calculate(df, return_df=True, include_final_window=True, approve_sparsity=True, window_idx="begin")
    df = df.merge(df_feats, how="left", left_index=True, right_index=True)
    df.fillna(method="ffill", inplace=True)


    res_vals = []

    for i_fold in range(5):
        pred = regs[i_fold].predict(df[cols]).clip(0.0,1.0)
        res_vals.append(np.expand_dims(np.round(pred, 3), axis = 2))

    res_vals = np.mean(np.concatenate(res_vals, axis = 2), axis = 2)
    res = pd.DataFrame(res_vals, columns=pcols)    


    df_featsS = fcS.calculate(dfS, return_df=True, include_final_window=True, approve_sparsity=True, window_idx="begin")
    dfS = dfS.merge(df_featsS, how="left", left_index=True, right_index=True)
    dfS.fillna(method="ffill", inplace=True) 
    
    
    res_valsS = []    

    for i_fold in range(5):
        predS = regsS[i_fold].predict(dfS[colsS]).clip(0.0,1.0)
        res_valsS.append(np.expand_dims(np.round(predS, 3), axis = 2))

    res_valsS = np.mean(np.concatenate(res_valsS, axis = 2), axis = 2)
    resS = pd.DataFrame(res_valsS, columns=pcolsS)

    #get predicted values from 2 used models and choose maximum one for final submission
    c = pd.DataFrame()
    c['Turn'] = resS['Turn']
    c['Turn2'] =  res['Turn']
    res['Turn'] = c.max(axis=1)

    c = pd.DataFrame()
    c['Walking'] = resS['Walking']
    c['Walking2'] =  res['Walking']
    res['Walking'] = c.max(axis=1)

    c = pd.DataFrame()
    c['StartHesitation'] = resS['StartHesitation']
    c['StartHesitation2'] =  res['StartHesitation']
    res['StartHesitation'] = c.max(axis=1)

    df = pd.concat([df,res], axis=1)
    df['Id'] = df['Id'].astype(str) + '_' + df.index.astype(str)

    #cut off some strange chaotic recordings at the beginning and end of session
    cutOnStart = 1300
    cutOnEnd = -1300
    df['StartHesitation'].iloc[:cutOnStart] = 0
    df['StartHesitation'].iloc[cutOnEnd:] = 0 
    df['Turn'].iloc[:cutOnStart] = 0
    df['Turn'].iloc[cutOnEnd:] = 0 
    df['Walking'].iloc[:cutOnStart] = 0
    df['Walking'].iloc[cutOnEnd:] = 0 


    submission.append(df[scols])
    
submission = pd.concat(submission)
submission = pd.merge(sub[['Id']], submission, how='left', on='Id').fillna(0.0)
submission[scols].to_csv('submission.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


In [21]:
submission

Unnamed: 0,Id,StartHesitation,Turn,Walking
0,003f117e14_0,0.0,0.0,0.0
1,003f117e14_1,0.0,0.0,0.0
2,003f117e14_2,0.0,0.0,0.0
3,003f117e14_3,0.0,0.0,0.0
4,003f117e14_4,0.0,0.0,0.0
...,...,...,...,...
286365,02ab235146_281683,0.0,0.0,0.0
286366,02ab235146_281684,0.0,0.0,0.0
286367,02ab235146_281685,0.0,0.0,0.0
286368,02ab235146_281686,0.0,0.0,0.0
