Dieses Skript trainiert zunächst den ersten Random Forest ohne weitere Einstellungen.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from utils.training import train_random_forest, predict_random_forest
import joblib
from datetime import datetime

feature_file = "./features/features.hdf5"
data_split_file = "./data_split.yaml"
features = ['mean', 'variance', 'std', 'ptp_amp', 'skewness', 'kurtosis', 'quantile']
model_save_path = f'./models/{datetime.now().strftime('%d-%m-%y %H-%M-%S')}.joblib'
random_state = 42

clf = RandomForestClassifier(n_jobs=-1, verbose=1, random_state=random_state)

clf = train_random_forest(clf, feature_file, features, data_split_file)
y_true, y_pred = predict_random_forest(clf, feature_file, features, data_split_file)
joblib.dump(clf, model_save_path)
print(classification_report(y_true, y_pred, target_names=["Kein Artefakt", "Artefakt"]))

Extracting features and labels for sessions: 100%|██████████| 268/268 [00:05<00:00, 49.93it/s]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of 100 | elapsed:  1.0min remaining:  9.2min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.6min finished
Extracting features and labels for sessions: 100%|██████████| 268/268 [00:01<00:00, 166.93it/s]
[Parallel(n_jobs=96)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=96)]: Done  10 out of 100 | elapsed:    1.4s remaining:   13.0s
[Parallel(n_jobs=96)]: Done 100 out of 100 | elapsed:    2.5s finished


               precision    recall  f1-score   support

Kein Artefakt       0.91      0.92      0.91    971776
     Artefakt       0.61      0.56      0.58    210350

     accuracy                           0.86   1182126
    macro avg       0.76      0.74      0.75   1182126
 weighted avg       0.85      0.86      0.86   1182126



Das Experiment hat funktioniert. Wie gut die Ergebnisse wirklich sind, bin ich mir noch nicht sicher. Allerdings wurden auch nur grundlegende features verwendet. Die Artefakte wurden nur  zu ca. 56% richtig erkannt. Im Folgenden werde ich ein weiteres Modell trainieren, dass auf allen Features basiert.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from utils.training import train_random_forest, predict_random_forest
import joblib
from datetime import datetime

feature_file = "./features/features.hdf5"
data_split_file = "./data_split.yaml"
features = ['mean', 'variance', 'std', 'ptp_amp', 'skewness', 'kurtosis', 'quantile', 'pow_freq_bands', 'hurst_exp', 'decorr_time']
model_save_path = f'./models/{datetime.now().strftime('%d-%m-%y %H-%M-%S')}.joblib'
random_state = 42

clf = RandomForestClassifier(n_jobs=-1, verbose=1, random_state=random_state)

clf = train_random_forest(clf, feature_file, features, data_split_file)
y_true, y_pred = predict_random_forest(clf, feature_file, features, data_split_file)
joblib.dump(clf, model_save_path)
print(classification_report(y_true, y_pred, target_names=["Kein Artefakt", "Artefakt"]))

Extracting features and labels for sessions: 100%|██████████| 268/268 [00:09<00:00, 28.26it/s]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of 100 | elapsed:  1.3min remaining: 12.1min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  2.1min finished
Extracting features and labels for sessions: 100%|██████████| 268/268 [00:02<00:00, 103.88it/s]
[Parallel(n_jobs=96)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=96)]: Done  10 out of 100 | elapsed:    1.0s remaining:    9.4s
[Parallel(n_jobs=96)]: Done 100 out of 100 | elapsed:    1.8s finished


               precision    recall  f1-score   support

Kein Artefakt       0.91      0.92      0.92    971776
     Artefakt       0.62      0.58      0.60    210350

     accuracy                           0.86   1182126
    macro avg       0.76      0.75      0.76   1182126
 weighted avg       0.86      0.86      0.86   1182126



Die Ergebnisse haben sich nochmal verbessert. Im Folgenden werde ich die Auswirkung von mehr Bäumen untersuchen.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from utils.training import train_random_forest, predict_random_forest
import joblib
from datetime import datetime

feature_file = "./features/features.hdf5"
data_split_file = "./data_split.yaml"
features = ['mean', 'variance', 'std', 'ptp_amp', 'skewness', 'kurtosis', 'quantile', 'pow_freq_bands', 'hurst_exp', 'decorr_time']
model_save_path = f'./models/{datetime.now().strftime('%d-%m-%y %H-%M-%S')}.joblib'
random_state = 42

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, verbose=1, random_state=random_state)

clf = train_random_forest(clf, feature_file, features, data_split_file)
y_true, y_pred = predict_random_forest(clf, feature_file, features, data_split_file)
joblib.dump(clf, model_save_path)
print(classification_report(y_true, y_pred, target_names=["Kein Artefakt", "Artefakt"]))

Extracting features and labels for sessions: 100%|██████████| 268/268 [00:09<00:00, 27.86it/s]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 258 tasks      | elapsed:  4.1min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  7.5min finished
Extracting features and labels for sessions: 100%|██████████| 268/268 [00:02<00:00, 107.28it/s]
[Parallel(n_jobs=96)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=96)]: Done   8 tasks      | elapsed:    1.1s
[Parallel(n_jobs=96)]: Done 258 tasks      | elapsed:    3.7s
[Parallel(n_jobs=96)]: Done 500 out of 500 | elapsed:    6.2s finished


               precision    recall  f1-score   support

Kein Artefakt       0.91      0.92      0.92    971776
     Artefakt       0.62      0.58      0.60    210350

     accuracy                           0.86   1182126
    macro avg       0.77      0.75      0.76   1182126
 weighted avg       0.86      0.86      0.86   1182126



Die Ergebnisse haben sich kaum, bzw. fast gar nicht verbessert.

Im Folgenden werden auch hier wieder die Ergebnisse der Daten ohne Hochpass erzeugt, um sie dann zu vergleichen.

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from utils.training import train_random_forest, predict_random_forest
import joblib
from datetime import datetime

feature_file = "./features/features_no_hp.hdf5"
data_split_file = "./data_split.yaml"
features = ['mean', 'variance', 'std', 'ptp_amp', 'skewness', 'kurtosis', 'quantile']
model_save_path = f'./models/{datetime.now().strftime('%d-%m-%y %H-%M-%S')}.joblib'
random_state = 42

clf = RandomForestClassifier(n_jobs=-1, verbose=1, random_state=random_state)

clf = train_random_forest(clf, feature_file, features, data_split_file)
y_true, y_pred = predict_random_forest(clf, feature_file, features, data_split_file)
joblib.dump(clf, model_save_path)
print(classification_report(y_true, y_pred, target_names=["Kein Artefakt", "Artefakt"]))

Extracting features and labels for sessions: 100%|██████████| 268/268 [00:05<00:00, 48.20it/s]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of 100 | elapsed:   57.7s remaining:  8.7min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  1.5min finished
Extracting features and labels for sessions: 100%|██████████| 268/268 [00:01<00:00, 170.44it/s]
[Parallel(n_jobs=96)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=96)]: Done  10 out of 100 | elapsed:    1.3s remaining:   12.0s
[Parallel(n_jobs=96)]: Done 100 out of 100 | elapsed:    2.2s finished


               precision    recall  f1-score   support

Kein Artefakt       0.91      0.92      0.91    971776
     Artefakt       0.61      0.56      0.58    210350

     accuracy                           0.86   1182126
    macro avg       0.76      0.74      0.75   1182126
 weighted avg       0.85      0.86      0.86   1182126



In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from utils.training import train_random_forest, predict_random_forest
import joblib
from datetime import datetime

feature_file = "./features/features_no_hp.hdf5"
data_split_file = "./data_split.yaml"
features = ['mean', 'variance', 'std', 'ptp_amp', 'skewness', 'kurtosis', 'quantile', 'pow_freq_bands', 'hurst_exp', 'decorr_time']
model_save_path = f'./models/{datetime.now().strftime('%d-%m-%y %H-%M-%S')}.joblib'
random_state = 42

clf = RandomForestClassifier(n_jobs=-1, verbose=1, random_state=random_state)

clf = train_random_forest(clf, feature_file, features, data_split_file)
y_true, y_pred = predict_random_forest(clf, feature_file, features, data_split_file)
joblib.dump(clf, model_save_path)
print(classification_report(y_true, y_pred, target_names=["Kein Artefakt", "Artefakt"]))

Extracting features and labels for sessions: 100%|██████████| 268/268 [00:07<00:00, 36.64it/s]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of 100 | elapsed:  1.3min remaining: 11.3min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:  2.0min finished
Extracting features and labels for sessions: 100%|██████████| 268/268 [00:02<00:00, 125.93it/s]
[Parallel(n_jobs=96)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=96)]: Done  10 out of 100 | elapsed:    1.0s remaining:    9.2s
[Parallel(n_jobs=96)]: Done 100 out of 100 | elapsed:    1.6s finished


               precision    recall  f1-score   support

Kein Artefakt       0.91      0.92      0.92    971776
     Artefakt       0.62      0.58      0.60    210350

     accuracy                           0.86   1182126
    macro avg       0.76      0.75      0.76   1182126
 weighted avg       0.86      0.86      0.86   1182126



In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from utils.training import train_random_forest, predict_random_forest
import joblib
from datetime import datetime

feature_file = "./features/features_no_hp.hdf5"
data_split_file = "./data_split.yaml"
features = ['mean', 'variance', 'std', 'ptp_amp', 'skewness', 'kurtosis', 'quantile', 'pow_freq_bands', 'hurst_exp', 'decorr_time']
model_save_path = f'./models/{datetime.now().strftime('%d-%m-%y %H-%M-%S')}.joblib'
random_state = 42

clf = RandomForestClassifier(n_estimators=500, n_jobs=-1, verbose=1, random_state=random_state)

clf = train_random_forest(clf, feature_file, features, data_split_file)
y_true, y_pred = predict_random_forest(clf, feature_file, features, data_split_file)
joblib.dump(clf, model_save_path)
print(classification_report(y_true, y_pred, target_names=["Kein Artefakt", "Artefakt"]))

Extracting features and labels for sessions: 100%|██████████| 268/268 [00:07<00:00, 36.57it/s]
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 258 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:  7.0min finished
Extracting features and labels for sessions: 100%|██████████| 268/268 [00:02<00:00, 126.53it/s]
[Parallel(n_jobs=96)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=96)]: Done   8 tasks      | elapsed:    1.0s
[Parallel(n_jobs=96)]: Done 258 tasks      | elapsed:    3.8s
[Parallel(n_jobs=96)]: Done 500 out of 500 | elapsed:    6.4s finished


               precision    recall  f1-score   support

Kein Artefakt       0.91      0.92      0.92    971776
     Artefakt       0.62      0.58      0.60    210350

     accuracy                           0.86   1182126
    macro avg       0.76      0.75      0.76   1182126
 weighted avg       0.86      0.86      0.86   1182126



Im Folgenden wird nun wieder mit den Daten mit Hochpass trainiert. Dabei wird nun jedoch die Class Imbalance beim Trainieren des RandomForrest behandelt.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from utils.training import train_random_forest, predict_random_forest
import joblib
from datetime import datetime

feature_file = "./features/features.hdf5"
data_split_file = "./data_split.yaml"
features = ['mean', 'variance', 'std', 'ptp_amp', 'skewness', 'kurtosis', 'quantile', 'pow_freq_bands', 'hurst_exp', 'decorr_time']
model_save_path = f'./models/{datetime.now().strftime('%d-%m-%y %H-%M-%S')}.joblib'
random_state = 42

clf = RandomForestClassifier(n_jobs=-1, verbose=1, random_state=random_state, class_weight='balanced')

clf = train_random_forest(clf, feature_file, features, data_split_file)
joblib.dump(clf, model_save_path)

y_true_train, y_pred_train = predict_random_forest(clf, feature_file, features, data_split_file, 'train')
y_true_val, y_pred_val = predict_random_forest(clf, feature_file, features, data_split_file, 'val')
print(f'Classifiction report training set \n\n{classification_report(y_true_train, y_pred_train, target_names=["Kein Artefakt", "Artefakt"])}\n\n')
print(f'Classifiction report validtation set \n\n{classification_report(y_true_val, y_pred_val, target_names=["Kein Artefakt", "Artefakt"])}')

Extracting features and labels for sessions:   0%|          | 0/268 [00:00<?, ?it/s]

Extracting features and labels for sessions: 100%|██████████| 268/268 [00:07<00:00, 36.84it/s]
[Parallel(n_jobs=96)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=96)]: Done  10 out of 100 | elapsed:    3.6s remaining:   32.1s
[Parallel(n_jobs=96)]: Done 100 out of 100 | elapsed:    5.5s finished
Extracting features and labels for sessions: 100%|██████████| 268/268 [00:02<00:00, 121.52it/s]
[Parallel(n_jobs=96)]: Using backend ThreadingBackend with 96 concurrent workers.
[Parallel(n_jobs=96)]: Done  10 out of 100 | elapsed:    1.0s remaining:    8.7s
[Parallel(n_jobs=96)]: Done 100 out of 100 | elapsed:    1.9s finished


Classifiction report training set 

               precision    recall  f1-score   support

Kein Artefakt       1.00      1.00      1.00   3367153
     Artefakt       1.00      1.00      1.00    811909

     accuracy                           1.00   4179062
    macro avg       1.00      1.00      1.00   4179062
 weighted avg       1.00      1.00      1.00   4179062



Classifiction report validtation set 

               precision    recall  f1-score   support

Kein Artefakt       0.91      0.93      0.92    971776
     Artefakt       0.63      0.55      0.59    210350

     accuracy                           0.86   1182126
    macro avg       0.77      0.74      0.75   1182126
 weighted avg       0.86      0.86      0.86   1182126



Die Ergebnisse sind marginal besser im Vergleich zu dem Lauf ohne das `class_weight`. Allerdings deutet der f1-Score von 1 auf Overfitting hin. Daher sollte perspektivisch eine GridSearch durchgeführt werden, etwa auf `max_depth`, `max_features` (SB) `min_samples_leaf`, `min_samples_split` (Gemnini) durchgeführt werden.