# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>1 |</span></b> <b>INTRODUCTION</b></div>

👋 Welcome to "🧠Exploring EEG: A Beginner's Guide"! 

If you're fascinated by the wonders of the human brain and the intricate patterns of brainwaves, but find the world of Electroencephalography (EEG) analysis daunting, you're in the right place. 

This notebook is designed for beginners like me & you, aiming to demystify the complexities of EEG data and make your learning journey both enjoyable and informative.

人間の脳の驚異や脳波の複雑なパターンに魅了されているが、脳波 (EEG) 分析の世界には気が遠くなるという方には、ここが正しい場所です。

このノートブックは私やあなたのような初心者向けに設計されており、EEG データの複雑さをわかりやすくし、学習を楽しく有益なものにすることを目的としています。


### <b><span style='color:#FFCE30'> 1.1 |</span> Intention of the notebook</b>
In this notebook, we will embark on an exploratory journey into the realm of EEG data analysis. Our goal is to provide a clear, step-by-step guide to understanding and analyzing EEG signals, which are crucial in detecting and classifying brain activities, such as seizures. We aim to:

* Break down complex concepts into easily digestible sections.
* Illustrate each step with practical code examples.
* Reference public notebooks and discussions to enhance your learning experience.

このノートブックでは、EEG データ分析の領域への探索的な旅に乗り出します。 私たちの目標は、発作などの脳活動の検出と分類に重要な EEG 信号を理解し、分析するための明確な段階的なガイドを提供することです。 私たちは次のことを目指しています。

* 複雑な概念を理解しやすいセクションに分割します。
* 実際のコード例を使用して各ステップを説明します。
* 公開ノートやディスカッションを参照して、学習体験を強化します。


### <b><span style='color:#FFCE30'> 1.2 |</span> Learning Objective</b>
By the end of this notebook, you will have a foundational understanding of:

* The basics of EEG signals and their significance in medical research and neurology.
* How to preprocess and analyze EEG data.
* Run through the basic code to build a machine learning model for EEG data classification.


このノートブックを読み終えるまでに、以下についての基礎を理解できるようになります。

* EEG 信号の基礎と、医学研究および神経学におけるその重要性。
* 脳波データを前処理して分析する方法。
* 基本コードを実行して、EEG データ分類のための機械学習モデルを構築します。

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>2 |</span></b> <b>REFERENCE & ACKNOWLEDGEMENT</b></div>

This notebook wouldn't be possible without the valuable insights and contributions from the Kaggle community. I've leveraged several resources to compile the most effective learning path for us:

このノートブックは、Kaggle コミュニティからの貴重な洞察と貢献がなければ不可能でした。 私はいくつかのリソースを活用して、最も効果的な学習パスを作成しました。

* https://www.kaggle.com/code/cdeotte/catboost-starter-lb-0-8
* https://www.kaggle.com/code/mvvppp/hms-eda-and-domain-journey
* https://www.kaggle.com/code/ksooklall/hms-banana-montage
* https://www.kaggle.com/code/mpwolke/seizures-classification-parquet


Feel free to explore these resources alongside this notebook to deepen your understanding.

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>3 |</span></b> <b>LOAD LIBARIES</b></div>

In [1]:
import os
import pandas as pd, numpy as np
from glob import glob
import matplotlib.pyplot as plt
VER = 1

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>4 |</span></b> <b>INTRODUCTION TO EEG AND SEIZURE DETECTION</b></div>

<b><span style='color:#FFCE30'> 4.1 |</span> Electroencephalography (EEG) - The Window into Brain Activity</b>

* Electroencephalography, commonly known as EEG, is a non-invasive method used by medical professionals to record electrical activity in the brain. 
* This is done using electrodes placed along the scalp. 
* EEG is a crucial tool in diagnosing neurological disorders, especially epilepsy, which is characterized by recurrent seizures.

* 一般にEEGとして知られる脳波検査は、脳内の電気活動を記録するために医療専門家によって使用される非侵襲的な方法です。
* 頭皮に沿って電極を設置して行います。
* EEG は、神経疾患、特に反復発作を特徴とするてんかんの診断において重要なツールです


<img src="https://www.researchgate.net/profile/Sebastian-Nagel-4/publication/338423585/figure/fig1/AS:844668573073409@1578396089381/Sketch-of-how-to-record-an-Electroencephalogram-An-EEG-allows-measuring-the-electrical.png" alt="EEG" width="600" height="400">



In [3]:
# check the reading of one parquet for understanding

BASE_PATH = '../input/hms-harmful-brain-activity-classification/'

df = pd.DataFrame({'path': glob(BASE_PATH + '**/*.parquet')})

df['test_type'] = df['path'].str.split('/').str.get(-2).str.split('_').str.get(-1)
df['id'] = df['path'].str.split('/').str.get(-1).str.split('.').str.get(0)

df_eeg = pd.read_parquet(BASE_PATH + 'train_eegs/1000913311.parquet')
df_eeg.head()

Unnamed: 0,Fp1,F3,C3,P3,F7,T3,T5,O1,Fz,Cz,Pz,Fp2,F4,C4,P4,F8,T4,T6,O2,EKG
0,-105.849998,-89.230003,-79.459999,-49.23,-99.730003,-87.769997,-53.330002,-50.740002,-32.25,-42.099998,-43.27,-88.730003,-74.410004,-92.459999,-58.93,-75.739998,-59.470001,8.21,66.489998,1404.930054
1,-85.470001,-75.07,-60.259998,-38.919998,-73.080002,-87.510002,-39.68,-35.630001,-76.839996,-62.740002,-43.040001,-68.629997,-61.689999,-69.32,-35.790001,-58.900002,-41.66,196.190002,230.669998,3402.669922
2,8.84,34.849998,56.43,67.970001,48.099998,25.35,80.25,48.060001,6.72,37.880001,61.0,16.58,55.060001,45.02,70.529999,47.82,72.029999,-67.18,-171.309998,-3565.800049
3,-56.32,-37.279999,-28.1,-2.82,-43.43,-35.049999,3.91,-12.66,8.65,3.83,4.18,-51.900002,-21.889999,-41.330002,-11.58,-27.040001,-11.73,-91.0,-81.190002,-1280.930054
4,-110.139999,-104.519997,-96.879997,-70.25,-111.660004,-114.43,-71.830002,-61.919998,-76.150002,-79.779999,-67.480003,-99.029999,-93.610001,-104.410004,-70.07,-89.25,-77.260002,155.729996,264.850006,4325.370117


In [5]:
# Determine the number of channels
# Assuming each row is a time point and each column is a channel
n_channels = df_eeg.shape[1]
n_channels

20

* The headers in the dataset (Fp1, F3, C3, P3, F7, T3, T5, O1, Fz, Cz, Pz, Fp2, F4, C4, P4, F8, T4, T6, O2, EKG) are standard electrode placement labels used in electroencephalography (EEG). 
* These labels correspond to specific positions on the scalp where EEG electrodes are placed to record brain activity. 
* Here's a brief overview of what they represent:

1. **Fp1, Fp2:** Frontopolar electrodes, located on the forehead, left and right side.
2. **F3, F4:** Frontal electrodes, on the left and right side of the forehead.
3. **C3, C4:** Central electrodes, placed above the left and right hemispheres of the brain.
4. **P3, P4:** Parietal electrodes, located on the upper back portion of the head, left and right sides.
5. **O1, O2:** Occipital electrodes, positioned at the back of the head near the visual cortex.
6. **T3, T4, T5, T6:** Temporal electrodes, situated on the left and right sides of the head near the ears. They are often involved in monitoring auditory functions.
7. **F7, F8:** Frontal-temporal electrodes, located at the front of the temporal lobes.
8. **Fz, Cz, Pz:** Midline electrodes, located at the frontal (Fz), central (Cz), and parietal (Pz) positions on the midline of the head.
9. **EKG:** Electrocardiogram electrode, which records the heart’s electrical activity. It's not directly related to brain activity but can be important in some EEG analyses.



* データセット内のヘッダー (Fp1、F3、C3、P3、F7、T3、T5、O1、Fz、Cz、Pz、Fp2、F4、C4、P4、F8、T4、T6、O2、EKG) は標準電極です 脳波検査 (EEG) で使用される配置ラベル。
* これらのラベルは、脳活動を記録するために EEG 電極が配置される頭皮上の特定の位置に対応します。
* それらが何を表すかについての簡単な概要は次のとおりです。

1. **Fp1、Fp2:** 前頭極電極。額の左側と右側にあります。
2. **F3、F4:** 額の左側と右側にある前頭部電極。
3. **C3、C4:** 中心電極。脳の左半球と右半球の上に配置されます。
4. **P3、P4:** 頭頂部電極。頭の後部上部の左側と右側にあります。
5. **O1、O2:** 後頭電極。後頭部の視覚野近くに配置されます。
6. **T3、T4、T5、T6:** 側頭電極。耳の近くの頭の左側と右側にあります。 彼らは多くの場合、聴覚機能の監視に関与します。
7. **F7、F8:** 側頭葉の前部に位置する前頭側頭電極。
8. **Fz、Cz、Pz:** 正中線電極。頭の正中線上の前頭部 (Fz)、中央 (Cz)、および頭頂部 (Pz) の位置にあります。
9. **EKG:** 心臓の電気活動を記録する心電図電極。 これは脳の活動とは直接関係しませんが、一部の EEG 分析では重要になる可能性があります。









<img src="https://www.researchgate.net/profile/Danny-Plass-Oude-Bos/publication/237777779/figure/fig3/AS:669556259434497@1536646060035/10-20-system-of-electrode-placement.png" alt="10-20-system-of-electrode-placement" width="300" height="150">

<b><span style='color:#FFCE30'> 4.2 |</span> Seizures and Their Impact</b>
* Seizures are sudden, uncontrolled electrical disturbances in the brain that can cause changes in behavior, feelings, movements, and levels of consciousness. 
* Detecting and classifying seizures accurately is vital for appropriate treatment and care, especially in critically ill patients.


* 発作は、行動、感情、動き、意識レベルの変化を引き起こす可能性がある、脳内の突然の制御不能な電気的障害です。
* 発作を正確に検出して分類することは、特に重症患者の場合、適切な治療とケアに不可欠です。


<b><span style='color:#FFCE30'> 4.3 |</span> The Challenge of Manual EEG Analysis</b>

* Traditionally, EEG data analysis relies on visual inspection by trained neurologists. 
* This process is not only time-consuming and labor-intensive but also prone to errors due to fatigue and subjective interpretation.


* 従来、EEG データ分析は訓練を受けた神経内科医による目視検査に依存していました。
* このプロセスは時間と労力がかかるだけでなく、疲労や主観的な解釈によりエラーが発生しやすくなります。

<img src="https://slideplayer.com/slide/12925171/78/images/2/Manual+Interpretation+of+EEGs.jpg" alt="Manual Interpretation of EEG" width="700" height="300">
Source: Automated Identification of Abnormal Adult EEG, S. López, G. Suarez, D. Jungreis, I. Obeid and J. Picone, Neural Engineering Data Consortium, Temple University


<b><span style='color:#FFCE30'> 4.4 |</span> The Role of Data Science in EEG Analysis</b>

* Automating EEG Interpretation
The advent of machine learning and data science offers an opportunity to automate the interpretation of EEG data. By developing algorithms that can detect and classify different patterns in EEG signals, we can aid neurologists in making faster, more accurate diagnoses.

* The Data Science Approach
Data scientists approach this challenge by first preprocessing the EEG data, which involves filtering out noise and extracting relevant features. Machine learning models are then trained on these features to distinguish between different types of brain activity.


* EEG解釈の自動化
機械学習とデータ サイエンスの出現により、EEG データの解釈を自動化する機会が生まれました。 EEG信号のさまざまなパターンを検出して分類できるアルゴリズムを開発することで、神経内科医がより迅速かつ正確な診断を行えるように支援できます。

* データサイエンスのアプローチ
データ サイエンティストは、まず EEG データを前処理することでこの課題に取り組みます。これには、ノイズのフィルタリングと関連する特徴の抽出が含まれます。 次に、機械学習モデルはこれらの特徴に基づいてトレーニングされ、さまざまな種類の脳活動を区別します。

<img src="https://www.researchgate.net/profile/Huiguang-He/publication/336336651/figure/fig1/AS:834361356197888@1575938657076/The-flow-chart-of-EEG-emotion-classification-with-similarity-learning-network.png" alt="flowchart for EEG classification" width="700" height="300">


<b><span style='color:#FFCE30'> 4.5 |</span> Understanding EEG Patterns</b>

In the realm of EEG analysis for seizure detection, certain patterns are of particular interest:

1. **Seizure (SZ):** Characterized by abnormal rhythmic activity, indicative of a seizure.
2. **Generalized Periodic Discharges (GPD):** Patterns that may be seen in various encephalopathies.
3. **Lateralized Periodic Discharges (LPD):** Often associated with focal brain lesions.
4. **Lateralized Rhythmic Delta Activity (LRDA):** Can be observed in focal brain dysfunction.
5. **Generalized Rhythmic Delta Activity (GRDA):** Typically related to diffuse brain dysfunction.
6. **"Other" Patterns:** Any other type of activity not falling into the above categories.


発作検出のための EEG 解析の分野では、特定のパターンが特に重要です。

1. **発作 (SZ):** 発作を示す異常なリズム活動を特徴とします。
2. **全般性周期放電 (GPD):** さまざまな脳症で見られるパターン。
3. **側方化周期放電 (LPD):** 多くの場合、限局性脳病変に関連します。
4. **側方化リズムデルタ活動 (LRDA):** 局所性脳機能障害で観察される可能性があります。
5. **一般化リズムデルタ活動 (GRDA):** 通常、びまん性脳機能障害に関連します。
6. **「その他」パターン:** 上記のカテゴリに当てはまらないその他の種類のアクティビティ。


<b><span style='color:#FFCE30'> 4.6 |</span> Interpreting Complex EEG Data</b>

EEG data interpretation can be complex, especially in edge cases where expert neurologists may not agree on a classification. This is where machine learning models can particularly shine by providing an additional layer of analysis.


EEG データの解釈は、特に専門の神経内科医が分類に同意しない可能性がある特殊なケースでは、複雑になる可能性があります。 ここでは、追加の分析レイヤーを提供することで、機械学習モデルが特に威力を発揮します。


<img src="https://www.neurology.org/cms/10.1212/WNL.0000000000207127/asset/bd84c182-712c-41ab-8742-cecf9d49a322/assets/images/large/5ff2.jpg" alt="flowchart for EEG classification" width="700" height="300">

Source: Development of Expert-Level Classification of Seizures and Rhythmic and Periodic Patterns During EEG Interpretation https://www.neurology.org/doi/10.1212/WNL.0000000000207127


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>5 |</span></b> <b>LOAD TRAIN DATA</b></div>

In [6]:
df = pd.read_csv('..//input/hms-harmful-brain-activity-classification/train.csv')
TARGETS = df.columns[-6:]
print('Train shape:', df.shape )
print('Targets', list(TARGETS))
df.head()

Train shape: (106800, 15)
Targets ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']


Unnamed: 0,eeg_id,eeg_sub_id,eeg_label_offset_seconds,spectrogram_id,spectrogram_sub_id,spectrogram_label_offset_seconds,label_id,patient_id,expert_consensus,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote
0,1628180742,0,0.0,353733,0,0.0,127492639,42516,Seizure,3,0,0,0,0,0
1,1628180742,1,6.0,353733,1,6.0,3887563113,42516,Seizure,3,0,0,0,0,0
2,1628180742,2,8.0,353733,2,8.0,1142670488,42516,Seizure,3,0,0,0,0,0
3,1628180742,3,18.0,353733,3,18.0,2718991173,42516,Seizure,3,0,0,0,0,0
4,1628180742,4,24.0,353733,4,24.0,3080632009,42516,Seizure,3,0,0,0,0,0


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>6 |</span></b> <b>CREATE NON-OVERLAPPING EEG ID TRAIN DATA</b></div>

Following the notebook from Chris Deotte: https://www.kaggle.com/code/cdeotte/catboost-starter-lb-0-8,
Initial discussion found here https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/467021

We perform the following because:

* **Match Training Data with Test Data Format:** The competition states that the test data does not have multiple segments from the same eeg_id. To make the training data similar to the test data, we also use only one segment per eeg_id in the training data.

* **Remove Redundancies:** This approach ensures that the training data does not have overlapping or redundant information, which can lead to a more accurate and generalizable machine learning model.

* **Consistency in Data:** By standardizing how we handle the EEG segments in training, we ensure that our model learns from data that is consistent in format with the data it will be tested on.

* **Data Preparation for Machine Learning:** The normalization of target variables and inclusion of relevant features like patient_id and expert_consensus prepare the dataset for effective machine learning modeling.


次の理由から、次のことを実行します。

* **トレーニング データをテスト データ形式と一致させる:** コンテストでは、テスト データには同じ eeg_id からの複数のセグメントが含まれていないと記載されています。 トレーニング データをテスト データと同様にするために、トレーニング データの eeg_id ごとに 1 つのセグメントのみを使用します。

* **冗長性の削除:** このアプローチにより、トレーニング データに重複または冗長な情報が含まれないことが保証され、より正確で一般化可能な機械学習モデルが得られます。

* **データの一貫性:** トレーニングでの EEG セグメントの処理方法を標準化することで、テスト対象のデータと形式が一貫しているデータからモデルが学習することを保証します。

* **機械学習のためのデータ準備:** ターゲット変数の正規化と、patient_id や Expert_consensus などの関連機能の組み込みにより、効果的な機械学習モデリングのためのデータセットが準備されます。

In [7]:
# Creating a Unique EEG Segment per eeg_id:
# The code groups (groupby) the EEG data (df) by eeg_id. Each eeg_id represents a different EEG recording.
# It then picks the first spectrogram_id and the earliest (min) spectrogram_label_offset_seconds for each eeg_id. This helps in identifying the starting point of each EEG segment.
# The resulting DataFrame train has columns spec_id (first spectrogram_id) and min (earliest spectrogram_label_offset_seconds).

# eeg_id ごとに一意の EEG セグメントを作成:
# このコードは、eeg_id によって EEG データ (df) をグループ化 (groupby) します。 各 eeg_id は異なる EEG 記録を表します。
# 次に、各 eeg_id の最初の spectrogram_id と最も早い (最小) spectrogram_label_offset_seconds を選択します。 これは、各 EEG セグメントの開始点を特定するのに役立ちます。
# 結果として得られる DataFrame トレインには、列 spec_id (最初の spectrogram_id) と min (最も古い spectrogram_label_offset_seconds) があります。

train = df.groupby('eeg_id')[['spectrogram_id','spectrogram_label_offset_seconds']].agg(
    {'spectrogram_id':'first','spectrogram_label_offset_seconds':'min'})
train.columns = ['spec_id','min']


# Finding the Latest Point in Each EEG Segment:
# The code again groups the data by eeg_id and finds the latest (max) spectrogram_label_offset_seconds for each segment.
# This max value is added to the train DataFrame, representing the end point of each EEG segment.

# 各 EEG セグメントの最新ポイントを見つける:
# コードは再び eeg_id によってデータをグループ化し、各セグメントの最新 (最大) spectrogram_label_offset_seconds を見つけます。
# この最大値はトレイン データフレームに追加され、各 EEG セグメントの終了点を表します。

tmp = df.groupby('eeg_id')[['spectrogram_id','spectrogram_label_offset_seconds']].agg(
    {'spectrogram_label_offset_seconds':'max'})
train['max'] = tmp

# このコードは、各 eeg_id のpatient_id をトレイン DataFrame に追加します。 これにより、各 EEG セグメントが特定の患者に関連付けられます。
tmp = df.groupby('eeg_id')[['patient_id']].agg('first') # The code adds the patient_id for each eeg_id to the train DataFrame. This links each EEG segment to a specific patient.
train['patient_id'] = tmp

# コードは、各 eeg_id のターゲット変数カウント (発作、LPD などの投票など) を合計します。
tmp = df.groupby('eeg_id')[TARGETS].agg('sum') # The code sums up the target variable counts (like votes for seizure, LPD, etc.) for each eeg_id.
for t in TARGETS:
    train[t] = tmp[t].values

# その後、合計が 1 になるようにこれらのカウントを正規化します。このステップでは、カウントを確率に変換します。これは、分類タスクでは一般的な方法です。    
y_data = train[TARGETS].values # It then normalizes these counts so that they sum up to 1. This step converts the counts into probabilities, which is a common practice in classification tasks.
y_data = y_data / y_data.sum(axis=1,keepdims=True)
train[TARGETS] = y_data

# 各 eeg_id について、コードには EEG セグメントの分類に関する Expert_consensus が含まれます。
tmp = df.groupby('eeg_id')[['expert_consensus']].agg('first') # For each eeg_id, the code includes the expert_consensus on the EEG segment's classification.
train['target'] = tmp

# これにより、eeg_id が通常の列になり、DataFrame の操作が容易になります。
train = train.reset_index() # This makes eeg_id a regular column, making the DataFrame easier to work with.
print('Train non-overlapp eeg_id shape:', train.shape )
train.head()

Train non-overlapp eeg_id shape: (17089, 12)


Unnamed: 0,eeg_id,spec_id,min,max,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,other_vote,target
0,568657,789577333,0.0,16.0,20654,0.0,0.0,0.25,0.0,0.166667,0.583333,Other
1,582999,1552638400,0.0,38.0,20230,0.0,0.857143,0.0,0.071429,0.0,0.071429,LPD
2,642382,14960202,1008.0,1032.0,5955,0.0,0.0,0.0,0.0,0.0,1.0,Other
3,751790,618728447,908.0,908.0,38549,0.0,0.0,1.0,0.0,0.0,0.0,GPD
4,778705,52296320,0.0,0.0,40955,0.0,0.0,0.0,0.0,0.0,1.0,Other


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>7 |</span></b> <b>FEATURE ENGINEERING</b></div>


<b><span style='color:#FFCE30'> 7.1 |</span> 10 min and 20 sec windows</b>

* The code belows efficiently reads spectrogram data, from a single combined file, based on the set variable. We relied on the dataset by Chris Deotte to save time. https://www.kaggle.com/datasets/cdeotte/brain-spectrograms
* It then performs feature engineering by calculating mean and minimum values over two different time windows for each frequency in the spectrogram.
It produce produces in 1600 features (400 features × 4 calculations) for each EEG ID.
* The new features are intended to help the model better understand and classify the EEG data.
* This approach is designed to enhance the model's performance by providing it with more detailed information derived from the spectrogram data.


* 以下のコードは、設定された変数に基づいて、単一の結合されたファイルからスペクトログラム データを効率的に読み取ります。 時間を節約するために、Chris Deotte によるデータセットに依存しました。 https://www.kaggle.com/datasets/cdeotte/brain-spectrograms
* 次に、スペクトログラム内の各周波数の 2 つの異なる時間ウィンドウにわたって平均値と最小値を計算することにより、特徴エンジニアリングを実行します。
EEG ID ごとに 1600 個の特徴 (400 個の特徴 × 4 回の計算) が生成されます。
* 新しい機能は、モデルが EEG データをよりよく理解して分類できるようにすることを目的としています。
* このアプローチは、スペクトログラム データから得られるより詳細な情報をモデルに提供することでモデルのパフォーマンスを向上させるように設計されています。

In [9]:
READ_SPEC_FILES = False # If READ_SPEC_FILES is False, the code reads the combined file instead of individual files.
FEATURE_ENGINEER = True

In [11]:
%%time
# READ ALL SPECTROGRAMS
PATH = '../input/hms-harmful-brain-activity-classification/train_spectrograms/'
files = os.listdir(PATH)
print(f'There are {len(files)} spectrogram parquets')

if READ_SPEC_FILES:    
    spectrograms = {}
    for i,f in enumerate(files):
        if i%100==0: print(i,', ',end='')
        tmp = pd.read_parquet(f'{PATH}{f}')
        name = int(f.split('.')[0])
        spectrograms[name] = tmp.iloc[:,1:].values
else:
    spectrograms = np.load('../input/hms-harmful-brain-activity-classification/brain-spectrograms/specs.npy',allow_pickle=True).item()

There are 11138 spectrogram parquets
CPU times: total: 750 ms
Wall time: 7.26 s


In [12]:
%time
# ENGINEER FEATURES
import warnings
warnings.filterwarnings('ignore')

# The code generates features from the spectrogram data for use in a model 
# The features are derived by calculating the mean and minimum values over time for each of the 400 spectrogram frequencies.
# Two types of windows are used for these calculations:
# A 10-minute window (_mean_10m, _min_10m).
# A 20-second window (_mean_20s, _min_20s).
# This process results in 1600 features (400 features × 4 calculations) for each EEG ID.

# このコードは、モデルで使用するためにスペクトログラム データから特徴を生成します。
# 特徴は、400 のスペクトログラム周波数ごとに経時的な平均値と最小値を計算することによって導出されます。
# これらの計算には 2 種類のウィンドウが使用されます。
# 10 分のウィンドウ (_mean_10m、_min_10m)。
# 20 秒のウィンドウ (_mean_20s、_min_20s)。
# このプロセスにより、EEG ID ごとに 1600 個の特徴 (400 個の特徴 × 4 回の計算) が生成されます。


SPEC_COLS = pd.read_parquet(f'{PATH}1000086677.parquet').columns[1:]
FEATURES = [f'{c}_mean_10m' for c in SPEC_COLS]
FEATURES += [f'{c}_min_10m' for c in SPEC_COLS]
FEATURES += [f'{c}_mean_20s' for c in SPEC_COLS]
FEATURES += [f'{c}_min_20s' for c in SPEC_COLS]
print(f'We are creating {len(FEATURES)} features for {len(train)} rows... ',end='')


# A data matrix data is initialized to store the new features for each eeg_id in the train DataFrame.
# For each row in train, the code calculates the mean and minimum values within the specified 10-minute and 20-second windows.
# These calculated values are then stored in the data matrix.
# Finally, the matrix is added to the train DataFrame as new columns.

# データ行列データは、トレイン データフレーム内の各 eeg_id の新しい特徴を格納するために初期化されます。
# トレイン内の各行について、コードは指定された 10 分および 20 秒のウィンドウ内の平均値と最小値を計算します。
# これらの計算された値はデータ マトリックスに保存されます。
# 最後に、行列が新しい列としてトレイン データフレームに追加されます。

if FEATURE_ENGINEER:
    data = np.zeros((len(train),len(FEATURES)))
    for k in range(len(train)):
        if k%100==0: print(k,', ',end='')
        row = train.iloc[k]
        r = int( (row['min'] + row['max'])//4 ) 
        
        # 10 MINUTE WINDOW FEATURES (MEANS and MINS)
        x = np.nanmean(spectrograms[row.spec_id][r:r+300,:],axis=0)
        data[k,:400] = x
        x = np.nanmin(spectrograms[row.spec_id][r:r+300,:],axis=0)
        data[k,400:800] = x
        
        # 20 SECOND WINDOW FEATURES (MEANS and MINS)
        x = np.nanmean(spectrograms[row.spec_id][r+145:r+155,:],axis=0)
        data[k,800:1200] = x
        x = np.nanmin(spectrograms[row.spec_id][r+145:r+155,:],axis=0)
        data[k,1200:1600] = x

    train[FEATURES] = data
else:
    train = pd.read_parquet('../input/hms-harmful-brain-activity-classification/brain-spectrograms/train.pqt')
print()
print('New train shape:',train.shape)

CPU times: total: 0 ns
Wall time: 0 ns
We are creating 1600 features for 17089 rows... 0 , 100 , 200 , 300 , 400 , 500 , 600 , 700 , 800 , 900 , 1000 , 1100 , 1200 , 1300 , 1400 , 1500 , 1600 , 1700 , 1800 , 1900 , 2000 , 2100 , 2200 , 2300 , 2400 , 2500 , 2600 , 2700 , 2800 , 2900 , 3000 , 3100 , 3200 , 3300 , 3400 , 3500 , 3600 , 3700 , 3800 , 3900 , 4000 , 4100 , 4200 , 4300 , 4400 , 4500 , 4600 , 4700 , 4800 , 4900 , 5000 , 5100 , 5200 , 5300 , 5400 , 5500 , 5600 , 5700 , 5800 , 5900 , 6000 , 6100 , 6200 , 6300 , 6400 , 6500 , 6600 , 6700 , 6800 , 6900 , 7000 , 7100 , 7200 , 7300 , 7400 , 7500 , 7600 , 7700 , 7800 , 7900 , 8000 , 8100 , 8200 , 8300 , 8400 , 8500 , 8600 , 8700 , 8800 , 8900 , 9000 , 9100 , 9200 , 9300 , 9400 , 9500 , 9600 , 9700 , 9800 , 9900 , 10000 , 10100 , 10200 , 10300 , 10400 , 10500 , 10600 , 10700 , 10800 , 10900 , 11000 , 11100 , 11200 , 11300 , 11400 , 11500 , 11600 , 11700 , 11800 , 11900 , 12000 , 12100 , 12200 , 12300 , 12400 , 12500 , 12600 , 12700 , 1

<b><span style='color:#FFCE30'> 7.2 |</span>  Frequency Band Analysis</b>

#### Frequency Band Feature Extraction:

* The function extract_frequency_band_features is designed to process a segment of EEG data. EEG data is a complex signal that represents the electrical activity of the brain.
* This function divides the EEG signal into different frequency bands: Delta, Theta, Alpha, Beta, and Gamma. These bands are significant in neuroscientific studies as they are associated with different brain states and activities.


* 関数 extract_frequency_band_features は、EEG データのセグメントを処理するように設計されています。 EEG データは、脳の電気活動を表す複雑な信号です。
* この機能は、EEG 信号をさまざまな周波数帯域 (デルタ、シータ、アルファ、ベータ、ガンマ) に分割します。 これらのバンドは、さまざまな脳の状態や活動に関連しているため、神経科学研究において重要です。

![](https://ars.els-cdn.com/content/image/3-s2.0-B9780128044902000026-f02-01-9780128044902.jpg)


1. **Delta (0.5 – 4 Hz):**
Delta waves are the slowest brainwaves and are typically associated with deep sleep and restorative processes in the body. They are most prominent during dreamless sleep and play a role in healing and regeneration.
2. **Theta (4 – 8 Hz):**
Theta waves occur during light sleep, deep meditation, and REM (Rapid Eye Movement) sleep. They are linked to creativity, intuition, daydreaming, and fantasizing. Theta states are often associated with subconscious mind activities.
3. **Alpha (8 – 12 Hz):**
Alpha waves are present during physically and mentally relaxed states but still alert. They are typical in wakeful states that involve a relaxed and effortless alertness. Alpha waves aid in mental coordination, calmness, alertness, and learning.
4. **Beta (12 – 30 Hz):**
Beta waves dominate our normal waking state of consciousness when attention is directed towards cognitive tasks and the outside world. They are associated with active, busy or anxious thinking and active concentration.
5. **Gamma (30 – 45 Hz):**
Gamma waves are involved in higher mental activity and consolidation of information. They are important for learning, memory, and information processing. Gamma waves are thought to be the fastest brainwave frequency and relate to simultaneous processing of information from different brain areas.

- 各周波数帯域


1. **デルタ (0.5 – 4 Hz):**
デルタ波は最も遅い脳波であり、通常は深い睡眠と体内の回復プロセスに関連しています。 それらは夢のない睡眠中に最も顕著であり、治癒と再生に役割を果たします。
2. **シータ (4 – 8 Hz):**
シータ波は、浅い睡眠、深い瞑想、レム睡眠（急速眼球運動）中に発生します。 それらは創造性、直感、空想、空想と結びついています。 シータ状態は、多くの場合、潜在意識の活動に関連しています。
3. **アルファ (8 – 12 Hz):**
アルファ波は、肉体的および精神的にリラックスしている状態でも、まだ警戒しているときに存在します。 これらは、リラックスした楽な覚醒状態を伴う覚醒状態に典型的に見られます。 アルファ波は、精神的な調整、落ち着き、注意力、学習を助けます。
4. **ベータ (12 – 30 Hz):**
認知作業や外界に注意が向けられているとき、ベータ波は通常の覚醒意識状態を支配します。 これらは、活動的、多忙、または不安な思考と活発な集中力に関連しています。
5. **ガンマ (30 – 45 Hz):**
ガンマ波は高次の精神活動と情報の統合に関与します。 これらは学習、記憶、情報処理にとって重要です。 ガンマ波は最も速い脳波周波数であると考えられており、脳のさまざまな領域からの情報の同時処理に関係しています。

* For each frequency band, the function applies a bandpass filter to isolate that band's signal. It then computes statistical features (mean, standard deviation, maximum, and minimum) for each band, effectively capturing the characteristics of the EEG signal in these different frequency ranges.
* The use of np.nanmean, np.nanstd, np.nanmax, and np.nanmin ensures that the calculations are robust to NaN (Not a Number) values in the data, which might occur due to various reasons like signal loss or artifacts.

* 各周波数帯域に対して、この関数はバンドパス フィルターを適用して、その帯域の信号を分離します。 次に、各帯域の統計的特徴 (平均、標準偏差、最大、最小) を計算し、これらの異なる周波数範囲の EEG 信号の特性を効果的に捕捉します。
* np.nanmean、np.nanstd、np.nanmax、および np.nanmin を使用すると、信号損失やアーティファクトなどのさまざまな理由によって発生する可能性のあるデータ内の NaN (非数値) 値に対して計算が確実に堅牢になります。


#### Feature Aggregation and PCA:

* The main script initializes a Principal Component Analysis (PCA) model with the intention of reducing the dimensionality of the extracted features. PCA is a common technique used to transform high-dimensional datasets into a lower-dimensional space while retaining most of the variance in the data.
* The script iterates over rows in the train dataset, extracting EEG segments and applying the extract_frequency_band_features function to each channel in these segments. The extracted features from all channels are aggregated.
* However, before applying PCA, any NaN values in the aggregated data (data_original) are handled using mean imputation. This step ensures that the PCA algorithm, which cannot handle NaN values, receives a clean dataset.
* After imputation, PCA is applied to transform the features into a principal component space, and these transformed features are added back into the train DataFrame.
* This process ultimately results in a feature set that's potentially more informative and concise for machine learning models, helping in tasks like classification or anomaly detection in EEG data.


* メイン スクリプトは、抽出された特徴の次元を削減することを目的として、主成分分析 (PCA) モデルを初期化します。 PCA は、データ内の分散の大部分を保持しながら、高次元のデータセットを低次元の空間に変換するために使用される一般的な手法です。
* スクリプトはトレイン データセット内の行を繰り返し、EEG セグメントを抽出し、これらのセグメント内の各チャネルに extract_frequency_band_features 関数を適用します。 すべてのチャネルから抽出された特徴が集約されます。
* ただし、PCA を適用する前に、集計データ (data_original) 内の NaN 値は平均値補完を使用して処理されます。 このステップにより、NaN 値を処理できない PCA アルゴリズムがクリーンなデータセットを受け取ることが保証されます。
* 代入後、PCA を適用して特徴を主成分空間に変換し、これらの変換された特徴をトレイン データフレームに追加し直します。
* このプロセスにより、最終的に機械学習モデルにとってより有益で簡潔な機能セットが生成され、EEG データの分類や異常検出などのタスクに役立ちます。

In [13]:
from scipy import signal
from sklearn.decomposition import PCA

In [14]:
def extract_frequency_band_features(segment):
    # Define EEG frequency bands
    eeg_bands = {'Delta': (0.5, 4), 'Theta': (4, 8), 'Alpha': (8, 12), 'Beta': (12, 30), 'Gamma': (30, 45)}
    
    band_features = []
    for band in eeg_bands:
        low, high = eeg_bands[band]
        # Filter signal for the specific band
        band_pass_filter = signal.butter(3, [low, high], btype='bandpass', fs=200, output='sos')
        filtered = signal.sosfilt(band_pass_filter, segment)
        # Extract features like mean, standard deviation, etc.
        band_features.extend([np.nanmean(filtered), np.nanstd(filtered), np.nanmax(filtered), np.nanmin(filtered)])
    
    return band_features

In [15]:
from sklearn.preprocessing import StandardScaler

# Columns to be excluded from scaling
excluded_columns = ['eeg_id', 'spec_id', 'min', 'max', 'patient_id', 'seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote','target']

# Save the columns to be excluded
excluded_data = train[excluded_columns]

# DataFrame with only the columns to be scaled
features = train.drop(columns=excluded_columns)

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the features and transform them
features_scaled = scaler.fit_transform(features)

# Create a DataFrame from the scaled features
features_scaled_df = pd.DataFrame(features_scaled, columns=features.columns)

# Concatenate the scaled features with the excluded columns
train_scaled_df = pd.concat([excluded_data.reset_index(drop=True),features_scaled_df,], axis=1)
train_scaled_df 


Unnamed: 0,eeg_id,spec_id,min,max,patient_id,seizure_vote,lpd_vote,gpd_vote,lrda_vote,grda_vote,...,RP_18.16_min_20s,RP_18.36_min_20s,RP_18.55_min_20s,RP_18.75_min_20s,RP_18.95_min_20s,RP_19.14_min_20s,RP_19.34_min_20s,RP_19.53_min_20s,RP_19.73_min_20s,RP_19.92_min_20s
0,568657,789577333,0.0,16.0,20654,0.0,0.000000,0.25,0.000000,0.166667,...,-0.034223,-0.034364,-0.033143,-0.033080,-0.031165,-0.025752,-0.026022,-0.026978,-0.024814,-0.026100
1,582999,1552638400,0.0,38.0,20230,0.0,0.857143,0.00,0.071429,0.000000,...,-0.034279,-0.034417,-0.033181,-0.033113,-0.031205,-0.025768,-0.026030,-0.026982,-0.024815,-0.026101
2,642382,14960202,1008.0,1032.0,5955,0.0,0.000000,0.00,0.000000,0.000000,...,-0.034270,-0.034411,-0.033168,-0.033106,-0.031201,-0.025765,-0.026030,-0.026981,-0.024815,-0.026101
3,751790,618728447,908.0,908.0,38549,0.0,0.000000,1.00,0.000000,0.000000,...,-0.034267,-0.034408,-0.033173,-0.033103,-0.031198,-0.025765,-0.026027,-0.026982,-0.024815,-0.026101
4,778705,52296320,0.0,0.0,40955,0.0,0.000000,0.00,0.000000,0.000000,...,-0.034239,-0.034376,-0.033153,-0.033087,-0.031194,-0.025765,-0.026028,-0.026980,-0.024814,-0.026101
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17084,4293354003,1188113564,0.0,0.0,16610,0.0,0.000000,0.00,0.000000,0.500000,...,-0.034282,-0.034420,-0.033183,-0.033115,-0.031206,-0.025769,-0.026031,-0.026982,-0.024815,-0.026101
17085,4293843368,1549502620,0.0,0.0,15065,0.0,0.000000,0.00,0.000000,0.500000,...,-0.034214,-0.034326,-0.033102,-0.033006,-0.031130,-0.025725,-0.026012,-0.026975,-0.024812,-0.026098
17086,4294455489,2105480289,0.0,0.0,56,0.0,0.000000,0.00,0.000000,0.000000,...,-0.034285,-0.034424,-0.033186,-0.033117,-0.031208,-0.025770,-0.026032,-0.026983,-0.024815,-0.026101
17087,4294858825,657299228,0.0,12.0,4312,0.0,0.000000,0.00,0.000000,0.066667,...,-0.034279,-0.034417,-0.033178,-0.033110,-0.031201,-0.025767,-0.026030,-0.026982,-0.024815,-0.026101


In [16]:
train_scaled_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17089 entries, 0 to 17088
Columns: 1612 entries, eeg_id to RP_19.92_min_20s
dtypes: float64(1608), int64(3), object(1)
memory usage: 210.2+ MB


# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>8 |</span></b> <b>TRAIN MODEL</b></div>

* Original work uses catboost, let's try with XGBoost in this version to see the difference in model performance.

In [18]:
import xgboost as xgb
import gc
from sklearn.model_selection import KFold, GroupKFold

print('XGBoost version', xgb.__version__)

XGBoost version 2.0.2


In [19]:
all_oof = []
all_true = []
TARS = {'Seizure':0, 'LPD':1, 'GPD':2, 'LRDA':3, 'GRDA':4, 'Other':5}

gkf = GroupKFold(n_splits=5)
for i, (train_index, valid_index) in enumerate(gkf.split(train , train .target, train .patient_id)):   
    
    print('#'*25)
    print(f'### Fold {i+1}')
    print(f'### train size {len(train_index)}, valid size {len(valid_index)}')
    print('#'*25)
    
    model = xgb.XGBClassifier(
        objective='multi:softprob', 
        num_class=len(TARS),
        learning_rate = 0.1, 
                      
#         tree_method='gpu_hist',  #skip GPU acceleration
    )
    
    # Prepare training and validation data
    X_train = train.loc[train_index, FEATURES]
    y_train = train.loc[train_index, 'target'].map(TARS)
    X_valid = train.loc[valid_index, FEATURES]
    y_valid = train.loc[valid_index, 'target'].map(TARS)
    
    model.fit(X_train, y_train, 
              eval_set=[(X_valid, y_valid)], 
              verbose=True, 
              early_stopping_rounds=10)
    model.save_model(f'XGB_v{VER}_f{i}.model')
    
    oof = model.predict_proba(X_valid)
    all_oof.append(oof)
    all_true.append(train.loc[valid_index, TARGETS].values)
    
    del X_train, y_train, X_valid, y_valid, oof
    gc.collect()
    
all_oof = np.concatenate(all_oof)
all_true = np.concatenate(all_true)

#########################
### Fold 1
### train size 13671, valid size 3418
#########################
[0]	validation_0-mlogloss:1.71525
[1]	validation_0-mlogloss:1.65074
[2]	validation_0-mlogloss:1.59639
[3]	validation_0-mlogloss:1.54921
[4]	validation_0-mlogloss:1.51127
[5]	validation_0-mlogloss:1.47471
[6]	validation_0-mlogloss:1.44130
[7]	validation_0-mlogloss:1.41417
[8]	validation_0-mlogloss:1.38801
[9]	validation_0-mlogloss:1.36416
[10]	validation_0-mlogloss:1.34616
[11]	validation_0-mlogloss:1.32763
[12]	validation_0-mlogloss:1.30994
[13]	validation_0-mlogloss:1.29378
[14]	validation_0-mlogloss:1.28079
[15]	validation_0-mlogloss:1.26874
[16]	validation_0-mlogloss:1.25603
[17]	validation_0-mlogloss:1.24528
[18]	validation_0-mlogloss:1.23556
[19]	validation_0-mlogloss:1.22740
[20]	validation_0-mlogloss:1.22049
[21]	validation_0-mlogloss:1.21151
[22]	validation_0-mlogloss:1.20529
[23]	validation_0-mlogloss:1.19835
[24]	validation_0-mlogloss:1.19216
[25]	validation_0-mlogloss:1.18574

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>9 |</span></b> <b>HYPERPARAMETER TUNING</b></div>

### <b><span style='color:#FFCE30'> 9.1 |</span> Import Libraries and Set Up Optuna</b>
* First, you import necessary libraries: optuna for hyperparameter optimization, xgboost for the machine learning model, log_loss from scikit-learn for the evaluation metric, and GroupKFold for cross-validation.
* optuna.create_study(direction='minimize') creates a new optimization study. The direction='minimize' means you want to minimize the value returned by the objective function, which in this case is the log loss.


* まず、必要なライブラリをインポートします。ハイパーパラメータ最適化には optuna、機械学習モデルには xgboost、評価メトリックには scikit-learn の log_loss、相互検証には GroupKFold です。
* optuna.create_study(direction='minimize') は新しい最適化スタディを作成します。 direct='minimize' は、目的関数によって返される値 (この場合は対数損失) を最小化することを意味します。

### <b><span style='color:#FFCE30'> 9.2 |</span> Define the Objective Function</b>
* The objective function is what Optuna will optimize. This function takes a trial object, which is used to suggest values for the hyperparameters.
* Inside this function, you set up the hyperparameter space. Optuna will test different combinations of these parameters:
1. lambda, alpha: Regularization parameters.
2. colsample_bytree, subsample: Ratios for column and row sampling.
3. learning_rate: Step size shrinkage used to prevent overfitting.
4. n_estimators: Number of gradient boosted trees.
5. max_depth: Maximum depth of a tree.
6. min_child_weight: Minimum sum of instance weight needed in a child.


* 目的関数は Optuna が最適化するものです。 この関数は、ハイパーパラメータの値を提案するために使用されるトライアル オブジェクトを受け取ります。
※この関数内でハイパーパラメータ空間を設定します。 Optuna は、次のパラメータのさまざまな組み合わせをテストします。
1. ラムダ、アルファ: 正則化パラメータ。
2.colsample_bytree、subsample: 列と行のサンプリングの比率。
3. learning_rate: 過学習を防ぐために使用されるステップ サイズの縮小。
4. n_estimators: 勾配ブーストされたツリーの数。
5. max_ Depth: ツリーの最大の深さ。
6. min_child_weight: 子に必要なインスタンスの重みの最小合計。

### <b><span style='color:#FFCE30'> 9.3 |</span> Cross-Validation Loop</b>

* The function uses GroupKFold for splitting the data. This method is suitable when you have groups in your data (like patient IDs) that should not be split across the training and validation sets.
* For each fold in the cross-validation, the function:
1. Splits the data into training and validation sets.
2. Trains an XGBoost model using the parameters suggested by Optuna.
3. Computes the log loss on the validation set.
4. The average log loss across all folds is returned. Optuna will use this value to decide which hyperparameters are best.

* この関数はデータの分割にGroupKFoldを使用します。 この方法は、トレーニング セットと検証セットに分割すべきでないグループ (患者 ID など) がデータ内にある場合に適しています。
* 相互検証の各フォールドについて、関数は次のとおりです。
1. データをトレーニング セットと検証セットに分割します。
2. Optuna が提案するパラメーターを使用して XGBoost モデルをトレーニングします。
3. 検証セットのログ損失を計算します。
4. すべてのフォールドにわたる平均対数損失が返されます。 Optuna はこの値を使用して、どのハイパーパラメータが最適かを決定します。


### <b><span style='color:#FFCE30'> 9.4 |</span> Running the Optuna Study</b>

* study.optimize(objective, n_trials=100) tells Optuna to optimize the objective function. It will try 100 different combinations of hyperparameters (n_trials=100) to find the best ones.
* It is best to start with small trials before investing time to run on more trials to manage time invested
* Once the optimization is complete, the best hyperparameters found are printed.

* Study.optimize(objective, n_trials=100) は、Optuna に目的関数を最適化するように指示します。 ハイパーパラメータの 100 種類の異なる組み合わせ (n_trials=100) を試して、最適なものを見つけます。
* 投資時間を管理するために、より多くのトライアルを実行するために時間を投資する前に、小規模なトライアルから始めることが最善です。
* 最適化が完了すると、見つかった最適なハイパーパラメータが出力されます。


In [20]:
import optuna
from sklearn.metrics import log_loss


def objective(trial):
    # Hyperparameters to be tuned by Optuna
    param = {
        'objective': 'multi:softprob',
        'num_class': len(TARS),
        'tree_method': 'gpu_hist',  # use 'gpu_hist' for GPU
        'lambda': trial.suggest_loguniform('lambda', 1e-4, 10.0),
        'alpha': trial.suggest_loguniform('alpha', 1e-4, 10.0),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]),
        'subsample': trial.suggest_categorical('subsample', [0.6, 0.7, 0.8, 0.9, 1.0]),
        'learning_rate': trial.suggest_categorical('learning_rate', [0.008, 0.01, 0.02, 0.05, 0.1]),
        'n_estimators': 1000,
        'max_depth': trial.suggest_categorical('max_depth', [5, 7, 9, 11, 13]),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 300),
    }

    gkf = GroupKFold(n_splits=5)
    cv_scores = []

    for train_index, valid_index in gkf.split(train, train.target, train.patient_id):
        X_train, X_valid = train.loc[train_index, FEATURES], train.loc[valid_index, FEATURES]
        y_train, y_valid = train.loc[train_index, 'target'].map(TARS), train.loc[valid_index, 'target'].map(TARS)

        model = xgb.XGBClassifier(**param)
        model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False, early_stopping_rounds=10)
        preds = model.predict_proba(X_valid)
        cv_scores.append(log_loss(y_valid, preds))

    return np.mean(cv_scores)

study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=10)  # Increase n_trials for more extensive search

print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

[I 2024-03-13 18:20:57,122] A new study created in memory with name: no-name-99e933e5-802a-4f91-9de3-09c4540ee6e4
[W 2024-03-13 18:23:46,524] Trial 0 failed with parameters: {'lambda': 0.07252433430783983, 'alpha': 7.947744467850382, 'colsample_bytree': 1.0, 'subsample': 0.7, 'learning_rate': 0.008, 'max_depth': 13, 'min_child_weight': 160} because of the following error: KeyboardInterrupt().
Traceback (most recent call last):
  File "c:\Users\takashi\AppData\Local\Programs\Python\Python311\Lib\site-packages\optuna\study\_optimize.py", line 200, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "C:\Users\takashi\AppData\Local\Temp\ipykernel_28144\4235841052.py", line 29, in objective
    model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=False, early_stopping_rounds=10)
  File "c:\Users\takashi\AppData\Local\Programs\Python\Python311\Lib\site-packages\xgboost\core.py", line 729, in inner_f
    return func(**kwargs)
           ^^^^^

KeyboardInterrupt: 

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>10 |</span></b> <b>FEATURE IMPORTANCE</b></div>

In [None]:
TOP = 30

# Assuming 'model' is your trained model
feature_importance = model.feature_importances_

# Get the feature names from 'train'
feature_names = train.columns

# Sort the feature importances and get the indices of the sorted array
sorted_idx = np.argsort(feature_importance)

# Plot only the top 'TOP' features
fig = plt.figure(figsize=(10, 8))
plt.barh(np.arange(len(sorted_idx))[-TOP:], feature_importance[sorted_idx][-TOP:], align='center')
plt.yticks(np.arange(len(sorted_idx))[-TOP:], feature_names[sorted_idx][-TOP:])
plt.title(f'Feature Importance - Top {TOP}')
plt.show()

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>11 |</span></b> <b>INFER TEST</b></div>

In [None]:
test = pd.read_csv('../input/hms-harmful-brain-activity-classification/test.csv')
print('Test shape',test.shape)
test.head()

In [None]:
PATH2 = '../input/hms-harmful-brain-activity-classification/test_spectrograms/'
s = "853520"
spec = pd.read_parquet(f'{PATH2}{s}.parquet')
spec

In [None]:
%%time
# READ ALL TEST SPECTROGRAMS
PATH2 = '../input/hms-harmful-brain-activity-classification/test_spectrograms/'
files = os.listdir(PATH2)
print(f'There are {len(files)} spectrogram parquets')

spectrograms = {}
for i,f in enumerate(files):
    if i%100==0: print(i,', ',end='')
    tmp = pd.read_parquet(f'{PATH2}{f}')
    name = int(f.split('.')[0])
    spectrograms_test[name] = tmp.iloc[:,1:].values


In [None]:
%time
# ENGINEER FEATURES
import warnings
warnings.filterwarnings('ignore')

# The code generates features from the spectrogram data for use in a model 
# The features are derived by calculating the mean and minimum values over time for each of the 400 spectrogram frequencies.
# Two types of windows are used for these calculations:
# A 10-minute window (_mean_10m, _min_10m).
# A 20-second window (_mean_20s, _min_20s).
# This process results in 1600 features (400 features × 4 calculations) for each EEG ID.

SPEC_COLS = pd.read_parquet(f'{PATH}1000086677.parquet').columns[1:]
FEATURES = [f'{c}_mean_10m' for c in SPEC_COLS]
FEATURES += [f'{c}_min_10m' for c in SPEC_COLS]
FEATURES += [f'{c}_mean_20s' for c in SPEC_COLS]
FEATURES += [f'{c}_min_20s' for c in SPEC_COLS]
print(f'We are creating {len(FEATURES)} features for {len(test)} rows... ',end='')


# A data matrix data is initialized to store the new features for each eeg_id in the train DataFrame.
# For each row in train, the code calculates the mean and minimum values within the specified 10-minute and 20-second windows.
# These calculated values are then stored in the data matrix.
# Finally, the matrix is added to the train DataFrame as new columns.

data = np.zeros((len(test),len(FEATURES)))
for k in range(len(test)):
    if k%100==0: print(k,', ',end='')
    row = test.iloc[k]
            
    # 10 MINUTE WINDOW FEATURES
    x = np.nanmean( spec.iloc[:,1:].values, axis=0)
    data[k,:400] = x
    x = np.nanmin( spec.iloc[:,1:].values, axis=0)
    data[k,400:800] = x

    # 20 SECOND WINDOW FEATURES
    x = np.nanmean( spec.iloc[145:155,1:].values, axis=0)
    data[k,800:1200] = x
    x = np.nanmin( spec.iloc[145:155,1:].values, axis=0)
    data[k,1200:1600] = x

    test[FEATURES] = data

    
print()
print('New test shape:',test.shape)

In [None]:
from sklearn.impute import SimpleImputer

# Initialize a PCA model
pca = PCA(n_components=0.95)
print("PCA model initialized.")

# Initialize an array for original features
num_rows = len(test)
num_features = 20 * n_channels  # 20 features per channel
data_original = np.zeros((num_rows, num_features))

print("Starting feature extraction and PCA processing...")
start_time = time.time()

for k in range(num_rows):
    if k % 1000 == 0:
        print(f"Processing row {k} of {num_rows}...")

    row = train.iloc[k]
    eeg_segment = spectrograms_test[853520][r:r+300, :]

    # Apply the feature extraction function to each EEG channel
    all_channel_features = []
    for i in range(n_channels):
        channel_features = extract_frequency_band_features(eeg_segment[:, i])
        all_channel_features.extend(channel_features)
    
    data_original[k, :] = all_channel_features

print("Data matrix constructed")

# Impute NaN values in the data matrix
imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data_original)

print(f"NaN values handled. Imputed data matrix shape: {data_imputed.shape}")

# Apply PCA on the imputed data
pca.fit(data_imputed)
print("PCA fitting completed.")

# Transform data using PCA
data_pca = pca.transform(data_imputed)

# Add PCA features to DataFrame
pca_feature_columns = [f'pca_feature_{i}' for i in range(data_pca.shape[1])]
test[pca_feature_columns] = data_pca

# Measure total processing time
total_time = time.time() - start_time
print(f"Total processing time: {total_time:.2f} seconds.")

test.head()

In [None]:
# Columns to be excluded from scaling
excluded_columns = ['eeg_id', 'spectrogram_id', 'patient_id']

# Save the columns to be excluded
excluded_data = test[excluded_columns]

# DataFrame with only the columns to be scaled
features = test.drop(columns=excluded_columns)

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler to the features and transform them
features_scaled = scaler.fit_transform(features)

# Create a DataFrame from the scaled features
features_scaled_df = pd.DataFrame(features_scaled, columns=features.columns)

# Concatenate the scaled features with the excluded columns
test_scaled_df = pd.concat([excluded_data.reset_index(drop=True),features_scaled_df,], axis=1)
test_scaled_df 


In [None]:
# FEATURE ENGINEER TEST
PATH2 = '../input/hms-harmful-brain-activity-classification/test_spectrograms/'
data = np.zeros((len(test),len(FEATURES)))
    
for k in range(len(test)):
    row = test.iloc[k]
    s = int( row.spectrogram_id )
    spec = pd.read_parquet(f'{PATH2}{s}.parquet')
    
    # 10 MINUTE WINDOW FEATURES
    x = np.nanmean( spec.iloc[:,1:].values, axis=0)
    data[k,:400] = x
    x = np.nanmin( spec.iloc[:,1:].values, axis=0)
    data[k,400:800] = x

    # 20 SECOND WINDOW FEATURES
    x = np.nanmean( spec.iloc[145:155,1:].values, axis=0)
    data[k,800:1200] = x
    x = np.nanmin( spec.iloc[145:155,1:].values, axis=0)
    data[k,1200:1600] = x

test[FEATURES] = data
print('New test shape',test.shape)

In [None]:
# INFER XGBOOST ON TEST
preds = []

for i in range(5):
    print(i, ', ', end='')
    
    # Load the XGBoost model
    model = xgb.XGBClassifier()
    model.load_model(f'XGB_v{VER}_f{i}.model')
    
    # Make predictions
    pred = model.predict_proba(test[FEATURES])
    preds.append(pred)

# Average the predictions from each fold
pred = np.mean(preds, axis=0)
print()
print('Test preds shape', pred.shape)

# <div style="padding: 30px;color:white;margin:10;font-size:60%;text-align:left;display:fill;border-radius:10px;background-color:#FFFFFF;overflow:hidden;background-color:#FFCE30"><b><span style='color:#FFFFFF'>12 |</span></b> <b>SUBMISSION</b></div>

In [None]:
sub = pd.DataFrame({'eeg_id':test.eeg_id.values})
sub[TARGETS] = pred
# sub.to_csv('submission.csv',index=False)
print('Submission shape',sub.shape)
sub.head()

In [None]:
# SANITY CHECK TO CONFIRM PREDICTIONS SUM TO ONE
sub.iloc[:,-6:].sum(axis=1)