In [None]:

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


# 1. Overview

Key points about the competition include:

* **Objective:** The primary goal is to detect and classify seizures as well as other harmful brain activities using machine learning models.

* **Data Source**: The data used for training the model comes from EEG signals recorded from critically ill patients in a hospital. EEG is a technique that measures electrical activity in the brain through electrodes placed on the scalp.

* **Potential Impact**: The competition emphasizes the potential transformative benefits for various medical fields, particularly in neurocritical care, epilepsy treatment, and drug development.

* **Medical Application**: Successful advancements in EEG pattern classification accuracy have the potential to significantly impact neurocritical care. This could lead to quicker and more precise treatments for patients experiencing seizures or other forms of brain damage.

* **Broader Implications**: The competition suggests that improvements in this area could have far-reaching effects, enabling doctors and brain researchers to rapidly detect and respond to abnormal brain activities. This, in turn, may enhance the overall understanding of neurological disorders and contribute to the development of more effective treatment strategies.

In summary, the competition aims to harness machine learning and EEG data to enhance the accuracy of detecting and classifying seizures and harmful brain activities. The ultimate goal is to bring about positive changes in neurocritical care, epilepsy treatment, and drug development by providing faster and more accurate diagnostic tools for medical professionals.






# 2. Description

**1. Current Medical Practice with EEG:**

Doctors currently use electroencephalography (EEG) as a diagnostic tool for critically ill patients to detect seizures and other forms of brain activity that could lead to brain damage.
EEG signals are interpreted by specialized neurologists through a manual analysis process.

**2. Challenges in Manual EEG Analysis:**

Manual analysis of EEG recordings is a time-consuming and expensive process.
It is prone to errors due to fatigue, and reliability issues may arise among different reviewers, even if they are experts.

**3. Objective of the Competition:**

The competition aims to automate EEG analysis, which would significantly alleviate the challenges associated with manual analysis.
The goal is to develop algorithms that can accurately detect six patterns of interest: seizure (SZ), generalized periodic discharges (GPD), lateralized periodic discharges (LPD), lateralized rhythmic delta activity (LRDA), generalized rhythmic delta activity (GRDA), or "other."

**4. Competition Host - Sunstella Foundation:**

The Sunstella Foundation was established in 2021 during the COVID pandemic with a focus on supporting minority graduate students in technology.
The foundation aims to help these students overcome challenges and celebrate their achievements through workshops, forums, and competitions.

**5. Partners and Collaborators:**

Sunstella Foundation collaborates with Persyst, Jazz Pharmaceuticals, and the Clinical Data Animation Center (CDAC).
The partners share a common research goal of preserving and enhancing brain health.

**6. Significance of Automated EEG Analysis:**

Automating EEG analysis is expected to expedite the detection of seizures and other harmful brain activities.
This, in turn, will facilitate quicker and more accurate treatments for patients.
The developed algorithms may also aid researchers in drug development for the treatment and prevention of seizures.

**7. Patterns of Interest:**

There are six patterns of interest, each representing different types of brain activity.
Detailed explanations of these patterns are available for reference.
Annotation of EEG Segments:

EEG segments used in the competition have been annotated or classified by a group of experts.
Agreement levels among experts vary, with some cases having high agreement ("idealized" patterns), some having disagreements (~1/2 experts label as "other" and ~1/2 label as one of the remaining five patterns - "proto patterns"), and others having experts split between two of the five named patterns ("edge cases").
In summary, the competition seeks to leverage machine learning to automate the analysis of EEG signals, with the ultimate goal of improving the speed and accuracy of detecting various patterns of brain activity, especially seizures, in critically ill patients. This has the potential to revolutionize neurocritical care, epilepsy treatment, and drug development.


![](https://storage.googleapis.com/kaggle-media/competitions/Harvard%20Medical%20School/eFig2.png)

**Figure Structure:**

**Rows**:

1. Seizure (SZ)
2. Lateralized Periodic Discharges (LPDs)
3. Generalized Periodic Discharges (GPDs)
4. Lateralized Rhythmic Delta Activity (LRDA)
5. Generalized Rhythmic Delta Activity (GRDA)

**Columns**:

1. Idealized forms of patterns (A) - Patterns with uniform expert agreement.
2. Proto or partially formed patterns (B) - About half of raters labeled these as one pattern and the other half labeled as "Other."
3, Edge cases (C) - About half of raters labeled these as one pattern and half labeled them as another pattern (column C).
4. More edge cases (D) - Similar to column C.

**Explanation of Examples:**

Column B (Proto or Partially Formed Patterns):

* B-1: Rhythmic delta activity with some sharp discharges, potentially the tail end of a seizure, causing disagreement between SZ and "Other."
* B-2: Frontal lateralized sharp transients with reversed polarity, suggesting non-cerebral source, leading to split between LPD and "Other."
* B-3: Diffused semi-rhythmic delta background with poorly formed low amplitude periodic discharges, a proto-GPD.
* B-4: Semi-rhythmic delta activity with unstable morphology over the right hemisphere, a proto-LRDA.
* B-5: Waves of rhythmic delta activity with unstable morphology, a proto-GRDA.

Columns C and D (Edge Cases):
Examples with features straddling two patterns:

* C-1: LPDs evolving into a seizure, an edge-case.
* D-1: GPDs on a suppressed background, suggesting a seizure, another edge case.
* C-2: Split between LPDs and GPDs.
* D-2: Tied between LPDs and LRDA, sharing features of both.
* C-3: Split between GPDs and LRDA.
* D-3: Split between GPDs and GRDA, showing asymmetry in slope.
* C-4: Split between LRDA and seizure.
* D-4: Split between LRDA and GRDA, asymmetry in delta wave.
* C-5: Split between GRDA and seizure.
* D-5: Split between GRDA and LPDs, showing generalized rhythmic delta activity with features suggestive of LPDs.

**Note:**
* EEG electrode recording regions are abbreviated as LL (left lateral), RL (right lateral), LP (left parasagittal), and RP (right parasagittal).

This detailed explanation provides insights into the complexities of EEG pattern classification, showcasing examples that challenge clear categorization and highlight the nuances involved in expert interpretation.









# 3. Data Analysis 

**1. Files and Folders**

**2. Input data**

    * train.csv
    * test.csv
    * XXXXX.parquet
    
**3. Output Data**

    *  sample_submission.csv
    


In [None]:
# Import
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [None]:
def get_folder_info(folder_path):
    # Initialize lists to store information
    files_list = []
    folders_list = []
    sizes_list = []
    types_list = []

    # Iterate over each item in the folder
    for item in os.listdir(folder_path):
        item_path = os.path.join(folder_path, item)

        # Check if it's a file or a folder
        if os.path.isfile(item_path):
            files_list.append(item)
            sizes_list.append(os.path.getsize(item_path))
            types_list.append(item.split('.')[-1].lower()) if '.' in item else 'Unknown'
            folders_list.append('')
        elif os.path.isdir(item_path):
            files_list.append('')
            sizes_list.append('')
            types_list.append('')
            folders_list.append(item)

    # Create a DataFrame for tabular representation
    df = pd.DataFrame({
        'File': files_list,
        'Folder': folders_list,
        'Size (bytes)': sizes_list,
        'Type': types_list
    })

    return df

# Provide the path to the folder you want to analyze
folder_path = '/kaggle/input/hms-harmful-brain-activity-classification'

# Get information about the folder
folder_info = get_folder_info(folder_path)

# Display the information
print(folder_info)


In [None]:
train_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/train.csv")
print(train_df)
train_df

In [None]:
test_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/test.csv")
test_df

In [None]:
sample_submission_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/sample_submission.csv")
sample_submission_df


In [None]:
import os
import pandas as pd

def get_folder_info(folder_path):
    # Initialize lists to store information
    files_list = []
    folders_list = []
    sizes_list = []
    types_list = []

    # Recursive function to explore subfolders
    def explore_folder(current_path):
        for item in os.listdir(current_path):
            item_path = os.path.join(current_path, item)

            if os.path.isfile(item_path):
                files_list.append(item)
                sizes_list.append(os.path.getsize(item_path))
                types_list.append(item.split('.')[-1].lower()) if '.' in item else 'Unknown'
                folders_list.append(current_path[len(folder_path) + 1:])  # Relative path to the main folder
            elif os.path.isdir(item_path):
                explore_folder(item_path)

    # Call the recursive function
    explore_folder(folder_path)

    # Create a DataFrame for tabular representation
    df = pd.DataFrame({
        'File': files_list,
        'Folder': folders_list,
        'Size (bytes)': sizes_list,
        'Type': types_list
    })

    return df

# Provide the path to the main folder you want to analyze
folder_path = '/kaggle/input/hms-harmful-brain-activity-classification'

# Get information about the folder and its subfolders
folder_info = get_folder_info(folder_path)

# Display the information
print(folder_info)


In [None]:
train_eeg_0001 = pd.read_parquet("/kaggle/input/hms-harmful-brain-activity-classification/train_eegs/2208063991.parquet")
train_eeg_0001

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Read the parquet file
file_path = "/kaggle/input/hms-harmful-brain-activity-classification/train_eegs/2208063991.parquet"
train_eeg_0001 = pd.read_parquet(file_path)

# List of EEG channels
eeg_channels = train_eeg_0001.columns[:-1]  # Exclude the last column (EKG)

# Plotting each EEG channel
for channel in eeg_channels:
    plt.figure(figsize=(10, 5))
    plt.plot(train_eeg_0001[channel], label=channel)
    plt.title(f'EEG Channel: {channel}')
    plt.xlabel('Time Steps')
    plt.ylabel('EEG Signal Value')
    plt.legend()
    plt.show()


In [None]:
train_spectrogram_0001 = pd.read_parquet("/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/1000086677.parquet")

In [None]:
train_spectrogram_0001

In [None]:
import pandas as pd

# Read the data
train_spectrogram_0001 = pd.read_parquet("/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/1000086677.parquet")

# Display basic information about the DataFrame
print(train_spectrogram_0001.info())

# Display summary statistics for numeric columns
print(train_spectrogram_0001.describe())

# Count missing values in each column
missing_values = train_spectrogram_0001.isnull().sum()
print("Missing Values:\n", missing_values[missing_values > 0])

# Calculate correlation matrix for a subset of columns
subset_columns = ['LL_0.59', 'LL_0.78', 'LL_0.98', 'RP_18.16', 'RP_18.36', 'RP_18.55']
correlation_matrix = train_spectrogram_0001[subset_columns].corr()
print("Correlation Matrix:\n", correlation_matrix)

# Plot histograms for a subset of columns
train_spectrogram_0001[subset_columns].hist(bins=20, figsize=(15, 8))
plt.suptitle("Histograms of Selected Columns")
plt.show()


# 4. Reference
 
https://www.acns.org/UserFiles/file/ACNSStandardizedCriticalCareEEGTerminology_rev2021.pdf

https://media.journals.elsevier.com/content/files/clinph-chapter1-5-14083047.pdf

http://ulae.org.ua/attachments/article/123/ICU%20EEG%20Terminology%20SB2018_SHORT.pdf