Neural networks like input data to have a Gaussian distribution (i.e. normal bell shaped histogram with mean=0 and std=1). Therefore we transform inputs to transform the distribution.

The most common transformation is standardization which is new data = (old data - mean) / std. When the original data has a skewed distribution (i.e. the histogram has a tail extending on only one side), then we use a log transform to remove the skew. So first we log transform and next we standardize.

Log transforms can only handle positive numbers (i.e x>0), so we must shift, clip, and/or flip all the data to be data > 0 before we perform log transform.

Another trick is using Gauss Rank Transform [here]("https://medium.com/rapids-ai/gauss-rank-transformation-is-100x-faster-with-rapids-and-cupy-7c947e3397da"). Or using a Quantile Transform [here]("https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html"). Both can transform any distribution into a nearly perfect Gaussian distribution afterward.

By @cdeotte in this discuss [Magic Formula to Convert EEG to Spectrograms!]("https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/469760")



# Motivation
```python
np.clip(img,np.exp(-4),np.exp(8)) # improved LB score by 0.10 introduced by Chris
```

# Back to Basics

![](https://www.googleapis.com/download/storage/v1/b/kaggle-forum-message-attachments/o/inbox%2F761268%2F2e5159c656d78753e70132867945a7a8%2Fposter.jpeg?generation=1706062656796456&alt=media)
- **Source - bing**

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
%%time
spectrograms = np.load('/kaggle/input/brain-spectrograms/specs.npy',allow_pickle=True).item()
eeg_spectrograms = np.load('/kaggle/input/brain-eeg-spectrograms/eeg_specs.npy',allow_pickle=True).item()

In [None]:
len(spectrograms), len(eeg_spectrograms)

In [None]:
sample_spectrograms = spectrograms[next(iter(spectrograms))].reshape(-1)
sample_eeg_spectrograms = eeg_spectrograms[next(iter(eeg_spectrograms))].reshape(-1).reshape(-1)

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(sample_spectrograms, kde=True, bins=50, color='green')
plt.title(f'Distribution of the Sample Spectrogram Values, Mean: {round(np.nanmean(sample_spectrograms),1)} Std: {round(np.nanstd(sample_spectrograms),1)}')
plt.xlabel('Spectrogram Value')
plt.ylabel('Freq')
plt.grid(True)
plt.show()

In [None]:
plt.figure(figsize=(12, 6))
sns.histplot(sample_eeg_spectrograms, kde=True, bins=50, color='blue')
plt.title(f"Distribution of the Sample EEG's Spectrogram Values, Mean: {round(np.nanmean(sample_eeg_spectrograms),1)} Std: {round(np.nanstd(sample_eeg_spectrograms),1)}")
plt.xlabel("EEG's Spectrogram Value")
plt.ylabel('Freq')
plt.grid(True)
"mean", np.mean(sample_eeg_spectrograms), "std", np.std(sample_eeg_spectrograms)

# Spectrogram: Its left Skewed Distribution
# EEG's Spectrogram: Already in Gaussian Distribution as Chris already normalised

In [None]:
sample_spectrograms_log = np.log(sample_spectrograms)
plt.figure(figsize=(12, 6))
sns.histplot(sample_spectrograms_log, kde=True, bins=50, color='green')
plt.title(f'Distribution of the Sample Spectrogram Log Values, Mean: {round(np.nanmean(sample_spectrograms_log),1)} Std: {round(np.nanstd(sample_spectrograms_log),1)}')
plt.xlabel('Spectrogram Value')
plt.ylabel('Freq')
plt.grid(True)
plt.show()

# Not bad, similar to Guesian Distribution (mean is not 0) - lets normalise

In [None]:
ep = 1e-6
m = np.nanmean(sample_spectrograms_log)
s = np.nanstd(sample_spectrograms_log)
sample_spectrograms_log_norm = (sample_spectrograms_log-m)/(s+ep)
sample_spectrograms_log_norm = np.nan_to_num(sample_spectrograms_log_norm, nan=0.0)
plt.figure(figsize=(12, 6))
sns.histplot(sample_spectrograms_log_norm, kde=True, bins=50, color='green')
plt.title(f'Distribution of the Sample Spectrogram Log Norm Values, Mean: {round(np.mean(sample_spectrograms_log_norm),2)} Std: {round(np.std(sample_spectrograms_log_norm),2)}')
plt.xlabel('Spectrogram Value')
plt.ylabel('Freq')
plt.grid(True)
plt.show()

# Mean and Std are now Gaussian distribution

# Lets reverse engineering the -> np.clip(img,np.exp(-4),np.exp(8))
> As we seen log values range from -4 to 10 but mean ~ 0.69 and std ~ 1.7 i.e in norm distribution we have values from -2 to 6 but mean = 0 and std = 1

# Outlers ( 3 sigma )
![](https://news.mit.edu/sites/default/files/styles/news_article__image_gallery/public/images/201202/20120208160239-1_0.jpg?itok=1X1a_HCs)

### Outliers range ( 0.69 - 3 * 1.7 ~ -4.41 and 0.69 + 3 * 1.7 ~ 5.79 ) close to Chris clip range -4 to 8

In [None]:
sample_spectrograms_log_norm_clip = np.clip(sample_spectrograms_log, -4, 8)
plt.figure(figsize=(12, 6))
sns.histplot(sample_spectrograms_log_norm_clip, kde=True, bins=50, color='orange')
plt.title(f'Distribution of the Sample Spectrogram Log Christ Clip Values, Mean: {round(np.mean(sample_spectrograms_log_norm_clip),2)} Std: {round(np.std(sample_spectrograms_log_norm_clip),2)}')
plt.xlabel('Spectrogram Value')
plt.ylabel('Freq')
plt.grid(True)
plt.show()

In [None]:
sample_spectrograms_log_norm_clip = np.clip(sample_spectrograms_log, -4.41, 5.79)
plt.figure(figsize=(12, 6))
sns.histplot(sample_spectrograms_log_norm_clip, kde=True, bins=50, color='red')
plt.title(f'Distribution of the Sample Spectrogram Log 3-Sigma Clip Values, Mean: {round(np.mean(sample_spectrograms_log_norm_clip),2)} Std: {round(np.std(sample_spectrograms_log_norm_clip),2)}')
plt.xlabel('Spectrogram Value')
plt.ylabel('Freq')
plt.grid(True)
plt.show()