NOTE: This notebook assumes that you have downloaded the competition data and saved it in `./data/speed-and-structure-train-data` and `./data/speed-and-structure-train-data-extended` directories.

# Speed and Structure Competition

## Part 1: EDA, K-Fold Preparation, and Simple Statistics

---

This is hardly an EDA. We are just preparing the dataset for the k-fold cross validation and collecting simple statistics to later use for data normalization during training.
We are creating two k-fold files, one for the first dataset and one for the extended dataset, and then combining them into one file. This is not ideal but the extended dataset came out later and I already had a k-fold file for the first dataset that I use for preliminary experiments so I kept that for a fair comparison.


In [None]:
import os
import random

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm.auto import tqdm

In [None]:
TRAIN_DIR = "./data/speed-and-structure-train-data"
TRAIN_DIR_EXTENDED = "./data/speed-and-structure-train-data-extended"
RECEIVER_IDS = [1, 75, 150, 225, 300]
FOLDS = 20

### K-Fold Preparation

Get the dataset file list for the original data

In [None]:
train_sample_dirs = os.listdir(TRAIN_DIR)
train_vel_files = []
train_receiver_files = []
for dir_id in train_sample_dirs:
    train_vel_files.append(os.path.join(TRAIN_DIR, dir_id, "vp_model.npy"))
    dir_id_receivers = []
    for rec_id in RECEIVER_IDS:
        dir_id_receivers.append(os.path.join(TRAIN_DIR, dir_id, f"receiver_data_src_{rec_id}.npy"))
    train_receiver_files.append(dir_id_receivers)

Uniformly assign a fold to each sample and save the fold information

In [None]:
# create a dataframe for fold information 80% train, 20% test
fold_info = pd.DataFrame()
fold_info['dir_id'] = train_sample_dirs
# assign a random fold to each sample uniformly
folds = [i % FOLDS for i in range(len(train_sample_dirs))]
random.seed(42)
random.shuffle(folds)
fold_info['fold'] = folds
fold_info['vel_file'] = train_vel_files
for i, rec_id in enumerate(RECEIVER_IDS):
    fold_info[f'rec_{rec_id}'] = [rec_files[i] for rec_files in train_receiver_files]

print(fold_info['fold'].value_counts())
fold_info.to_csv(f'fold_info.csv', index=False)

fold_info.head()

Same operations for the extended data. We have pretty much a duplicate of the code above.

Getting the dataset file list

In [None]:
train_sample_dirs = os.listdir(TRAIN_DIR_EXTENDED)
train_vel_files = []
train_receiver_files = []
for dir_id in train_sample_dirs:
    train_vel_files.append(os.path.join(TRAIN_DIR_EXTENDED, dir_id, "vp_model.npy"))
    dir_id_receivers = []
    for rec_id in RECEIVER_IDS:
        dir_id_receivers.append(os.path.join(TRAIN_DIR_EXTENDED, dir_id, f"receiver_data_src_{rec_id}.npy"))
    train_receiver_files.append(dir_id_receivers)

Uniformly assign a fold to each sample and save the fold information

In [None]:
# create a dataframe for fold information 80% train, 20% test
fold_info_extended = pd.DataFrame()
fold_info_extended['dir_id'] = train_sample_dirs
# assign a random fold to each sample uniformly
folds = [i % FOLDS for i in range(len(train_sample_dirs))]
random.seed(42)
random.shuffle(folds)
fold_info_extended['fold'] = folds
fold_info_extended['vel_file'] = train_vel_files
for i, rec_id in enumerate(RECEIVER_IDS):
    fold_info_extended[f'rec_{rec_id}'] = [rec_files[i] for rec_files in train_receiver_files]

print(fold_info_extended['fold'].value_counts())
fold_info_extended.to_csv(f'fold_info_extended.csv', index=False)

fold_info_extended.head()

We now have two k-fold files, one for the original data and one for the extended data. We can now combine them into one file.

In [None]:
fold_info_all = pd.concat([fold_info, fold_info_extended], ignore_index=True)
print(fold_info_all['fold'].value_counts())
fold_info_all.to_csv(f'fold_info_all.csv', index=False)

### Collection of Simple Statistics
We are calculating the stats based on 500 samples. You can always get more precise results with more samples. Overall it shouldn't matter too much.

#### Velocity
We will use Min-Max normalization on the velocity data during training. It looks like we have strict min and max boundaries in the velocity data. <br>

See [Min-Max normalization](https://en.wikipedia.org/wiki/Feature_scaling#Rescaling_(min-max_normalization))

In [None]:
sample_n = 500 # 
vel_list = []
for vel_file in tqdm(train_vel_files[:sample_n]):
    vel_list.append(np.load(vel_file))

vel_array = np.concatenate(vel_list, axis=0)
print(vel_array.shape)

print(f"Mean: {vel_array.mean()}")
print(f"Std: {vel_array.std()}")
print(f"Min: {vel_array.min()}")
print(f"Max: {vel_array.max()}")
print(f"Median: {np.median(vel_array)}")

#### Receiver
Unlike the velocity data, we'll be using the mean and the std we calculated here for the Z-score normalization on the receiver data. <br>
See [Z-score normalization](https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization))

In [None]:
sample_n = 500
receiver_list = []
for receiver_files in tqdm(train_receiver_files[:sample_n]):
    for receiver_file in receiver_files:
        receiver_list.append(np.load(receiver_file))
receiver_array = np.concatenate(receiver_list, axis=0)
print(receiver_array.shape)

print(f"Mean: {receiver_array.mean()}")
print(f"Std: {receiver_array.std()}")
print(f"Min: {receiver_array.min()}")
print(f"Max: {receiver_array.max()}")
print(f"Median: {np.median(receiver_array)}")