# Intro

There is a complex system of subfolders, while in the `training_labels.csv` each data point is referenced by an ID which corresponds to the file name. 

In this notebook I map each ID to the full file path. I store both the corresponding dictionary mapping the ID to the full file path, as well as the enriched file `training_labels_with_paths.csv`.

# Mapping

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import pickle
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
%matplotlib inline

id_2_path = dict()
for dirname, _, filenames in tqdm(os.walk('/kaggle/input/g2net-gravitational-wave-detection/train/'), total=4369, desc='Checking filepath for each ID...'):
    for filename in filenames:
        if not os.path.isdir(filename):
            id_2_path[os.path.splitext(filename)[0]] = os.path.join(dirname, filename)

# Storing results

In [None]:
with open('id_2_path.pkl', 'wb') as f:
    pickle.dump(id_2_path, f)

In [None]:
training_labels = pd.read_csv('../input/g2net-gravitational-wave-detection/training_labels.csv')
training_labels['filepath'] = training_labels['id'].map(id_2_path)

In [None]:
training_labels.to_csv('training_labels_with_paths.csv', index=None)

# Demonstration

## Getting a subsample of positive data points

## Raw time series from LIGO Hanford

In [None]:
_, axs = plt.subplots(10, 2, figsize=(12, 30), sharex=True, sharey=True)

pos_subsample = training_labels.loc[training_labels['target'] == 1, 'filepath'].sample(10)
neg_subsample = training_labels.loc[training_labels['target'] == 0, 'filepath'].sample(10)

for row_i, pos_filepath in enumerate(pos_subsample):
    pos_data = np.load(pos_filepath)
    axs[row_i, 0].plot(pos_data[0], c='r')
    
for row_i, neg_filepath in enumerate(neg_subsample):
    neg_data = np.load(neg_filepath)
    axs[row_i, 1].plot(neg_data[0], c='b')

axs[0, 0].set_title('Positives', fontsize=15)
axs[0, 1].set_title('Negatives', fontsize=15)
plt.suptitle('Visual comparison of randomly sampled positive and negative samples', fontsize=19)