In [1]:
import pandas as pd
from pathlib import Path

In this notebook, we determine the patterns matched by the names of the CSVs in
`data/raw/swan_sf/partition[1-5]/(FL|NF)/`. The names contain important
information like flare classes and active region numbers that needs to be
extracted into data frame columns. We begin by pulling the names of all of the
CSVs.

In [2]:
ROOT_DIR = "../raw/swan_sf/"
FL_CSV_PATH_PATTERN = "partition*/FL/*.csv"
NF_CSV_PATH_PATTERN = "partition*/NF/*.csv"
fl_file_names = pd.Series(
    (p.name for p in Path(ROOT_DIR).glob(FL_CSV_PATH_PATTERN))
)
nf_file_names = pd.Series(
    (p.name for p in Path(ROOT_DIR).glob(NF_CSV_PATH_PATTERN))
)

Below are the names of some of the `FL` files.

In [3]:
fl_file_names.head()

0    M6.5@10636:Primary_ar5692_s2015-06-21T18:36:00...
1    M5.5@11174:Primary_ar5983_s2015-10-01T00:00:00...
2    M1.0@11848:Primary_ar6327_s2016-02-11T11:24:00...
3    M1.0@10911:Primary_ar5885_s2015-08-24T05:24:00...
4    M1.1@11251:Primary_ar6015_s2015-10-15T06:48:00...
dtype: object

Below are the names of some of the `NF` files.

In [4]:
nf_file_names.head()

0    FQ_ar5742_s2015-07-05T08:36:00_e2015-07-05T20:...
1    FQ_ar7015_s2017-05-19T07:24:00_e2017-05-19T19:...
2    FQ_ar5453_s2015-04-14T03:24:00_e2015-04-14T15:...
3    FQ_ar6174_s2015-12-16T10:36:00_e2015-12-16T22:...
4    FQ_ar5342_s2015-03-22T15:12:00_e2015-03-23T03:...
dtype: object

Based on the above, we construct regular expressions that the file names seem to
match.

In [None]:
flare_re = r"[ABCMX][0-9]+\.[0-9]+@[0-9]+:(Primary|Secondary)"
ar_num_re = r"ar[0-9]+"
datetime_re = r"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:00"
timeframe_re = rf"s{datetime_re}_e{datetime_re}"

fl_re = rf"{flare_re}_{ar_num_re}_{timeframe_re}\.csv"
nf_re = rf"(FQ|{flare_re})_{ar_num_re}_{timeframe_re}\.csv"

The file names do match the regular expressions. Thus, in `data/processed/make_df.py`,
the flare class can be extracted from the beginning of the name, and the
substring following `"ar"` can be extracted to obtain the active region number.

In [6]:
assert fl_file_names.str.fullmatch(fl_re).all()
assert nf_file_names.str.fullmatch(nf_re).all()