In [1]:
import pandas as pd

In [2]:
harp_data = pd.read_parquet("../harp_data/data/processed/hmi_sharp_cea_720s.parquet")
harp_data.head()

Unnamed: 0,HARPNUM,T_REC,USFLUX,MEANGAM,MEANGBT,MEANGBZ,MEANGBH,MEANJZD,TOTUSJZ,MEANALP,...,LON_FWT,LAT_FWTPOS,LON_FWTPOS,LAT_FWTNEG,LON_FWTNEG,T_FRST1,T_LAST1,NOAA_AR,NOAA_NUM,NOAA_ARS
0,1,2010-05-01 00:00:00+00:00,6.510776e+21,28.337,66.808,84.497,32.193,-0.131873,5777592000000.0,0.00933,...,-78.194817,23.822844,-78.326813,23.677998,-75.876213,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067
1,1,2010-05-01 00:12:00+00:00,6.521054e+21,29.678,68.349,90.781,32.345,-0.113589,5654726000000.0,-0.004021,...,-78.183884,23.76306,-78.388466,23.785194,-75.429527,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067
2,1,2010-05-01 00:24:00+00:00,6.917875e+21,28.441,67.682,89.127,32.411,0.061197,6488687000000.0,0.0034,...,-77.894882,23.770275,-78.056015,23.708254,-75.365669,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067
3,1,2010-05-01 00:36:00+00:00,6.973706e+21,28.031,67.166,85.321,31.966,0.053302,6193157000000.0,0.00515,...,-77.822472,23.789299,-78.000526,23.62512,-75.309296,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067
4,1,2010-05-01 00:48:00+00:00,7.228647e+21,26.98,64.805,76.349,32.647,0.011571,5797055000000.0,0.000902,...,-77.759651,23.775604,-77.954346,23.754055,-74.332764,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067


No HARP has more than around 2,000 records, i.e., no HARP was observed for more than 24,000 minutes (400 hours, 17 days).

In [3]:
harp_data["HARPNUM"].value_counts()

HARPNUM
3784     2126
9374     2050
10727    2001
10413    1995
2040     1986
         ... 
10832       3
3127        3
4892        3
10460       3
6713        1
Name: count, Length: 6544, dtype: Int64

In [80]:
# harp_data.groupby(harp_data["HARPNUM"])["NOAA_AR"].nunique().value_counts()

For some HARP records, there is no matching NOAA active region (AR); for these records, `NOAA_AR` equals zero. We replace the zeros with `pd.NA`'s so we can easily count unique ARs by HARP. Most HARPs don't correspond to any AR; a handful correspond to two ARs.

In [4]:
noaa_ars = harp_data["NOAA_AR"].replace(0, pd.NA)
noaa_ars_by_harp = noaa_ars.groupby(harp_data["HARPNUM"]).nunique()
noaa_ars_by_harp.value_counts()

NOAA_AR
0    4409
1    2129
2       6
Name: count, dtype: int64

We only retain the flare data columns that we need. Ke Hu suggested using `noaa_ar_5min` instead of `noaa_ar_5s` since the latter is based on a rule that's too restrictive. We discard flare records that lack `end time`s. See `notebooks/process_flare_data.ipynb` for more information on those records.

In [5]:
flare_data = pd.read_parquet("../flare_data/flare_data.parquet")
flare_data = flare_data[["noaa_ar_5min", "start time", "peak time", "end time", "flare_class", "peak_intensity"]]
flare_data = flare_data[~flare_data["end time"].isna()]
flare_data.head()

Unnamed: 0,noaa_ar_5min,start time,peak time,end time,flare_class,peak_intensity
0,,2010-01-01 06:02:00+00:00,2010-01-01 06:09:00+00:00,2010-01-01 06:13:00+00:00,B,1.1e-07
1,0.0,2010-01-01 12:00:00+00:00,2010-01-01 12:09:00+00:00,2010-01-01 12:19:00+00:00,B,2.7e-07
2,0.0,2010-01-01 12:27:00+00:00,2010-01-01 12:43:00+00:00,2010-01-01 13:09:00+00:00,B,3.3e-07
3,,2010-01-01 15:58:00+00:00,2010-01-01 16:20:00+00:00,2010-01-01 16:31:00+00:00,B,2.5e-07
4,,2010-01-01 18:20:00+00:00,2010-01-01 18:27:00+00:00,2010-01-01 18:31:00+00:00,B,1.3e-07


The following function calculates the proportion of flares in a given data frame that have a matching AR and the proportion that don't.

In [6]:
def calc_prop_w_ar(flare_df: pd.DataFrame) -> pd.Series:
    is_non_na_nonzero = ~flare_df["noaa_ar_5min"].isna() & (flare_df["noaa_ar_5min"] > 0)
    props = is_non_na_nonzero.value_counts(normalize=True, sort=False)
    if not True in props.index:
        missing_entry = pd.Series([0], index=[True])
        props = pd.concat([props, missing_entry])
    if not False in props.index:
        missing_entry = pd.Series([0], index=[False])
        props = pd.concat([props, missing_entry])
    props.sort_index(ascending=False, inplace=True)
    return props

Slightly less than two-thirds of flares have a matching AR.

In [7]:
calc_prop_w_ar(flare_data)

noaa_ar_5min
True     0.62391
False    0.37609
Name: proportion, dtype: Float64

The proportion of flares with a matching AR increases with flare strength.

In [8]:
flare_data.groupby("flare_class").apply(calc_prop_w_ar, include_groups=False)

Unnamed: 0_level_0,True,False
flare_class,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.0,1.0
B,0.437824,0.562176
C,0.705131,0.294869
M,0.914847,0.085153
X,0.991304,0.008696


For many flares, `noaa_ar_5min` is missing. These flares should be deleted before matching flares to HARPs because Pandas seems to match records if their join keys are both missing.

In [9]:
flare_data["noaa_ar_5min"].isna().value_counts()

noaa_ar_5min
False    20250
True      7044
Name: count, dtype: int64

The data frame below was created by matching flares to HARPs using `scripts/match_flares_to_harps.py`.

In [10]:
harp_flare_data = pd.read_parquet("../combined_data/harp_flare_data.parquet")
harp_flare_data.head()

Unnamed: 0,HARPNUM,T_REC,USFLUX,MEANGAM,MEANGBT,MEANGBZ,MEANGBH,MEANJZD,TOTUSJZ,MEANALP,...,T_FRST1,T_LAST1,NOAA_AR,NOAA_NUM,NOAA_ARS,start time,peak time,end time,flare_class,peak_intensity
0,1,2010-05-01 00:00:00+00:00,6.510776e+21,28.337,66.808,84.497,32.193,-0.131873,5777592000000.0,0.00933,...,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067,NaT,NaT,NaT,,
1,1,2010-05-01 00:12:00+00:00,6.521054e+21,29.678,68.349,90.781,32.345,-0.113589,5654726000000.0,-0.004021,...,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067,NaT,NaT,NaT,,
2,1,2010-05-01 00:24:00+00:00,6.917875e+21,28.441,67.682,89.127,32.411,0.061197,6488687000000.0,0.0034,...,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067,NaT,NaT,NaT,,
3,1,2010-05-01 00:36:00+00:00,6.973706e+21,28.031,67.166,85.321,31.966,0.053302,6193157000000.0,0.00515,...,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067,NaT,NaT,NaT,,
4,1,2010-05-01 00:48:00+00:00,7.228647e+21,26.98,64.805,76.349,32.647,0.011571,5797055000000.0,0.000902,...,2010-05-01 00:00:00+00:00,2010-05-11 16:12:00+00:00,11067,1,11067,NaT,NaT,NaT,,


In [104]:
# import numpy as np
# shouldnt_have_matches = harp_flare_data["NOAA_AR"].isna() & ~harp_flare_data["start time"].isna()
# harp_flare_data.loc[shouldnt_have_matches, ["start time", "peak time", "end time", "flare_class"]] = pd.NA
# harp_flare_data.loc[shouldnt_have_matches, "peak_intensity"] = np.nan
# harp_flare_data.drop_duplicates(inplace=True)
# harp_flare_data.to_parquet("../combined_data/harp_flare_data.parquet")
# (harp_flare_data["NOAA_AR"].isna() & ~harp_flare_data["start time"].isna()).any()

A matching flare was found for very few HARP records.

In [11]:
(~harp_flare_data["start time"].isna()).value_counts(normalize=True)

start time
False    0.991231
True     0.008769
Name: proportion, dtype: float64

For a tiny number of HARP records, multiple matching flares were found.

In [12]:
recs_w_mult_flares = harp_flare_data.groupby(["HARPNUM", "T_REC"]).size().reset_index(name="num_recs")
recs_w_mult_flares = recs_w_mult_flares[recs_w_mult_flares["num_recs"] > 1]
pd.merge(harp_flare_data, recs_w_mult_flares, how="inner", on=["HARPNUM", "T_REC"])

Unnamed: 0,HARPNUM,T_REC,USFLUX,MEANGAM,MEANGBT,MEANGBZ,MEANGBH,MEANJZD,TOTUSJZ,MEANALP,...,T_LAST1,NOAA_AR,NOAA_NUM,NOAA_ARS,start time,peak time,end time,flare_class,peak_intensity,num_recs
0,46,2010-06-09 13:36:00+00:00,1.233112e+22,41.343,98.798,104.587,55.63,-0.181483,14525880000000.0,-0.004718,...,2010-06-12 03:00:00+00:00,11078,1,11078,2010-06-09 13:25:00+00:00,2010-06-09 13:32:00+00:00,2010-06-09 13:36:00+00:00,B,1.2e-07,2
1,46,2010-06-09 13:36:00+00:00,1.233112e+22,41.343,98.798,104.587,55.63,-0.181483,14525880000000.0,-0.004718,...,2010-06-12 03:00:00+00:00,11078,1,11078,2010-06-09 13:35:00+00:00,2010-06-09 13:43:00+00:00,2010-06-09 13:51:00+00:00,B,2.8e-07,2
2,1449,2012-03-06 07:48:00+00:00,5.325347e+22,51.939,89.239,105.911,60.161,0.013598,80792130000000.0,-0.047258,...,2012-03-16 00:12:00+00:00,11429,2,1142911430,2012-03-06 07:31:00+00:00,2012-03-06 07:39:00+00:00,2012-03-06 07:48:00+00:00,C,6.8e-06,2
3,1449,2012-03-06 07:48:00+00:00,5.325347e+22,51.939,89.239,105.911,60.161,0.013598,80792130000000.0,-0.047258,...,2012-03-16 00:12:00+00:00,11429,2,1142911430,2012-03-06 07:48:00+00:00,2012-03-06 07:55:00+00:00,2012-03-06 08:00:00+00:00,M,1.4e-05,2
4,3688,2014-02-02 16:24:00+00:00,2.396364e+22,51.913,107.564,110.253,67.562,0.292932,51099510000000.0,-0.027853,...,2014-02-10 08:48:00+00:00,11968,1,11968,2014-02-02 16:06:00+00:00,2014-02-02 16:18:00+00:00,2014-02-02 16:24:00+00:00,C,6e-06,2
5,3688,2014-02-02 16:24:00+00:00,2.396364e+22,51.913,107.564,110.253,67.562,0.292932,51099510000000.0,-0.027853,...,2014-02-10 08:48:00+00:00,11968,1,11968,2014-02-02 16:24:00+00:00,2014-02-02 16:29:00+00:00,2014-02-02 16:36:00+00:00,M,1.5e-05,2
6,7670,2021-07-17 16:12:00+00:00,2.437989e+21,58.979,123.708,126.734,79.19,0.795026,6774378000000.0,-0.036468,...,2021-07-23 15:36:00+00:00,12845,1,12845,2021-07-17 16:03:00+00:00,2021-07-17 16:08:00+00:00,2021-07-17 16:12:00+00:00,B,1.2e-07,2
7,7670,2021-07-17 16:12:00+00:00,2.437989e+21,58.979,123.708,126.734,79.19,0.795026,6774378000000.0,-0.036468,...,2021-07-23 15:36:00+00:00,12845,1,12845,2021-07-17 16:12:00+00:00,2021-07-17 16:16:00+00:00,2021-07-17 16:20:00+00:00,B,4.3e-07,2
