## Making a Representative Sample of Stamps URIs

So we want to run the script which identifies single antenna inerference in a stamp file on all the stamp files. However, that would take way too long. On blpc1, it took about 6 seconds to process 49 stamps, so to do all 32 million in the >24Ghz dataset would take almost 6.5 weeks. Instead, I'll take a representative sample of the stamp files from above 24Ghz - from all pointing directions, coherent hits, incoherent hits, etc and run the detection on those to hopefully get some statistics about what the single-antenna RFI looks like. So this notebook is about creating a representative sample of the stamp URIs for that script to run on

### Setup

Use cosmic (Python 3.8.0) conda env

In [9]:
# Import useful packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.sparse import csr_array
from scipy.sparse import lil_array
from scipy.sparse import save_npz
from scipy.sparse import load_npz
import os
import random

In [3]:
# Read in the data

# Check which server we're on (in case the data is in different places on different servers)
import socket
hostname = socket.gethostname()

# Get paths to data
if hostname == "blpc1" or hostname == "blpc2":
    full_dataset_path = "/datax/scratch/nstieg/25GHz_higher.pkl"
    coherent_dataset_path = "/datax/scratch/nstieg/25GHz_higher_coherent.pkl"
    incoherent_dataset_path = "/datax/scratch/nstieg/25GHz_higher_incoherent.pkl"
else:
    raise Exception("Data path not known")

# Read in data
# coherent = pd.read_pickle(coherent_dataset_path)
# incoherent = pd.read_pickle(incoherent_dataset_path)
df = pd.read_pickle(full_dataset_path)

In [31]:
print(df.shape)

(31208910, 29)


### Take representative sample

In [5]:
# So get data from each FOV recorded
by_day = df.groupby("tstart")

4703


In [33]:
# Create a new sample
decimation = 25
samples_from_unique_times = [group[1].sample(frac=1/decimation, random_state=1) for group in by_day]
reduced_df = pd.concat(samples_from_unique_times).reset_index(drop=True)
print(reduced_df.shape)

(1248312, 29)


In [48]:
# Save out the file_uri's so the script can run on them on the cosmic server
# start with just the first 100 to test
test_df = reduced_df[:100]
print(test_df.shape)
cols_to_save = ["id", "file_uri", "file_local_enumeration"]

# Save data
path = "/home/nstieg/BL-COSMIC-2024-proj/stamps/"
test_df[cols_to_save].to_csv(path + "file_info.csv", index=False)

(100, 29)


In [47]:
test_df.columns

Index(['id', 'beam_id', 'observation_id', 'tuning', 'subband_offset',
       'file_uri', 'file_local_enumeration', 'signal_frequency',
       'signal_index', 'signal_drift_steps', 'signal_drift_rate', 'signal_snr',
       'signal_coarse_channel', 'signal_beam', 'signal_num_timesteps',
       'signal_power', 'signal_incoherent_power', 'source_name', 'fch1_mhz',
       'foff_mhz', 'tstart', 'tsamp', 'ra_hours', 'dec_degrees',
       'telescope_id', 'num_timesteps', 'num_channels', 'coarse_channel',
       'start_channel'],
      dtype='object')