# Prototyping variability_selection

Goal: figure out precisely which criteria we will use to select "valid" stars (Q2, Q1, etc) and 
One change I am considering making: expanding the "Q1" criterion to include stars who may not have any 100% bands, but who are like 90% good in each band. (I'd like to compute variability stats on ONLY their `good` data in these cases, which might require adding some columns to spreadsheet_maker.)

One consideration: backwards compatibility, at least with the 2015 orion paper. At the very very least, let's confirm that we can reproduce those results to some degree.

I'm looking at ["official_star_counter" from wuvsars-orion](https://github.com/tomr-stargazer/wuvars-orion/blob/master/official_star_counter.py).

(We'll also, someday, be interested in splitting off WSERV5-SE and treating it as its own thing, too.)

### First: Can we, like, re-run wuvsars-orion's official star counter?

official star counter lives here:
/Users/tsrice/Documents/Code/wuvars-orion/official_star_counter.py


In [2]:
%run /Users/tsrice/Documents/Code/wuvars-orion/official_star_counter.py

Auto-detected table type: fits
Auto-detected table type: fits
Auto-detected table type: fits
Auto-detected table type: fits
Number of detected sources in the dataset:
40630
Number of stars that meet absolute minimum considerations for valid data:
(i.e., have at least 50 recorded observations in at least one band)
14728
Maximum possible number of variables: 3141
Number of stars automatically classed as variables: 868
Number of stars that have the data quality for auto-classification: 3592
Auto-detected table type: fits

Number of probably-variable stars requiring subjective verification due to imperfect data quality: 2273
Number of new subjectives: 94

Number of STRICT autovariables: 553
Number of STRICT autocandidates: 2348

 Q: Statistically, what fraction of our stars are variables?
 A: 23.55%, drawn from the tightest-controlled sample;
    24.16%, drawn from a looser sample.

Number of possible variables with detected periods: 585
Number of autovariables that are periodic: 354
Numbe

### Stats from "old" official star counter:

- Q0 stars (at least 50 observations in at least one band): 
 - 14728
- Total detected sources:
 - 40630
- Q2 stars:
 - 2348
- Q1 + Q2 stars:
 - 3592

# Question 1: 

Given that we've shifted away from old "summary spreadsheet" code from ~2012 (which used ATpy internally) to new code which uses Pandas internally (for a huge boost in performance, maintainability/clarity, and compatibility with Python 3), can we reproduce the numbers from Table 1 of Rice et al 2015? In other words, **can we verify that the new code produces the same output as the old code**, given the same photometric data and the same definitions for "quality bins"?

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from astropy.table import Table


In [2]:
# Let's re-implement the above for WSERV5, in my new reduction.

spreadsheet_root = "/Users/tsrice/Documents/Variability_Project_2020/wuvars/Data/analysis_artifacts"
wserv_ids = [5]

for wserv in wserv_ids[::-1]:
    
    print(f"\n   WSERV{wserv}: \n")

    spreadsheet_path = os.path.join(
        spreadsheet_root,
        f"wserv{str(wserv)}",
        f"WSERV{str(wserv)}_graded_clipped0.95_scrubbed0.1_dusted0.5_summary_spreadsheet.h5",
    )
    if wserv==5:
        spreadsheet_path = os.path.join(
            spreadsheet_root, 
            "wserv5_v2012",
            f"WSERV{str(wserv)}_fdece_graded_clipped0.8_scrubbed0.1_dusted0.5_summary_spreadsheet.h5")
        print(f"WSERV5: {spreadsheet_path}")
    
    ds = pd.read_hdf(spreadsheet_path, key='table')

    q0 = (
        (ds["count"]["N_J"] >= 50)
        | (ds["count"]["N_H"] >= 50)
        | (ds["count"]["N_K"] >= 50)
    )
    
    print("Total detected sources:", len(ds))
    print("Total sources with at least 50 obs in one band:", len(ds[q0]))



   WSERV5: 

WSERV5: /Users/tsrice/Documents/Variability_Project_2020/wuvars/Data/analysis_artifacts/wserv5_v2012/WSERV5_fdece_graded_clipped0.8_scrubbed0.1_dusted0.5_summary_spreadsheet.h5
Total detected sources: 40630
Total sources with at least 50 obs in one band: 14728


# Answer to Question 1 (updated!! as of 12/9/20):

Okay, the good: we are picking up **exactly** the same number of detected sources for WSERV5 as before. (40630)

(Context: this is the version of the spreadsheet which uses the old, '80% graded' data, as an exact copy from 2012.)

~~The mostly-good: we are picking up very nearly the same number of Q0 sources (15,101 versus the old 14,728). I'm not sure where these 373 newcomers came from, actually, we should find out.~~

## UPDATE

**Now we have 14728 Q=0 sources in this dataset**. This means we are reproducing, exactly, the output of the older code, and can be confident that, moving forward (as we apply this code to the other datasets), we are bringing forward our experience-tested criteria.