# Explanation


The HSC data is too large to store as one sqlite database file using github.  So instead, it needs to be fetched by the user, separately from cloning the repository. This notebook is a work-in-progress to help automate that process, and make sure that the final schema is correct.

The one complication is that the database is also too large to fetch all-at-once, even if you just want ~10 columns rather than the full ~1000 columns. So you need to download it peicemeal, and then combine into a single database.


 **Remember to set your credentials within `hsc_credentials.py` !**

In [None]:
from __future__ import division, print_function

In [None]:
from hsc_credentials import credential

In [None]:
from hscReleaseQuery import query_wrapper

# Build the query
Right now it only gets the *fluxes*, not the magnitudes. So far, I haven't needed the zeropoint. But this is a good start place if you need to build a query that gets the magnitudes.

In [None]:
sql_base = """
SELECT 
    object_id, 
    ra, dec, 
    detect_is_patch_inner, detect_is_tract_inner, detect_is_primary,
    gcmodel_flux, gcmodel_flux_err, gcmodel_flux_flags,
    rcmodel_flux, rcmodel_flux_err, rcmodel_flux_flags,
    icmodel_flux, icmodel_flux_err, icmodel_flux_flags,
    zcmodel_flux, zcmodel_flux_err, zcmodel_flux_flags,
    ycmodel_flux, ycmodel_flux_err, ycmodel_flux_flags
FROM 
    pdr1_cosmos_widedepth_median.forced
LIMIT 
    {}
OFFSET 
    {}
"""

# Make the query

**The total number of objects is currently hardcoded! Make sure this hasn't changed!**
The cleaner way to do this would be to make a simple query to the database, then count the number of records. But for now, hardcoding it is simpler.

In [None]:
n_objects = 1263503

In [None]:
block_size = 250000
n_blocks = (n_objects // block_size) + 1

In [None]:
limit = block_size

preview_results = False
delete_job = True
out_format = "sqlite3"

for i in range(n_blocks):
    offset = i*block_size
    
    sql = sql_base.format(limit, offset)
    
    output_filename = "tmp_{}.sqlite3".format(i)
    
    print(" ---------------- QUERY {} -------------------- ".format(i+1))
    print(sql)

    with open(output_filename, mode="wb") as output_file:
        query_wrapper(credential, sql, preview_results, delete_job, 
                      out_format, output_file,
                      nomail=True)

# Check if it worked

In [None]:
database_filenames = sorted(glob.glob("tmp_*.sqlite3"))
database_filenames

# Combine databases

In [None]:
import os, shutil
import glob
import pandas as pd

In [None]:
dfs = [pd.read_sql_table("table_1", "sqlite:///{}".format(database_filename),
                         index_col="object_id")
       for database_filename in database_filenames]
assert(sum(df.shape[0] for df in dfs) == n_objects)

combined = pd.concat(dfs)
assert(combined.shape[0] == n_objects)

del dfs
combined.head()


In [None]:
for filename in database_filenames:
    os.remove(filename)

In [None]:
combined.keys()

In [None]:
hsc_database_filename = "../HSC_COSMOS_median_forced.sqlite3"
hsc_database_filename_old = hsc_database_filename + ".old"

try:
    shutil.move(hsc_database_filename, hsc_database_filename_old)
    combined.to_sql("hsc", "sqlite:///{}".format(hsc_database_filename))
except:
    shutil.move(hsc_database_filename_old, hsc_database_filename)
    raise
else:
    os.remove(hsc_database_filename + ".old")
