# Example notebook: Search data with CasJobs

SciServer Compute can talk to other components of SciServer through a series of <em>modules</em>, one for each component. This example notebook shows how to use the <strong><code>SciServer.CasJobs</code></strong> module to search for data within your Python scripts. CasJobs is SciServer's data search tool, allowing you to select data from any of our big data datasets and/or your own uploaded datasets.

You are welcome (encouraged!) to copy these examples into another folder and modify them to meet your needs. You can use them as a starting point to create your own scripts. Please do not edit this notebook directly, because your edits may be overwritten if changes to the SciServer modules require changes to these example notebooks.

To run the example Python scripts in this notebook, click in any of the Code cells below (the ones with the gray backgrounds). Click the play button at the top of the window (just below the menubar) to run the script, or press Shift-Enter. When you run a cell, its output of will appear directly below the cell.

## Import modules

Like any Python modules, the SciServer modules must be imported before being used. The next code block first imports the SciServer modules you will need for this example notebook, then imports some other required modules. Comments in the code block explain what each module does. To learn how to import other modules, see the Python 3.5 import documentation (https://docs.python.org/3.5/reference/import.html), or the documentation of the module(s) you are trying to import.

In [2]:
import SciServer
from SciServer import CasJobs     # Communicate between SciServer Compute and CasJobs
print('Imported SciServer modules')

import pandas                                # data analysis tools
import numpy as np                           # numerical tools
from datetime import datetime, timedelta     # date and timestamp tools
from pprint import pprint                    # print human-readable output
print('Imported other needed modules')

Imported SciServer modules
Imported other needed modules


## Get help

At any point after the modules are imported, you can type "help (<em>name of module</em>)" to read the documentation for that module. This is true for all SciServer modules and most other modules as well. Try it below.

In [None]:
# Read the help document for the entire SciServer package
help(SciServer)

In [None]:
# Read the help document for the CasJobs module
help(CasJobs)

### Optional: create functions

These are convenience functions.

In [43]:
# These convenience functions will print your list of tables in human-readable format

def better_createdate(cjCreateDate):     # returns table create date as a datetime object
    createsec = cjCreateDate / 10000000  # Divide by 10 million to get seconds elapsed since 1 AD
    firstday = datetime(1, 1, 1, 0, 0)   # Save 1 AD as "firstday"
    created = firstday + timedelta(seconds=createsec)  # Get calendar date on which table was created
    return (created)

def tables_formatted(tables):   # better formatted printing of a tables dictionary (output of get_tables)
# Returns the following information abaout the tables in your MyDB (as a Python dictionary object):
### Size: size of the table (in kB)
### Name: the name of the table
### Rows: the number of rows the table contains
### Date: the date of the table's creation, as the number of 10-microsecond intervals elapsed 1 AD

    tables = sorted(tables, key=lambda k: k['Name']) # alphabetize by table name
    for thistable in tables:
        print('Table name:\t',thistable['Name'])
        print('Rows:\t\t {:,.0f}'.format(thistable['Rows']))
        print('Size (kB):\t {:,.0f} '.format(thistable['Size']))
        betterdate = better_createdate(thistable['Date'])
        print('Created time:\t',betterdate.strftime('%Y-%m-%d %H:%M:%S'))
        print('\n')
        
print('Created functions')

Created functions


## What datasets can I search?

CasJobs allows you to search many different datasets, referred to as <strong>contexts</strong> (they are known as contexts so they can be described independently of the databases in which they are stored). Each context consists of one or more tables containing data or metadata related to a single aspect of the full dataset.

### Get a list of contexts

At the moment, the SciServer.CasJobs module does not have a function to list available contexts. The best way to see what contexts are available to you is to log in to <a href="http://skyserver.sdss.org/casjobs/" target="_blank">CasJobs</a> (link opens in a new window). Once you are logged in, you should see the Query page. Look for the <strong>Contexts</strong> dropdown menu toward the top left of the page, just above the big textbox. The values in that dropdown list show the contexts you can search, both directly in CasJobs and in Compute.

### Show data tables in a context

Once you know what context you want to search, you can use the <strong>CasJobs.getTables(context)</strong> function to show the data tables in that context. The Code cell below gives commands to list all tables in a context. Set the value of <em>this_context</em> to be the context you want to see. The function CasJobs.getTables(context) returns a list of Python dictionaries, one dictionary per table.

Each dictionary in the list contains the following information about one table:
<ul>
<li><em>Date:</em> the number of 10-millisecond intervals since the table was created</li>
<li><em>Name:</em> the name of the table</li>
<li><em>Rows:</em> the number of rows in the table</li>
<li><em>Size:</em> the size of the table in kilobytes</li>
</ul>

The code cell gives two options for printing the list of tables: using Python's pprint library or using the tables_formatted(tableList) convenience function defined above. The convenience function sorts the list of tables alphabetically by name, and displays the dates into datetime values. Try uncommenting and commenting those lines to see both options.

In [53]:
this_context = "MyDB"    # Your MyDB
#this_context = 'dr14'   # SDSS Data Release 14

tables = CasJobs.getTables(context=this_context)
print('Tables in '+this_context+':\n')


#pprint(tables)   # Standard human-readable printing using Python's pprint module
tables_formatted(tables)  # Sorting and better printing using a convenience function

Tables in MyDB:

Table name:	 GalaxyThumbs
Rows:		 16
Size (kB):	 16 
Created time:	 2017-06-21 13:01:20


Table name:	 GalaxyThumbs2
Rows:		 48
Size (kB):	 16 
Created time:	 2017-06-16 08:45:05


Table name:	 GalaxyThumbs3
Rows:		 16
Size (kB):	 16 
Created time:	 2017-06-16 10:43:58


Table name:	 MyNewtable22
Rows:		 1
Size (kB):	 16 
Created time:	 2017-08-02 15:30:43


Table name:	 MyNewtable55
Rows:		 1
Size (kB):	 16 
Created time:	 2017-08-02 16:23:49


Table name:	 QuickResults
Rows:		 1
Size (kB):	 80 
Created time:	 2017-09-13 12:31:38


Table name:	 SkyServer_Book
Rows:		 2
Size (kB):	 16 
Created time:	 2013-06-19 14:32:14


Table name:	 apogeesegue
Rows:		 794
Size (kB):	 64 
Created time:	 2013-07-18 10:57:09


Table name:	 boss_martin
Rows:		 2,332,836
Size (kB):	 66,952 
Created time:	 2017-05-23 08:47:02


Table name:	 danadr12withgz
Rows:		 27,868
Size (kB):	 2,312 
Created time:	 2015-01-13 09:09:47


Table name:	 danadr12withgzall
Rows:		 188,473
Size (kB):	 15,43

## Submit a query

In [None]:
#executes a quick SQL query and store results in a pandas dataframe:


df = CasJobs.executeQuery(sql=CasJobs_TestQuery, context=CasJobs_TestDatabase, format="pandas")
df

# Other options for return format (format=...):
# 'json': a JSON string containing the query results
# 'dict': a dictionary created from the JSON string containing the query results
# 'csv': a csv string
# 'readable': an object of type io.StringIO, which has the .read() method and wraps a csv string that can be passed into pandas.read_csv for example
# 'StringIO': an object of type io.StringIO, which has the .read() method and wraps a csv string that can be passed into pandas.read_csv for example
# 'fits': an object of type io.BytesIO, which has the .read() method and wraps the result in fits format
# 'BytesIO': an object of type io.BytesIO, which has the .read() method and wraps the result in fits format

# CasJobs.executeQuery(sql, context, format)

In [None]:
#submit a job, which inserts the query results into a table in the MyDB database context. 
#Wait until the job is done and get its status.

def translate_status(status):
    if (status == 0):
        status_word = 'Ready'
    elif (status == 1):
        status_word = 'Started'
    elif (status == 2):
        status_word = 'Cancelling'
    elif (status == 3):
        status_word = 'Cancelled'
    elif (status == 4):
        status_word = 'Failed'
    elif (status == 5):
        status_word = 'Finished'
    else:
        status_word = 'Status not found!!!!!!!!!'
    return (status_word)

def jobDescriber(jobDescription):
    print('JobID: ',jobDescription['JobID'])
    print('Status: ',jobDescription['Status'],' (',translate_status(jobDescription['Status']),')')
    print('Message: ',jobDescription['Message'])
    print('Created_Table: ',jobDescription['Created_Table'])
    print('Rows: ',jobDescription['Rows'])
    wait = pandas.to_datetime(jobDescription['TimeStart']) - pandas.to_datetime(jobDescription['TimeSubmit'])
    duration = pandas.to_datetime(jobDescription['TimeEnd']) - pandas.to_datetime(jobDescription['TimeStart'])
    print('Wait time: ',wait.seconds,' seconds')
    print('Query duration: ',duration.seconds, 'seconds')

    
thequery = CasJobs_TestQuery + 'into MyDB.' + CasJobs_TestTableName2

jobId = CasJobs.submitJob(sql=thequery, context="MyDB")
print('Submitting query:\n',thequery)
print('\n')
print('Job submitted with jobId = ',jobId)
print('\n')
jobDescription = CasJobs.waitForJob(jobId=jobId, verbose=True)
print('\n')
print('Information about the job:')
jobDescriber(jobDescription)

In [None]:
bigtablename = 'hugetable'

verylongquery = 'select top 150000 *\n'
verylongquery += 'into ' + CasJobs_TestDatabase + '.'+ bigtablename + '\n'
verylongquery += 'from photoobjall'

#thequery = CasJobs_TestQuery + 'into MyDB.' + CasJobs_TestTableName2
thequery = verylongquery

print('Submitting query:\n',thequery)
print('\n')

jobId = CasJobs.submitJob(sql=thequery, context="DR14")

print('Job submitted with jobId = ',jobId)
print('\n')

jobDescription = CasJobs.waitForJob(jobId=jobId, verbose=True)
print('\n')

print('Information about the job:')
jobDescriber(jobDescription)
#pprint(jobDescription)

## Thank you!

Thanks for reviewing this SciServer example notebook. You can use this notebook as a template to develop your own notebooks, but please do so in a copy rather than in the original example notebook.
As you begin to use any of our SciServer modules in your own notebooks, consult the SciServer scripting documentation at http://www.sciserver.org/docs/sciscript-python/SciServer.html.

If you have questions, please email the SciServer helpdesk at sciserver-helpdesk@jhu.edu .

In [None]:
#get user schema info

casJobsId = CasJobs.getSchemaName()
print(casJobsId)

In [None]:
# Define some very simple test cases for the purposes of this example notebook

CasJobs_TestDatabase = "MyDB" # use your MyDB - normally you won't need to change this
CasJobs_TestQuery = "select 4 as Column1, 5 as Column2 " # toy query, simply returns two columns of constants: 4, 5.
CasJobs_TestTableName1 = "MyNewtable1"    # the name of a new table to be created (doesn't yet exist)
CasJobs_TestTableName2 = "MyNewtable37"    # the name of a new table to be created (doesn't yet exist)
CasJobs_TestTableCSV = u"Column1,Column2\n6,7\n"    # toy CSV file: 1 row x 2 columns of constants: 6, 7.
CasJobs_TestCSVFile = "SciScriptTestFile.csv"       # a more real CSV file. It must be included in this dir in current container.
CasJobs_TestFitsFile = "SciScriptTestFile.fits"     # a FITS file to upload

print('Test case variables set.')