<!--

    Gaia Data Processing and Analysis Consortium (DPAC) 
    Co-ordination Unit 9 Work Package 930, based on 
    original scripts provided by the Apache SW Foundation
    
    (c) 2005-2025 Gaia DPAC
    
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
    -->
    
This notebook illustrates a few tips and tricks that should aid users in interacting with the platform via the Zeppelin user interface.

Users should be aware that the platform accesses a shared compute cluster. Depending on activity elsewhere cell execution may result in "Pending" jobs and we kindly request that users be patient. 

If, however, a running notebook cell becomes unresponsive (e.g. goes away running and never comes back) or behaves in unexpected ways (the Python interpreter can get occasionally tied in knots) as a last resort you can reset using the "Interpreter binding" drop-down available from the top-level cog icon at the head of the notebook (N.B. _not_ the individual cog icon to the immediate upper-right of this cell). Click on this top-level cog then click on the circular arrows icon in the drop-down: this is the "Restart" button. Note that this will kill all currently executing jobs in your context and free up all memory so you must re-establish the platform set-up in the resulting fresh Spark context by executing in PySpark

    import gaiadmpsetup

before doing anything else. (This is why we recommend that you include this line at the top of each notebook workflow: if the platform is already set up then this import does nothing so there's no harm in including it and no penalty in importing multiple times if/when re-running the notebook from the top).

You can export data from the platform via your Zeppelin account home directory, onto your local desktop. In order to do this you need to install your public ssh key on the system. The best way to do this is send us a copy of your public ssh key in a email and we will add it to the system in the correct location. If you are unsure how to find your public ssh key send us an email and we will talk you through it.

To transfer a file out of the system, copy the data file you wish to export into your /home or /user directory on the system,
For example:

    %sh
    echo "my data" > /user/{YOUR-USERNAME}/data.txt

From your local machine you can now either ssh into Zeppelin e.g. from your own desktop / laptop

    ssh {YOUR-USERNAME}@dmp.gaia.ac.uk
    
or you can copy your files from Zeppelin using scp e.g. from your local desktop (replace ‘data.txt’ with the name of the file you want to download, and /tmp/data with the path on your local desktop where you want to store it)

    scp {YOUR-USERNAME}@dmp.gaia.ac.uk:/user/{YOUR-USERNAME}/data.txt /tmp/data
    
To save in-memory data (for example as expressed in a results DataFrame) on the platform in a file for export, be aware that a Spark DataFrame is a distributed data set. If you save such an object to disk you will get a large set of partition files reflecting the natural distribution of the underlying source data. This is neither convenient nor particularly friendly. Provided the data size is not too large it is better to collect the distributed data to a non-distributed object on the master executor. The easiest way to do this is to call the "toPandas" method of the DataFrame, then this can be saved to a convenient format (e.g. comma-separated value). In the following simple example a DataFrame of the positions and magnitudes of all sources in the [Gaia Andromeda Photometric Survey (GAPS)](https://gea.esac.esa.int/archive/documentation/GDR3/Data_processing/chap_cu5pho/sec_cu5pho_gaps/) is created, collected to an intermediate non-distributed Pandas object, then saved to csv: 

In [3]:
%pyspark

# simple example of saving a results file to disk, e.g. prior to transfer off the platform to a user's local file system (see final paragraph in the description in the cell above)

# standard set-up
import gaiadmpsetup

# create an example data set - in this case a simple GAPS selection from gaia source
data_frame = spark.sql('SELECT ra, dec, parallax, parallax_error, phot_g_mean_mag, phot_bp_mean_mag, phot_rp_mean_mag FROM gaiadr3.gaia_source WHERE in_andromeda_survey')

# collect results to Pandas and save to csv in the user's home directory - substitute your username as appropriate
# data_frame.toPandas().to_csv(path_or_buf = '/user/{YOUR-USERNAME}/gapscat.csv', index = False)


## Interpreters

We recommend usage of the PySpark interpreter since this gives access to large scale distributed computing via the data frame applications programming interface. Other Python interpreters are available however: for light, non-distributed processing of relatively small data sets collected to the driver process in Zeppelin it is possible to specify plain Python or IPython interpreters (the latter is perhaps more familiar to Jupyter notebook users). 

There are some differences in functionality between the interpreters available - these are illustrated in the following cells.

In [5]:
%spark.pyspark

import sys

help(sys)
# ... available also in python.ipython interpreter

In [6]:
%python.ipython
# ... this facility is not available in the pyspark interpreter

import sys

sys?

All the IPython magic functions are avalible in Zeppelin, here's one example of `%timeit`, for the complete IPython magic functions, you can check the [link](http://ipython.readthedocs.io/en/stable/interactive/magics.html) here.



In [8]:
%python.ipython
# ... available only in IPython interpreter

%timeit range(1000)


## Tab completion

Tab completion, especially for attributes, is a convenient way to explore the structure of any object you’re dealing with. Simply type `object_name.<TAB>` to view the object’s attributes. See the following screenshot illustrating how tab completion works in the IPython Interpreter; it will work also in the pyspark interpreter.
![alt text](https://user-images.githubusercontent.com/164491/34858941-3f28105a-f78e-11e7-8341-2fbfd306ba5b.gif "Logo Title Text 1")





## Use of the ZeppelinContext 

`ZeppelinContext` is a utlity class which provide the following features

* Dynamic forms
* Show DataFrame via built-in visualisation

The ZeppelinContext is addressed via the pre-loaded object instance "z." in IPython or PySpark interpreters.


In [11]:
%python.ipython

# dynamic form
z.input(name='my_name', defaultValue='hello')

In [12]:
%python.ipython

import pandas as pd
df = pd.DataFrame({'name':['a','b','c'], 'count':[12,24,18]})

# visualise the data frame via the context built-in
z.show(df)

## Visualisation options

One big advantage of notebooks is that you can visualise data with your within your code and mark-down cells. [Matplotlib](https://matplotlib.org) is the premier Python plotting module available on this platform and it works in much the same way as other familiar Python environments (but note that an explicit call to `show()` is not necessary - plot rendering is accomplished via a post-execute hook which tells Zeppelin to plot all currently open matplotlib figures after executing the rest of the paragraph). Saving a plot locally is as simple as calling the pyplot instance savefig() method (see above for instructions on transfering files off the platform).



In [14]:
%pyspark

import matplotlib.pyplot as plt

plt.plot([1,2,3,4])
plt.ylabel('some numbers')

# to save the plot file use the savefig method, substituting your username as appropriate:
# plt.savefig('/user/{YOUR-USERNAME}/somenums.png')

To iteratively update a single plot, we can leverage Zeppelin's built-in Angular Display System. Currently this feature is only available for the `pyspark` interpreter for raster (png and jpg) formats. To enable this, we must set a special `angular` flag to `True` in our configuration:


In [16]:
%pyspark

import matplotlib.pyplot as plt
plt.close() # Added here to reset the plot when rerunning the paragraph
z.configure_mpl(angular=True, close=False)
plt.plot([1, 2, 3], label=r'$y=x$')

# ... the following related cells are placed by the side of this one by adjusting their width via the cog icon in the top right of each.

In [17]:
%pyspark

plt.plot([3, 2, 1], label=r'$y=3-x$')


In [18]:
%pyspark

plt.xlabel(r'$x$', fontsize=20)
plt.ylabel(r'$y$', fontsize=20)

In [19]:
%pyspark

plt.legend(loc='upper center', fontsize=20)

In [20]:
%pyspark

plt.title('Inline plotting example', fontsize=20)

Pandas provides a high level api for visualisation of Pandas data frames. It uses matplotlib for its visualization under the hood, so the usage is the same as matplotlib. 

In [22]:
%python.ipython

import pandas as pd
import numpy as np

ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()

## Pandas User Defined Functions (a.k.a. vectorized UDFs)

A convenient feature of the PySpark SQL data frame API is the programmability afforded by user defined functions. Users who have found themselves limited by the small number of aggregate functions typically available in ADQL will find this feature particularly useful in scale-out usage scenarios. There are illustrations of the use of UDFs in the tutorial notebooks provided on this platform - see for example notebook 5, "Working with Gaia XP spectra". 

For further details see the [Apache Spark documentation for Pandas UDFs](https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html#pandas-udfs-a-k-a-vectorized-udfs).

