From 55e97d3d1416ccc80bf75fafeb1be33e41e3dc27 Mon Sep 17 00:00:00 2001 From: Dennis Orzikh Date: Thu, 20 Jul 2017 16:56:44 -0700 Subject: [PATCH] adding cosmo demo notebook --- ipnb examples/Cosmo8+Demo+Notebook.ipynb | 612 +++++++++++++++++++++++ 1 file changed, 612 insertions(+) create mode 100644 ipnb examples/Cosmo8+Demo+Notebook.ipynb diff --git a/ipnb examples/Cosmo8+Demo+Notebook.ipynb b/ipnb examples/Cosmo8+Demo+Notebook.ipynb new file mode 100644 index 0000000..a92d157 --- /dev/null +++ b/ipnb examples/Cosmo8+Demo+Notebook.ipynb @@ -0,0 +1,612 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For more info on Myria, access to the demo cluster, and setting up a cluster: http://myria.cs.washington.edu/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Connecting to Myria" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# myria-python functionality\n", + "from myria import *\n", + "\n", + "# myriaL cell functionality\n", + "%load_ext myria\n", + "\n", + "# connection for myria-python functionality\n", + "connection = MyriaConnection(rest_url='http://localhost:8753')\n", + "# same as: http://ec2-52-36-55-94.us-west-2.compute.amazonaws.com:8753\n", + "\n", + "# connection for myriaL cell functionality\n", + "%connect http://localhost:8753" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## MyriaL Basics" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First you would scan in whatever tables you want to use in that cell. \n", + "These are the tables visible in the Myria-Web datasets tab.\n", + "\n", + "```\n", + "R1 = scan(cosmo8_000970);\n", + "```\n", + "This put all of the data in the `cosmo8_000970` table into the relation `R1` which can now be queried with MyriaL\n", + "\n", + "```\n", + "R2 = select * from R1 limit 5;\n", + "```\n", + "Once we have a relation, we can query it, and store the result in a new relation.\n", + "Sometimes you just want to see the output of the cell you're running, or sometimes you want to store the result for later use. In either case, you have to `store` the relation that you want to see or store the output of, because otherwise Myria will optimize the query into an empty query.\n", + "```\n", + "store(R2, MyInterestingResult);\n", + "```\n", + "This will add `MyInterestingResult` to the list of datasets on Myria-Web. If you are running multiple queries and want to just see their results without storing multiple new tables, you can pick a name and overwrite it repeatedly:\n", + "```\n", + "%%query\n", + " ...\n", + " store(R2, temp);\n", + "...\n", + "query%%\n", + " ...\n", + " store(R50, temp);\n", + "```\n", + "\n", + "All statements need to be ended with a semicolon!\n", + "\n", + "Also, note that a MyriaL cell cannot contain any Python.\n", + "These cells are Python by default, but a MyriaL cell starts with `%%query` and can only contain MyriaL syntax." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%query\n", + "-- comments in MyriaL look like this \n", + "-- notice that the notebook highlighting still thinks we are writing python: in, from, for, range, return\n", + "\n", + "R1 = scan(cosmo8_000970);\n", + "R2 = select * from R1 limit 5;\n", + "R3 = select iOrder from R1 limit 5;\n", + "store(R2, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%query\n", + "-- there are some built in functions that are useful, just like regular SQL:\n", + "cosmo8 = scan(cosmo8_000970);\n", + "countRows = select count(*) as countRows from cosmo8;\n", + "store(countRows, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "-- lets say we want just the number of gas particles\n", + "cosmo8 = scan(cosmo8_000970);\n", + "c = select count(*) as numGas from cosmo8 where type = 'gas';\n", + "store(c, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "-- some stats about the positions of star particles\n", + "cosmo8 = scan(cosmo8_000970);\n", + "positionStats = select min(x) as min_x\n", + " , max(x) as max_x\n", + " , avg(x) as avg_x\n", + " , stdev(x) as stdev_x\n", + " , min(y) as min_y\n", + " , max(y) as max_y\n", + " , avg(y) as avg_y\n", + " , stdev(y) as stdev_y\n", + " , min(z) as min_z\n", + " , max(z) as max_z\n", + " , avg(z) as avg_z\n", + " , stdev(z) as stdev_z\n", + " from cosmo8\n", + " where type = 'star';\n", + "store(positionStats, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# we can also create constants in Python and reference them in MyriaL\n", + "low = 50000\n", + "high = 100000\n", + "destination = 'tempRangeCosmo8'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "-- we can reference Python constants with '@'\n", + "cosmo8 = scan(cosmo8_000970);\n", + "temps = select iOrder, mass, type, temp\n", + " from cosmo8\n", + " where temp > @low and temp < @high\n", + " limit 10;\n", + "\n", + "store(temps, @destination);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## User Defined Functions" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In MyriaL we can define our own functions that will then be applied to the results of a query. These can either be written in Python and registered with Myria or they can be written directly within a MyriaL cell (but not in Python).\n", + "\n", + "When registering a Python function as a UDF, we need to specify the type of the return value. The possible types are the `INTERNAL_TYPES` defined in `raco.types` as seen here\n", + "\n", + "Currently a function signature can't be registered more than once. In order to overwrite an existing registered function of the same signature, you have to use the Restart Kernel button in the Jupyter Notebook toolbar.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from raco.types import DOUBLE_TYPE\n", + "from myria.udf import MyriaPythonFunction\n", + "\n", + "# each row is passed in as a tupl within a list\n", + "def sillyUDF(tuplList):\n", + " row = tuplList[0]\n", + " x = row[0]\n", + " y = row[1]\n", + " z = row[2]\n", + " \n", + " if (x > y):\n", + " return x + y + z\n", + " else:\n", + " return z\n", + "\n", + "# A python function needs to be registered to be able to \n", + "# call it from a MyriaL cell\n", + "MyriaPythonFunction(sillyUDF, DOUBLE_TYPE).register()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# To see all functions currently registered\n", + "connection.get_functions()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "-- for your queries to run faster, its better to push the UDF to the smallest possible set of data\n", + "cosmo8 = scan(cosmo8_000970);\n", + "small = select * from cosmo8 limit 10;\n", + "res = select sillyUDF(x,y,z) as sillyPyRes from small;\n", + "store(res, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "-- same thing but as a MyriaL UDF\n", + "def silly(x,y,z):\n", + " case\n", + " when x > y\n", + " then x + y + z\n", + " else z\n", + " end;\n", + " \n", + "cosmo8 = scan(cosmo8_000970);\n", + "res = select silly(x,y,z) as sillyMyRes from cosmo8 limit 10;\n", + "store(res, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from raco.types import DOUBLE_TYPE\n", + "\n", + "def distance(tuplList):\n", + " # note that libraries used inside the UDF need to be imported inside the UDF\n", + " import math\n", + " row = tuplList[0]\n", + " x1 = row[0]\n", + " y1 = row[1]\n", + " z1 = row[2]\n", + " x2 = row[3]\n", + " y2 = row[4]\n", + " z2 = row[5]\n", + " \n", + " return math.sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2)\n", + "\n", + "MyriaPythonFunction(distance, DOUBLE_TYPE).register()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "print distance([(.1, .1, .1, .2, .2, .2)])\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eps = .0042" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "-- here I am trying to find all points within eps distance of a given point\n", + "-- in order to avoid the expensive UDF distance() call on every point in the data,\n", + "-- I first filter the points by a simpler range query that immitates a bounding box\n", + "cosmo8 = scan(cosmo8_000970);\n", + "point = select * from cosmo8 where iOrder = 68649;\n", + "cube = select c.* from cosmo8 as c, point as p\n", + " where abs(c.x - p.x) < @eps\n", + " and abs(c.y - p.y) < @eps\n", + " and abs(c.z - p.z) < @eps;\n", + "distances = select c.*, distance(c.x, c.y, c.z, p.x, p.y, p.z) as dist from cube as c, point as p;\n", + "res = select * from distances where dist < @eps;\n", + "store(res, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "cosmo8 = scan(cosmo8_000970);\n", + "point = select * from cosmo8 where iOrder = 68649;\n", + "cube = select c.* from cosmo8 as c, point as p\n", + " where abs(c.x - p.x) < @eps\n", + " and abs(c.y - p.y) < @eps\n", + " and abs(c.z - p.z) < @eps;\n", + "onlyGases = select * from cube where type = 'gas';\n", + "distances = select c.*, distance(c.x, c.y, c.z, p.x, p.y, p.z) as dist from onlyGases as c, point as p;\n", + "res = select * from distances where dist < @eps;\n", + "store(res, garbage);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There is also special syntax for user defined aggregate functions, which use all of the rows to produce a single output, like a Reduce or Fold function pattern:\n", + "\n", + "```\n", + "uda func-name(args) {\n", + " initialization-expr(s);\n", + " update-expr(s);\n", + " result-expr(s);\n", + "};\n", + "```\n", + "\n", + "Where each of the inner lines is a bracketed statement with an entry for each expression that you want to output." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "-- UDA example using MyriaL functions inside the UDA update line\n", + "def pickBasedOnValue2(val1, arg1, val2, arg2):\n", + " case\n", + " when val1 >= val2\n", + " then arg1\n", + " else arg2\n", + " end;\n", + " \n", + "def maxValue2(val1, val2):\n", + " case\n", + " when val1 >= val2\n", + " then val1\n", + " else val2\n", + " end;\n", + "\n", + "uda argMaxAndMax(arg, val) {\n", + " [-1 as argAcc, -1.0 as valAcc];\n", + " \n", + " [pickBasedOnValue2(val, arg, valAcc, argAcc),\n", + " maxValue2(val, valAcc)];\n", + " \n", + " [argAcc, valAcc];\n", + "};\n", + "\n", + "cosmo8 = scan(cosmo8_000970);\n", + "res = select argMaxAndMax(iOrder, vx) from cosmo8;\n", + "store(res, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Previously when we wrote a UDF we expected the tuplList to only hold one row\n", + "# but UDFs that are used in a UDA could be given many rows at a time, so it is\n", + "# important to loop over all of them and keep track of the state/accumulator outside\n", + "# the loop, and then return the value that is expected by the update-expr line in the UDA.\n", + "from raco.types import LONG_TYPE\n", + "def pickBasedOnValue(tuplList):\n", + " maxArg = -1\n", + " maxVal = -1.0\n", + " \n", + " for tupl in tuplList:\n", + " value1 = tupl[0]\n", + " arg1 = tupl[1]\n", + " value2 = tupl[2]\n", + " arg2 = tupl[3]\n", + " if (value1 >= value2):\n", + " if (value1 >= maxVal):\n", + " maxArg = arg1\n", + " maxVal = value1\n", + " else:\n", + " if (value2 >= maxVal):\n", + " maxArg = arg2\n", + " maxVal = value2\n", + " return maxArg\n", + "\n", + "MyriaPythonFunction(pickBasedOnValue, LONG_TYPE).register()\n", + "\n", + "from raco.types import DOUBLE_TYPE\n", + "def maxValue(tuplList):\n", + " maxVal = -1.0\n", + " \n", + " for tupl in tuplList:\n", + " value1 = tupl[0]\n", + " value2 = tupl[1]\n", + " if (value1 >= value2):\n", + " if (value1 >= maxVal):\n", + " maxVal = value1\n", + " else:\n", + " if (value2 >= maxVal):\n", + " maxVal = value2\n", + " return maxVal\n", + "\n", + "MyriaPythonFunction(maxValue, DOUBLE_TYPE).register() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%query\n", + "-- UDA example using Python functions inside the UDA update line\n", + "uda argMaxAndMax(arg, val) {\n", + " [-1 as argAcc, -1.0 as valAcc];\n", + " \n", + " [pickBasedOnValue(val, arg, valAcc, argAcc),\n", + " maxValue(val, valAcc)];\n", + " \n", + " [argAcc, valAcc];\n", + "};\n", + "t = scan(cosmo8_000970);\n", + "s = select argMaxAndMax(iOrder, vx) from t;\n", + "store(s, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "-- of course, argMaxAndMax can be done much more simply:\n", + "c = scan(cosmo8_000970);\n", + "m = select max(vx) as mvx from c;\n", + "res = select iOrder, mvx from m,c where vx = mvx;\n", + "store(res, garbage);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Working with multiple snapshots" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "On the Myria demo cluster we only provide cosmo8_000970, but on a private cluster we could load in any number of snapshots to look for how things change over time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "c8_000970 = scan(cosmo8_000970);\n", + "c8_000962 = scan(cosmo8_000962);\n", + "\n", + "-- finding all gas particles that were destroyed between step 000962 and 000970\n", + "c1Gases = select iOrder from c8_000962 where type = 'gas';\n", + "c2Gases = select iOrder from c8_000970 where type = 'gas';\n", + "\n", + "exist = select c1.iOrder from c1Gases as c1, c2Gases as c2 where c1.iOrder = c2.iOrder;\n", + "\n", + "destroyed = diff(c1Gases, exist);\n", + "store(destroyed, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%%query\n", + "c8_000970 = scan(cosmo8_000970);\n", + "c8_000962 = scan(cosmo8_000962);\n", + "\n", + "-- finding all particles where some property changed between step 000962 and 000970\n", + "res = select c1.iOrder\n", + " from c8_000962 as c1, c8_000970 as c2\n", + " where c1.iOrder = c2.iOrder \n", + " and c1.metals = 0.0 and c2.metals > 0.0;\n", + "store(res, garbage);" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "from IPython.display import HTML\n", + "HTML('''\n", + "To toggle on/off output_stderr, click here.''')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}