# Myria-Python & IPython

<img src="overview.png" style="height: 300px"/>

### To install `Myria-Python`:

```
git clone https://github.com/uwescience/myria-python
cd myria-python
sudo python setup.py install
```

### Or:

```
pip install myria-python
```



## 1. Connecting to Myria

In [90]:
from myria import *
import numpy

# Create a connection to a Myria EC2 cluster
connection = MyriaConnection(
    rest_url='http://ec2-52-26-242-67.us-west-2.compute.amazonaws.com:8753',
    execution_url='http://ec2-52-26-242-67.us-west-2.compute.amazonaws.com:8080')

In [48]:
connection

<myria.connection.MyriaConnection at 0x114177c50>

## 2. Myria: Connections, Relations, and Queries (and Schemas and Plans)

In [91]:
# How many datasets are there on the server?
print len(connection.datasets())

14


In [92]:
# Let's look at the datasets...
print connection.datasets()

[{u'created': u'2016-01-26T18:26:20.158Z', u'numTuples': 1, u'uri': u'http://ec2-52-26-242-67.us-west-2.compute.amazonaws.com:8753/dataset/user-public/program-adhoc/relation-JustX', u'howPartitioned': {u'workers': [1, 2], u'pf': None}, u'queryId': 15, u'relationKey': {u'userName': u'public', u'relationName': u'JustX', u'programName': u'adhoc'}, u'schema': {u'columnNames': [u'a'], u'columnTypes': [u'LONG_TYPE']}}, {u'created': u'2016-01-26T22:00:47.032Z', u'numTuples': 17, u'uri': u'http://ec2-52-26-242-67.us-west-2.compute.amazonaws.com:8753/dataset/user-public/program-adhoc/relation-TwitterCC', u'howPartitioned': {u'workers': [1, 2], u'pf': {u'numPartitions': None, u'index': 0, u'seedIndex': 0, u'type': u'SingleFieldHash'}}, u'queryId': 29, u'relationKey': {u'userName': u'public', u'relationName': u'TwitterCC', u'programName': u'adhoc'}, u'schema': {u'columnNames': [u'id', u'cnt'], u'columnTypes': [u'LONG_TYPE', u'LONG_TYPE']}}, {u'created': u'2016-01-26T21:56:37.551Z', u'numTuples': 

### Three parts to a relation name:

In [93]:
# What's the name of the first relation?
name = connection.datasets()[0]['relationKey']
name

{u'programName': u'adhoc', u'relationName': u'JustX', u'userName': u'public'}

In [95]:
# Let's upload a dataset...
query = MyriaQuery.submit(
'''T1 = load("https://goo.gl/YqKALA",csv(schema(a:int, b:int),skip=0));store(T1, TwitterK3, [a, b]);''', connection=connection)
query.status

u'SUCCESS'

In [96]:
print query.status
print len(connection.datasets())
print connection.datasets()[-1]

SUCCESS
15
{u'created': u'2016-01-26T17:49:07.159Z', u'numTuples': -1, u'uri': u'http://ec2-52-26-242-67.us-west-2.compute.amazonaws.com:8753/dataset/user-public/program-logs/relation-Sending', u'howPartitioned': {u'workers': [1, 2], u'pf': None}, u'queryId': 2, u'relationKey': {u'userName': u'public', u'relationName': u'Sending', u'programName': u'logs'}, u'schema': {u'columnNames': [u'queryId', u'subQueryId', u'fragmentId', u'nanoTime', u'numTuples', u'destWorkerId'], u'columnTypes': [u'LONG_TYPE', u'INT_TYPE', u'INT_TYPE', u'LONG_TYPE', u'LONG_TYPE', u'INT_TYPE']}}


In [97]:
# Let's try another query...
query = MyriaQuery.submit(
'''T1 = scan(TwitterK);
  T2 = scan(TwitterK);
  Joined = [from T1, T2
            emit T1.$0 as src, T1.$1 as link, T2.$1 as dst];
  store(Joined, TwoHopsInTwitter);
''', connection=connection)
query.status

u'SUCCESS'

### Setting a default connection

In [98]:
# Set the default connection for the session
MyriaRelation.DefaultConnection = connection

# Myria IPython Extensions

## 1. Loading the Extension

In [99]:
%load_ext myria

The myria extension is already loaded. To reload it, use:
  %reload_ext myria


## 2. Configuration Options

In [80]:
%config MyriaExtension

MyriaExtension options
--------------------
MyriaExtension.execution_url=<Unicode>
    Current: u'https://demo.myria.cs.washington.edu'
    Myria web API endpoint URL
MyriaExtension.language=<Unicode>
    Current: u'MyriaL'
    Language for Myria queries
MyriaExtension.rest_url=<Unicode>
    Current: u'https://rest.myria.cs.washington.edu:1776'
    Myria REST API endpoint URL
MyriaExtension.timeout=<Int>
    Current: 60
    Query timeout (in seconds)


The really important one:

In [100]:
%config timeout=120

## 3. Ambient Connection to Myria

View `connect` arguments:

In [101]:
%connect?

Connect to the EC2 cluster:

In [79]:
%connect http://ec2-52-26-242-67.us-west-2.compute.amazonaws.com:8753 http://ec2-52-26-242-67.us-west-2.compute.amazonaws.com:8080
            
# This is just the IPython equivalent of setting the default MyriaConnection!

<myria.connection.MyriaConnection at 0x21206c250>

## 4. Executing Queries

In [102]:
%%query
-- Cross matching two relations based on coordinate.
-- Example in 2 dimensions of the input data.

-- Points in one relation are the same as the other relation, 
-- but perturbed by Gaussian noise of sigma = .00001. Using a matching
-- criteria (epsilon) of 2*sigma we'd expect to recover around 95% of 
-- the correct matches. A much higher criterion, say .001, would start
-- to match points incorrectly.

const sigma_noise: 0.00001;
const partition: 0.4;
const epsilon: 2 * sigma_noise;

def mod(x, n): x - int(x/n)*n;
def cell(v): int((v - mod(v, partition)) * (1/partition));
def is_ghost(xoffset, yoffset):
  case when xoffset = 0 and
            yoffset = 0 then 0 else 1 end;
def is_replicated(x, y, xoffset, yoffset):
  is_ghost(xoffset, yoffset) = 0 or
  cell(x + epsilon*xoffset) != cell(x) or
  cell(y + epsilon*yoffset) != cell(y);
def distance(x1, x2, y1, y2): sqrt((x1-x2)*(x1-x2) +
                                           (y1-y2)*(y1-y2));

pointsleft = load("https://s3-us-west-2.amazonaws.com/myria-sdss/crossmatch/pointsleft.txt",
              csv(schema(id:int,
                         x:float,
                         y:float,
                         z:float), skip=0));

pointsright = load("https://s3-us-west-2.amazonaws.com/myria-sdss/crossmatch/pointsright.txt",
              csv(schema(id:int,
                         x:float,
                         y:float,
                         z:float), skip=0));

permutations = load("https://s3-us-west-2.amazonaws.com/myria-sdss/crossmatch/permutations.txt",
                    csv(schema(xoffset:int,
                               yoffset:int), skip=0));

store(pointsleft, pointsleft);
store(pointsright, pointsright);

-- Partition into a grid with edges of size partition
-- Replicate any point that falls within epsilon of a partition boundary

partitionsleft = [from pointsleft, permutations
              where is_replicated(x, y, xoffset, yoffset)
              emit id, x, y,
                   cell(x) + xoffset as px,
                   cell(y) + yoffset as py,
                   is_ghost(xoffset, yoffset) as ghost];

store(partitionsleft, partitionsleft, [px, py]);

partitionsright = [from pointsright
              emit id, x, y,
                   cell(x) as px,
                   cell(y) as py,
                   0 as ghost];

store(partitionsright, partitionsright, [px, py]);


Unnamed: 0,ghost,id,px,py,x,y
0,1,82,14,13,6.201653,5.600008
1,1,82,15,13,6.201653,5.600008
2,0,3,15,13,6.357476,5.555078
3,0,4,22,22,8.881786,9.080743
4,0,7,15,18,6.300177,7.327647
5,0,10,13,14,5.221197,5.675361
6,0,11,11,6,4.688087,2.787459
7,0,13,14,15,5.922075,6.229038
8,0,14,11,19,4.64335,7.613819
9,0,17,18,19,7.388158,7.814237


In [103]:
%%query
-- calculate all pairs and filter by distance threshold
const sigma_noise: 0.00001;
const epsilon: 2 * sigma_noise;
def distance(x1, x2, y1, y2): sqrt((x1-x2)*(x1-x2) +
                                           (y1-y2)*(y1-y2));

partitionsleft = scan(partitionsleft);
partitionsright = scan(partitionsright);

-- Cross product on partition + ghost cells; no shuffle required
local = [from partitionsleft left,
              partitionsright right
         where left.px = right.px and
               left.py = right.py
         emit *];

store(local, local);

-- Calculate distances within each local pair and filter outliers
distances = [from local
             where ghost1 = 0 and -- The stable points must be ghost==0
                   distance(x, x1, y, y1) <= epsilon
             emit id as id1,
                  id1 as id2, -- ghost, ghost1, for debugging if necessary
                  distance(x, x1, y, y1) as distance];

store(distances, distances);


Unnamed: 0,distance,id1,id2
0,2.080836e-06,3,3
1,8.230525e-06,4,4
2,6.399734e-07,7,7
3,1.612142e-05,10,10
4,3.706306e-06,11,11
5,1.076368e-06,13,13
6,1.727197e-05,14,14
7,4.884389e-06,17,17
8,6.200257e-06,19,19
9,3.769274e-06,23,23


In [82]:
%%query
E = scan(TwitterK);
V = select distinct E.$0 from E;
CC = [from V emit V.$0 as node_id, V.$0 as component_id];
do
  new_CC = [from E, CC where E.$0 = CC.$0 emit E.$1, CC.$1] + CC;
  new_CC = [from new_CC emit new_CC.$0, MIN(new_CC.$1)];
  delta = diff(CC, new_CC);
  CC = new_CC;
while [from delta emit count(*) > 0];
comp = [from CC emit CC.$1 as id, count(CC.$0) as cnt];
store(comp, TwitterCC);


Unnamed: 0,cnt,id
0,3,498
1,2,443
2,378,12
3,1,724
4,1,877
5,1,975
6,1,419
7,3,395
8,5,510
9,1,507


In [104]:
# Grab the results of the most recent execution
query = _

In [73]:
query

Unnamed: 0,cnt,id
0,3,498
1,378,12
2,2,443
3,1,724
4,1,877
5,1,975
6,1,419
7,3,395
8,5,510
9,1,220


## 6. Plans and Delayed Execution

You can use `%plan` magic to compile a plan without immediately executing it:

In [74]:
%%plan 
T1 = scan(TwitterK);
T2 = [from T1 where $0 >= 999 emit $0];
store(T2, JustX);

{u'language': u'MyriaL',
 u'logicalRa': u'Store(public:adhoc:JustX)[Apply(a=$0)[Select(($0 >= 999))[Scan(public:adhoc:TwitterK)]]]',
 u'plan': {u'fragments': [{u'operators': [{u'opId': 0,
      u'opName': u'MyriaScan(public:adhoc:TwitterK)',
      u'opType': u'TableScan',
      u'relationKey': {u'programName': u'adhoc',
       u'relationName': u'TwitterK',
       u'userName': u'public'}},
     {u'argChild': 0,
      u'argPredicate': {u'rootExpressionOperator': {u'left': {u'columnIdx': 0,
         u'type': u'VARIABLE'},
        u'right': {u'type': u'CONSTANT',
         u'value': u'999',
         u'valueType': u'LONG_TYPE'},
        u'type': u'GTEQ'}},
      u'opId': 1,
      u'opName': u'MyriaSelect(($0 >= 999))',
      u'opType': u'Filter'},
     {u'argChild': 1,
      u'emitExpressions': [{u'outputName': u'a',
        u'rootExpressionOperator': {u'columnIdx': 0, u'type': u'VARIABLE'}}],
      u'opId': 2,
      u'opName': u'MyriaApply(a=$0)',
      u'opType': u'Apply'},
     {u'argChil

In [75]:
plan = _
result = MyriaQuery.submit_plan(plan).to_dataframe()
result

Unnamed: 0,a
0,999


# Myria in your own EC2 Cluster!

# Where to find more information:

#### Documentation
[Myria Website](http://myria.cs.washington.edu/)<br /> 
[Myria Python](http://myria.cs.washington.edu/docs/myriapython.html)<br /> 
[Additional Language Documentation](http://myria.cs.washington.edu/docs/myriaql.html)<br /> 
[This Notebook](https://github.com/uwescience/myria-python/blob/master/ipnb%20examples/myria%20examples.ipynb) 

#### Repositories
[Myria](github.com/uwescience/myria)<br /> 
[Myria-Python](github.com/uwescience/myria-python)<br /> 
[Myria-EC2](github.com/uwescience/myria-ec2)

#### Mailing List
[myria-users@cs.washington.edu](mailto:myria-users@cs.washington.edu)

## IPython
[Homepage](http://ipython.org/)

## Pandas/Dataframes
[Homepage](http://pandas.pydata.org/)