# Myria Python & IPython & StarCluster

<img src="overview.png" style="height: 300px"/>

### To install `Myria-Python`:

```
git clone https://github.com/uwescience/myria-python
cd myria-python
sudo python setup.py install
```

### Or:

```
pip install myria-python
```



## 1. Connecting to Myria

In [1]:
from myria import *
import numpy

# Create a connection to the Myria *Production* cluster
connection = MyriaConnection(
    rest_url='https://rest.myria.cs.washington.edu:1776',
    execution_url='https://myria-web.appspot.com')

In [2]:
# ... or create a connection to the Myria *demo* cluster
connection = MyriaConnection(
    rest_url='http://demo.myria.cs.washington.edu:8753')

## 2. Myria: Connections, Relations, and Queries (and Schemas and Plans)

In [3]:
# How many datasets are there on the server?
print len(connection.datasets())

25


In [4]:
# Let's look at the first dataset...
print connection.datasets()[0]

{u'created': u'2015-05-07T22:00:48.057Z', u'numTuples': 3, u'uri': u'http://demo.myria.cs.washington.edu:8753/dataset/user-Brandon/program-Demo/relation-Books', u'queryId': 141, u'relationKey': {u'userName': u'Brandon', u'relationName': u'Books', u'programName': u'Demo'}, u'schema': {u'columnNames': [u'name', u'pages'], u'columnTypes': [u'STRING_TYPE', u'LONG_TYPE']}}


### Three parts to a relation name:

In [4]:
# What's the name of the first relation?
name = connection.datasets()[0]['relationKey']
name

{u'programName': u'Demo', u'relationName': u'Books', u'userName': u'Brandon'}

In [None]:
# Compact form
MyriaRelation("Brandon:Demo:Books", connection=connection)

In [3]:
# Expanded form (used with many lower-level Myria utilities)
MyriaRelation({'userName':     'Brandon', 
               'programName':  'Demo',
               'relationName': 'Books'}, 
              connection=connection)

Unnamed: 0,name,pages
0,Brave New World,288
1,We,256
2,Nineteen Eighty-Four,376


### Uploading data

In [None]:
# Uploading from a Python string
name = {'userName': 'Brandon', 'programName': 'Demo', 'relationName': 'Books'}
schema = { "columnNames" : ["name", "pages"],
           "columnTypes" : ["STRING_TYPE","LONG_TYPE"] }

data = """Brave New World,288
Nineteen Eighty-Four,376
We,256"""

result = connection.upload_file(
    name, schema, data, delimiter=',', overwrite=True)

MyriaRelation(result['relationKey'], connection=connection)

In [None]:
# Want to upload a local file?  No problem...
another_name = {'userName': 'Brandon', 
                'programName': 'Demo', 
                'relationName': 'MoreBooks3'} # Name must be unique!

with open('books.csv') as f:
    connection.upload_fp(another_name, schema, f)
    
MyriaRelation(another_name, connection=connection)

In [None]:
#Or, load using the myria_upload command-line utility
!myria_upload --hostname demo.myria.cs.washington.edu --port 8753 --no-ssl --user Brandon --program Demo --relation Demo --overwrite books.csv

In [None]:
# Myria also supports uploading large datasets in parallel, by mapping one URL to one worker as follows:

demo_connection = MyriaConnection(
    rest_url='http://demo.myria.cs.washington.edu:8753')
schema = MyriaSchema(
    {"columnNames": ['Title','Url','BeginVolume','BeginYear',
                     'EndVolume','EndYear','Subject','Publisher'],
     "columnTypes": ["STRING_TYPE"] * 8 })
destination = MyriaRelation("Brandon:Demo:SomeJournals", 
                            schema=schema,
                            connection=demo_connection)


# CAVEAT: currently must have *exactly one URL per database worker*.  
#         This will change soon!
query = MyriaQuery.parallel_import(
            destination,
            [(1, 'http://nlist.inflibnet.ac.in/ejournals/American%20Institute%20of%20Physics.csv'),
             (2, 'http://nlist.inflibnet.ac.in/ejournals/American%20Physical%20Society.csv'),
             (3, 'http://nlist.inflibnet.ac.in/ejournals/Cambridge%20University%20Press.csv'),
             (4, 'http://nlist.inflibnet.ac.in/ejournals/Royal%20Society%20of%20Chemistry.csv')],
            )

query.to_dataframe()

### Working with relations

In [4]:
# Using MyriaConnection to retrieve data:
MyriaRelation("Brandon:Demo:Books", connection=connection)

Unnamed: 0,name,pages
0,Nineteen Eighty-Four,376
1,Brave New World,288
2,We,256


In [5]:
relation = MyriaRelation("Brandon:Demo:Books", connection=connection)
print len(relation)
print relation.created_date
print relation.schema.names

3
2015-05-07 22:00:48.057000+00:00
[u'name', u'pages']


### Setting a default connection

In [None]:
# Who wants to have to type "connection=connection" all the time?!
MyriaRelation("Brandon:Demo:MoreBooks", connection=connection)

Instead, just set the default connection for the session:

In [6]:
# Set the default connection for the session
MyriaRelation.DefaultConnection = connection

# Now, anything we do will use that connection, and we don't have to specify it
MyriaRelation("Brandon:Demo:MoreBooks")

Unnamed: 0,name,pages
0,We,256
1,The Iron Heel,354
2,Nineteen Eighty-Four,376
3,Brave New World,288


### Working Locally with Relations

In [7]:
# We've seen this, which displays a relation:
MyriaRelation("Brandon:Demo:MoreBooks")

# But how do we actually USE the relation locally?

Unnamed: 0,name,pages
0,Nineteen Eighty-Four,376
1,Brave New World,288
2,The Iron Heel,354
3,We,256


In [8]:
# 1: Download as a Python dictionary
d = MyriaRelation("Brandon:Demo:MoreBooks").to_dict()
print 'First book returned: %s' % d[0]['name']
print d

First book returned: Brave New World
[{u'name': u'Brave New World', u'pages': 288}, {u'name': u'We', u'pages': 256}, {u'name': u'Nineteen Eighty-Four', u'pages': 376}, {u'name': u'The Iron Heel', u'pages': 354}]


In [9]:
# 2: Download as a Pandas DataFrame
df = MyriaRelation("Brandon:Demo:MoreBooks").to_dataframe()
print '%d books more than 300 pages' % len(df[df.pages > 300]) 

2 books more than 300 pages


In [10]:
# 3: Download as a DataFrame and convert to a numpy array
array = MyriaRelation("Brandon:Demo:MoreBooks").to_dataframe().as_matrix()
print 'Mean number of pages = %d' % array[:,1].mean()

Mean number of pages = 318


## Working with queries

In [11]:
# Let's execute a query:

query = MyriaQuery.submit(
    """books = scan(Brandon:Demo:MoreBooks);
       longerBooks = [from books where pages > 300 emit name];
       store(longerBooks, Brandon:Demo:LongerBooks);""")

print query.status

SUCCESS


In [12]:
query.to_dataframe()

Unnamed: 0,name
0,The Iron Heel
1,Nineteen Eighty-Four


In [None]:
MyriaRelation("Brandon:Demo:LongerBooks")

# Myria IPython Extensions

## 1. Loading the Extension

In [13]:
%load_ext myria

## 2. Configuration Options

In [14]:
%config MyriaExtension

MyriaExtension options
--------------------
MyriaExtension.execution_url=<Unicode>
    Current: u'https://demo.myria.cs.washington.edu'
    Myria web API endpoint URL
MyriaExtension.language=<Unicode>
    Current: u'MyriaL'
    Language for Myria queries
MyriaExtension.rest_url=<Unicode>
    Current: u'https://rest.myria.cs.washington.edu:1776'
    Myria REST API endpoint URL
MyriaExtension.timeout=<Int>
    Current: 60
    Query timeout (in seconds)


The really important one:

In [None]:
%config timeout=120

## 3. Ambient Connection to Myria

View `connect` arguments:

In [None]:
%connect?

Connect to the production server:

In [15]:
%connect https://rest.myria.cs.washington.edu:1776 https://myria-web.appspot.com
            
# This is just the IPython equivalent of setting the default MyriaConnection!

<myria.connection.MyriaConnection at 0x7fac7ab6d2d0>

## 4. Executing Queries

In [18]:
%%query
OppData = scan(all_opp_v3);
VctData = scan(all_vct);

OppWithPop = select opp.*, vct.pop
             from OppData as opp,
                  VctData as vct
             where opp.Cruise = vct.Cruise
               and opp.Day = vct.Day
               and opp.File_Id = vct.File_Id
               and opp.Cell_Id = vct.Cell_Id;

PlanktonCount = select Cruise, count(*) as Phytoplankton
                from OppWithPop
                where pop != "beads" and pop != "noise"
                  and fsc_small > 10000;

store(PlanktonCount, public:demo:PlanktonCount);

Unnamed: 0,Cruise,Phytoplankton
0,Tokyo_3,6088514
1,Tokyo_2,1824845
2,Tokyo_4,1353062
3,Tokyo_1,316049


In [19]:
# Grab the results of the most recent execution
query = _
or_this_works_too = _18

In [20]:
query

Unnamed: 0,Cruise,Phytoplankton
0,Tokyo_2,1824845
1,Tokyo_3,6088514
2,Tokyo_1,316049
3,Tokyo_4,1353062


### Single-line queries may be treated like Python expressions

In [21]:
query = %datalog Just500(column0, 500) :- TwitterK(column0, 500)%
print query.status
query

SUCCESS


Unnamed: 0,_COLUMN1_,column0
0,500,499
1,500,498


## 5. Variable Binding

In [22]:
low, high, destination = 543, 550, 'BoundRelation'

The tokens `@low`, `@high`, and `@destination` are bound to their values:

In [23]:
%%query
T1 = scan(TwitterK);
T2 = [from T1 where $0 > @low and $0 < @high emit $1 as x];
store(T2, @destination);

Unnamed: 0,x
0,610
1,16
2,53
3,20
4,21
5,989


## 6. Plans and Delayed Execution

You can use `%plan` magic to compile a plan without immediatley executing it:

In [24]:
%%plan 
T1 = scan(TwitterK);
T2 = [from T1 where $0 >= 999 emit $0];
store(T2, JustX);

{u'language': u'MyriaL',
 u'logicalRa': u'Store(public:adhoc:JustX)[Apply(column0=$0)[Select(($0 >= 999))[Scan(public:adhoc:TwitterK)]]]',
 u'plan': {u'fragments': [{u'operators': [{u'opId': 0,
      u'opName': u'MyriaScan(public:adhoc:TwitterK)',
      u'opType': u'TableScan',
      u'relationKey': {u'programName': u'adhoc',
       u'relationName': u'TwitterK',
       u'userName': u'public'}},
     {u'argChild': 0,
      u'argPredicate': {u'rootExpressionOperator': {u'left': {u'columnIdx': 0,
         u'type': u'VARIABLE'},
        u'right': {u'type': u'CONSTANT',
         u'value': u'999',
         u'valueType': u'LONG_TYPE'},
        u'type': u'GTEQ'}},
      u'opId': 1,
      u'opName': u'MyriaSelect(($0 >= 999))',
      u'opType': u'Filter'},
     {u'argChild': 1,
      u'emitExpressions': [{u'outputName': u'column0',
        u'rootExpressionOperator': {u'columnIdx': 0, u'type': u'VARIABLE'}}],
      u'opId': 2,
      u'opName': u'MyriaApply(column0=$0)',
      u'opType': u'Apply'

In [26]:
plan = _
result = [0]
for i in xrange(5):
    result += MyriaQuery.submit_plan(plan).to_dataframe()
result

Unnamed: 0,column0
0,4995


# Myria in your own Amazon Cluster!

## 1. Installing Myria-EC2 & Starcluster

You'll need AWS API keys before installing:

```
github clone https://github.com/uwescience/myria-ec2.git
cd myria-ec2
sudo python setup.py install
```



## 2. Cluster Configuration

In [27]:
!cat ~/.starcluster/myriacluster.config

[cluster myriacluster]
KEYNAME = bhaynesKey
CLUSTER_SIZE = 3
NODE_INSTANCE_TYPE = m1.large
#SPOT_BID = 0.08
DISABLE_QUEUE=True

PLUGINS = postgresplugin, myriaplugin
CLUSTER_USER = myriaadmin
DNS_PREFIX = True
NODE_IMAGE_ID = ami-765b3e1f
PERMISSIONS = rest, http

[plugin postgresplugin]
SETUP_CLASS = postgresplugin.PostgresInstaller
PORT = 5432

[plugin myriaplugin]
SETUP_CLASS = myriaplugin.MyriaInstaller
POSTGRES_PORT = 5432

#DBMS=sqlite
#PATH=/tmp/myria
#HEAP=-Xmx2g
#REST_PORT=8753
#MASTER_PORT=8001
#WORKER_PORT=9001
#ADDITIONAL_PACKAGES=yum
#REPOSITORY=https://github.com/uwescience/myria.git
#INSTALL_DIRECTORY=~/myria
#DATABASE_PASSWORD=myriaisawesome

[permission rest]
# this has to be the same as REST_PORT
IP_PROTOCOL = tcp
FROM_PORT = 8753
TO_PORT = 8753

[permission http]
IP_PROTOCOL = tcp
FROM_PORT = 80
TO_PORT = 80


### Important keys to consider modifing: 
```
    CLUSTER_SIZE = 2
    NODE_INSTANCE_TYPE = m1.large
    SPOT_BID = 0.2
```

### How much should I bid?

In [None]:
!starcluster spothistory -p m1.large

## 2. Launching Clusters

In [None]:
!starcluster start -c myriacluster MYCLUSTERNAME

## 3. Connecting to the Cluster via Python

In [28]:
!starcluster listclusters | grep MYCLUSTERNAME-master

StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

    myriademo-master running i-b564ef63 ec2-52-1-38-182.compute-1.amazonaws.com


In [29]:
%connect http://ec2-52-1-38-182.compute-1.amazonaws.com:8753

<myria.connection.MyriaConnection at 0x7fac90a9ff10>

In [30]:
MyriaRelation("Just500")

Unnamed: 0,_COLUMN1_,column0
0,500,499
1,500,498
2,500,499
3,500,498
4,500,499
5,500,498
6,500,499
7,500,498


## 4. What about the Web Interface?

Go to <a href="http://ec2-52-1-38-182.compute-1.amazonaws.com" target="_blank">http://ec2-52-1-38-182.compute-1.amazonaws.com</a> in your browser (**substitute the EC2 domain name returned by `grep` above!**)

## 5. SSH into a Cluster

In [None]:
!starcluster sshmaster MYCLUSTERNAME

## 6. Terminating Clusters

In [None]:
!starcluster terminate MYCLUSTERNAME

# Where to find more information:

#### Documentation
[Myria Website](http://myria.cs.washington.edu/)<br /> 
[Myria Python](http://myria.cs.washington.edu/docs/myriapython.html)<br /> 
[Additional Language Documentation](http://myria.cs.washington.edu/docs/myriaql.html)<br /> 
[This Notebook](https://github.com/uwescience/myria-python/blob/master/ipnb%20examples/myria%20examples.ipynb) 

#### Repositories
[Myria](github.com/uwescience/myria)<br /> 
[Myria-Python](github.com/uwescience/myria-python)<br /> 
[Myria-EC2](github.com/uwescience/myria-ec2)

#### Mailing List
[myria-users@cs.washington.edu](mailto:myria-users@cs.washington.edu)

## StarCluster
[Homepage](http://star.mit.edu/cluster/)

## IPython
[Homepage](http://ipython.org/)

## Pandas/Dataframes
[Homepage](http://pandas.pydata.org/)