PythonQL Intro and Tutorial

pavelvelikhov edited this page Apr 7, 2018 · 68 revisions

Welcome to PythonQL

Contents:

1. Installing and running PythonQL

2. First Queries

3. PythonQL vs Python's comprehensions

4. PythonQL vs SQL

5. The Query Expression

6. Joins

7. Grouping

8. Sorting

9. Path Expressions

10. Try/Except Expressions

11. Subqueries and Nesting

12. Pattern Matching

13. Using Windows

14. PythonQL and Pandas

15. PythonQL and SQL Databases

16. PythonQL and NoSQL Databases

17. PythonQL and Apache Spark & Hadoop

18. Query Hints

19. Debugging query plans

20. Help / Bugs / Suggestions


1. Installing and running PythonQL

The easiest way to install PythonQL is through Python's standard package manager. Just run pip install pythonql for Python2.7 and pip install pythonql3 for Python 3.x and you're set to go.

Once you have PythonQL installed, you can try running a simple query. Create a regular python script file test.py:

#coding: pythonql

print([ select y for x in [1,2,3] let y = x**2 ])

And now simply run it as you would run a regular python script:

python test.py

You should get an answer with squares of 1, 2, 3 :

[1,4,9]

2. First queries

Okay, once you got it all running, lets start making some queries. In PythonQL we took Python's comprehensions and extended them to do full-scale queries over whatever you give them. Let's do a recap on Python's comprehensions. The most common is the list comprehension and it goes something like this:

 [ f(x,y) for x in gen_1 for y in gen_2 if cond_1 ]

In this comprehension variable x iterates over some iterable gen_1, and for each value of x, y iterates over gen_2. Then the pairs of values of x and y are filtered with cond_1 in the if clause, and whatever remains is fed into a function f(x,y). The results of this function, applied to the x,y pairs is returned as a list to the user.

So let's build a list of tuples (x,y), where x is an even number from 1 to 8, y is an odd number from 1 to 7 and x>y:

result = [ (x,y) 
           for x in range(1,8) 
           for y in range(1,7) 
           if x % 2 == 0 and 
              y % 2 != 0 and 
              x > y ]

So this is a common Python comprehension. What we have done with PythonQL is we extended this concept with a number of powerful and convenient constructs that turn comprehensions into a full scale query language.

As you can see in the above comprehension - we already have the beginning of a query language: we construct the output tuple, we have two iterators (like the two FROM clauses in SQL), and we have a condition in the end (the WHERE clause).

Let's now write a PythonQL query, where x,y are again iterating over integers, but we will compute a sum of ys for each and we will return tuples with names in them. This query in PythonQL will be:

#coding: pythonql
result = [ (x, sum(y) as sum)
           for x in range(1,8), 
               y in range(1,7)
           if x % 2 == 0 and y % 2 != 0 and x > y
           group by x ]

As you can see we just a few slight modifications to the comprehension and the resulting query is still quite Pythonic in style. We have added a group by clause at the end of the query, where we group by variable x. All other variables (y in this case) are turned into lists, which contain all the values of these variables for the specific value of x. Then, in the result, we apply the sum function to y. And we make use of the simple notation for named tuples in the tuple constructor. The first tuple element is called x by default and we explicitly name the second element.

Let's do another simple case, before we jump into details of PythonQL. In this case we'll showcase the let clause. Let's again compute the sum of ys, but let's only output the tuples where this sum is odd. Let's do the group by as usual in this case, but let's add a let clause after it to compute the sum and another if clause to filter the sum. So our query becomes:

#coding: pythonql
result = [ select (x, sum_y)
           for x in range(1,8), 
               y in range(1,7)
           where x % 2 == 0 and y % 2 != 0 and x > y
           group by x
           let sum_y = sum(y)
           where sum_y % 2 != 0
           ]

We changed the if into where for the looks, if you wish you can keep using the if notation. We've added a let clause after the group by, which computes a new variable sum_y and assigns it the sum of ys. We added a filter on this sum with another where clause. Finally, we added the select keyword in the beginning: this is optional, you can construct the results without the select.

3. PythonQL vs. Python comprehensions

So, as we've seen above, PythonQL is a strict extension of Python comprehensions. Some queries that include grouping, sorting and other operations going on are impossible to do with pure comprehensions, but are easily done with PythonQL. Also, PythonQL includes other goodies, such as a try/except expression and path expressions.

The try/except is described later on, but the gist is that you can include a try/except in your query, which is very convenient when working with dirty data or using computations that can crash.

Path expressions are very convenient when working with nested JSON or XML data. They are especially useful when dealing with semi-structured data. They're also described later, after the main query language.

4. PythonQL vs. SQL

SQL is the standard and most well-known query language for databases. PythonQL has a lot of similarities to SQL, but we believe it is much easier to work with and cleaner than SQL.

The basics of the two languages are very similar:

  • select clause in SQL -> the constructor in PythonQL (also has an optional select keyword)
  • from clause in SQL -> the for clause in PythonQL
  • where clause in SQL -> the if or where clause of PythonQL
  • order by in SQL -> order by in PythonQL
  • group by in SQL -> group by in PythonQL
  • having clause in SQL -> this is just a regular if or where clause in PythonQL
  • window in SQL -> there is a window clause in PythonQL

There are of course a number of differences:

  • PythonQL is composable, the clauses can come in any order, as long as the first clause is a select and the next clause is some generator clause (for, let or window).
  • Expressions and subqueries can occur anywhere PythonQL.
  • The group by clause in PythonQL creates nested structures, while SQL's only creates entities for aggregation functions. So PythonQL's group by's can be nested.
  • The window clause in PythonQL is quite different from SQL's, our definition is equivalent to XQuery's definition which we believe is cleaner and easier to understand than SQL's.
  • PythonQL can operate on top of nested data and produce nested data, while SQL works on top of flat tables and SQL queries produce a list of tuples in the result.

So let's continue our PythonQL walkthrough!

5. The Query expression

Okay, we've seen the basics of the query expression in the examples above already. Let's dig a bit into the details. First of all, let's see what kind of query expressions there are. Since we modelled our queries after Python's comprehensions, the queries can occur exactly where the comprehension can be:

  • List comprehensions: This is what we've seen in the examples
  • Set comprehensions: Instead of a list, our queries will produce sets if we wrap them in {}
  • Map comprehensions: We can build queries that produce sets, just like Python's map comprehensions
  • Generators: Our queries can produce a lazy generator, we just wrap the query expression in ()

So let's just into what clauses are there in our query language and how they are combined to build queries:

  1. The first expression in the query, similar to the comprehensions, has to define what the output is. For example the (x, sum_y) as in the previous query.
  2. Then we need some generator, which can be a for, let or window clause. We have seen the for and let examples already, we'll cover window later in the Tutorial.
  3. After this, we can have a mix of for, let, where, group by, order by, window, and count clauses. This is a bit different from classical query languages like SQL, but this arbitrary composition makes the language a lot more user-friendly.

6. Joins

Join in a very common operation in data wrangling and Python's list comprehensions already take care of the inner join case. In fact the two previous examples had joins in them: we have two variable iterating over different generators, producing all possible pairs of values, and then we had a filter condition.

Let's look at the case of left outer join. In this case we are iterating over left object, and there might be zero matches with the right objects. But in the case of no match we still want to keep the left object, and bind the right variable to a None. We have a special helper function that does exactly that:

#coding: pythonql
from pythonql.helpers import outer

result = [ (x,y)
           for x in range(1,10)
           for y in outer([ select y 
                            for y in range(1,10) 
                            where x%2 == 0 and x > y ])
         ]

7. Grouping

Grouping in PythonQL is more natural than in SQL and follows the paradigms of OQL and XQuery. When the group by clause is run, all variables in the group by clause form the key of the group by. All the other variables will also appear in the results, but they will now hold the lists of values of these variables that occur with the group by key.

For example, the results of the following query is:

#coding: pythonql

res = [ select (x,y)
        for x in range(1,5),
            y in range(1,5)
        where x > y
        group by x ]

Will be:

[{"x": 2, "y": [1]},
 {"x": 3, "y": [1,2]},
 {"x": 4, "y": [1,2,3]}]

So there is nothing special in the group by clause, you can run other clauses right after it. We have already seen an example, where we had a let clause after group by, which computed a sum of values that were grouped under the key.

You can also group by expressions in the group by. And we have added a way to name these expressions, so that you can refer to the later in the query. For example let's take the query above and group by the squares of x. We can write such query as follows:

#coding: pythonql

res = [ select (x_squared, y)
        for x in range(1,5),
            y in range(1,5)
        where x > y
        group by x**2 as x_squared ]

8. Sorting

The order by clause in PythonQL works as expected, you can also order by expressions (but you can't give them names as in the group by clause). Here's a simple example:

#coding: pythonql

res = [ select (x,y)
        for x in range(1,5),
            y in range(1,5)
        where x > y
        order by abs(x-y) asc, y desc ]

9. Path Expressions

Path expressions are really handy for processing deeply nested and semi-structured data. We have taken only the most useful path expressions from XQuery and JSONiq. Basically all we have are:

  • Child path step: source ./ label
  • Descendant path step: source .// label

Where source is some expression that produces a nested structure and label is an expression that must evaluate to a string - it functions like a name test. Here's how they work.

Child step can be applied to either an xml element, a list or a dict structure.

  • In the case of an xml element it will select the children of the element whose tag matches the label.
  • In the case of a dict object (think of JSON's hashtable) it will select the value of a key that matches the label.
  • In the case of a list it will iterate over all objects in the list (if there are nested lists, it will recursively iterate over all of them) and match the objects as in the first two cases.

Here's as example of a small JSON structure and a child step:

#coding: pythonql

db = [ {"region": [{"box": [1,2], "label":"lake" },
                   {"box": {"box": [2,3]} },
                   {"region": {"box":[ 1,2], "label":"lake"} },
                   {"region": {"circle": [0.5,0.5,45], "label":"pond" }}] },
       {"region": {"box": [1,2], "label":"lake" }}]

print (list(db ./ 'region' ./ 'region' ))

The result of this query is a list of values of the regions at the second nesting level:

[{'box': [1, 2], 'label': 'lake'}, 
 {'circle': [0.5, 0.5, 45], 'label': 'pond'}]

Descendents step similarly can be applied to elements, maps and lists. They have exactly the same function as the child steps, except they traverse the whole structures, reaching all the descendants. Here is an example of a descendant step for the above dataset:

#coding: pythonql

db = [ {"region": [{"box": [1,2], "label":"lake" },
                   {"box": {"box": [2,3]} },
                   {"region": {"box":[ 1,2], "label":"lake"} },
                   {"region": {"circle": [0.5,0.5,45], "label":"pond" }}] },
       {"region": {"box": [1,2], "label":"lake" }}]

print (list(db .// 'box'))

The result of this step are all the values of key 'box', including one that contains a nested 'box':

[
 [1, 2], 
 [1, 2], 
 [1, 2], 
 {'box': [2, 3]}, 
 [2, 3]]

10. Try / Except Expressions

When working with semi-structured or nested data it really useful to have try/except expressions. Python only has try/except statement, hence Python's comprehensions can't throw and catch exceptions inside (they can only throw an exception, terminating the whole comprehension). So suppose you're processing data and most of the time (99%) its good, but the rest of the time it will break your query in many different ways. This is the perfect case for the try/except expression.

Let's see a very simple example: just summing up values in a list, and if some values are bad, treat them as zeros.

#coding: pythonql

data = [ 15, 25, None, 80, 34, "34", "twenty", [12], 54, 12]

res = sum([ select num 
        for item in data
        let num = try int(item) except 0 ])

11. Subqueries and nesting

As you might have already noticed, arbitrary expressions can be used in many places in PythonQL, including the binding sequences in the for, let, window clauses and in the select clause. So, naturally, anywhere an expression is used a nested query can be used as well. We have already seen some nested queries in examples above, for instance in the join examples.

But you can also create nested structures quite easily with PythonQL. For example, its quite easy to construct nested JSON objects. Lets iterate over some integers and build up sequence objects, where the sequence starts at each integer, and for each sequence we create nested item objects:

#coding: pythonql

res = [ select {"sequence_start": i,
                "sequence": [ select {"item":k}
                              for k in range(i,i+5) ]}
        for i in [1,3,5]]
print(res)

The output of this script will be:

[{'sequence': [{'item': 1}, {'item': 2}, {'item': 3}, {'item': 4}, {'item': 5}], 
  'sequence_start': 1}, 
 {'sequence': [{'item': 3}, {'item': 4}, {'item': 5}, {'item': 6}, {'item': 7}], 
  'sequence_start': 3}, 
 {'sequence': [{'item': 5}, {'item': 6}, {'item': 7}, {'item': 8}, {'item': 9}], 
  'sequence_start': 5}]

12. Pattern Matching

When working with nested and semi-structured data, sometimes its very convenient to match whole structures against input data. PythonQL provides a pattern matching mechanism with exact match (when you want the whole pattern to match exactly with the data) and partial match (say you just need to rip out a specific structure, but don't care if other fields are present in the data).

The pattern matching syntax also allows for extracting data into variables and specifying filter conditions right in the pattern. Let's create a simple example of pattern matching. We have a collection of people:

people = [ {'first':'John', 'last':'Doe', 'age':30,
             'address' : {'city':'San Jose', 'zipcode':92122}},
           {'first':'Daniela', 'last':'Kleene', 'age':40, 
             'address' : {'city':'San Jose', 'street':'Main', zipcode:92314}},
           {'first':'Jerry', 'last':'Lee', 'age':50, 'occupation':'Engineer', 
             'address' : {'city':'San Jose', zipcode:93213}}
]

There is some variation in structure among these records, Daniela has an extra street field, that other records don't contain. Jerry has an occupation field. Let's start with a partial match pattern first. The query above will match all the records and extract the last name and zipcode of all folks over 25 year old living in San Jose:

res = [ select (l, z)
        match {
         "last_name" : as l,
         "age" : as a where a > 25,
         "address" : {
             "city" : "San Jose",
             "zipcode" : as z
          }
        } as x in people ]

As you can see such pattern matching is pretty convenient, when working with nested JSON structures.

Now suppose we need to find records that have exactly the same fields as the pattern, the following pattern matching query will do the job:

res = [  select x
          match {
            "last_name" : as _,
            "first_name" : as _,
            "age" : as a where a > 25,
            "address" : {
               "city" : as _,
               "zipcode" : as _
            }
          } as x in people ]
]

This query will only match the first record in the people list. These types of queries can be very useful in a scenario, when you have a lot of incomplete records in your JSON database, and you need to query only the complete ones.

13. Using windows

PythonQL has a powerful window clause, its quite a bit different from SQL's window, but it has the same power and we believe its much cleaner to understand. On the top level, we have two types of windows in PythonQL: the sliding window and tumbling window. Sliding window is the usual window folks think of all the time - the overlapping windows. But sometimes you want the opposite - you want to operate on windows that don't share any elements with other windows. PythonQL uses tumbling windows for this case.

Lets start with a simple example of a window. Let's run through a list of integers with a fixed sliding window of size five or less (when approaching the end of the sequence) and just compute the sum of the window at each step:

#coding: pythonql

x = [1,2,3,4,5,6,7]

res = [ select (y,sum(w) as sum)
        for sliding window w in x
        start y at s when True
        end at e when e-s == 2 ]

print(res)

will result in:

[{"y":1,"sum ( w )":6},
 {"y":2,"sum ( w )":9}, 
 {"y":3,"sum ( w )":12}, 
 {"y":4,"sum ( w )":15}, 
 {"y":5,"sum ( w )":18}, 
 {"y":6,"sum ( w )":13},
 {"y":7,"sum ( w )":7}]

In this example we created a window variable w which binds to windows: each window is a list of objects (we could put tuples there for example, which is very convenient, and we'll show how to do it below). We control where the windows start and end using start and end window variables and conditions.

The start variables and condition are pretty trivial: y binds to the first object in the window, s binds to the position of this object in the original list and the condition on when the window should be started is True. This means we will start the window at each position.

The end variables and condition is used to figure out when to end the window. Again, variable e binds to the position of the end object, and we finish the current window when e-s==2, that is when the window reaches the size of 2.

Here we have used a simple condition on the positional variables of window's start and end. But we can place arbitrary conditions on the first, previous or following objects in the window as well. Let's start the windows only at even integers and end the window when the difference between the elements at the start and end of the window is greater than 2:

#coding: pythonql

x = [1,2,3,4,5,6,7]

res = [ select (y,sum(w) as sum)
        for sliding window w in x
        start y when y % 2 == 0
        end z when z-y > 2 ]

print(res)

The result will be:

[{"y":2,"sum":14},
 {"y":4,"sum":22},
 {"y":6,"sum":13}]

So PythonQL allows for current, previous, and following objects to be used in the start/end conditions, as well as variables that bind to the current position of the start/end of the window. Finally, in the end condition, the only keyword signifies that the window should only be created when the condition holds (in the other case, the window will be created when the end of the original sequence is reached).

Let's do a final example with a window over tuples, in this case the window really produces a sequence of tables. This could be useful for example in machine learning, where you want to train a model on a bunch of overlapping subsets of the original dataset:

#coding: pythonql

res = [ select (s, x)
        for sliding window x in
          (select (y,z)
           for (y,z) in [(1,1), (2,2), (3,3),
                         (4,4), (5,5), (6,6)])
        start s when s.y % 2 == 1
        only end e when e.y-s.y >= 2 ]

print(res)

Will produce:

[{"s":{"y":1,"z":1},"x":[ {"y":1,"z":1}, {"y":2,"z":2}, {"y":3,"z":3} ]}, 
 {"s":{"y":3,"z":3},"x":[ {"y":3,"z":3}, {"y":4,"z":4}, {"y":5,"z":5} ]}]

14. PythonQL and Pandas

So far we have minimal integration of PythonQL with pandas, but it's already quite convenient.

We have defined two functions in the pythonql.pandas_lib package:

to_df( query_result, columns=[] )

Converts the result of a PythonQL query to a pandas DataFrame. If the result of the query is a list of named tuples, the DataFrame is created with the corresponding column names.

wrap_df( dataFrame )

Wraps a pandas DataFrame, so that PythonQL queries can operate upon it and use convenient dot notation to access fields of the records. Here is an example:

#coding: pythonql
from pythonql.pandas_lib import wrap_df

df = someFunctionToCreateDataFrame()

res = [ select (t.a as a, t.b as b)
        for t in wrap_df(df)
        where t.c > 10 ]

15. PythonQL and SQL Databases

Currently we only have a PostgreSQL wrapper, if you want to query other databases, you can use SQLAlchemy to get data from the database into an iterator, and then PythonQL can run on top of such iterator. We have a convenience function wrap_tuples(tuples,columns) that wraps a tuple iterator and provides a list of tuples with names. So you can do queries like:

from pythonql.helpers import wrap_tuples

res = [ select (t.name, t.salary) 
        from wrap_tuples(tuple_source, ['name','salary']) ]

If you're using PostgreSQL, we have a much more powerful solution for you. You need to use SQL Alchemy to specify the engine and create a PostgreSQL source (a table). Then you can use this source in your PythonQL query and parts of the query that can be handled by PostgreSQL will be pushed into it. Here is an example:

#coding: pythonql

# Import the create_engine functions from SQL Alchemy and our PostgreSQL source
from sqlalchemy import create_engine
from pythonql.sources.postgresql import PostgresTable

# Create the database engine
engine = create_engine('postgresql://localhost:5432/test')

# Specify the tables that will be used in the query. You can also specify which
# schema the tables are in (public by default).

table1 = PostgresTable(engine,table_name='client')
table2 = PostgresTable(engine,table_name='product',schema_name='products')

# Use the tables in your query:

res = [select (c,p,pr)
       for c in table1
       for p in table2
       let pr = p.price
       where c.id == p.client_id
     ]

In the case above the entire join will be pushed to the PostgreSQL database and only tuple construction will happen in the Python runtime of PythonQL.

16. PythonQL and NoSQL Databases

Again, we haven't built wrappers for NoSQL databases yet, so you need to get the data out yourself, then PythonQL will run on top of the data.

17. PythonQL and Apache Spark & Hadoop

We're planning to support Apache Spark in the future, including running queries on top of HDFS files. Actually, we're considering Apache Spark to be the only real way to massively parallelize PythonQL queries in the future.

18. Query Hints

We're planning to build a full optimizer for PythonQL, but this effort will take quite some time to materialize. In the meanwhile (and possible as a future workaround for plans that the optimizer won't get right) we are adding query hints that let you control how the PythonQL query plans will be executed. Currently we provide join implementation and direction hints. Since joins are specified via conditions on tuple variables, so are hints. Here is an example on how to specify an index join, in which the left relation is scanned and the right one is indexed:

#coding: pythonql

res = [select (x.id,y.id)
       for x in table1
       for y in table2
       where x.id == y.id and hint('index','x','y')
     ]

If we want to scan over the right side and build an index on the left side, we reverse the order of join variables in the hint:

#coding: pythonql

res = [select (x.id,y.id)
       for x in table1
       for y in table2
       where x.id == y.id and hint('index','y','x')
     ]

If we don't want to use an index join, but want a nested loops join with right-hand side being scanned once and left-hand side being scanned once for each tuple of the right relation, we can write the following query:

#coding: pythonql

res = [select (x.id,y.id)
       for x in table1
       for y in table2
       where x.id == y.id and hint('nl','y','x')
     ]

We'll be introducing more hints as we build more implementation of operators in PythonQL.

19. Debugging query plans

When you are not getting the desired behaviour from your queries, it might be useful to look at the original and optimized query plans to see what you're actually running. PythonQL has the facilities to do just that:

#coding: pythonql

from pythonql.debug import Debug
Debug().print_program()  # prints the original plans with the entire PythonQL script
Debug().print_optimized=True # prints optimized plans

res = [select (x.id,y.id)
       for x in table1
       for y in table2
       where x.id == y.id
     ]

20. Help/Bugs/Suggestions

We have a mailing list running on Google groups where you can ask any questions, report bugs or make suggestions: https://groups.google.com/forum/#!forum/pythonql

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.