## Joins

The movie dataset comes in multiple parts.  It is very natural to ask questions about the relationship between the parts.

- What is the mean rating per genre?
- What is the favorite movie for each occupation?
- What genres are most preferred by women vs men?
 
Joins let us combine multiple datasets together to answer joint questions.

In [None]:
import hail as hl
import seaborn

hl.utils.get_movie_lens('data/')

users = hl.read_table('data/users.ht')
movies = hl.read_table('data/movies.ht')
ratings = hl.read_table('data/ratings.ht')

To understand joins in Hail, we need to revisit one of the crucial properties of `Table`s: the key.

A `Table` has an ordered list of fields known as the key.  The key is shown by `describe`.

In [None]:
users.describe()

`key` is a struct expression of all of the key fields.

In [None]:
users.key.describe()

Keys need not be unique or non-missing, although in many applications they will be both.


Hail’s join syntax is most easily understood through an example.

In [None]:
t1 = hl.Table.parallelize([
    {'a': 'foo', 'b': 1},
    {'a': 'bar', 'b': 2},
    {'a': 'bar', 'b': 2}],
    hl.tstruct(a=hl.tstr, b=hl.tint32),
    key='a')
t2 = hl.Table.parallelize([
    {'t': 'foo', 'x': 3.14},
    {'t': 'bar', 'x': 2.78},
    {'t': 'bar', 'x': -1},
    {'t': 'quam', 'x': 0}],
    hl.tstruct(t=hl.tstr, x=hl.tfloat64),
    key='t')

In [None]:
t1.show()

In [None]:
t2.show()

In [None]:
j = t1.annotate(t2_x = t2[t1.a].x)
j.show()

The magic of keys is that they turn tables into maps: `table[expr]` should naturally refer to the row of `table` that has key the value of `expr`.  Note: if the row is not unique, one such row is chosen arbitrarily.

Here's a subtle bit: if `expr` is an expression indexed by row of `table2`, then `table[expr]` is also an expression indexed by row of `table2`.

Also note that while they look similar, `table['field1']` and `table[table2.key]` are doing very different things!

In [None]:
t1

In [None]:
t2[t1.a].describe()

Now let's use joins to compute the average movie rating per genre.

In [None]:
t = (ratings.group_by(ratings.movie_id) 
     .aggregate(rating = hl.agg.mean(ratings.rating)))
t.describe()

In [None]:
# now join in the movie genre
t = t.annotate(genres = movies[t.movie_id].genres)
t.describe()

In [None]:
t.show()

## Explode

Now we want to group by genres, but they're packed up in an array.  To unpack the genres, we can use [explode](https://hail.is/docs/devel/hail.Table.html#hail.Table.explode).  `explode` creates a new row for each element in the value of the field, which must be a collection (array or set).

In [None]:
t = t.explode(t.genres)
t.show()

In [None]:
t = (t.group_by(t.genres)
     .aggregate(rating = hl.agg.mean(t.rating)))
# save the intermediate result
t = t.cache()
t.show(n=100)

## Ordering

We can sort tables using [order_by](https://hail.is/docs/devel/hail.Table.html#hail.Table.order_by).  Default is ascending, but you can control the direction with `asc` and `desc`.

In [None]:
t = t.order_by(hl.desc(t.rating))
t.show(n=100)

`Table`s also have a SQL-style inner/left/right/outer [join](https://hail.is/docs/devel/hail.Table.html#hail.Table.join) method.

SQL-style joins for `MatrixTable` are coming soon.

## Exercises

- What is the favorite movie for each occupation?
- What genres are rated most differently by men and women?
 