# Using RAW: joining data in S3 and PostgreSQL relational database

## Enabling RAW <-> Jupyter integration

In [1]:
%load_ext raw_magic

### Making sure all is well

In [2]:
%%query
1+1

int
2


## Looking at data in the PostgreSQL relational database

In [4]:
%%rdbms_register demos 
type: postgresql 
host: test-psql.raw-labs.com
database: demos
username: rawtest
password: f55ba92415

RDBMS server "demos" replaced

This demo uses data from IMDB, the "Internet Movie Database".

In [5]:
%%query
SELECT *
FROM READ_PGSQL("demos", "public", "movies")
LIMIT 10

tconst,startyear,primarytitle,originaltitle,primaryname
tt0027125,1935,Top Hat,Top Hat,Fred Astaire
tt0028333,1936,Swing Time,Swing Time,Fred Astaire
tt0034862,1942,Holiday Inn,Holiday Inn,Fred Astaire
tt0050419,1957,Funny Face,Funny Face,Fred Astaire
tt0053137,1959,On the Beach,On the Beach,Fred Astaire
tt0037382,1944,To Have and Have Not,To Have and Have Not,Lauren Bacall
tt0038355,1946,The Big Sleep,The Big Sleep,Lauren Bacall
tt0039302,1947,Dark Passage,Dark Passage,Lauren Bacall
tt0040506,1948,Key Largo,Key Largo,Lauren Bacall
tt0045891,1953,How to Marry a Millionaire,How to Marry a Millionaire,Lauren Bacall


### Let's declare a VIEW to use this source

In [6]:
%%view movies
READ_PGSQL("demos", "public", "movies")

View "movies" replaced

Now we can use it in a query more succinctly:

In [7]:
%%query
SELECT * FROM movies
LIMIT 10

tconst,startyear,primarytitle,originaltitle,primaryname
tt0027125,1935,Top Hat,Top Hat,Fred Astaire
tt0028333,1936,Swing Time,Swing Time,Fred Astaire
tt0034862,1942,Holiday Inn,Holiday Inn,Fred Astaire
tt0050419,1957,Funny Face,Funny Face,Fred Astaire
tt0053137,1959,On the Beach,On the Beach,Fred Astaire
tt0037382,1944,To Have and Have Not,To Have and Have Not,Lauren Bacall
tt0038355,1946,The Big Sleep,The Big Sleep,Lauren Bacall
tt0039302,1947,Dark Passage,Dark Passage,Lauren Bacall
tt0040506,1948,Key Largo,Key Largo,Lauren Bacall
tt0045891,1953,How to Marry a Millionaire,How to Marry a Millionaire,Lauren Bacall


### What's Tom Cruise been up to?

In [8]:
%%query
SELECT startyear, primarytitle
FROM movies
WHERE primaryname = "Tom Cruise"
ORDER BY startyear DESC
LIMIT 10

startyear,primarytitle
2017,American Made
2017,The Mummy
2016,Jack Reacher: Never Go Back
2015,Mission: Impossible - Rogue Nation
2014,Edge of Tomorrow
2013,Oblivion
2012,Rock of Ages
2012,Jack Reacher
2011,Mission: Impossible - Ghost Protocol
2010,Knight and Day


Note this is just like regular SQL (so far!).

Let's reshape this data a bit, so it looks more readable. We'll do it creating a new view "actors_in_movies".

In [9]:
%%view actors_in_movies
SELECT primarytitle + " (" + originaltitle + ", " + startyear + ")" AS title, primaryname AS name
FROM movies

View "actors_in_movies" replaced

Let's see how that looks:

In [10]:
%%query
SELECT *
FROM  actors_in_movies
LIMIT 10

title,name
"Top Hat (Top Hat, 1935)",Fred Astaire
"Swing Time (Swing Time, 1936)",Fred Astaire
"Holiday Inn (Holiday Inn, 1942)",Fred Astaire
"Funny Face (Funny Face, 1957)",Fred Astaire
"On the Beach (On the Beach, 1959)",Fred Astaire
"To Have and Have Not (To Have and Have Not, 1944)",Lauren Bacall
"The Big Sleep (The Big Sleep, 1946)",Lauren Bacall
"Dark Passage (Dark Passage, 1947)",Lauren Bacall
"Key Largo (Key Largo, 1948)",Lauren Bacall
"How to Marry a Millionaire (How to Marry a Millionaire, 1953)",Lauren Bacall


## Another dataset: list of TV series, stored on a file in Dropbox

Let's read directly data from Dropbox!

In [11]:
%buckets_register raw-tutorial

"s3://raw-tutorial" replaced

In [13]:
%%query
SELECT *
FROM READ("s3://raw-tutorial/ipython-demos/series.hjson")
LIMIT 3

title,ep,starring
ER,S1E23,Deezer D
ER,S1E23,Laura Ceron
ER,S1E23,Laura Innes
ER,S1E23,Noah Wyle
ER,S1E23,Eriq La Salle
ER,S1E23,Julianna Margulies
ER,S1E23,George Clooney
ER,S1E23,Anthony Edwards
ER,S1E23,Maura Tierney
ALF,S3E4,Liz Sheridan


Let's now create view to access this data more easily as before.

In [14]:
%%view series
SELECT *
FROM READ("s3://raw-tutorial/ipython-demos/series.hjson")

View "series" replaced

Let's also reshape it a bit. Note we are using non-SQL: we are "unnesting" the nested data in the FROM clause.

In [15]:
%%view actors_in_series
SELECT s.title + " (" + s.ep + ")" as title, p AS name
from series s, s.starring p

View "actors_in_series" replaced

Let's see how that looks like:

In [16]:
%%query
SELECT *
FROM actors_in_series
LIMIT 20

title,name
ER (S1E23),Deezer D
ER (S1E23),Laura Ceron
ER (S1E23),Laura Innes
ER (S1E23),Noah Wyle
ER (S1E23),Eriq La Salle
ER (S1E23),Julianna Margulies
ER (S1E23),George Clooney
ER (S1E23),Anthony Edwards
ER (S1E23),Maura Tierney
ALF (S3E4),Liz Sheridan


Now both views we created - actors in movies and actors in series - can be trivially combined since they are "structurally compatible", as if they were from the same single source. An actor appearing in both a movie and a TV series is now showing in a single query.

In [17]:
%%view actors_in_all
SELECT *
FROM
actors_in_movies
UNION ALL
actors_in_series

View "actors_in_all" replaced

So now let's query that view. Note we are querying *BOTH* Dropbox and a PostgreSQL relational database system. And it's all transparent!

In [18]:
%%query
SELECT *
FROM actors_in_all
WHERE name = "Sullivan Stapleton"
LIMIT 10

title,name
"300: Rise of an Empire (300: Rise of an Empire, 2014)",Sullivan Stapleton
"Strike Back (Strike Back, 2010)",Sullivan Stapleton
"Blindspot (Blindspot, 2015)",Sullivan Stapleton
Blindspot (S3E4),Sullivan Stapleton
Blindspot (S3E15),Sullivan Stapleton
Blindspot (S2E3),Sullivan Stapleton
Blindspot (S3E3),Sullivan Stapleton
Blindspot (S2E11),Sullivan Stapleton
Blindspot (S1E6),Sullivan Stapleton
Blindspot (S2E8),Sullivan Stapleton


Note that "Blindspot" is a TV series (from Dropbox), and "300: Rise of an Empire" is a movie (from PostgreSQL).

## Cleaning the TV Series dataset

A problem: in the Dropbox dataset (TV series), some names are spelled slightly wrong.

In [19]:
%%query
SELECT V FROM (
    SELECT DISTINCT name
    FROM actors_in_series
    WHERE name LIKE "%Stapleton%" OR name LIKE "%Benj%" OR name LIKE "%Laura Ce%"
) V  ORDER BY V LIMIT 10

string
Benj Gregory
Benji Gregory
Laura Ceron
Laura Cerón
Sulivan Stapleton
Sullivan Stapleton


For this we need some reference dataset. But we have one: the IMDB database (PostgreSQL) that has a curated list of actors names. So let's replace the mispelled names by the most similar ones found in the curated IMDB list.

We build a view with "primary names" coming from IMDB (PostgreSQL).

In [20]:
%%view ref_names
SELECT primaryname
FROM READ_PGSQL("demos", "public", "names")

View "ref_names" replaced

There's quite a few of them!

In [21]:
%%query
SELECT COUNT(*) FROM ref_names

long
252320


So let's script a cleaning tool for our case in the RAW language (RQL) itself. RQL is what we've been using all along. It looks-and-feels like SQL but it's a lot more powerful.

In [22]:
%%view fixed_actors_in_series

candidates(x: string, vs: collection(string nullable)) := {
    select
        levenshtein_distance(n, x) AS score,
        n AS value
    from vs n
    where len(x) == len(n) and substr(x, 1, 1) = substr(n, 1, 1)
};

bestmatch(s: string, vs: collection(string nullable)) :=
    cfirst(
        select x.value
        from candidates(s, vs) x
        where x.score < 4
        order by x.score asc
        limit 3
    )
;

SELECT title, name, bestmatch(name, ref_names) AS fixed_name
FROM actors_in_series

View "fixed_actors_in_series" replaced

Let's now see the names that are different from our reference names.

In [24]:
%%query
SELECT *
FROM fixed_actors_in_series
WHERE name != fixed_name
LIMIT 30

title,name,fixed_name
ER (S1E23),Laura Ceron,Laura Cerón
ER (S10E20),Laura Ceron,Laura Cerón
ER (S5E16),Laura Ceron,Laura Cerón
ER (S12E12),Laura Ceron,Laura Cerón
ER (S8E4),Eric La Salle,Eriq La Salle
ER (S2E11),Eric La Salle,Eriq La Salle
ER (S3E4),Eric La Salle,Eriq La Salle
ER (S12E14),Eric La Salle,Eriq La Salle
ER (S12E8),Eric La Salle,Eriq La Salle
ER (S4E16),Eric La Salle,Eriq La Salle


So let's now rebuild our list of actors in movies + series with the correct names.

In [None]:
%%view fixed_all_actors
SELECT *
FROM actors_in_movies
UNION ALL
(SELECT title, fixed_name AS name FROM fixed_actors_in_series)

Let's see if names look good! (Sullivan Stapleton had two different spellings before):

In [None]:
%%query
SELECT *
FROM fixed_all_actors
WHERE name = "Sullivan Stapleton" LIMIT 20

## And we're done! Merging data from PostgreSQL and Dropbox data while cleaning it!