In [1]:
%load_ext raw_magic

# Parsing log files

The log file `log_example.log` is obtained from an HTTP server. 

In [2]:
%buckets_register raw-tutorial

API error: S3 credentials already exists


In [3]:
%%query
read_lines("s3://raw-tutorial/ipython-demos/log_example.log")

Showing only 100 values...


string
"199.72.81.55 - [01/Jul/2015:00:00:01 -0400] ""GET /history/apollo/ HTTP/1.0"" 200 6245"
"unicomp6.unicomp.net - [01/Jul/2015:00:00:06 -0400] ""GET /shuttle/countdown/ HTTP/1.0"" 200 3985"
"199.120.110.21 - [01/Jul/2015:00:00:09 -0400] ""GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0"" 200 4085"
"burger.letters.com - [01/Jul/2015:00:00:11 -0400] ""GET /shuttle/countdown/liftoff.html HTTP/1.0"" 304 0"
"199.120.110.21 - [01/Jul/2015:00:00:11 -0400] ""GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0"" 200 4179"
"burger.letters.com - [01/Jul/2015:00:00:12 -0400] ""GET /images/NASA-logosmall.gif HTTP/1.0"" 304 0"
"burger.letters.com - [01/Jul/2015:00:00:12 -0400] ""GET /shuttle/countdown/video/livevideo.gif HTTP/1.0"" 200 0"
"205.212.115.106 - [01/Jul/2015:00:00:12 -0400] ""GET /shuttle/countdown/countdown.html HTTP/1.0"" 200 3985"
"d104.aa.net - [01/Jul/2015:00:00:13 -0400] ""GET /shuttle/countdown/ HTTP/1.0"" 200 3985"
"129.94.144.152 - [01/Jul/2015:00:00:13 -0400] ""GET / HTTP/1.0"" 200 7074"


The file is turned into a structured form using a regular expression.

Note that the keyword `PARSE AS` maps each match of the regular expression into anonymous field names: `_1, _2, ...`. The keyword `INTO` creates a new record from those fields.

In [4]:
%%view logs
log_file := read_lines("s3://raw-tutorial/ipython-demos/log_example.log");

SELECT * FROM log_file
PARSE AS r"""(.+) - \[(.+)\] "(\w+) (.+) .+" (\d+) (\d+)"""
INTO (host: _1, timestamp: _2, method: _3, url: _4, return_code: _5, size: _6)

View "logs" was replaced


In [5]:
%query SELECT * FROM logs LIMIT 5

host,timestamp,method,url,return_code,size
199.72.81.55,01/Jul/2015:00:00:01 -0400,GET,/history/apollo/,200,6245
unicomp6.unicomp.net,01/Jul/2015:00:00:06 -0400,GET,/shuttle/countdown/,200,3985
199.120.110.21,01/Jul/2015:00:00:09 -0400,GET,/shuttle/missions/sts-73/mission-sts-73.html,200,4085
burger.letters.com,01/Jul/2015:00:00:11 -0400,GET,/shuttle/countdown/liftoff.html,304,0
199.120.110.21,01/Jul/2015:00:00:11 -0400,GET,/shuttle/missions/sts-73/sts-73-patch-small.gif,200,4179


# Performing a query

The following query returns in hierarchical form, the top 10 hostname group by HTTP return code.

Note that `*` in a `GROUP BY` query refers to the elements in each group. Also note that, in RAW, a SELECT statement can be used as a column and can refer to all columns/variables in scope.

In [6]:
%%query

SELECT return_code,
       (SELECT host, count(*) as requests FROM * GROUP BY host ORDER BY requests DESC LIMIT 10) AS hosts
FROM logs
GROUP BY return_code


return_code,hosts,hosts
return_code,host,requests
200,kristina.az.com,109
200,piweba1y.prodigy.com,55
200,www-a1.proxy.aol.com,46
200,teleman.pr.mcs.net,45
200,pm-1-25.connectnet.com,41
200,piweba3y.prodigy.com,41
200,129.188.154.200,41
200,news.ti.com,41
200,palona1.cns.hp.com,38
200,slip-5.io.com,34


#  Displaying locations

This query shows the locations of IPs of requests that returned errors.

The query uses a regular expression to filter hosts that are not IPs and them joins the result with  `locations` a json file taken from a location database.

Note that no schemas were created, no data was explicitly loaded and no separate ETL process or scripts were needed: these optimizations are all done internally by RAW and transparent to the user.

In [7]:
%%query
ips := (SELECT DISTINCT host FROM logs) PARSE AS r"""(\d+\.\d+\.\d+\.\d+)""" SKIP ON FAIL;

locations := read("s3://raw-tutorial/ipython-demos/ip_locations.json");

SELECT  ip, country_name as country, latitude, longitude FROM ips p, locations 
WHERE p = ip and country_name IS NOT NULL
ORDER BY country


Showing only 100 values...


ip,country,latitude,longitude
198.142.12.2,Australia,-33.494,143.2104
203.12.152.51,Australia,-37.8103,144.9544
149.171.160.183,Australia,-33.8427,151.1936
129.94.144.152,Australia,-33.8928,151.2472
103.224.182.240,Australia,-33.494,143.2104
149.171.160.182,Australia,-33.8427,151.1936
200.10.239.205,Brazil,-22.8305,-43.2192
200.10.239.195,Brazil,-22.8305,-43.2192
205.189.154.54,Canada,48.6393,-93.4469
158.69.158.186,Canada,45.5,-73.5833
