Experimenting with Solr SQL
===========================

This notebook is for experimenting with Solr's Parallel SQL interface, especially via an SQLAlchemy plugin so that it's exactly like normal SQL.

Unfortunatley, current state of play is that Solr 6 does not cope with SQL queries on _aliases_, so it's not much use for text analysis right now.

The SQL system does work okay for Solr 8, but the `SELECT *` logic seems to be a bit brittle (at least via the SQLAlchemy module).  It works pretty reliably if the fields are explicitly enumerated.

In [1]:
!pip install sqlalchemy-solr



In [42]:
from sqlalchemy import create_engine

fc = 'NPLD-FC2017-20190228'

engine = create_engine('solr://solr.api.wa.bl.uk:80/solr/%s' % fc)

# Unfortunately, for Solr 6, we can't query aliases and all fields * leads to class cast exceptions!
# (java.lang.Long cannot be cast to java.lang.String)
rows = engine.execute("SELECT id,url,wayback_date FROM `%s` LIMIT 1" % fc)

for r in rows:
    for column, value in r.items():
        print(column, value)


************************************
Query: SELECT id,url,wayback_date FROM `NPLD-FC2017-20190228` LIMIT 1
************************************
id 20171225120530/r1upsuMttEfpjRI2R4rN7Q==
url http://www.newquayvoice.co.uk/news/5/article/2920/
wayback_date 20171225120530
************************************
Catched StopIteration in fetchone
************************************


In [21]:
#rows = engine.execute("SELECT id,url FROM `NPLD-FC2017-20190228` WHERE (host:'theguardian.com' OR host:'independent.co.uk' OR host:'dailymail.co.uk' OR host:'express.co.uk' OR host:'thesun.co.uk' OR host:'mirror.co.uk' OR host:'dailystar.co.uk') AND ((title:meghan AND title:harry) OR (title:meghan AND title:markle)) LIMIT 1")
rows = engine.execute("SELECT id,url,title,host FROM `NPLD-FC2017-20190228` WHERE host = '(theguardian.com independent.co.uk dailymail.co.uk express.co.uk thesun.co.uk mirror.co.uk dailystar.co.uk)' AND ((title = 'meghan' AND title = 'markle') OR (title = 'meghan' AND title = 'harry')) ORDER BY crawl_date LIMIT 1")

for r in rows:
    for column, value in r.items():
        print(column, value)


************************************
Query: SELECT id,url,title,host FROM `NPLD-FC2017-20190228` WHERE host = '(theguardian.com independent.co.uk dailymail.co.uk express.co.uk thesun.co.uk mirror.co.uk dailystar.co.uk)' AND ((title = 'meghan' AND title = 'markle') OR (title = 'meghan' AND title = 'harry')) ORDER BY crawl_date LIMIT 1
************************************
id 20170101102033/lhGw4C4wmG+0iFG0t7JrFw==
url http://www.dailymail.co.uk/tvshowbiz/article-3976578/amp/Meghan-Markle-enjoys-cocktails-Quantico-star-Priyanka-Chopra-Prince-Harry-tours-Caribbean.html
title Meghan Markle enjoys cocktails with Quantico's Priyanka Chopra while Prince Harry tours
host dailymail.co.uk
************************************
Catched StopIteration in fetchone
************************************


In [27]:
sql_df = pd.read_sql_query(
    "SELECT id,url,title,host,crawl_date FROM `NPLD-FC2017-20190228` WHERE host = '(theguardian.com independent.co.uk dailymail.co.uk express.co.uk thesun.co.uk mirror.co.uk dailystar.co.uk)' AND ((title = 'meghan' AND title = 'markle') OR (title = 'meghan' AND title = 'harry')) ORDER BY crawl_date LIMIT 100000",
    con=engine
)

sql_df

************************************
Query: SELECT id,url,title,host,crawl_date FROM `NPLD-FC2017-20190228` WHERE host = '(theguardian.com independent.co.uk dailymail.co.uk express.co.uk thesun.co.uk mirror.co.uk dailystar.co.uk)' AND ((title = 'meghan' AND title = 'markle') OR (title = 'meghan' AND title = 'harry')) ORDER BY crawl_date LIMIT 100000
************************************


Unnamed: 0,id,url,title,host,crawl_date
0,20170101102033/lhGw4C4wmG+0iFG0t7JrFw==,http://www.dailymail.co.uk/tvshowbiz/article-3...,Meghan Markle enjoys cocktails with Quantico's...,dailymail.co.uk,2017-01-01 10:20:33
1,20170101102049//M3a8PfNIu5zNH26wfStXA==,http://www.dailymail.co.uk/news/article-392542...,Prince Harry's girlfriend Meghan Markle spotte...,dailymail.co.uk,2017-01-01 10:20:49
2,20170101102057/Erluxdjjqr/jNL4hNzha4A==,http://www.dailymail.co.uk/news/article-402914...,Prince Harry and girlfriend Meghan Markle 'buy...,dailymail.co.uk,2017-01-01 10:20:57
3,20170101102101/z6CUHQC6PS+SWkzOnV6hLw==,http://www.dailymail.co.uk/news/article-402914...,Prince Harry and girlfriend Meghan Markle 'buy...,dailymail.co.uk,2017-01-01 10:21:01
4,20170101102123/Vs9UbyshtSk1D9ZNIyRFCA==,http://www.dailymail.co.uk/news/article-396227...,Prince Harry's girlfriend Meghan Markle says s...,dailymail.co.uk,2017-01-01 10:21:23
...,...,...,...,...,...
21051,20171227215813/rt9ttovXyXjBwGOzBKCWgg==,http://www.mirror.co.uk/news/uk-news/prince-ha...,Prince Harry WON'T take part in the traditiona...,mirror.co.uk,2017-12-27 21:58:13
21052,20171227215854/ZxKG1KVJAy95739ySnIbxg==,http://www.mirror.co.uk/3am/style/celebrity-fa...,"Meghan Markle is Hollywood perfection in £56,0...",mirror.co.uk,2017-12-27 21:58:54
21053,20171227231231/V2VEliskxTDJbIqlCcsF7w==,http://www.mirror.co.uk/3am/celebrity-news/por...,Porn searches for Meghan Markle go through the...,mirror.co.uk,2017-12-27 23:12:31
21054,20171228022515/n/RUSr7G6FbUZuSljxsDew==,http://www.mirror.co.uk/news/uk-news/meghan-ma...,Prince Harry reveals Meghan Markle's first Chr...,mirror.co.uk,2017-12-28 02:25:15


In [50]:
import pandas as pd
from sqlalchemy import create_engine

# Solr 8, works fine:
engine = create_engine('solr://solr8.api.wa.bl.uk:80/solr/tracking',echo=True)
# Solr 6, this doesn't work:
#engine = create_engine('solr://solr.api.wa.bl.uk:80/solr/NPLD-FC2017-20190228',echo=True)

from sqlalchemy import inspect

inspector = inspect(engine)

for table_name in inspector.get_table_names():
    for column in inspector.get_columns(table_name):
        print("Table: %s Column: %s" % (table_name, column['name']))


Table: trackdb-20200402 Column: _root_
Table: trackdb-20200402 Column: _version_
Table: trackdb-20200402 Column: cdx_index_ss
Table: trackdb-20200402 Column: cdx_records_checked_i
Table: trackdb-20200402 Column: cdx_records_found_i
Table: trackdb-20200402 Column: collection_s
Table: trackdb-20200402 Column: file_ext_s
Table: trackdb-20200402 Column: file_name_s
Table: trackdb-20200402 Column: file_path_s
Table: trackdb-20200402 Column: file_size_l
Table: trackdb-20200402 Column: hdfs_group_s
Table: trackdb-20200402 Column: hdfs_replicas_i
Table: trackdb-20200402 Column: hdfs_user_s
Table: trackdb-20200402 Column: id
Table: trackdb-20200402 Column: job_s
Table: trackdb-20200402 Column: kind_s
Table: trackdb-20200402 Column: layout_s
Table: trackdb-20200402 Column: modified_at_dt
Table: trackdb-20200402 Column: permissions_s
Table: trackdb-20200402 Column: recognised_b
Table: trackdb-20200402 Column: refresh_date_dt
Table: trackdb-20200402 Column: stream_s
Table: trackdb-20200402 Column:

In [11]:
#rows = engine.execute("SELECT * FROM tracking LIMIT 10")

# Note that SELECT * can get a bit wonky when requesting multiple documents, best to be explicit:
sql_df = pd.read_sql_query(
    "SELECT id,timestamp_dt FROM tracking WHERE kind_s = 'warcs' LIMIT 2",
    con=engine
)

sql_df

************************************
Query: SELECT id,timestamp_dt FROM tracking WHERE kind_s = 'warcs' LIMIT 2
************************************


Unnamed: 0,id,timestamp_dt
0,hdfs://hdfs:54310/1_data/npld/webrecorder/bl-y...,2016-12-30 11:59:00
1,hdfs://hdfs:54310/1_data/npld/webrecorder/bl-y...,2016-12-30 11:59:00


In [76]:
# Solr 8, works fine:
engine = create_engine('solr://dev1.n45.wa.bl.uk:8913/solr/crawl_log_fc',echo=True)
sql_df = pd.read_sql_query(
    "SELECT url,`timestamp` FROM crawl_log_fc WHERE annotations = 'Q:serverMaxSuccessKb' LIMIT 2",
    con=engine
)

sql_df

2021-02-06 22:07:37,827 INFO sqlalchemy.engine.base.Engine SELECT url,`timestamp` FROM crawl_log_fc WHERE annotations = 'Q:serverMaxSuccessKb' LIMIT 2


2021-02-06 22:07:37,827 - SELECT url,`timestamp` FROM crawl_log_fc WHERE annotations = 'Q:serverMaxSuccessKb' LIMIT 2


2021-02-06 22:07:37,828 INFO sqlalchemy.engine.base.Engine ()


2021-02-06 22:07:37,828 - ()


************************************
Query: SELECT url,`timestamp` FROM crawl_log_fc WHERE annotations = 'Q:serverMaxSuccessKb' LIMIT 2
************************************


Unnamed: 0,url,timestamp
0,https://i0.wp.com/www.theoffsideline.com/wp-co...,2021-01-28 16:25:08.791
1,https://fi.sportsdirect.com/under-armour/under...,2021-01-28 16:25:09.128
