Experimenting with Solr SQL
===========================

This notebook is for experimenting with Solr's Parallel SQL interface, especially via an SQLAlchemy plugin so that it's exactly like normal SQL.

Unfortunatley, current state of play is that Solr 6 does not cope with SQL queries on _aliases_, so it's not much use for text analysis right now.

The SQL system does work okay for Solr 8, but the `SELECT *` logic seems to be a bit brittle (at least via the SQLAlchemy module).  It works pretty reliably if the fields are explicitly enumerated.

In [1]:
!pip install sqlalchemy-solr



In [29]:
from sqlalchemy import create_engine

engine = create_engine('solr://solr.api.wa.bl.uk:80/solr/fc')

# Unfortunately, for Solr 6, we can't query aliases and all fields * leads to class cast exceptions!
# (java.lang.Long cannot be cast to java.lang.String)
rows = engine.execute("SELECT id,url FROM `NPLD-FC2017-20190228` LIMIT 1")

for r in rows:
    for column, value in r.items():
        print(column, value)


************************************
Query: SELECT id,url FROM `NPLD-FC2017-20190228` LIMIT 1
************************************
id 20171225120530/r1upsuMttEfpjRI2R4rN7Q==
url http://www.newquayvoice.co.uk/news/5/article/2920/
************************************
Catched StopIteration in fetchone
************************************


In [1]:
import pandas as pd
from sqlalchemy import create_engine

# Only create this engine once per kernel session, or things get confuse (?)
engine = create_engine('solr://solr8.api.wa.bl.uk:80/solr/tracking',echo=False)



In [11]:
#rows = engine.execute("SELECT * FROM tracking LIMIT 10")

# Note that SELECT * can get a bit wonky when requesting multiple documents, best to be explicit:
sql_df = pd.read_sql_query(
    "SELECT id,timestamp_dt FROM tracking WHERE kind_s = 'warcs' LIMIT 2",
    con=engine
)

sql_df

************************************
Query: SELECT id,timestamp_dt FROM tracking WHERE kind_s = 'warcs' LIMIT 2
************************************


Unnamed: 0,id,timestamp_dt
0,hdfs://hdfs:54310/1_data/npld/webrecorder/bl-y...,2016-12-30 11:59:00
1,hdfs://hdfs:54310/1_data/npld/webrecorder/bl-y...,2016-12-30 11:59:00
