## Solr Practice Session

1.  Installing SOLR
1. Starting the service
  1.  Standalone versus cloud
1.  Basic administration
1.  Example collections
1.  The admin GUI
1.  Defining a new collection
  1.  Configuration files
  1.  Defining the schema
1.  Indexing documents
1.  Queries and query responses



### Installing SOLR

https://lucene.apache.org/solr/guide/7_1/installing-solr.html

1. Verify Java version
1. Download SOLR archive
1. Expand it to a location of your choice
1. Remember that location, as you will be running commands out of its bin directory

My install is at C:\solr-8.0.0
1.  Run bin\solr start --help to see options
1.  Run bin\solr start -e techproducts
1.  Run the browser and look at general admin and also the core


### Starting and stopping the service

1.  <pre>bin\solr start [-c] [-e example]</pre>
2. <pre>bin\solr stop -all</pre>

#### A Note on Cloud Mode vs. Standalone Mode

  <img src="basic-cloud3.png"/>

#### Look at the *Techproducts* example
1. Start the server having it load the example "collection"
2. Use the admin GUI to see what constitutes a "collection"
3. Operations from the GUI relevant to collections
  1. Analysis -- see what analyzers are being run on indexing and on querying
    1.  Notice synonyms on query
  2. Query -- run queries and examine results
  3. Shema -- more on this later

### Constructing Our Own Collection for SUFaculty

1. Understand our data and retrieval requirements
2. Understand the SOLR schema syntax
3. Build a schema.xml file

#### Build the Core Using Admin Interface
1.  Create a config directory for the new core
2.  Edit the schema.xml file to our needs
3.  Create the core using the Admin Interface
4.  Add a few documents
5.  Analyze a few terms
6.  Do a few queries

#### Remember our schema
1.  Look at a data file
1.  Fields:  name, email, phone, interests, joined
1.  Make some decisions
  1. *name* -- is it truly a text field?
  1. *email* -- special structure
  1. *phone* -- special structure
  1. *interests* -- really a text field!
  1. *joined* -- year / integer


------------------------------------------------------
------------------------------------------------------

### Moving out of the Admin GUI

https://lucene.apache.org/solr/guide/6_6/introduction-to-client-apis.html#introduction-to-client-apis

>Clients use Solr’s five fundamental operations to work with Solr. The operations are query, index, delete, commit, and optimize.

Things we want to do
1.  Start and stop the server -- command line / OS
1.  Create / delete a collection -- either command line or admin interfaces
1.  Index documents -- API / wrapper
1.  Query -- API / wrapper


#### SOLR Command Line Interface
<pre>
(base) C:\solr-8.0.0>bin\solr --help

Usage: solr COMMAND OPTIONS
       where COMMAND is one of: 
       start, stop, restart, healthcheck, create, create_core, create_collection, 
       delete, version, zk, auth, assert, config, autoscaling
</pre>

We want start, stop, create_collection, and delete.

In [1]:
# For "shell commands" in Windows.  Replace it if you're not Windows
import subprocess

SOLR_EXECUTABLE = 'C:\\solr-8.0.0\\bin\\solr.cmd'

def solr_command(*args):
    return subprocess.check_output([SOLR_EXECUTABLE] + list(args))
                                   

Create a collection -- it would probably be preferable to do this with an API, but the core admin API does not work in stand-alone mode.

In [2]:
config_loc = 'C:\\Users\\hanks\\Documents\\GitHub\\cpsc5340-w21\\solrLab\\sufaculty\\conf'
solr_command('create_core', '-c', 'sufaculty', '-d', config_loc)

b"\nCreated new core 'sufaculty'\r\n"

#### Pysolr interface
https://github.com/django-haystack/pysolr

In [3]:
import pysolr
import json
solr = pysolr.Solr('http://localhost:8983/solr/sufaculty')

In [4]:
import os 
docs = []
DIR = '..\\scrapeLab\\json\\'
for filename in os.listdir(DIR):
    with open(DIR + filename) as f:
        print(f"{DIR + filename}")
        docs.append(json.loads(f.read()))
solr.add(docs, commit=True)

..\scrapeLab\json\dingle-adair.json
..\scrapeLab\json\hanks-steven.json
..\scrapeLab\json\khadivi-pejman.json
..\scrapeLab\json\koenig-michael.json
..\scrapeLab\json\kong-hidy.json
..\scrapeLab\json\larson-eric.json
..\scrapeLab\json\leblanc-richard.json
..\scrapeLab\json\li-lin.json
..\scrapeLab\json\lundeen-kevin.json
..\scrapeLab\json\mckee-michael.json
..\scrapeLab\json\mishra-aditya.json
..\scrapeLab\json\obare-james.json
..\scrapeLab\json\oh-sheila.json
..\scrapeLab\json\reeder-susan.json
..\scrapeLab\json\wong-jason.json
..\scrapeLab\json\zhu-yingwu-.json


'<?xml version="1.0" encoding="UTF-8"?>\n<response>\n\n<lst name="responseHeader">\n  <int name="status">0</int>\n  <int name="QTime">149</int>\n</lst>\n</response>\n'

#### Queries

- All documents using both requests and pysolr
- Exact match on handle
- Keyword match on body/name
- Range match on tweet count



In [5]:
import requests
# http://localhost:8983/solr/tweeters/select?df=location&q=washington
list(requests.get("http://localhost:8983/solr/sufaculty/select?q=*.*"))

[b'{\n  "responseHeader":{\n    "status":0,\n    "QTime":21,\n    "params":{\n      "q":"*.*"}},\n  "response":{"numFound":16,"start":0,"',
 b'docs":[\n      {\n        "name":"Adair Dingle, Ph.D.",\n        "email":"dingle@seattleu.edu",\n        "phone":"(206) 296-5516",\n ',
 b'       "bio":"Dr. Dingle\'s Personal Webpage\\n\xc2\xa0\\nTeaching Interests:\\n\\nData Structures\\nFoundations of Computer Science\\nObject',
 b'-Oriented Software Development\\nLanguages and Computation\\nDesign Patterns and Refactoring\\n\\n\xc2\xa0\\nResearch Interests:\\nReclaimin',
 b'g Garbage and Education: Java Memory Leaks, Tracking the Design of Objects: Encapsulation Through Polymorphism, The Maintainabil',
 b'ity Gap, Assessing the Ripple Effect of Language Choice in CS1, Improving C++ Performance Using Temporaries, The Object-Ownershi',
 b'p Model: A Case Study for Inheritance and Operator Overloading",\n        "joined":1996,\n        "_version_":1689618743003447296,',
 b'\n        "handle":"d

In [6]:
list(solr.search("*"))

[{'name': 'Adair Dingle, Ph.D.',
  'email': 'dingle@seattleu.edu',
  'phone': '(206) 296-5516',
  'bio': "Dr. Dingle's Personal Webpage\n\xa0\nTeaching Interests:\n\nData Structures\nFoundations of Computer Science\nObject-Oriented Software Development\nLanguages and Computation\nDesign Patterns and Refactoring\n\n\xa0\nResearch Interests:\nReclaiming Garbage and Education: Java Memory Leaks, Tracking the Design of Objects: Encapsulation Through Polymorphism, The Maintainability Gap, Assessing the Ripple Effect of Language Choice in CS1, Improving C++ Performance Using Temporaries, The Object-Ownership Model: A Case Study for Inheritance and Operator Overloading",
  'joined': 1996,
  '_version_': 1689618743003447296,
  'handle': 'dingle-adair'},
 {'name': 'Steven Hanks, Ph.D.',
  'email': 'hankssteven@seattleu.edu',
  'bio': '\xa0\nTeaching interests:\n\nData science\nArtificial intelligence\nSoftware design\nText and natural language processing, and search\n\nResearch interests:\n\nAp

In [7]:
list(solr.search('intelligence'))

[{'name': 'Steven Hanks, Ph.D.',
  'email': 'hankssteven@seattleu.edu',
  'bio': '\xa0\nTeaching interests:\n\nData science\nArtificial intelligence\nSoftware design\nText and natural language processing, and search\n\nResearch interests:\n\nApplication of data-science methodologiesConnecting machine learning and AI – how ML algorithms can learn representations useful for “general commonsense intelligence” (whatever that means)\xa0\n',
  'joined': 2008,
  '_version_': 1689618743007641600,
  'handle': 'hanks-steven'},
 {'name': 'Pejman Khadivi, Ph.D.',
  'email': 'khadivip@seattleu.edu',
  'phone': '(206) 296-2567',
  'bio': "Dr. Khadivi's Personal Web Page\n\xa0\nTeaching Interests:\n\nDesign & Analysis of Algorithm\nComp Systems Principles\nArtificial Intelligence\n\n\xa0\nResearch Interests:\nMy primary research interests are in the field of artificial intelligence, machine learning, and data analytics using large scale datasets, with emphasis on time series analytics, which is consi

In [8]:
list(solr.search('joined:[2010 TO *]'))

[{'name': 'Michael McKee',
  'email': 'mckeem@seattleu.edu',
  'bio': '\xa0\nTeaching Interests:\n\nProgramming & Problem Solving\nData Structures And Algorithms\nDatabases\nSoftware Economics\nSoftware Testing\nData Analytics\n\nResearch Interests:\n\nCS Education, Databases, Data Warehousing, Computer Languages, Economics, STEM.\n',
  'joined': 2012,
  '_version_': 1689618743021273088,
  'handle': 'mckee-michael'},
 {'name': 'James Obare',
  'email': 'obarej@seattleu.edu',
  'phone': '(206) 296-2837',
  'bio': '\xa0\nTeaching Interests:\n\nIntro to Computer Science\nIntro Computers & Applications\nComp Systems Principles\nCyber Security\nComputer Organization and Architecture\n',
  'joined': 2013,
  '_version_': 1689618743023370240,
  'handle': 'obare-james'},
 {'name': 'Jason Wong',
  'email': 'wongja@seattleu.edu',
  'phone': '(206) 296-5949',
  'bio': '\xa0\nTeaching Interests:\n\nFundamentals of Databases\nSoftware Engineering & Project Development\nSecurity in Computing\n\n\xa0\

In [9]:
list(solr.search('email:"hankssteven"'))

[{'name': 'Steven Hanks, Ph.D.',
  'email': 'hankssteven@seattleu.edu',
  'bio': '\xa0\nTeaching interests:\n\nData science\nArtificial intelligence\nSoftware design\nText and natural language processing, and search\n\nResearch interests:\n\nApplication of data-science methodologiesConnecting machine learning and AI – how ML algorithms can learn representations useful for “general commonsense intelligence” (whatever that means)\xa0\n',
  'joined': 2008,
  '_version_': 1689618743007641600,
  'handle': 'hanks-steven'}]

In [11]:
list(solr.search('email:"hanks"'))

[]

In [10]:
list(solr.search('handle:"hanks"'))

[]

In [12]:
list(solr.search('handle:"hanks-steven"'))

[{'name': 'Steven Hanks, Ph.D.',
  'email': 'hankssteven@seattleu.edu',
  'bio': '\xa0\nTeaching interests:\n\nData science\nArtificial intelligence\nSoftware design\nText and natural language processing, and search\n\nResearch interests:\n\nApplication of data-science methodologiesConnecting machine learning and AI – how ML algorithms can learn representations useful for “general commonsense intelligence” (whatever that means)\xa0\n',
  'joined': 2008,
  '_version_': 1689618743007641600,
  'handle': 'hanks-steven'}]

### Other parameters for the query
* df -- default search field
* fl -- fields to return
* start and rows
* sort

In [13]:
# Here's an abstraction that will allow us to mess around with query params
import requests

def solr_select(params, port="8983", collection="sufaculty"):
    param_arg = "&".join(list(map(lambda p: f"{p[0]}={p[1]}", list(params.items()))))
    query_string = f"http://localhost:{port}/solr/{collection}/select?"
    rs = query_string + param_arg
    print(rs)
    r = requests.get(rs)
    if (r.status_code == 200):
        return r.json()['response']
    else:
        raise Exception(f"Request Error: {r.status_code}")


In [14]:
# Example with additional parameters:  search all listings on 

solr_select({"df": "bio", "q": "computer", "rows":5, "start":0, "sort": "joined asc", "fl": "name,email,joined"})

http://localhost:8983/solr/sufaculty/select?df=bio&q=computer&rows=5&start=0&sort=joined asc&fl=name,email,joined


{'numFound': 12,
 'start': 0,
 'docs': [{'name': 'Adair Dingle, Ph.D.',
   'email': 'dingle@seattleu.edu',
   'joined': 1996},
  {'name': 'Kevin Lundeen\n',
   'email': 'lundeenk@seattleu.edu',
   'joined': 1997},
  {'name': 'Hidy Kong, Ph.D.', 'email': 'hkong@seattleu.edu', 'joined': 1998},
  {'name': 'Yingwu Zhu, Ph.D.', 'email': 'zhuy@seattleu.edu', 'joined': 1999},
  {'name': 'Aditya Mishra, Ph.D.',
   'email': 'mishraa@seattleu.edu',
   'joined': 2000}]}

### A Few More Details on SOLR Search and Response

<pre>
request  ->  SOLR service -> dispatch to core -> Search Handler -> Query Parser -> Lookup
  -> Response Handler
</pre>

Response can contain
* Results (fields of documents)
* Facets -- for example, for tweeters, facet on tweet count.  For products, facet on product categories
* More like this -- documents similar to the search results (not necessarily based on the query terms)
* Highlight -- choose snippets of documents with matching terms
* Stats -- statistics on numeric fields
* Debug -- parsed query string plus info on how documents were scored

#### Query Parsers
Lucene query parser vs dismax vs edismax.  

>In general, the DisMax query parser’s interface is more like that of Google than the interface of the 'standard' Solr request handler. This similarity makes DisMax the appropriate query parser for many consumer applications. It accepts a simple syntax, and it rarely produces error messages.

> The DisMax query parser supports an extremely simplified subset of the Lucene QueryParser syntax. 

#### Filters vs Queries

* Filter (fq):  reduce number of "qualifying" documents without ranking
* Query (q): both reduce and rank

##### General form is 
<pre>  fieldName:queryString </pre>

##### Decorations
*  +solr
*  apache AND solr
*  apache OR solr
* "apache solr"   -- but remember stopword removal and stemming
* "apache solr"~3
* solr AND NOT (panel OR electricity)
* number: \[0 TO *\]
* string: \[ape TO apple\]
* date: \[ * TO NOW-1 DAY\]
* hel* w?rld

##### Boosts
* apache^10 solr^100

### Responses
* What document fields
* Dynamic:  relevancy score
* Pagination



### Larger example

1. OR keyword search on keywords the name field
2. Optional phrase search on the bio
3. Optional threshold on year joined
4. Return name and email address only



In [None]:
def standard_query(name_keywords, bio_phrase='', joined_threshold=''):
    qvalue =  "(" + " OR ".join(name_keywords.split()) + ")"
    if (len(bio_phrase) > 0 ):
        qvalue += " AND \"" + bio_phrase + "\""
    if (len(joined_threshold) > 0):
        qvalue += f" AND joined:[{joined_threshold} TO *]"
    return {"q": qvalue, "fl": "name,joined"}



In [None]:
standard_query("richard adair", "computer science", "100")

In [None]:
p = standard_query("richard adair", "computer science", "100")
solr_select(p)