###### CSCE 670 Information Storage and Retrieval Spring 2020
---
# Spotlight - Guideline for Pylucene
--Handong Hao

## 1.Introduction
PyLucene is a Python extension for accessing Java Lucene. Its goal is to allow you to use Lucene's text indexing and searching capabilities from Python.PyLucene is not a Lucene port but a Python wrapper around Java Lucene. PyLucene embeds a Java VM with Lucene into a Python process. 

## 2.Requirements
PyLucene is a Python extension built with JCC.To build PyLucene, JCC needs to be built first. Also, To build PyLucene, a Java Development Kit (JDK) and Ant are required; use of the resulting PyLucene binaries requires only a Java Runtime Environment (JRE).

## 3.Implementation
Here, we will use Pylucene to implement a simple search system. This work is separate into two parts, one is to create Lucene index by using the class in Pylucene, the other is searching.

### 3.1 Creating a Lucene index (IndexFiles.py)
This part is loosely based on the Lucene (java implementation) demo class  *org.apache.lucene.demo.IndexFiles*.  It will take a directory as an argument and will index all of the files in that directory and downward recursively. It will index the file path, the file name, and the file contents.  The resulting Lucene index will be placed in the current directory and called 'index'. 

In the beginning, let me introduce the main class we will use in Pylucene.
>**Directory**: A Directory provides an abstraction layer for storing a list of files. A directory contains only files (no sub-folder hierarchy).

>**Analyzer**: An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

>**IndexWriter**: An IndexWriter creates and maintains an index.

>**Document**: Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value. A field may be stored with the document, in which case it is returned with search hits on the document. Thus each document should typically contain one or more stored fields that uniquely identify it.

>**Field**: A field is a section of a Document. Each field has three parts: name, type, and value. 

In [None]:
import sys, os, PyLucene, threading, time  
from datetime import datetime  
class Ticker(object):  
  
    def __init__(self):  
        self.tick = True  
  
    def run(self):  
        while self.tick:  
            sys.stdout.write('.')  
            sys.stdout.flush()  
            time.sleep(1.0)  
  
class IndexFiles(object):  
  
    def __init__(self, root, storeDir, analyzer):  
  
        if not os.path.exists(storeDir):  
            os.mkdir(storeDir)  
        store = PyLucene.FSDirectory.getDirectory(storeDir, False)  
        writer = PyLucene.IndexWriter(store, analyzer, False)  
        writer.setMaxFieldLength(1048576)  
        self.indexDocs(root, writer)  
        ticker = Ticker()  
        print 'optimizing index',  
        threading.Thread(target=ticker.run).start()  
        writer.optimize()  
        writer.close()  
        ticker.tick = False  
        print 'done'  
  
    def indexDocs(self, root, writer):  
        for root, dirnames, filenames in os.walk(root):  
            print root  
            try:  
                sroot = unicode(root, 'GBK')  
                print sroot  
            except:  
                print "***************************unicode error**********************************"  
                print root  
                continue  
  
            #add dir  
            doc = PyLucene.Document()  
            doc.add(PyLucene.Field("path", sroot,  
                                   PyLucene.Field.Store.YES,  
                                   PyLucene.Field.Index.UN_TOKENIZED))  
  
            doc.add(PyLucene.Field("name", sroot,  
                                   PyLucene.Field.Store.YES,  
                                   PyLucene.Field.Index.TOKENIZED))  
            writer.addDocument(doc)  
              
            for filename in filenames:  
                try:  
                    filename = unicode(filename, 'GBK')  
                except:  
                    print "***************************unicode error******************************"  
                    print filename  
                    continue  
                print "adding", filename  
                try: 
                    path =os.path.join(sroot, filename)  
                    doc = PyLucene.Document()  
                    doc.add(PyLucene.Field("path", path,  
                                           PyLucene.Field.Store.YES,  
                                           PyLucene.Field.Index.UN_TOKENIZED))  
                    doc.add(PyLucene.Field("name", filename,  
                                           PyLucene.Field.Store.YES,  
                                           PyLucene.Field.Index.TOKENIZED))  
                    writer.addDocument(doc)  
                except Exception, e:  
                    print "Failed in indexDocs:", e  
__debug = 0  
if __name__ == '__main__':  
    if __debug != 1:  
        if len(sys.argv) < 2:  
            print IndexFiles.__doc__  
            sys.exit(1)  
  
    print 'PyLucene', PyLucene.VERSION, 'Lucene', PyLucene.LUCENE_VERSION  
    start = datetime.now()  
    try:  
        if __debug != 1:  
            IndexFiles(sys.argv[1], "index", PyLucene.StandardAnalyzer())  
        else:  
            IndexFiles(r'c:/test', "index", PyLucene.StandardAnalyzer())  
        end = datetime.now()  
        print end - start  
    except Exception, e:  
        print "Failed: ", e 

Let's take a look at how it works. 

The main method parses the command-line parameters, then in preparation for instantiating *IndexWriter*, opens a *Directory*, and instantiates *StandardAnalyzer*. Lucene *Directory*s are used by the IndexWriter to store information in the index. In addition to the *FSDirectory* implementation we are using, there are several other Directory subclasses that can write to RAM, to a database, etc.

Lucene *Analyzers* are processing pipelines that break up text into indexed tokens, also known as terms, and optionally perform other operations on these tokens, for example, downcasing, synonym insertion, filtering out unwanted tokens, etc. The Analyzer we are using is *StandardAnalyzer*, which creates tokens using the Word Break rules from the Unicode Text Segmentation algorithm specified in Unicode Standard Annex #29; converts tokens to lowercase; and then filters out stopwords. Stopwords are common language words such as articles (a, an, the) and other tokens that may have less value for searching. 

After *IndexWriter* is instantiated, Let us see the *indexDocs()* code. This recursive function crawls the directories and creates *Document* objects. The *Document* is simply a data object to represent the text content from the file as well as its creation time and location. These instances are added to the IndexWriter.

### 3.2 Searching Files (SearchFiles.py)
This script is loosely based on the Lucene (java implementation) demo class *org.apache.lucene.demo.SearchFiles*.  It will prompt for a search query, then it will search the Lucene index in the current directory called 'index' for the search query entered against the 'contents' field.  It will then display the 'path' and 'name' fields for each of the hits it finds in the index. 

In [None]:
from PyLucene import QueryParser, IndexSearcher, StandardAnalyzer, FSDirectory  
from PyLucene import VERSION, LUCENE_VERSION  
  
def run(searcher, analyzer):  
    while True:  
        print  
        print "Hit enter with no input to quit."  
        command = raw_input("Query:")  
        command = unicode(command, 'GBK')  
        if command == '':  
            return  
  
        print  
        print "Searching for:", command  
        query = QueryParser("name", analyzer).parse(command)  
        hits = searcher.search(query)  
        print "%s total matching documents." % hits.length()  
  
        for i, doc in hits:  
            print 'path:', doc.get("path"), 'name:', doc.get("name")  
  
  
if __name__ == '__main__':  
    STORE_DIR = "index"  
    print 'PyLucene', VERSION, 'Lucene', LUCENE_VERSION  
    directory = FSDirectory.getDirectory(STORE_DIR, False)  
    searcher = IndexSearcher(directory)  
    analyzer = StandardAnalyzer()  
    run(searcher, analyzer)  
    searcher.close()  

It primarily collaborates with an *IndexSearcher*, *StandardAnalyzer*, (which is used in the 3.1 as well) and a *QueryParser*. The query parser is constructed with an analyzer used to interpret query text in the same way the documents are interpreted: finding word boundaries, downcasing, and removing stop words. The *Query* object contains the results from the *QueryParser* which is passed to the searcher. Note that it's also possible to programmatically construct a rich *Query* object without using the query parser. The query parser just enables decoding the *Lucene query syntax* into the corresponding *Query* object.

Now, in the command line:

In [None]:
python IndexFiles.py c:/
python SearchFiles.py

If you want to find only one keyword, inpu it directly; if you want to find two keywords at the same time, such as "Python network", input: "Python AND network"; if you want to find "Python" or ""network", input: "Python network"(or "Python OR network").


## Reference
[https://freethreads.net/2012/09/17/pylucene-part-i-creating-index/](https://freethreads.net/2012/09/17/pylucene-part-i-creating-index/)

[https://medium.com/@michaelaalcorn/how-to-use-pylucene-e2e2f540024c](https://medium.com/@michaelaalcorn/how-to-use-pylucene-e2e2f540024c)

[https://svn.apache.org/viewvc/lucene/pylucene/trunk/test3/](https://svn.apache.org/viewvc/lucene/pylucene/trunk/test3/)

[https://lucene.apache.org/pylucene/features.html](https://lucene.apache.org/pylucene/features.html)
