# Session 2 - Programming with Elastic Search

## 1 Modifying ElasticSearch index behavior

In the previous session we had to clean manually the list of words in order to compute Zipf's and Heaps' laws. 

ElasticSearch allows using a pipeline of processes that allows to clean the text that is indexed discarding anything not useful.

We are going to work with three of the usual processes:

* Tokenization
* Normalization
* Token filtering (stopwords and stemming)

The next cells allow configuring the default tokenizer for an index and analyze an example text. We are going to play a little bit with the possibilities and see what tokens result from the analysis.


In [1]:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Index, analyzer, tokenizer

client = Elasticsearch()

## Token Whitespace filter lowercase

In [2]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('whitespace'),
    filter=['lowercase']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

{u'acknowledged': True, u'shards_acknowledged': True}

Now you can ask the index to analyze any text, feel free to change the text

In [3]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{u'end_offset': 2, u'token': u'my', u'type': u'word', u'start_offset': 0, u'position': 0}
{u'end_offset': 9, u'token': u'taylor', u'type': u'word', u'start_offset': 3, u'position': 1}
{u'end_offset': 13, u'token': u'4\xeds', u'type': u'word', u'start_offset': 10, u'position': 2}
{u'end_offset': 18, u'token': u'was%', u'type': u'word', u'start_offset': 14, u'position': 3}
{u'end_offset': 28, u'token': u'&printing', u'type': u'word', u'start_offset': 19, u'position': 4}
{u'end_offset': 36, u'token': u'printed', u'type': u'word', u'start_offset': 29, u'position': 5}
{u'end_offset': 41, u'token': u'rich', u'type': u'word', u'start_offset': 37, u'position': 6}
{u'end_offset': 46, u'token': u'the.', u'type': u'word', u'start_offset': 42, u'position': 7}


## Token Standard

In [4]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('standard'),
    filter=['lowercase']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

{u'acknowledged': True, u'shards_acknowledged': True}

Now you can ask the index to analyze any text, feel free to change the text

In [5]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{u'end_offset': 2, u'token': u'my', u'type': u'<ALPHANUM>', u'start_offset': 0, u'position': 0}
{u'end_offset': 9, u'token': u'taylor', u'type': u'<ALPHANUM>', u'start_offset': 3, u'position': 1}
{u'end_offset': 13, u'token': u'4\xeds', u'type': u'<ALPHANUM>', u'start_offset': 10, u'position': 2}
{u'end_offset': 17, u'token': u'was', u'type': u'<ALPHANUM>', u'start_offset': 14, u'position': 3}
{u'end_offset': 28, u'token': u'printing', u'type': u'<ALPHANUM>', u'start_offset': 20, u'position': 4}
{u'end_offset': 36, u'token': u'printed', u'type': u'<ALPHANUM>', u'start_offset': 29, u'position': 5}
{u'end_offset': 41, u'token': u'rich', u'type': u'<ALPHANUM>', u'start_offset': 37, u'position': 6}
{u'end_offset': 45, u'token': u'the', u'type': u'<ALPHANUM>', u'start_offset': 42, u'position': 7}


## Token Letter

In [6]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

{u'acknowledged': True, u'shards_acknowledged': True}

Now you can ask the index to analyze any text, feel free to change the text

In [7]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{u'end_offset': 2, u'token': u'my', u'type': u'word', u'start_offset': 0, u'position': 0}
{u'end_offset': 9, u'token': u'taylor', u'type': u'word', u'start_offset': 3, u'position': 1}
{u'end_offset': 13, u'token': u'\xeds', u'type': u'word', u'start_offset': 11, u'position': 2}
{u'end_offset': 17, u'token': u'was', u'type': u'word', u'start_offset': 14, u'position': 3}
{u'end_offset': 28, u'token': u'printing', u'type': u'word', u'start_offset': 20, u'position': 4}
{u'end_offset': 36, u'token': u'printed', u'type': u'word', u'start_offset': 29, u'position': 5}
{u'end_offset': 41, u'token': u'rich', u'type': u'word', u'start_offset': 37, u'position': 6}
{u'end_offset': 45, u'token': u'the', u'type': u'word', u'start_offset': 42, u'position': 7}


## Filter asciifolding

In [8]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

{u'acknowledged': True, u'shards_acknowledged': True}

Now you can ask the index to analyze any text, feel free to change the text

In [9]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{u'end_offset': 2, u'token': u'my', u'type': u'word', u'start_offset': 0, u'position': 0}
{u'end_offset': 9, u'token': u'taylor', u'type': u'word', u'start_offset': 3, u'position': 1}
{u'end_offset': 13, u'token': u'is', u'type': u'word', u'start_offset': 11, u'position': 2}
{u'end_offset': 17, u'token': u'was', u'type': u'word', u'start_offset': 14, u'position': 3}
{u'end_offset': 28, u'token': u'printing', u'type': u'word', u'start_offset': 20, u'position': 4}
{u'end_offset': 36, u'token': u'printed', u'type': u'word', u'start_offset': 29, u'position': 5}
{u'end_offset': 41, u'token': u'rich', u'type': u'word', u'start_offset': 37, u'position': 6}
{u'end_offset': 45, u'token': u'the', u'type': u'word', u'start_offset': 42, u'position': 7}


## filter asciifolding + stop

In [10]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding', 'stop']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

{u'acknowledged': True, u'shards_acknowledged': True}

Now you can ask the index to analyze any text, feel free to change the text

In [11]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{u'end_offset': 2, u'token': u'my', u'type': u'word', u'start_offset': 0, u'position': 0}
{u'end_offset': 9, u'token': u'taylor', u'type': u'word', u'start_offset': 3, u'position': 1}
{u'end_offset': 28, u'token': u'printing', u'type': u'word', u'start_offset': 20, u'position': 4}
{u'end_offset': 36, u'token': u'printed', u'type': u'word', u'start_offset': 29, u'position': 5}
{u'end_offset': 41, u'token': u'rich', u'type': u'word', u'start_offset': 37, u'position': 6}


## Filter asciifolding + stop + snowball

In [12]:
# Index analyzer cofiguration
# Change the configuration and run this cell and the next to see the changes

# Tokenizers: whitespace, standard, classic, letter
# Filters: lowercase, asciifolding, stop, porter_stem, kstem, snowball
my_analyzer = analyzer('default',
    type='custom',
    tokenizer=tokenizer('letter'),
    filter=['lowercase','asciifolding','stop', 'snowball']
)
   
ind = Index('news', using=client)
ind.close()
ind.analyzer(my_analyzer)    
ind.save()
ind.open()

{u'acknowledged': True, u'shards_acknowledged': True}

Now you can ask the index to analyze any text, feel free to change the text

In [13]:
res = ind.analyze(body={'analyzer':'default', 'text':u'my taylor 4ís was% &printing printed rich the.'})
for r in res['tokens']:
    print(r)

{u'end_offset': 2, u'token': u'my', u'type': u'word', u'start_offset': 0, u'position': 0}
{u'end_offset': 9, u'token': u'taylor', u'type': u'word', u'start_offset': 3, u'position': 1}
{u'end_offset': 28, u'token': u'print', u'type': u'word', u'start_offset': 20, u'position': 4}
{u'end_offset': 36, u'token': u'print', u'type': u'word', u'start_offset': 29, u'position': 5}
{u'end_offset': 41, u'token': u'rich', u'type': u'word', u'start_offset': 37, u'position': 6}


Now **follow the instructions** of the documentation, index the documents from the previous session using the script 'IndexFilesPreprocess.py' and use the script 'CountWords.py' from the previous session to see how the set of tokens change.

***

## 2 The index reloaded

You can use the modified indexer ```IndexFilesPreprocess.py``` script to play with the different possibilities for the preprocessing pipeline.

You can change the **tokenizer** and apply different processes to the tokens like lowercasing, asccii folding, removing stopwords and different stemming algorithms.

***

## 3 Computing Tf-Idf and Cosine similarity

Now is your turn to work in the session task.

The idea is to program a script that given two document paths obtains their ids, computes the Tf-Idf representation of the documents and then computes and prints their cosine similarity

**Follow the instructions** in the documentation and and **pay attention** to the documentation that you have to deliver for this session.