## Information Retrieval - Assigment 3
Group 3: Hooshyar Hosna, Lima Rachel, Lorefice Alessandra

### Import packages

In [1]:
from whoosh.fields import Schema, TEXT
import os.path
from whoosh.index import create_in
from whoosh.qparser import QueryParser
from whoosh import reading
from collections import defaultdict
from whoosh import scoring, searching
from whoosh.scoring import FunctionWeighting
from whoosh.analysis import RegexTokenizer, LowercaseFilter, StopFilter, StemmingAnalyzer, StandardAnalyzer, CharsetFilter, Filter
from whoosh.support.charset import accent_map
from whoosh.analysis import NgramFilter, NgramTokenizer
from whoosh.reading import IndexReader
from whoosh import qparser

In [2]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [3]:
import pandas as pd

## Create an index and import the corpus


In [4]:
def createIndex(index_name, schema = Schema(content = TEXT)):

    '''
    Given the analyzer, create a shema and an index.
    
    '''
    
    # Create an index and import the corpus
    if not os.path.exists(index_name): #to check whether the specified path exists or not
        os.mkdir(index_name) #to create a directory
    ix = create_in(index_name, schema)
    
    writer = ix.writer()
    
     # Read textual documents from file
    documents_path = 'AssociatedPress.txt'
    with open(documents_path, 'r', encoding='utf-8') as doc_f:
        corpus_list = doc_f.readlines()
    
    # Index documents
    for x in corpus_list:
        writer.add_document(content= x)
    
    writer.commit()

    return ix


### Query and results


In [5]:
def Search(ix, schema, scoring_option, queryTerms, limit = None):
    
    '''
    Given a schema, an index, a weight, an expression (string) and a limit of results to be 
    given as output, it searches for the expression in the content of the schema and returns the 
    results of the query (for each resul we are interested on score, document id and ranking)
    
    '''

    searcher = ix.searcher()
   
    with ix.searcher(weighting = scoring_option) as searcher:
        query = QueryParser("content", ix.schema).parse(queryTerms)
        results = searcher.search(query, limit = limit, terms = True) 
        
    return results

In [6]:
def printRanking(results):
    
    '''
    Given the results from the Search function, it will print the ranking
    
    '''
    
    doc_id = []
    scores = []
    
    
    for result in results:
        doc_id.append(result.docnum)
        scores.append(result.score)
    
    temp = dict()
    temp["Document ID"] = doc_id
    temp["Score"] = scores
    table = pd.DataFrame(temp)
    
    # How many scored and sorted documents in this Results object?
    print('Scored and sorted documents:', results.scored_length()) 
    print('')
       
    return table

## Scoring

We start by creating the default schema by using the standard analyser (tokenizer, lowcase filter and optional stop words). We compare three different scoring options:
- Okapi BM25F
- TF-IDF
- Natural frequency





In [7]:
schema_default = Schema(content = TEXT(phrase=True, stored=True)) 
ix_default = createIndex(index_name="index_default", schema=schema_default)

### Okapi BM25F

This is the default weight used by the library to compute the score of a collection of documents with respect to a query.

In [8]:
# It is the by default ranking function used by whoosh. BM stands for best matching. 
# It is based on tf-idf along with bunch of factors like length of document in words, 
# average length of documents in the collection. It also has free parameters k = 1.2 and b = 0.75.

okapi_BM25F = scoring.BM25F() 

In [9]:
printRanking(Search(ix_default, schema_default, okapi_BM25F, "Michael Dukakis"))

Scored and sorted documents: 70



Unnamed: 0,Document ID,Score
0,1662,13.514338
1,1250,13.086147
2,394,13.070377
3,2032,12.757304
4,1350,12.643483
...,...,...
65,1627,6.899361
66,1747,6.899361
67,1007,6.109321
68,670,6.086949


In [10]:
printRanking(Search(ix_default, schema_default, okapi_BM25F, "Dukakis OR  Bush"))

Scored and sorted documents: 274



Unnamed: 0,Document ID,Score
0,1475,15.022824
1,1350,14.740524
2,2220,14.739377
3,1237,14.658561
4,155,14.605196
...,...,...
269,2003,2.289851
270,30,2.244085
271,1027,2.244085
272,2143,2.244085


In [11]:
printRanking(Search(ix_default, schema_default, okapi_BM25F, "graduate of Syracuse University"))

Scored and sorted documents: 2



Unnamed: 0,Document ID,Score
0,215,19.41169
1,2003,11.412986


### TF-IDF

This weight takes into consideration the natural frequency and the document frequency.

In [12]:
# It returns tf * idf scores of each document.
TF_IDF = scoring.TF_IDF() 

In [13]:
printRanking(Search (ix_default, schema_default, TF_IDF, "Michael Dukakis"))

Scored and sorted documents: 70



Unnamed: 0,Document ID,Score
0,1856,95.986556
1,1475,95.156550
2,1662,89.925791
3,420,82.784298
4,865,73.982792
...,...,...
65,1747,7.971499
66,2069,7.971499
67,2073,7.971499
68,2158,7.971499


In [14]:
printRanking(Search(ix_default, schema_default, TF_IDF, "Dukakis OR Bush"))

Scored and sorted documents: 274



Unnamed: 0,Document ID,Score
0,1475,154.731582
1,155,118.301784
2,2220,112.442178
3,990,110.488976
4,1856,108.300697
...,...,...
269,2096,3.176977
270,2112,3.176977
271,2143,3.176977
272,2228,3.176977


In [15]:
printRanking(Search(ix_default, schema_default, TF_IDF, "graduate of Syracuse University"))

Scored and sorted documents: 2



Unnamed: 0,Document ID,Score
0,215,24.896802
1,2003,15.834567


### Frequency

This weight just considers the frequency of the query's terms into the collection.

In [16]:
#It simply returns the count of the terms occurred in the document. 
#It does not perform any normalization or weighting.

#DOES NOT CONSIDER THE LENGHT OF THE DOCUMENTS

frequency = scoring.Frequency()

In [17]:
printRanking(Search(ix_default, schema_default, frequency, "Michael Dukakis"))

Scored and sorted documents: 70



Unnamed: 0,Document ID,Score
0,1475,22.0
1,1856,22.0
2,1662,21.0
3,420,19.0
4,865,17.0
...,...,...
65,1747,2.0
66,2069,2.0
67,2073,2.0
68,2158,2.0


In [18]:
printRanking(Search(ix_default, schema_default, frequency, "graduate of Syracuse University"))

Scored and sorted documents: 2



Unnamed: 0,Document ID,Score
0,215,5.0
1,2003,3.0


In [19]:
printRanking(Search(ix_default, schema_default, frequency, "Dukakis OR Bush"))

Scored and sorted documents: 274



Unnamed: 0,Document ID,Score
0,1475,41.0
1,155,33.0
2,76,30.0
3,2220,30.0
4,990,29.0
...,...,...
269,2112,1.0
270,2143,1.0
271,2158,1.0
272,2228,1.0


### Observations

When we compare the ranking for the 10 first documents with tf and tf-IDF for the query "Michael Dukakis", whe can observe that they are very similar (see table below). 

In [20]:
compare_scoring = dict()

In [21]:
results_okapi = Search(ix_default, schema_default, okapi_BM25F, "Michael Dukakis", limit = 10)
results_tfidf = Search(ix_default, schema_default, TF_IDF, "Michael Dukakis", limit = 10)
results_freq = Search(ix_default, schema_default, frequency, "Michael Dukakis", limit = 10)

In [22]:
id_okapi = []
id_tfidf = []
id_freq = []

for i in range(10):
    id_okapi.append(results_okapi[i].docnum)
    id_tfidf.append(results_tfidf[i].docnum)
    id_freq.append(results_freq[i].docnum)

In [23]:
compare_scoring["Okapi BM25F"] = id_okapi
compare_scoring["TF-IDF"] = id_tfidf
compare_scoring["Natural Frequency"] = id_freq

In [24]:
print("Rankings using different scorings for the query <<Michael Dukakis>>.")
print("The numbers in the table represent the documents' ids")

pd.DataFrame(compare_scoring)

Rankings using different scorings for the query <<Michael Dukakis>>.
The numbers in the table represent the documents' ids


Unnamed: 0,Okapi BM25F,TF-IDF,Natural Frequency
0,1662,1856,1475
1,1250,1475,1856
2,394,1662,1662
3,2032,420,420
4,1350,865,865
5,528,31,31
6,969,990,990
7,652,1963,1963
8,1475,545,545
9,656,776,776


Other comments:

We noticed that if we look for "Michael Dukakis" we found 70 results while looking for "Dukakis" gave us 74 documents. This means that in 4 documents only the surname appears and we are not able to retrieve it by looking for "Michael Dukakis".

It seems that "Dukakis" is a distinctive expression, that has high tf*idf. We can confirm this affirmative, looking for the most distinctive terms (below).

In [25]:
# most distinctive terms:
reader = ix_default.reader() 
reader.most_distinctive_terms('content', number=30, prefix='')   

[(2844.849142675824, b'he'),
 (2796.5082422326022, b'percent'),
 (2696.998412178119, b'his'),
 (2451.3391138734355, b'bush'),
 (2315.9425510257033, b'soviet'),
 (2240.135717976341, b'she'),
 (2104.22029764667, b'her'),
 (2036.5207684689315, b'police'),
 (2016.3449929525473, b'would'),
 (2014.4609222483714, b'million'),
 (1963.7794703697907, b'u.s'),
 (1909.064593190822, b'government'),
 (1866.3140113605505, b'were'),
 (1862.012251749837, b'they'),
 (1778.1480529972948, b'new'),
 (1769.0883046146644, b'000'),
 (1749.313180456082, b'billion'),
 (1733.8495348767626, b'people'),
 (1709.354769245226, b'has'),
 (1702.1423255507952, b'said'),
 (1680.5226238820896, b'who'),
 (1676.18342218846, b'had'),
 (1668.4946875682845, b'president'),
 (1661.658802509878, b'their'),
 (1659.9663831062758, b'year'),
 (1624.1508755482414, b'its'),
 (1593.0647915662814, b'state'),
 (1570.520895027367, b'dukakis'),
 (1553.0118791899372, b'but'),
 (1547.0679549405504, b'court')]

## Other queries

Example with phrases:

In [26]:
printRanking(Search(ix_default, schema_default, TF_IDF, "to be a problem until scientists"))

Scored and sorted documents: 5



Unnamed: 0,Document ID,Score
0,19,57.13686
1,290,49.453291
2,584,35.588536
3,1600,16.811555
4,813,11.756875


In [27]:
printRanking(Search(ix_default, schema_default, TF_IDF, "problem until scientists"))

Scored and sorted documents: 5



Unnamed: 0,Document ID,Score
0,19,57.13686
1,290,49.453291
2,584,35.588536
3,1600,16.811555
4,813,11.756875


Since the "to", "be" and "a" are stoppwords they were removed. Because of this the results of the 2 quesries are equal.

In [28]:
printRanking(Search(ix_default, schema_default, TF_IDF, "problem NOT scientist"))

Scored and sorted documents: 137



Unnamed: 0,Document ID,Score
0,1439,16.021585
1,1516,16.021585
2,250,12.266189
3,587,12.266189
4,812,12.266189
...,...,...
132,2190,4.755396
133,2193,4.755396
134,2202,4.755396
135,2214,4.755396


In [29]:
printRanking(Search(ix_default, schema_default, TF_IDF, "pro* scienti?t")).head()

Scored and sorted documents: 23



Unnamed: 0,Document ID,Score
0,19,93.597726
1,2031,65.029262
2,1234,53.26461
3,290,46.054239
4,584,45.293532


## Different Analyzers for indexing 

In the following cells we try to apply different analysers to the schema. To do so, we create different scemas and different indexes and try different queries to see the results. 

In [30]:
class LemmatizationFilter(Filter):
    
    def __call__ (self, tokens):
        lemmatizer = WordNetLemmatizer()
        for t in tokens:
            t.text = lemmatizer.lemmatize(t.text)
            yield t

In [31]:
#just tokens
schema1 = Schema(content=TEXT(phrase=True, stored=True, analyzer=RegexTokenizer()))
ix1 = createIndex(index_name="ix1", schema=schema1)

#with lowercase filter
schema2 = Schema(content=TEXT(phrase=True, stored=True, analyzer=RegexTokenizer() | LowercaseFilter()))
ix2 = createIndex(index_name="ix2", schema=schema2)

#with stop filter
schema3 = Schema(content=TEXT(phrase=True, stored=True, analyzer=RegexTokenizer() | LowercaseFilter() | StopFilter()))
ix3 = createIndex(index_name="ix3", schema=schema3)

#with stemming 
schema4 = Schema(content=TEXT(phrase=True, stored=True, spelling=True, analyzer=StemmingAnalyzer()))
ix4 = createIndex(index_name="ix4", schema=schema4)

#with lemmatization
schema5 = Schema(content=TEXT(phrase=True, stored=True, analyzer=StandardAnalyzer() | LemmatizationFilter()))
ix5 = createIndex(index_name="ix5", schema=schema5)

### Schema1 - Tokenize

Selecting the schema1, the content will distinguish between upper and lower case words.The search confirms our intuition. We also do not have any stop words filter, thus when we search for a stop word we get it.

In [32]:
printRanking(Search(ix1, schema1, TF_IDF, "michael"))

Scored and sorted documents: 0



Unnamed: 0,Document ID,Score


In [33]:
printRanking(Search(ix1, schema1, TF_IDF, "Michael"))

Scored and sorted documents: 171



Unnamed: 0,Document ID,Score
0,687,28.565972
1,233,14.282986
2,801,10.712239
3,1662,10.712239
4,1688,10.712239
...,...,...
166,2206,3.570746
167,2213,3.570746
168,2220,3.570746
169,2225,3.570746


In [34]:
printRanking(Search(ix1, schema1, TF_IDF, "MICHAEL"))

Scored and sorted documents: 0



Unnamed: 0,Document ID,Score


In [35]:
printRanking(Search(ix1, schema1, TF_IDF, "the"))

Scored and sorted documents: 2227



Unnamed: 0,Document ID,Score
0,439,86.806796
1,1746,82.769271
2,837,81.759889
3,782,80.750508
4,1293,79.741127
...,...,...
2222,2057,1.009381
2223,2082,1.009381
2224,2097,1.009381
2225,2101,1.009381


In [36]:
printRanking(Search(ix1, schema1, TF_IDF, "The"))

Scored and sorted documents: 2035



Unnamed: 0,Document ID,Score
0,408,25.288467
1,1007,21.989971
2,784,19.790974
3,1223,17.591977
4,1695,17.591977
...,...,...
2030,2228,1.099499
2031,2235,1.099499
2032,2237,1.099499
2033,2242,1.099499


In [37]:
printRanking(Search(ix1, schema1, TF_IDF, "THE"))

Scored and sorted documents: 1



Unnamed: 0,Document ID,Score
0,1985,8.025094


### Schema2 - Tokenize, Lowercase Filter

Selecting the schema2, the content will have only lower case words. There is also no difference between lowercase or uppercase words in the query. Again, we do not have any filter for stop words so they will also be there.

In [38]:
printRanking(Search(ix2, schema2, TF_IDF, "michael"))

Scored and sorted documents: 171



Unnamed: 0,Document ID,Score
0,687,28.565972
1,233,14.282986
2,801,10.712239
3,1662,10.712239
4,1688,10.712239
...,...,...
166,2206,3.570746
167,2213,3.570746
168,2220,3.570746
169,2225,3.570746


In [39]:
printRanking(Search(ix2, schema2, TF_IDF, "Michael"))

Scored and sorted documents: 171



Unnamed: 0,Document ID,Score
0,687,28.565972
1,233,14.282986
2,801,10.712239
3,1662,10.712239
4,1688,10.712239
...,...,...
166,2206,3.570746
167,2213,3.570746
168,2220,3.570746
169,2225,3.570746


In [40]:
printRanking(Search(ix2, schema2, TF_IDF, "MICHAEL"))

Scored and sorted documents: 171



Unnamed: 0,Document ID,Score
0,687,28.565972
1,233,14.282986
2,801,10.712239
3,1662,10.712239
4,1688,10.712239
...,...,...
166,2206,3.570746
167,2213,3.570746
168,2220,3.570746
169,2225,3.570746


In [41]:
printRanking(Search(ix2, schema2, TF_IDF, "the"))

Scored and sorted documents: 2231



Unnamed: 0,Document ID,Score
0,782,95.720825
1,439,90.682887
2,837,87.660124
3,1585,87.660124
4,1746,87.660124
...,...,...
2226,1699,1.007588
2227,2082,1.007588
2228,2097,1.007588
2229,2101,1.007588


In [42]:
printRanking(Search(ix2, schema2, TF_IDF, "The"))

Scored and sorted documents: 2231



Unnamed: 0,Document ID,Score
0,782,95.720825
1,439,90.682887
2,837,87.660124
3,1585,87.660124
4,1746,87.660124
...,...,...
2226,1699,1.007588
2227,2082,1.007588
2228,2097,1.007588
2229,2101,1.007588


In [43]:
printRanking(Search(ix2, schema2, TF_IDF, "THE"))

Scored and sorted documents: 2231



Unnamed: 0,Document ID,Score
0,782,95.720825
1,439,90.682887
2,837,87.660124
3,1585,87.660124
4,1746,87.660124
...,...,...
2226,1699,1.007588
2227,2082,1.007588
2228,2097,1.007588
2229,2101,1.007588


### Schema3 - Tokenize, Lowercase Filter, Stopwords Filter

Selecting the schema3, the content will also remove stopwords:

In [44]:
printRanking(Search(ix3, schema3, TF_IDF, "the"))

Scored and sorted documents: 0



Unnamed: 0,Document ID,Score


### Schema4 -  Tokenize, Lowercase Filter, Stopwords Filter, Stemming 

Selecting the schema4, the content will stem words. For testing our new schema we try to search in the schema3 and in the schema with stemming two words that, with stemming, should appear as the same: *rock* and *rocks*.

In [45]:
printRanking(Search(ix3, schema3, TF_IDF, "rock")).head()

Scored and sorted documents: 47



Unnamed: 0,Document ID,Score
0,1221,24.2352
1,289,19.38816
2,108,14.54112
3,167,9.69408
4,324,9.69408


In [46]:
printRanking(Search(ix3, schema3, TF_IDF, "rocks")).head()

Scored and sorted documents: 18



Unnamed: 0,Document ID,Score
0,2015,23.095208
1,364,11.547604
2,770,11.547604
3,800,11.547604
4,1119,11.547604


Now, after creating the index we will see that the results for the same queries will change.

In [47]:
printRanking(Search(ix4, schema4, TF_IDF, "rock"))

Scored and sorted documents: 72



Unnamed: 0,Document ID,Score
0,1221,22.138908
1,2015,22.138908
2,289,17.711126
3,108,13.283345
4,1119,13.283345
...,...,...
67,1982,4.427782
68,1985,4.427782
69,2040,4.427782
70,2096,4.427782


In [48]:
printRanking(Search(ix4, schema4, TF_IDF, "rocks"))

Scored and sorted documents: 72



Unnamed: 0,Document ID,Score
0,1221,22.138908
1,2015,22.138908
2,289,17.711126
3,108,13.283345
4,1119,13.283345
...,...,...
67,1982,4.427782
68,1985,4.427782
69,2040,4.427782
70,2096,4.427782


Ideed, *rock* and *rocks* have the same results since, by using stemming, *rocks* became *rock*.

### Schema5 - StandardAnalyzer, Lemmatizer Filter

In this section we want to focus in developing a filter that can do lemmatization. To do so, we use **WordNetLemmatizer()** from nltk library.

We try to look for the words *meet* and *meeting*. We should get different results by using stemming and lemmatization. Indeed, lemmatization can understand the semantic of words thus, it is capble to distinguish from the context between *meeting* gerund of meet and *meeting* noun.

In [49]:
printRanking(Search(ix4, schema4, TF_IDF, "meet"))

Scored and sorted documents: 366



Unnamed: 0,Document ID,Score
0,1231,33.754549
1,291,30.941670
2,400,30.941670
3,415,28.128791
4,1644,22.503033
...,...,...
361,2167,2.812879
362,2197,2.812879
363,2201,2.812879
364,2208,2.812879


In [50]:
printRanking(Search(ix4, schema4, TF_IDF, "meeting"))

Scored and sorted documents: 366



Unnamed: 0,Document ID,Score
0,1231,33.754549
1,291,30.941670
2,400,30.941670
3,415,28.128791
4,1644,22.503033
...,...,...
361,2167,2.812879
362,2197,2.812879
363,2201,2.812879
364,2208,2.812879


Selecting the schema5, the content will lemmatize words.

In [51]:
printRanking(Search(ix5, schema5, TF_IDF, "meet"))

Scored and sorted documents: 150



Unnamed: 0,Document ID,Score
0,269,11.102883
1,1120,11.102883
2,1410,11.102883
3,94,7.401922
4,96,7.401922
...,...,...
145,2117,3.700961
146,2124,3.700961
147,2190,3.700961
148,2197,3.700961


In [52]:
printRanking(Search(ix5, schema5, TF_IDF, "meeting"))

Scored and sorted documents: 276



Unnamed: 0,Document ID,Score
0,1231,37.130681
1,291,30.942234
2,400,27.848011
3,415,27.848011
4,139,21.659564
...,...,...
271,2167,3.094223
272,2190,3.094223
273,2201,3.094223
274,2208,3.094223


### 'N-grams'

NGram filter: Breaks individual tokens into N-grams as part of an analysis pipeline. This is more useful for languages with word separation.

In [53]:
my_analyzer = StandardAnalyzer() | NgramFilter(minsize=4, maxsize=7)
[token.text for token in my_analyzer(u"rendering shaders")]

['rend',
 'rende',
 'render',
 'renderi',
 'ende',
 'ender',
 'enderi',
 'enderin',
 'nder',
 'nderi',
 'nderin',
 'ndering',
 'deri',
 'derin',
 'dering',
 'erin',
 'ering',
 'ring',
 'shad',
 'shade',
 'shader',
 'shaders',
 'hade',
 'hader',
 'haders',
 'ader',
 'aders',
 'ders']

In [54]:
schema_for_ngrams = Schema(content=TEXT(phrase=True, stored=True, analyzer=my_analyzer))

In [None]:
ix_for_ngrams = createIndex(index_name="index_ngrams", schema=schema_for_ngrams)

Example 1: Most common ngrams that have the "tion" in the text:

In [None]:
reader = ix_for_ngrams.reader()
reader.most_frequent_terms('content', number=15, prefix='tion') 

NgramTokenizer: tokenizes the entire field into N-grams. This is more useful for Chinese/Japanese/Korean languages, where it’s useful to index bigrams of characters rather than individual characters. Using this tokenizer with roman languages leads to spaces in the tokens.

In [None]:
ngt = NgramTokenizer(minsize=2, maxsize=4)
[token.text for token in ngt(u"hi there")]

## Suggestions for spell correction

Below we report a simple correction of mistyped words in a query. To do so we used the schema with stemming (schema4) that we introduced before.

In [None]:
#a simple example for correction

corrector = ix4.searcher().corrector("content")

#example
corrector.suggest("micheal")

In [None]:
print("Enter terms to search in the collection"
#query=str(input())
query = "Micheal Dukkakis"

parser = qparser.QueryParser("content", schema4)
parsed_query = parser.parse(query)

with ix4.searcher() as searcher:
    if len(Search(ix4, schema4, TF_IDF, query)) == 0:
        correction = searcher.correct_query(parsed_query, query)
        if correction.query != query:
            #print(correction)
            #print(correction.query)
            print("Maybe you wanted to search for:")
            print(correction.string)
            print("Try again!")
    else:
        print(printRanking(Search(ix4, schema4, TF_IDF, query)))

## Conclusions

This assignment was really good to understand how the filters and the queries work. 

We had some issues to understand how the schema and the index structures work and also how to get information from them. In addition, when we tried to use the scoring for ranking queries' results we had some problem in understanding how they are computed. 

We wanted to perform cosine too, but, for what we know from the documentation, we could not find it in the library. 

We tried to implement the lemmatizer filter (not available in the library). It was good to see how it is implemented since it let us deeply understand how it works.

Unluckly, we had no time for implementing a deep spelling correction.However, it could be a challenge to be done after.