# Build `pyserini` Index for SemEval 2014 Task 4

- We first convert the original `xml` dataset files into the document collection of `jsonl` format that `pyserini` understands.

- We build a `Pyserini` index that includes all documents from train and test sets for both `laptop` and `restaurant`.

- Lastly, generate `test_queries_{laptop, restaurant}.txt` and `test_qrels_{laptop, restaurant}.txt` out of the original test dataset, by treating each unique `(aspect, sentiment)` label as a query.

## Google Colab setups

This part only gets executed if this notebook is being run under Google Colab. **Please change the working path  directory below in advance!**

In [1]:
# Use Google Colab
use_colab = True

# Is this notebook running on Colab?
# If so, then google.colab package (github.com/googlecolab/colabtools)
# should be available in this environment

# Previous version used importlib, but we could do the same thing with
# just attempting to import google.colab
try:
    from google.colab import drive
    colab_available = True
except:
    colab_available = False

if use_colab and colab_available:
    # If there are packages I need to install separately, do it here
    !pip install pyserini==0.9.4.0 jsonlines==1.2.0

    # Mount Google Drive
    drive.mount('/content/drive')

    # cd to the appropriate working directory under my Google Drive
    # (IMPORTANT: THIS PATH MUST MATCH EXACTLY TO WHERE THIS NOTEBOOK IS LOCATED
    # IN YOUR GOOGLE DRIVE!!)
    %cd '/content/drive/My Drive/CS646_Final_Project/BM25'

    # List the directory contents
    !ls

## Import packages

In [2]:
import os
import json 
import random
import xml.etree.ElementTree as ET
import pathlib
import math

import jsonlines 
from pyserini.search import SimpleSearcher

## Path setups

In [3]:
semeval_path = os.path.join('..', 'data', 'SemEval2014_Task4')
collection_path = 'collection'
index_path = 'index'

In [4]:
# This is used to assign unique document ids across the different collections.
collection_ids = {
    'Laptop_Train_v2.xml': 1,
    'Laptops_Test_Gold.xml': 2,
    'Restaurants_Test_Gold.xml': 3,
    'Restaurants_Train_v2.xml': 4,
}

In [5]:
# For fair comparision, generate collections for the original SemEval test datasets only
new_collection_files = {
   'Laptops_Test_Gold.xml': 'laptop_test/laptop_test.jsonl',
   'Restaurants_Test_Gold.xml': 'restaurant_test/restaurant_test.jsonl',
}

In [6]:
new_query_files = {
   'Laptops_Test_Gold.xml': 'test_queries_laptop.txt',
   'Restaurants_Test_Gold.xml': 'test_queries_restaurant.txt',
}

In [7]:
new_qrels_files = {
   'Laptops_Test_Gold.xml': {
       'qrels_filename': 'test_qrels_laptop.txt',
       'original_dataset_files': ['Laptops_Test_Gold.xml']
   },
   'Restaurants_Test_Gold.xml': {
       'qrels_filename': 'test_qrels_restaurant.txt',
       'original_dataset_files': ['Restaurants_Test_Gold.xml'] 
   },
}

## Create collection (`jsonl` files)

In [8]:
for file_name in new_collection_files.keys():
    
    file_path = os.path.join(semeval_path, file_name)

    save_path = os.path.join(collection_path, new_collection_files[file_name])
  
    print(save_path)
    
    if os.path.exists(save_path):
        !rm -r {save_path}
    else:
        pathlib.Path(save_path).parent.mkdir(parents=True, exist_ok=True)
        !touch {save_path}

    with open(file_path) as semeval_file:
        sentence_elements = ET.parse(semeval_file).getroot().iter('sentence')

        for sentence_id, s in enumerate(sentence_elements):
            sent = s.find('text').text

            doc = {
                'id': 'doc' + str(collection_ids[file_name]) + str(sentence_id),
                'contents': sent,
            }

            with jsonlines.open(save_path, mode='a') as writer:
                writer.write(doc)

collection/laptop_test/laptop_test.jsonl
collection/restaurant_test/restaurant_test.jsonl


## Create `pyserini` index for laptop

In [9]:
!python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
 -threads 1 -input collection/laptop_test \
 -index index/laptop_test -storePositions -storeDocvectors -storeRaw

2021-01-14 12:53:16,516 INFO  [main] index.IndexCollection (IndexCollection.java:636) - Setting log level to INFO
2021-01-14 12:53:16,527 INFO  [main] index.IndexCollection (IndexCollection.java:639) - Starting indexer...
2021-01-14 12:53:16,530 INFO  [main] index.IndexCollection (IndexCollection.java:641) - DocumentCollection path: collection/laptop_test
2021-01-14 12:53:16,532 INFO  [main] index.IndexCollection (IndexCollection.java:642) - CollectionClass: JsonCollection
2021-01-14 12:53:16,534 INFO  [main] index.IndexCollection (IndexCollection.java:643) - Generator: DefaultLuceneDocumentGenerator
2021-01-14 12:53:16,536 INFO  [main] index.IndexCollection (IndexCollection.java:644) - Threads: 1
2021-01-14 12:53:16,538 INFO  [main] index.IndexCollection (IndexCollection.java:645) - Stemmer: porter
2021-01-14 12:53:16,540 INFO  [main] index.IndexCollection (IndexCollection.java:646) - Keep stopwords? false
2021-01-14 12:53:16,542 INFO  [main] index.IndexCollection (IndexCollection.jav

## Create `pyserini` index for restaurant

In [10]:
!python -m pyserini.index -collection JsonCollection -generator DefaultLuceneDocumentGenerator \
 -threads 1 -input collection/restaurant_test \
 -index index/restaurant_test -storePositions -storeDocvectors -storeRaw

2021-01-14 12:56:42,371 INFO  [main] index.IndexCollection (IndexCollection.java:636) - Setting log level to INFO
2021-01-14 12:56:42,381 INFO  [main] index.IndexCollection (IndexCollection.java:639) - Starting indexer...
2021-01-14 12:56:42,383 INFO  [main] index.IndexCollection (IndexCollection.java:641) - DocumentCollection path: collection/restaurant_test
2021-01-14 12:56:42,384 INFO  [main] index.IndexCollection (IndexCollection.java:642) - CollectionClass: JsonCollection
2021-01-14 12:56:42,385 INFO  [main] index.IndexCollection (IndexCollection.java:643) - Generator: DefaultLuceneDocumentGenerator
2021-01-14 12:56:42,386 INFO  [main] index.IndexCollection (IndexCollection.java:644) - Threads: 1
2021-01-14 12:56:42,389 INFO  [main] index.IndexCollection (IndexCollection.java:645) - Stemmer: porter
2021-01-14 12:56:42,390 INFO  [main] index.IndexCollection (IndexCollection.java:646) - Keep stopwords? false
2021-01-14 12:56:42,391 INFO  [main] index.IndexCollection (IndexCollection

## Test the new indexes

### Laptop

In [12]:
from pyserini.search import SimpleSearcher

idx_path = os.path.join('index', 'laptop_test')

searcher = SimpleSearcher(idx_path)
hits = searcher.search('Boot time')

for i in range(len(hits)):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

 1 doc20           4.30850
 2 doc2523         3.11450
 3 doc2568         2.65380
 4 doc2273         2.54620
 5 doc2346         1.90690
 6 doc2672         1.90690
 7 doc2130         1.86390
 8 doc2721         1.86390
 9 doc2358         1.82280
10 doc2540         1.78340


### Restaurants

In [17]:
from pyserini.search import SimpleSearcher

idx_path = os.path.join('index', 'restaurant_test')

searcher = SimpleSearcher(idx_path)
hits = searcher.search('quality')

for i in range(len(hits)):
    print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')

 1 doc3560         2.62770
 2 doc3457         2.57050
 3 doc3625         2.57050
 4 doc3506         2.46310
 5 doc3330         2.31800
 6 doc3556         1.81790
 7 doc3710         1.73760


## Create test queries from the test dataset

In [15]:
for f in new_query_files.keys():
    
    file_path = os.path.join(semeval_path, f)
  
    save_path = new_query_files[f]
  
    print(save_path)
    
    queries = []

    with open(file_path) as semeval_file:
        sentence_elements = ET.parse(semeval_file).getroot().iter('sentence')

        for id_, s in enumerate(sentence_elements):
            sent = s.find('text').text
            
            for o in s.iter('aspectTerm'):
                aspect_term = o.get('term')
                
                queries.append(aspect_term)
                    
    print("Total number of queries:", len(queries))

    unique_queries = list(set(queries))    
    unique_queries.sort()

    print("Total number of unique queries:", len(unique_queries))
    
    print()

    with open(save_path, 'w') as new_file:
        for q in unique_queries:
            new_file.write("%s\n" % q)

test_queries_laptop.txt
Total number of queries: 654
Total number of unique queries: 419

test_queries_restaurant.txt
Total number of queries: 1134
Total number of unique queries: 555



## Create qrels.txt (ground truth) for test queries

In [16]:
for f in new_qrels_files.keys():
    
    query_path = new_query_files[f]
  
    save_path = new_qrels_files[f]['qrels_filename']
  
    print(save_path)
    
    with open(query_path, 'r') as test_query_file:
        unique_queries = test_query_file.readlines()

    rel_docIDs = {}
    
    for j in range(len(unique_queries)):
        rel_docIDs[j] = []
        
    for semeval_file_name in new_qrels_files[f]['original_dataset_files']:
        
        semeval_file_path = os.path.join(semeval_path, semeval_file_name)

        with open(semeval_file_path) as semeval_file:
            sentence_elements = ET.parse(semeval_file).getroot().iter('sentence')

            for id_, s in enumerate(sentence_elements):
                # doc_id used in our Pyserini index
                doc_id = 'doc' + str(collection_ids[semeval_file_name]) + str(id_)
            
                for o in s.iter('aspectTerm'):
                    aspect_term = o.get('term')

                    for i, query in enumerate(unique_queries):
                        if query.rstrip('\n') == aspect_term:
                            rel_docIDs[i].append(doc_id)

    # write query/relevant doc pairs to qrels file
    with open(save_path, 'w') as f:
        for i in rel_docIDs.keys():
            # Some documents may have two identical (aspect, term) labels
            # because the aspect term occur multiple times within the sentence
            rel_docIDs[i] = list(set(rel_docIDs[i]))
            rel_docIDs[i].sort()

            for rd in rel_docIDs[i]:
                line = str(i+1) + '\t' + '0' + '\t' + rd + '\t' + '1'
                f.write("%s\n" % line)
                
    # Get the stats for the # of relevant documents per query
    total_rel_documents = 0
    min_rel_documents = math.inf
    min_rel_query_id = -1
    max_rel_documents = -math.inf
    max_rel_query_id = -1
    
    for i in rel_docIDs.keys():
        total_rel_documents = total_rel_documents + len(rel_docIDs[i])
        
        if len(rel_docIDs[i]) < min_rel_documents:
            min_rel_documents = len(rel_docIDs[i])
            min_rel_query_id = i
            
        if len(rel_docIDs[i]) > max_rel_documents:
            max_rel_documents = len(rel_docIDs[i])
            max_rel_query_id = i
        
    print("Average # of relevant documents per query:", total_rel_documents/len(unique_queries))
    
    print("Min # of relevant documents:", min_rel_documents)
    print("The query ID with min #:", min_rel_query_id)

    print("Max # of relevant documents:", max_rel_documents)
    print("The query ID with max #:", max_rel_query_id)
    
    print()

test_qrels_laptop.txt
Average # of relevant documents per query: 1.5536992840095465
Min # of relevant documents: 1
The query ID with min #: 0
Max # of relevant documents: 17
The query ID with max #: 313

test_qrels_restaurant.txt
Average # of relevant documents per query: 2.036036036036036
Min # of relevant documents: 1
The query ID with min #: 0
Max # of relevant documents: 117
The query ID with max #: 262

