# Smart Document Retrieval System

The objective of this project is to create and implement an information retrieval system utilizing Elasticsearch for document indexing and retrieval. The focus involves extracting temporal expressions and georeferences from documents to enable spatiotemporal and textual queries. Users can search for information based on time-related, geographical aspects, and traditional textual queries. This comprehensive approach enhances the system's capability to handle a wide range of queries, making it a powerful tool for information retrieval.

Required libraries 

In [27]:
import os
import re
import zipfile
from bs4 import BeautifulSoup
from elasticsearch import Elasticsearch

## Elasticsearch Connection
Connect to the server

In [None]:
elasticsearch_host = 'localhost'
elasticsearch_port = 9200

es = Elasticsearch([f'http://{elasticsearch_host}:{elasticsearch_port}'])

Test Server Connection 

In [None]:
if es.ping():
    print("Connected to Elasticsearch")
else:
    print("Connection to Elasticsearch failed")

## Collecting & Cleaning Data
### Data Collecting
Assign the zip file path as `zip_file` and the location for extracting the files as `extract_files_path`.

In [17]:
zip_path = r'C:\\Users\\yasee\Downloads\\archive (1).zip'
extract_files_path = 'C:\\Users\\yasee\\Downloads\\extracted_data'

The `unzip_data_file` function takes the path for the folder that contains data, then exteact all the files in that path.

In [18]:
def unzip_data_file(zip_path, extract_path):
    try:
        os.makedirs(extract_path, exist_ok=True)

        with zipfile.ZipFile(zip_path, 'r') as zip_ref:
            zip_ref.extractall(extract_path)

        print(f"Successfully extracted files to {extract_path}")
    except Exception as e:
        print(f"Error during extraction: {e}")

unzip_data_file(zip_path, extract_files_path)

Successfully extracted files to C:\Users\yasee\Downloads\extracted_data


The `extract_reuters` function takes the path to the extracted files of type `sgm`, extracts all the Reuters elements, and then returns them as a list.

In [23]:
def extract_reuters(extract_files_path):
    reuters = []
    try:
        for file in os.listdir(extract_files_path):
            if file.endswith(".sgm"):
                filename = os.path.join(extract_files_path, file)
                
                with open(filename, 'r', encoding='utf-8', errors='ignore') as f:
                    data_file = f.read()

                soup = BeautifulSoup(data_file, 'html.parser')
                reuters.extend(soup.find_all('reuters'))
            
        print(f"Successfully extract ruters.")
        return reuters
    except Exception as e:
        print(f"Error during extracting ruters: {e}") 
        
reuters = extract_reuters(extract_files_path)
print(f"We have {len(reuters)} reuters.")

Successfully extract ruters.
We have 21578 reuters.


### Data Cleaning
The `split_authors` function takes authors as a string, then splits the string using `and` or `by` as separators. It subsequently removes extra whitespaces from the beginning and end of the string. Each author is then stored as an object containing `Firstname` and `Surname`.

In [24]:
def split_authors(authors):
    unclean_author_list = re.split(r'\b(?:BY|AND)\b', authors, flags=re.IGNORECASE)
    clean_author_list = [author.strip() for author in unclean_author_list if author.strip()]
    
    authors_list = []
    for author in clean_author_list:
        author_parts = author.split(',')[0].split(' ')
        authors_list.append({"Firstname": author_parts[0], "Surname": author_parts[1]})
        
    return authors_list

The `extreact_entitys` function takes all reuters, then extract all needed entitys, then stored them in articles list as an objects.

In [30]:
def extreact_entitys(reuters):
    articles = []
    for reuter in reuters:
        
        article = {
            'date': reuter.find('date').text if reuter.find('date') else "N/A",
            'topics': reuter.find('topics').text if reuter.find('topics') else "N/A",
            'places': reuter.find('places').text if reuter.find('places') else "N/A",
            'title': reuter.find('title').text if reuter.find('title') else "N/A",
            'author': split_authors(reuter.find('author').text) if reuter.find('author') else "N/A",
            'dateline': reuter.find('dateline').text if reuter.find('dateline') else "N/A",
            'body': reuter.find('body').text if reuter.find('body') else "N/A"
        }
        
        articles.append(article)
        
    return articles

articles = extreact_entitys(reuters)
print(articles[0])

{'date': '26-FEB-1987 15:01:01.79', 'topics': 'cocoa', 'places': 'el-salvadorusauruguay', 'title': 'BAHIA COCOA REVIEW', 'author': 'N/A', 'dateline': '    SALVADOR, Feb 26 - ', 'body': 'Showers continued throughout the week in\nthe Bahia cocoa zone, alleviating the drought since early\nJanuary and improving prospects for the coming temporao,\nalthough normal humidity levels have not been restored,\nComissaria Smith said in its weekly review.\n    The dry period means the temporao will be late this year.\n    Arrivals for the week ended February 22 were 155,221 bags\nof 60 kilos making a cumulative total for the season of 5.93\nmln against 5.81 at the same stage last year. Again it seems\nthat cocoa delivered earlier on consignment was included in the\narrivals figures.\n    Comissaria Smith said there is still some doubt as to how\nmuch old crop cocoa is still available as harvesting has\npractically come to an end. With total Bahia crop estimates\naround 6.4 mln bags and sales standin

In [None]:

from dateutil import parser

original_date_str = "26-FEB-1987 15:02:20.00"
parsed_date = parser.parse(original_date_str)
formatted_date_str = parsed_date.strftime("%Y-%m-%d %H:%M:%S ")
print(formatted_date_str)