<img src="../../images/banners/python-modular.png" width="600"/>

# <img src="../../images/logos/python.png" width="23"/> Project: Word Cloud


## <img src="../../images/logos/toc.png" width="20"/> Table of Contents 
* [What is a search engine?](#search-engine)
* [Data](#data)
* [Text PreProcessing](#text-preprocessing)
* [Indexing](#indexing)
* [Search](#search)

---

<a class="anchor" id="search-engine"></a>

## What is a search engine?

A search engine is a software program that helps users locate information in the worldwide web. A search engine normally has three basic steps which are crawling, indexing and ranking. The crawling stage is where the program crawls the web in a predefined method collecting data such as images, links and stuff. The indexing stage is where the collected data is then stored in a data structure and finally, we have the ranking stage where the collected data is then ranked by relevance in that the higher the ranking the more accurate the answer.

We will be building our super simple search engine using python dictionary.

<a class="anchor" id="data"></a>

## Data

Text documents are located in `data/documents`.

In [1]:
from pathlib import Path

In [2]:
data = {}

for doc_path in Path('data/documents').iterdir():
    if doc_path.suffix != '.txt':
        continue
        
    with open(doc_path) as f:
        doc_name = doc_path.stem.split('_')
        doc_name = ' '.join(doc_name).title()
        data[doc_name] = f.read()

<a class="anchor" id="text-preprocessing"></a>

## Text PreProcessing

In [3]:
from abc import ABC, abstractmethod
import string

In [4]:
class TextProcessor:
    @abstractmethod
    def transform(self, text):
        pass

In [5]:
class ConvertCase(TextProcessor):
    def __init__(self, casing='lower'):
        self.casing = casing
        
    def transform(self, text):
        if self.casing == 'lower':
            return text.lower()
        elif self.casing == 'upper':
            return text.upper()
        elif self.casing == 'title':
            return text.title()

In [6]:
class RemoveDigit:
    def transform(self, text):
        return ''.join(char if not char.isdigit() else ' ' for char in text)

In [7]:
class RemovePunkt:
    def transform(self, text):
        return ''.join(char if not char in string.punctuation else ' ' for char in text)

In [8]:
class RemoveSpace:
    def transform(self, text):
        return ' '.join(text.split())

In [9]:
class TextPipeline:
    def __init__(self, *args):
        self.transformers = args
        
    def transform(self, text):
        for tf in self.transformers:
            text = tf.transform(text)
        return text
        
    def __str__(self):
        transformers = ' -> '.join([tf.__class__.__name__ for tf in self.transformers])
        return f'Pipeline: [{transformers}]'

In [10]:
pipe = TextPipeline(ConvertCase(), RemoveDigit(), RemovePunkt(), RemoveSpace())

In [11]:
pipe.transform('1. Here is my Text!!!')

'here is my text'

In [12]:
processed_data = {
    doc_name: pipe.transform(content) for doc_name, content in data.items()
}

<a class="anchor" id="indexing"></a>

## Indexing

In [13]:
stop_words = open('data/stop_words.txt').readlines()
stop_words = list(map(str.strip, stop_words))
stop_words = list(map(pipe.transform, stop_words))

In [14]:
index = {}

for doc_name, content in data.items():
    for word in content.split():
        word = pipe.transform(word)
        
        # Empty Words
        if not word:
            continue

        # Ignore Stop Words
        if word in stop_words:
            continue

        # Add to index
        if index.get(word):
            index[word].add(doc_name)
        else:
            index[word] = {doc_name, }

<a class="anchor" id="search"></a>

## Search

In [15]:
from termcolor import colored
from collections import Counter

In [16]:
def print_success(text):
    print(colored(text, 'green'))

In [17]:
TOP_N = 3
while True:
    # Get user input
    search_input = input('Search to find a doc (q to quit):')
    if search_input.lower() == 'q':
        break
    search_input = pipe.transform(search_input)

    # Get input tokens
    search_tokens = search_input.split()

    # Get relevant documents
    docs = []
    for token in search_tokens:
        docs.extend(index.get(token, []))
        
    # Rank documents
    docs_counter = Counter(docs).most_common()
    docs = [d[0] for d in docs_counter][:TOP_N]
    
    # Print the results
    for doc in docs:
        print_success(f'- {doc}')

Search to find a doc (q to quit): q
