In this lab you are going to implement a standard document processing pipeline and then build a simple search engine based on it:
- starting from crawling documents, 
- then building an inverted index,
- answering queries using this index,
- and organizing it as a simple web server.

Second part is devoted to spellchecking.

# 1. [45] Building inverted index and answering queries

## 1.1. [5] Preprocessing

First, we need a unified approach to documents preprocessing. Implement a class responsible for that. Complete the code for given functions (most of them are just one-liners) and make sure you pass the tests. Make use of `nltk` library or any other you know.

In [6]:
import nltk
nltk.download('punkt')

class Preprocessor:
    
    def __init__(self):
        self.stop_words = {'a', 'an', 'and', 'are', 'as', 'at', 'be', 'by', 'for', 'from', 'has', 'he', 'in', 'is', 'it', 'its',
                      'of', 'on', 'that', 'the', 'to', 'was', 'were', 'will', 'with'}
        self.ps = nltk.stem.PorterStemmer()

    
    def tokenize(self, text):
        #TODO word tokenize text using nltk lib
        return nltk.word_tokenize(text)

    
    def stem(self, word, stemmer):
        #TODO stem word using provided stemmer
        return stemmer.stem(word)

    
    def is_apt_word(self, word):
        #TODO check if word is appropriate - not a stop word and isalpha, 
        # i.e consists of letters, not punctuation, numbers, dates
        return word not in self.stop_words and str.isalpha(word)

    
    def preprocess(self, text):
        #TODO combine all previous methods together: tokenize lowercased text 
        # and stem it, ignoring not appropriate words
        return [self.stem(word, self.ps) 
         for word in self.tokenize(text.lower())
         if self.is_apt_word(word)]

[nltk_data] Downloading package punkt to /Users/osmiyg/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 1.1.1. Tests

In [7]:
prep = Preprocessor()
text = 'To be, or not to be, that is the question'

assert prep.tokenize(text) == ['To', 'be', ',', 'or', 'not', 'to', 'be', ',', 'that', 'is', 'the', 'question']
assert prep.stem('retrieval', prep.ps) == 'retriev'
assert prep.is_apt_word('qwerty123') is False
assert prep.preprocess(text) == ['or', 'not', 'question']

## 1.2. [25] Crawling and Indexing

### 1.2.1. [5] Base classes

Here are some base classes we will need for writing our indexer. The code from the first lab's solution is given, but note that you will need to change some of it, namely, the `parse` function (it is also possible to use your own implementation from the first homework, but make sure that it is correct). The reason is it always makes complete parsing, which we want to avoid when we only need links, for example, or a specific portion of text.

In [8]:
import requests
from urllib.parse import quote
from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse
from pathlib import Path
import os


class Document:

    def __init__(self, url):
        self.url = url
        self.download()

    def download(self):
        try:
            response = requests.get(self.url)
            if response.status_code == 200:
                self.content = response.content
                return True
            else:
                return False
        except:
            return False

    def persist(self, path):
        Path(path).mkdir(parents=True, exist_ok=True)
        with open(os.path.join(path, quote(self.url).replace('/', '_')), 'wb') as f:
            f.write(self.content)


class HtmlDocument(Document):

    def normalize(self, href):
        if href is not None and href[:4] != 'http':
            href = urllib.parse.urljoin(self.url, href)
        return href

    def parse(self, parse_img=False, parse_text=False, parse_anchors=False):
        #TODO change this method
        
        def parse_images():
            images = []
            i = model.find_all('img')
            for img in i:
                href = self.normalize(img.get('src'))
                images.append(href)
            return images
            
        def parse_links():
            anchors = []
            a = model.find_all('a')
            for anchor in a:
                href = self.normalize(anchor.get('href'))
                text = anchor.text
                anchors.append((text, href))
            return anchors
        
        def parse_text():
            texts = model.findAll(text=True)
            visible_texts = filter(tag_visible, texts)  
            return u" ".join(t.strip() for t in visible_texts)
        
        def tag_visible(element):
            if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
                return False
            if isinstance(element, Comment):
                return False
            return True
            
        
        model = BeautifulSoup(self.content)
        
        if parse_anchors:
            self.anchors = parse_links() 
        if parse_img:
            self.images = parse_images()
        if parse_text:
             self.text = parse_text() 

### 1.2.2. [15] Main class

The main indexer logic is here. We organize it as a crawler generator that adds certain visited pages to inverted index and saves them on disk. 

- `crawl_generator_for_index` method crawles the given website doing BFS, starting from `source` within given `depth`. Considers only inner pages (of a form https://www.reuters.com/...) for visiting. To speed up, doesn't consider for visiting pages with content type other than html: '.pdf', '.mp3', '.avi', '.mp4', '.txt'. If encounters an article page (of a form https://www.reuters.com/article/...), saves its content in a file in `collection_path` folder and populates the inverted index calling `index_doc` method. When done, saves on disk three resulting dictionaries:
    - `doc_urls`, `doc_id:url`
    - `index`, `term:[collection_frequency, (doc_id_1, doc_freq_1), (doc_id_2, doc_freq_2), ...]`
    - `doc_lengths`, `doc_id:doc_length` 

    `limit` parameter is given for testing - if not `None`, break the loop when number of saved articles exceeds the `limit` and return without writing dictionaries to disk.
    
    
- `index_doc` method parses and preprocesses the content of a `doc` and adds it to the inverted index. Also keeps track of document lengths in a `doc_lengths` dictionary.


**Extra task \* (no penalty to skip)** In real industrial systems a crawler would pass the links to the dedicated service that would load their contents in a bunch of parallel threads. Implement such a service - get urls as inputs, load page contents in parallel and return filenames on disk, which are then processed by indexer.

In [9]:
from collections import Counter, defaultdict
from urllib.parse import urlparse
from queue import Queue
import pickle
import os

class Indexer:

    def __init__(self):      
        # dictionaries to populate
        self.doc_urls = {}        
        self.index = {}
        self.doc_lengths = {}
        self.id_counter = 0
        # preprocessor
        self.prep = Preprocessor()
        self.filter_types = ['.pdf', '.mp3', '.avi', '.mp4', '.txt']
        
    
    def crawl_generator_for_index(self, source, depth, collection_path="collection", limit=None):        
        #TODO generate url-s for visiting
        
        def check_limit_less():
            if limit:
                return len(self.doc_urls) < limit
            else: 
                return True
        
        def process_link(link):
            src = HtmlDocument(link)
            checked.add(link)
            src.parse(parse_anchors=True)
            return src
        
        def check_if_html(link):
            url_parsed = urlparse(link)
            for ftype in self.filter_types:
                if ftype in url_parsed:
                    return False
            return True
        
        def check_if_article(link):
            return "https://www.reuters.com/article" in link
        
        def check_if_reuters(link):
            return "https://www.reuters.com" in link
        
        def index_doc(doc):
            #TODO add documents to index
            doc.parse(parse_text=True)
            words = self.prep.preprocess(doc.text)
            # Add url to dict
            self.doc_urls[self.id_counter] = doc.url
            
            #Add doc_lengths to dict
            self.doc_lengths[self.id_counter] = len(words)
            
            # Adding word to index
            for term in words:
                if term in self.index:
                    term_freq, docs_list = self.index[term][0], self.index[term][1:]
                    docs_dict = dict(docs_list)

                    term_freq += 1
                    if self.id_counter in docs_dict:
                        docs_dict[self.id_counter] += 1
                    else:
                        docs_dict[self.id_counter] = 1
                        
                    new_list = [term_freq]
                    new_list.extend(list(docs_dict.items()))
                    self.index[term] = new_list
                else:
                    docs_dict = {self.id_counter: 1}
                    new_list = [1]
                    new_list.extend(list(docs_dict.items()))
                    self.index[term] = new_list
            
            self.id_counter += 1
        
        queue = []
        checked = set()
        tmp_anchor_list = []
        self.id_counter = 0
        
        # Put initial links
        src = process_link(source)
        if check_if_article(src.url):
            src.persist(collection_path)
        index_doc(src)
        yield src
        for _, url in src.anchors:
            if check_if_reuters(url) and check_if_html(url) and url not in checked and url not in queue:
                queue.append(url)
        
            
        # Going until depth is satisified
        for i in range(depth-1):
            print(f"Reached depth {i+1}. Number of links to crawl {len(set(queue))}")
            while (queue):
                try:
                    link = queue.pop(0)
                    # We procceed if the link was not crawled before
                    if link not in checked:
                        src = process_link(link)
                        for _, url in src.anchors:
                            if check_if_reuters(url) and check_if_html(url) and url not in checked and url not in queue:
                                tmp_anchor_list.append(url)
                        if check_limit_less():
                            if check_if_article(src.url):
                                src.persist(collection_path)
                            index_doc(src)
                            yield src  
                        else:
                            return
                except Exception as e:
                    print(e)
            queue.extend(tmp_anchor_list)
            tmp_anchor_list = []

        # save results for later use
        with open('doc_urls.p', 'wb') as fp:
            pickle.dump(self.doc_urls, fp, protocol=pickle.HIGHEST_PROTOCOL)
        with open('inverted_index.p', 'wb') as fp:
            pickle.dump(self.index, fp, protocol=pickle.HIGHEST_PROTOCOL)
        with open('doc_lengths.p', 'wb') as fp:
            pickle.dump(self.doc_lengths, fp, protocol=pickle.HIGHEST_PROTOCOL)

### 1.2.3. Tests

In [6]:
indexer = Indexer()
k = 1
for c in indexer.crawl_generator_for_index("https://www.reuters.com/technology", 2, "test_collection", 5):
    print(k, c.url)
    k+=1

assert type(indexer.index) is dict
assert type(indexer.index['reuter']) is list
assert type(indexer.index['reuter'][0]) is int
assert type(indexer.index['reuter'][1]) is tuple

1 https://www.reuters.com/technology
Reached depth 1. Number of links to crawl 71
2 https://www.reuters.com/
3 https://www.reuters.com/home
4 https://www.reuters.com/world
5 https://www.reuters.com/world/us-politics


### 1.2.4. Building an index

In [7]:
indexer = Indexer()
k = 1
for c in indexer.crawl_generator_for_index("https://www.reuters.com/", 3, "docs_collection"):
    print(k, c.url)
    k+=1

1 https://www.reuters.com/
Reached depth 1. Number of links to crawl 122
2 https://www.reuters.com/home
3 https://www.reuters.com/world
4 https://www.reuters.com/world/us-politics
5 https://www.reuters.com/world/us
6 https://www.reuters.com/world/uk
7 https://www.reuters.com/world/china
8 https://www.reuters.com/world/india
9 https://www.reuters.com/world/americas
10 https://www.reuters.com/world/asia-pacific
11 https://www.reuters.com/world/europe
12 https://www.reuters.com/world/middle-east-africa
argument of type 'NoneType' is not iterable
13 https://www.reuters.com/business
14 https://www.reuters.com/business/sustainable-business
argument of type 'NoneType' is not iterable
15 https://www.reuters.com/business/energy
16 https://www.reuters.com/business/environment
17 https://www.reuters.com/business/finance
18 https://www.reuters.com/business/media-telecom
19 https://www.reuters.com/business/healthcare-pharmaceuticals
20 https://www.reuters.com/business/autos-transportation
21 https:

91 https://www.reuters.com/finance
92 https://www.reuters.com/article/us-retail-trading-yellen/u-s-treasury-secretary-yellen-too-soon-to-say-if-changes-needed-to-address-market-volatility-idUSKBN2A70J4
93 https://www.reuters.com/article/us-g4s-m-a-garda-world/g4s-to-hold-talks-for-head-to-head-takeover-auction-the-telegraph-idUSKBN2A60RB
94 https://www.reuters.com/article/us-retail-trading-bonds-analysis/analysis-the-other-winners-of-the-reddit-fueled-rallies-convertible-debt-idUSKBN2A522A
95 https://www.reuters.com/article/us-israel-cenbank-reserves/bank-of-israel-buys-6-8-billion-of-forex-in-january-reserves-jump-to-new-record-idUSKBN2A70G5
96 https://www.reuters.com/article/us-taiwan-forex/taiwan-punishes-deutsche-bank-others-in-currency-speculation-case-idUSKBN2A705J
97 https://www.reuters.com/article/us-health-coronavirus-britain-budget/uk-plans-to-tax-firms-that-profited-from-pandemic-sunday-times-idUSKBN2A60TE
98 https://www.reuters.com/article/us-ecuador-election/ecuadoreans-vo

156 https://www.reuters.com/video/2017/08/22/over-100-feared-dead-after-himalayan-gla?videoId=725479592&videoChannel=117760
157 https://www.reuters.com/video/2017/08/22/conjoined-yemeni-twins-flown-to-amman-fo?videoId=725478771&videoChannel=117760
158 https://www.reuters.com/video/2017/02/07/tens-of-thousands-rally-against-myanmar?videoId=725473451&videoChannel=117760
159 https://www.reuters.com/video/2017/08/22/oxford-covid-shot-less-effective-on-safr?videoId=725472630&videoChannel=117760
160 https://www.reuters.com/article/us-usa-china/blinken-presses-china-on-xinjiang-hong-kong-in-call-with-beijings-top-diplomat-idUSKBN2A604Y
161 https://www.reuters.com/article/us-health-coronavirus-indonesia-vaccine/exclusive-indonesia-approves-chinas-sinovac-vaccine-for-the-elderly-idUSKBN2A60I0
162 https://www.reuters.com/article/us-health-coronavirus-vaccine-sinovac/china-approves-sinovac-biotech-covid-19-vaccine-for-general-public-use-idUSKBN2A60AY
163 https://www.reuters.com/article/us-china-p

216 https://www.reuters.com/news/archive/rcom-apac?view=page&page=2&pageSize=10
217 https://www.reuters.com/article/us-health-coronavirus-britain-vaccinatio/more-than-12-million-britons-have-received-first-covid-19-vaccine-dose-idUSKBN2A70LD
218 https://www.reuters.com/article/us-italy-politics-salvini/italys-5-star-league-open-up-to-draghi-government-await-policy-plans-idUSKBN2A60G6
219 https://www.reuters.com/article/us-iran-nuclear-europe-usa/france-says-held-in-depth-talks-with-u-s-britain-germany-on-iran-idUSKBN2A52L5
220 https://www.reuters.com/article/us-iran-nuclear-europe/blinken-discusses-iran-with-uk-french-german-ministers-idUSKBN2A526T
221 https://www.reuters.com/article/us-libya-security-un-britain/western-powers-welcome-libyan-interim-government-idUSKBN2A52OJ
222 https://www.reuters.com/article/us-global-markets/world-shares-scale-new-peak-on-stimulus-hopes-oil-gains-idUSKBN2A501F
223 https://www.reuters.com/article/us-health-coronavirus-britain-france/collaboration-need

282 https://www.reuters.com/article/us-twitter-content-hate/twitter-expands-hate-speech-rules-to-include-race-ethnicity-idUSKBN28D03U
283 https://www.reuters.com/article/us-tech-sap/sap-pledges-to-spend-more-on-social-diverse-businesses-idUSKBN26R0C8
284 https://www.reuters.com/subjects/sustainable-business/responsible-investing
285 https://www.reuters.com/article/us-ecb-climate-lagarde/ecb-should-think-green-when-picking-assets-policymakers-idUSKBN29U0TA
286 https://www.reuters.com/article/us-britain-boe-environment/bank-of-england-told-to-stop-buying-high-carbon-bonds-idUSKBN29U01A
287 https://www.reuters.com/article/us-greenbonds-issuance/global-green-bond-issuance-hit-new-record-high-last-year-idUSKBN29U013
288 https://www.reuters.com/article/us-exxon-mobil-climate/exxon-unveils-carbon-removal-tech-venture-in-green-image-push-idUSKBN2A13S2
289 https://www.reuters.com/article/us-shell-strategy-insight/shell-targets-power-trading-and-hydrogen-in-climate-drive-idUSKBN2A10ZZ
290 https:

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


cannot unpack non-iterable NoneType object
'HtmlDocument' object has no attribute 'content'
294 https://www.reuters.com/article/idUSKBN2A52JG
295 https://www.reuters.com/article/idUSKBN2A52SH
296 https://www.reuters.com/article/idUSKBN2A52S4
297 https://www.reuters.com/article/us-usa-economy-imf/u-s-faces-risk-of-bankruptcies-unemployment-if-fiscal-support-not-maintained-imf-idUSKBN2A52JG
298 https://www.reuters.com/article/us-ford-emissions/ford-says-u-s-justice-dept-california-end-probe-into-emissions-issue-idUSKBN2A527G
299 https://www.reuters.com/article/us-usa-harvard-lawreview/harvard-law-review-elects-first-muslim-president-idUSKBN2A52BX
300 https://www.reuters.com/article/us-people-nygard/canadian-judge-denies-fashion-designer-nygard-bail-in-u-s-extradition-case-idUSKBN2A5220
301 https://www.reuters.com/article/us-people-ghislaine-maxwell/maxwell-says-u-s-is-prosecuting-her-only-because-jeffrey-epstein-is-dead-idUSKBN2A529J
302 https://www.reuters.com/article/us-amazon-com-labo

357 https://www.reuters.com/article/us-volkswagen-chips-audi/hit-by-shortage-volkswagen-demands-boost-to-europe-chip-sector-idUSKBN2A41RS
358 https://www.reuters.com/article/us-autos-europe-electric/eu-electric-and-plug-in-hybrid-car-sales-jump-to-over-1-million-in-2020-idUSKBN2A41HR
359 https://www.reuters.com/article/us-mazda-semiconductors/mazda-expects-chip-shortage-to-affect-about-7000-vehicles-in-february-idUSKBN2A40WO
361 https://www.reuters.com/article/us-fordmotor-china-zotye/ford-motor-terminates-electric-vehicle-plans-with-chinas-zotye-idUSKBN2A409A
362 https://www.reuters.com/article/us-ford-motor-results-preview/fords-forecast-in-focus-as-it-wraps-up-year-with-quarterly-loss-idUSKBN2A32YE
363 https://www.reuters.com/article/us-daimler-trucks-divestiture/daimler-to-spin-off-truck-unit-sharpen-investor-focus-on-mercedes-benz-idUSKBN2A329T
364 https://www.reuters.com/article/us-gm-semiconductors-exclusive/gm-hit-by-chip-shortage-to-cut-production-at-four-plants-idUSKBN2A32LL


417 https://www.reuters.com/article/taiwan-forex/taiwan-punishes-deutsche-bank-others-in-currency-speculation-case-idUSL1N2KD02M
418 https://www.reuters.com/article/britain-boe-bailey/boes-bailey-says-could-shun-fossil-fuel-firms-in-bond-buys-observer-idUSL8N2KC0KZ
419 https://www.reuters.com/article/italy-economy-visco/update-3-bank-of-italy-says-country-needs-cohesion-to-grow-and-cut-debt-idUSL1N2KC089
420 https://www.reuters.com/article/italy-politics-salvini/italys-salvini-says-no-vetoes-after-meeting-with-draghi-idUSL8N2KC07N
421 https://www.reuters.com/news/archive/marketsNews?view=page&page=2&pageSize=10
422 https://www.reuters.com/article/us-europe-stocks/european-shares-flat-after-u-s-jobs-data-pound-weighs-on-ftse-100-idUSKBN2A50SU
423 https://www.reuters.com/article/europe-stocks/update-2-european-shares-flat-after-u-s-jobs-data-pound-weighs-on-ftse-100-idUSL4N2KB2I5
424 https://www.reuters.com/article/europe-stocks/european-shares-rise-in-early-trading-germany-lags-broader-

499 https://www.reuters.com/article/emerging-markets-asia/emerging-markets-indian-bond-yields-rise-rupee-strengthens-as-c-bank-keeps-rates-steady-idUSL4N2KB241
500 https://www.reuters.com/article/sharp-hon-hai/sharp-corp-may-delay-q3-earnings-release-as-it-probes-accounting-at-subsidiary-idUSL4N2KB21Y
501 https://www.reuters.com/article/southkorea-markets-close/s-korea-shares-rise-boosted-by-samsung-electronics-post-best-week-in-a-month-idUSAZN21W300
502 https://www.reuters.com/article/india-stocks-bonds/indian-bond-yields-rise-as-rbi-leaves-rates-steady-idUSKBN2A50L4
503 https://www.reuters.com/article/japan-stocks-close/japanese-shares-rally-on-upbeat-earnings-u-s-stimulus-hopes-idUSL4N2KB1V3
504 https://www.reuters.com/article/india-stocks/indian-bond-yields-rise-as-cenbank-leaves-rates-steady-idUSL4N2KB1PN
505 https://www.reuters.com/article/retail-trading-china/the-police-will-visit-you-why-gamestonk-wont-come-to-china-idUSL4N2KA32N
506 https://www.reuters.com/markets/asia?view=pa

601 https://www.reuters.com/article/usa-stocks-weekahead/wall-st-week-ahead-gamestop-frenzy-reveals-potential-for-broader-market-stress-idUSL1N2KB2Y3
602 https://www.reuters.com/article/usa-stocks-weekahead/wall-st-week-ahead-gamestop-frenzy-reveals-potential-for-broader-market-stress-idUSL1N2K92JX
603 https://www.reuters.com/article/usa-results-sp500/corrected-update-1-u-s-companies-set-to-post-profit-growth-for-q4-which-would-defy-forecasts-idUSL1N2K924P
604 https://www.reuters.com/news/archive/fundsfundsnews?view=page&page=2&pageSize=10
605 https://www.reuters.com/video/2021/02/05/big-correction-could-come-in-spring-geor?videoId=725406636&videoChannel=-10208
606 https://www.reuters.com/video/2021/02/05/sp-nasdaq-post-best-weekly-gains-in-thre?videoId=725411483&videoChannel=-10208
607 https://www.reuters.com/video/2021/02/05/gamestop-shares-jump-after-robinhood-lif?videoId=725400550&videoChannel=-10208
608 https://www.reuters.com/quote/.BADI
609 https://www.reuters.com/quote/.SPGSCIT

683 https://www.reuters.com/article/us-retail-trading-breakingviews/breakingviews-small-stock-mania-calls-for-big-picture-thinking-idUSKBN2A42TV
684 https://www.reuters.com/article/us-health-coronavirus-finance-breakingvi/breakingviews-corona-capital-mckinsey-opioids-bumble-ipo-idUSKBN2A42NW
685 https://www.reuters.com/article/us-amazon-com-ceo-breakingviews/breakingviews-viewsroom-bezos-takes-step-back-draghi-steps-up-idUSKBN2A420H
686 https://www.reuters.com/article/us-deutsche-bank-results-breakingviews/breakingviews-deutsches-trim-investment-bank-is-model-for-peers-idUSKBN2A41C8
687 https://www.reuters.com/article/us-hsbc-hldg-hong-kong-breakingviews/breakingviews-its-time-hsbcs-big-bosses-followed-the-money-idUSKBN2A40G6
688 https://www.reuters.com/news/archive/mcbreakingviews?view=page&page=2&pageSize=10
689 https://www.reuters.com/video/2021/02/05/breakingviews-tv-sewing-grace?videoId=725377288&videoChannel=117766
690 https://www.reuters.com/video/2021/02/04/breakingviews-tv-23a

768 https://www.reuters.com/article/us-tennis-ausopen-preview/australian-open-ready-to-launch-after-pandemic-palpitations-idUSKBN2A706R
769 https://www.reuters.com/article/us-football-nfl-hall-of-fame/peyton-manning-charles-woodson-highlight-eight-member-hof-class-idUSKBN2A706L
770 https://www.reuters.com/article/us-golf-european/golf-johnson-holds-on-to-clinch-second-saudi-international-title-idUSKBN2A70GS
771 https://www.reuters.com/article/us-tennis-tennis-roundup/wta-roundup-no-1-ashleigh-barty-earns-ninth-title-idUSKBN2A70KR
772 https://www.reuters.com/article/us-icehockey-nhl-roundup/nhl-roundup-flames-edge-oilers-in-scoring-frenzy-idUSKBN2A709O
773 https://www.reuters.com/article/us-football-nfl-superbowl-masks/mask-wearing-slips-in-tampa-as-fans-celebrate-super-bowl-weekend-idUSKBN2A7007
774 https://www.reuters.com/video/sports
775 https://www.reuters.com/video/2021/02/06/super-bowl-festivities-dampened-by-covid?videoId=725447438&videoChannel=45
776 https://www.reuters.com/vide

837 https://www.reuters.com/article/us-health-coronavirus-science/some-lingering-covid-19-issues-seen-in-children-patients-antibodies-attack-multiple-virus-targets-idUSKBN2A13I6
838 https://www.reuters.com/article/us-space-exploration-starship/musks-spacex-violated-its-launch-license-in-explosive-starship-test-the-verge-idUSKBN29Z06R
839 https://www.reuters.com/article/us-health-coronavirus-science/jj-vaccine-effective-in-preventing-severe-disease-a-mothers-covid-19-antibodies-may-protect-newborns-idUSKBN29Y2PF
840 https://www.reuters.com/news/archive/sciencenews?view=page&page=2&pageSize=10
841 https://www.reuters.com/article/us-ubs-investment-banking-bonus/ubs-to-lift-investment-bank-bonus-pool-by-20-bloomberg-news-idUSKBN2A600H
842 https://www.reuters.com/finance/investing
843 https://www.reuters.com/news/archive/wealth-taxes
844 https://www.reuters.com/news/archive/wealth-family
845 https://www.reuters.com/article/us-change-suite-scrivner/starting-a-job-dream-big-and-dont-be-afraid

927 https://www.reuters.com/journalists/costas-pitas
928 https://www.reuters.com/journalists/ruma-paul
929 https://www.reuters.com/news/archive/sportsNews
930 https://www.reuters.com/journalists/amy-tennery
931 https://www.reuters.com/journalists/rory-carroll
932 https://www.reuters.com/journalists/gabriella-borter
933 https://www.reuters.com/journalists/nathan-frandino
934 https://www.reuters.com/news/archive/greatreboot
935 https://www.reuters.com/journalists/timothy-aeppel
936 https://www.reuters.com/journalists/gavin-jones
937 https://www.reuters.com/journalists/emma-thomasson
938 https://www.reuters.com/news/archive/aerospace-defense
939 https://www.reuters.com/journalists/parisa-hafezi
940 https://www.reuters.com/news/archive/esg-environment
941 https://www.reuters.com/news/archive/americas-test-2
942 https://www.reuters.com/news/archive/media-industry
'HtmlDocument' object has no attribute 'content'
943 https://www.reuters.com/news/picture/top-photos-of-the-day-idUSRTX8Y855
944 

### 1.2.5. [5] Index statistics

Load an index and print the statistics.

In [10]:
# load index, doc_lengths and doc_urls
with open('inverted_index.p', 'rb') as fp:
    index = pickle.load(fp)
with open('doc_lengths.p', 'rb') as fp:
    doc_lengths = pickle.load(fp)
with open('doc_urls.p', 'rb') as fp:
    doc_urls = pickle.load(fp)
    
    
print('Total index length', len(index))
print('\nTop terms by number of documents they apperared in:')
sorted_by_n_docs = sorted(index.items(), key=lambda kv: (len(kv[1]), kv[0]), reverse=True)
print([(sorted_by_n_docs[i][0], len(sorted_by_n_docs[i][1])) for i in range(20)])
print('\nTop terms by overall frequency:')
sorted_by_freq = sorted(index.items(), key=lambda kv: (kv[1][0], kv[0]), reverse=True)
print([(sorted_by_freq[i][0], sorted_by_freq[i][1][0]) for i in range(20)])

Total index length 14247

Top terms by number of documents they apperared in:
[('use', 978), ('term', 978), ('privaci', 978), ('reuter', 977), ('not', 948), ('see', 947), ('all', 947), ('do', 945), ('here', 944), ('us', 942), ('my', 942), ('person', 938), ('list', 938), ('inform', 938), ('exchang', 938), ('site', 937), ('sell', 935), ('minut', 933), ('delay', 933), ('app', 933)]

Top terms by overall frequency:
[('reuter', 6613), ('s', 3969), ('said', 2961), ('market', 2831), ('all', 2805), ('us', 2231), ('delay', 2180), ('news', 2093), ('not', 2084), ('more', 2073), ('advertis', 2008), ('thomson', 1920), ('minut', 1909), ('new', 1868), ('after', 1827), ('busi', 1782), ('follow', 1650), ('world', 1583), ('inform', 1532), ('biden', 1484)]


## 1.3. [15] Answering query

Now, given that we already have built the inverted index, it's time to utilize it for answering user queries. In this class there are two methods you need to implement:
- `boolean_retrieval`, the simplest form of document retrieval which returns a set of documents such that each one contains all query terms. Returns a set of document ids. Refer to *ch.1* of the book for details;
- `okapi_scoring`, Okapi BM25 ranking function - assigns scores to documents in the collection that are relevant to the user query. Returns a dictionary of scores, `doc_id:score`. Read about it in [Wikipedia](https://en.wikipedia.org/wiki/Okapi_BM25#The_ranking_function) and implement accordingly.

Both methods accept `query` parameter in a form of a dictionary, `term:frequency`

In [12]:
from collections import Counter
import math

class QueryProcessing:
    
    @staticmethod
    def prepare_query(raw_query):
        prep = Preprocessor()
        # pre-process query the same way as documents
        query = prep.preprocess(raw_query)
        # count frequency
        return Counter(query)
    
    @staticmethod
    def boolean_retrieval(query, index):
        #TODO retrieve a set of documents containing all query terms
        set_docs = None
        for word in query:
            if set_docs:
                set_docs = set_docs.intersection(set([x[0] for x in index[word][1:]]))
            else:
                set_docs = set([x[0] for x in index[word][1:]])
        
        return set_docs

    
    @staticmethod
    def okapi_scoring(query, doc_lengths, index, k1=1.2, b=0.75):
        #TODO retrieve relevant documents with scores
        scores = dict()
        for doc in doc_lengths.keys():
            scores[doc] = 0

        avg = sum(doc_lengths.values()) / len(doc_lengths)

        for word, query_freq in query.items():
            if word not in index:
                continue
            idf = math.log10(len(doc_lengths) / len(index[word][1:]))

            for doc, dtf in index[word][1:]:
                if doc in scores:
                    scores[doc] += idf * dtf * (k1 + 1) / (dtf + k1 * (1 - b + b * doc_lengths[doc] / avg))
        scores = dict([(doc, score) for doc, score in scores.items() if score > 0])
        return scores

### 1.3.1. Tests

In [12]:
test_doc_lengths = {1: 20, 2: 15, 3: 10, 4:20, 5:30}
test_index = {'x': [2, (1, 1), (2, 1)], 'y': [2, (1, 1), (3, 1)], 'z': [3, (2, 1), (4,2)]}


test_query1 = QueryProcessing.prepare_query('x z')
test_query2 = QueryProcessing.prepare_query('x y')


assert QueryProcessing.boolean_retrieval(test_query1, test_index) == {2}
assert QueryProcessing.boolean_retrieval(test_query2, test_index) == {1}
okapi_res = QueryProcessing.okapi_scoring(test_query2, test_doc_lengths, test_index)
assert all(k in okapi_res for k in (1, 2, 3))
assert not any(k in okapi_res for k in (4, 5))
assert okapi_res[1] > okapi_res[3] > okapi_res[2]

## 1.4. Setting up a server

**Extra task \* (no penaly if skipped)** Organize the resulting search engine as a web-service that gets a query from get-parameters and returns urls with scores as a `json` dictionary. Check its work in a browser of with curl, should look smth like this:
 
`> curl localhost:8080/?q=some_query_text
{ "url1" : 0.9, "url2": 0.8 }`

You can use one of the following tools for this task: 
- https://www.acmesystems.it/python_http, 
- [http.server.ThreadingHTTPServer (3.7+)](https://docs.python.org/3/library/http.server.html#http.server.SimpleHTTPRequestHandler)
- [Flask](https://pypi.org/project/Flask/)

In [27]:
#TODO write a web-service that answers queries using inverted index
from flask import Flask, request
app = Flask(__name__)

with open('inverted_index.p', 'rb') as fp:
    index = pickle.load(fp)
with open('doc_lengths.p', 'rb') as fp:
    doc_lengths = pickle.load(fp)
with open('doc_urls.p', 'rb') as fp:
    doc_urls = pickle.load(fp)

@app.route("/")
def hello():
    return "I am search engine. If you want okapi scoring got to /okapi, if you want boolean retrieval - got to /bool"

@app.route("/bool", methods=["GET"])
def boolean():
    query = request.args.get("q")
    prep_query = QueryProcessing.prepare_query(query)
    
    try:
        bool_res = QueryProcessing.boolean_retrieval(prep_query, index)
    except KeyError:
        return {"msg":"Could not find word in dict"}
    
    result = dict([(doc_id, doc_url) for doc_id, doc_url in doc_urls.items() if doc_id in bool_res])
    
    return result

@app.route("/okapi", methods=["GET"])
def okapi():
    query = request.args.get("q")
    prep_query = QueryProcessing.prepare_query(query)
    
    okapi_res = QueryProcessing.okapi_scoring(prep_query, doc_lengths, index)
    
    result = dict([(doc_url, okapi_res[doc_id]) for doc_id, doc_url in doc_urls.items() if doc_id in okapi_res])
    return result



app.run(host='0.0.0.0',port=8080)

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)
127.0.0.1 - - [08/Feb/2021 11:24:34] "[37mGET /okapi?q=%22reuters%22 HTTP/1.1[0m" 200 -


# 2. [55] Sorri not veri gud in inglish

Have you ever googled someone's name without knowing exactly how it should be written? Were you ever reluctant to look up the correct spelling of a query you typed? Or just unable to type properly because of being in a rush? Modern search engines usually do a pretty good job in deciphering defective user input. In order to be able to do that, a good spell-checking mechanism should be incorporated into a search procedure. Today we will take one step further towards building a good search engine and work on tolerant retrieval with respect to user queries. We will consider two cases:

1. User knows that he doesn't know the correct spelling OR he wants to get the results that follow some known pattern, so he uses so called wildcards - queries like `retr*val`;
2. User doesn't know the correct spelling OR he doesn't care OR he's in a rush OR he expects his mistakes will be corrected OR your option, so he makes mistakes and we need to handle them using:

    2.1. Simple spellchecker by Peter Norvig;
    
    2.2. Phonetic correction by means of Soundex algorithm;
    
    2.3. Trigrams with Jaccard coefficient.

## 2.1. [20] Handling wildcards

We will handle wildcard queries using k-grams. K-grams is a list of consecutive k chars in a string - i.e., for the word *'star'*, it will be '*\$st*', '*sta*', '*tar*', and '*ar$*', if we take k=3. Take a look at [book](https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) *chapter 3.2.2* to understand how k-grams can help efficiently match a wildcard against dictionary words. Here we will only consider wildcards with star symbols (may be multiple).

Notice that for building k-grams index, **we will need a vocabulary of original word forms** to compare words in user input to the vocabulary of "correct" words (think why inverted index which we built for stemmed words doesn't work here).   

You need to implement the following:

- `build_inverted_index_orig_forms` - creates inverted index of original world forms from `facts` list, which is already given to you.  
    Output format: `term:[collection_frequency, (doc_id_1, doc_freq_1), (doc_id_2, doc_freq_2), ...]`
    

- `build_k_gram_index` - creates k-gram index which maps every k-gram encountered in facts collection to a list of words containing this k-gram. Use the abovementioned inverted index of original words to construct this index.  
    Output format: `'k_gram': ['word1_with_k_gram', 'word2_with_k_gram', ...]`
    
    
- `generate_wildcard_options` - produce a list of vocabulary words matching given wildcard by intersecting postings of k-grams present in the wildcard (refer to *ch 3.2.2*). 

- `search_wildcard` - return list of facts that contain the words matching a wildcard query.


We will use the dataset with curious facts for testing.

### 2.1.1. Downloading the dataset

In [3]:
import urllib.request
data_url = "https://raw.githubusercontent.com/hsu-ai-course/hsu.ai/master/code/datasets/nlp/facts.txt"
local_filename, headers = urllib.request.urlretrieve(data_url)

facts = []
with open(local_filename) as fp:
    for cnt, line in enumerate(fp):
        facts.append(line.strip('\n'))
        
print(*facts[-5:], sep='\n')

151. Women have twice as many pain receptors on their body than men. But a much higher pain tolerance.
152. There are more stars in space than there are grains of sand on every beach in the world.
153. For every human on Earth there are 1.6 million ants.
154. The total weight of all those ants, however, is about the same as all the humans.
155. On Jupiter and Saturn it rains diamonds.


### 2.1.2. [20] Implementation of search

In [4]:
import re


def build_inverted_index_orig_forms(documents):
    #TODO build an inverted index of original word forms 
    # (without stemming, just word tokenized and lowercased)
    
    def tokenize(text):
        #TODO word tokenize text using nltk lib
        return nltk.word_tokenize(text)
    
    inverted_index = {}
    
    for doc_id, docs in enumerate(documents):
        tokenized = tokenize(docs.lower())
        
        # Adding word to index
        for term in tokenized:
            if term in inverted_index:
                term_freq, docs_list = inverted_index[term][0], inverted_index[term][1:]
                docs_dict = dict(docs_list)

                term_freq += 1
                if doc_id in docs_dict:
                    docs_dict[doc_id] += 1
                else:
                    docs_dict[doc_id] = 1

                new_list = [term_freq]
                new_list.extend(list(docs_dict.items()))
                inverted_index[term] = new_list
            else:
                docs_dict = {doc_id: 1}
                new_list = [1]
                new_list.extend(list(docs_dict.items()))
                inverted_index[term] = new_list
    return inverted_index


def build_k_gram_index(inverted_index, k):
    #TODO build index of k-grams for dictionary words. 
    # Padd with '$' ($word$) before splitting to k-grams    

    k_gram_index = {}
    for word in inverted_index:
        new_word = "$" + word + "$"
        gram_list = [new_word[i:i + k] for i in range(0, len(new_word))]
        # truncating last not k-words
        if len(new_word) % k != 0:
            gram_list = gram_list[:-(len(new_word) % k)]

        for gram in gram_list:
            if gram in k_gram_index:
                k_gram_index[gram].append(word)
            else:
                k_gram_index[gram] = [word]
    
    return k_gram_index


def generate_wildcard_options(wildcard, k_gram_index, inverted_index):
    #TODO for a given wildcard return all words matching it using k-grams
    # refer to book chapter 3.2.2
    # don't forget to pad wildcard with '$', when appropriate
    new_wildcard = "$" + wildcard + "$"
    k = len(list(k_gram_index.keys())[0])
    gram_list = [new_wildcard[i:i + k] for i in range(0, len(new_wildcard))]
    # truncating last not k-words
    if len(new_wildcard) % k != 0:
        gram_list = gram_list[:-(len(new_wildcard) % k)]

    gram_list_trunc = list(filter(lambda x: x != "$" and "*" not in x, gram_list))

    option_set = set()
    for word in gram_list_trunc:
        if word in k_gram_index:
            if not option_set:
                option_set.update(k_gram_index[word])
            else:
                option_set.intersection_update(k_gram_index[word])
    return list(option_set)


def search_wildcard(wildcard, k_gram_index, index, docs):
    #TODO retrive list of documnets (facts) that contain words matching wildcard
    def union_retrieval():
        #TODO retrieve a set of documents containing all query terms
        set_docs = None
        for word in query:
            if set_docs:
                set_docs = set_docs.union(set([x[0] for x in index[word][1:]]))
            else:
                set_docs = set([x[0] for x in index[word][1:]])
        
        return set_docs

    query = generate_wildcard_options(wildcard, k_gram_index, index)
    doc_ids = union_retrieval()
    if doc_ids:
        return [doc for doc_id, doc in enumerate(docs) if doc_id in doc_ids]
    return []

### 2.1.3. Tests

In [7]:
index_orig_forms = build_inverted_index_orig_forms(facts)
k_gram_index = build_k_gram_index(index_orig_forms, 3)

wildcard = "re*ed"

wildcard_options = generate_wildcard_options(wildcard, k_gram_index, index_orig_forms)
print(wildcard_options)
assert(len(wildcard_options) >= 3)

wildcard_results = search_wildcard(wildcard, k_gram_index, index_orig_forms, facts)
# some pretty printing
for r in wildcard_results:
    # highlight terms for visual evaluation
    for term in wildcard_options:
        r = re.sub(r'(' + term + ')', r'\033[1m\033[91m\1\033[0m', r, flags=re.I)
    print(r)

assert(len(wildcard_results) >=3)

assert "13. James Buchanan, the 15th U.S. president continuously bought slaves with his own money in order to free them." in search_wildcard("pres*dent", k_gram_index, index_orig_forms, facts)
assert "40. 9 out of 10 Americans are deficient in Potassium." in search_wildcard("p*tas*um", k_gram_index, index_orig_forms, facts)
assert "61. A man from Britain changed his name to Tim Pppppppppprice to make it harder for telemarketers to pronounce." in search_wildcard("*price", k_gram_index, index_orig_forms, facts)

['received', 'recorded', 'reduced']
4. The largest [1m[91mrecorded[0m snowflake was in Keogh, MT during year 1887, and was 15 inches wide.
102. More than 50% of the people in the world have never made or [1m[91mreceived[0m a telephone call.
134. A person can live without food for about a month, but only about a week without water. If the amount of water in your body is [1m[91mreduced[0m by just 1%, you'll feel thirsty. If it's [1m[91mreduced[0m by 10%, you'll die.


AssertionError: 

## 2.2. [35] Handling typos

### 2.2.1 Dataset 

Download github typo dataset from [here](https://github.com/mhagiwara/github-typo-corpus).
Load it with this code:

In [None]:
!pip install jsonlines

In [8]:
import jsonlines

dataset_file = "github-typo-corpus.v1.0.0.jsonl"

dataset = []
other_langs = set()

with jsonlines.open(dataset_file) as reader:
    for obj in reader:
        for edit in obj['edits']:
            if edit['src']['lang'] != 'eng':
                other_langs.add(edit['src']['lang'])
                continue

            if edit['is_typo']:
                src, tgt = edit['src']['text'], edit['tgt']['text']
                if src.lower() != tgt.lower():
                    dataset.append((edit['src']['text'], edit['tgt']['text']))
                
print(f"Dataset size = {len(dataset)}")

Dataset size = 245909


#### Explore sample typos
Please, explore the dataset. You may see, that this is
- mostly markdown
- some common mistakes with do/does
- some just refer to punctuation typos (which we do not consider)

In [13]:
for pair in dataset[1010:1020]:
    print(f"{pair[0]} => {pair[1]}")

        """Make am instance. =>         """Make an instance.
* travis: test agains Node.js 11 => * travis: test against Node.js 11
The parser receive a string and returns an array inside a user-provided  => The parser receives a string and returns an array inside a user-provided 
CSV data is send through the `write` function and the resulted data is obtained => CSV data is sent through the `write` function and the resulting data is obtained
One useful function part of the Stream API is `pipe` to interact between  => One useful function of the Stream API is `pipe` to interact between 
source to a `stream.Writable` object destination. This example available as  => source to a `stream.Writable` object destination. This example is available as 
`node samples/pipe.js` read the file, parse its content and transform it. => `node samples/pipe.js` and reads the file, parses its content and transforms it.
Most of the generator is imported from its parent project [CSV][csv] in a effort  => Most o

### 2.2.2. [5] Build a dataset vocabulary
We will need it for Norvig's spellchecker as well as for estimating overall correction quality. Consider only word-level. Be carefull, there is markdown (e.g. \`name\`. \[url\]\(http://url)) and comment symbols (\#, //, \*).

In [14]:
def sent_to_words(sent):
    # splits sentence to words, filtering out non-alphabetical terms
    words = nltk.word_tokenize(sent)    
    words_filtered = filter(lambda x: x.isalpha(), words)
    return words_filtered

In [15]:
vocabulary = Counter()
for pair in dataset:
    for word in sent_to_words(pair[1].lower()):
        vocabulary[word] += 1
len(vocabulary)

63724

In [16]:
from itertools import islice
print(list(islice(vocabulary.items(), 10)))

[('function', 6193), ('de', 82), ('deutsch', 4), ('nocomments', 2), ('you', 42075), ('can', 26027), ('disable', 532), ('comments', 360), ('for', 44756), ('the', 207017)]


### 2.2.3. [25] Implement context-independent spellcheker ##

0) Write code to compute editorial distance

1) [Norvig's corrector](https://norvig.com/spell-correct.html)

2) [Soundex](https://en.wikipedia.org/wiki/Soundex)

3) Trigrams with Jaccard coefficient.

#### 2.2.3.0. [5] Editorial distance

Frequently used distance measure between two character sequences. We will use this distance to sort Soundex search results.

In [17]:
def edit_dist(s1, s2) -> int:
    # TODO compute the Damerau-Levenshtein distance between two given strings (s1 and s2)
        d = {}
        lenstr1 = len(s1)
        lenstr2 = len(s2)
        for i in range(-1,lenstr1+1):
            d[(i,-1)] = i+1
        for j in range(-1,lenstr2+1):
            d[(-1,j)] = j+1

        for i in range(lenstr1):
            for j in range(lenstr2):
                if s1[i] == s2[j]:
                    cost = 0
                else:
                    cost = 1
                d[(i,j)] = min(
                               d[(i-1,j)] + 1, # deletion
                               d[(i,j-1)] + 1, # insertion
                               d[(i-1,j-1)] + cost, # substitution
                              )
                if i and j and s1[i]==s2[j-1] and s1[i-1] == s2[j]:
                    d[(i,j)] = min (d[(i,j)], d[i-2,j-2] + cost) # transposition

        return d[lenstr1-1,lenstr2-1]

In [18]:
# tests

assert edit_dist("korrectud", "corrected") == 2, "Edit distance is computed incorrectly"
assert edit_dist("soem", "some") == 1, "Edit distance is computed incorrectly"
assert edit_dist("one", "one") == 0, "Edit distance is computed incorrectly"

#### 2.2.3.1. [5] Norvig's spellchecker

In [19]:
# Thanks to Peter Norvig :)

def fix_typo_norvig(word) -> str:
    def P(word, N=sum(vocabulary.values())): 
        "Probability of `word`."
        return vocabulary[word] / N

    def candidates(word): 
        "Generate possible spelling corrections for word."
        return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

    def known(words): 
        "The subset of `words` that appear in the dictionary of WORDS."
        return set(w for w in words if w in vocabulary)

    def edits1(word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(word): 
        "All edits that are two edits away from `word`."
        return (e2 for e1 in edits1(word) for e2 in edits1(e1))
    
    return max(candidates(word), key=P)

In [20]:
# tests

assert fix_typo_norvig("korrectud") == "corrected", "Norvig's correcter doesn't work"
assert fix_typo_norvig("speling") == "spelling", "Norvig's correcter doesn't work"

#### 2.2.3.2. [10] Soundex 

For cases when the exact spelling is unknown, phonetic algorithms such as Soundex can be very helpful - they allow user to type a word the way he thinks it should sound, and then suggest the corrrect version. Go through *chapter 3.4* to understand how Soundex algorithm works.

In [41]:
def produce_soundex_code(word):
    #TODO implement Soundex algorithm, version from book chapter 3.4
    # input word is already lowercased
    # return Soundex 4-character code, like 'k450'
    letter_dict = {
        'a': '0', 'e': '0', 'i': '0', 'o': '0', 'u': '0', 'h': '0', 'w': '0', 'y': '0',
        'b': '1', 'f': '1', 'p': '1', 'v': '1',
        'c': '2', 'g': '2', 'j': '2', 'k': '2', 'q': '2', 's': '2', 'x': '2', 'z': '2',
        'd': '3', 't': '3',
        'l': '4',
        'm': '5', 'n': '5',
        'r': '6'
    }
    # Changing letters to digits
    if all([True if letter in letter_dict else False for letter in word]):
        #print([True for letter in word if (letter in letter_dict) else False])
        soundex_list = [word[0]]
        for letter in word[1:]:
            soundex_list.append(letter_dict[letter])

        # Removing pairs of consecutive digits
        for i in range(1, len(soundex_list) - 1):
            if soundex_list[i] == soundex_list[i + 1]:
                soundex_list[i] = '-'

        # Deleting all zeros and adding trailing zeros
        soundex_word = "".join(list(filter(lambda x: x != '0' and x != '-', soundex_list))) + "0000"

        return soundex_word[:4]
    return None


def build_soundex_index(dictionary):
    #TODO build soundex index for dictionary words.
    # dictionary is a vocabulary of original words
    # output format: 'code1': ['word1_with_code1', 'word2_with_code1', ...]    
    soundex_index = {}
    for word in dictionary:
        code = produce_soundex_code(word)
        if code:
            if code in soundex_index:
                soundex_index[code].append(word)
            else:
                soundex_index[code] = [word]

    return soundex_index


def fix_typo_soundex(word, soundex_index) -> list:
    #TODO return words from vocabulary that match with result by soundex fingerprint
    # ordered results by editorial distance
    
    sound_word = produce_soundex_code(word)
    matched_words = soundex_index.get(sound_word, [])
    
    return matched_words or [word]

In [42]:
# tests

soundex_index = build_soundex_index(vocabulary)

code1 = produce_soundex_code("britney")
code2 = produce_soundex_code("breatany")
print(code1, code2)
assert code1 == code2

print(fix_typo_soundex("enouhg", soundex_index))
assert "enough" in fix_typo_soundex("enouhg", soundex_index), "Assert soundex failed"

b635 b635
['enjoy', 'enough', 'emacs', 'emoji', 'emc', 'emas', 'enqueue', 'euank', 'ensue', 'eng', 'emq', 'enmasse', 'emac', 'ens', 'enc', 'emojii', 'enki', 'enso', 'enzo', 'enwiki', 'emmc', 'emesh', 'emg', 'emgo']


#### 2.2.3.3. [5] Trigrams with Jaccard coefficient

In [43]:
def fix_typo_kgram(word, k_gram_index) -> list:
    #TODO return best matches with respect to Jaccard index   
    k = max(len(x) for x in k_gram_index)

    def make_grams(word):
        return [
            word[i:i+ki]
            for ki in range(2, k + 1)
            for i in range(len(word) - ki + 1)
        ]

    def calc_jaccard(word):
        curr_grams = make_grams(word)
        inter = grams.intersection(curr_grams)
        union = grams.union(curr_grams)
        return len(inter) / len(union)

    grams = set(make_grams(word))
    candidates = set(word for gram in grams for word in k_gram_index.get(gram, [])) 
    return sorted(candidates, key=calc_jaccard, reverse=True)[:10] or [word]

In [44]:
# tests

k_gram_index_github = build_k_gram_index(vocabulary, 3)
print(fix_typo_kgram("enouh", k_gram_index_github)[:20])
assert "enough" in fix_typo_kgram("enouh", k_gram_index_github), "Assert k-gram failed"

['enough', 'eno', 'enought', 'nous', 'renounce', 'noun', 'deno', 'exogenous', 'endogenous', 'menoh']


## 2.2.4. [5] Estimate quality

In [45]:
norvig, soundex, kgram = 0, 0, 0
limit = 10000
counter = limit
for i, (src, target) in enumerate(dataset):
    if i == limit:
        break
    words = sent_to_words(src.lower())
    # word suspected for typos
    sn, ss, sk = src.lower(), src.lower(), src.lower()
    for word in words:
        if word not in vocabulary and word.isalpha():
            # top-1 accuracy
            wn, ws, wk = fix_typo_norvig(word), \
                         fix_typo_soundex(word, soundex_index)[0], \
                         fix_typo_kgram(word, k_gram_index_github)[0]
            sn = sn.replace(word, wn)
            ss = ss.replace(word, ws)
            sk = sk.replace(word, wk)
    norvig += int(sn == target.lower())
    soundex += int(ss == target.lower())
    kgram += int(sk == target.lower())

print(f"Norvig accuracy ({norvig}) = {norvig / limit}")
print(f"Soundex accuracy ({soundex}) = {soundex / limit}")
print(f"k-gram accuracy ({kgram}) = {kgram / limit}")

Norvig accuracy (2429) = 0.2429
Soundex accuracy (648) = 0.0648
k-gram accuracy (1316) = 0.1316
