# Wikipedia-based vocabulary

Part of the text cleaning is to recover misspelled tokens in documents. The base toolkit to implement the spell checking component is the [pyenchant](https://github.com/pyenchant/pyenchant) library.

While the existing solution works, there are some issues that this implementation face. The most important of which is the detection of emerging or novel words. Recently, Covid has become a common term but the vocabulary of the dictionary that we're using doesn't contain it. This means that when a document containing this term is processed, it will be classified as misspelled and the pipeline will be try to "fix" it.

To remedy this, we modify the solution by updating the standard vocabulary with the vocabulary from a dynamically updating corpus. In this case, we choose the [Wikipedia corpus](https://dumps.wikimedia.org/enwiki/latest/) as the source of our updated vocabulary.



The solution for this is summarized as follows:

1. Download the latest wikipedia corpus from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
2. Use gensim to process and collect the tokens in the corpus.

        ```python
        from gensim.corpora import WikiCorpus
        from wb_nlp.dir_manager import get_data_dir
        import os

        wiki_dump = get_data_dir('raw', 'wiki', 'enwiki-latest-pages-articles.xml.bz2')
        wiki_dict = get_data_dir('processed', 'wiki')
        if not os.path.isdir(wiki_dict):
            os.makedirs(wiki_dict)

        wiki = WikiCorpus(
                wiki_dump, processes=max(1, os.cpu_count() - 4),
                lemmatize=False,
                article_min_tokens=50, token_min_len=2,
                token_max_len=50, lower=True)

        wiki.dictionary.save(os.path.join(wiki_dict, 'wiki_en.gensim.dict.pickle'))
        ```

3. Filter the tokens using the `.cfs` and `.idf` attributes of the `wiki.dictionary`.
4. Update use an updated dictionary.

        ```
        import enchant
        en_dict = enchant.DictWithPWL("en_US", "wiki_en.txt")
        ```

In [29]:
import re
import requests
from bs4 import BeautifulSoup
from wb_nlp.dir_manager import get_data_dir
import urllib.request

wiki_meta_url = 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2-rss.xml'

soup = BeautifulSoup(requests.get(wiki_meta_url).content, 'html.parser')
wiki_latest_url = BeautifulSoup(soup.find('item').find('description').text).find('a', href=True)['href']

wiki_latest = wiki_latest_url.split('/')[-1]
wiki_data_path = get_data_dir('raw', 'wiki')
wiki_data_file = os.path.join(wiki_data_path, wiki_latest)

if not os.path.isdir(wiki_data_path):
    os.makedirs(wiki_data_path)

if not os.path.isfile(wiki_data_file):
    local_filename, headers = urllib.request.urlretrieve(wiki_latest_url, wiki_data_file)

''

In [1]:
import os
from gensim.corpora import WikiCorpus

wiki_dict = get_data_dir('processed', 'wiki')

if not os.path.isdir(wiki_dict):
    os.makedirs(wiki_dict)

wiki = WikiCorpus(
        wiki_data_file, processes=max(1, os.cpu_count() - 4),
        lemmatize=False,
        article_min_tokens=50, token_min_len=2,
        token_max_len=50, lower=True)

wiki.dictionary.save(os.path.join(wiki_dict, 'wiki_en.gensim.dict.pickle'))