# Wikipedia-based vocabulary

Part of the text cleaning is to recover misspelled tokens in documents. The base toolkit to implement the spell checking component is the [pyenchant](https://github.com/pyenchant/pyenchant) library.

While the existing solution works, there are some issues that this implementation face. The most important of which is the detection of emerging or novel words. Recently, Covid has become a common term but the vocabulary of the dictionary that we're using doesn't contain it. This means that when a document containing this term is processed, it will be classified as misspelled and the pipeline will be try to "fix" it.

To remedy this, we modify the solution by updating the standard vocabulary with the vocabulary from a dynamically updating corpus. In this case, we choose the [Wikipedia corpus](https://dumps.wikimedia.org/enwiki/latest/) as the source of our updated vocabulary.



The solution for this is summarized as follows:

1. Download the latest wikipedia corpus from https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2.
2. Use gensim to process and collect the tokens in the corpus.

        ```python
        from gensim.corpora import WikiCorpus
        from wb_nlp.dir_manager import get_data_dir
        import os

        wiki_dump = get_data_dir('raw', 'wiki', 'enwiki-latest-pages-articles.xml.bz2')
        wiki_dict = get_data_dir('processed', 'wiki')
        if not os.path.isdir(wiki_dict):
            os.makedirs(wiki_dict)

        wiki = WikiCorpus(
                wiki_dump, processes=max(1, os.cpu_count() - 4),
                lemmatize=False,
                article_min_tokens=50, token_min_len=2,
                token_max_len=50, lower=True)

        wiki.dictionary.save(os.path.join(wiki_dict, 'wiki_en.gensim.dict.pickle'))
        ```

3. Filter the tokens using the `.cfs` and `.idf` attributes of the `wiki.dictionary`.
4. Update use an updated dictionary.

        ```
        import enchant
        en_dict = enchant.DictWithPWL("en_US", "wiki_en.txt")
        ```

Deduplicate documents based on hash similarity.

In [1]:
import spacy
import re

In [2]:
from wb_nlp.cleaning import cleaner
from wb_nlp.extraction import phrase
from wb_nlp import dir_manager
from joblib import Parallel, delayed

In [3]:
phrase.get_phrases?

[0;31mSignature:[0m
[0mphrase[0m[0;34m.[0m[0mget_phrases[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdoc[0m[0;34m:[0m [0mspacy[0m[0;34m.[0m[0mtokens[0m[0;34m.[0m[0mdoc[0m[0;34m.[0m[0mDoc[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmin_token_length[0m[0;34m:[0m [0mint[0m [0;34m=[0m [0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtoken_func[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mCallable[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtoken_container[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mlist[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mlist[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      /workspace/src/wb_nlp/extraction/phrase.py
[0;31mType:[0m      function


In [4]:
nlp = spacy.load('en_core_web_sm')

In [5]:
txt_path = dir_manager.get_path_from_root('notebooks/archive/SCRIPTS/acronyms/imf_00ae75cce82e5c915d5bead7a7bb2165e9ef215a.txt')

with open(txt_path) as fl:
    text = fl.read()

In [6]:
lda_cleaner = cleaner.LDACleaner()

In [7]:
%%time
tokens = lda_cleaner.get_tokens(text)

CPU times: user 3.78 s, sys: 2.66 s, total: 6.44 s
Wall time: 6.69 s


In [8]:
%%time
tokens_and_phrases = lda_cleaner.get_tokens_and_phrases(text)

CPU times: user 3.3 s, sys: 2.82 s, total: 6.12 s
Wall time: 6.25 s


In [9]:
len(tokens), len(tokens_and_phrases['tokens'])

(7831, 7831)

In [11]:
len(tokens_and_phrases['phrases'])

2298

In [6]:
e = phrase.get_phrases(nlp(re.sub('\s+', ' ', text[:20000])))

In [7]:
e

['second_review',
 'year_arrangement',
 'performance_criterion',
 'staff_report',
 'staff_team',
 'economic_development',
 'information_available',
 'staff_report',
 'staff_report',
 'staff_team',
 'staff_report',
 'staff_report',
 'other_document',
 'market_sensitive_information',
 'publication_policy',
 'reader_comment',
 'monetary_fund',
 'other_department',
 'second_review',
 'mission_team',
 'other_senior_government_official',
 'donor_representative',
 'amount_equivalent',
 'first_review',
 'time_director',
 'additional_resource',
 'food_relief_program',
 'second_review',
 'debt_relief',
 'debt_relief',
 'quantitative_performance_criterion',
 'structural_performance_criterion',
 'recent_development',
 'exchange_rate',
 'external_tariff',
 'other_member',
 'exchange_rate_policy',
 'regional_level',
 'proposed_schedule',
 'common_indicator',
 'executive_summary',
 'severe_drought',
 'strong_recovery',
 'agriculture_sector',
 'macroeconomic_performance',
 'real_growth',
 'record_agri

In [8]:
doc = nlp('This is the University of in the Philippines, Diliman.')

In [9]:
[(t.lemma_, t.pos_, t.ent_type_) for t in doc]

[('this', 'DET', ''),
 ('be', 'AUX', ''),
 ('the', 'DET', 'ORG'),
 ('University', 'PROPN', 'ORG'),
 ('of', 'ADP', 'ORG'),
 ('in', 'ADP', 'ORG'),
 ('the', 'DET', ''),
 ('Philippines', 'PROPN', 'GPE'),
 (',', 'PUNCT', ''),
 ('Diliman', 'PROPN', 'PERSON'),
 ('.', 'PUNCT', '')]

In [10]:
phrase.get_phrases(doc)

[]