# Building vocabulary from Wikipedia articles

Pre-processing steps we take:
1. All text is lowercases
2. We replace '-' with space because '-' containing words most likely are composed of words already present in the vocabulary
3. We repalce numbers and digits with '#NUMBER' to avoid unneccessary increase in the vocabulary size
4. We replace the hyperlinks by '#HLINK' because for our our study purposes hyperlinks are not useful

Building vocabulary:
1. Initially we were using NLTK's vocabulary for English that contains approximately 230k words. While working with the Wikipedia articles we realized that this 230k vocabulary is not enough. To give an example, 'marxism' word was not present in the NLTK's vocabulary. Now, we may think that if we are filtering sentences based on words spoken by siz year olds then 'marxism', in particular, might not be that important. But, we don't know what is missing in the NLTK's vocabulary and we don't want to miss out on selecting a simple sentence because the word was not present in NLTK.
2. We first sequentially go through wikipedia articles building the vocabulary, until the vocabulary size reaches 2 million. There are many information resources that indicate actual vocabulary size of English is much smaller. But, 2 million is just to be safe. It required 260k articles to develop a vocabulary of 2 million.
3. We check all words from NLTK and all words from AOCHildes are present in the 2 million
4. We sample 260k documents from previously unseen document, build new vocabulary and merge with the existing vocabulary

In [1]:
import scipy as sp
import numpy as np
import pandas as pd
import json
import os
import glob
from datasets import load_dataset
from tqdm import tqdm
from english_words import english_words_set
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zipf
from scipy.optimize import curve_fit
from nltk.corpus import words
import vocab_utils as utils_
from nltk import FreqDist, word_tokenize, wordpunct_tokenize
import re

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#
nltk_vocab = words.words()
aoc_vocab = pd.read_csv('CHILDES_vocab_age.csv').loc[:, 'word'].unique().tolist()

In [3]:
# regexps for text preprocessing
ONLY_ALPHA = re.compile(r'([^\s\w]|_)+')
NUMBERS = re.compile(r'\b\d+')
MULTISPACE = re.compile(r'[^\S\r\n]{2,}')
AT_DIGIT = re.compile(r'@[,.]@')
AT_HYPHEN = re.compile(r'@-@')
WORD = re.compile(r'\w+')
HLINK = re.compile(r'http\S+')


def preprocess(line):
    #line = line.replace('``', '"')
    #line = line.replace("''", '"')
    #line = AT_HYPHEN.sub('-', line)
    line = line.replace('-', ' ')
    line = AT_DIGIT.sub('#NUMBER', line)
    line = NUMBERS.sub('#NUMBER ', line)
    line = HLINK.sub('#HLINK', line)
    #line = MULTISPACE.sub(' ', line)
    #line = line.lstrip(' ')
    return line

In [4]:
wiki_data = load_dataset("wikipedia", "20220301.en")
wiki_docs = wiki_data['train']['text']
wiki_vocab = pd.read_csv('wikipedia_vocab_regex_based.csv').iloc[:, 1].values.tolist()

Reusing dataset wikipedia (/home/hf_cache/datasets_cache/wikipedia/20220301.en/2.0.0/aa542ed919df55cc5d3347f42dd4521d05ca68751f50dbc32bae2a7f1e167559)
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 29.03it/s]


In [5]:
english_vocab = [] + wiki_vocab
article_idx = 0
for article in tqdm(wiki_docs):
    article_idx += 1
    """
    if article_idx < 250000:
        continue
    """
    article = preprocess(article)
    english_vocab += WORD.findall(article.lower())#word_tokenize(article.lower())
        
    #
    if article_idx%10000 == 0:
        english_vocab = list(set(english_vocab))
        pd.DataFrame(english_vocab).to_csv('wikipedia_vocab_regex_based.csv')
        
        if len(english_vocab) >= 2e6:
            break
            

print(len(set(english_vocab)))

  4%|█▏                             | 249999/6458670 [00:03<01:19, 78425.71it/s]

2180879





In [6]:
english_vocab = [str(i) for i in english_vocab]

In [7]:
missing_nltk = set(nltk_vocab) - set(english_vocab)
missing_aoc = set(aoc_vocab) - set(english_vocab)

In [8]:
print('='*10)
print(f'number of words in NLTK but not in Wiki: {len(missing_nltk)}')
#print(missing_nltk)

print('='*10)
print(f'number of words in AOC but not in Wiki: {len(missing_aoc)}')
#print(missing_aoc)

number of words in NLTK but not in Wiki: 1
number of words in AOC but not in Wiki: 1


In [9]:
english_vocab += list(nltk_vocab)
english_vocab += list(aoc_vocab)
english_vocab = list(set(english_vocab))
pd.DataFrame(english_vocab).to_csv('wikipedia_vocab_regex_based.csv')

In [12]:
random_articles = np.random.choice([i for i in range(260000, len(wiki_data['train']))], size=260000, replace=False)
new_articles = [wiki_docs[i] for i in random_articles]

In [13]:
not_in_vocab = []
new_vocab = []
article_idx = 0
for article in tqdm(new_articles):
    article_idx += 1
    article = preprocess(article)
    new_vocab += WORD.findall(article.lower())#word_tokenize(article.lower())
        
    #
    if article_idx%50000 == 0:
        new_vocab = list(set(new_vocab))
        not_in_vocab += list(set(new_vocab) - set(english_vocab))
        not_in_vocab = list(set(not_in_vocab))
        print(f'There are {len(not_in_vocab)} number of words found that are not present in the vocabulary')
        
        if len(new_vocab) >= 2e6:
            break


 20%|██████▋                           | 50950/260000 [00:10<02:58, 1172.19it/s]

There are 128548 number of words found that are not present in the vocabulary


 39%|████████████▊                    | 101038/260000 [00:21<02:25, 1095.51it/s]

There are 253135 number of words found that are not present in the vocabulary


 58%|███████████████████▏             | 151139/260000 [00:31<01:40, 1078.11it/s]

There are 371033 number of words found that are not present in the vocabulary


 77%|█████████████████████████▌       | 201272/260000 [00:41<00:54, 1069.62it/s]

There are 481238 number of words found that are not present in the vocabulary


 97%|████████████████████████████████▊ | 250904/260000 [00:52<00:09, 938.29it/s]

There are 586063 number of words found that are not present in the vocabulary


100%|█████████████████████████████████| 260000/260000 [00:53<00:00, 4816.95it/s]


In [14]:
print(len(new_vocab))

5651913


In [15]:
# merge both vocab
english_vocab += list(new_vocab)
english_vocab = list(set(english_vocab))
pd.DataFrame(english_vocab).to_csv('wikipedia_vocab_regex_based.csv')