# Preprocessing English Wikipedia corpus

There are a number of corrections that will improve the quality of our word vector representations. We'll iterate on the preprocessing step to create a better word vector.

Common preprocessing steps are:
* dropping numbers
* dropping punctuation
* coercing to lower case
* stemming
* lemmatization

However, some of the operations can be undesirably lossy, e.g.:
 * dropping the hypen can remove certain word combinations
     * the word "email" was once written "e-mail"
 * Coercing to lower case squashes the dimensionality of entities (capitalized nouns and Proper names)
     * "Edward Said" becomes "edward said"

In [1]:
import re

from cltk.prosody.latin.string_utils import punctuation_for_spaces_dict
from nltk.corpus import stopwords
from nltk.corpus import words as unix_dictionary

In [2]:
replacer = punctuation_for_spaces_dict()
print(f"Hyphen has same ascii and unicode value: {ord('-')}")
del replacer[45] 

Hyphen has same ascii and unicode value: 45


In [3]:
print("Madam, I'm a Adam!".translate(replacer))
print("The word email, used to be e-mail.".translate(replacer))
print("But hyphen assimilation isn't always possible based on sense: beauty-obsessed".translate(replacer))

Madam  I m a Adam 
The word email  used to be e-mail 
But hyphen assimilation isn t always possible based on sense  beauty-obsessed


In [4]:
has_numeric = re.compile('.*[0-9]+.*')

In [5]:
stop_words = set(stopwords.words('english'))

In [6]:
lower_words = set([tmp for tmp in unix_dictionary.words() if tmp.lower() == tmp]) 

In [9]:
with open('wikimedia.en.processed.cor', 'wt') as writer:
    with open('wikimedia.en.cor', 'rt') as reader:
        for line in reader:            
            line = line.strip()
            words = line.translate(replacer).split()
            # Check the first word of each sentence, if it typically occurs in lower case, coerce to lower
            if words and words[0][0].upper() == words[0][0]:
                if words[0] in lower_words:
                    words[0] = words[0].lower()
            cleaned_words = [tmp for tmp in words
                                 if not has_numeric.match(tmp)                                   
                                 and len(tmp) > 1 
                                 and tmp not in stop_words ]
            writer.write(' '.join(cleaned_words))
            writer.write('\n')                                  

# Now we're ready to create a quality word vector from the corpus

## Appendix: notebook testing of the cleaning functions

In [None]:
# Why has_numeric
'salt-making'.isalpha()

In [None]:
## alpha and non-alpha match are bit problematic
# if truly needed we should scan the corpus and pull out a set of all the characters we want.
non_alpha = re.compile(r'.*[^a-zA-Z\-]+')
non_alpha.match('salt-Making')

In [None]:
non_alpha.match('abc120')

In [None]:
non_alpha.match('120abc')

In [None]:
has_numeric.match('abc120')

In [None]:
has_numeric.match('120abc')

In [None]:
has_numeric.match('1')

In [None]:
has_numeric.match('13')