# Preprocessing: Clean Up & Tokenize Questions

Break question titles into tokens, and perform token-level normalization: expand shortened words, correct spelling, etc.

## Imports

This utility package imports `numpy`, `pandas`, `matplotlib` and a helper `kg` module into the root namespace.

In [None]:
from pygoose import *



In [10]:
import nltk

## Config

Automatically discover the paths to various data folders and compose the project structure.

In [11]:
project = kg.Project.discover()

## Load Data

Original question datasets.

In [12]:
df_train = pd.read_csv(project.data_dir + 'train.csv').fillna('none')
df_test = pd.read_csv(project.data_dir + 'test.csv').fillna('none')

In [13]:
df_all = pd.concat([df_train, df_test])

In [14]:
df_all.head()

Unnamed: 0,lvl1,lvl2,titles,descriptions,price
0,110,267,Seeking Vaccancy In A Mnufacturing Company,"when working with people one need to be kind, sincere, and loyal in all that he is doing.",none
1,5,55,Slippers For Men,its made of good qualities and good for all men's casuals,3000.000000
2,27,257,Afro By Nature Hydrating Leave In Conditioner,"Dry, brittle hair? Damaged hair or split ends? No problem. \r\nThis product is formulated to restore moisture into the hair. it also repairs the hair and leaves it full, shiny and bouncy. Suitable for all hair texture. Recommended for both natural and relaxed hair.",2500.000000
3,5,168,Porshe Design Wristwatches,Porshe design new wristwatch is now available at my store,175000.000000
4,3,17,"Brand New Samsung 20"" 20J4003 TV LED - Black","KEY FEATURES\r\nBrand: Samsung\r\nModel: 20J4003\r\nDesign: LED\r\nVideo: 23.6"" Measured Diagonally\r\nWireless Connectivity: YES\r\nInputs & Outputs: HDMI, USB\r\nDimensions (W X H X D): 22.1"" x 13.7"" x 1.9""\r\nPower: AC110-120V 60Hz\r\nProduct warranty: 2years warranty.\r\nTo place your order,chat me up.\r\nDelivery available nationwide with discount. Also available in bulk.",38000.000000


Stopwords customized for Quora dataset.

In [15]:
stopwords = set(kg.io.load_lines(project.aux_dir + 'stopwords.vocab'))

In [16]:
from nltk.corpus import stopwords as st
stop = set(st.words('english'))

In [17]:
stopwords = stopwords | stop

Pre-composed spelling correction dictionary.

In [18]:
spelling_corrections = kg.io.load_json(project.aux_dir + 'spelling_corrections.json')

## Load Tools

In [19]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')

## Preprocess and tokenize questions

In [20]:
def translate(text, translation):
    for token, replacement in translation.items():
        text = text.replace(token, ' ' + replacement + ' ')
    text = text.replace('  ', ' ')
    return text

In [21]:
def spell_digits(text):
    translation = {
        '0': 'zero',
        '1': 'one',
        '2': 'two',
        '3': 'three',
        '4': 'four',
        '5': 'five',
        '6': 'six',
        '7': 'seven',
        '8': 'eight',
        '9': 'nine',
    }
    return translate(text, translation)

In [22]:
def expand_negations(text):
    translation = {
        "can't": 'can not',
        "won't": 'would not',
        "shan't": 'shall not',
    }
    text = translate(text, translation)
    return text.replace("n't", " not")

In [23]:
_our_bad_words = []

In [24]:
def correct_spelling(text):
    global _k
    for token in tokenizer.tokenize(text):
        if token in spelling_corrections:
            _our_bad_words.append(token)
    return ' '.join(     
        spelling_corrections.get(token, token)
        for token in tokenizer.tokenize(text)
    )

In [25]:
def get_question_tokens(question, lowercase=True, spellcheck=True, remove_stopwords=True):
    if lowercase:
        question = question.lower()
    
    if spellcheck:
        question = correct_spelling(question)
    
    question = spell_digits(question)
    question = expand_negations(question)

    tokens = [token for token in tokenizer.tokenize(question.lower() if lowercase else question)]    
    if remove_stopwords:
        tokens = [token for token in tokens if token not in stopwords]
    
    #tokens.append('.')
    return tokens

In [26]:
def get_question_pair_tokens_lowercase_spellcheck_remove_stopwords(pair):
    return [
        get_question_tokens(pair[0], lowercase=True, spellcheck=True, remove_stopwords=True),
        get_question_tokens(pair[1], lowercase=True, spellcheck=True, remove_stopwords=True),
    ]

In [27]:
def get_question_pair_tokens_lowercase_spellcheck_remove_stopwords_descriptions_only(pair):
    return get_question_tokens(pair[0], lowercase=True, spellcheck=True, remove_stopwords=True)

In [21]:
df_all.head()

Unnamed: 0,lvl1,lvl2,titles,descriptions,price
0,110,267,Seeking Vaccancy In A Mnufacturing Company,"when working with people one need to be kind, sincere, and loyal in all that he is doing.",none
1,5,55,Slippers For Men,its made of good qualities and good for all men's casuals,3000.000000
2,27,257,Afro By Nature Hydrating Leave In Conditioner,"Dry, brittle hair? Damaged hair or split ends? No problem. \r\nThis product is formulated to restore moisture into the hair. it also repairs the hair and leaves it full, shiny and bouncy. Suitable for all hair texture. Recommended for both natural and relaxed hair.",2500.000000
3,5,168,Porshe Design Wristwatches,Porshe design new wristwatch is now available at my store,175000.000000
4,3,17,"Brand New Samsung 20"" 20J4003 TV LED - Black","KEY FEATURES\r\nBrand: Samsung\r\nModel: 20J4003\r\nDesign: LED\r\nVideo: 23.6"" Measured Diagonally\r\nWireless Connectivity: YES\r\nInputs & Outputs: HDMI, USB\r\nDimensions (W X H X D): 22.1"" x 13.7"" x 1.9""\r\nPower: AC110-120V 60Hz\r\nProduct warranty: 2years warranty.\r\nTo place your order,chat me up.\r\nDelivery available nationwide with discount. Also available in bulk.",38000.000000


In [22]:
len(set(df_all.lvl1))

14

Tokenize the questions, correct spelling, but keep the upper/lower case.

In [203]:
tokens_spellcheck = kg.jobs.map_batch_parallel(
    df_all.as_matrix(columns=['descriptions', 'titles']),
    item_mapper=get_question_pair_tokens_lowercase_spellcheck_remove_stopwords,
    batch_size=1000,
)


Batches:   0%|          | 0/717 [00:00<?, ?it/s][A
Batches: 100%|██████████| 717/717 [00:28<00:00, 24.91it/s]


In [23]:
tokens_spellcheck_descriptions = kg.jobs.map_batch_parallel(
    df_all.as_matrix(columns=['descriptions']),
    item_mapper=get_question_pair_tokens_lowercase_spellcheck_remove_stopwords_descriptions_only,
    batch_size=1000,
)

Batches: 100%|██████████| 717/717 [00:15<00:00, 45.09it/s]


In [28]:
tokens_spellcheck_titles = kg.jobs.map_batch_parallel(
    df_all.as_matrix(columns=['titles']),
    item_mapper=get_question_pair_tokens_lowercase_spellcheck_remove_stopwords_descriptions_only,
    batch_size=1000,
)

Batches: 100%|██████████| 717/717 [00:02<00:00, 253.67it/s]


Tokenize the questions, convert to lowercase and correct spelling, keep the stopwords (useful for neural models).

## Extract question vocabulary

In [204]:
vocab = set()
for question in progressbar(np.array(tokens_spellcheck).ravel()):
    for token in question:
        vocab.add(token)

100%|██████████| 1433350/1433350 [00:05<00:00, 249630.12it/s]


In [205]:
len(vocab)

226268

## Save preprocessed data

Tokenized questions.

In [206]:
kg.io.save(
    tokens_spellcheck[:len(df_train)],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_train.pickle'
)
kg.io.save(
    tokens_spellcheck[len(df_train):],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_test.pickle'
)

## Descriptions_only

In [26]:
kg.io.save(
    tokens_spellcheck_descriptions[:len(df_train)],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_descriptions_train.pickle'
)
kg.io.save(
    tokens_spellcheck_descriptions[len(df_train):],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_descriptions_test.pickle'
)

# Titles only

In [29]:
kg.io.save(
    tokens_spellcheck_titles[:len(df_train)],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_titles_train.pickle'
)
kg.io.save(
    tokens_spellcheck_titles[len(df_train):],
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords_titles_test.pickle'
)

Question vocabulary.

In [207]:
kg.io.save_lines(
    sorted(list(vocab)),
    project.preprocessed_data_dir + 'tokens_lowercase_spellcheck_no_stopwords.vocab'
)

Ground truth.

In [208]:
kg.io.save(df_train['lvl1'].values, project.features_dir + 'y_train.pickle')

In [7]:
kg.io.save(df_test['lvl1'].values, project.features_dir + 'y_test.pickle')