# Fall 2020: DS-GA 1011 NLP with Representation Learning
## Lab 2: 11-Sep-2020, Friday
## Text Pre-processing

In this lab, we will cover the steps on how to clean and process text data before it is ready to be fed to nlp models.

---
### Data
We are using [movie review data](https://ai.stanford.edu/~amaas/data/sentiment/) from IMDB, which is for *binary sentiment classification*. There are 25,000 reviews for training and 25,000 for testing.

### Download and unzip the data

The command `wget` helps you download the data from the following url.

Linux/Mac: Install using `brew install wget` if utility not available.

In [1]:
# !brew install wget
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2021-07-29 15:58:58--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2021-07-29 15:59:01 (37.3 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



Windows: Download utility using `conda` or `pip`. Can also refer to tip on Campuswire post #28. File extraction can be done using utility like 7-Zip too.

In [None]:
# !pip install wget 
# import wget
# wget.download('https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz', 'aclImdb_v1.tar.gz')

The command `tar` is used to compress and extract files to and from an archive.

In [2]:
# Unzip
!tar xzf aclImdb_v1.tar.gz

### Read data

In [3]:
import os
from tqdm import tqdm

cf. 
  
> [`tqdm`](https://pypi.org/project/tqdm/) makes your loops show a smart progress meter. Just wrap any iterable with *tqdm(iterable)*, and you're done!

> `os.listdir(path)` returns a list containing the names of the entries in the directory given by path.

In [4]:
# Set path strings
train_path = "aclImdb/train/"
test_path = "aclImdb/test/"

train_corpus = []
for filename in tqdm(os.listdir(train_path+"pos")):
    review = open(train_path+"pos/"+filename, 'rt', encoding='ISO-8859-1').read()
    train_corpus.append(review)

for filename in tqdm(os.listdir(train_path+"neg")):
    review = open(train_path+"neg/"+filename, 'rt').read()
    train_corpus.append(review)



test_corpus = []
for filename in tqdm(os.listdir(test_path+"pos")):
    review = open(test_path+"pos/"+filename, 'rt').read()
    test_corpus.append(review)

for filename in tqdm(os.listdir(test_path+"neg")):
    review = open(test_path+"neg/"+filename, 'rt').read()
    test_corpus.append(review)

100%|██████████| 12500/12500 [00:00<00:00, 30264.85it/s]
100%|██████████| 12500/12500 [00:00<00:00, 28032.67it/s]
100%|██████████| 12500/12500 [00:00<00:00, 29507.05it/s]
100%|██████████| 12500/12500 [00:00<00:00, 30411.86it/s]


In [5]:
print(len(train_corpus), len(test_corpus))

25000 25000


In [6]:
# Reducing corpus size for faster processing
train_corpus = train_corpus[:500]
test_corpus = test_corpus[:500]

In [7]:
train_corpus[1]

'In Rosenstrasse, Margarethe von Trotta blends two stories to create a vibrant tapestry of love and courage. The film depicts a family drama of estrangement between a mother and her daughter, and the story of German women who staged a protest on Rosenstrasse to free their Jewish husbands from certain extermination. In addition to the dramatization of historical events, the focus of the film is on the saving of a child from the Holocaust by a German and the result of the child\'s experience of losing her mother. While Ms. von Trotta shows that the courage of a small number of Germans made a difference, she does not use it to excuse German society. Indeed, she shows how in the midst of torture and extermination, the wealthy artists and intellectuals of German high society went on about their lives and parties, oblivious to the suffering.<br /><br />Rosenstrasse opens in New York as a Jewish widow Ruth Weinstein (Jutta Lampe) decides to sit Shiva, a seven-day period of mourning that takes

In [8]:
train_corpus[3]

'While many unfortunately passed on, the ballroom scene is still very much alive and carrying on their legacy. Some are still very much alive and quite well, Octavia is more radiant and beautiful than ever, Willi Ninja is very accomplished and gives a great deal of support to the gay community as a whole, Pepper Labeija just passed on last year of natural cause, may she rest in peace. After Anji\'s passing Carmen became the mother of the house of Xtravaganza (she was in the beach scene) and she is looking more and more lovely as well. Some balls have categories dedicated to those who have passed, may they all rest in peace. There is currently another project underway known as "How Do I Look?", you can check out the website at www.howdoilooknyc.org.'

---
### Pre-processing
#### Remove white space and punctuation

cf.
> A regular expression is a sequence of characters that forms a search pattern for strings. The functions in [`re`](https://docs.python.org/3/library/re.html) module let you check if a particular string matches a given regular expression (or vice versa).

In [9]:
import re

def remove_space_punctuation(data):
  # input: list of raw sentences
  # output: list of sentences without punctuation and white space
  
    result = [re.sub('<.*?>',' ',s) for s in tqdm(data)] # html tags
    result = [re.sub(r'[^\w\s]',' ',s) for s in tqdm(result)] # punctuation
    result = [re.sub(' +',' ',s) for s in tqdm(result)] # white space
    return result

train = remove_space_punctuation(train_corpus)
test = remove_space_punctuation(test_corpus)

100%|██████████| 500/500 [00:00<00:00, 80062.30it/s]
100%|██████████| 500/500 [00:00<00:00, 20011.95it/s]
100%|██████████| 500/500 [00:00<00:00, 13916.26it/s]
100%|██████████| 500/500 [00:00<00:00, 134856.41it/s]
100%|██████████| 500/500 [00:00<00:00, 29922.55it/s]
100%|██████████| 500/500 [00:00<00:00, 13622.65it/s]


In [10]:
train[1]

'In Rosenstrasse Margarethe von Trotta blends two stories to create a vibrant tapestry of love and courage The film depicts a family drama of estrangement between a mother and her daughter and the story of German women who staged a protest on Rosenstrasse to free their Jewish husbands from certain extermination In addition to the dramatization of historical events the focus of the film is on the saving of a child from the Holocaust by a German and the result of the child s experience of losing her mother While Ms von Trotta shows that the courage of a small number of Germans made a difference she does not use it to excuse German society Indeed she shows how in the midst of torture and extermination the wealthy artists and intellectuals of German high society went on about their lives and parties oblivious to the suffering Rosenstrasse opens in New York as a Jewish widow Ruth Weinstein Jutta Lampe decides to sit Shiva a seven day period of mourning that takes place following a funeral i

In [11]:
train[3]

'While many unfortunately passed on the ballroom scene is still very much alive and carrying on their legacy Some are still very much alive and quite well Octavia is more radiant and beautiful than ever Willi Ninja is very accomplished and gives a great deal of support to the gay community as a whole Pepper Labeija just passed on last year of natural cause may she rest in peace After Anji s passing Carmen became the mother of the house of Xtravaganza she was in the beach scene and she is looking more and more lovely as well Some balls have categories dedicated to those who have passed may they all rest in peace There is currently another project underway known as How Do I Look you can check out the website at www howdoilooknyc org '

#### Lowercasing, tokenization and lemmatization 

*Tokenization* 
The task of chopping the input text into pieces, called *tokens*. *Tokens* are the building blocks for nlp, sequence of characters grouped together as basic unit. They can be either words, subwords or just characters.

**Token Normalization**

*Stemming*
The process of converting any word in the data to its root form. 

*Lemmatization*
Transforms words to the actual root.


cf.

> [spaCy](https://spacy.io): for app developers

> [NLTK](https://www.nltk.org): for researchers and scholars

Install `spacy` & load model

In [12]:
#!conda install -c conda-forge spacy
#!pip install spacy
#!python -m spacy download en_core_web_sm

import spacy

# load model for en
nlp = spacy.load("en_core_web_sm") 

Example

In [13]:
# Example
doc = nlp('1. 09/11/2020 - This is a test string. You\'re good')
print(doc, type(doc))

1. 09/11/2020 - This is a test string. You're good <class 'spacy.tokens.doc.Doc'>


In [14]:
for token in doc:
    print([token, token.text, token.lemma_, token.pos_, token.is_stop, type(token)])

[1, '1', '1', 'X', False, <class 'spacy.tokens.token.Token'>]
[., '.', '.', 'PUNCT', False, <class 'spacy.tokens.token.Token'>]
[09/11/2020, '09/11/2020', '09/11/2020', 'NUM', False, <class 'spacy.tokens.token.Token'>]
[-, '-', '-', 'PUNCT', False, <class 'spacy.tokens.token.Token'>]
[This, 'This', 'this', 'DET', True, <class 'spacy.tokens.token.Token'>]
[is, 'is', 'be', 'AUX', True, <class 'spacy.tokens.token.Token'>]
[a, 'a', 'a', 'DET', True, <class 'spacy.tokens.token.Token'>]
[test, 'test', 'test', 'NOUN', False, <class 'spacy.tokens.token.Token'>]
[string, 'string', 'string', 'NOUN', False, <class 'spacy.tokens.token.Token'>]
[., '.', '.', 'PUNCT', False, <class 'spacy.tokens.token.Token'>]
[You, 'You', '-PRON-', 'PRON', True, <class 'spacy.tokens.token.Token'>]
['re, "'re", 'be', 'AUX', True, <class 'spacy.tokens.token.Token'>]
[good, 'good', 'good', 'ADJ', False, <class 'spacy.tokens.token.Token'>]


In [None]:
#nlp.Defaults.stop_words

Implementation

In [15]:
import string

def tokenize(data):
  # input: list of sentences without punctuations and white spaces
  # output: list of lists of lower-case lemmatized word tokens

    tokenized_data = []  
    for review in tqdm(data):
        result = nlp(review) # tokenized document
        tokenized_data.append([token.lemma_.lower() for token in result \
                           if not token.is_stop \
                           and token.text not in string.punctuation]) 
    return tokenized_data

train_tokenized = tokenize(train)
test_tokenized = tokenize(test)

100%|██████████| 500/500 [00:23<00:00, 21.00it/s]
100%|██████████| 500/500 [00:21<00:00, 23.19it/s]


In [16]:
print(train_tokenized[1])

['rosenstrasse', 'margarethe', 'von', 'trotta', 'blend', 'story', 'create', 'vibrant', 'tapestry', 'love', 'courage', 'film', 'depict', 'family', 'drama', 'estrangement', 'mother', 'daughter', 'story', 'german', 'woman', 'stage', 'protest', 'rosenstrasse', 'free', 'jewish', 'husband', 'certain', 'extermination', 'addition', 'dramatization', 'historical', 'event', 'focus', 'film', 'saving', 'child', 'holocaust', 'german', 'result', 'child', 's', 'experience', 'lose', 'mother', 'ms', 'von', 'trotta', 'show', 'courage', 'small', 'number', 'germans', 'difference', 'use', 'excuse', 'german', 'society', 'show', 'midst', 'torture', 'extermination', 'wealthy', 'artist', 'intellectual', 'german', 'high', 'society', 'go', 'life', 'party', 'oblivious', 'suffer', 'rosenstrasse', 'open', 'new', 'york', 'jewish', 'widow', 'ruth', 'weinstein', 'jutta', 'lampe', 'decide', 'sit', 'shiva', 'seven', 'day', 'period', 'mourning', 'take', 'place', 'follow', 'funeral', 'jewish', 'family', 'member', 'devote',

In [17]:
print(train_tokenized[3])

['unfortunately', 'pass', 'ballroom', 'scene', 'alive', 'carry', 'legacy', 'alive', 'octavia', 'radiant', 'beautiful', 'willi', 'ninja', 'accomplished', 'give', 'great', 'deal', 'support', 'gay', 'community', 'pepper', 'labeija', 'pass', 'year', 'natural', 'cause', 'rest', 'peace', 'anji', 's', 'pass', 'carmen', 'mother', 'house', 'xtravaganza', 'beach', 'scene', 'look', 'lovely', 'ball', 'category', 'dedicate', 'pass', 'rest', 'peace', 'currently', 'project', 'underway', 'know', 'look', 'check', 'website', 'www', 'howdoilooknyc', 'org']


---
### Explore

Find most common words and build vocabulary.

cf.
> [`collections`](https://docs.python.org/2/library/collections.html) provides specialized container datatypes like `OrderedDict` & `Counter`

> `Counter` is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values

Example

In [18]:
from collections import Counter

animals = ['dogs', 'cats', 'fish', 'monkey', 'cats', 'elephant', 'dogs', 'cats', 'lion', 'monkey']
c = Counter(animals)
c

Counter({'cats': 3,
         'dogs': 2,
         'elephant': 1,
         'fish': 1,
         'lion': 1,
         'monkey': 2})

In [19]:
c.most_common(3)

[('cats', 3), ('dogs', 2), ('monkey', 2)]

In [20]:
vocab, count = zip(*c.most_common(5))
print(f'vocab: ', vocab, '\ncount: ', count)

vocab:  ('cats', 'dogs', 'monkey', 'fish', 'elephant') 
count:  (3, 2, 2, 1, 1)


#### Implementation

In [21]:
def build_vocab(tokenized_data, max_vocab=10000):
    # input: list of lists of tokens
    # output: token2id: dict, id2token: list

    PAD_IDX = 0
    UNK_IDX = 1

    all_tokens = [token for tokens in tokenized_data for token in tokens]

    token_counter = Counter(all_tokens)
    vocab, count = zip(*token_counter.most_common(max_vocab))
    token2id = dict(zip(vocab, range(2, 2 + len(vocab))))
    token2id["<PAD>"] = PAD_IDX
    token2id["<UNK>"] = UNK_IDX
    id2token = ["<PAD>", "<UNK>"] + list(vocab)

    return token2id, id2token

token2id, id2token = build_vocab(train_tokenized)

In [None]:
#print(token2id)

#### Transform tokens into indices according to token2id

In [22]:
def transform(tokenized_data, token2id):
  # input: list of lists of tokens, token2id: dict which maps token to value
  # output: list of list of ids according to token2id

    text_indices = []
    for tokens in tqdm(tokenized_data):
        indices = [token2id.get(token, 1) for token in tokens]
        text_indices.append(indices)
    return text_indices

train_transformed = transform(train_tokenized, token2id)
test_transformed = transform(test_tokenized, token2id)

100%|██████████| 500/500 [00:00<00:00, 30821.89it/s]
100%|██████████| 500/500 [00:00<00:00, 31298.44it/s]


In [23]:
print(train_transformed[1])
print([id2token[i] for i in train_transformed[1]])

[1406, 3632, 1583, 1854, 2218, 8, 122, 3633, 5352, 13, 1855, 3, 716, 45, 224, 5353, 225, 297, 8, 902, 63, 531, 3634, 1406, 368, 1856, 351, 599, 3635, 997, 5354, 903, 310, 382, 3, 5355, 127, 3636, 902, 322, 127, 2, 298, 118, 225, 1407, 1583, 1854, 42, 1855, 195, 237, 3637, 1104, 154, 2756, 902, 417, 42, 2219, 1408, 3635, 3638, 1584, 2220, 902, 151, 417, 27, 25, 631, 5356, 632, 1406, 383, 58, 532, 1856, 1585, 1857, 5357, 5358, 5359, 206, 418, 5360, 1239, 49, 533, 5361, 43, 95, 155, 2221, 1856, 45, 334, 5362, 384, 196, 2222, 2223, 297, 2757, 1858, 5363, 2758, 674, 1240, 265, 998, 3639, 5364, 1241, 5365, 633, 5366, 2757, 403, 225, 904, 206, 155, 2759, 905, 3640, 2760, 1857, 5367, 2760, 717, 2757, 403, 266, 63, 226, 1409, 43, 1857, 127, 2, 225, 5368, 227, 2761, 2224, 33, 1409, 352, 675, 225, 2, 216, 1105, 43, 5369, 33, 1409, 5370, 5371, 5372, 23, 47, 1242, 5373, 1586, 2762, 599, 445, 3636, 5374, 634, 1409, 55, 8, 51, 3641, 23, 47, 63, 5375, 3642, 906, 351, 1856, 5376, 3643, 2763, 5377, 1410

In [24]:
# Example
tokens = ["oh", "lot", "skhjdaasdsa"]
print([token2id.get(token, 1) for token in tokens])

[914, 50, 1]


---
## References
DS-GA 1011 NLP with Representation Learning Fall 2019