### P1: Get Data

* We will work with (**a part of**) English Wikipedia. 
* English Wikipedia has **more than 55M** sentences! 
* For today's class, we will work with 1M sentences.
* If you are interested in how to extract sentences from English Wikipedia, you can look at this [code](https://github.com/vineetm/tf-similar-sentences/blob/master/code/extract_sentences.py), which uses gensim Wikicorpus to extract data from raw Dump of 15G!

Let us begin by extracting the zip with 1M sentenes. 

Run the following command in your terminal/shell:
```bash
cd ../data
unzip wiki.1M.txt.zip
cd -
```


### P2: Explore the data
The first thing that you should **always** do when you start to work with *any* data, is to explore it! 

* Let us count the lines in the file:
```bash
wc -l ../data/wiki.1M.txt
```

* Let us now look at some lines from the file
```
head ../data/wiki.1M.txt
```

### P3: Tokenization
Tokenization is a process to convert a word such as *Delhi's* into its constituent parts(generally words): *Delhi* and *'s*

**QS**: Why do we care if **Delhi's** is split into two tokens?

**Note**: We demonstrate how to tokenize using [nltk](https://www.nltk.org/) here. You can use other alternatives such as [spacy](https://spacy.io/)

If the following does not work install nltk data
```bash
python -m nltk.downloader 'punkt'
```

In [1]:
#Let us define some handles for input, output and vocab files
input_file = '../data/wiki.1M.txt'
output_file = '../data/wiki.1M.txt.tokenized'
vocab_file = '../data/vocab.txt'

In [16]:
from nltk.tokenize import word_tokenize
from collections import OrderedDict
import time

In [3]:
word_tokenize("Delhi's")

['Delhi', "'s"]

### P4: Tokenize all sentences in a file

Now that we know how to tokenize a single sentence, let us create a **tokenized** version of `wiki.1M.txt`. 

More concretely:
1. Lowercase the sentence
2. Tokenize the sentence
3. Join tokens with space ' ' and write the **tokenized** sentence to file...
4. Print progress, say every 100K sentences

In [4]:
#TODO: Fill in the method below:
def convert_to_tokens(input_file, output_file):
    with open(input_file) as fr,open(output_file, 'w') as fw:
        for index, sentence in enumerate(fr):
            words = word_tokenize(sentence.strip().lower())
            fw.write(f"{' '.join(words)}\n")
            if index % 100000 == 0:
                print(index)

In [5]:
#This should take about 2 minutes. Think that is long! Think again, you processed 1M sentences in under 2 minutes!
start = time.time()
convert_to_tokens(input_file, output_file)
print(f'Time Taken: {time.time()-start}s')

0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Time Taken: 109.22984576225281s


### P5: Peek at the new file
Now, let us take a look at our newly created file:
1. Check if it has same number of lines as the raw file
```bash
wc -l ../data/wiki.1M.txt.tokenized
```

2. Check a few lines and compare it with the raw file:
```bash
head ../data/wiki.1M.txt
head ../data/wiki.1M.txt.tokenized
```

### P6: Building Vocabulary
* Vocabulary refers to what words appear in your corpus
* To make learning tractable, we **restrict our vocabulary** 
* You can restrict using any measure, but as we are dealing with data-driven methods, its intuitive to restrict using **word counts**


* All words which appear with a **minimum frequency** are retained, and rest all words are assigned special unknown symbol **UNK**
* Another popular method is to take Top-$K$ words.

Now, we will use our freshly minted `wiki.1M.txt.tokenized` to count words in the corpus

In [6]:
#TODO: Count words and return words in a sorted order
def count_words(sentences_file):
    counter = dict()
    for sentence in open(sentences_file):
        words = sentence.strip().split()
        for word in words:
            counter[word] = counter.get(word, 0) + 1
    return sorted(counter.items(), key=lambda pair:pair[1], reverse=True)

In [7]:
word_counts = count_words(output_file)
word_counts[:10]

[('the', 929180),
 (',', 822110),
 ('.', 580671),
 ('of', 477522),
 ('and', 376517),
 ('in', 337457),
 ('to', 280731),
 ('a', 259190),
 (')', 137015),
 ('(', 135461)]

### P7: Assign word index

1. Next, we want to create a map, which returns an **integer** for a word
2. As we have restricted our vocabulary, we want to return index 0 for words we removed
3. These words are also known as Out of Vocabulary (OOV) words
4. Assign a unique integer to words in vocabulary, starting from 1
5. For consistency, we will assign lower indexe to a higher frequency word

In [17]:
#TODO: Create a word->integer mapping
# Discard words which have count less than min freq
# Assign index 0 to `UNK`
def build_vocab(word_counts, min_freq):
    vocab = OrderedDict()
    vocab['UNK'] = 0
    
    for word, freq in word_counts:
        if freq < min_freq:
            return vocab
        vocab[word] = len(vocab) + 1
    return vocab

In [9]:
vocab = build_vocab(word_counts, 10)
print(f'V[UNK]: {vocab["UNK"]} V["learning"]: {vocab["learning"]}')

V[UNK]: 0 V["learning"]: 1552


In [10]:
#Return a list of word indexes for all words in sentence
def assign_word_indexes(sentence, vocab):
    return [vocab[word] if word in vocab else vocab['UNK'] for word in sentence.split()]

In [11]:
word_indexes = assign_word_indexes('i am learning to build vocab', vocab)

In [12]:
print(word_indexes)

[117, 1790, 1552, 8, 1785, 0]


In [13]:
len(vocab)

46014

### P8: Write Vocabulary file
* Write each word in vocabulary to a new line

In [14]:
def write_vocab_file(vocab_file, vocab):
    with open(vocab_file, 'w') as fw:
        for word in vocab:
            fw.write(f'{word}\n')

In [15]:
write_vocab_file(vocab_file, vocab)

How many lines in our vocab file?