<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/natural-language-processing-with-pytorch/05-word-embeddings/munging_frankenstein_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Munging Frankenstein Dataset

We will build a text dataset from a digitized version of Mary Shelley’s
novel Frankenstein, available via [Project Gutenberg](https://www.gutenberg.org/files/84/84-h/84-h.htm). This section walks through
the preprocessing; building a PyTorch `Dataset` class for this text dataset; and finally
splitting the dataset into training, validation, and test sets.

Starting with the raw text file that Project Gutenberg distributes, the preprocessing is
minimal: we use [NLTK’s Punkt tokenizer](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) to split the text into separate sentences,
then each sentence is converted to lowercase and the punctuation is completely
removed. This preprocessing allows for us to later split the strings on whitespace in
order to retrieve a list of tokens.

The next step is to enumerate the dataset as a sequence of windows so that the
CBOW model can be optimized. To do this, we iterate over the list of tokens in each sentence and group them into windows of a specified window size.

<img src='https://github.com/rahiakela/natural-language-processing-research-and-practice/blob/main/natural-language-processing-with-pytorch/05-word-embeddings/images/slide-window.png?raw=1' width='800'/>

The CBOW task: predict a word using the left and the right context. The
context windows are of length 2 on either side. 

A sliding window over the text produces
many `supervised` examples, each with its target word (in the middle). The windows that are not of length 2 are padded appropriately. 

For example, for window #3, given
the contexts `i pitied` and `my pity,` the CBOW classifier is set up to predict `frankenstein`.

The final step in constructing the dataset is to split the data into three sets: the training, validation, and test sets.


##Setup

In [22]:
import collections
import numpy as np
import pandas as pd
import re
import string
import os

import nltk.data

from argparse import Namespace
from tqdm import tqdm_notebook

In [2]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
!wget https://github.com/rahiakela/natural-language-processing-research-and-practice/raw/main/natural-language-processing-with-pytorch/05-word-embeddings/dataset/frankenstein.txt

##Dataset preproccessing

In [4]:
args = Namespace(
    raw_dataset_txt="frankenstein.txt",
    window_size=5,
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="frankenstein_with_splits.csv",
    seed=1337
)

In [5]:
# Split the raw text book into sentences
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")

In [6]:
with open(args.raw_dataset_txt) as fp:
  book = fp.read()
sentences = tokenizer.tokenize(book)

In [7]:
print(len(sentences), " sentences")
print(f"Sample: {sentences[100]}")

3430  sentences
Sample: This letter will reach England by a merchantman now on
its homeward voyage from Archangel; more fortunate than I, who may not
see my native land, perhaps, for many years.


In [8]:
# Clean sentences
def preprocess_text(text):
  text = " ".join(word.lower() for word in text.split(" "))
  text = re.sub(r"([.,!?])", r" \1 ", text)
  text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
  return text

In [9]:
cleaned_sentences = [preprocess_text(sentence) for sentence in sentences]

In [10]:
# Global vars
MASK_TOKEN = "<MASK>"

In [20]:
# Create sliding windows
flatten = lambda outer_list: [item for inner_list in outer_list for item in inner_list]
windows = flatten([list(nltk.ngrams([MASK_TOKEN] * args.window_size + sentence.split(" ") + \
                                    [MASK_TOKEN] * args.window_size, args.window_size * 2 + 1)) \
                    for sentence in tqdm_notebook(cleaned_sentences)])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


  0%|          | 0/3430 [00:00<?, ?it/s]

In [23]:
# Create cbow data
data = []
for window in tqdm_notebook(windows):
  target_token = window[args.window_size]
  context = []
  for i, token in enumerate(window):
    if token == MASK_TOKEN or i == args.window_size:
      continue
    else:
      context.append(token)

  data.append([" ".join(token for token in context), target_token])

# Convert to dataframe
cbow_data = pd.DataFrame(data, columns=["context", "target"])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  This is separate from the ipykernel package so we can avoid doing imports until


  0%|          | 0/90808 [00:00<?, ?it/s]

In [24]:
cbow_data.head()

Unnamed: 0,context,target
0,"gutenberg s frankenstein , by",project
1,"project s frankenstein , by mary",gutenberg
2,"project gutenberg frankenstein , by mary wolls...",s
3,"project gutenberg s , by mary wollstonecraft g...",frankenstein
4,project gutenberg s frankenstein by mary wolls...,","


In [25]:
# Create split data
n = len(cbow_data)

def get_split(row_num):
  if row_num <= n * args.train_proportion:
    return "train"
  elif (row_num > n * args.train_proportion) and (row_num >= args.train_proportion + n * args.val_proportion):
    return "val"
  else:
    return "test"

cbow_data["split"] = cbow_data.apply(lambda row: get_split(row.name), axis=1)

In [26]:
cbow_data.head()

Unnamed: 0,context,target,split
0,"gutenberg s frankenstein , by",project,train
1,"project s frankenstein , by mary",gutenberg,train
2,"project gutenberg frankenstein , by mary wolls...",s,train
3,"project gutenberg s , by mary wollstonecraft g...",frankenstein,train
4,project gutenberg s frankenstein by mary wolls...,",",train


In [27]:
# Write split data to file
cbow_data.to_csv(args.output_munged_csv, index=False)