# Natural Language Processing
To solve this problem we need several processing steps. First we need to convert the raw text-words into so-called tokens which are integer values. These tokens are really just indices into a list of the entire vocabulary. Then we convert these integer-tokens into so-called embeddings which are real-valued vectors, whose mapping will be trained along with the neural network, so as to map words with similar meanings to similar embedding-vectors. Then we input these embedding-vectors to a Recurrent Neural Network which can take sequences of arbitrary length as input and output a kind of summary of what it has seen in the input. This output is then squashed using a Sigmoid-function to give us a value between 0.0 and 1.0, where 0.0 is taken to mean a negative sentiment and 1.0 means a positive sentiment. This whole process allows us to classify input-text as either having a negative or positive sentiment.

The flowchart of the algorithm is roughly:

<div class="imgcap">
<img src="images/natural_language.png" style="border:none;width:60%;">
</div>

In [1]:
import torch
import spacy
import random
import re
import urllib
import pandas as pd
import six
import requests
import csv
from torchtext import data
from torchtext import datasets
from tqdm import tqdm

SEED = 1234

nlp = spacy.load('en')

### Raw Text

In [2]:
text = 'The quick fox jumped over a lazy dog.'
text

'The quick fox jumped over a lazy dog.'

### Tokenizer

In [3]:
MAX_CHARS = 20000

def tokenizer(sentence):
    sentence = re.sub(r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’\!;]", " ", str(sentence))
    sentence = re.sub(r"[ ]+", " ", sentence)
    sentence = re.sub(r"\!+", "!", sentence)
    sentence = re.sub(r"\,+", ",", sentence)
    sentence = re.sub(r"\?+", "?", sentence)
    
    if (len(sentence) > MAX_CHARS):
        sentence = sentence[:MAX_CHARS]

    return [x.text for x in nlp.tokenizer(sentence) if x.text != " "]

tokenizer(text)

['The', 'quick', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog', '.']

### Torchtext Data

In [4]:
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)

In [5]:
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:19<00:00, 4.24MB/s]


In [8]:
train_data, valid_data = train_data.split(random_state=random.seed(SEED))

In [9]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 12250
Number of validation examples: 5250
Number of testing examples: 25000


Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer).

<img src="images/one_hot.png" width="350">

We do this as our machine learning model cannot operate on strings, only numbers. Each index is used to construct a one-hot vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.



The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one).

There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special unknown or <unk> token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I <unk> it".



In [10]:
TEXT.build_vocab(train_data, max_size=25000)
LABEL.build_vocab(train_data)

In [11]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2


In [12]:
print(vars(train_data.examples[0]))

{'text': ['I', 'really', 'hate', 'this', 'retarded', 'show', ',', 'it', 'SUCKS', '!', 'big', 'time', ',', 'and', 'personally', 'I', 'think', 'it', 'is', 'insulting', 'to', 'fairy', 'kind', '(', 'if', 'you', 'believe', 'in', 'fairies', 'that', 'is', ')', ';', 'I', 'mean', 'the', 'people', 'who', 'had', 'come', 'up', 'with', 'such', 'crap', "'", 'ought', 'to', 'have', 'their', 'heads', 'examine', 'huh', '?', 'and', 'also', 'there', 'is', 'a', 'LOT', 'of', 'craziness', '(', 'the', 'evil', 'school', 'teacher', ',', 'which', 'I', 'think', 'is', 'getting', 'really', 'old', ')', 'and', 'also', 'stupidity', '(', 'the', 'boy', "'s", 'parents', 'and', 'fairy', 'godfather', ')', 'in', 'this', 'show', '-', 'two', 'of', 'the', 'things', 'that', 'I', 'dispised', 'and', 'loathe', 'in', 'the', 'WHOLE', 'world', '(', 'especially', 'stupidity).<br', '/><br', '/>Overall', ',', 'I', 'say', 'that', 'this', 'show', 'is', 'so', 'f', '*', '*', '*', '*', '*', "'", 'annoying', 'and', 'should', 'not', 'be', 'see

In [13]:
print(TEXT.vocab.freqs.most_common(20))

[('the', 141858), (',', 134951), ('.', 115669), ('a', 76382), ('and', 76313), ('of', 70587), ('to', 65470), ('is', 53198), ('in', 43016), ('I', 37616), ('it', 37363), ('that', 34248), ('"', 30880), ("'s", 30449), ('this', 29550), ('-', 25925), ('/><br', 25282), ('was', 24437), ('as', 20879), ('with', 20789)]


In [14]:
print(TEXT.vocab.itos[:10])

['<unk>', '<pad>', 'the', ',', '.', 'a', 'and', 'of', 'to', 'is']


In [15]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x114eb2c80>, {'neg': 0, 'pos': 1})


The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a BucketIterator which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using torch.device, we then pass this device to the iterator.

In [16]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits((train_data, valid_data, test_data),
                                                                           batch_size=BATCH_SIZE)

## Custom data

In [17]:
def download_from_url(url, path):
    """Download file, with logic (from tensor2tensor) for Google Drive"""
    def process_response(r):
        chunk_size = 16 * 1024
        total_size = int(r.headers.get('Content-length', 0))
        with open(path, "wb") as file:
            with tqdm(total=total_size, unit='B',
                      unit_scale=1, desc=path.split('/')[-1]) as t:
                for chunk in r.iter_content(chunk_size):
                    if chunk:
                        file.write(chunk)
                        t.update(len(chunk))

    if 'drive.google.com' not in url:
        response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'}, stream=True)
        process_response(response)
        return

    print('downloading from Google Drive; may take a few minutes')
    confirm_token = None
    session = requests.Session()
    response = session.get(url, stream=True)
    for k, v in response.cookies.items():
        if k.startswith("download_warning"):
            confirm_token = v

    if confirm_token:
        url = url + "&confirm=" + confirm_token
        response = session.get(url, stream=True)

    process_response(response)

In [18]:
# Upload data from GitHub to notebook's local drive
url = "https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/data/news.csv"
file_name = 'news.csv'
download_from_url(url, file_name)

news.csv: 6.13MB [00:00, 8.07MB/s]                            


In [19]:
# Raw data
df = pd.read_csv(file_name, header=0)
df.head()

Unnamed: 0,category,title
0,Business,Wall St. Bears Claw Back Into the Black (Reuters)
1,Business,Carlyle Looks Toward Commercial Aerospace (Reu...
2,Business,Oil and Economy Cloud Stocks' Outlook (Reuters)
3,Business,Iraq Halts Oil Exports from Main Southern Pipe...
4,Business,"Oil prices soar to all-time record, posing new..."


In [20]:
TEXT = data.Field(tokenize='spacy')
LABELS = data.Field()

news_data = data.TabularDataset(
    path=file_name, format='csv',skip_header=True,
    fields=[('category', LABELS), ('title', TEXT)])

In [21]:
TEXT.build_vocab(news_data, max_size=25000)
LABELS.build_vocab(news_data)

In [22]:
len(news_data)

120000

In [23]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABELS.vocab)}")

Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 6


In [24]:
print(vars(news_data.examples[0]))

{'category': ['Business'], 'title': ['Wall', 'St.', 'Bears', 'Claw', 'Back', 'Into', 'the', 'Black', '(', 'Reuters', ')']}


In [25]:
print(TEXT.vocab.freqs.most_common(20))

[('to', 22793), ('(', 17132), (')', 17130), ('in', 16767), (',', 16321), ('-', 13503), ('#', 12950), ('for', 11660), (':', 9629), ('on', 8986), ('of', 8736), (';', 7778), ('AP', 7777), ('39;s', 6079), ('the', 4990), ("'", 4328), ('Reuters', 4261), ('US', 3956), ("'s", 3860), ('a', 3745)]


In [26]:
print(LABELS.vocab.stoi)

defaultdict(<function _default_unk_index at 0x114eb2c80>, {'<unk>': 0, '<pad>': 1, 'Business': 2, 'Sci/Tech': 3, 'Sports': 4, 'World': 5})
