# Natural Language Processing
To solve this problem we need several processing steps. First we need to convert the raw text-words into so-called tokens which are integer values. These tokens are really just indices into a list of the entire vocabulary. Then we convert these integer-tokens into so-called embeddings which are real-valued vectors, whose mapping will be trained along with the neural network, so as to map words with similar meanings to similar embedding-vectors. Then we input these embedding-vectors to a Recurrent Neural Network which can take sequences of arbitrary length as input and output a kind of summary of what it has seen in the input. This output is then squashed using a Sigmoid-function to give us a value between 0.0 and 1.0, where 0.0 is taken to mean a negative sentiment and 1.0 means a positive sentiment. This whole process allows us to classify input-text as either having a negative or positive sentiment.

The flowchart of the algorithm is roughly:

<div class="imgcap">
<img src="images/natural_language.png" width="40%">
</div>

In [2]:
import re
import os
import torch
import spacy
import pandas as pd
import matplotlib.pyplot as plt
import torchtext
from torchtext import vocab
from torchtext.data import Field, BucketIterator, TabularDataset, BPTTIterator
from download_files import download_from_url
from tqdm import tqdm, tqdm_notebook, tnrange
tqdm.pandas(desc='Progress')
from sklearn.model_selection import train_test_split
from collections import Counter

SEED = 1234

nlp = spacy.load('en', disable=['parser', 'tagger', 'ner'])

ImportError: cannot import name 'Field' from 'torchtext.data' (/home/rashed/.virtualenvs/dnlp/lib/python3.9/site-packages/torchtext/data/__init__.py)

In [None]:
text = 'The quick fox jumped over a lazy dog.'
text

In [69]:
MAX_CHARS = 20000

def tokenizer(sentence):
    sentence = re.sub(r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’\!;]", " ", str(sentence))
    sentence = re.sub(r"[ ]+", " ", sentence)
    sentence = re.sub(r"\!+", "!", sentence)
    sentence = re.sub(r"\,+", ",", sentence)
    sentence = re.sub(r"\?+", "?", sentence)
    
    if (len(sentence) > MAX_CHARS):
        sentence = sentence[:MAX_CHARS]

    return [x.text for x in nlp.tokenizer(sentence) if x.text != " "]

tokenizer(text)

['The', 'quick', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog', '.']

In [70]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

## Torchtext

<img src="https://i0.wp.com/mlexplained.com/wp-content/uploads/2018/02/%E3%82%B9%E3%82%AF%E3%83%AA%E3%83%BC%E3%83%B3%E3%82%B7%E3%83%A7%E3%83%83%E3%83%88-2018-02-07-10.32.59.png?resize=1024%2C481" width="80%">

In [1]:
# Upload data from GitHub to notebook's local drive
url = "https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/data/news.csv"
file_name = '.data/news.csv'
if not os.path.isfile(file_name):
    download_from_url(url, file_name)

NameError: name 'os' is not defined

In [72]:
file_name = '.data/news.csv'
df = pd.read_csv(file_name, header=0)
df.head()

Unnamed: 0,category,title
0,Business,Wall St. Bears Claw Back Into the Black (Reuters)
1,Business,Carlyle Looks Toward Commercial Aerospace (Reu...
2,Business,Oil and Economy Cloud Stocks' Outlook (Reuters)
3,Business,Iraq Halts Oil Exports from Main Southern Pipe...
4,Business,"Oil prices soar to all-time record, posing new..."


In [73]:
train, val = train_test_split(df, test_size=0.25)

In [74]:
train.to_csv(".data/train.csv", index=False)
val.to_csv(".data/val.csv", index=False)

In [75]:
TEXT = Field(init_token='<s>', eos_token='</s>', lower=True, tokenize='spacy', include_lengths=True)
LABEL = Field(pad_token=None, unk_token=None)

data_fields = [('category', LABEL), ('title', TEXT)]

train, val = TabularDataset.splits(path='.data/', train='train.csv', validation='val.csv', format='csv', skip_header=True, fields=data_fields)

In [76]:
train.fields

{'category': <torchtext.data.field.Field at 0x170858c88>,
 'title': <torchtext.data.field.Field at 0x17086cda0>}

In [77]:
len(train), len(val)

(90000, 30000)

In [78]:
train.fields.items()

dict_items([('category', <torchtext.data.field.Field object at 0x170858c88>), ('title', <torchtext.data.field.Field object at 0x17086cda0>)])

In [79]:
ex = train[0]
ex.category

['Sports']

In [80]:
ex.title

['busch', ',', 'truex', 'garner', 'wet', 'poles', 'at', 'darlington']

In [81]:
# specify the path to the localy saved vectors
vec = vocab.Vectors('.data/numberbatch-en-17.06.txt', '.data/cached_vectors/')

In [83]:
TEXT.build_vocab(train, val, vectors=vec)
LABEL.build_vocab(train, val)

In [84]:
TEXT.vocab.vectors[TEXT.vocab.stoi['the']]

tensor([ 0.1242,  0.1674,  0.0332, -0.0748, -0.0949,  0.0375,  0.1173,  0.0417,
        -0.0555,  0.0879,  0.0022, -0.0468, -0.0807, -0.0585,  0.0150, -0.0920,
         0.1271,  0.0427,  0.0620, -0.0173,  0.0926, -0.0594,  0.0598,  0.0100,
         0.0155,  0.0383,  0.0163,  0.1254, -0.0468,  0.0004,  0.0943, -0.0575,
        -0.0242,  0.0752, -0.1304,  0.0479, -0.0497, -0.0374,  0.0215, -0.0180,
         0.0609, -0.0007, -0.0978, -0.0085, -0.0386,  0.0116,  0.0443, -0.0374,
         0.0439, -0.0452,  0.0377,  0.0720,  0.0405,  0.0212,  0.0781,  0.0728,
        -0.0364, -0.0924, -0.0307,  0.0720,  0.0887,  0.0687,  0.0253,  0.1093,
        -0.0389, -0.0521,  0.0624,  0.0678,  0.0103, -0.0738,  0.0250,  0.0523,
         0.0072, -0.0300,  0.0142,  0.0265, -0.0210,  0.0078, -0.0398, -0.0607,
        -0.0003,  0.0169,  0.0239,  0.0328, -0.0895, -0.0195,  0.0574,  0.0884,
        -0.0789,  0.0518,  0.0406,  0.0407,  0.0459,  0.0452, -0.0202, -0.0383,
        -0.0792,  0.0765, -0.0440,  0.01

In [85]:
print(TEXT.vocab.itos[:20])

['<unk>', '<pad>', '<s>', '</s>', 'to', 'in', '(', ')', ',', '-', '#', 'for', ':', 'on', 'of', ';', 'ap', 'the', '39;s', 'a']


In [86]:
print(LABEL.vocab.stoi)

defaultdict(<function _default_unk_index at 0x12352d048>, {'Business': 0, 'Sci/Tech': 1, 'Sports': 2, 'World': 3})


In [87]:
print(TEXT.vocab.stoi['and'])

31


In [88]:
train_iter = BucketIterator(train, batch_size=10, shuffle=False)

In [89]:
batch = next(iter(train_iter))

In [90]:
print(batch.title)

(tensor([[    2,     2,     2,     2,     2,     2,     2,     2,     2,     2],
        [ 1935,   410,   230,  5214,  2153,    52,  2024,   396,    92,  5359],
        [    8,   841,  1555,    91,  1415,     4,  1518, 13196,  6667,   568],
        [12893,   321,  2043,  5144,   790,   767,   172,     8,  1301,  4413],
        [10103,  1093,  4742,  7299,   908,  2187,     5,  1364,  1064,  1695],
        [ 4170,   493,     3,     3,  3769,     3,   285,    46,    26,  1171],
        [ 6674, 13524,     1,     1,     5,     1,  1178,    17, 11452,     3],
        [   22,     5,     1,     1,  1502,     1,   255,  3097,  2518,     1],
        [ 9238,  6299,     1,     1,  1503,     1,     6,     3,     3,     1],
        [    3,  2726,     1,     1,   234,     1,    45,     1,     1,     1],
        [    1,     3,     1,     1,    14,     1,     7,     1,     1,     1],
        [    1,     1,     1,     1,  1217,     1,     3,     1,     1,     1],
        [    1,     1,     1,     1,   

In [91]:
print(batch.category)

tensor([[2, 2, 0, 1, 3, 0, 3, 1, 3, 2]])


In [92]:
traindl, valdl = BucketIterator.splits(datasets=(train, val), # specify train and validation Tabulardataset
                                            batch_sizes=(3,3),  # batch size of train and validation
                                            sort_key=lambda x: len(x.title), # on what attribute the text should be sorted
                                            sort_within_batch=True, 
                                            repeat=False)

In [93]:
class BatchGenerator:
    def __init__(self, dl, x_field, y_field):
        self.dl, self.x_field, self.y_field = dl, x_field, y_field
        
    def __len__(self):
        return len(self.dl)
    
    def __iter__(self):
        for batch in self.dl:
            X = getattr(batch, self.x_field)
            y = getattr(batch, self.y_field)
            yield (X,y)

In [94]:
train_batch_it = BatchGenerator(traindl, 'category', 'title')
print(next(iter(train_batch_it))[0])

tensor([[1, 1, 2]])


In [95]:
print(next(iter(train_batch_it))[1])

(tensor([[   2,    2,    2],
        [4534, 1504,  295],
        [ 519,   13, 5674],
        [8253,   17,   28],
        [ 700,  812, 3135],
        [   3,    3,    3]]), tensor([6, 6, 6]))


### Built-in data

In [47]:
TEXT = Field()

In [59]:
train, val, test = torchtext.datasets.LanguageModelingDataset.splits(
    path=".data/ptb", 
    train="train.txt", validation="valid.txt", test="test.txt", text_field=TEXT)

In [49]:
# we train on the entire corpus, modeled as a single sentence
print('len(train)', len(train))

len(train) 1


In [50]:
# build the vocabulary. 10001 because the vocab has <unk> but then torchtext adds its own
TEXT.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))

len(TEXT.vocab) 10001


In [51]:
# for debugging, reduce vocabulary.
if False:
    TEXT.build_vocab(train, max_size=1000)
    print(len(TEXT.vocab))

In [53]:
# make batch iterators
train_iter, val_iter, test_iter = BPTTIterator.splits((train, val, test), batch_size=10, bptt_len=32, repeat=False)

In [54]:
# each batch is a string of length 32 and sentences are ended with a special <eos> token
it = iter(train_iter)
batch = next(it) 

In [55]:
print("Size of text batch [max bptt length, batch size]", batch.text.size())

Size of text batch [max bptt length, batch size] torch.Size([32, 10])


In [56]:
print("Second in batch", batch.text[:, 2])

Second in batch tensor([   8,  202,   77,    5,  183,  561, 3837,   18,  975,  976,    7,  943,
           5,  157,   78, 1571,  289,  645,    3,   30,  132,    0,   20,    2,
         273, 7821,   17,    9,  117, 2815,  969,    6])


In [57]:
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))

Converted back to string:  in part because of buy programs generated by stock-index arbitrage a form of program trading involving futures contracts <eos> but interest <unk> as the day wore on and investors looked ahead to


In [64]:
# each consecutive batch is a continuation of the previous one. there are no separate labels
batch = next(it)
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 2].data]))

Converted back to string:  for bonds and bad for stocks <eos> in major market activity stock prices rose in light trading <eos> but declining issues on the new york stock exchange outnumbered gainers N to N


### Bag of words

How will we present the text? The easiest way is with a bag of words.

Let's get a big-big dictionary - a list of all the words in the training set. Then each sentence can be represented as a vector in which it will be written, how many times each of the possible words has been encountered:

<img src="images/BOW.png" width="50%">


A simple and enjoyable way to do this is to stuff the texts into the `CountVectorizer`.

It has the following signature:

```python
CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=r'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64'>)
```


To begin with, pay attention to the parameters `lowercase = True` and` max_df = 1.0, min_df = 1, max_features = None` - they mean that by default all words will be converted to lower case and all words found in the texts will be included in the dictionary .

If desired, it would be possible to remove too rare or too frequent words - until we do this.

Let's look at a simple example of how it will work:

In [96]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

dummy_data = ['The movie was excellent', 'the movie was awful']

dummy_matrix = vectorizer.fit_transform(dummy_data)

print(dummy_matrix.toarray())

[[0 1 1 1 1]
 [1 0 1 1 1]]


In [97]:
print(vectorizer.get_feature_names())

['awful', 'excellent', 'movie', 'the', 'was']


How exactly does vectorizer define word boundaries? Note the parameter `token_pattern = r '(? U) \ b \ w \ w + \ b'` - how will it work?

What they wanted was a vector with a bow (i.e., bag-of-words) representation of the source text.

And how can this information help? Well, all the same - some words are positive color, some - negative. Most are generally neutral, yes.
<img src="images/BOW_weights.png" width="50%">

I would like, probably, to choose the coefficients that will determine the level of color, right? It is necessary to select by the training sample, and not as we did before.

For example, for sampling

```
1 The movie was excellent
0 the movie was awful
```

It’s easy to pick odds on the eye: something like `+ 1` for` excellent`, `-1` for` awful` and zeros for everything else.

Let's build a linear model that will do this. She will learn to build a separating hyperplane in the space of bow-vectors.

Check out how the logistic regression can handle our super sample of a couple of sentences.

In [98]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

dummy_data = ['The movie was excellent',
              'the movie was awful']
dummy_labels = [1, 0]

vectorizer = CountVectorizer()
classifier = LogisticRegression(solver='lbfgs')

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(dummy_data, dummy_labels)

print(vectorizer.get_feature_names())
print(classifier.coef_)

['awful', 'excellent', 'movie', 'the', 'was']
[[-0.40104279  0.40104279  0.          0.          0.        ]]


## Tf-idf

Now we look at all words with the same weight - although some of them are more rare, some more frequent, and this frequency is useful, generally speaking, information.

The easiest way to add statistical information about frequencies is to do * tf-idf * weighting:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

*tf* - term-frequency - frequency of the word `t` in a specific document` d` (reviews in our case). This is exactly what we already thought.

*idf* - inverse document-frequency - coefficient, which is greater, the smaller the number of documents met this word. It is considered something like this:

$$\text{idf}(t) = \text{log}\frac{1 + n_d}{1 + n_{d(t)}} + 1$$

where $n_d$ is the number of all documents, and $ n_{d (t)} $ is the number of documents with the word `t`.

Using it is easy - you need to replace `CountVectorizer` with` TfidfVectorizer`.

**Task** Try running `TfidfVectorizer`. Look at the mistakes that he learned to correct, and the mistakes that he began to make - compared to `CountVectorizer`.

In [100]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
classifier = LogisticRegression(solver='lbfgs')

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

# model.fit(train_df['review'], train_df['is_positive'])

# eval_model(model, test_df)

### N-gram

Until now, we looked at the texts as a bag of words - but it is obvious that there is a difference between the `good movie` and` not good movie`.

Add information (at least some) about the sequences of words - we will also extract the digrams of words.

In Vectorizers, this has the option `ngram_range = (n_1, n_2)` - it says that we need n_1 -... n_2-grams.

**Task** Try an increased range and interpret the result.

In [101]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
classifier = LogisticRegression(solver='lbfgs')

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

# model.fit(train_df['review'], train_df['is_positive'])

# eval_model(model, test_df)

### N-grams characters

Character n-grams provide an easy way to learn useful roots and suffixes without being associated with this linguistics of yours - just statistics, only hardcore.

For example, the word `badass` we can represent in the form of such a sequence of trigrams:

`##b #ba bad ada das ass ss# s##`

So interpretable, is not it?

It’s still as easy to implement as you need to put an analyzer = 'char'` in your favorite Vectorizer and choose the size of `ngram_range`.

**Task** File a classifier on n-grams of characters and visualize it.

In [102]:
vectorizer = TfidfVectorizer(ngram_range=(2, 6), max_features=20000, analyzer='char')
classifier = LogisticRegression(solver='lbfgs')

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

# model.fit(train_df['review'], train_df['is_positive'])

# eval_model(model, test_df)

## Lemmatization and stemming

If you look closely, you can find the forms of one word with different semantic coloring according to the classifier. Or not?

**Assignment** Find the word forms with different semantic coloring.

Believe that they are, try something to do with it.

For example, lemmatizing - we reduce all words to the initial form. The spacy library will help in this.

In [103]:
import spacy
from spacy import displacy

nlp = spacy.load('en', disable=['parser'])

docs = [doc for doc in nlp.pipe(train_df.review.values[:50])]

NameError: name 'train_df' is not defined

In [None]:
for token in docs[0]:
    print(token.text, token.lemma_)

**Task** Make a classifier on lemmatized texts.

An easier way to normalize words is to use stemming. It is a little dull, does not take into account the context, but sometimes it turns out to be even more effective than lemmatization - and, most importantly, faster.

In essence, this is just a set of rules how to cut a word to get a stem (stem):

In [104]:
from nltk import PorterStemmer

stemmer = PorterStemmer()

print(stemmer.stem('become'))
print(stemmer.stem('becomes'))
print(stemmer.stem('became'))

becom
becom
becam


**Task** Try to classify bases instead of lemmas.

### NER

There are a lot of named entities in review texts. For example:

In [49]:
displacy.render(docs[0], style='ent', jupyter=True)

Generally speaking, why should any Depp have to carry a semantic coloring? However, it turns out that the classifier learns that some names are more often in positive reviews - or vice versa. It looks like retraining - why not try cutting out entities?

**Task** Remove some of the entities from the texts, using the coordinates of the zapikennyh files. The description of entities can be viewed [here] (https://spacy.io/api/annotation#named-entities). Run the classifier.