# Natural Language Processing
To solve this problem we need several processing steps. First we need to convert the raw text-words into so-called tokens which are integer values. These tokens are really just indices into a list of the entire vocabulary. Then we convert these integer-tokens into so-called embeddings which are real-valued vectors, whose mapping will be trained along with the neural network, so as to map words with similar meanings to similar embedding-vectors. Then we input these embedding-vectors to a Recurrent Neural Network which can take sequences of arbitrary length as input and output a kind of summary of what it has seen in the input. This output is then squashed using a Sigmoid-function to give us a value between 0.0 and 1.0, where 0.0 is taken to mean a negative sentiment and 1.0 means a positive sentiment. This whole process allows us to classify input-text as either having a negative or positive sentiment.

The flowchart of the algorithm is roughly:

<center>
<img src="./images/natural_language.png" width="20%">
</center>

In [1]:
import re
import nltk
import datasets
import torch
import matplotlib.pyplot as plt
import torchtext
from torchtext.vocab import build_vocab_from_iterator
from torch import nn

In [None]:
text = 'The quick fox jumped over a lazy dog.'

In [70]:
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

## Torchtext

### Bag of words

How will we present the text? The easiest way is with a bag of words.

Let's get a big-big dictionary - a list of all the words in the training set. Then each sentence can be represented as a vector in which it will be written, how many times each of the possible words has been encountered:

<center>
<img src="./images/BOW.png" width="15%">
</center>

A simple and enjoyable way to do this is to stuff the texts into the `CountVectorizer`.

It has the following signature:

```python
CountVectorizer(input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=r'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class ‘numpy.int64'>)
```


To begin with, pay attention to the parameters `lowercase = True` and` max_df = 1.0, min_df = 1, max_features = None` - they mean that by default all words will be converted to lower case and all words found in the texts will be included in the dictionary .

If desired, it would be possible to remove too rare or too frequent words - until we do this.

Let's look at a simple example of how it will work:

In [96]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()

dummy_data = ['The movie was excellent', 'the movie was awful']

dummy_matrix = vectorizer.fit_transform(dummy_data)

print(dummy_matrix.toarray())

[[0 1 1 1 1]
 [1 0 1 1 1]]


In [97]:
print(vectorizer.get_feature_names())

['awful', 'excellent', 'movie', 'the', 'was']


How exactly does vectorizer define word boundaries? Note the parameter `token_pattern = r '(? U) \ b \ w \ w + \ b'` - how will it work?

What they wanted was a vector with a bow (i.e., bag-of-words) representation of the source text.

And how can this information help? Well, all the same - some words are positive color, some - negative. Most are generally neutral, yes.

<center>
<img src="./images/BOW_weights.png" width="15%">
</center>

I would like, probably, to choose the coefficients that will determine the level of color, right? It is necessary to select by the training sample, and not as we did before.

For example, for sampling

```
1 The movie was excellent
0 the movie was awful
```

It’s easy to pick odds on the eye: something like `+ 1` for` excellent`, `-1` for` awful` and zeros for everything else.

Let's build a linear model that will do this. She will learn to build a separating hyperplane in the space of bow-vectors.

Check out how the logistic regression can handle our super sample of a couple of sentences.

In [98]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

dummy_data = ['The movie was excellent',
              'the movie was awful']
dummy_labels = [1, 0]

vectorizer = CountVectorizer()
classifier = LogisticRegression(solver='lbfgs')

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

model.fit(dummy_data, dummy_labels)

print(vectorizer.get_feature_names())
print(classifier.coef_)

['awful', 'excellent', 'movie', 'the', 'was']
[[-0.40104279  0.40104279  0.          0.          0.        ]]


## Tf-idf

Now we look at all words with the same weight - although some of them are more rare, some more frequent, and this frequency is useful, generally speaking, information.

The easiest way to add statistical information about frequencies is to do * tf-idf * weighting:

$$\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)$$

*tf* - term-frequency - frequency of the word `t` in a specific document` d` (reviews in our case). This is exactly what we already thought.

*idf* - inverse document-frequency - coefficient, which is greater, the smaller the number of documents met this word. It is considered something like this:

$$\text{idf}(t) = \text{log}\frac{1 + n_d}{1 + n_{d(t)}} + 1$$

where $n_d$ is the number of all documents, and $ n_{d (t)} $ is the number of documents with the word `t`.

Using it is easy - you need to replace `CountVectorizer` with` TfidfVectorizer`.

**Task** Try running `TfidfVectorizer`. Look at the mistakes that he learned to correct, and the mistakes that he began to make - compared to `CountVectorizer`.

In [100]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
classifier = LogisticRegression(solver='lbfgs')

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

# model.fit(train_df['review'], train_df['is_positive'])

# eval_model(model, test_df)

### N-gram

Until now, we looked at the texts as a bag of words - but it is obvious that there is a difference between the `good movie` and` not good movie`.

Add information (at least some) about the sequences of words - we will also extract the digrams of words.

In Vectorizers, this has the option `ngram_range = (n_1, n_2)` - it says that we need n_1 -... n_2-grams.

**Task** Try an increased range and interpret the result.

In [101]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
classifier = LogisticRegression(solver='lbfgs')

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

# model.fit(train_df['review'], train_df['is_positive'])

# eval_model(model, test_df)

### N-grams characters

Character n-grams provide an easy way to learn useful roots and suffixes without being associated with this linguistics of yours - just statistics, only hardcore.

For example, the word `badass` we can represent in the form of such a sequence of trigrams:

`##b #ba bad ada das ass ss# s##`

So interpretable, is not it?

It’s still as easy to implement as you need to put an analyzer = 'char'` in your favorite Vectorizer and choose the size of `ngram_range`.

**Task** File a classifier on n-grams of characters and visualize it.

In [102]:
vectorizer = TfidfVectorizer(ngram_range=(2, 6), max_features=20000, analyzer='char')
classifier = LogisticRegression(solver='lbfgs')

model = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', classifier)
])

# model.fit(train_df['review'], train_df['is_positive'])

# eval_model(model, test_df)

## Lemmatization and stemming

If you look closely, you can find the forms of one word with different semantic coloring according to the classifier. Or not?

**Assignment** Find the word forms with different semantic coloring.

Believe that they are, try something to do with it.

For example, lemmatizing - we reduce all words to the initial form. The spacy library will help in this.

**Task** Make a classifier on lemmatized texts.

An easier way to normalize words is to use stemming. It is a little dull, does not take into account the context, but sometimes it turns out to be even more effective than lemmatization - and, most importantly, faster.

In essence, this is just a set of rules how to cut a word to get a stem (stem):

In [104]:
from nltk import PorterStemmer

stemmer = PorterStemmer()

print(stemmer.stem('become'))
print(stemmer.stem('becomes'))
print(stemmer.stem('became'))

becom
becom
becam
