# Concept - Natural Language Processing 101
> Understand the basic process of pattern recognition that is needed to work with text data. This approach can be used for a large number of basic operations on text data like Document Classification (e.g. topic of an article),Sequence to Sequence Learning (e.g. translations), and Sentiment Analysis

- toc: true
- badges: true
- comments: true
- categories: [Concept, TFIDF, NLP, Visualization, Altair]
- image:

Lets understand the basic process of pattern recognition that is needed to work with text data. This approach can be used for a large number of basic operations on text data like 
- Document Classification (e.g. topic of an article)
- Sequence to Sequence Learning (e.g. translations)
- Sentiment Analysis

Lets start with seeing how a text sequence can be encoded into numbers and processed to prepare for these tasks

## Tokenisation

Lets take the sentence - **The quick brown fox jumped over the lazy dog**

We need to first break this sentence in to smaller constituents - called **tokens**. Now there are three ways of creating the tokens can happen:

- **Individual character** - create tokens for each
- **Individual word** - Create tokens for each word in the sentence
- **N-gram** - Create tokens by taking n-grams words in the sentence

### Create word tokens

**Pre-processing - split, punctuation & case**

There is some basic **pre-processing** that has been done in the process of creating word tokens
- Split the sentence on **whitespace**
- Filter **punctuations**
- Change to **lower case** text

After this pre-processing, we can get each token.

In [39]:
import numpy as np
import pandas as pd
import altair as alt

import spacy
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.text import one_hot, hashing_trick

In [None]:
! python -m spacy download en_core_web_sm

In [4]:
sentence = 'The quick brown fox jumped over the lazy dog.'

In [5]:
text_to_word_sequence(sentence)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

In [8]:
nlp = spacy.load('en_core_web_sm')

In [9]:
sentence = 'The quick brown fox jumped over the lazy dog'
doc = nlp(sentence)

In [10]:
for token in doc:
    print(token)

The
quick
brown
fox
jumped
over
the
lazy
dog


## Vectorisation

Once you have tokens, we need to find a way to represent them as vectors. Let's look at two traditional way of representing them as vectors

- Frequency Based
    - Binary
    - Count 
    - tfidf
    - Co-occurence (Skipgram)
- Prediction Based
    - Pre-trained Vectors
    - Learning Vectors
    - Learning vectors with the task

### One-Hot Encoding 

In [12]:
# Given a size of vocabulary, do one-hot encoding
one_hot(sentence, n=10)

[5, 3, 3, 7, 8, 8, 5, 9, 9]

In [13]:
# Given a size of vocabulary, do hash encoding (to save space)
hashing_trick(sentence, n=100, hash_function="md5")

[51, 13, 19, 11, 7, 95, 51, 74, 33]

> Tip: Using the Tokenizer API

In [14]:
from keras.preprocessing.text import Tokenizer

In [15]:
# Instantiate the Tokenizer
simple_tokenizer = Tokenizer()

In [16]:
# Fit the Tokenizer
simple_tokenizer.fit_on_texts([sentence])

In [30]:
def get_sentence_vectors(sentences, tokenizer, mode="binary"):
    matrix = tokenizer.texts_to_matrix(sentences, mode=mode)
    df = pd.DataFrame(matrix)
    df.drop(columns=0, inplace=True)
    df.columns = tokenizer.word_index
    return df

In [17]:
# See the word vectors
simple_tokenizer.word_index

{'brown': 3,
 'dog': 8,
 'fox': 4,
 'jumped': 5,
 'lazy': 7,
 'over': 6,
 'quick': 2,
 'the': 1}

Normally we will be working with a set of text (like sentences), so it is better to use the tokenizer API

In [31]:
sentences = ['The quick brown fox jumped over the lazy dog', 
             'The dog woke up lazily and barked at the fox',
             'The fox looked back and just ignored the dog']

In [32]:
# Instantiate and Fit
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

In [33]:
tokenizer.word_index

{'and': 4,
 'at': 14,
 'back': 16,
 'barked': 13,
 'brown': 6,
 'dog': 3,
 'fox': 2,
 'ignored': 18,
 'jumped': 7,
 'just': 17,
 'lazily': 12,
 'lazy': 9,
 'looked': 15,
 'over': 8,
 'quick': 5,
 'the': 1,
 'up': 11,
 'woke': 10}

In [34]:
tokenizer.texts_to_sequences(sentences)

[[1, 5, 6, 2, 7, 8, 1, 9, 3],
 [1, 3, 10, 11, 12, 4, 13, 14, 1, 2],
 [1, 2, 15, 16, 4, 17, 18, 1, 3]]

In [35]:
tokenizer.texts_to_matrix(sentences, mode="binary")

array([[0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.],
       [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0.,
        0., 0., 0.],
       [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        1., 1., 1.]])

In [52]:
_x = get_sentence_vectors(sentences, tokenizer, mode="tfidf")
_x

Unnamed: 0,the,fox,dog,and,quick,brown,jumped,over,lazy,woke,up,lazily,barked,at,looked,back,just,ignored
0,0.947512,0.559616,0.559616,0.0,0.916291,0.916291,0.916291,0.916291,0.916291,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.947512,0.559616,0.559616,0.693147,0.0,0.0,0.0,0.0,0.0,0.916291,0.916291,0.916291,0.916291,0.916291,0.0,0.0,0.0,0.0
2,0.947512,0.559616,0.559616,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.916291,0.916291,0.916291,0.916291


In [53]:
_x = _x.rename_axis('sentence').reset_index().melt(id_vars=['sentence'])
_x.head(10)

Unnamed: 0,sentence,variable,value
0,0,the,0.947512
1,1,the,0.947512
2,2,the,0.947512
3,0,fox,0.559616
4,1,fox,0.559616
5,2,fox,0.559616
6,0,dog,0.559616
7,1,dog,0.559616
8,2,dog,0.559616
9,0,and,0.0


In [55]:
alt.Chart(_x).mark_rect().encode(
    x=alt.X('variable:N', title="word"),
    y=alt.Y('sentence:N', title="sentence"),
    color=alt.Color('value:Q', title="tfidf")
).properties(
    width=700
).interactive()

In [56]:
one_hot_results = tokenizer.texts_to_matrix(sentence, mode='binary')

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 18 unique tokens.
