# PyData Piraeus - Sep, 2019

## NLU Data Science


### Language

- Python 3

### Tools

- pandas
- nltk
- gensim
- scikit-learn
- pytorch

In [1]:
# libraries
from gensim.models.keyedvectors import KeyedVectors
from nltk import word_tokenize
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from typing import List

from utils.deep_learning import train, evaluate

In [2]:
# inputs
fname_snips_train = './data/snips_train.csv'
fname_snips_test = './data/snips_test.csv'
fname_quora_train = './data/quora_train.csv'
fname_quora_test = './data/quora_test.csv'
fname_embeddings = '/workspace/deepNLU/mLearningNLU/embeddings/pretrained/GloVe/glove.840B.300d.bin'
cache_dir = '/workspace/deepNLU/mLearningNLU/models/pytorch-transformers'
epochs = 20

In [3]:
print(f'Loading static embeddings: {fname_embeddings}')
embeddings = KeyedVectors.load_word2vec_format(fname_embeddings, binary=True)

Loading static embeddings: /workspace/deepNLU/mLearningNLU/embeddings/pretrained/GloVe/glove.840B.300d.bin


# Text Classication

- Integral component of NLU.
- Broad range of applications:
    - Chatbot
    - Spam detection

## 1. Chat Bot - Intent Classification
- Software to conduct human-like conversation.
- Intent classification only for our chatbot.
- Why Python?
    - Powerful computational frameworks (NumPy, Tensorflow, PyTorch)
    - Ever-contributing community
    - Open-source - Research friendly

### 1.1 Data Loading
SNIPS dataset
- 7 intents
- 70 utterances for training set
- 700 utterances for test set

**pandas** library
- ease of use
- optimized for tabular data

In [4]:
print(f'Loading datasets...\n{fname_snips_train}\n{fname_snips_test}\n')
train_set = pd.read_csv(fname_snips_train, sep=';', header=None, names=['sample', 'class'])
test_set = pd.read_csv(fname_snips_test, sep=';', header=None, names=['sample', 'class'])

for the_class in train_set['class'].unique():
    print(f"{the_class}:\t{train_set[train_set['class'] == the_class].iloc[0][0]}")

print(f"\nTraining set distribution:\n{train_set['class'].value_counts().sort_index()}\n"
      f"\nTest set distribution:\n{test_set['class'].value_counts().sort_index()}")

Loading datasets...
./data/snips_train.csv
./data/snips_test.csv

GetWeather:	what do the cloud indicate in East Aurora
BookRestaurant:	I need a table at a restaurant serving carne pizzaiola for tamra davis, viola and dorothea
AddToPlaylist:	Add artist Matt Noveskey to journey
RateBook:	Rate the Under the Sign of Saturn 0 of 6
PlayMusic:	Play me a top-ten song by Phil Ochs on Groove Shark
SearchScreeningEvent:	Find movie schedule for animated movies in the area
SearchCreativeWork:	Please look up the television show, Noel Hill & Tony Linnane.

Training set distribution:
AddToPlaylist           10
BookRestaurant          10
GetWeather              10
PlayMusic               10
RateBook                10
SearchCreativeWork      10
SearchScreeningEvent    10
Name: class, dtype: int64

Test set distribution:
AddToPlaylist           100
BookRestaurant          100
GetWeather              100
PlayMusic               100
RateBook                100
SearchCreativeWork      100
SearchScreeningEv

### 1.2 Text to vectors
How to feed textual data into a machine learning model? &rightarrow; Transformation to numerical representations!

**scikit-learn** bag-of-words:
- For each word in the vocabulary &rightarrow; One dimension in vector space
- Constant dimensionality in sentences

A vector in ML:
- Series of numbers
- Each position &rightarrow; Intrinsic property of data

In [5]:
vectorizer = CountVectorizer()
vectorizer.fit_transform(train_set['sample'])
train_set['vectors'] = train_set['sample'].apply(
    lambda row: vectorizer.transform([row]).toarray()[0, :])
test_set['vectors'] = test_set['sample'].apply(
    lambda row: vectorizer.transform([row]).toarray()[0, :])

print(f"{train_set['sample'].iloc[0]}:\n{train_set['vectors'].iloc[0]}")

what do the cloud indicate in East Aurora:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### 1.3 Machine Learning
**scikit-learn** logistic regression

In [6]:
logit = LogisticRegression(solver='lbfgs', multi_class='auto', n_jobs=-1)

print('Training...')
logit.fit(X=train_set['vectors'].tolist(), y=train_set['class'].tolist())

print('Evaluating...')
predictions = logit.predict(X=test_set['vectors'].tolist())
accuracy = accuracy_score(y_true=test_set['class'].tolist(), y_pred=predictions)
report = classification_report(y_true=test_set['class'].tolist(), y_pred=predictions)
print(f'Accuracy: {100 * accuracy:.2f}%\n{report}')

Training...
Evaluating...
Accuracy: 86.86%
                      precision    recall  f1-score   support

       AddToPlaylist       0.90      0.88      0.89       100
      BookRestaurant       0.90      0.98      0.94       100
          GetWeather       0.83      0.96      0.89       100
           PlayMusic       0.97      0.91      0.94       100
            RateBook       0.83      0.85      0.84       100
  SearchCreativeWork       0.76      0.86      0.81       100
SearchScreeningEvent       0.96      0.64      0.77       100

            accuracy                           0.87       700
           macro avg       0.88      0.87      0.87       700
        weighted avg       0.88      0.87      0.87       700



How does our chatbot compare to the most popular intent detection tools?
![SNIPS_comparison](https://i.ibb.co/3mhfHYF/SNIPS-comparison.png)

Ready for a **harder NLU problem**?

## 2. Spam detection
**Quora** - Kaggle spam detection competition

- Two (2) types of questions:
    - sincere
    - insincere
- Difficult to distinguish with rule-based approach
- Split original training set to training/test subsets

### 2.1 Data Loading

In [15]:
print(f'Loading datasets...\n{fname_quora_train}\n{fname_quora_test}\n')
train_set = pd.read_csv(fname_quora_train, sep=';', header=None, names=['sample', 'class'])
test_set = pd.read_csv(fname_quora_test, sep=';', header=None, names=['sample', 'class'])

for the_class in train_set['class'].unique():
    print(f"{the_class}:\n{train_set[train_set['class'] == the_class].iloc[10][0]}\n")

print(f"\nTraining set distribution:\n{train_set['class'].value_counts().sort_index()}\n"
      f"\nTest set distribution:\n{test_set['class'].value_counts().sort_index()}")

Loading datasets...
./data/quora_train.csv
./data/quora_test.csv

0:
How do I prepare for a software engineering job internship interview?

1:
Should China always walk three steps behind the US so that the US maintains its status in the world?


Training set distribution:
0    495
1    505
Name: class, dtype: int64

Test set distribution:
0    512
1    488
Name: class, dtype: int64


### 2.2 Text to vectors
**scikit-learn** bag-of-words

In [16]:
vectorizer.fit_transform(train_set['sample'])
train_set['vectors'] = train_set['sample'].apply(
    lambda row: vectorizer.transform([row]).toarray()[0, :])
test_set['vectors'] = test_set['sample'].apply(
    lambda row: vectorizer.transform([row]).toarray()[0, :])

### 2.3 Machine Learning
**scikit-learn** logistic regression

In [9]:
print('Training...')
logit.fit(X=train_set['vectors'].tolist(), y=train_set['class'].tolist())

print('Evaluating...')
predictions = logit.predict(X=test_set['vectors'].tolist())
accuracy = accuracy_score(y_true=test_set['class'].tolist(), y_pred=predictions)
report = classification_report(y_true=test_set['class'].tolist(), y_pred=predictions)
print(f'Accuracy: {100 * accuracy:.2f}%\n{report}')

Training...
Evaluating...
Accuracy: 81.20%
              precision    recall  f1-score   support

           0       0.79      0.85      0.82       512
           1       0.83      0.77      0.80       488

    accuracy                           0.81      1000
   macro avg       0.81      0.81      0.81      1000
weighted avg       0.81      0.81      0.81      1000



71% for the winning Kaggle Team &rightarrow; no direct comparison (different test set)

But can we **do better**?

## 3. GloVe Embeddings

- Statistics on large corpus &rightarrow; Word co-occurence matrix within a fixed window
- Semantic similarity translates to proximity in the vector space
- **gensim** library

In [10]:
# nearest neighbors
embeddings.most_similar('man', topn=5)

[('woman', 0.7401745319366455),
 ('guy', 0.7067893743515015),
 ('boy', 0.7045701742172241),
 ('he', 0.6831111907958984),
 ('men', 0.6729365587234497)]

### 3.1 Text to vectors
- Tokenize sentences
- Assign embeddings to tokens
- Sentence &rightarrow; **array of embeddings**

In [11]:
print('Tokenizing...')
train_set['tokens'] = train_set['sample'].apply(lambda row: word_tokenize(row.lower()))
test_set['tokens'] = test_set['sample'].apply(lambda row: word_tokenize(row.lower()))

print('Assigning embeddings to tokens...')
def token_to_vector(tokens: List[str], embeddings: KeyedVectors):
    return np.array([embeddings[token] if token in embeddings.vocab
                     else np.zeros(embeddings.vector_size) for token in tokens])

train_set['token_vectors'] = train_set['tokens'].apply(lambda row: token_to_vector(row, embeddings))
test_set['token_vectors'] = test_set['tokens'].apply(lambda row: token_to_vector(row, embeddings))

print(f"{train_set['sample'].iloc[0]} -> {train_set['tokens'].iloc[0]}\n"
      f"{train_set['token_vectors'].iloc[0]}")

Tokenizing...
Assigning embeddings to tokens...
How can Thanos be killed? -> ['how', 'can', 'thanos', 'be', 'killed', '?']
[[-0.23205   0.47468  -0.38264  ...  0.33178   0.31545   0.37972 ]
 [-0.23857   0.35457  -0.30219  ... -0.35283   0.41888   0.13168 ]
 [ 0.27719  -0.08278   0.38441  ... -0.20608   0.24344  -0.12833 ]
 [-0.059177  0.10653  -0.21613  ... -0.42755   0.024956  0.02466 ]
 [-0.60482   0.37399   0.28446  ... -0.13601   0.30378  -0.084864]
 [-0.086864  0.19161   0.10915  ... -0.01516   0.11108   0.2065  ]]


### 3.2 Sentence vectors
- Sentences &rightarrow; **variable size** arrays
- Average-pooling &rightarrow; Fixed-size vectors

In [12]:
print('Sentence encoding...')
train_set['sentence_vectors'] = train_set['token_vectors'].apply(np.mean, axis=0)
test_set['sentence_vectors'] = test_set['token_vectors'].apply(np.mean, axis=0)

print(f"{train_set['sample'].iloc[0]}\n{train_set['sentence_vectors'].iloc[0]}")

Sentence encoding...
How can Thanos be killed?
[-1.57381833e-01  2.36433327e-01 -2.04900000e-02  2.51661334e-02
 -9.79803782e-03 -2.41743326e-01  1.59131680e-02 -3.55949998e-02
 -1.11650735e-01  1.94314992e+00 -5.77013381e-02 -1.72157988e-01
  4.69354987e-02  1.33230910e-01 -2.00982168e-01 -1.24464661e-01
 -3.41710038e-02  6.82393372e-01 -2.31642172e-01 -1.06247507e-01
  5.78600056e-02 -1.63971677e-01  3.13919969e-02 -2.08065853e-01
 -5.97727261e-02  1.09268343e-02 -3.06570321e-01 -1.30958661e-01
  1.37621328e-01 -5.04691713e-02 -2.55629331e-01 -1.11335002e-01
 -8.62485096e-02  4.54606600e-02 -1.57813337e-02  1.22769969e-02
  1.18856668e-01  5.33842444e-02 -8.99899378e-02  7.60633424e-02
 -7.48885944e-02 -2.38888338e-01 -3.25999409e-03 -2.11189330e-01
  7.31329992e-02  3.64403278e-02 -1.24055006e-01 -1.29284158e-01
 -1.09186850e-01  1.51953176e-01 -5.15733426e-03  3.20708267e-02
  5.78449154e-03 -7.80333802e-02  3.40656415e-02  7.40098357e-02
 -1.38135150e-01 -3.84733342e-02  1.7090784

### 3.3 Machine Learning
**scikit-learn** logistic regression

In [13]:
print('Training...')
logit.fit(X=train_set['sentence_vectors'].tolist(), y=train_set['class'].tolist())

print('Evaluating...')
predictions = logit.predict(X=test_set['sentence_vectors'].tolist())
accuracy = accuracy_score(y_true=test_set['class'].tolist(), y_pred=predictions)
report = classification_report(y_true=test_set['class'].tolist(), y_pred=predictions)
print(f'Accuracy: {100 * accuracy:.2f}%\n{report}')

Training...
Evaluating...
Accuracy: 86.80%
              precision    recall  f1-score   support

           0       0.86      0.88      0.87       512
           1       0.87      0.85      0.86       488

    accuracy                           0.87      1000
   macro avg       0.87      0.87      0.87      1000
weighted avg       0.87      0.87      0.87      1000



Embeddings **>** bag-of-words

*Note:* Both approaches do not account for word order.

## 4. Transfer Learning
- Deep learning requires huge amounts of data
- Transfer Learning
    - Deep learning model trained on enormous corpus
    - State-of-the-art &rightarrow; **BERT** 340 million parameters
    - HuggingFace **pytorch-transformers** library
    - Task-specific classifier is attached to the large model
    - Finetuning
- Increased resources compared to statistical models

In [14]:
print('Training deep learning model...')
deep_model = train(features=train_set['sample'], labels=train_set['class'], cache_dir=cache_dir,
                   epochs=epochs)

print('Evaluating deep learning model...')
predictions = evaluate(model=deep_model, features=test_set['sample'], cache_dir=cache_dir)
accuracy = accuracy_score(y_true=test_set['class'].tolist(), y_pred=predictions)
report = classification_report(y_true=test_set['class'].tolist(), y_pred=predictions)
print(f'Accuracy: {100 * accuracy:.2f}%\n{report}')

Training deep learning model...


Epoch: 01/20 | Loss: 0.5174 | LR: 4.75E-05: 100%|██████████| 125/125 [00:30<00:00,  4.14it/s]
Epoch: 02/20 | Loss: 0.4667 | LR: 4.50E-05: 100%|██████████| 125/125 [00:29<00:00,  4.19it/s]
Epoch: 03/20 | Loss: 0.3983 | LR: 4.25E-05: 100%|██████████| 125/125 [00:29<00:00,  4.22it/s]
Epoch: 04/20 | Loss: 0.3539 | LR: 4.00E-05: 100%|██████████| 125/125 [00:29<00:00,  4.23it/s]
Epoch: 05/20 | Loss: 0.3054 | LR: 3.75E-05: 100%|██████████| 125/125 [00:29<00:00,  4.24it/s]
Epoch: 06/20 | Loss: 0.2729 | LR: 3.50E-05: 100%|██████████| 125/125 [00:29<00:00,  4.24it/s]
Epoch: 07/20 | Loss: 0.2451 | LR: 3.25E-05: 100%|██████████| 125/125 [00:29<00:00,  4.25it/s]
Epoch: 08/20 | Loss: 0.2199 | LR: 3.00E-05: 100%|██████████| 125/125 [00:29<00:00,  4.25it/s]
Epoch: 09/20 | Loss: 0.2017 | LR: 2.75E-05: 100%|██████████| 125/125 [00:29<00:00,  4.24it/s]
Epoch: 10/20 | Loss: 0.1857 | LR: 2.50E-05: 100%|██████████| 125/125 [00:29<00:00,  4.24it/s]
Epoch: 11/20 | Loss: 0.1719 | LR: 2.25E-05: 100%|██████████|

Evaluating deep learning model...


100%|██████████| 8/8 [00:06<00:00,  1.19it/s]

Accuracy: 88.90%
              precision    recall  f1-score   support

           0       0.89      0.90      0.89       512
           1       0.89      0.88      0.89       488

    accuracy                           0.89      1000
   macro avg       0.89      0.89      0.89      1000
weighted avg       0.89      0.89      0.89      1000






# Recap
- Started with nothing (bag-of-words)
- Enriched information via word embeddings
- Finetuning of BERT to a downstream task provided us with state-of-the-art results