# Natural Language Processing (NLP) with TensorFlow

Inspired by work from the Standford University course CS224N, we will explore some of the currently common techniques associated with Natural Language Processing and Deep Learning.

In this notebook we will classify Twitter data and build a model that predicts the owner of a tweet.

## Table of Contents
* **[Getting the data](#getting-the-data)**
* **[Preprocessing](#preprocessing)**
  * [Concatenate dataframes](#concatenate)
  * [Format the text](#format-text)
* **[Training and validation data](#training-validation-data)**
  * [Create feature set](#feature-set)
  * [Create label set](#label-set)
  * [Create train and test sets](#train-test-split)
  * [Create validation set](#validation-set)
* **[Build the model](#build-model)**
  * [Loss function and optimizer](#loss-function-and-optimizer)
* **[Train the model](#train-model)**
* **[Evaluate the model](#evaluate-model)**
  * [Plot accuracy and loss over time](#plot-accuracy-and-loss)

In [1]:
# Header
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

In [2]:
# The usual suspects ...
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns

from numpy import array
from numpy import argmax
from nltk.tokenize import wordpunct_tokenize
from gensim import corpora
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

<a id='getting-the-data'></a>
### Gettting the data

In [3]:
# Params
ENCODING = 'ISO-8859-1'
USECOLS = ['text', 'user']

# Trump
trump = pd.read_csv('../data/realDonaldTrump.csv', encoding=ENCODING, usecols=USECOLS)
# Obama
obama = pd.read_csv('../data/BarackObama.csv', encoding=ENCODING, usecols=USECOLS)
# Senators
senators = pd.read_csv('../data/senators.csv', encoding=ENCODING, usecols=USECOLS)

In [None]:
# Save the data in 'utf-8' encoding
trump.to_csv('../data/realDonaldTrump.csv', encoding='utf-8', index=None)
obama.to_csv('../data/BarackObama.csv', encoding='utf-8')

<a id='preprocessing'></a>
### Preprocessing


<a id='concatenate'></a>
#### Concatenate dataframes

In [4]:
df = pd.concat([trump, obama, senators])

<a id='format-text'></a>
#### Format the text

In [5]:
# Text corpus
def create_document_corpus(df, column):
    '''Creates document corpus.'''
    return [i for i in df[column]]

# Removing common words and tokenize
def tokenize(document_corpus):
    '''Tokenizes text.'''
    return [[word for word in doc.lower().split()] for doc in document_corpus]

# Formatting
document = create_document_corpus(df, 'text')
tokenised_doc = tokenize(document)
dictionary = corpora.Dictionary(tokenised_doc)

In [6]:
print(dictionary)

Dictionary(458047 unique tokens: ['a', 'and', 'beginning,', 'conversation', 'david']...)


<a id='training-validation-data'></a>
### Training & validation data

<a id='feature-set'></a>
#### Creating feature set

In [7]:
# Feature
data = list(map(lambda item: dictionary.doc2idx(tokenised_doc[item]),
                range(len(tokenised_doc))))

<a id='label-set'></a>
#### Creating label set

In [8]:
# Labels
def encode(dataset):
    print(f'Shape before encoding: {len(dataset)}')
    encoded = tf.keras.utils.to_categorical(dataset)
    print(f'Shape after encoding: {encoded.shape}')
    return encoded

values = array(df['user'].values)
label_encoder = LabelEncoder()
integer_encodings = label_encoder.fit_transform(values)
labels = encode(integer_encodings)

Shape before encoding: 295054
Shape after encoding: (295054, 102)


<a id='train-test-split'></a>
#### Create training  and test set

In [9]:
# Train-test split
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=.25, shuffle=True)
print(f'X-train: {len(x_train)}, Y-train: {len(y_train)}')
print(f'X-test: {len(x_test)}, Y-test: {len(y_test)}')

X-train: 221290, Y-train: 221290
X-test: 73764, Y-test: 73764


In [10]:
# Pad the data
x_train = tf.keras.preprocessing.sequence.pad_sequences(x_train,
                                                        value=0,
                                                        padding='post',
                                                        maxlen=45)
x_test = tf.keras.preprocessing.sequence.pad_sequences(x_test,
                                                       value=0,
                                                       padding='post',
                                                       maxlen=45)

<a id='validation-set'></a>
#### Create validation set

In [11]:
# Feature set
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

# Label set 
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

<a id='build-model'></a>
### Build the model

The model we will build will take input data consisting of an array of word-indices, while the labels to predict are the names of senators.

In [16]:
vocab_size = 458047

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, 64))
model.add(tf.keras.layers.GlobalAveragePooling1D())
model.add(tf.keras.layers.Dense(32, activation=tf.nn.relu))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(102, activation=tf.nn.softmax))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 64)          29315008  
_________________________________________________________________
global_average_pooling1d_1 ( (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 102)               3366      
Total params: 29,320,454
Trainable params: 29,320,454
Non-trainable params: 0
_________________________________________________________________


<a id='loss-function-and-optimizer'></a>
#### Loss function and optimizer

In [17]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['acc'])

<a id='train-model'></a>
### Train the model

In [18]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=40,
                    batch_size=512,
                    validation_data=(x_val, y_val),
                    verbose=1)

Train on 211290 samples, validate on 10000 samples
Epoch 1/40
Epoch 2/40

KeyboardInterrupt: 

<a id='evaluate-model'></a>
### Evaluate the model

In [None]:
results = model.evaluate(x_test, y_test)
print(results)

<a id='plot-accuracy-and-loss'></a>
#### Plot accuracy and loss over time

In [None]:
history_dict = history.history
history_dict.keys()

In [None]:
acc = history_dict['acc']
val_acc = history_dict['val_acc']
loss = history_dict['loss']
val_loss = history_dict['val_loss']

epochs = range(1, len(acc) + 1)

In [None]:
# Plotting polarity
f, (ax1, ax2) = plt.subplots(2, 1, figsize=(20, 20), dpi=80, sharex=True)
sns.lineplot(epochs, acc, label='Training Accuracy', color='navy', ax=ax1)
sns.lineplot(epochs, val_acc, label='Validation Accuracy', color='brown', ax=ax1)
sns.lineplot(epochs, loss, label='Training Loss', color='navy', ax=ax2)
sns.lineplot(epochs, val_loss, label='Validation Loss', color='brown', ax=ax2)
ax1.set_title('Training & Validation Accuracy', fontsize=22)
ax2.set_title('Training & Validation Loss', fontsize=22)
ax1.set_ylabel('Accuracy', fontsize=14)
ax2.set_ylabel('Loss', fontsize=14)
ax1.legend(), ax2.legend()
ax1.grid(), ax2.grid()
plt.xlabel('Epochs')