# Text classification using Deep Learning - Tensorflow v2 & Keras - Bi-directional GRU 

## Intro
I've been playing with text classification problem of reviews dataset. First I've tried Bag of words methods: nltk, tokenize, lemmatize, remove stopwords, TF-IDF and then run it using Random Forest Classifier getting to **56%** accuracy on validation data. Not as good as I would expect so wanted to try deep nets what we can get out of here. My best result was **68%** accuracy on validation data and 65% accuracy on test data (submission).

What I really loved here is that there was no need to preprocess and clean data (remove stop words, lemmatize, stem, ...) as using BOW method, however it still require to transform strings into integers/numbers with uniform dimension.**

## Load libraries and data

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, GRU, Dropout, Bidirectional, SpatialDropout1D
from tensorflow.keras.utils import to_categorical

In [None]:
df_train = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/train.tsv.zip', sep='\t', usecols=['Phrase', 'Sentiment'])
df_submission = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/test.tsv.zip', sep='\t', usecols=['Phrase'])

## Split data into training and testing set
It's good to mention we does not need to do this step, as we can ask tensorflow to put partition of data outside and use it as validation set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_train['Phrase'].values, df_train['Sentiment'].values, test_size=0.1)

## Data preparation/preprocessing
Several simple steps has been performed in data preparation:
* initialize tokenizer, calculate size of dataset
* fit tokenizer to create mapping vocabulary (string/words -> integer)
* calculate vocabulary size
* convert words to integers (texts to sequences), as each sentence has different length, we must uniform sizes of arrays using padding with zeros
* pad sequences to same length

In [None]:
# initialize Tokenizer to encode strings into integers
tokenizer = Tokenizer()

# calculate number of rows in our dataset
num_rows = df_train.shape[0]

# create vocabulary from all words in our dataset for encoding
tokenizer.fit_on_texts(df_train['Phrase'].values)

# max length of 1 row (number of words)
row_max_length = max([len(x.split()) for x in df_train['Phrase'].values])

# count number of unique words
vocabulary_size = len(tokenizer.word_index) + 1

# convert words into integers
X_train_tokens = tokenizer.texts_to_sequences(X_train)
X_test_tokens = tokenizer.texts_to_sequences(X_test)
X_sub_tokens = tokenizer.texts_to_sequences(df_submission['Phrase'].values)

# ensure every row has same size - pad missing with zeros
X_train_pad = pad_sequences(X_train_tokens, maxlen=row_max_length, padding='post')
X_test_pad = pad_sequences(X_test_tokens, maxlen=row_max_length, padding='post')
X_sub_pad = pad_sequences(X_sub_tokens, maxlen=row_max_length, padding='post')

## Labels preprocessing
In tensorflow, if we deal with multinomial target, we must convert vector to matrix having as many columns as much targets we have. I.e. having target values 0-4, vector must be converted to matrix of 5 cols. It's actually same as one hot encoding.

In [None]:
y_train_cat = to_categorical(y_train)
y_test_cat = to_categorical(y_test)

target_length = y_train_cat.shape[1]
print('Original vector size: {}'.format(y_train.shape))
print('Converted vector size: {}'.format(y_train_cat.shape))

## Modeling - train deep nets
**Embedding** - same as dimensionality reduction. Can be 100,500 or even 1000. Basically if EMBEDDING_DIM == vocabulary_size, then it's identical to bag of words, but you cannot handle it if you have 300 000 of words, would be sparse matrix with 99.99% of values to be zero.
Instead, embedding layer will be dense and have much smaller dimension.


For modelling, we will use sequential model using spatial dropout, bidirectional GRU with 128 units and 2 dense layers. This combination gave me best results (Tested single-multi LSTM, GRU, Convolutional nets, ...)

In [None]:
EMBEDDING_DIM = 256

model = Sequential()
model.add(Embedding(vocabulary_size, EMBEDDING_DIM, input_length=row_max_length))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(GRU(128)))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(target_length, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=1)
history = model.fit(X_train_pad, y_train_cat, epochs=5, validation_data=(X_test_pad, y_test_cat), batch_size=128, callbacks=[callback])

## Predict test data
Model is completed and trained, it's time to predict our test data for submission and save it to CSV.

In [None]:
# predict test data
y_sub_hat_ = model.predict(X_sub_pad)
y_sub_hat = [np.argmax(x) for x in y_sub_hat_]

# save to csv
df_save = pd.read_csv('/kaggle/input/sentiment-analysis-on-movie-reviews/sampleSubmission.csv')
df_save['Sentiment'] = y_sub_hat
df_save.to_csv('Submission.csv', index = False)
print('Submission saved!')

## Conlusion
My solution use very simple model, however using more complex (using convolutional nets, double/triple gru, combination of bidirectional, more dense layers, different dropout, using LSTM, increasing units & embedding layer size had actually minimum effect on final results.

If you got idea how to improve score (using different model, different way of preprocessing), feel free to **leave comment**!