# Deep Learning Models

### Trained from scratch:
1. Deep network formed from **Dense Layers**. For this Neural Net, we will perform simple tokenization of our data and feed it to the Dense layers and calculate loss using the output and the one-hot encoding of the actual true value.

2. **Deep Recurrent network**: We will tokenize our data and send it to our model consisting of Embedding layer followed by LSTM layer and adding dense layer to perform the classification.

### Pre-Trained Models:
1. Text Embedding GNews Swivel: This is a pre-trained model by Google which performs embedding of our input data. We follow this embedding with dense layers of our own to perform classification.

2. **BERT**: This is Google's latest state-of-the-art language model made to perform multiple NLP tasks. Because BERT is an extremely heavy model, we will instead use Distilbert- A much lighter model preserving most of the accuracy of BERT model. 

## Trained from scratch

In this notebook, we will perform the classifications by training our own models from scratch. We will begin with the **Deep Feedforward** network.

In [0]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Activation, Dense, Dropout, Embedding, LSTM, SpatialDropout1D
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [0]:
df = pd.read_csv('drive/My Drive/dataset_final.csv', engine='python')
df.drop(['Unnamed: 0'], axis=1, inplace=True)
sr = pd.isnull(df['title'])
sr.loc[sr==True]
df.drop(83369, axis=0, inplace=True)
df = df.reset_index(drop=True)
sr = pd.isnull(df['full_text'])
drop_arr = sr.loc[sr==True].index.tolist()
df.drop(drop_arr, axis=0, inplace=True)
df = df.reset_index(drop=True)

In [0]:
target_flairs = df['flair'].index.tolist()
X_text = list(df['full_text'])
Y = list(df['flair'])
x_train, x_test, y_train, y_test = train_test_split(X_text, Y, test_size=0.3)

We will try to overcome the problem of the imbalance in our dataset by oversampling the data that we have less of. There are multiple methods we can use for this such as random over sampling, SMOTE and ADASYN. The best performing one for our data is SMOTE and we will oversample the flairs with lesser data points as given in the dictionary

In [13]:
vect = TfidfVectorizer(stop_words='english',ngram_range=(1,1))

x_train_tokenized = vect.fit_transform(x_train)
x_test_tokenized = vect.transform(x_test)

from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy={'Coronavirus':6000, 'Science/Technology':6000,
'Business/Finance':6000,
'[R]eddiquette':6000,
'Sports':6000,
'Photography':6000})
x_train_tokenized, y_train = smote.fit_resample(x_train_tokenized, y_train)


encoder = LabelBinarizer()
encoder.fit(y_train)
y_train_bin = encoder.transform(y_train)
y_test_bin = encoder.transform(y_test)



In [14]:
model = Sequential()
model.add(Dense(512, input_shape=(x_train_tokenized.shape[1],)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(10))
model.add(Activation('softmax'))
model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               32223744  
_________________________________________________________________
activation (Activation)      (None, 512)               0         
_________________________________________________________________
dropout (Dropout)            (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               65664     
_________________________________________________________________
activation_1 (Activation)    (None, 128)               0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 10)               

In [0]:
x_train_tokenized.sort_indices()

In [16]:
num_epochs = 4
batch_size = 128
model.fit(x_train_tokenized, y_train_bin,
                    batch_size=batch_size,
                    epochs=num_epochs)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<tensorflow.python.keras.callbacks.History at 0x7fbbb7833e10>

In [17]:
x_test_tokenized.sort_indices()
model.evaluate(x_test_tokenized,  y_test_bin, verbose=2)

994/994 - 1s - loss: 1.7118 - accuracy: 0.5759


[1.7118370532989502, 0.5759423971176147]

We will now train an LSTM model to factor in the sequence and time-modelling and see if it increases the accuracy. The final model trained here does not use oversampling as it produces worse results

In [0]:
x_train, x_test, y_train, y_test = train_test_split(X_text, Y, test_size=0.3)

In [4]:
vocab_size = 100000
maxlen = 500
# This is fixed.
EMBEDDING_DIM = 100

tokenizer = Tokenizer(num_words=vocab_size, filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~', lower=True)
tokenizer.fit_on_texts(x_train)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 76131 unique tokens.


In [5]:
x_train_tokenized = tokenizer.texts_to_sequences(x_train)
x_train_tokenized = pad_sequences(x_train_tokenized, maxlen=maxlen)
x_test_tokenized = tokenizer.texts_to_sequences(x_test)
x_test_tokenized = pad_sequences(x_test_tokenized, maxlen=maxlen)
print('Shape of data tensor:', x_train_tokenized.shape)

Shape of data tensor: (74215, 500)


In [6]:
encoder = LabelBinarizer()
encoder.fit(y_train)
y_train_bin = encoder.transform(y_train)
y_test_bin = encoder.transform(y_test)
y_train_bin.shape

(74215, 10)

In [7]:
model = Sequential()
model.add(Embedding(vocab_size, EMBEDDING_DIM, input_length=x_train_tokenized.shape[1]))
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 100)          10000000  
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 500, 100)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 10)                1010      
Total params: 10,081,410
Trainable params: 10,081,410
Non-trainable params: 0
_________________________________________________________________
None


In [8]:
num_epochs = 3
batch_size = 128
model.fit(x_train_tokenized, y_train_bin,
                    batch_size=batch_size,
                    epochs=num_epochs)


Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7fbbb7e1d588>

In [9]:
model.evaluate(x_test_tokenized,  y_test_bin, verbose=2)

994/994 - 111s - loss: 1.1561 - accuracy: 0.6181


[1.1561249494552612, 0.6181343793869019]

Great! The time modelling did improve the accuracy by almost 4% from it's purely Dense model counterpart. This is in accordance with what you would expect as in the dense model, we are losing out on the information that is stored in the order of words in a post