In [1]:
!wget -P /kaggle/working/data/ http://nlp.stanford.edu/data/glove.6B.zip
!unzip /kaggle/working/data/glove.6B.zip -d /kaggle/working/data/
!head ./data/glove.6B.50d.txt

In [1]:
import re

import tensorflow as tf
import keras
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

In [1]:
EPOCHS = 2
BATCH_SIZE = 128

MAX_LEN = 128
NUM_WORDS = 10000

EMBEDDING_DIM = 50
H1 = 32

THRESH = 0.5

## Data
This section talks about grabbing and pre processing the dataset.

In [1]:
train_data=pd.read_csv('/kaggle/input/fake-news/train.csv')
test_data=pd.read_csv('/kaggle/input/fake-news/test.csv')
train_data.loc[train_data["text"].isnull(), "text"] = ""
test_data.loc[test_data["text"].isnull(), "text"] = ""
print(train_data.shape, test_data.shape)
train_data.head()

In [1]:
train_data["text"] = train_data["text"].map(lambda x: re.sub("([^a-zA-Z0-9\s])", r' \1 ', x))
train_data["text"] = train_data["text"].map(lambda x: re.sub("\s+", r' ', x))
test_data["text"] = test_data["text"].map(lambda x: re.sub("([^a-zA-Z0-9\s])", r' \1 ', x))
test_data["text"] = test_data["text"].map(lambda x: re.sub("\s+", r' ', x))
train_data.head()

In [1]:
word_vec = pd.read_table("./data/glove.6B.50d.txt", sep=r"\s", header=None)
word_vec.set_index(0, inplace=True)
word_vec.head()

We are required to convert words to numbers. We do this by creating a mapping of word to integer. eg. `{"the": 1, "I": 2, "am": 3, ...}`. `tf.keras.preprocessing.text.Tokenizer` does this for us.

In [1]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(
    filters="",
    num_words=10000,
    lower=True,
    document_count=5,
)
tokenizer.fit_on_texts(train_data["text"].values)

Note that `fit_on_texts` only creates the mapping while `texts_to_sequences` below creates the actual tokens.

In [1]:
train_tokens = tokenizer.texts_to_sequences(train_data["text"])
test_tokens = tokenizer.texts_to_sequences(test_data["text"])

We cannot send in different shaped sequences in when we are doing batch processing. Therefore the data needs to be padded with zeros so that all sequences are of length `MAX_LEN`.

In [1]:
tokenizer.index_word[9999], min(tokenizer.index_word.keys())

Get the missing words in our glove vectors.

In [1]:
words_used = [tokenizer.index_word[i] for i in range(1, 10000)]
missing_words = set(words_used) - set(word_vec.index.values)
print(len(missing_words))
missing_word_index = [tokenizer.word_index[word] for word in missing_words]

Delete any of the above 'missing words'

In [1]:
train_tokens = [[word for word in sentence if word not in missing_word_index] for sentence in train_tokens]
test_tokens = [[word for word in sentence if word not in missing_word_index] for sentence in test_tokens]

In [1]:
train_X = keras.preprocessing.sequence.pad_sequences(train_tokens, maxlen = MAX_LEN)
test_X = keras.preprocessing.sequence.pad_sequences(test_tokens, maxlen = MAX_LEN)
train_Y = train_data["label"].values

In [1]:
embedding_weights = np.zeros((10000, 50))
index_n_word = [(i, tokenizer.index_word[i]) for i in range(1, len(embedding_weights)) if tokenizer.index_word[i] in word_vec.index]
idx, word = zip(*index_n_word)
embedding_weights[idx, :] = word_vec.loc[word,:].values

## Model
Considering all the tokenized numbers above are effectively categories, we need to pass this through an embedding layer to get the embedding. In this case each word is represented by `EMBEDDING_DIM` numbers. This is then passed through an RNN layer before passing through a final feed forward layer to calculate the probability.

If you wish to understand LSTMs at a mathematical level this is an amazing blog post: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

In [1]:
model = keras.Sequential()
model.add(keras.layers.Embedding(tokenizer.num_words, 
                                 EMBEDDING_DIM, 
                                 weights=[embedding_weights],
                                 trainable=False
                                )) # , batch_size=batch_size
model.add(keras.layers.LSTM(H1))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [1]:
model.fit(train_X, train_Y, batch_size=BATCH_SIZE, epochs=EPOCHS, validation_split=0.1)

## Submission

In [1]:
test_y = model.predict(test_X) > THRESH
test_data["label"] = test_y.astype(np.int)
test_data[["id", "label"]].to_csv("submission.csv", index=False)

## Shameless Self Promotion
See here for [my course](https://www.udemy.com/course/machine-learning-and-data-science-2021/?referralCode=E79228C7436D74315787) on Machine Learning and Deep Learning (Use code DEEPSCHOOL-MARCH to 85% off).