


# Deep Learning NLP

**Fake news classifier**: Train a text classification model to detect fake news articles!

**Summary**

1. Further fine-tuning efforts using various hyper-parameters optimization techniques might help to get a better result. Also, I didn't apply any text-cleaning, the evaluation result given by the model, as can be seen from the percentage accuracy and loss, is fairly satisfactory.

2. Even, a more simpler model using CBOW or TF-IDF and MLPs might give a satisfactory result. But, I didn't try them out.

3. I couldn't figure out the magic behind the effects of the embedding weights of each word to a category of a given text, thus I couldn't find out the words which have had the highest impact. That is just to be honest, but I'm sure I would have found a solution for it had I worked on it for a few more hours. (I just tried to complete the exrcise in an approximate duration of about 6hrs, as mentioned in the direction of the challange.). Nonetheless, using a TF-IDF, this task would would have been just comparing weights of each word.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
dataframes = []
for dirname, _, filenames in os.walk('/kaggle/input/fake-and-real-news-dataset'):
    for filename in filenames:
        df = pd.read_csv(os.path.join(dirname, filename))\
        .assign(category = 0 if filename.startswith("True") else 1).astype({'category' : 'int32'})
        # Instead of the above line, I could equivalently use the following 2 lines. I didn't find out the one that performs well.  
        
        #df = pd.read_csv(os.path.join(dirname, filename))
        #df["category"] = np.full((len(df), 1), 0 if filename.find("True") !=-1 else 1, dtype=int)
        dataframes.append(df)

In [None]:
for i in range(len(dataframes)):
    print("Length of dataframe {} :".format(i+1), len(dataframes[i]))

In [None]:
combined_df = pd.concat(dataframes)
combined_df = combined_df.sample(frac=1, random_state=132).reset_index(drop=True)
print(len(combined_df))

In [None]:
# combined_df[44890:]

In [None]:

def median_of_words_per_texts(texts) -> float:
    """ Takes string of texts belonging to a certain category,
    and returns the median number of words per the given category.
    """
    words = [len(txt.lower().split()) for txt in texts]
    return np.median(words)

In [None]:

train_df = combined_df.sample(frac=0.8, random_state=100)
test_df = combined_df[~combined_df.index.isin(train_df.index)]
processed_df = pd.DataFrame()
processed_df['texts'] = combined_df["title"].str.lower()  + " "*10 + combined_df["text"].str.lower()

# train_df[:6]

In [None]:
median_of_words_per_texts(processed_df['texts'])

In [None]:

# train_df.head()

In [None]:
texts_training = train_df["title"].str.lower() + " "*10 + train_df["text"].str.lower()
texts_testing = test_df["title"].str.lower() + " "*10 + test_df["text"].str.lower()
categories_training = train_df["category"]
categories_testing = test_df["category"]

vocab_size = 100000
embedding_dim = 16
max_length = 120
trunc_type='post'
padding_type='post'
oov_tok = "<OOV>"

tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(texts_training)

word_index = tokenizer.word_index
seq_training = tokenizer.texts_to_sequences(texts_training)
padded_training_seq = pad_sequences(seq_training, maxlen=max_length, padding=padding_type, truncating=trunc_type)


seq_testing = tokenizer.texts_to_sequences(texts_testing)
padded_testing_seq = pad_sequences(seq_testing, maxlen=max_length, padding=padding_type, truncating=trunc_type)


In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length, name="embedding"),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy',optimizer=tf.keras.optimizers.Adam(1e-5), metrics=['accuracy'])
model.summary()


In [None]:
num_epochs = 10
padded_training_seq = np.array(padded_training_seq)
categories_training = np.array(categories_training)
padded_testing_seq = np.array(padded_testing_seq)
categories_testing = np.array(categories_testing)


In [None]:

history = model.fit(padded_training_seq, categories_training, epochs=num_epochs, validation_data=(padded_testing_seq, categories_testing), verbose=1)

In [None]:
import matplotlib.pyplot as plt


def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_' + string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')

In [None]:
model.save("model_fake_real_news.h5")

In [None]:
# For each word in a top 100,000 vocabulary list, the embedding vector can be give as: 
for i, layer in  enumerate(model.layers):
    if(i<1):
        weights = layer.get_weights()
        print("Layer {}: / Shape of weights: {} X {} ".format(i, len(weights[0]), len(weights[0][0])))
        print(weights[0][:6])

*I couldn't figure out which words are having a greater impact on a category to which a given text belongs*