Hello Kagglers!

In this notebook, the fake vs real news are classified using NLP techniques. The technique that is implemented in this notebook is Words Vector approach and Tensorflow Deep Learning model to classify the Fake News and Real News. NLTK python package is used to clean the data. Exploratory Data Analysis is also performed in this notebook.

Lets get started...

![Fake News Image](https://upload.wikimedia.org/wikipedia/commons/f/f7/The_fin_de_si%C3%A8cle_newspaper_proprietor_%28cropped%29.jpg)

In [None]:
import nltk
import pandas as pd
import numpy as np
import plotly.express as plot
from wordcloud import WordCloud, STOPWORDS
from matplotlib import pyplot as mplot
from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from tqdm.notebook import tqdm as tqdm
import gensim
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import tensorflow.keras.layers as Layers
import tensorflow.keras.models as Models
import tensorflow.keras.initializers as Init
from sklearn.metrics import classification_report

The stopwords are the words that have no effect on the classification model is removed. For Example, the, is, i, am, etc has no effect on the model so we will remove the them from the data. Moreover, the TQDM package is initialized, which is used for progress bar during bulking operations.

In [None]:
#starting essential code
nltk_STOPWORDS = set(stopwords.words("english"))
tqdm.pandas()

Reading the data from the CSV file with the help of Pandas.

In [None]:
df = pd.read_csv("../input/fake-news/fake_train.csv")

# EDA

In [None]:
df.head()

There are about 20k instances and 5 columns.

In [None]:
df.shape

The most important column is "text" in the database, so all the rows where the text is NULL are removed.

In [None]:
df.dropna(subset=["text"], axis=0, inplace=True)

Column 'id' is useless. so, it is droped too.

In [None]:
df.drop(["id"], axis=1, inplace=True)

After removing the 1 column and 49 rows, the new shape is:

In [None]:
df.shape

There are about 4k news author in the dataset.

In [None]:
print("Unique Authors:", len(df.author.unique()))

In the figure below, the 'Pam Key' is the top authors with 243 news and the author which might be an admin of some blog or they may be admins of different blogs, they comes in second place with 193 News.

In [None]:
topAuthors = df.author.value_counts()[:10]
plot.bar(x=topAuthors.keys(), y=topAuthors.values, title="Top Authors", labels={"x":"Authors","y":"Number of News"})

Here, the top authors thats provided fake news only are computed. In figure below, Authors that spreads the fake news are showm. 

In [None]:
topFakeAuthors = df[df.label == 1].author.value_counts()[:10]
plot.bar(x=topFakeAuthors.keys(), y=topFakeAuthors.values, title="Top Fake News Authors", labels={"x":"Authors","y":"Number of Fake News"})

Here, the top authors thats provided fake news only are computed. In figure below, Authors that spreads the real news are showm. 

In [None]:
topRealAuthors = df[df.label == 0].author.value_counts()[:10]
plot.bar(x=topRealAuthors.keys(), y=topRealAuthors.values, title="Top Real News Authors", labels={"x":"Authors","y":"Number of Real News"})

The dataset is equaly balanced i.e. there are 50% fake news and 50% real news.

In [None]:
labelCount = df.label.value_counts()

plot.pie(values = labelCount.values, names=["Fake","Real"], title="Fake Vs Real")

The word cloud is the best technique to determine the frequency of the words in the text. The figure below, shows that the word "New York", "York Times" and "Trump" is the most frequent used words.

In [None]:
text = " ".join("" if pd.isnull(t) else t for t in df.title)
wc = WordCloud(width=800, height=400,stopwords=STOPWORDS, background_color="white").generate(text)
fig = mplot.figure(figsize=(16, 16))
mplot.imshow(wc, interpolation='bilinear')
mplot.axis("off")
mplot.title("Word Cloud")
mplot.show()

Words cloud for only fake news are computed, the "Trum", "Hillary", and "Video" are the most common words.

In [None]:
text = " ".join("" if pd.isnull(t) else t for t in df[df.label == 1].title)
wc = WordCloud(width=800, height=400,stopwords=STOPWORDS, background_color="white").generate(text)
fig = mplot.figure(figsize=(16, 16))
mplot.imshow(wc, interpolation='bilinear')
mplot.axis("off")
mplot.title("Fake News Word Cloud")
mplot.show()

Words cloud for only real news are computed, the "Trum", "York Times", and "New York" are the most common words.

In [None]:
text = " ".join("" if pd.isnull(t) else t for t in df[df.label == 0].title)
wc = WordCloud(width=800, height=400,stopwords=STOPWORDS, background_color="white").generate(text)
fig = mplot.figure(figsize=(16, 16))
mplot.imshow(wc, interpolation='bilinear')
mplot.axis("off")
mplot.title("Real News Word Cloud")
mplot.show()

# Classification

**1. Pre Processing**

It is assumed that the news text which length is less then 20 alphabets are Junk and considered not a news, so those types of news are removed.

In [None]:
df = df[df.text.map(len) > 20]
df.shape

The sentences are now converted to the tokens e.g. "This is a cat and 2 Dogs." are now converted to ["This", "is", "a", "cat", "and", "2", "Dogs."].

In [None]:
df["tokens"] = df.text.progress_apply(word_tokenize)

After tokenization, all the text are converted to the lowercase letters e,g. ["This", "is", "a", "cat", "and", "2", "Dogs"] witll become ["this", "is", "a", "cat", "and", "2", "dogs."]

In [None]:
def toLower(tokens)->list:
    return [t.lower() for t in tokens]

df["tokens"] = df.tokens.progress_apply(toLower)

After converting to lower, all the punctuations and numbers are removed from the tokens, because it is useless in the model e,g. ["this", "is", "a", "cat", "and", "2", "dogs."] witll become ["this", "is", "a", "cat", "and", "dogs"]

In [None]:
def removePunctuation(tokens)->list:
    return [t for t in tokens if t.isalpha()]

df["tokens"] = df.tokens.progress_apply(removePunctuation)

In this cell, the stopwords that are discussed above are removed and our tokens become more shorter and compact e.g. ["this", "is", "a", "cat", "and", "dogs"] becomes ["cat", "dogs"].

In [None]:
def removeStopwords(tokens) -> list:
    return [t for t in tokens if t not in nltk_STOPWORDS]

df["tokens"] = df.tokens.progress_apply(removeStopwords)

After the stopwords removing, some of the tokens in the dataset are shorter then 3 charchters, these tokens did not make any sense as there are only few words that are shorter then 3 charcters, so these are removed too. Hense the above tokens list is not effected i.e. ["cat","dogs"].

In [None]:
def removeSingleLenWords(tokens)->list:
    return [t for t in tokens if len(t) >= 2]
        

df["tokens"] = df.tokens.progress_apply(removeSingleLenWords)

Last step is lemitization, it basically normalize the words e.g. "goats" becomes "goat", "working" or "works" become "work" etc. The tokens ["cat",dogs"] becomes ["cat","dog"]

In [None]:
def lematize(tokens, lematizer)->list:
    tokens = [lematizer.lemmatize(t, pos = "v") for t in tokens]
    return [lematizer.lemmatize(t, pos = "n") for t in tokens]
lemme = WordNetLemmatizer()
df["tokens"] = df.tokens.progress_apply(lambda tokens: lematize(tokens,lemme))

**2. Words Vector Creation**

Here we dfine some constants, the MAX_LEN stores the length of largest text. However, the dimensions are basically used by Words Vector which will be described below.

In [None]:
DIMENSION = 100
MAX_LEN = max([len(x) for x in df.tokens])

Gensim is the python library used to create a Words Vector, we can also create a words vectors directly in the keras model, but the gensim provides the mechanisum to save the vectors and it provides more configurations.
1. Sentences are the actual tokens that are already prepared.
2. Size is basically a size of vectors, it is also called the dimensions. e.g. if we provide dimension of 3 the vector for word ["dog"] woud be [0.344,0.233,-3.33] (Values are assumed). So, what the vector means that the word "dog" now live in the 3d space represented by the vector.
3. Window are the distance of the words that are effected by other words. e.g. ["dog","cat","goat"] let say the window is of size 1 then the dog is directly effected by the cat but not by the goat. However if the window size will be 2 then dog will be effected by goat too.
4. Workers are the numbers of threads need to be run to train the vectors, 1 for no parallelisum.
5. Min_count is used to describe that the if the frequency of particlular word is below the value, simply ignore them.

In [None]:
model_word2vec = gensim.models.Word2Vec(sentences=df.tokens, size=DIMENSION, window=7, workers=2, min_count = 1)


VOCAB_SIZE is the size of the vocabolary that the words vector have.

In [None]:
VOCAB_SIZE = len(model_word2vec.wv.vocab)
VOCAB_SIZE

The three below cells displays the most similar to the words are provided.

In [None]:
model_word2vec.wv.most_similar("bad")

In [None]:
model_word2vec.wv.most_similar("man")

In [None]:
model_word2vec.wv.most_similar("woman")

**3. Procesing Words Vector to feed in Neural Network**

After the words vector, the tokens are now converted to the sequences i.e. numbers e.g. ["dogs","cats"] are converted to [1,2]. Note: 0 is already reserved for padding and not used in the sequence.

In [None]:
token = Tokenizer()
token.fit_on_texts(df.tokens)
df["tokens"] = token.texts_to_sequences(df.tokens)

This is the most crucial part, preparing the matrix of words vector to feed in the NN. First the size of the matrix is defined by the constatns and dimensions, Then each vector for the words are extracted from the model and placed according to the index of the word sequence. e.g. [1,2] now becomes [[0.3,0.2,-3.33],[0.3,4.5,3.2]] and shape will be (2,3).

In [None]:
words2vec_matrix = np.zeros((VOCAB_SIZE+1,DIMENSION))

for word, index in tqdm(token.word_index.items()):
    words2vec_matrix[index] = model_word2vec.wv[word]


In [None]:
words2vec_matrix.shape

After the matirix, the tokens are now padded with zeros to feed the NN because the NN always takes fixed sized numpy array so, it can prepare the tensors from it. e.g. The two list of tokens i,e, [2, 3, 4] and [1, 2] will become [2,3,4] and [1,2,0]. the 0 is padded to make the both list size equal.

In [None]:
data = pad_sequences(df.tokens, padding="post", maxlen=MAX_LEN, dtype=int)
labels = df.label
del df

In [None]:
data.shape

In [None]:
labels.shape

**4. Deep Learning Model**

Spliting the data into 30% test and 70% train data.

In [None]:
(train_data, test_data, train_labels, test_labels) = train_test_split(data, labels, test_size=0.3)

The sequential models are used with conv1d to extract the features from the data. The most important layer is embedding layers, that is used to combine the features of words vectors fromt the data. As the words vector is already trained, so we set the trainable false and provide the matrix to it that is already created.

In [None]:
model = Models.Sequential()
model.add(Layers.Embedding(VOCAB_SIZE+1,DIMENSION, embeddings_initializer = Init.Constant(words2vec_matrix), input_length=MAX_LEN, trainable=False ))
model.add(Layers.Conv1D(16,3, activation="relu"))
model.add(Layers.MaxPooling1D(5))
model.add(Layers.Conv1D(8,3, activation="relu"))
model.add(Layers.Flatten())
model.add(Layers.Dense(8,activation='relu'))
model.add(Layers.Dropout(0.5))
model.add(Layers.Dense(4,activation='relu'))
model.add(Layers.Dense(1,activation='sigmoid'))

model.compile(optimizer="adam",loss='binary_crossentropy',metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
trained = model.fit(train_data,train_labels, batch_size=64,epochs=10,validation_split=0.30)

In [None]:
del train_data
del train_labels

In [None]:
mplot.plot(trained.history['accuracy'])
mplot.plot(trained.history['val_accuracy'])
mplot.title('Model accuracy')
mplot.ylabel('Accuracy')
mplot.xlabel('Epoch')
mplot.legend(['Train', 'Test'], loc='upper left')
mplot.show()

mplot.plot(trained.history['loss'])
mplot.plot(trained.history['val_loss'])
mplot.title('Model loss')
mplot.ylabel('Loss')
mplot.xlabel('Epoch')
mplot.legend(['Train', 'Test'], loc='upper left')
mplot.show()

In [None]:
print(classification_report(model.predict(test_data).round(), test_labels))

Here, the accuracy of about **93%** is achived. The accuracy can be increased by using some different techniques and tricks. But for this kernel, I hope this is enough :)

Thank you for viewing my kernel, If you have any sugestions or questions, Please start the discussion below.