## EDA | Modelling | Embedding
![](https://beconnected.esafety.gov.au/pluginfile.php/52815/mod_resource/content/12/fake-news-hero-img.jpg)
<br>
We come across different type of news, out of which some can be bullshit, some news can be fake, some news are shown to express hate, and some news are published to make fun of others. Given the news content, we as humans are able to classify that article into different categories but can computers do it?

We first explore the data in depth and then draw certain conclusions from that. Then we will perform some text preprocessing using TF-IDF, once the data is preprocessed we apply Random Forest for modelling. During this, we will reliase the drawback of TF-IDF, which will take us to using Embeddings using TensorFlow.

**Do upvote if you liked it!**

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import cufflinks as cf
import plotly
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sns
from math import pi

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer

from colorama import Fore, Back, Style
y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
sr_ = Style.RESET_ALL

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import warnings
warnings.filterwarnings("ignore")



In [None]:
df = pd.read_csv('/kaggle/input/source-based-news-classification/news_articles.csv')
df.head()

In [None]:
print("Number of rows and columns present in the dataset are: ", df.shape)

In [None]:
print("The number and categories of unique type of articles are: ", len(df['type'].unique()), df['type'].unique())

In [None]:
fig = px.pie(df,names='type',title='Types of Articles')
fig.show()

**These are the 8 different types of articles:**
- bias
- conspiracy
- fake
- bs (i.e. bullshit)
- satire
- hate
- junksci(i.e. junk science)
- state

### Null values

Let's now check for Null Values in this dataset

In [None]:
def msv(data, thresh = 20, color = 'black', edgecolor = 'black', height = 3, width = 15):
    
    plt.figure(figsize = (width, height))
    percentage = (data.isnull().mean()) * 100
    percentage.sort_values(ascending = False).plot.bar(color = color, edgecolor = edgecolor)
    plt.axhline(y = thresh, color = 'r', linestyle = '-')
    
    plt.title('Missing values percentage per column', fontsize=20, weight='bold' )
    
    plt.text(len(data.isnull().sum()/len(data))/1.7, thresh+2.5, f'Columns with more than {thresh}% missing values', fontsize=10, color='crimson',
         ha='left' ,va='top')
    plt.text(len(data.isnull().sum()/len(data))/1.7, thresh - 0.5, f'Columns with less than {thresh}% missing values', fontsize=10, color='green',
         ha='left' ,va='top')
    plt.xlabel('Columns', size=15, weight='bold')
    plt.ylabel('Missing values percentage')
    
    return plt.show()
msv(df, 10, color=sns.color_palette('Oranges',10))

There are less than **2.5% values** which are missing in the columns 'text_without_stopwords' and 'text' and close to 0.5% missing values in columns like 'title_without_stop_words', 'language' etc. We will be dropping these null values, since they wont have much effect value on our model.

In [None]:
df_orig = df.copy()
df.dropna(inplace = True)
msv(df, thresh = 2, color=sns.color_palette('Reds',15))

Thus, we have **removed all Null Values**.

## Visualization and EDA

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

def get_top_n_words(corpus, n = None):
    """
    A function that returns the top 'n' unigrams used in the corpus
    """
    vec = CountVectorizer().fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    freq_sorted = sorted(words_freq, key = lambda x: x[1], reverse = True)
    return freq_sorted[:n]


def get_top_n_bigram(corpus, n = None):
    """
    A function that returns the top 'n' bigrams used in the corpus
    """
    vec = CountVectorizer(ngram_range = (2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis = 0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    freq_sorted = sorted(words_freq, key = lambda x: x[1], reverse=True)
    return freq_sorted[:n]

top_unigram = get_top_n_words(df['text_without_stopwords'], 20)
top_bigram = get_top_n_bigram(df['text_without_stopwords'], 20)
words = [i[0] for i in top_unigram]
count = [i[1] for i in top_unigram]

plt.figure(figsize=(15,10))
plt.bar(words, count,align='center')
plt.xticks(rotation=90)
plt.ylabel('Number of Occurences')
plt.show()

In [None]:
from wordcloud import WordCloud 
wc = WordCloud(background_color="white", max_words=100,
               max_font_size=256,
               random_state=42, width=1000, height=1000)
wc.generate(' '.join(df['text_without_stopwords']))
plt.imshow(wc)
plt.axis('off')
plt.show()

### Language

In [None]:
fig = px.pie(df,names='language',title='Languages of Articles')
fig.show()

### Articles Including Images vs. Label

In [None]:
fig = px.bar(df, x='hasImage', y='label',title='Articles Including Images vs Label')
fig.show()

**Using HTML to view the images given in form of image URL**

In [None]:
from IPython.core.display import HTML

def convert(path):
    return '<img src="'+ path + '" width="80">'
df_sources = df[['site_url','label','main_img_url', 'title_without_stopwords']]
df_r = df_sources.loc[df['label']== 'Real'].iloc[6 : 10,:]
df_f = df_sources.loc[df['label']== 'Fake'].head(6)

In [None]:
HTML(df_r.to_html(escape = False, formatters = dict(main_img_url = convert)))

In [None]:
HTML(df_f.to_html(escape = False, formatters = dict(main_img_url = convert)))

Thus, we see that most of the fake news is being given by the **21stcenturywire.com**.

### Real vs. Fake News

In [None]:
fig = px.pie(df,names='label',title='Proportion of Real vs. Fake News',color_discrete_sequence=px.colors.sequential.Viridis_r)
fig.show()

**Let's check which are the sites deliver fake news**

In [None]:
print(f"Sites printing Fake news are: {r_}{df[df['label'] == 'Fake']['site_url'].unique()}")

In [None]:
print(f"Sites printing Fake news are: {g_}{df[df['label'] == 'Real']['site_url'].unique()}")

**Let's also check if there exists websites that publish both real and fake news**

In [None]:
real = set(df[df['label'] == 'Real']['site_url'].unique())
fake = set(df[df['label'] == 'Fake']['site_url'].unique())
print(f"Websites publishing both real & fake news are {m_}{real & fake}")

Websites that publish fake news is fine, but it might be possible that there are sites where only 1 or 2 news were fake, let's take these into consideration as well.

In [None]:
df[df['label'] == 'Fake']['site_url'].value_counts().tail(10)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() ## Converting the type column from object datatype to numerical datatype
df['type'] = le.fit_transform(df['type'])
df.head()

In [None]:
le.classes_
le.transform(['bias', 'bs', 'conspiracy', 'fake', 'hate', 'junksci', 'satire' , 'state'])
mapping = {}
for i in le.classes_:
    mapping[i] = le.transform([i])[0]
print(mapping)

fig = px.sunburst(df, path=['label', 'type'])
fig.show()

**LabelEncoder has labeled bias as 0, bs as 1, and conspiracy as 2 etc.**

## Websites and the Types of Stories They Publish

In [None]:
def sites_type(df):
    types = df['type'].unique()
    for type in types:
        df_type = df[df['type'] == type]
        type = le.inverse_transform([type])
        print(f"{r_}The unique sites publishing article of type {type[0]} are: {g_}{df_type['site_url'].unique()}")
        print()
        
sites_type(df)

## Modelling

**In the dataset, all the values are ordered. Therefore, we need to reshuffle these values.**

In [None]:
df = df.sample(frac = 1)
df.head()

**Let's remove '.com' from the main urls.**

In [None]:
urls = []
for url in df['site_url']:
    urls.append(url.split('.')[0])
df['site_url'] = urls

**Let's combine both of them together to form a new column, url_text.**

In [None]:
features = df[['site_url', 'text_without_stopwords']]
features['url_text'] = features["site_url"].astype(str) + " " + features["text_without_stopwords"]
features.drop(['site_url', 'text_without_stopwords'], axis = 1, inplace = True)
features.head()

In [None]:
X = features
y = df['type']
y = y.tolist()

## TF-IDF

**TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.**

TF-IDF for a word in a document is calculated by multiplying two different metrics:

- The term frequency of a word in a document. There are several ways of calculating this frequency, with the simplest being a raw count of instances a word appears in a document. Then, there are ways to adjust the frequency, by length of a document, or by the raw frequency of the most frequent word in a document.

- The inverse document frequency of the word across a set of documents. This means, how common or rare a word is in the entire document set. The closer it is to 0, the more common a word is. This metric can be calculated by taking the total number of documents, dividing it by the number of documents that contain a word, and calculating the logarithm.

So, if the word is very common and appears in many documents, this number will approach 0. Otherwise, it will approach 1.

**The importance increases proportionally to the number of times a word appears in the document but is balanced by the frequency of the word in the corpus.**

Typically, the tf-idf weight is composed of two terms: Term Frequency (TF), aka. The number of times a word appears in a document, divided by the total number of words in that document and **Inverse Document Frequency (IDF)**, computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

tfidf_vectorizer = TfidfVectorizer(use_idf = True, stop_words = 'english')

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train['url_text'])
X_test_tfidf = tfidf_vectorizer.transform(X_test['url_text'])

In [None]:
tfidf_train = pd.DataFrame(X_train_tfidf.A, columns = tfidf_vectorizer.get_feature_names())
tfidf_train.head()

Above is the representation of the **tf-idf matrix**. The first represents the **'first url_text'** and corresponding column values represent the value of that column for 1st document. One point to note here is the presence of a very large number of zeros. We will be dealing with that in the next section

In [None]:
rfc = RandomForestClassifier(n_estimators=100,random_state=42)
rfc.fit(tfidf_train, y_train)
y_pred = rfc.predict(X_test_tfidf)
RFscore = metrics.accuracy_score(y_test, y_pred)
print("The accuracy is : ", RFscore)

In [None]:
print("The Weighted F1 score is: ", metrics.f1_score(y_test, y_pred, average = 'weighted'))

Lets try to boost this using a different approach called **Embedding**.

## Embedding

By applying the tf-idf method we used above, we observed that we get a lot of zeros for sentence representation, i.e we got a sparse matrix. Sparse Matrix is not a true represen****tation for the corpus, and it doesn't take into account the similarity of the words. That is where Embeddings come to our rescue.

A word **embedding is a class of approaches for representing words and documents using a dense vector representation.**

It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary. These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values.

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
y_train = np.array(y_train)
y_test = np.array(y_test)

- We will set the maximum number of words allowed to be 10000.

- In the test dataset, there can be words that are Out of Vocabulary(OOV), we will encode those words as OOV

- Embedding Dimension has been set to 32

- A lot of sentences might be very long, we will keep the maximum length to be 120.

- You might be wondering that all the sentences are not neccessarry to be of the same length, in order to tackle that we will use the concept of padding i.e to add zeros before or after the sentence to keep the length uniform.

In [None]:
vocab_size = 10000
oov_token = "<OOV>"
embedding_dim = 32
max_length = 120
padding = 'post' # 
trunc_type = 'post'

**Tokenizer updates internal vocabulary based on the given list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary s.t. word_index["the"] = 1; word_index["cat"] = 2**

In [None]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token = oov_token)
tokenizer.fit_on_texts(X_train['url_text'])
# tokenizer.word_index # Mapping of words to numbers

**Once we get the dictionary of word indexes, we need to convert the whole sentence into numerical representation, for that we use 'texts_to_sequences'**

In [None]:
training_sequences = tokenizer.texts_to_sequences(X_train['url_text'])
testing_sequences = tokenizer.texts_to_sequences(X_test['url_text']) # Converting the test data to sequences

p**ad_sequences** is used to ensure that all sequences in a list have the same length. **By default this is done by padding 0 in the beginning of each sequence until each sequence has the same length as the longest sequence.**

In [None]:
train_padded = pad_sequences(training_sequences, maxlen = max_length, padding = 'post', truncating = trunc_type)
train_padded.shape

There are **1533 sentences** and each sentence has **length of 120 words**.

In [None]:
testing_padded = pad_sequences(testing_sequences, maxlen = max_length, padding = 'post', truncating = trunc_type)
testing_padded.shape

There are **512 sentences** and each sentence has **length of 120 words**.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length = max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation = 'relu'),
    tf.keras.layers.Dense(8, activation = 'softmax')
])

model.compile(loss = 'sparse_categorical_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary()

In [None]:
num_epochs = 15
history = model.fit(train_padded, y_train, epochs = num_epochs, validation_data = (testing_padded, y_test))

In [None]:
print("The Training Accuracy we get is: ", history.history['accuracy'][14])
print("The Testing Accuracy we get is: ", history.history['val_accuracy'][14])

**Hope you liked the notebook, any suggestions would be highly appreciated.**

**Please upvote if you liked it!**