<a href="https://colab.research.google.com/github/singularity014/BERT_FakeNews_Detection_Challenge/blob/master/SImilarity_News.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Problem Statement

- **Given** - A Data Set of Fake and Real news.

- **Objective** - To develop a solution which  
  detects if a given news is Fake or Real.

- **Methodology used** - There are multiple ways of approaching to this problem. I have tried to use to popular ways of clustering.
The Contextual Clustering approach, and LDA based approach.

# Importing Dependencies
We will add and install all the dependencies
required here.



In [11]:
# !pip install textacy
# !pip install ktrain
# !pip install torch
# !pip install spacy
# !pip install neuralcoref
# !pip install nltk




In [12]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import urllib
import json
from ktrain import text
import nltk
from nltk.tokenize import sent_tokenize

from google.colab import drive

In [18]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

# Data Loader

In [None]:
# Mounting Drive
drive.mount("/content/drive")

In [70]:
def load_convert_data(url):
    """
    Downloads the json file from net and convert into pandas dataframe format.
    """
    with urllib.request.urlopen(url) as url:
        df = json.loads(url.read().decode('utf-8'))
        df = pd.DataFrame.from_dict(df)
    return df

In [71]:
df_data = load_convert_data("https://storage.googleapis.com/public-resources/dataset/clusters.json")

In [72]:
df_data['text'][0]

'The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”.\nAt the very beginning of the ХХl century, the optimization of medicine and health care was carried out in Lithuania, as a result of which the number of medical institutions was sharply reduced - all small ones were closed and only large ones were left. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. In the districts, something like paramedic points remained for emergency assistance.\nIf there are few hospitals, then there are few doctors - and today this problem is one of the main ones, we have to involve senior students of medical universities in the fight against the epidemic, but all the same, specialists are sorely lacking.\nThe Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but wo

In [73]:
# checking data features
print(df_data.columns )

# checking number of data points
print(df_data.shape)

# checking distinct clusters and there numbers
cluster_names = list(zip(df_data.cluster_name.unique(), df_data.cluster.unique()))
print(cluster_names)

Index(['id', 'text', 'title', 'lang', 'date', 'cluster', 'cluster_name'], dtype='object')
(181, 7)
[('MS fails to respond', '0'), ('Anti-Russia', '1'), ('Claims about China', '2'), ('Collapse', '3'), ('Coronavirus is not serious', '4'), ('Cure', '5'), ('EU fails to respond', '6'), ('Miscellaneous', '7'), ('Origins', '8'), ('Properties', '9'), ('Was predicted', '10'), ('Secret plan of the global elite', '11'), ('Ukraine fails to respond', '12'), ('USA created COVID-2019', '13')]


**Observations**:

- There are total 181 news articles in the data.
- each articles have features :- 'text', 'title', 'lang', 'date', 'cluster', 'cluster_name'
- We will be using 'title' and 'text' for getting contextual representations of each News Article.
- There are total 14 distinct clusters.

# II - Data Preprocessing and Cleaning

In [74]:
# data cleaner function
def clean_txt(sentence):
    res = re.sub('[!*)@#%(&$_^]', '', sentence)
    return sentence

In [75]:
# Text cleaning
df_data['text'] = df_data['text'].apply(clean_txt)
# # Title cleaning
df_data['title'] = df_data['title'].apply(clean_txt)
df_data['text'][0]

'The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a “bright European future”.\nAt the very beginning of the ХХl century, the optimization of medicine and health care was carried out in Lithuania, as a result of which the number of medical institutions was sharply reduced - all small ones were closed and only large ones were left. Today in the country there are only five medical centers in the largest cities - in Kaunas, Klaipeda, Siauliai, Panevezys and Vilnius. In the districts, something like paramedic points remained for emergency assistance.\nIf there are few hospitals, then there are few doctors - and today this problem is one of the main ones, we have to involve senior students of medical universities in the fight against the epidemic, but all the same, specialists are sorely lacking.\nThe Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors, but wo

#### Text Summarization

MODEL USED -  BERT Extractive Summerizer
Reason  - It is trained on a generic set of data including News articles such as CNN, Daily Mail. Hence we don not need to fine-tune it (although we can, if domain heavy text is there).

![alt text](https://iq.opengenus.org/content/images/2020/01/pic3.png)


**Why Summarisation?**
(My idea behind Approach)
- The idea is extract import information in a concise way to represent it into a shorter form.

- A strong argument to advocate this approach is, that as a human, whe try to say if one News is similar to other, we first try to get a summary in our head of the two news articles.
- We then get the context and information conveyed in both the News Articiles.
- By following this approach, we will try to mimic the process.
- While one may argue that we might end up losing some information, but then, a huge advantage is that, we will end-up focusing on only important part of the news.
- We want to focus on prnciple informations in the news article.


# III -  Contextual Clustering Approach

In Contextual Clustering Approach -

*   The idea is to capture what the News paragraph is trying to convey.
*   We can use pretrained Sentence Embedding approaches to such as Transformer based "BERT Setence Transformer".
- I will be using Multilingual Sentence Transformer model to develop the solution.
- We will try to pose the problem as assigning vector to each News <paragraph + Title> a contextual Representation.
- If a new instance of News comes, it will be encoded in a similar fashion




#### Feeding News Articles to Sentence Tokenizers

In [19]:
def sent_tokens(news_article):
  """
  this function accepts a document/News articles and performs
  sentence tokenization.
  * returns - list of tokenized sentences
  """
  return sent_tokenize(news_article)

In [21]:
df_data['news_tokenized'] = df_data['text'].apply(sent_tokens)

In [24]:
df_data.news_tokenized[0]

['The coronavirus epidemic in Lithuania has clearly demonstrated the results of democratic reforms on the way to a bright European future At the very beginning of the ХХl century the optimization of medicine and health care was carried out in Lithuania as a result of which the number of medical institutions was sharply reduced all small ones were closed and only large ones were left Today in the country there are only five medical centers in the largest cities in Kaunas Klaipeda Siauliai Panevezys and Vilnius In the districts something like paramedic points remained for emergency assistance If there are few hospitals then there are few doctors and today this problem is one of the main ones we have to involve senior students of medical universities in the fight against the epidemic but all the same specialists are sorely lacking The Minister of Defense of Lithuania Raimundas Karoblis has already promised that the military will be sent to help the doctors but would anyone really want him