Installing all the Libraries required

In [None]:
!pip install datasets
!pip install requests
!pip install html5lib
!pip install bs4
!pip install lxml

Importing Libraries as needed

In [47]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup

Importing **Dataset**

In [None]:
from datasets import load_dataset
 
dataset = load_dataset("datacommons_factcheck", "fctchk_politifact_wapo")

Preview of the Dataset

In [None]:
print(dataset)

Loading the dataset as Pandas Dataframe

In [None]:
df=pd.DataFrame(dataset["train"])

Data Preview 

In [None]:
df.head(5)

Word Cloud for the most frequent words in review ratings

In [None]:
text = df['review_rating'].values 
from wordcloud import WordCloud 
wordcloud = WordCloud(width=1500, height=1500,max_words =1000).generate(str(text))
fig = plt.gcf()
fig.set_size_inches(18.5, 10.5)
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

A new Dataframe to check the frequency of the review_ratings

In [None]:
df_review_rating = df.groupby('review_rating').review_rating.count() \
                               .reset_index(name='count') \
                               .sort_values(['count'], ascending=False) \
                               .head(100).reset_index(drop=True)

Getting the % of the count based on frequency

In [None]:
df_review_rating['percent'] = (df_review_rating['count'] / df_review_rating['count'].sum()) * 100
df_review_rating['percent'][0:16].sum()

List of the top_16 reviews based on count 

In [None]:
top_16_reviews = df_review_rating['review_rating'][0:16].values.tolist()
top_16_reviews

 I had gone through the dataset and added a new column as Label (True news and Fake news). I have checked the frequency of  review_rating. List of top 16 of them being ['False', 'Mostly False', 'Pants on Fire', 'Half True', 'Mostly True', 'True', 'Four Pinocchios', 'Three Pinocchios', 'Two Pinocchios', 'Distorts the Facts', 'Misleading', 'No Evidence', 'Not the Whole Story', 'Spins the Facts', 'Needs context', 'Not the whole story'].

This accounts for 94.57% of the data. So, I have given TRUE NEWS for labels with 'Mostly True' & 'True'. While the rest of the labels are considered Fake news.

In [None]:
def categorise(row):  
    if row['review_rating'] == 'Mostly True' or row['review_rating'] == 'True' :
        return 'TRUE NEWS'
    else:
        return 'FAKE NEWS'

df_review_rating['Label'] = df_review_rating.apply(lambda row: categorise(row), axis=1)

Transforming **df** based on based on the Condition 

In [None]:
df['Label'] = df.apply(lambda row: categorise(row), axis=1)
df['condition'] = df.review_rating.isin(top_16_reviews)
print(df.shape)
df = df[df.condition == True]
df = df.drop('condition', axis=1)
print(df.shape)

Now, we have to scrape the data in **df['review_url']** and train it against **df['label']**. The below steps are scrapping the data using beautiful soup. 

In [49]:
URL='https://www.politifact.com/texas/statements/2016/sep/01/james-woods/james-woods-says-dallas-cowboys-cant-honor-dead-of/'
r=requests.get(URL)
print(r.content)

b'\n<!DOCTYPE html>\n<html lang="en-US" dir="ltr">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="x-ua-compatible" content="ie=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>PolitiFact | James Woods says Dallas Cowboys can&#39;t honor dead officers with helmet decals</title>\n<meta name="description" content="After San Francisco 49ers quarterback Colin Kaepernick didn\xe2\x80\x99t stand during the playing of the national anthem before a g" />\n<meta property="og:url" content="https://www.politifact.com/factchecks/2016/sep/01/james-woods/james-woods-says-dallas-cowboys-cant-honor-dead-of/" />\n<meta property="og:image" content="https://static.politifact.com/politifact/rulings/meter-true.jpg" />\n<meta property="og:image:secure_url" content="https://static.politifact.com/politifact/rulings/meter-true.jpg" />\n<meta property="og:title" content="PolitiFact - James Woods says Dallas Cowboys can&#39;t honor dead officers with helmet decals" />\n<meta

In [None]:
soup=BeautifulSoup(r.content,'html5lib')
print(soup.prettify())



*   Although the <div class is "m-textblock"> for politifact.com. It was dynamic for factcheck.org. So, I had some issues in scrapping the data using a fixed class.
*    My approach after the web scraping the data is to finetune a BERT (Pretrained model for classification)

