## Exploratory analysis of the [Fake-News-Detection-dataset](https://huggingface.co/datasets/Pulk17/Fake-News-Detection-dataset)

In [None]:
import pandas as pd
from datasets import load_dataset
from typing import cast

# https://huggingface.co/datasets/Pulk17/Fake-News-Detection-dataset
df = pd.read_csv("hf://datasets/Pulk17/Fake-News-Detection-dataset/train.tsv", sep="\t")

### Information about the dataset

First 5 rows of the dataset

In [None]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,label
0,2619,Ex-CIA head says Trump remarks on Russia inter...,Former CIA director John Brennan on Friday cri...,politicsNews,"July 22, 2017",1
1,16043,YOU WON’T BELIEVE HIS PUNISHMENT! HISPANIC STO...,How did this man come to OWN this store? There...,Government News,"Jun 19, 2017",0
2,876,Federal Reserve governor Powell's policy views...,President Donald Trump on Thursday tapped Fede...,politicsNews,"November 2, 2017",1
3,19963,SCOUNDREL HILLARY SUPPORTER STARTS “TrumpLeaks...,Hillary Clinton ally David Brock is offering t...,left-news,"Sep 17, 2016",0
4,10783,NANCY PELOSI ARROGANTLY DISMISSES Questions on...,Pleading ignorance is a perfect ploy for Nancy...,politics,"May 26, 2017",0


Last 5 rows of the dataset

In [174]:
df.tail(5)

Unnamed: 0.1,Unnamed: 0,title,text,subject,date,label
29995,6880,U.S. aerospace industry urges Trump to help Ex...,The chief executive of the U.S. Aerospace Indu...,politicsNews,"December 6, 2016",1
29996,17818,Highlights: Hong Kong leader Carrie Lam delive...,The following are highlights of the maiden pol...,worldnews,"October 11, 2017",1
29997,5689,Obama Literally LAUGHS At Claims That Brexit M...,If there s one thing President Barack Obama is...,News,"June 28, 2016",0
29998,15805,Syrian army takes full control of Deir al-Zor ...,The Syrian army and its allies have taken full...,worldnews,"November 2, 2017",1
29999,8143,"U.S., Israel sign $38 billion military aid pac...",The United States will give Israel $38 billion...,politicsNews,"September 14, 2016",1


### Types of data

In [175]:
df.dtypes

Unnamed: 0     int64
title         object
text          object
subject       object
date          object
label          int64
dtype: object

### Size of the dataset (rows and columns)

In [176]:
print("Size:", df.shape)

Size: (30000, 6)


### Columns description in the dataset

In [177]:
descriptions = [
    "The unique identifier for each news article.",
    "The title of the news article.",
    "The content of the news article.",
    "The subject indicates the category of the news article.",
    "The publication date of the news article.",
    "The label indicating whether the news article is real (1) or fake (0)."
]
    
df_description = pd.DataFrame({
    "Column name": df.columns,
    "Description": descriptions
})

df_description.index = df_description.index + 1  # Start index at 1

df_description.style.set_table_styles([
    {'selector': 'th',
     'props': [('text-align', 'left')]},
    {'selector': 'td',
     'props': [('text-align', 'left')]}
])

Unnamed: 0,Column name,Description
1,Unnamed: 0,The unique identifier for each news article.
2,title,The title of the news article.
3,text,The content of the news article.
4,subject,The subject indicates the category of the news article.
5,date,The publication date of the news article.
6,label,The label indicating whether the news article is real (1) or fake (0).


For an easier readability, we will rename some columns which are not easily readable. 
- For example, the column ```Unnamed: 0``` will be renamed to ```id```.
- The column ```label``` will be renamed to ```is_fake_news```.
- Values in the column ```is_fake_news``` will be renamed from 0 and 1 to ```True``` and ```False``` respectively.

In [178]:
df = df.rename(columns={"Unnamed: 0": "id"})
df = df.rename(columns={"label": "is_fake_news"})
df["is_fake_news"] = df["is_fake_news"].map({0: True, 1: False})
df.head(5)

Unnamed: 0,id,title,text,subject,date,is_fake_news
0,2619,Ex-CIA head says Trump remarks on Russia inter...,Former CIA director John Brennan on Friday cri...,politicsNews,"July 22, 2017",False
1,16043,YOU WON’T BELIEVE HIS PUNISHMENT! HISPANIC STO...,How did this man come to OWN this store? There...,Government News,"Jun 19, 2017",True
2,876,Federal Reserve governor Powell's policy views...,President Donald Trump on Thursday tapped Fede...,politicsNews,"November 2, 2017",False
3,19963,SCOUNDREL HILLARY SUPPORTER STARTS “TrumpLeaks...,Hillary Clinton ally David Brock is offering t...,left-news,"Sep 17, 2016",True
4,10783,NANCY PELOSI ARROGANTLY DISMISSES Questions on...,Pleading ignorance is a perfect ploy for Nancy...,politics,"May 26, 2017",True


### Deletion of duplicate rows

The datasets contains duplicated entries. We will remove them to avoid bias in the training and evaluation of the models.
We need to consider that the column ```id``` is unique for each row, so we will not consider it when looking for duplicates.

In [179]:
duplicate_row_df = df[df.duplicated(subset=df.columns.difference(['id']))]
print("Number of duplicate rows:", duplicate_row_df.shape[0])
print("Size before removing duplicates:", df.shape)

Number of duplicate rows: 92
Size before removing duplicates: (30000, 6)


The dataset contains 92 duplicate rows. We will remove them now.

In [180]:
df = df.drop_duplicates(subset=df.columns.difference(['id']))
print("Size after removing duplicates:", df.shape)

Size after removing duplicates: (29908, 6)


### Check for missing values in the dataset

Let's check if there are any missing values in the dataset.

In [181]:
print(df.isnull().sum())

id              0
title           0
text            0
subject         0
date            0
is_fake_news    0
dtype: int64


No missing values were found in the dataset.

### Deletion of unnecessary news articles

Inside the dataset, there are almost 30k articles. 
However, we want to keep articles with a text length larger or equal than the average length of all articles. This is to avoid giving the chatbots articles that are too obvious to be classified as true or fake news.

In [182]:
df_average_length = df['text'].str.len().mean()
print("The average length of articles is:", df_average_length.astype(int))

The average length of articles is: 2484


We proceed to remove every article with a text length smaller than the average length of all articles, so ```2484```.

In [183]:
df = df[df['text'].str.len() >= df_average_length]
print("Size after removing short articles:", df.shape)

Size after removing short articles: (12001, 6)


### Number of true vs fake news

We can check the number of true vs fake news in the dataset, in order to see if the dataset is balanced or not.

In [187]:
real_news_count = df['is_fake_news'].value_counts()[False]
fake_news_count = df['is_fake_news'].value_counts()[True]
real_news_percent = (real_news_count / (real_news_count + fake_news_count)) * 100
fake_news_percent = (fake_news_count / (real_news_count + fake_news_count)) * 100
print("Number of real news articles:", real_news_count, "({:.2f}%)".format(real_news_percent))
print("Number of fake news articles:", fake_news_count, "({:.2f}%)".format(fake_news_percent))


Number of real news articles: 5851 (48.75%)
Number of fake news articles: 6150 (51.25%)


The difference between real and fake news articles is minimal. The dataset is balanced.

### Conclusion
In this notebook, we have explored the Fake-News-Detection-dataset. We have done following operations to clean the dataset:
- Removed duplicate rows
- Removed articles with a text length smaller than the average length of all articles
- Checked for missing values
The dataset is now ready to be used for the evaluation of the chatbots.