In [1]:
import pandas as pd
import seaborn as sns

In [2]:
## Import and read the first five records of dataset
data=pd.read_csv("fake_news_dataset.csv")
data.head()

Unnamed: 0,title,text,date,source,author,category,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake


In [3]:
## check the  data types of each columns of dataset

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     20000 non-null  object
 1   text      20000 non-null  object
 2   date      20000 non-null  object
 3   source    19000 non-null  object
 4   author    19000 non-null  object
 5   category  20000 non-null  object
 6   label     20000 non-null  object
dtypes: object(7)
memory usage: 1.1+ MB


In [5]:
## Get the number of  rows in the dataset

In [6]:
data.count()

title       20000
text        20000
date        20000
source      19000
author      19000
category    20000
label       20000
dtype: int64

In [7]:
## Check  the null values are their or not

In [8]:
data.isnull().sum()

title          0
text           0
date           0
source      1000
author      1000
category       0
label          0
dtype: int64

In [9]:
##  check the duplicate values

In [10]:
data.duplicated().sum()

0

In [11]:
## Split the whole dataset into train and test set

In [12]:
from sklearn.model_selection import train_test_split

train_set,test_set=train_test_split(data, test_size=0.2, random_state=42)

In [13]:
print(train_set.shape,test_set.shape)

(16000, 7) (4000, 7)


## Explotary Data Analysis  (EDA)

In [14]:
## check the null values of train_set

In [15]:
train_set.isnull().sum()

title         0
text          0
date          0
source      782
author      811
category      0
label         0
dtype: int64

In [16]:
## check the duplicate values of  train_set

In [17]:
train_set.duplicated().sum()

0

In [18]:
## check the value counts of  source and  author solumns

In [19]:
train_set['source'].value_counts()

source
Daily News      1970
The Guardian    1930
BBC             1913
CNN             1898
Fox News        1890
Reuters         1886
NY Times        1874
Global Times    1857
Name: count, dtype: int64

In [20]:
train_set['author'].value_counts()

author
Michael Smith            10
John Smith                9
Christopher Johnson       7
Jennifer Davis            7
John Brown                6
                         ..
Wendy Mccullough          1
Jonathan Curtis           1
Rodney Young              1
Samantha Gutierrez        1
Mr. Ernest Harris Jr.     1
Name: count, Length: 13847, dtype: int64

In [21]:
## fill the null values

In [33]:
train_set['source'].fillna('Global Times', inplace=True)

In [34]:
train_set.isnull().sum()

title       0
text        0
date        0
source      0
author      0
category    0
label       0
dtype: int64

In [31]:
train_set['author'].fillna('Wendy Mccullough', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_set['author'].fillna('Wendy Mccullough', inplace=True)


In [32]:
train_set.isnull().sum()

title       0
text        0
date        0
source      0
author      0
category    0
label       0
dtype: int64

In [36]:
train_set.isnull().sum()

title       0
text        0
date        0
source      0
author      0
category    0
label       0
dtype: int64

In [37]:
## check the duplicate values again

In [39]:
train_set.duplicated().sum()

0

## Explotary Data Analysis (EDA)

#### Text Preprocessing

In [40]:
import re
import string

In [41]:
train_set['text'].head()

5894    image morning whether thought seven office kit...
3728    recent item success plant dark however however...
8958    yourself religious point hour Mrs cover case s...
7671    road listen add question main head worker gene...
5999    low indicate education support brother suffer ...
Name: text, dtype: object

In [42]:
## Convert uppercase into lowercase

In [44]:
train_set['text']=train_set['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [45]:
train_set['text']

5894     image morning whether thought seven office kit...
3728     recent item success plant dark however however...
8958     yourself religious point hour mrs cover case s...
7671     road listen add question main head worker gene...
5999     low indicate education support brother suffer ...
                               ...                        
11284    middle form imagine item company good town for...
11964    collection short section yourself involve real...
5390     card compare magazine education evening energy...
860      fire ask lose institution field candidate age ...
15795    book easy morning report kind better start fig...
Name: text, Length: 16000, dtype: object

In [46]:
train_set['author']=train_set['author'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [47]:
train_set['author']

5894         kimberly martinez
3728     mr. daniel bailey dds
8958           kimberly wagner
7671       charlene harrington
5999            robert gardner
                 ...          
11284              cathy woods
11964      nicholas williamson
5390               david mcgee
860           stephanie austin
15795    mr. ernest harris jr.
Name: author, Length: 16000, dtype: object