#### About the dataset

---

## 📄 WELFake Dataset Overview

**WELFake** is a dataset containing **72,134 news articles**, with:

* 📰 **35,028 real** news articles
* 🕵️ **37,106 fake** news articles

---

### 📚 Dataset Composition

To enhance model robustness and avoid overfitting, the authors **merged four well-known news datasets**:

* Kaggle
* McIntire
* Reuters
* BuzzFeed Political

This results in a richer dataset for better machine learning training and evaluation.

---

### 🧾 Dataset Columns

| Column            | Description                                  |
| ----------------- | -------------------------------------------- |
| **Serial number** | Row index (starting from 0)                  |
| **Title**         | Headline of the news article                 |
| **Text**          | Full content of the news article             |
| **Label**         | Classification label: `0` = Fake, `1` = Real |

---

### 📖 Reference

Published in:
*IEEE Transactions on Computational Social Systems*, pp. 1–13
DOI: [10.1109/TCSS.2021.3068519](https://doi.org/10.1109/TCSS.2021.3068519)

---

#### Importing dependencies

In [24]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from google.colab import drive


In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
print('printing the stop words in English')
print(stopwords.words('english'))

printing the stop words in English
['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 

#### Importing datasets

In [5]:
drive.mount('/content/drive')
file_path = '/content/drive/My Drive/Colab Notebooks/WELFake_Dataset.csv'

# Loading the dataset in a pandas dataframe
news_data = pd.read_csv(file_path, index_col=0)

news_data.index.name = 'serial_number'

Mounted at /content/drive


In [6]:
# Checking the number of rows and colums
news_data.shape

(72134, 3)

In [7]:
news_data.head(5)

Unnamed: 0_level_0,title,text,label
serial_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,,Did they post their votes for Hillary already?,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [8]:
# Checking for missing values
news_data.isnull().sum()

Unnamed: 0,0
title,558
text,39
label,0


In [9]:
# Replacing the null values with empty strings
news_data = news_data.fillna(' ')
news_data.isnull().sum()

Unnamed: 0,0
title,0
text,0
label,0


In [10]:
# Merging the title column with text column
news_data['news'] = news_data['title']+" "+news_data['text']

#### Seperating data and label


In [11]:
news_data['news']

Unnamed: 0_level_0,news
serial_number,Unnamed: 1_level_1
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...
1,Did they post their votes for Hillary already?
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...
3,"Bobby Jindal, raised Hindu, uses story of Chri..."
4,SATAN 2: Russia unvelis an image of its terrif...
...,...
72129,Russians steal research on Trump in hack of U....
72130,WATCH: Giuliani Demands That Democrats Apolog...
72131,Migrants Refuse To Leave Train At Refugee Camp...
72132,Trump tussle gives unpopular Mexican leader mu...


#### Stemming

In [12]:
port_stem = PorterStemmer()
stop_words = set(stopwords.words('english'))  # Load once

def stemming(news):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', news)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if word not in stop_words]
    return ' '.join(stemmed_content)

In [13]:
# apply the stemming fuction to news column
news_data['news'] = news_data['news'].apply(stemming)


In [14]:
print(news_data['news'])

serial_number
0        law enforc high alert follow threat cop white ...
1                                post vote hillari alreadi
2        unbeliev obama attorney gener say charlott rio...
3        bobbi jindal rais hindu use stori christian co...
4        satan russia unv imag terrifi new supernuk wes...
                               ...                        
72129    russian steal research trump hack u democrat p...
72130    watch giuliani demand democrat apolog trump ra...
72131    migrant refus leav train refuge camp hungari m...
72132    trump tussl give unpopular mexican leader much...
72133    goldman sach endors hillari clinton presid gol...
Name: news, Length: 72134, dtype: object


#### Seperating the data and label


In [15]:
# Seperating data and label
X = news_data['news'].values

Y = news_data['label'].values



In [16]:
print(X)

['law enforc high alert follow threat cop white blacklivesmatt fyf terrorist video comment expect barack obama member fyf fukyoflag blacklivesmatt movement call lynch hang white peopl cop encourag other radio show tuesday night turn tide kill white peopl cop send messag kill black peopl america one f yoflag organ call sunshin radio blog show host texa call sunshin f ing opinion radio show snapshot fyf lolatwhitefear twitter page p show urg support call fyf tonight continu dismantl illus white snapshot twitter radio call invit fyf radio show air p eastern standard time show caller clearli call lynch kill white peopl minut clip radio show heard provid breitbart texa someon would like refer hannib alreadi receiv death threat result interrupt fyf confer call unidentifi black man said mother f ker start f ing like us bunch ni er takin one us roll said caus alreadi roll gang anyway six seven black mother f cker see white person lynch ass let turn tabl conspir cop start lose peopl state emerg

In [17]:
print(Y)

[1 1 1 ... 0 0 1]


#### Converting textual data to numerical data

In [18]:
# Converting to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [19]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 13656667 stored elements and shape (72134, 162203)>
  Coords	Values
  (0, 938)	0.019104619426517897
  (0, 1282)	0.017363778513914716
  (0, 2131)	0.052457780993620334
  (0, 2783)	0.020231302394732035
  (0, 3614)	0.029904475345647965
  (0, 3999)	0.02747310458208844
  (0, 4264)	0.023865946576073604
  (0, 4335)	0.05055646943154232
  (0, 4846)	0.01513932938772633
  (0, 4862)	0.02486341752399553
  (0, 6013)	0.014596932940161228
  (0, 6507)	0.057303304534120254
  (0, 6845)	0.01589099748716943
  (0, 8437)	0.12657603668480968
  (0, 8976)	0.015516676506767142
  (0, 10478)	0.06692816334064453
  (0, 11430)	0.018962618491219916
  (0, 12727)	0.01580162854760987
  (0, 14072)	0.018345912143817013
  (0, 14679)	0.01785037970922704
  (0, 15442)	0.19279395985841352
  (0, 15499)	0.08125624068348719
  (0, 15611)	0.0888918729364855
  (0, 15886)	0.029332963149593518
  (0, 18063)	0.10843561013885229
  :	:
  (72133, 132638)	0.031715743461707
  (72133

#### Splitting the data into training and test data


In [20]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=13)

#### Training the Model: Logistic regression

In [21]:
model = LogisticRegression()

In [22]:
model.fit(X_train, Y_train)

#### Evaluation


In [25]:
# Evaluating the model on the training data
X_train_prediction = model.predict(X_train)
# Accuracy
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)
# Precision
trainig_data_precision = precision_score(Y_train, X_train_prediction)
# Recall
trainig_data_recall = recall_score(Y_train, X_train_prediction)
# f1
trainig_data_f1 = f1_score(Y_train, X_train_prediction)

In [26]:
print('Accuracy score of the training data :', training_data_accuracy)
print('Precision score of the training data :', trainig_data_precision)
print('Recall score of the training data :', trainig_data_recall)
print('F1 score of the training data :', trainig_data_f1)

Accuracy score of the training data : 0.961651099519989
Precision score of the training data : 0.9579277236964928
Recall score of the training data : 0.9679636179888833
F1 score of the training data : 0.9629195221259698


In [27]:
# Evaluating the model on the test data
X_test_prediction = model.predict(X_test)
# Accuracy
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)
# Precision
test_data_precision = precision_score(Y_test, X_test_prediction)
# Recall
test_data_recall = recall_score(Y_test, X_test_prediction)
# f1
test_data_f1 = f1_score(Y_test, X_test_prediction)

In [28]:
print('Accuracy score of the test data :', test_data_accuracy)
print('Precision score of the test data :', test_data_precision)
print('Recall score of the test data :', test_data_recall)
print('F1 score of the test data :', test_data_recall)

Accuracy score of the test data : 0.9482913980730575
Precision score of the test data : 0.9426979705531238
Recall score of the test data : 0.9576876431747743
F1 score of the test data : 0.9576876431747743


#### Making a Prediction System

In [29]:
X_new = X_test[300]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Fake')
else:
  print('The news is Real')

[1]
The news is Real


In [30]:
print(Y_test[300])

1
