About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





In [1]:
#installing all the required libraries
# !pip install numpy pandas nltk scikit-learn


import numpy as np
import pandas as pd

# Step 1: Loading the dataset

In [2]:
df = pd.read_csv("../datasets/train.csv", sep=';', engine='python', encoding='utf-8', on_bad_lines='skip')


In [3]:
df.head()

Unnamed: 0,id,title,text,label
0,0,Palestinians switch off Christmas lights in Be...,"RAMALLAH, West Bank (Reuters) - Palestinians s...",1
1,1,China says Trump call with Taiwan president wo...,BEIJING (Reuters) - U.S. President-elect Donal...,1
2,2,FAIL! The Trump Organization’s Credit Score W...,While the controversy over Trump s personal ta...,0
3,3,Zimbabwe military chief's China trip was norma...,BEIJING (Reuters) - A trip to Beijing last wee...,1
4,4,THE MOST UNCOURAGEOUS PRESIDENT EVER Receives ...,There has never been a more UNCOURAGEOUS perso...,0


In [4]:
df.shape

(24353, 4)

In [5]:
#droping the extra column
df = df.drop("id", axis=1)

In [6]:
df

Unnamed: 0,title,text,label
0,Palestinians switch off Christmas lights in Be...,"RAMALLAH, West Bank (Reuters) - Palestinians s...",1
1,China says Trump call with Taiwan president wo...,BEIJING (Reuters) - U.S. President-elect Donal...,1
2,FAIL! The Trump Organization’s Credit Score W...,While the controversy over Trump s personal ta...,0
3,Zimbabwe military chief's China trip was norma...,BEIJING (Reuters) - A trip to Beijing last wee...,1
4,THE MOST UNCOURAGEOUS PRESIDENT EVER Receives ...,There has never been a more UNCOURAGEOUS perso...,0
...,...,...,...
24348,Mexico Senate committee OK's air transport dea...,MEXICO CITY (Reuters) - A key committee in Mex...,1
24349,BREAKING: HILLARY CLINTON’S STATE DEPARTMENT G...,IF SHE S NOT TOAST NOW THEN WE RE IN BIGGER TR...,0
24350,trump breaks from stump speech to admire beaut...,kremlin nato was created for agression \r\nru...,0
24351,NFL PLAYER Delivers Courageous Message: Stop B...,Dallas Cowboys star wide receiver Dez Bryant t...,0


# Step 2: Data preprocessing

In [7]:
# dropping the null values
df = df.dropna()
# checking the null values
df.isnull().sum()

title    0
text     0
label    0
dtype: int64

In [8]:
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/no0ne/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [9]:
# Initializing the PorterStemmer for removing suffixes
ps = PorterStemmer()
# Getting the list of stopwords in English
stop_words = stopwords.words('english')

In [10]:
def preprocess_text(text):
    # Removing non-alphabetic characters
    text = re.sub('[^a-zA-Z]', ' ', text)
    text = text.lower()
    words = text.split()
    # Stemming and removing stopwords
    processed_words = [ps.stem(word) for word in words if word not in stop_words]
    return ' '.join(processed_words)

In [21]:
df["content"] = df["title"] + " " + df["text"]
df


Unnamed: 0,title,text,label,content
0,Palestinians switch off Christmas lights in Be...,"RAMALLAH, West Bank (Reuters) - Palestinians s...",1,Palestinians switch off Christmas lights in Be...
1,China says Trump call with Taiwan president wo...,BEIJING (Reuters) - U.S. President-elect Donal...,1,China says Trump call with Taiwan president wo...
2,FAIL! The Trump Organization’s Credit Score W...,While the controversy over Trump s personal ta...,0,FAIL! The Trump Organization’s Credit Score W...
3,Zimbabwe military chief's China trip was norma...,BEIJING (Reuters) - A trip to Beijing last wee...,1,Zimbabwe military chief's China trip was norma...
4,THE MOST UNCOURAGEOUS PRESIDENT EVER Receives ...,There has never been a more UNCOURAGEOUS perso...,0,THE MOST UNCOURAGEOUS PRESIDENT EVER Receives ...
...,...,...,...,...
24348,Mexico Senate committee OK's air transport dea...,MEXICO CITY (Reuters) - A key committee in Mex...,1,Mexico Senate committee OK's air transport dea...
24349,BREAKING: HILLARY CLINTON’S STATE DEPARTMENT G...,IF SHE S NOT TOAST NOW THEN WE RE IN BIGGER TR...,0,BREAKING: HILLARY CLINTON’S STATE DEPARTMENT G...
24350,trump breaks from stump speech to admire beaut...,kremlin nato was created for agression \r\nru...,0,trump breaks from stump speech to admire beaut...
24351,NFL PLAYER Delivers Courageous Message: Stop B...,Dallas Cowboys star wide receiver Dez Bryant t...,0,NFL PLAYER Delivers Courageous Message: Stop B...


In [20]:
df.shape

(24353, 4)

In [12]:
# applying the preprocess_text function
df["content"] = df["content"].apply(preprocess_text)

In [13]:
#checking the processed text
df["content"].head(10)

0    palestinian switch christma light bethlehem an...
1    china say trump call taiwan presid chang islan...
2    fail trump organ credit score make laugh contr...
3    zimbabw militari chief china trip normal visit...
4    uncourag presid ever receiv courag award proce...
5    suspect boko haram suicid bomber kill least ni...
6    watch john oliv present gop debat clowntown f ...
7    senat democrat ask trump attorney gener pick r...
8    trump humili republican latest hissi fit side ...
9    maci get boot loyal custom fire trump know pat...
Name: content, dtype: object

# Step 3: Feature Extraction

In [14]:
# for feature extraction using TF-IDF Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer( max_features=5000, ngram_range=(1, 3))

In [None]:
# in this step we will covert the text data into feature vectors
X = vectorizer.fit_transform(df["content"]).toarray()


In [26]:
Y = df["label"].values

In [27]:
print(X.shape)
print(Y.shape)

(24353, 5000)
(24353,)


In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)