# Create a fake news classifier

The rise of social media and the proliferation of digital news sources have made it easier than ever to spread misinformation and fake news. As a result, this prompts the need to be able to distinguish between real and fake news stories. Machine learning techniques offer a promising solution to this, by allowing us to classify news articles based on their content and other features.

In this project, I will explore the use of machine learning classification methods to classify news articles as either real or fake. I will analyse the text of the news articles based on the natural language processing (NLP) methods, and evaluate the performance of our classifier using a variety of metrics.

##### Part 1: Data Preprocessing using NLP Techniques

In this first section, I will first perform some exploratory data analysis on the data provided, and preprocess the data.

1. Import the necesary libraries and methods.

In [1]:
import pandas as pd
import numpy as np
import nltk 
from nltk import word_tokenize
from nltk.corpus import stopwords

2. Read the dataset from the CSV file

In [2]:
df = pd.read_csv('fake_news_train.csv')

3. After loading the dataset, perform some primary exploratory data analysis to understand the dataset provided. You can use simple pandas methods and attributes such as `head()`, `shape` and `info()`.

In [3]:
# Exploratory data analysis to familiarize yourself with the data
df.head(10)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
5,5,Jackie Mason: Hollywood Would Love Trump if He...,Daniel Nussbaum,"In these trying times, Jackie Mason is the Voi...",0
6,6,Life: Life Of Luxury: Elton John’s 6 Favorite ...,,Ever wonder how Britain’s most iconic pop pian...,1
7,7,Benoît Hamon Wins French Socialist Party’s Pre...,Alissa J. Rubin,"PARIS — France chose an idealistic, traditi...",0
8,8,Excerpts From a Draft Script for Donald Trump’...,,Donald J. Trump is scheduled to make a highly ...,0
9,9,"A Back-Channel Plan for Ukraine and Russia, Co...",Megan Twohey and Scott Shane,A week before Michael T. Flynn resigned as nat...,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [5]:
df.describe()

Unnamed: 0,id,label
count,20800.0,20800.0
mean,10399.5,0.500625
std,6004.587135,0.500012
min,0.0,0.0
25%,5199.75,0.0
50%,10399.5,1.0
75%,15599.25,1.0
max,20799.0,1.0


4. Before proceeding, it is always a good measure to check if null values are present in the dataset or not. 

If there are null values in the DataFrame.

In [6]:
# Check for null values and if any, fill them with an empty string 
df.isnull().sum()
# Use the `fillna` method to fill the null values with an empty string
df.fillna("", axis=1, inplace=True)

5. For data preprocessing, I will focus on the 'text' column of the DataFrame, which contains the content of each news article. I will apply tokenization.

In [7]:
# Define a function to tokenize the text given
def tokenize_text(text):
    tokens = word_tokenize(text)
    return tokens

# Apply the tokenize_text function to the 'text' column of the DataFrame and create a new column 'tokenized_text'
df['tokenized_text'] = df['text'].apply(tokenize_text)


6. With the new column containing the tokens of the text, I will dive into the second preprocessing step, which is to remove the stop words from the tokens.

A list will be created- `stop_words` that contains the NLTK predefined stopwords.

In [8]:
stop_words = stopwords = set(stopwords.words('english'))

7. Define a function that removes stop words from a list of tokens. The NLTK predefined stopwords are in lowercase, while some of the tokens in your current DataFrame contain uppercase alphabets.

In [9]:
# Define a function to remove stopwords from a list of tokens
def remove_stopwords(tokens):
    tokens_without_stopwords = [i for i in tokens if i.lower() not in stop_words]
    return tokens_without_stopwords

# Apply the remove_stopwords function to the 'tokenized_text' column
df['tokenized_text'] = df['tokenized_text'].apply(remove_stopwords)

##### Part 2: Separating the dataset and Vectorization

8. Before proceeding, I will separate the dataset into features and targets. This allows to clearly define the inputs and outputs of the model.

In [10]:
# Separate the data into features and targets
X_df = df['tokenized_text']
y_df = df['label']
#Analyse the `y_df` data.
type(y_df)

pandas.core.series.Series

9. The data type of y_df appears to be a `str` or `object` data type.

We are looking to have a binary output which should only include numerical values. To train the model, I will  convert the label column into a numerical one.

In [11]:
y_df = y_df.astype(int)

10. As machine learning models take in numerical values for their inputs, I have to convert the feature data into numerical format as well. This is where vectorization will be incorporated.

Import the `TfidfVectorizer` and create a TfidfVectorizer object. Since the features being worked with are in tokens, I will specify this in the parameter as the vectorizer takes in strings by default.

I will set the `tokenizer` parameter to a lambda function that simply returns each document as-is. Also set `lowercase=False` to ensure that the tokenization is not modified.

After this, I can fit and transform the vectorizer on the tokenized documents `x_df`. This produces the TFIDF matrix.

In [12]:
# Perform vectorization using the TFIDF Vectorizer and fit and transform the tokenized documents
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False)
tfidf = tfidf_vectorizer.fit_transform(X_df)



##### Part 3: Training and testing the model

11. I will be making use of a `LogisticRegression` model to create the fake news classifier.

I will first split the data into a training set and a testing set to evaluate the performance of the Logistic Regression model.

Now with the TFIDF matrix and target data, I can split the data into testing and training sets using `train_test_split`.

In [13]:
# Split the data into test and train
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf, y_df, random_state=0)

12. Import the necessary modules and create a LogisticRegression object then fit the model according to the X and y training data produced above.

In [14]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

13. Now that the model has been trained, I will calculate the predictions of the model using the test data set.

In [15]:
y_pred = logreg.predict(X_test)

14. Now, to evaluate how well the model did. Here, I will use three evaluation metrics to assess the performance of the model. These are the metrics I will be working with: Accuracy, Precision and Recall.

I will import the following metrics and calculate the scores by comparing the test targets to the predicted targets of the test set.

In [16]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

By using multiple metrics for evaluation, we can identify the strengths and weaknesses of the model to make informed decisions about how to improve its performance.

15. Print out each of the scores for your model below.

In [17]:
print("The accuracy score is ", accuracy)
print("The precision score is ", precision)
print("The recall score is ", recall)

The accuracy score is  0.9651923076923077
The precision score is  0.9661982529434106
The recall score is  0.9650986342943855
