<a href="https://colab.research.google.com/github/ryaltic/Spaceship-Titanic/blob/main/News_Article_Validity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# News Article Validity

Being able to distingusish between fake news and real news is a key skill to have in today's world. People falling for misinformation or disinformation can cause harm to society and lead to bad consequences. The goal of this notebook is to use natural language processing to differentiate between what is real news and what is fake news to help understand common techniques that fake new outlets use to write articles.



## Read in Data Files

In [1]:
# Loading the Pandas package
import pandas as pd

In [2]:
# Reading in the data in .csv format from my github
# Using low_memory = False to have the columns read in correctly
fake_news_1 = pd.read_csv('https://raw.githubusercontent.com/ryaltic/News-Article-Validity/refs/heads/main/Fake_News_part0.csv', low_memory=False)
fake_news_2 = pd.read_csv('https://raw.githubusercontent.com/ryaltic/News-Article-Validity/refs/heads/main/Fake_News_part1.csv', low_memory=False)
fake_news_3 = pd.read_csv('https://raw.githubusercontent.com/ryaltic/News-Article-Validity/refs/heads/main/Fake_News_part2.csv', low_memory=False)
fake_news_4 = pd.read_csv('https://raw.githubusercontent.com/ryaltic/News-Article-Validity/refs/heads/main/Fake_News_part3.csv', low_memory=False)
real_news_1 = pd.read_csv('https://raw.githubusercontent.com/ryaltic/News-Article-Validity/refs/heads/main/Real_News_part0.csv', low_memory=False)
real_news_2 = pd.read_csv('https://raw.githubusercontent.com/ryaltic/News-Article-Validity/refs/heads/main/Real_News_part1.csv', low_memory=False)
real_news_3 = pd.read_csv('https://raw.githubusercontent.com/ryaltic/News-Article-Validity/refs/heads/main/Real_News_part2.csv', low_memory=False)
real_news_4 = pd.read_csv('https://raw.githubusercontent.com/ryaltic/News-Article-Validity/refs/heads/main/Real_News_part3.csv', low_memory=False)

CSV format can be tricky with textual data and presents many problems. Preferably the text data would be in a .json format, however the data came in a CSV format.

## Preprocessing

In [3]:
# Seeing the head of fake_news_1
fake_news_1.head()

Unnamed: 0,title,text,subject,date,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 119,Unnamed: 120,Unnamed: 121,Unnamed: 122,Unnamed: 123,Unnamed: 124,Unnamed: 125,Unnamed: 126,Unnamed: 127,Unnamed: 128
0,LOL: Putin Is Angry Now Because Trump Gets Mo...,Vladimir Putin is super pissed because Russian...,News,16-Feb-17,,,,,,,...,,,,,,,,,,
1,WATCH: CNN Host Fareed Zakaria ROASTS Trump F...,Donald Trump embarrassed himself and the natio...,News,16-Feb-17,,,,,,,...,,,,,,,,,,
2,Prominent Psychiatrist Gives One DAMNING Reas...,Given Donald Trump s disastrous presidential c...,News,16-Feb-17,,,,,,,...,,,,,,,,,,
3,Steve Bannon Insulted Reporters To Their Face...,Just about every journalist who was waiting to...,News,16-Feb-17,,,,,,,...,,,,,,,,,,
4,Twitter Beautifully Fact Checks Trump In Real...,According to what White House press secretary ...,News,16-Feb-17,,,,,,,...,,,,,,,,,,


From looking at the head of the first fake news df, there are a bunch of columns that should not be there which are the unnamed columns.

In [4]:
# Creating a function using .loc to locate all columns the contains unnamed in its names and then drops those columns.
def drop_unnamed(df):
    df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
    return df


fake_news_1 = drop_unnamed(fake_news_1)
fake_news_1.head()
fake_news_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9732 entries, 0 to 9731
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    7501 non-null   object
 1   text     7501 non-null   object
 2   subject  7500 non-null   object
 3   date     7500 non-null   object
dtypes: object(4)
memory usage: 304.3+ KB


After dropping the unnamed columns the first fake news columns is ready to be appended

In [5]:
# Repeat process for all other dfs read in
fake_news_2 = drop_unnamed(fake_news_2)
fake_news_3 = drop_unnamed(fake_news_3)
fake_news_4 = drop_unnamed(fake_news_4)
real_news_1 = drop_unnamed(real_news_1)
real_news_2 = drop_unnamed(real_news_2)
real_news_3 = drop_unnamed(real_news_3)
real_news_4 = drop_unnamed(real_news_4)

In [6]:
# Creating a function that adds a truth column that is fake
def add_fakenews_column(df):
    df['truth'] = 'fake'
    return df

# Applying the add_fakenews_column function to the fake_news_dfs
fake_news_1 = add_fakenews_column(fake_news_1)
fake_news_2 = add_fakenews_column(fake_news_2)
fake_news_3 = add_fakenews_column(fake_news_3)
fake_news_4 = add_fakenews_column(fake_news_4)

# Creating a function that adds a truth column that is real
def add_realnews_column(df):
    df['truth'] = 'real'
    return df

# Applying the add_realnews_column function to the real_news_dfs
real_news_1 = add_realnews_column(real_news_1)
real_news_2 = add_realnews_column(real_news_2)
real_news_3 = add_realnews_column(real_news_3)
real_news_4 = add_realnews_column(real_news_4)

In [7]:
# Display info of all dfs to see if columns match
print(fake_news_1.info())
print(fake_news_2.info())
print(fake_news_3.info())
print(fake_news_4.info())
print(real_news_1.info())
print(real_news_2.info())
print(real_news_3.info())
print(real_news_4.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9732 entries, 0 to 9731
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    7501 non-null   object
 1   text     7501 non-null   object
 2   subject  7500 non-null   object
 3   date     7500 non-null   object
 4   truth    9732 non-null   object
dtypes: object(5)
memory usage: 380.3+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    10000 non-null  object
 1   text     10000 non-null  object
 2   subject  9996 non-null   object
 3   date     9996 non-null   object
 4   truth    10000 non-null  object
dtypes: object(5)
memory usage: 390.8+ KB
None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3499 entries, 0 to 3498
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  --

All of the columns are the same so now the dfs need to become one big df by appending the dfs together

In [8]:
# Using the concat function to append all of the dfs together into one
final_df = pd.concat([fake_news_1, fake_news_2, fake_news_3, fake_news_4, real_news_1, real_news_2, real_news_3, real_news_4])
final_df.head()

Unnamed: 0,title,text,subject,date,truth
0,LOL: Putin Is Angry Now Because Trump Gets Mo...,Vladimir Putin is super pissed because Russian...,News,16-Feb-17,fake
1,WATCH: CNN Host Fareed Zakaria ROASTS Trump F...,Donald Trump embarrassed himself and the natio...,News,16-Feb-17,fake
2,Prominent Psychiatrist Gives One DAMNING Reas...,Given Donald Trump s disastrous presidential c...,News,16-Feb-17,fake
3,Steve Bannon Insulted Reporters To Their Face...,Just about every journalist who was waiting to...,News,16-Feb-17,fake
4,Twitter Beautifully Fact Checks Trump In Real...,According to what White House press secretary ...,News,16-Feb-17,fake


In [9]:
# Displaying the shape and info of the final df
print(final_df.shape)
final_df.info()

(47147, 5)
<class 'pandas.core.frame.DataFrame'>
Index: 47147 entries, 0 to 1416
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    44916 non-null  object
 1   text     44916 non-null  object
 2   subject  44895 non-null  object
 3   date     44895 non-null  object
 4   truth    47147 non-null  object
dtypes: object(5)
memory usage: 2.2+ MB


The final df has 47,147 rows and 5 columns. The 5 columns are title of the news article, the text of the news article, the subject of the article, the date the article was published, and if the article was real or fake news. There are a few nulls that will need to be investigated and potentially adding a few more variables that can be engineered.   