# From Detection to Credibility: A Machine Learning Framework for Assessing News Source Reliability



**Motivation**

As media continues to grow in volume, it is becoming increasingly difficult to differentiate real and fake news effectively. It is thus imperative for us to find better ways to identify fake news, and for us, this means with the help of data mining and machine learning.

In the first part of our project, we will focus on experimenting with different data processing techniques and predictive models, optimising our final pipeline and model to accurately identify fake news.

For the second part, we want to apply our trained model to scraped news data from popular US media outlets and access the credibility of these media outlets. This way we can help the public to make more informed decisions about what media outlets they can trust. 


## 2nd Part: Fake News Classification Use Case

For the second part, we scraped articles from 10 different news sites, split into two categories of news sites. 

The dimensions are shown below:
- **Index:**: Index.
- **title:** Title of news article.
- **text:** Text content of news article.
- **label:** Whether news article is real (0) or fake (1).

The Fake News Dataset is split into 3 `csv` files (`part1.csv`, `part2.csv`, `part3.csv`) so that size does not exceed size limit to push changes to GitHub.

## Import Libraries

Please uncomment the code box below to pip install relevant dependencies for this notebook.

In [2]:
# Import necessary libraries

# Data manipulation
import pandas as pd
import numpy as np

# Statistical functions
from scipy.stats import zscore

# For concurrency (running functions in parallel)
from concurrent.futures import ThreadPoolExecutor

# For caching (to speed up repeated function calls)
from functools import lru_cache

# For progress tracking
from tqdm import tqdm

# Plotting and Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# Language Detection packages
# `langdetect` for detecting language
from langdetect import detect as langdetect_detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException
# `langid` for an alternative language detection method
from langid import classify as langid_classify

# Text Preprocessing and NLP
# Stopwords (common words to ignore) from NLTK
from nltk.corpus import stopwords
# Tokenizing sentences/words
from nltk.tokenize import word_tokenize
# Part-of-speech tagging
from nltk import pos_tag
# Lemmatization (converting words to their base form)
from nltk.stem import WordNetLemmatizer
import nltk
# Regular expressions for text pattern matching
import re

# Word Cloud generation
from wordcloud import WordCloud

In [34]:
## Unreliable websites
breitbart_df = pd.read_csv('./unreliable websites/breitbart_articles.csv')
dailycaller_df = pd.read_csv('./unreliable websites/dailycaller_articles.csv')
naturalnews_df = pd.read_csv('./unreliable websites/naturalnews_articles.csv')
newsmax_df = pd.read_csv('./unreliable websites/newsmax_articles.csv')
zerohedge_df = pd.read_csv('./unreliable websites/zerohedge_articles.csv')

# Reliable websites
cnn_df = pd.read_csv('./reliable websites/cnn_articles.csv')
ap_df = pd.read_csv('./reliable websites/ap_articles.csv')
bbc_df = pd.read_csv('./reliable websites/bbc_articles.csv')
npr_df = pd.read_csv('./reliable websites/npr_articles.csv')
guardian_df = pd.read_csv('./reliable websites/guardian_articles.csv')

## Combine all 10 dataframes into 1
Here we combine all 10 dataframes into 1 dataframe (`data_raw`) by concatenating along rows.

We also reset the `index` and drop the old `index` column.

In [35]:
# Combining all 3 dataframes into 1
data_raw = pd.concat([breitbart_df, dailycaller_df, naturalnews_df,newsmax_df,zerohedge_df,cnn_df,ap_df,bbc_df,npr_df,guardian_df], axis=0)

# Reset index and drop old index column
data_raw = data_raw.reset_index(drop=True)

data_raw.info()
print("Dataframe Shape: ", data_raw.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 955 entries, 0 to 954
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   source   955 non-null    object
 1   title    955 non-null    object
 2   content  955 non-null    object
 3   date     955 non-null    object
dtypes: object(4)
memory usage: 30.0+ KB
Dataframe Shape:  (955, 4)


In [36]:
data_raw.head()

Unnamed: 0,source,title,content,date
0,Breitbart,Russia Practices 'Massive Nuclear Strike' in A...,“Important to have modern and constantly ready...,30/10/2024
1,Breitbart,Nolte: Trump and Republican Senate Challenger ...,The final Insider Advantage poll out of Wiscon...,29/10/2024
2,Breitbart,Sen. Tom Cotton Barnstorms Nation Campaigning ...,Sen. Tom Cotton (R-AR) is putting in long hour...,29/10/2024
3,Breitbart,U.S. Marines Successfully Test Israel's Iron D...,The U.S. Marines have successfully tested a mo...,30/10/2024
4,Breitbart,"Seoul, Japan Warn North Korea Preparing Nuclea...",The South Korean Defense Intelligence Agency (...,30/10/2024


In [37]:
data_raw['source'].value_counts()

source
The Daily Caller    100
Natural News        100
Zerohedge           100
Breitbart            99
CNN                  99
Guardian             99
AP                   97
News Max             95
NPR                  86
BBC                  80
Name: count, dtype: int64

## Selection Criteria
- Language: Only use articles written in English (using language detection if necessary).
- Date Range: Focus on articles published in 2024 or, at most, 2023 to ensure credibility remains up-to-date.
- Content Focus: Only articles from the politics/election sections of each news website, targeting US election-related topics.

### Additional Filtering
- Define Political Content: Articles should contain keywords like “election,” “government,” “policy,” or “candidate” to qualify as political.
- Exclude Advertisements: Filter out content with commercial keywords or phrases related to advertisements.
- Exclude Opinion Pieces: Identify and exclude articles labeled as “opinion,” “editorial,” or similar terms, or those appearing in “Opinion” sections.
- Remove Outliers by Length: Filter out articles with fewer than 100 words or more than 5,000 words to focus on substantial content.
- Regex Filtering: Use regular expressions to remove boilerplate or irrelevant sections, such as “All rights reserved,” “Read more,” bylines, and embedded links to other articles or advertisements.

#### Step 1: Language Detection

In [38]:
# 1) Set a seed for langdetect to ensure reproducibility
DetectorFactory.seed = 0

# 2a) Simplified preprocessing: only remove non-alphabetic characters
def preprocess_text_simple(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic characters
    text = re.sub(r'\s+', ' ', text)  # Normalize whitespace
    return text.strip()

# 2b) Check if the text is non-language (e.g., numbers, symbols only)
def is_non_language_text(text):
    if re.match(r'^[^a-zA-Z]*$', text):  # Check if text has no alphabetic characters
        return True
    return False

# 3a) Function to get langdetect prediction
def get_langdetect_prediction(text):
    try:
        # Directly use text without preprocessing for efficiency
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        lang = langdetect_detect(text)
        return lang
    except LangDetectException:
        return "unknown"

# 3b) Function to get langid prediction
def get_langid_prediction(text):
    try:
        lang, _ = langid_classify(text)
        if len(text) < 10 or is_non_language_text(text):
            return "unknown"
        return lang
    except Exception:
        return "unknown"

# 4) Function to calculate majority vote for each language
def calculate_majority_vote(predictions):
    vote_counts = {}
    for lang in predictions:
        if lang in vote_counts:
            vote_counts[lang] += 1
        else:
            vote_counts[lang] = 1
    return vote_counts

# 5) Parallel processing for efficiency with limited workers
def parallel_detection(text):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = list(executor.map(lambda func: func(text), 
                                    [get_langdetect_prediction, get_langid_prediction]))
    return results

# 6) Caching function for repeated inputs
@lru_cache(maxsize=500)
def get_cached_language(text):
    return combined_language_detection(text)

# 7) Combined majority voting language detection function
def combined_language_detection(text):
    # Check if the text is non-language (e.g., numbers, symbols only)
    if is_non_language_text(text):
        return "unknown"
    
    # Run the detectors in parallel for efficiency
    predictions = parallel_detection(text)
    
    # Calculate majority vote for each language based on predictions
    vote_counts = calculate_majority_vote(predictions)
    
    # Determine the language with the highest majority vote
    final_language = max(vote_counts, key=vote_counts.get)
    
    # If "unknown" is the most common or if all detectors fail, return "unknown"
    if final_language == "unknown" or vote_counts[final_language] <= 1:
        return "unknown"
    
    return final_language

# 8) Apply the cached function to each text in the DataFrame with a progress bar
data_raw['language'] = [get_cached_language(text) for text in tqdm(data_raw['content'], desc="Language Detection")]

# 9) Display the DataFrame with detected languages
data_raw

Language Detection: 100%|██████████| 955/955 [00:17<00:00, 54.20it/s] 


Unnamed: 0,source,title,content,date,language
0,Breitbart,Russia Practices 'Massive Nuclear Strike' in A...,“Important to have modern and constantly ready...,30/10/2024,en
1,Breitbart,Nolte: Trump and Republican Senate Challenger ...,The final Insider Advantage poll out of Wiscon...,29/10/2024,en
2,Breitbart,Sen. Tom Cotton Barnstorms Nation Campaigning ...,Sen. Tom Cotton (R-AR) is putting in long hour...,29/10/2024,en
3,Breitbart,U.S. Marines Successfully Test Israel's Iron D...,The U.S. Marines have successfully tested a mo...,30/10/2024,en
4,Breitbart,"Seoul, Japan Warn North Korea Preparing Nuclea...",The South Korean Defense Intelligence Agency (...,30/10/2024,en
...,...,...,...,...,...
950,Guardian,How can the candidate with most votes lose? Th...,Even though the United States touts its status...,19/10/2024,en
951,Guardian,The RBA will likely hold interest rates – but ...,When the Reserve Bank board meet next week to ...,30/10/2024,en
952,Guardian,Trump’s mass deportation plan would be ‘econom...,"If elected, Donald Trump plans to carry out “t...",30/10/2024,en
953,Guardian,Sun belt to October surprise: US election term...,"On Tuesday 5 November, Americans will vote aft...",10/10/2024,en


In [39]:
# Drop rows where language is NOT in english and reset the index
data_raw = data_raw[data_raw['language'] == 'en'].reset_index(drop=True)
data_raw

Unnamed: 0,source,title,content,date,language
0,Breitbart,Russia Practices 'Massive Nuclear Strike' in A...,“Important to have modern and constantly ready...,30/10/2024,en
1,Breitbart,Nolte: Trump and Republican Senate Challenger ...,The final Insider Advantage poll out of Wiscon...,29/10/2024,en
2,Breitbart,Sen. Tom Cotton Barnstorms Nation Campaigning ...,Sen. Tom Cotton (R-AR) is putting in long hour...,29/10/2024,en
3,Breitbart,U.S. Marines Successfully Test Israel's Iron D...,The U.S. Marines have successfully tested a mo...,30/10/2024,en
4,Breitbart,"Seoul, Japan Warn North Korea Preparing Nuclea...",The South Korean Defense Intelligence Agency (...,30/10/2024,en
...,...,...,...,...,...
949,Guardian,How can the candidate with most votes lose? Th...,Even though the United States touts its status...,19/10/2024,en
950,Guardian,The RBA will likely hold interest rates – but ...,When the Reserve Bank board meet next week to ...,30/10/2024,en
951,Guardian,Trump’s mass deportation plan would be ‘econom...,"If elected, Donald Trump plans to carry out “t...",30/10/2024,en
952,Guardian,Sun belt to October surprise: US election term...,"On Tuesday 5 November, Americans will vote aft...",10/10/2024,en


In [40]:
data_raw['language'].value_counts()

# Drop the  'language' column
data_raw = data_raw.drop(columns=['language'])

data_raw.head()

Unnamed: 0,source,title,content,date
0,Breitbart,Russia Practices 'Massive Nuclear Strike' in A...,“Important to have modern and constantly ready...,30/10/2024
1,Breitbart,Nolte: Trump and Republican Senate Challenger ...,The final Insider Advantage poll out of Wiscon...,29/10/2024
2,Breitbart,Sen. Tom Cotton Barnstorms Nation Campaigning ...,Sen. Tom Cotton (R-AR) is putting in long hour...,29/10/2024
3,Breitbart,U.S. Marines Successfully Test Israel's Iron D...,The U.S. Marines have successfully tested a mo...,30/10/2024
4,Breitbart,"Seoul, Japan Warn North Korea Preparing Nuclea...",The South Korean Defense Intelligence Agency (...,30/10/2024


In [41]:
#  Copy data for safety
data_filtered = data_raw.copy()
print("Number of articles before filtering:", len(data_filtered))

# Step 2: Political Keywords Filter
us_politics_keywords = [
    # Core Keywords
    "election", "elections", "2024 election", "presidential election", "campaign", "campaigning",
    "primary", "primaries", "polling", "polls",

    # Key Political Figures
    "Kamala Harris", "Donald Trump", "Joe Biden", "Ron DeSantis", "Gavin Newsom", 
    "Mike Pence", "Vivek Ramaswamy", "Tim Walz"

    # Political Parties & Groups
    "Democrat", "Democratic Party", "Republican", "GOP", "Republican Party",
    "Independent", "Third party", "PAC", "Super PAC",

    # Political Issues & Controversies
    "voting rights", "voter suppression", "absentee ballot", "mail-in ballot",
    "Electoral College", "Supreme Court", "abortion", "Roe v. Wade", "gun control",
    "Second Amendment", "immigration", "border security", "healthcare", 
    "Medicare", "Affordable Care Act", "climate change", "Green New Deal", 
    "inflation", "economic policy", "tax cuts", "tax reform", "foreign policy", 
    "foreign relations",

    # U.S. Government Bodies & Offices
    "Congress", "Senate", "Senators", "House of Representatives", "White House",
    "Supreme Court", "Federal government", "State government",

    # Policies & Bills
    "voting reform", "healthcare reform", "climate policy", "gun legislation", 
    "economic recovery", "infrastructure bill", "Social Security", "student loan forgiveness",

    # Social & Cultural Issues
    "social justice", "racial equality", "police reform", "civil rights",
    "freedom of speech", "religious freedom",

    # Election Processes
    "debates", "presidential debate", "swing state", "battleground state",
    "electoral votes",

    # Additional Relevant Terms
    "approval rating", "national convention", "lobbying", "lobbyist",
    "scandal", "investigation", "political rally", "rally"
]
data_filtered = data_filtered[data_filtered['content'].str.contains('|'.join(us_politics_keywords), case=False, na=False)]
print("Number of articles after political keyword filtering:", len(data_filtered))

# Step 3: Exclude Opinion Pieces (assuming titles or content contain specific indicators of opinion)
opinion_keywords = ['opinion', 'editorial', 'op-ed']
data_filtered = data_filtered[~data_filtered['title'].str.contains('|'.join(opinion_keywords), case=False, na=False)]
data_filtered = data_filtered[~data_filtered['content'].str.contains('|'.join(opinion_keywords), case=False, na=False)]
print("Number of articles after opinion piece filtering:", len(data_filtered))

# Step 4: Remove Irrelevant Content Using Regex
# Define patterns for irrelevant content
irrelevant_patterns = [
    r'All rights reserved', r'Read more', r'For more information', r'Follow us', 
    r'Find us on', r'Contact the author', r'Subscribe for updates', r'\bWATCH\b',
    r'Advertisement', r'^[\W_]+$'  # Removes lines that are mostly symbols or whitespace
]

def clean_irrelevant_content(text):
    for pattern in irrelevant_patterns:
        text = re.sub(pattern, '', text, flags=re.IGNORECASE)
    return text

data_filtered['content'] = data_filtered['content'].apply(clean_irrelevant_content)
print("Number of articles after irrelevant content cleaning:", len(data_filtered))

# Final DataFrame after filtering
print("Number of articles after all filtering:", len(data_filtered))
data_filtered.head()
data_filtered.info()


Number of articles before filtering: 954
Number of articles after political keyword filtering: 917
Number of articles after opinion piece filtering: 818
Number of articles after irrelevant content cleaning: 818
Number of articles after all filtering: 818
<class 'pandas.core.frame.DataFrame'>
Index: 818 entries, 0 to 952
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   source   818 non-null    object
 1   title    818 non-null    object
 2   content  818 non-null    object
 3   date     818 non-null    object
dtypes: object(4)
memory usage: 32.0+ KB


In [42]:
data_filtered['source'].value_counts()

source
The Daily Caller    94
Breitbart           91
CNN                 85
AP                  85
News Max            83
Zerohedge           83
Guardian            82
Natural News        78
NPR                 77
BBC                 60
Name: count, dtype: int64

# Feature Selection
Here we select the relevant features for fake news classification.
- `title`, `text`
- Create a new DataFrame (`data`) by selecting the specifc columns mentioned above from the original DataFrame `data_raw`.

In [43]:
data = data_filtered[['title', 'content', 'source',]]
print(type(data))
print(data.head())

# Shape before dropping duplicates
print("\nThe old shape is: ", data.shape)

<class 'pandas.core.frame.DataFrame'>
                                               title  \
0  Russia Practices 'Massive Nuclear Strike' in A...   
1  Nolte: Trump and Republican Senate Challenger ...   
2  Sen. Tom Cotton Barnstorms Nation Campaigning ...   
3  U.S. Marines Successfully Test Israel's Iron D...   
4  Seoul, Japan Warn North Korea Preparing Nuclea...   

                                             content     source  
0  “Important to have modern and constantly ready...  Breitbart  
1  The final Insider Advantage poll out of Wiscon...  Breitbart  
2  Sen. Tom Cotton (R-AR) is putting in long hour...  Breitbart  
3  The U.S. Marines have successfully tested a mo...  Breitbart  
4  The South Korean Defense Intelligence Agency (...  Breitbart  

The old shape is:  (818, 3)


# Data Cleaning

## Remove Duplicate Rows
- Drop duplicate rows from the dataframe (`data`).

In [44]:
data_ = data.drop_duplicates()

# Display the new dataframe shape
print("The new shape is: ", data.shape)

The new shape is:  (818, 3)


## Remove Outliers

### `text`

The `text` column of `data`, which is of string type, may contain values with unusually long lengths, indicating the presence of outliers. We will identify the outliers using [Z-score method].

1. Create a new column `text_length` in the DataFrame `data` by calculating the length of each review. (Set the value as 0 if the correponding `text` column has NaN values.)

2. Check the statistics of `text_length` using `describe()` method.

3. Calculate the mean and standard deviation of the `text_length` column.

4. Set the Z-score threshold for identifying outliers to 3.

5. Identify outliers of the `text_length` column and set the corresponding `text` to np.nan.

6. Drop the `text_length` column from the DataFrame.

In [45]:
data['text_length'] = data['content'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["text_length"]
stats_TL = TL.describe()
print(stats_TL)

                                               title  \
0  Russia Practices 'Massive Nuclear Strike' in A...   
1  Nolte: Trump and Republican Senate Challenger ...   
2  Sen. Tom Cotton Barnstorms Nation Campaigning ...   

                                             content     source  text_length  
0  “Important to have modern and constantly ready...  Breitbart         3755  
1  The final Insider Advantage poll out of Wiscon...  Breitbart         2713  
2  Sen. Tom Cotton (R-AR) is putting in long hour...  Breitbart         4010  
count      818.000000
mean      4741.350856
std       3330.378252
min        115.000000
25%       2647.750000
50%       3763.000000
75%       6180.500000
max      24327.000000
Name: text_length, dtype: float64


In [46]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'text' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'content'] = np.nan
# print(data.head(3))

data = data.drop("text_length", axis=1)
data.head()

Unnamed: 0,title,content,source
0,Russia Practices 'Massive Nuclear Strike' in A...,“Important to have modern and constantly ready...,Breitbart
1,Nolte: Trump and Republican Senate Challenger ...,The final Insider Advantage poll out of Wiscon...,Breitbart
2,Sen. Tom Cotton Barnstorms Nation Campaigning ...,Sen. Tom Cotton (R-AR) is putting in long hour...,Breitbart
3,U.S. Marines Successfully Test Israel's Iron D...,The U.S. Marines have successfully tested a mo...,Breitbart
4,"Seoul, Japan Warn North Korea Preparing Nuclea...",The South Korean Defense Intelligence Agency (...,Breitbart


### `title`

Similarly, the `title` column of `data` (of type `str`) may also contain values with unusually long lengths, indicating the presence of outliers.

1. Create a new column `title_length` in the DataFrame `data` by calculating the length of each price value. (Set the value as 0 if the correponding `title` column has NaN values.)

2. Check the statistics of `title_length` using `describe()` method and display its unique values.

3. Identify the outlier values by inspecting the content in `title` corresponding to the abnormal value in `title_length` and set the corresponding value of `title` to np.nan.

4. Drop the `title_length` column from the DataFrame.

In [47]:
data['title_length'] = data['title'].apply(lambda x: len(x) if pd.notna(x) else 0)
print(data.head(3))

TL = data["title_length"]
stats_TL = TL.describe()
print(stats_TL)

                                               title  \
0  Russia Practices 'Massive Nuclear Strike' in A...   
1  Nolte: Trump and Republican Senate Challenger ...   
2  Sen. Tom Cotton Barnstorms Nation Campaigning ...   

                                             content     source  title_length  
0  “Important to have modern and constantly ready...  Breitbart            72  
1  The final Insider Advantage poll out of Wiscon...  Breitbart            63  
2  Sen. Tom Cotton (R-AR) is putting in long hour...  Breitbart            89  
count    818.000000
mean      79.930318
std       23.854029
min        9.000000
25%       63.000000
50%       79.000000
75%       94.000000
max      186.000000
Name: title_length, dtype: float64


In [48]:
mean_TL = TL.mean()
# print(mean_TL)

sd_TL = TL.std()
# print(sd_TL)

threshold = 3

z_score = zscore(TL)
# print(z_score)

# Remove 'title' of lengths that are greater than 3 standard deviations above the mean
data.loc[abs(z_score) > threshold, 'title'] = np.nan
# print(data.head(3))

data = data.drop("title_length", axis=1)
data.head()

Unnamed: 0,title,content,source
0,Russia Practices 'Massive Nuclear Strike' in A...,“Important to have modern and constantly ready...,Breitbart
1,Nolte: Trump and Republican Senate Challenger ...,The final Insider Advantage poll out of Wiscon...,Breitbart
2,Sen. Tom Cotton Barnstorms Nation Campaigning ...,Sen. Tom Cotton (R-AR) is putting in long hour...,Breitbart
3,U.S. Marines Successfully Test Israel's Iron D...,The U.S. Marines have successfully tested a mo...,Breitbart
4,"Seoul, Japan Warn North Korea Preparing Nuclea...",The South Korean Defense Intelligence Agency (...,Breitbart


In [49]:
data.isnull().sum()

title       5
content    15
source      0
dtype: int64

# Feature Engineering

### Create new column `full_review`
Since there are some rows with empty `text` and `title`, we will concatenate both columns (`text` and `title`) to form a new column `full_content`.
1. Replace `NaN` values in `text` and `title` with an empty string.

2. Combine `text` and `title` into `full_content`.

3. Strip any leading/trailing whitespaces in `full_content`.

4. Drop `text` and `title` columns.

In [50]:
# 1) Fill NaN values in 'text' and 'title' with an empty string
data['title'] = data['title'].fillna('')
data['content'] = data['content'].fillna('')

# 2) Combine 'text' and 'title' into 'content'
data['full_content'] = data['content'] + " " + data['title']

# 3) Strip any leading/trailing whitespace
data['full_content'] = data['full_content'].str.strip()

# 4) Drop `text` and `title` columns
data = data.drop(columns = ['content', 'title'])

# Check if the 'full_review' column was added and if 'text' and 'title' columns has been dropped
print(data.head())
print("\nThe old shape is:",data.shape)

      source                                       full_content
0  Breitbart  “Important to have modern and constantly ready...
1  Breitbart  The final Insider Advantage poll out of Wiscon...
2  Breitbart  Sen. Tom Cotton (R-AR) is putting in long hour...
3  Breitbart  The U.S. Marines have successfully tested a mo...
4  Breitbart  The South Korean Defense Intelligence Agency (...

The old shape is: (818, 2)


### Handle Missing Values
1. Drop rows where `full_review` are empty strings and reset the index.

2. Check if there are no more null values in `data`.

In [51]:
# 1) Drop rows where `full_review` are empty strings and reset the index
data = data[data['full_content'] != ""].reset_index(drop=True)
print("The new shape is:",data.shape)

# 2) Check if there are no more null values in `data`
data.isnull().sum()

The new shape is: (818, 2)


source          0
full_content    0
dtype: int64

In [52]:
data

Unnamed: 0,source,full_content
0,Breitbart,“Important to have modern and constantly ready...
1,Breitbart,The final Insider Advantage poll out of Wiscon...
2,Breitbart,Sen. Tom Cotton (R-AR) is putting in long hour...
3,Breitbart,The U.S. Marines have successfully tested a mo...
4,Breitbart,The South Korean Defense Intelligence Agency (...
...,...,...
813,Guardian,A California “home” for rent is shining a ligh...
814,Guardian,Even though the United States touts its status...
815,Guardian,When the Reserve Bank board meet next week to ...
816,Guardian,"If elected, Donald Trump plans to carry out “t..."


# Text Preprocessing for NLP

Here we will define a function `process_full_review` that takes a textual value as input and applies the following processing steps in sequence:

1. Convert the input text to lowercase using the `lower()` function.

2. Tokenize the lowercase text using the `word_tokenize` function from the NLTK library.

3. Create a list (`alphabetic_tokens`) containing only alphanetic tokens using a list comprehension with a regular expression match.

4. Remove stopwords
-   Obtain a set of English stopwords using the `stopwords.words('english')` method.
-   Define a list of `allowed_words` that should not be removed.
-   Remove the stopwords (excluding those that should not be removed).

5. Apply stemming to each token in the list (`lemmatized_words`) using the `lemmatize` method.

6. Join the stemmed tokens into a single processed text using the `join` method and return the processed text.

Create  new columns (`processed_full_review`) in `data` by applying the `process_full_review` function to the `full_review` column.

In [53]:
# Ensure require NLTK data is downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('all')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\User\AppData\Roaming\nltk

True

In [54]:
# Define function to process text
import string
from nltk.stem import *
from nltk.stem.porter import *

def process_full_review(text):
    # Convert to lowercase and tokenize
    text = text.lower()
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in string.punctuation]
    stemmer = PorterStemmer()
    # List of stopwords
    stop_words = stopwords.words('english')
    allowed_words = ["no", "not", "don't", "dont", "don", "but", 
                     "however", "never", "wasn't", "wasnt", "shouldn't",
                     "shouldnt", "mustn't", "musnt"]

    stemmed = [stemmer.stem(word) for word in tokens if word not in stop_words or word in allowed_words]
    return ' '.join(stemmed)

### <span style="color:red">The code below will take approximately 16 minutes to run!</span>

In [55]:
# Enable tqdm for pandas (progress bar)
tqdm.pandas(desc="Processing News Articles")

# Apply process_full_review function with tqdm progress bar and expand the results into separate columns.
processed_columns = ['processed_full_content']
data[processed_columns] = data['full_content'].progress_apply(lambda x: pd.Series(process_full_review(x)))

data

data_copy = data.copy()

Processing News Articles: 100%|██████████| 818/818 [00:09<00:00, 84.53it/s] 


In [56]:
data['source'].value_counts() 

source
The Daily Caller    94
Breitbart           91
CNN                 85
AP                  85
News Max            83
Zerohedge           83
Guardian            82
Natural News        78
NPR                 77
BBC                 60
Name: count, dtype: int64

In [57]:
# Sample 61 articles from each source without including the grouping column in the operation
sample_size = 60

data_sampled = (
    data.groupby('source', group_keys=False)
    .apply(lambda x: x.sample(n=sample_size, random_state=42))
    .reset_index(drop=True)
)

# Display the result
print("Number of articles after sampling:", len(data_sampled))
print(data_sampled['source'].value_counts())  # Check the counts per source
data_sampled.head()

data_sampled = data_sampled.drop(columns=['full_content'])

Number of articles after sampling: 600
source
AP                  60
BBC                 60
Breitbart           60
CNN                 60
Guardian            60
NPR                 60
Natural News        60
News Max            60
The Daily Caller    60
Zerohedge           60
Name: count, dtype: int64


  .apply(lambda x: x.sample(n=sample_size, random_state=42))


In [58]:
data_sampled

Unnamed: 0,source,processed_full_content
0,AP,abort law motiv women north carolina vote elec...
1,AP,vote us elect step must take vote us elect ste...
2,AP,voter drown ad ‘ obscen ’ amount cash flood mo...
3,AP,don ’ count recount chang winner close elect f...
4,AP,battleground georgia poor peopl see no reason ...
...,...,...
595,Zerohedge,mani peopl might think attorney gener ken paxt...
596,Zerohedge,latest vivid demonstr once-dur democrat consti...
597,Zerohedge,former vatican roman cathol archbishop carlo m...
598,Zerohedge,author sam dorman via epoch time emphasi steve...


# Applying our best model (LSTM + GloVe 300D) on the scraped data

In [28]:
data = pd.read_csv("../processed_data.csv")

In [29]:
import numpy as np
import pandas as pd
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score


# Load and process GloVe embeddings
def load_glove_embeddings(glove_file, word_index, embedding_dim=300):
    embeddings_index = {}
    with open(glove_file, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefficients = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefficients
    embedding_matrix = np.zeros((len(word_index) + 1, embedding_dim))
    for word, i in word_index.items():
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix


# Define LSTM model with Dropout and L2 regularization
def create_lstm_model(vocab_size, embedding_matrix, input_length, learning_rate=0.001, l2_lambda=0.01):
    model = Sequential()
    model.add(Embedding(vocab_size, embedding_matrix.shape[1],
                        weights=[embedding_matrix],
                        input_length=input_length,
                        trainable=True))
    model.add(LSTM(units=64, return_sequences=False, dropout=0.2))
    model.add(Dropout(0.2))  # Added Dropout layer here
    model.add(Dense(1, activation='sigmoid', kernel_regularizer=l2(l2_lambda)))
    model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate), metrics=['accuracy'])

    return model
early_stopping = EarlyStopping(
    monitor='val_loss',  # or 'val_accuracy' if you prefer to monitor accuracy
    patience=3,          # stop after 3 epochs with no improvement
    restore_best_weights=True  # revert to the best weights after stopping
)
# K-Fold Cross-Validation with additional metrics
def k_fold_cross_validation(X, y, embedding_matrix, vocab_size, max_len, n_splits=5, batch_size=128):
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    metrics = {'accuracy': [], 'precision': [], 'recall': [], 'f1': [], 'roc_auc': []}

    # Define early stopping
    early_stopping = EarlyStopping(
        monitor='val_loss',  # monitor validation loss
        patience=3,          # number of epochs with no improvement
        restore_best_weights=True  # revert to the best model weights
    )

    for train_idx, val_idx in kfold.split(X):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        model = create_lstm_model(vocab_size, embedding_matrix, max_len, learning_rate=0.001, l2_lambda=0.01)
        
        # Fit model with early stopping
        model.fit(
            X_train, y_train,
            validation_data=(X_val, y_val),
            epochs=10,             # maximum number of epochs
            batch_size=128,
            verbose=1,
            callbacks=[early_stopping]  # add early stopping
        )
        
        y_pred = (model.predict(X_val) > 0.5).astype("int32")
        y_pred_prob = model.predict(X_val).ravel()

        accuracy = accuracy_score(y_val, y_pred)
        precision = precision_score(y_val, y_pred)
        recall = recall_score(y_val, y_pred)
        f1 = f1_score(y_val, y_pred)
        roc_auc = roc_auc_score(y_val, y_pred_prob)

        metrics['accuracy'].append(accuracy)
        metrics['precision'].append(precision)
        metrics['recall'].append(recall)
        metrics['f1'].append(f1)
        metrics['roc_auc'].append(roc_auc)

        print(f"Fold Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1 Score: {f1}, ROC AUC: {roc_auc}")

    avg_metrics = {metric: np.mean(scores) for metric, scores in metrics.items()}
    print("\nAverage Metrics Across Folds:")
    for metric, avg_score in avg_metrics.items():
        print(f"{metric.capitalize()}: {avg_score:.4f}")
    
    return avg_metrics, model

# Example usage
# Load your data and tokenize it
texts = data["processed_full_content"]
num_samples = len(texts)
labels = data["label"].values  # Adjusted labels for each sample

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
X = tokenizer.texts_to_sequences(texts)
max_len = 100
X = pad_sequences(X, maxlen=max_len)

# Load GloVe embeddings
embedding_dim = 300
vocab_size = len(tokenizer.word_index) + 1
embedding_matrix = load_glove_embeddings("../model_experiements/glove.6B.300d.txt", tokenizer.word_index, embedding_dim)

# Perform K-Fold cross-validation
avg_metrics, trained_model = k_fold_cross_validation(X, labels, embedding_matrix, vocab_size, max_len)
print("Final Average Metrics:", avg_metrics)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Fold Accuracy: 0.963827121829001, Precision: 0.9644607843137255, Recall: 0.9550970873786407, F1 Score: 0.9597560975609756, ROC AUC: 0.9927637818017205
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Fold Accuracy: 0.9650015659254619, Precision: 0.9573026092849881, Recall: 0.9666381522668948, F1 Score: 0.9619477313356601, ROC AUC: 0.9938413218727427
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Fold Accuracy: 0.9658628249295333, Precision: 0.9608454608454609, Recall: 0.9635163307852675, F1 Score: 0.9621790423317141, ROC AUC: 0.9935927747398381
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Fold Accuracy: 0.9648449733792671, Precision: 0.9667186687467498, Recall: 0.9559478916695234, F1 Score: 0.9613031112643282, ROC AUC: 0.9944265645677227
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Fold Accuracy: 0.9681334168493579, Precision: 0.9676867840656522, Recall: 0.962912555

In [30]:
scraped_texts = data_sampled['processed_full_content']
scraped_sequences = tokenizer.texts_to_sequences(scraped_texts)
scraped_padded_sequences = pad_sequences(scraped_sequences, maxlen=max_len)

# Get predictions from the trained model
new_predictions = trained_model.predict(scraped_padded_sequences).ravel()

# Convert predicted probabilities to class labels
# For binary classification with sigmoid activation, threshold at 0.5
predicted_classes = (new_predictions > 0.5).astype("int32")
data_sampled['predicted_label'] = predicted_classes



In [31]:
predicted_classes

array([0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0,

In [32]:
data_sampled

Unnamed: 0,source,processed_full_content,predicted_label
0,AP,abort law motiv women north carolina vote elec...,0
1,AP,vote us elect step must take vote us elect ste...,1
2,AP,voter drown ad ‘ obscen ’ amount cash flood mo...,0
3,AP,don ’ count recount chang winner close elect f...,0
4,AP,battleground georgia poor peopl see no reason ...,0
...,...,...,...
595,Zerohedge,mani peopl might think attorney gener ken paxt...,1
596,Zerohedge,latest vivid demonstr once-dur democrat consti...,0
597,Zerohedge,former vatican roman cathol archbishop carlo m...,1
598,Zerohedge,author sam dorman via epoch time emphasi steve...,0


In [33]:
# Group by 'source' and calculate the sum of 'predicted_label' for each group
grouped_data = data_sampled.groupby('source')['predicted_label'].sum()

# Sort the grouped data in descending order based on the sum of 'predicted_label'
sorted_data = grouped_data.sort_values(ascending=False)

# Convert to DataFrame if needed
sorted_data = sorted_data.reset_index()

# Display the result
print(sorted_data)

             source  predicted_label
0      Natural News               57
1  The Daily Caller               39
2         Zerohedge               39
3          Guardian               31
4          News Max               23
5               CNN               21
6               BBC               20
7         Breitbart               15
8               NPR               11
9                AP               10


In [61]:
scraped_texts_test = data_copy['processed_full_content']
scraped_sequences_test = tokenizer.texts_to_sequences(scraped_texts_test)
scraped_padded_sequences_test = pad_sequences(scraped_sequences_test, maxlen=max_len)

# Get predictions from the trained model
new_predictions_test = trained_model.predict(scraped_padded_sequences_test).ravel()

# Convert predicted probabilities to class labels
# For binary classification with sigmoid activation, threshold at 0.5
predicted_classes_test = (new_predictions_test > 0.5).astype("int32")
data_copy['predicted_label'] = predicted_classes_test



In [63]:
# Group by 'source' and calculate the sum of 'predicted_label' and the count of rows
grouped_data_test = data_copy.groupby('source').agg(
    predicted_label_sum=('predicted_label', 'sum'),
    row_count=('predicted_label', 'size')
)

# Calculate the ratio
grouped_data_test['predicted_label_ratio'] = grouped_data_test['predicted_label_sum'] / grouped_data_test['row_count']

# Sort the data in descending order based on the ratio
sorted_data_test = grouped_data_test.sort_values(by='predicted_label_ratio', ascending=False).reset_index()

# Display the result
print(sorted_data_test)


             source  predicted_label_sum  row_count  predicted_label_ratio
0      Natural News                   71         78               0.910256
1  The Daily Caller                   66         94               0.702128
2         Zerohedge                   56         83               0.674699
3          Guardian                   44         82               0.536585
4          News Max                   32         83               0.385542
5               BBC                   20         60               0.333333
6               CNN                   25         85               0.294118
7         Breitbart                   19         91               0.208791
8                AP                   17         85               0.200000
9               NPR                   15         77               0.194805
