# Cleaning the FakeNewsCorpus
- *Author*: Juan Cabanela
- *Start Date*: November 1, 2021

## Requirements

Requires the following python libraries:
- pandas
- numpy
- shutil
- fasttext

To run `fasttext` you must install the package and download the 128MB `lid.176.bin` model datafile (`https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin`) and place it in a `fasttext/` subdirectory with this code.

This notebook should only be executed after you have downloaded and assembled the FakeNewsCorpus files (https://github.com/several27/FakeNewsCorpus). You should have the following files in the `data_dir` directory:
- news.sample.csv
- websites.csv
- news_cleaned_2018_02_13.csv.zip (or news_cleaned_2018_02_13.csv)

The `data_dir` defined toward the end of the first block of code is where the notebook will look for the FaceNewsCorpus files (my choice was `./FakeNewsCorpus/`).

## History 
**November 2, 2021**: Initial exploratory version just used the `cleanDataframe` code but made no attempt to reduce the information in the dataframe except by dropping columns and merging in `websites.csv` information.

**November 3, 2021**: After further development, I moved some of the code for cleaning the data from the Tokenization notebook to this one, which includes code to reduce the `content` column to the first 800 characters (10 lines) and code to clean both the `content` and `title` to remove punctuation and capitalization to make it easier to process.  Even though this is tossing out data, the opening of a news article is often touted as where the writer has to make the most important statements, so I am hoping they will have enough information within them to determine if there are any journalistic flaws.

**November 4, 2021**: Examination of some of the dropped entries revealed that the FaceNewsCorpus contained a lot of additional `reliable` (aka mainstream media) news sources that I was tossing out because there were not in the `websites.csv` file, which seems to list only sites with some sort of flaw.  I modified the merge logic to keep these sources.  Pushed up the number of articles we keep to 8.1 million!  I also removed a stupid error in the puncutation I was removing to also remove a closing quotation).

The final step I performed in the cleaning was purging articles that were not in English. Turns out this is a bit unfeaasible. I tried several language detection libraries (`langdetect`, `langid` , and `fasttext`) and settled on `fasttext` as the fastest.  It turned out `langdetect` was faster, handling 250 strings in about 3 seconds to do 250 strings, but this extrapolates to about 20-25 minutes for each 250,000 'chunk' of data, so doing all the purging of English would take about 2 days!  `fastext` seems to process each chunk of 250,000 articles in about 60-80 seconds, so that is much more feasible.

**November 14, 2021:** As it looked likely that we would want to have the full context of the text in some cases, I separated the removal of
the stop words and dumped out two versions of the data files for later use.

**November 24, 2021:** Fixed an issue where some of the `1st_type` entries are blank.  Error was partly due to `amazon.com` entries (wierd) and other domains not matching the websites list. `blogspot.ie` set up to match to `blogspot.com`.  However, still have the following domains (not in websites.csv) which ended up dropped (`hugedomains.com`, `21wire.tv`, `amazon.com`).

In [None]:
import os
import pandas as pd
import numpy as np
import shutil
import re
import fasttext

##
## Define functions
##

# Define function to create directories (wiping out existing directories, possibly dangerous!!!)
def create_dir(dir, clobber=False, ):
    if not os.path.exists(dir):
        os.makedirs(dir)
        print("Created Directory : ", dir)
    else:
        if (not clobber):
            ans = input(f"Should we remove the existing {dir} directory (y/n)?")
        else:
            ans = 'y'
        if (ans.lower()[0] == 'y'):
            try:
                shutil.rmtree(dir)
            except OSError as e:
                print("Error: %s : %s" % (dir_path, e.strerror))
            os.makedirs(dir)
    return dir


# Print information on the Pandas dataframe
def print_FrameInfo(df):
    print(f"Number of records: {df.shape[0]}")
    print(f"Columns: {df.columns}")
    print(f"Data types: {df.dtypes}")


# Check for nulls in Pandas dataframe and report it
def check_nulls(df, verbose=False):
    if df.isnull().values.any():
        if (verbose):
            print("WARNING: You have null values somewhere in your Pandas dataframe!!!")
            for col in df.columns:
                print(f" - Found {sample_df[col].isnull().sum()} null '{col}' entries.")

    return df.isnull().values.any()


# Function that attempts to identify the language
# Based on code at https://ricardoanderegg.com/posts/python-fast-language-identification-fasttext/
def lang(text):
    # Fasttext can only handle one line at a time, so only process the first line
    text = re.sub(r'\n.*',r'',text)
    # return empty string if there is no text
    if text.isspace():
        return ""
    else:
        # get first item of the prediction tuple, then split by "__label__" and return only language code
        return lid_model.predict(text)[0][0].split("__label__")[1]


# Clean Articles dataframe by purging records with no type or unknown type and also matching up
# website dataframe information to get additional types as needed.
def clean_Dataframe(articles_df, classedweb_df, verbose = False):
    # Make deep copies to work with (and to avoid triggering SettingWithCopyWarning)
    art_df = articles_df.copy(deep=True)
    types_df = classedweb_df.copy(deep=True)

    # Convert the domains to lower case and strip out any leading "www." (to allow merging with websites.csv data)
    art_df['domain'] = art_df['domain'].str.lower()
    art_df['domain'] = art_df['domain'].str.replace('www\.', '', regex=True)

    # Clean up alternative domain name issue with blogspot.
    art_df['domain'] = art_df['domain'].replace(regex={r'blogspot.ie': 'blogspot.com'})

    # merge website information with article information to get updated types
    merged_df = pd.merge(art_df, types_df, on = "domain", how = "left")  # Keep all original records, but match to website where possible

    # Remove "unknown" 1st_type
    merged_df = merged_df[merged_df['1st_type'] != 'unknown'].copy(deep=True)

    # Determine the language of the article text and remove non-English articles.
    merged_df['lang'] = merged_df['content'].apply(lang)
    merged_df = merged_df[merged_df['lang'] == 'en'].copy(deep=True)
    merged_df = merged_df.drop(columns=['lang'])

    # Address those cases where there was not a match in the merge above by seting '1st_type' to 'type' (and trusting the classifications).
    noweb = merged_df['1st_type'].isna()
    merged_df['1st_type'][noweb] = merged_df['type'][noweb]

    # remove entries where '1st_type' is NaN
    merged_df = merged_df.dropna(subset=['1st_type'])

    # Reset comparison after addressing mismatched domains.
    merged_df['compare'] = (merged_df['type'] == merged_df['1st_type'])
    mismatch = (merged_df['compare'] == False)
    notnull_mismatch = (merged_df['type'].notna()) & (merged_df['compare'] == False) & (merged_df['domain'] != 'patriotnewsdaily.com') & (merged_df['domain'] != 'madworldnews.com')
    n_mismatch = merged_df[mismatch].shape[0]
    n_mismatch_notnull = merged_df[notnull_mismatch].shape[0]
    n_updated = n_mismatch-n_mismatch_notnull
    if (n_mismatch_notnull > 0):
        print(f"WARNING: Found {n_mismatch_notnull} entires in which a non-null 'type' from sample doesn't match '1st_type' from websites.")
        print(merged_df[['domain', 'type','1st_type','2nd_type','3rd_type']][notnull_mismatch])

    if (n_updated>0) and (verbose):
        print(f"NOTE: Updated types for {n_updated} entries in which 'type' was previously NaN.")

    # Drop unused columns and redundant type columns and return
    merged_df.drop(columns={'compare', 'type'}, inplace=True)

    return merged_df


def content_cleaner(row):
    # Processes the row content through the cleaner
    content = row.content
    return string_cleaner(content)


def title_cleaner(row):
    # Processes the row content through the cleaner
    title = row.title
    return string_cleaner(str(title)) # This became necesary because some titles ended up as floats!?!


def string_cleaner(stuff):
    # This function takes the input string and removes line feed and space runs

    # Remove line feeds and space runs
    stuff = stuff.replace('\n',' ')
    stuff = re.sub(r"\s+", " ", stuff)  # Remove multiple space runs

	# Remove last word since it is likely to be a partial word anyway
    last_space_idx = stuff.rfind(" ")
    stuff = stuff[:last_space_idx]
    return stuff.strip()


def content_second_scrub(row):
    # Processes the row content through stop word remover
    content = row.content
    return second_scrub(content)


def title_second_scrub(row):
    # Processes the row content through stop word remover
    title = row.title
    return second_scrub(str(title)) # This became necesary because some titles ended up as floats!?!


def second_scrub(stuff):
    # Remove stop words and punctuation and make entire text lowercase
    stuff = stuff.lower()
    stuff = ''.join(filter(lambda c: c not in punctuation, stuff))

    # Remove Stop Words
    newstuff = ""
    for word in stuff.strip().split(" "):
        if word not in ENGLISH_STOP_WORDS:
            newstuff += f"{word} "
    del stuff  # Release memory (just in case)
    return newstuff.strip()


##
## Define constants
##

# List of English stopwords (grabbed from https://gist.github.com/ethen8181/d57e762f81aa643744c2ffba5688d33a and used in scikit-learn
# and nltk)
ENGLISH_STOP_WORDS=['a','about','above','across','after','afterwards','again','against',
	'ain','all','almost','alone','along','already','also','although','always','am',
	'among','amongst','amoungst','amount','an','and','another','any','anyhow',
	'anyone','anything','anyway','anywhere','are','aren','around','as','at','back',
	'be','became','because','become','becomes','becoming','been','before','beforehand',
	'behind','being','below','beside','besides','between','beyond','bill','both',
	'bottom','but','by','call','can','cannot','cant','co','con','could','couldn',
	'couldnt','cry','d','de','describe','detail','did','didn','do','does','doesn',
	'doing','don','done','down','due','during','each','eg','eight','either','eleven',
	'else','elsewhere','empty','enough','etc','even','ever','every','everyone',
	'everything','everywhere','except','few','fifteen','fify','fill','find','fire',
	'first','five','for','former','formerly','forty','found','four','from','front',
	'full','further','get','give','go','had','hadn','has','hasn','hasnt','have',
	'haven','having','he','hence','her','here','hereafter','hereby','herein','hereupon',
	'hers','herself','him','himself','his','how','however','hundred','i','ie','if','in',
	'inc','indeed','interest','into','is','isn','it','its','itself','just','keep','last',
	'latter','latterly','least','less','ll','ltd','m','ma','made','many','may','me',
	'meanwhile','might','mightn','mill','mine','more','moreover','most','mostly','move',
	'much','must','mustn','my','myself','name','namely','needn','neither','never',
	'nevertheless','next','nine','no','nobody','none','noone','nor','not','nothing',
	'now','nowhere','o','of','off','often','on','once','one','only','onto','or','other',
	'others','otherwise','our','ours','ourselves','out','over','own','part','per',
	'perhaps','please','put','rather','re','s','same','see','seem','seemed','seeming',
	'seems','serious','several','shan','she','should','shouldn','show','side','since',
	'sincere','six','sixty','so','some','somehow','someone','something','sometime',
	'sometimes','somewhere','still','such','system','t','take','ten','than','that',
	'the','their','theirs','them','themselves','then','thence','there','thereafter',
	'thereby','therefore','therein','thereupon','these','they','thick','thin','third',
	'this','those','though','three','through','throughout','thru','thus','to',
	'together','too','top','toward','towards','twelve','twenty','two','un','under',
	'until','up','upon','us','ve','very','via','was','wasn','we','well','were',
	'weren','what','whatever','when','whence','whenever','where','whereafter',
	'whereas','whereby','wherein','whereupon','wherever','whether','which','while',
	'whither','who','whoever','whole','whom','whose','why','will','with','within',
	'without','won','would','wouldn','y','yet','you','your','yours','yourself',
	'yourselves']

# Define punctuation to purge
punctuation = '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~‘’–“”'

# Define data directory
data_dir = "./FakeNewsCorpus/"

# Set up language classifier model
fasttext.FastText.eprint = lambda x: None   # Suppresses stupid warning about deprecation
lid_model = fasttext.load_model('fasttext/lid.176.bin')

In [None]:
# Load the websites classifications from OpenSource.co for the purpose of allowing updating/correcting
# of 'type' assignments in main corpus (including tracking all three known types).
WebClass = f"{data_dir}websites.csv"
orig_websites_df = pd.read_csv(WebClass)[['url','type','2nd_type','3rd_type']]

# Convert website data to have all lowercase urls and fix column names for later
websites_df = orig_websites_df.copy(deep=True)
websites_df['url'] = websites_df['url'].str.lower()

websites_df = websites_df.rename(columns={"type": "1st_type", "url": "domain"})  # Rename columns for easier joining

# # Print information on websites csv
# print(f"Read in {WebClass} file and found:")
# print_FrameInfo(websites_df)

## Exploring the Sample Corpus (OPTIONAL)

### You do NOT need to run the following cell to clean the data.

I started by exploring the sample news corpus file and looking at how to reduce it down to less data and see if there were any issues.  This cell summarizes that work.  The key finding was that in some cases, the `type` column was set to `NaN` (not a number, so essentially a 'null' value).

It looks like in a couple cases, the 'domain' didn't exactly match one of the URLs in the `websites.csv` file, so a 'type' was NOT assigned. I ended up fixing this by converting all the 'domain' values to be lowercase and removing any leading 'www.'.  This allowed me to perform a join with data from the websites.csv file to recover the correct 'type' (and, in fact, I match the [up to] three 'types' listed in the websites.csv file).

All this is done in the `clean_Dataframe()` function.  In the process of developing that function I also discovered the guy who built the FaceNewsCorpus occasionally had some articles from websites that were not known and so the articles were typed as 'unknown.'  If I was unable to match them to the 'websites.csv' file, I dropped those articles while cleaning.  All the entries should now have at last the '1st_type' assigned.

In [None]:
# Load the sample corpus
SampleCorpus = f"{data_dir}news_sample.csv"
orig_sample_df = pd.read_csv(SampleCorpus)

# Subset just the part of the pandas data frame I want to keep.
# The idea here is to JUST keep the title, article text, and tagging information.
sample_df = orig_sample_df[['domain','type','title','content']]

# Print information
print(f"Read in {SampleCorpus} file and found:")
print_FrameInfo(sample_df)

# I discovered a few of the entries had NaNs, this reports they are in the type column
if (check_nulls(sample_df, verbose=True)):
    print("Well crap, you need to investigate these nulls...")

# Retrieve a cleaned dataframe
revised_df = clean_Dataframe(sample_df, websites_df, verbose=True)
revised_df

## Read and Chunk the full FakeNewsCorpus file

Read the main compressed FakeNewsCorpus file in using Pandas CSV importer (which supports ZIP compressed CSV files).  We will "chunk it" to subfiles each of 250,000 records.  In the process I will remove bad entries with unknown types or cases where the '1st_type' is 'unknown'.

### Problem discovered while processing
It looks like a few websites were coded differently in the FakeNewsCorpus vs. the website.csv file.
1. madworldnews.com coded as "unreliable" where websites.csv says "fake".
2. patriotnewsdaily.com coded as "satire" where websites.csv says "bias".

I coded the `clean_Dataframe()` function to ignore these cases but still look for other mismatches, no other ones found.

This also runs athe first 800 character in the `content` column through `content_cleaner` which removes punctuation, newlines, and English stop words and converts the entry to lowercase.

**KEY RESULT:** This routine takes about 3min30sec to run per chunk (about 3 hours total), but reduces 29GB of raw data to about 4.3GB!

In [None]:
# Activating DEBUGGING CODE
DEBUG = 0

# Determine size of original Corpus file
OriginalCorpus = f"{data_dir}/news_cleaned_2018_02_13.csv.zip"
b = os.path.getsize(OriginalCorpus)
print(f"FakeNewsCorpus file {OriginalCorpus} has size {b/1024**3:0.3f} GB.")

# Write out the chunked frames into separate files, but maintain the keys in the first line of the csv when writing frame to csv.

# Load data using dataframes of limited length...
chunklines = 250000  # Number of entries per chunk
df = pd.read_csv(OriginalCorpus, iterator=True, chunksize=250000, low_memory=False, compression='zip', lineterminator='\n')

# Create directory to place chunks in
if (DEBUG):
    preclean_dir = f"{data_dir}news_precleaned/"
    dir_created = create_dir(preclean_dir, clobber=True)
    print(f"Created {dir_created} directory to store precleaned datafiles (clobbering previous version).")

fullcontext_dir = f"{data_dir}fullcontextnews_chunked/"
dir_created = create_dir(fullcontext_dir, clobber=True)
print(f"Created {dir_created} directory to store full context chunked datafiles (clobbering previous version).")

chunked_dir = f"{data_dir}news_chunked/"
dir_created = create_dir(chunked_dir, clobber=True)
print(f"Created {dir_created} directory to store chunked datafiles (clobbering previous version).")

print(f"Chunking {OriginalCorpus} into chunks with {chunklines} entires.")

# Load each frame into memory and then process it
n_tot = 0
for i, frame in enumerate(df):
    print(f"Processing chunk #{i+1:03d}:")
    # Dump output before any cleaning processing
    if (DEBUG):
        fname2 = f"{preclean_dir}precleaned_news_{i+1:03d}.csv"
        print(f"Creating {fname2} (DEBUG) ... ")
        frame.to_csv(fname2, index=False)

    # Run code to re-match website classes to entries and remove non-English content
    revised_frame = clean_Dataframe(frame[['domain','type','title','content']], websites_df)

    # Dump output before string cleaning processing
    if (DEBUG):
        fname1 = f"{preclean_dir}midcleaned_news_{i+1:03d}.csv"
        print(f"Creating {fname1} (DEBUG) ... ")
        revised_frame.to_csv(fname1, index=False)

    n_entries = revised_frame.shape[0]
    if (n_entries>0):
        # Check for NaN here that sneak through, just in case
        bad_df = revised_frame[revised_frame['1st_type'].isnull()]
        bad_domains = bad_df.drop_duplicates(subset = ["domain"])
        if bad_domains.shape[0] > 0:
            print("Empty '1st_type' for following domains: ")
            print(bad_domains['domain'])

        # Reduce content to first 800 characters
        revised_frame["content"] = revised_frame["content"].str[:800]
        # clean all the article content
        revised_frame['content'] = revised_frame.apply(content_cleaner, axis=1)
        # clean all the title content
        revised_frame['title'] = revised_frame.apply(title_cleaner, axis=1)

        # Dump articles with context before stripping the stop words (and thus wiping context)
        fname3 = f"{fullcontext_dir}news_{i+1:03d}.csv"
        print(f"Creating {fname3} ... ")
        revised_frame.to_csv(fname3, index=False)

        # clean all the article stopwords, remove punctuation, and convert to lowercase
        revised_frame['content'] = revised_frame.apply(content_second_scrub, axis=1)
        # clean all the title stopwords, remove punctuation, and convert to lowercase
        revised_frame['title'] = revised_frame.apply(title_second_scrub, axis=1)

        fname = f"{chunked_dir}news_{i+1:03d}.csv"
        print(f"Creating {fname} ... ", end='')
        revised_frame.to_csv(fname, index=False)
        n_tot = n_tot + n_entries
        print(f"{n_entries} entries exported post cleaning.")
    else:
        print("no entries survived cleaning.")
    del revised_frame   # Release the memory (or mark it as releasable)
n_files = i+1
print(f"- A total of {n_tot} entries exported across {n_files} datafiles.")

## Test reading a chunked news file

Read just the first file and see what you can do with it

In [None]:
chunked_dir = "FakeNewsCorpus/news_chunked/"
chunked_df = pd.read_csv(f"{chunked_dir}/news_001.csv")

print(f"Columns are: {chunked_df.columns}")
print(f"There are {chunked_df.shape[0]} entries in this file.")
print ("First entry:")
print (f"Title: {chunked_df.get('title')[0]}")
print (f"Content: {chunked_df.get('content')[0]}")
print (f"1st Type: {chunked_df.get('1st_type')[0]}")
print (f"2nd Type: {chunked_df.get('2nd_type')[0]}")
print (f"3rd Type: {chunked_df.get('3rd_type')[0]}")