# Tokenizing the FakeNewsCorpus
- *Author*: Juan Cabanela
- *Start Date*: November 1, 2021

### Requirements

Requires the following python libraries:
- pandas
- numpy
- tensorflow
- scikit-learn (sklearn)

This notebook should only be executed using the cleaned version of the FakeNewsCorpus that has been "chunked" into smaller portions, mostly because that is the way the data will be loaded.

The `chunked_dir` defined at the end of the first block of code is where the notebook will look for the cleaned and chunked FaceNewsCorpus files (I defaulted to `./FakeNewsCorpus/news_chunked/`).

This script will do the work of tokenizing the words (representing words with numbers) in the corpus.  Based on my reading I made a few decisions to save on memory use as inspired by the Chapter 16 discussion in *Hands-on Machine Learning with Scikit-Learn, Keras, & TensorFlow* by Aurélien Géron on tokenizing a text for sentiment analysis:

- I will tokenize both the titles and the content.
- I will only keep the 10,000 most common words.

### History 
**November 4, 2021**: First complete version of this code is done, it takes the cleaned Corpus and tokenizes the text in the Corpus, saving the final dataframe as both a pickle file (to keep the structure intact) and as a CSV file (which would have to be parsed carefully to keep the structure intact).  This pickle file is only 2.49GB in size!

**November 14, 2021**: Performed some statistics on the numbers of articles in each class. Also added boolean columns for each class since some articles can have up to three classes assigned.  I also added a TF-IDF tokenizer from Scikit-Learn to the arsenal, although I didn't save the results of running that Tokenizer, it is considerably faster (taking only 9 minutes to run).

**November 15, 2021** Removed code for dealing with data that as cleaned using TF-IDF tokenization to a separate notebook.  Also had the code automatically detect if pickle files exist from previous processing to avoid re-running everyhing.

In [None]:
import pandas as pd
import numpy as np
import pathlib
import tensorflow as tf
import pickle
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

##
## Custom functions
##

# Set up tokenizing function
def tokenize(ele):
    # text must be lowercase string to properly tokenize it
    # Requires table to be defined (as global) before running it)
    return np.array(table.lookup(tf.constant(str(ele).lower().split())), dtype=np.int32)


##
## Define constants
##

# Directory containing chunked data
data_dir = "./FakeNewsCorpus/"
chunked_dir = f"{data_dir}news_chunked/"

## Loading the chunked data files

Read the files which have already been cleaned (should be only about 3.5GB) and build a single dataframe containing the entire cleaned dataset. It takes about 50sec to load all the data into memory. The full dataset now occupies about 5.5GB of memory unpacked.

In [None]:
# Set up path to chunked CSV files
chunked_path = pathlib.Path(chunked_dir)
corpus_pickle = f"{data_dir}cleaned_corpus.p"
corpus_csv = f"{data_dir}cleaned_corpus.csv"
vocab_pickle = "vocabulary.p"
corpus_pickle_path = pathlib.Path(corpus_pickle)
vocab_pickle_path = pathlib.Path(vocab_pickle)

# Check if the pickle files exist, if not, process everything
if (corpus_pickle_path.is_file() & vocab_pickle_path.is_file()):
    print(f"Loading previously pickled final_df (about {corpus_pickle_path.stat().st_size/1024**3:0.2f} GB)")
    final_df = pickle.load( open( corpus_pickle, "rb" ) )
    print(f"Loading previously pickled vocabulary (about {vocab_pickle_path.stat().st_size/1024**3:0.2f} GB)")
    vocabulary = pickle.load( open( vocab_pickle, "rb" ) )

    print(f"Loaded final_df which is occupying {final_df.memory_usage(deep=True).sum()/1024**3:0.3f} GB of memory.")

    # Get all the categories as a list
    rawcategories = np.array(final_df["1st_type"].unique()).astype('str')
    categories = rawcategories[rawcategories != 'nan'].tolist()
else:
    # Create master dataframe
    master_df = pd.DataFrame(columns=["title", "content", "1st_type" , "2nd_type", "3rd_type"])

    # Iterate through chunked csv files alphabetically
    n_tot = 0

    print("Loading records: ")
    for i, csvname in enumerate(sorted(chunked_path.iterdir())):
        if ".csv" in str(csvname):
            print(f"   {csvname.name}: ", end="")
            chunked_df = pd.read_csv(csvname, dtype={'domain': str, 'title': str, 'content': str, '1st_type': str, '2nd_type': str, '3rd_type': str})
            n_tot = n_tot + chunked_df.shape[0]
            master_df = master_df.append(chunked_df, ignore_index=True)
            print(f" {chunked_df.shape[0]} entries loaded.")

    print(f"\n{n_tot} entries loaded into master_df which is occupying {master_df.memory_usage(deep=True).sum()/1024**3:0.3f} GB of memory.")

    # Get all the categories as a list
    rawcategories = np.array(master_df["1st_type"].unique()).astype('str')
    categories = rawcategories[rawcategories != 'nan'].tolist()

    # Collect 1st_type counts
    summary = master_df["1st_type"].value_counts()

    # Add a column for each category, flagging the entry as an example, collect stats
    for cat in categories:
        master_df[cat] = (master_df['1st_type'] == cat) | (master_df['2nd_type']  == cat) | (master_df['3rd_type']  == cat)
        print(f"- '{cat}': {len(master_df[master_df[cat]])} total articles (only {summary[cat]} as 1st_type).")

    # Build a vocabulary and save it to pickle file for later use (takes about 2min ro process the entire dataset)
    vocabulary = Counter()

    print(f"Determining vocabulary for {n_tot} entries (expect this to take a few minutes):")
    for i, content in enumerate(master_df['content']):
        # Add vocabulary (force content to be string, in case it was typed as something else)
        vocabulary.update(str(content).strip().split(" "))
        if i%250000 == 0:
            print(f"{i:07d}...", end='')
    print("\nDONE!")

    # Write vocab to file to avoid having to reprocess
    pickle.dump( vocabulary, open(vocab_pickle, "wb" ) )

    # Check size of vocabulary
    print (f" There are {len(vocabulary)} words in the complete vocabulary (sans English stop words)!")

    # Let's select only the top  10,000 words
    vocab_size = 10000
    truncated_vocabulary = [word for word,count in vocabulary.most_common()[:vocab_size]]
    print (f" There are {len(truncated_vocabulary)} words in the truncated vocabulary (sans English stop words)!")

    # Set up TensorFlow for tokenizing (a la Chapter 16 of Hands-on Machine Learning with Scikit-Learn, Keras, & TensorFlow by Aurélien Géron)
    words = tf.constant(truncated_vocabulary)
    wordIDs = tf.range(len(truncated_vocabulary), dtype=tf.int64)
    vocab_init = tf.lookup.KeyValueTensorInitializer(words, wordIDs)
    # Create a 1000 out of vocabulary buckets
    num_OOV_buckets = 1000
    table = tf.lookup.StaticVocabularyTable(vocab_init, num_OOV_buckets)

    # Test it
    np.array(table.lookup(tf.constant(b"President Obama is not Trump".lower().split()))).tolist()

    # Apply the lookup to every 'title' and 'contents' and save the results (takes about 45 min to run)
    print(f"Tokenizing 'title' for {n_tot} entries ... ", end='')
    master_df['title_tokens'] = master_df['title'].apply(tokenize)
    print(f"and 'content' for {n_tot} entries ... ", end='')
    master_df['content_tokens'] = master_df['content'].apply(tokenize)
    print(" DONE!")

    # Extract only desired columns
    final_df = master_df[['domain', '1st_type', '2nd_type', '3rd_type', 'title_tokens', 'content_tokens', 'rumor', 'hate', 'unreliable',
                            'conspiracy', 'clickbait', 'satire', 'fake', 'bias', 'political', 'junksci', 'reliable']]

    # Save the results both as pickle and as CSV
    print(f"Creating {picklename} ... ")
    pickle.dump( final_df, open(corpus_pickle, "wb" ) )
    print(f"Creating {csvname} ... ")
    final_df.to_csv(corpus_csv, index=False)


In [None]:
final_df