## 01 - Loading data

The following notebook will have 2 objectives:
1. Give general outlines of the projects
2. Load the dataset and extract useful features.

#### Objectives

The goal is to build a simple neural network to predict fake news
based on various news articles. While the data presented here can be used
with top-of-the-line algorithms, most of the work here is proof of concept
on deep learning. The major road-block is hardware. It takes massive amount
of processing power to clean and run these models.

With that in mind, only a small subsets of the data will be utilized.
Moreover, only two categories of news are retained: fake and reliable.

#### Pipelines

1. Load and extract desired features and rows
2. Get a small sample from the data (2000 articles):
    - Clean and tokenize
    - Generate Term Document Sparse Matrix.
3. Fit model with neural network.
4. Model evaluations

#### Dataset

Credits: several27 renatosc

The following [dataset](https://github.com/several27/FakeNewsCorpus)
will be used. This is an open-sourced dataset comprised of millions
of news article with label from fake, religious, satire,...to reliable.

Note: dataset's size is over 3.5 Gb.

Labels definitions:
1. fake: Sources that entirely fabricate information, disseminate deceptive
content, or grossly distort actual news reports.
2. reliable: Sources that circulate news and information in a manner consistent with traditional and ethical practices in journalism (Remember: even credible sources sometimes rely on clickbait-style headlines or occasionally make mistakes. No news organization is perfect, which is why a healthy news diet consists of multiple sources of information).

#### Loading data

Contents and titles are the two features that will be used to predict
news type - fake or reliable.

Reading in the data directly is impossible due to the size. Thus, an
iterator is used instead. The chosen chunksize is 5000 which works best
with the hardware provided.

For each chunk, extract only content and type columns. Afterward, filter
rows with type labeled fake or reliable. All chunks are concatenated
as one dataset - saved for used later.

In [7]:
import pandas as pd
import csv

# Increase csv file limit
import csv
csv.field_size_limit(2000000)

# Data path for all saved data
data_path = 'D:\\PycharmProjects\\springboard\\data\\'
file_name = 'news_cleaned_2018_02_13.csv'

# Chunk size
chunksize = 5000

# Only extract content and type columns
use_cols = ['type', 'content']
use_rows = ['fake', 'reliable']

# Read data in chunk
data_iterator = pd.read_csv(f'{data_path}{file_name}',
                    usecols=use_cols,
                    chunksize=chunksize,
                    encoding='ISO-8859-1',
                    index_col=False,
                    engine='python')

# Read each chunk and extract desired features and rows
chunk_list = []
for chunk in data_iterator:
    filtered_chunk = chunk[chunk.type.isin(use_rows)]
    chunk_list.append(filtered_chunk)

# Joining and save the cleaned data for use later
filtered_data = pd.concat(chunk_list)
filtered_data.to_csv("D:\\PycharmProjects\\springboard\\data\\news_clean_1.csv")
