### Read in the data file

Note that this is a very large data set with over 3.5 Gb. However, we will select
only fake and reliable news from this data set to analyze.

First we will work with a small chunk of data.

In [87]:
import pandas as pd
import numpy as np
from datetime import datetime
import string

import nltk
import re, string, unicodedata
import contractions
import inflect
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer


In [88]:
# Data path for all saved data
data_path = 'D:\\PycharmProjects\\springboard\\data\\'
file_name = 'news_cleaned_2018_02_13.csv'

# Chunk size
nrow = 1000

# Load data
chunk = pd.read_csv(f'{data_path}{file_name}',
                    nrows=nrow,
                    encoding='ISO-8859-1',
                    index_col=False)

In [89]:
# Retain these columns and rows only
columns = ['type', 'content', 'title', 'authors']
rows = ['fake', 'reliable']

# Filter chunk to get df
df = chunk[columns]
df = df[df.type.isin(rows)]

# Head of df
df.head()

Unnamed: 0,type,content,title,authors
27,fake,Headline: Bitcoin & Blockchain Searches Exceed...,Surprise: Socialist Hotbed Of Venezuela Has Lo...,The Pirate'S Cove
28,fake,Water Cooler 1/25/18 Open Thread; Fake News ? ...,Water Cooler 1/25/18 Open Thread; Fake News ? ...,
29,fake,Veteran Commentator Calls Out the Growing âE...,Veteran Commentator Calls Out the Growing âE...,
30,fake,"Lost Words, Hidden Words, Otters, Banks and Bo...","Lost Words, Hidden Words, Otters, Banks and Books",Jackie Morris Artist
31,fake,Red Alert: Bond Yields Are SCREAMING âInflat...,Red Alert: Bond Yields Are SCREAMING âInflat...,Phoenix Capital Research


In [90]:
df.type.value_counts()

fake        124
reliable     58
Name: type, dtype: int64

### Pipeline

Goal: Transform each Content columns and Title into a multiple vectors.

Cleaning
1. De-noise
2. Tokens
3. Stop words
4. Lexicon Normalization - stemming and lemmatization

Feature Engineering
1. Bag of Words
2. TF-IDF
3. Word Embedding

Further Improvement
1. POS tagging
2. Sentiment analysis
3. Using vec2word somehow!

Deep Learning
1. Layers config
2. loss config
3. optimizer config

Lets work with the very first fake new article in the df

### Cleaning

In [91]:
# Get the article from content series
article = df.content.iloc[0]

# This article has multiple headings inside it.
# Remove html links
def remove_between_square_brackets(text):
    """Remove anything between brackets"""
    return re.sub('\[[^]]*\]', '', text)

# Remove http links
def remove_links(text):
    """Remove http links in the text"""
    return re.sub('(https\S+|http\S+)', '', text)

# Replace contraction at this point will save us quite a bit of time later on
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)

# De-noise the text
def denoise_text(text):
    text = remove_between_square_brackets(text)
    text = remove_links(text)
    text = replace_contractions(text)
    return text

article= denoise_text(article)
print(article)

Headline: Bitcoin & Blockchain Searches Exceed Trump! Blockchain Stocks Are Next!

Quite frankly, Iâm surprised it has half left. This is a country that cannot even produce toilet paper and beer. And theyâre stealing zoo animals for food. Here we have a Progressive/Marxist/Socialist station (CNN) telling us about how bad things are in a PMS nation

Half the Venezuelan economy has disappeared

Itâs getting worse. Unemployment will reach 30% and prices on all types of goods in the country will rise 13,000% this year, according to new figures published Thursday by the International Monetary Fund.

The IMFâs economist for the Western Hemisphere, Alejandro Werner, put Venezuelaâs future bluntly.

âIn Venezuela, the crisis continues,â Werner said in a blog post. He added that inflation is skyrocketing this year because of âthe loss of confidence in the nationâs currency.â

This year will mark the third consecutive year of double-digit contractions in Venezuelaâs gross d

Now we tokenize and clean again

### Tokenize

In [92]:
words = nltk.word_tokenize(article)

In [93]:
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
    return new_words

def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words

def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r'[^\w\s]', '', word)
        if new_word != '':
            new_words.append(new_word)
    return new_words

def replace_numbers(words):
    """Replace all integer occurrences in list of tokenized words with textual representation"""
    p = inflect.engine()
    new_words = []
    for word in words:
        if word.isdigit():
            new_word = p.number_to_words(word)
            new_words.append(new_word)
        else:
            new_words.append(word)
    return new_words

def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
    return new_words

def stem_words(words):
    """Stem words in list of tokenized words"""
    stemmer = LancasterStemmer()
    stems = []
    for word in words:
        stem = stemmer.stem(word)
        stems.append(stem)
    return stems

def lemmatize_verbs(words):
    """Lemmatize verbs in list of tokenized words"""
    lemmatizer = WordNetLemmatizer()
    lemmas = []
    for word in words:
        lemma = lemmatizer.lemmatize(word, pos='v')
        lemmas.append(lemma)
    return lemmas

def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = replace_numbers(words)
    words = remove_stopwords(words)
    return words

words = normalize(words)

### Stemming and Lemmatize

Is this necessary and how do we compare? Make 2 dataset and run through network?

In [94]:
def stem_and_lemmatize(words):
    stems = stem_words(words)
    lemmas = lemmatize_verbs(words)
    return stems, lemmas

In [95]:
# Clean all columns and tokenize
df.content = df.content.map(denoise_text)
df.content = df.content.map(nltk.word_tokenize)
df.content = df.content.map(normalize)

df.head()

Unnamed: 0,type,content,title,authors
27,fake,"[headline, bitcoin, blockchain, searches, exce...",Surprise: Socialist Hotbed Of Venezuela Has Lo...,The Pirate'S Cove
28,fake,"[water, cooler, twelve thousand, five hundred ...",Water Cooler 1/25/18 Open Thread; Fake News ? ...,
29,fake,"[veteran, commentator, calls, growing, aethnon...",Veteran Commentator Calls Out the Growing âE...,
30,fake,"[lost, words, hidden, words, otters, banks, bo...","Lost Words, Hidden Words, Otters, Banks and Books",Jackie Morris Artist
31,fake,"[red, alert, bond, yields, screaming, ainflati...",Red Alert: Bond Yields Are SCREAMING âInflat...,Phoenix Capital Research
