# Introduction To NLP using TensorFlow

Building up a model on Textual Data is one of the difficult task of Machine Learning as well as Deep Learning. Texts are just strings and DL models expects a numeric representation. Most common methods of text representations are:

1. Bag of Words Model
2. Word-Embeddings (Context Based)

## Case Study: Fake News Data

I have picked a dataset from Kaggle related to Fake news. 
https://www.kaggle.com/mrisdal/fake-news .

The dataset contains text and metadata from 244 websites and represents 12,999 posts in total from the past 30 days. The data was pulled using the webhose.io API; because it's coming from their crawler, not all websites identified by the BS Detector are present in this dataset. Each website was labeled according to the BS Detector as documented here. Data sources that were missing a label were simply assigned a label of "bs". There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read.


### Imports

Let's start with our imports. Here we are importing TensorFlow to build a deep learning model.

gensim library for text cleaning and loading word_emebddings

pandas for reading data file and analysis

In [1]:
import tensorflow as tf
import pandas as pd
import gensim
import numpy as np

from gensim.utils import simple_preprocess
from gensim.models import KeyedVectors
from gensim.models.word2vec import Text8Corpus


### Loading Data

Lets start with loading data and analysis of data:

In [2]:
fake_news = pd.read_csv("fake.csv")
fake_news.head()

Unnamed: 0,uuid,ord_in_thread,author,published,title,text,language,crawled,site_url,country,domain_rank,thread_title,spam_score,main_img_url,replies_count,participants_count,likes,comments,shares,type
0,6a175f46bcd24d39b3e962ad0f29936721db70db,0,Barracuda Brigade,2016-10-26T21:41:00.000+03:00,Muslims BUSTED: They Stole Millions In Gov’t B...,Print They should pay all the back all the mon...,english,2016-10-27T01:49:27.168+03:00,100percentfedup.com,US,25689.0,Muslims BUSTED: They Stole Millions In Gov’t B...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
1,2bdc29d12605ef9cf3f09f9875040a7113be5d5b,0,reasoning with facts,2016-10-29T08:47:11.259+03:00,Re: Why Did Attorney General Loretta Lynch Ple...,Why Did Attorney General Loretta Lynch Plead T...,english,2016-10-29T08:47:11.259+03:00,100percentfedup.com,US,25689.0,Re: Why Did Attorney General Loretta Lynch Ple...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
2,c70e149fdd53de5e61c29281100b9de0ed268bc3,0,Barracuda Brigade,2016-10-31T01:41:49.479+02:00,BREAKING: Weiner Cooperating With FBI On Hilla...,Red State : \nFox News Sunday reported this mo...,english,2016-10-31T01:41:49.479+02:00,100percentfedup.com,US,25689.0,BREAKING: Weiner Cooperating With FBI On Hilla...,0.0,http://bb4sp.com/wp-content/uploads/2016/10/Fu...,0,1,0,0,0,bias
3,7cf7c15731ac2a116dd7f629bd57ea468ed70284,0,Fed Up,2016-11-01T05:22:00.000+02:00,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,Email Kayla Mueller was a prisoner and torture...,english,2016-11-01T15:46:26.304+02:00,100percentfedup.com,US,25689.0,PIN DROP SPEECH BY FATHER OF DAUGHTER Kidnappe...,0.068,http://100percentfedup.com/wp-content/uploads/...,0,0,0,0,0,bias
4,0206b54719c7e241ffe0ad4315b808290dbe6c0f,0,Fed Up,2016-11-01T21:56:00.000+02:00,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,Email HEALTHCARE REFORM TO MAKE AMERICA GREAT ...,english,2016-11-01T23:59:42.266+02:00,100percentfedup.com,US,25689.0,FANTASTIC! TRUMP'S 7 POINT PLAN To Reform Heal...,0.865,http://100percentfedup.com/wp-content/uploads/...,0,0,0,0,0,bias


In [3]:
fake_news = fake_news[["text", "spam_score"]]

Removing rows with with very little or no text at all:

In [4]:
fake_news = fake_news[fake_news.text.str.len() > 500]
fake_news.head()
len(fake_news)

11114

### Shuffle and Reducing Dataset:

In [5]:
fake_news = fake_news.sample(frac=1)
fake_news = fake_news[:5000]

fake_news.head()

Unnamed: 0,text,spam_score
12137,"Posted on October 31, 2016 by Michael Collins ...",0.0
12553,"In a medium stock pot, heat the coconut oil fo...",0.0
3190,DCG | 7 Comments \nBut it’s perfectly accept...,0.0
4748,"\nIn November 2014, a horrific gang rape alleg...",0.035
3638,Waking Times – by Nathaniel Mauka \nCongress o...,0.0


### Text Cleaning

Lets clean our text before converting them into any numerical representation:

In [6]:
def clean_text(text):
    """Preprocess the text"""
    return " ".join(simple_preprocess(text, deacc=True, max_len=50, min_len=1))


In [7]:
fake_news["text"] = fake_news["text"].apply(clean_text)
fake_news.head()

Unnamed: 0,text,spam_score
12137,posted on october by michael collins fbi direc...,0.0
12553,in a medium stock pot heat the coconut oil for...,0.0
3190,dcg comments but it s perfectly acceptable whe...,0.0
4748,in november a horrific gang rape allegedly too...,0.035
3638,waking times by nathaniel mauka congress overw...,0.0


### Convert text to Vectors

Goolge has provided pre-trained word-vectors. These vectors were trained on GoogleNews Dataset. We will use them to convert our text to vectors.

![title](word_2_vec_example.png)

You can downlaod using GoogleNews Vectors using the following commands in your LinuxOS shell:
brew install wget

wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

In [8]:
word_vectors = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin", binary=True)

In [9]:
def text_to_vec(text):
    text_vec = np.sum(np.array([word_vectors[word] for word in text.split() if word in word_vectors.vocab]), axis=0)
    if type(text_vec) is np.float64:
        return pd.Series([0 for val in range(300)])

    return pd.Series(text_vec)

In [10]:
vec_cols = [str(val) for val in range(300)]
for v_c in vec_cols:
    fake_news[v_c] = 0
fake_news[vec_cols] = fake_news.text.apply(text_to_vec)

In [11]:
fake_news.head()

Unnamed: 0,text,spam_score,0,1,2,3,4,5,6,7,...,290,291,292,293,294,295,296,297,298,299
12137,posted on october by michael collins fbi direc...,0.0,23.979471,41.08075,39.793293,63.547409,-79.524353,-36.17207,32.098293,-71.761749,...,-72.831383,6.583115,-39.315556,32.650078,-38.071568,-17.666512,20.569752,-77.732941,20.854851,-8.143642
12553,in a medium stock pot heat the coconut oil for...,0.0,0.138863,6.734994,4.418945,8.55011,-4.185699,-2.443207,3.410614,-11.071533,...,-3.584503,-0.244019,-8.380287,3.024475,1.07309,-4.251247,-0.5802,-2.402649,4.597954,-1.368103
3190,dcg comments but it s perfectly acceptable whe...,0.0,0.771248,11.038279,14.744694,30.432737,-12.249817,-12.213135,13.83741,-20.676102,...,-16.46233,8.284546,-27.597057,17.28891,-7.918175,3.98605,10.741259,-4.590454,10.181051,0.838394
4748,in november a horrific gang rape allegedly too...,0.035,5.112835,16.648228,18.000587,11.637341,-22.66515,-20.700853,9.488131,-36.828888,...,-26.671478,3.887695,-36.037209,14.009432,-13.020554,7.855839,-1.228943,-27.397923,13.862194,9.280304
3638,waking times by nathaniel mauka congress overw...,0.0,14.188129,22.273739,24.253025,49.824341,-52.462204,-14.53571,9.924065,-27.772154,...,-44.106712,7.897926,-28.674641,18.110649,-22.668493,5.363058,2.367294,-31.841084,19.144749,-10.600079


### Define and Training Neural Network

Lets define a 3 layer neural network to solve this problem:

In [12]:
model = tf.keras.models.Sequential([tf.keras.layers.Dense(500, activation=tf.nn.tanh),
                                    tf.keras.layers.Dense(500, activation=tf.nn.tanh),
                                    tf.keras.layers.Dense(250, activation=tf.nn.tanh),
                                    tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)])

In [13]:
model.compile(optimizers="sgd", loss="mean_squared_error", metrics=['accuracy'])

In [14]:
model.fit(fake_news[:4000][vec_cols].values, fake_news[:4000]["spam_score"].values, epochs=5)

Train on 4000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7fa3c1ff9550>

### Evaluating your Model

We can call model.evaluate, and pass in the two sets, test_labels and test_images, and it will report back the loss for each.

In [15]:
model.evaluate(fake_news[4000:][vec_cols].values, fake_news[4000:]["spam_score"].values)



[0.013673257540911437, 0.881]