# Natural Language Processing : Analyzing text for Fake or Real News predictions

Natural language processing (NLP) is the ability of a computer program to understand human language as it is spoken or written. NLP is a viable component of artificial intelligence (AI).

Most of the research being done on Natural Language Processing revolves around search, especially enterprise search. This involves allowing users to query data sets in the form of a question that they might pose to another person. The machine interprets the important elements of the human language sentence, such as those that might correspond to specific features in a data set, and returns an answer. NLP can be used to interpret free text and make it analyzable. There is a tremendous amount of information stored in free text files, like patients' medical records, for example. Sentiment analysis is another primary use case for NLP. Using sentiment analysis, data scientists can assess comments on social media to see how their business's brand is performing.

This project demonstrates the use of NLP libraries in Python to clean texts, creating NLP models and carrying out predictive analytics. The dataset consists of text extracts from news and their fake /real identities. Available at :https://s3.amazonaws.com/assets.datacamp.com/blog_assets/fake_or_real_news.csv

Begin by importing the libraries, load the dataset and preview it.

In [1]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Importing the dataset
dataset = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/blog_assets/fake_or_real_news.csv")
dataset.head(5)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [3]:
dataset.shape

(6335, 4)

The dataset has 6335 entries. We will be making use of the `text` column as our feature for analysis

## Text Cleaning

Lets move on to some text cleaning. This is done using the `re` and `nltk` libraries in Python.

In [5]:
import re       
import nltk
#nltk.download('stopwords')  #download stopwords package
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

We need to download the `stopwords` package and import it. The stopwords contain the list of unnecessary words for text processing. Also we should use `Stemming`, to extract the root of the words using the `PorterStemmer` library to simplify the process.

We create a collection words in variable `corpus` as in the `for` loop below

In [6]:
corpus = []                 
for i in range(0, 6335):
    newsitem = re.sub('[^a-zA-Z]', ' ', dataset['text'][i]) 
    newsitem = newsitem.lower()         
    newsitem = newsitem.split()          
    ps = PorterStemmer()            
    newsitem = [ps.stem(word) for word in newsitem if not word in set(stopwords.words('english'))]
    newsitem = ' '.join(newsitem)       
    corpus.append(newsitem)

We loop through every row of the dataset, and remove all specical charcters using the `re.sub()` separated by space. Then convert to lowercase using `lower()`. Afterwards the string in each row is split into a list of words using the `split()` function. Next, for stemming or to extract the root of the word, we create the object `ps`, apply the method `stem()` after making sure the word is not in the stopwords list. These processed rows of text are then appended to the variable `corpus`, separated by ' '.

## Tockenization : Creating Matrix of unique words

Lets do tockenization using two methods :

### 1. Using Count Vectorizer/ Bag of Words

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)

We set the `max_features` =1500 to simplify the process

Define the matrix of features, `X`, by transforming the  variable `corpus` using the `CountVectorizer` object and then coerce into array.

In [11]:
X = cv.fit_transform(corpus).toarray() 
y = dataset.iloc[:, 3].values

Now split into training and test sets

In [12]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

For NLP, Bayesian models and ensemble methods are used in practice. Lets develop Bayesian models for NLP

### Machine Learning

In [13]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [14]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [17]:
# Evaluating accuracy on test set
from sklearn.metrics import accuracy_score
accuracy = classifier.score(X_test,y_test)
print("Accuracy of model1 using GNB :",accuracy)

Accuracy of model1 : 0.794790844515


Lets try another Bayesian model MultinomialNB, which would be a better chice for NLP

In [18]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB(alpha=0.1)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Evaluating accuracy on test set
from sklearn.metrics import accuracy_score
accuracy = classifier.score(X_test,y_test)
print("Accuracy of model1 using MNB :",accuracy)

Accuracy of model1 using MNB : 0.83820047356


Observe that the MultinomialNB could augment the accuracy by 4% 

## 2. Using Term freq inverse doc. freg Vectorizer

In [20]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english', max_df=0.7)
X = tfidf.fit_transform(corpus).toarray() 
y = dataset.iloc[:, 3].values

Split into training and test sets and move to machine learning

In [21]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

### Machine Learning

In [22]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

GaussianNB(priors=None)

In [23]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Evaluating accuracy on test set
from sklearn.metrics import accuracy_score
accuracy = classifier.score(X_test,y_test)
print("Accuracy of model2 using GNB :",accuracy)

Accuracy of model2 using GNB : 0.792423046567


In [24]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB(alpha=0.1)
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Evaluating accuracy on test set
from sklearn.metrics import accuracy_score
accuracy = classifier.score(X_test,y_test)
print("Accuracy of model2 using MNB :",accuracy)

Accuracy of model2 using MNB : 0.888713496448


## Concluding Remarks

1. For the Fake or Real news prediction data, CountVectorizer and TfidfVectorizer have been used, to build predictive models.
2. For the data, the most accuracy is exhibited, when the Multinomial Naive Bayes was fit. This is because the Multinomial NB assumes each feature is a multinomial distribution, rather than some other distribution, works well for data which can easily be turned into counts, such as word counts in text.