# Fake News 

__Diondra Stubbs__

__Assignment 12__

__2022 December 1__

## Data

The data includes variables:

__‘text’__: contents of an article

‘__label’__: whether it is real or fake news

__‘title’__: title of the article

## Objective

The task is to to perform __feature extraction__, on the provided data to answer the following questions:

1. Is the text or the title of an article more predictive of whether it is real or fake?
1. Are titles of real or fake news more similar to one another?

In [2]:
# imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction import text
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.feature_extraction.text import CountVectorizer

## Reading in the Data

I'm using Pandas library to read in the data.

In [5]:
news_url = 'https://raw.githubusercontent.com/rhodyprog4ds/12-fake-news-stubbsdiondra/main/fake_or_real_news.csv?token=GHSAT0AAAAAAB3L5S7D6QRBKDNJ5M2YCZWYY4SEZ3A'
news_data = pd.read_csv(news_url)

In [6]:
news_data.head()

Unnamed: 0,id,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


## Representing Text Data

Natural Language Processing (NLP) is a branch of AI that deals with human language to make a system able to understand and respond to language. Data should always be represented in a way that helps easy understanding and modeling.

__Feature Extraction__ is the process of taking unstructured data into strucutred (tabular) format. It's the process of converting text into numbers. Machines can only understand numbers and to make machines identiy language or text, we need to change it's representation to numeric form. 

The goal of changing the representation of this data is so that we can analyze it and perform prediciton modeling.

### Is the text or the title of an article more predictive of whether it is real or fake? 

To go about this, I'm going to fit two models to predict the label. One with text and one with title.Labels tells us whether the article is real or fake news. X is going to be text or title and Y is going to be the label FAKE or REAL.

#### Text Data

The information in the __text__ column is the actual content of the article. Here we are defining that as __X__. __Labels__ tells us whether the article is REAL or FAKE news and we're going to define it as __Y__,

So right now what we want to do is to take the __text__ of an article and predict whether the news is real or fake.

In [16]:
news_data.columns

Index(['id', 'title', 'text', 'label'], dtype='object')

In [17]:
news_text_X = news_data['text']
news_text_y = news_data['label']

In [18]:
news_text_y[:5]

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [19]:
news_text_X[0]



### Vectorization

**Count Vectorizer** is a way to convert a given set of strings into a frequency representation.

In order to make our text data appropiate for modeling, I need to instantiate it. I want to fit this to the whole dataset and then dp a test/train split on the vectorized or newly represented data.

**Vectorization** is simply converting the text in numbers, in vector form.

In [20]:
count_vec = text.CountVectorizer()
news_vec = count_vec.fit_transform(news_text_X)

In [21]:
news_X_train, news_X_test, news_y_train, news_y_test = train_test_split(
                                        news_text_X, news_text_y, random_state=0)

In [22]:
news_vec_train = count_vec.transform(news_X_train)
news_vec_test = count_vec.transform(news_X_test)

The __multinomial Naive Bayes (MultinomialNB)__ classifier is suitable for classification with discrete features (e.g., word counts for text classification).I'm going to use it to tell me how well fake news is being distinguised from real news from the word counts.

In [23]:
clf = MultinomialNB()

In [24]:
clf.fit(news_vec_train,news_y_train).score(news_vec_test,news_y_test)

0.8800505050505051

The validation set score is 88%, which is pretty good. From word counts, we're able to distinguish between fake news and real news articles pretty well. 

Let's try with **title**

#### Title Data

The information in the title column is the title of each article. Here we are defining that as X. Once again, labels tells us whether the article is REAL or FAKE news and we're going to define it as Y.

I want to take the title of an article and predict whether the news is real or fake.

In [25]:
news_title_X = news_data['title']
news_title_y = news_data['label']

In [26]:
news_title_y[:5]

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

In [27]:
news_title_X[0]

'You Can Smell Hillary’s Fear'

Let's Vectorize again

In [28]:
count_vec = text.CountVectorizer()
new_vec = count_vec.fit_transform(news_title_X)

In [29]:
new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(
                                        news_title_X, news_title_y, random_state=0)

In [30]:
new_vec_train = count_vec.transform(new_X_train)
new_vec_test = count_vec.transform(new_X_test)

In [31]:
clf = MultinomialNB()

In [32]:
clf.fit(new_vec_train,new_y_train).score(new_vec_test,new_y_test)

0.803030303030303

The validation set score is 80%, which is still pretty good but not as good as the model using text.

So, **Is the text or the title of an article more predictive of whether it is real or fake?** The text of an article is more predictice of whether it is real or fake since it provides a higher validation score.

### Are titles of real or fake news more similar to one another?

Yes. While we get a pretty good validation score when using title to predict whether an article is real or fake, the score is not great. It only  correctly distinguishes 80% of articles as real or fake. 

The titles of real or fake news are more similar to one another. If they weren't our validation score would be higher.

In [33]:
print("Validation Score using title",clf.fit(new_vec_train,new_y_train).score(new_vec_test,new_y_test))

Validation Score using title 0.803030303030303


## Considering Differences

There are two different ways you can represent the data. You can do that using **Count Vectorizer**, which converts a given set of strings into a frequency representation or **Term Frequency - Inverse Document Frequency (TF_IDF)** which  focuses on the frequency of words present in the corpus and also provides the importance of the words. TF-IDF is known to be better than Count Vectorizers because it includes focus on the importance of the words.

We want consider the differences in how we represent the data and how that might impact our model performance.

Recall the validation score from fitting a Multinomial NB model for counts on the text data.

In [39]:
clf.fit(news_vec_train,news_y_train).score(news_vec_test,news_y_test)

0.8800505050505051

Let's represent the data using TF-IDF and print the validation score.

In [34]:
tfidf = text.TfidfVectorizer()

tfidf.fit(news_text_X)

TfidfVectorizer()

In [None]:
news_X_train, news_X_test, news_y_train, news_y_test = train_test_split(
                                        news_text_X, news_text_y, random_state=0)

In [36]:
news_tfidf_train = tfidf.transform(news_X_train)
news_tfidf_test = tfidf.transform(news_X_test)

In [38]:
clf.fit(news_tfidf_train,news_y_train).score(news_tfidf_test,news_y_test)

0.8049242424242424

The validation scores is 80%. However, a Multinomial NB model was used to score the TF-IDF representation. Multinomial NB is an appropriate model for counts, but not for continuous values like the TF-IDF.

So let's apply an appropiate classifier for TF-IDF

In [40]:
news_tfidf_train

<4751x67659 sparse matrix of type '<class 'numpy.float64'>'
	with 1611413 stored elements in Compressed Sparse Row format>

In [41]:
news_tfidf_test

<1584x67659 sparse matrix of type '<class 'numpy.float64'>'
	with 546869 stored elements in Compressed Sparse Row format>

Regression algorithms are machine learning techniques for predicting continuous numerical values. Let's try a Logistic Regression model

In [44]:
from sklearn.linear_model import LogisticRegression

In [45]:
log_reg = LogisticRegression(max_iter=1000)

In [46]:
log_reg.fit(news_tfidf_train, news_y_train)
y_pred = log_reg.predict(news_tfidf_test)

In [48]:
log_reg.score(news_tfidf_test,news_y_test)

0.9090909090909091

The validation score is now 90%. This means that from the TF-IDF we are able to distinguish between fake and real news articles well.

This performs better than that from the word counts using a Multinomial NB model. By representing the data as TF-IDF our model performance improves.