# Analyzing, Visualizing & Classifying Fake & Real News Data with Logistic Regression

In this project, we aim to analyze and visualize news data before predicting whether the article is fake or real news. We will be using Logistic Regression for classification purposes since it's good at text-classifying. Data used can be found [here](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset).

The data is split into two datasets, containing 44898 articles in total. The columns include:

* title: Contains the title of each article.
* text: Contains the context of each article.
* subject: Contains the subject of each article.
* date: Contains the date each article was posted in a Month DD, YYYY format.

## 1. Reading the data in

We will not be using the text column for this project. For other columns, we will be familiarizing ourselves with them in this step. Let's import, concatenate and explore the datasets.

In [None]:
import pandas as pd
import numpy as np
import re
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib inline

warnings.simplefilter(action='ignore', category=Warning)
pd.options.mode.chained_assignment = None

In [None]:
fake = pd.read_csv("../input/fake-and-real-news-dataset/Fake.csv", usecols=["title", "subject", "date"]).copy()
real = pd.read_csv("../input/fake-and-real-news-dataset/True.csv", usecols=["title", "subject", "date"]).copy()

In [None]:
fake["label"] = "fake"
real["label"] = "real"
fake.head()

In [None]:
real.head()

We have labelled the data, we can now combine the datasets. We will do sampling on the entire dataset after combining with a seed. We will use the seed "1" for the random number generator for reproducable results.

In [None]:
news = pd.concat([fake, real], axis=0).sample(frac=1, random_state=1).reset_index(drop=True)
news.head()

In [None]:
news.label.value_counts(dropna=False)

Now that we have combined the data, we will first clean and prepare it.

## 2. Preprocessing the data

We will first take a look at the columns subject and date.

In [None]:
news.subject.value_counts(dropna=False)

As we can see, there are no missing values in this column. However, some subject names are written in camel case with others written in snake case and more... We will replace the column names to fix this problem.

In [None]:
news["subject"] = news["subject"].replace({"politicsNews": "Politics",
                                           "worldnews": "World",
                                           "politics": "Politics",
                                           "News": "All",
                                           "left-news": "Left", 
                                           "Government News": "Government",
                                           "US_News": "US",
                                           "Middle-east": "Middle East"})
news.subject.value_counts()

We will now take a look at the dates.

In [None]:
news.date.value_counts(dropna=False)

As we can see, there are some values which are not dates. We will explore these values.

In [None]:
news[news.date.str.extract(r"^((?!\w+ \d+, \d+))*", expand=False).notnull()]

Here are some observations about our recent exploration:

* The articles that have falsely formatted dates and other values for dates are all labelled fake.
* They are mostly about politics.

What we will do is, we will replace these values with null values.

In [None]:
news.loc[news.date.str.extract(r"^((?!\w+ \d+, \d+))*", expand=False).notnull(), "date"] = np.nan

In [None]:
news[news.date.isnull()].shape[0]

Before classification, we will visualize our data to see if we're missing anything so far. We will do some further preprocessing later on.

## 3. Visualizing the data for gaining further insights

We will now visualize and analyze the data we have to see if there are any patterns we might be missing. We will begin with visualizing the time series data for frequencies of fake vs. real news articles.

In [None]:
news.info()

In [None]:
news.date = pd.to_datetime(news.date, errors="coerce")
news_grouped = news[["date", "subject", "label"]].groupby(["date", "label"]).count().reset_index()

fig, ax = plt.subplots(figsize=(16,10))
sns.lineplot(x="date", y="subject", hue="label", data=news_grouped, palette="Set2", ax=ax)
plt.title("News Articles Labelled Fake vs. Real")
plt.xlabel("Time")
plt.ylabel("Count")

As we are able to see from the line chart above, we have no real news article data prior to the beginning of 2016. However, even after that, the amount of fake news articles is dominating compared to the amount of real news articles for a while. We see a sudden increase in number of real news articles in the last quarter of 2016 followed by another unexpected peak nearing May, 2017. Afterwards, we see a sudden and massive increase in the amount of real articles' data with a more subtle decrease in number of fake articles over time. 

Let's take a look at the subjects.

In [None]:
news_group_by_subj_and_label = news.groupby(by=["label", "subject"]).count().reset_index()

fig1, ax1 = plt.subplots(figsize=(16, 8))
sns.barplot(x="subject", y="title", hue="label", data=news_group_by_subj_and_label, palette="Set2", saturation=0.5, ax=ax1)
plt.title("Fake vs. Real News Articles by Subjects")
plt.xticks(rotation=45, horizontalalignment='center', fontweight='light', fontsize='x-large')
plt.yticks(horizontalalignment='center', fontweight='light', fontsize='large')
plt.xlabel("Subjects", fontsize="large")
plt.ylabel("Amount", fontsize="large")
plt.legend(fontsize="large")

As we can see, "real" article group contains articles that have either "Politics" or "World" as their subjects while "fake" articles consist of a variety of subjects. Fake news articles seem to have a tendency to belong into the subject category "All", followed by "Politics" whereas not showing any interest in "World" category. In the meanwhile, real news articles belong into either one of these categories: "Politics" and "World". Both real and fake articles have many articles in "Politics" category. 

## 4. Further Preprocessing

Before we move on to the classification, we will be working on the data further.

Our first step will be dropping some data. We will drop articles that were released before 2016 because there are no real news articles in that period, therefore, the data is not representative enough. Talking about representation, we will also be dropping articles from subjects "Government", "Left", "Middle East" and "US". That's because the only common subject both fake and real articles have is "Politics" and we will be dropping too much data if we drop articles with "All" and "World" subjects. Also, their frequencies and names also indicate that the "All" subject for fake articles might be the equivalent of the "World" subject in real news. 

In [None]:
news_clean = news[news.date > dt.datetime(2016,1,1)]
news_clean.date.value_counts().sort_index()

In [None]:
news_clean = news_clean[news_clean.subject.isin(["All", "Politics", "World"])]
news_clean.subject.value_counts()

In [None]:
news_clean.shape[0]

As we can see, we still ended up with a decent amount of data! As a next step, we will be working on our features. 

## 5. Feature Weighting

As our features, we are going to use words in the article titles. We will be assigning weights to these words using term-frequencies of them. We will begin with splitting the dat into training and testing sets.

In [None]:
training_data, testing_data = train_test_split(news_clean, random_state=1) #seed for reproducibility

Y_train = training_data["label"].values
Y_test = testing_data["label"].values

def word_counter(data, column, training_set, testing_set):

    cv = CountVectorizer(binary=False, max_df=0.95)
    cv.fit_transform(training_data[column].values)
    
    train_feature_set = cv.transform(training_data[column].values)
    test_feature_set = cv.transform(testing_data[column].values)
    
    return train_feature_set, test_feature_set, cv

X_train, X_test, feature_transformer = word_counter(news_clean, "title", training_data, testing_data)

## 6. Training the model, prediction & accuracy

Now that we have prepared our features, we will be training our model. Then, we will predict values for our testing set. We will be using the accuracy metric after that. Accuracy is measured using the formula below:

\begin{equation}
\text{Accuracy} = \frac{\text{number of correctly classified articles}}{\text{total number of classified articles}}
\end{equation}

In [None]:
classifier = LogisticRegression(solver="newton-cg", C=5, penalty="l2", multi_class="multinomial", max_iter=1000)
model = classifier.fit(X_train, Y_train)

In [None]:
predictions = model.predict(X_test)
accuracy = accuracy_score(Y_test, predictions, normalize=True)
print("Our model has {}% prediction accuracy.".format(round(accuracy, 2) * 100))

## 7. Conclusion

Through this project, we have cleaned, analyzed, visualized and classified fake and real news articles. 

* We have cleaned our data to prepare it for further analysis.
* Later, we have visualized and analyzed the data before fitting it to make sure that our data is ready.
* We assigned term-frequencies of the words in the title column as their weights while feature weighting.
* We have used Multinomial Logistic Regression for classification and accuracy as our metric. Our model successfully classified 96% of the testing data in the end.