# Analysis of fakenews project - CS 371
Author: Alex Nguyen and Hao Lin | Gettysburg College

## Structure of the repository

* The [`README.md`](./README.md) is the main written answer file that reader should follow in addition to the code written in [`notebook.py`](./notebook.py) and [`notebook.ipynb`](./notebook.ipynb).

* The [analysis folder](./analysis) contains csv files that analyze the nature of the data.

* The [data folder](./data) contains the data files.

* The file [`notebook.py`](./notebook.py) is the main file that contains the main code analysis

* The file [`notebook.ipynb`](./notebook.ipynb) is the main notebook that contains the main code analysis

* [Here](https://colab.research.google.com/drive/1CniVqlrgH_wxul13CTXURzmqkxM3zVH8?usp=sharing) is the editable link to the google colab jupyter notebook. <b>Note:</b> Please upload the required folders and data files in order for the notebook to work.

In [1]:
from typing import List, Dict, Tuple

import random
import numpy as np
from pathlib import Path
import collections
import pandas as pd

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split


from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

# CONFIGURATIONs

DATA_FOLDER = Path("./data")
FAKE_DATA_NAME = "clean_fake.txt"
FAKE_DATA_PATH = DATA_FOLDER / FAKE_DATA_NAME
REAL_DATA_NAME = "clean_real.txt"
REAL_DATA_PATH = DATA_FOLDER / REAL_DATA_NAME

First we define our methods:

In [2]:
def read_data(real_path, fake_path, stop_words : List[str]=[]):
    """
        Given a path to real data and fake data, return 2 2d-array of word 
        with shape(n_sentences, n_words_in_sentence)
    """
    fake: List[str] = []
    with open (fake_path, "r") as f:
        for line in f:
            tmp = []
            for w in line.strip().split():
                if w not in stop_words:
                    tmp.append(w)
            fake.append(tmp)
    real: List[str] = []
    with open (real_path, "r") as f:
        for line in f:
            tmp = []
            for w in line.strip().split():
                if w not in stop_words:
                    tmp.append(w)
            real.append(tmp)
    return np.asarray(real), np.asarray(fake)

# Get the most presence word
def export_most_presence_words(real_x: List[str], fake_x: List[str]):
    total_real = 0
    cnt_real = collections.Counter()
    for line in real_x:
        words = line.split()
        for w in words:
            cnt_real[w] += 1
            total_real += 1

    cnt_fake = collections.Counter()
    for line in fake_x:
        words = line.split()
        for w in words:
            cnt_fake[w] += 1

    most_presence = np.asarray(cnt_real.most_common(10))
    # print(most_presence.shape)
    most_presence_probs = [int(data) / total_real for data in most_presence[:,1]]
    print("Most presence words probability:\n" + str(most_presence_probs))
    least_presence = np.asarray(cnt_real.most_common()[:-10-1:-1])
    least_presence_probs = [int(data) / total_real for data in least_presence[:,1]]
    print("Least presence words probability:\n" + str(least_presence_probs))

    # Write csv
    df = pd.DataFrame(cnt_fake.items())
    df.to_csv("./analysis/fake_words.csv")

    df = pd.DataFrame(cnt_real.items())
    df.to_csv("./analysis/real_words.csv")

# Get the most presence word
def export_most_presence_words_with_stop_words(real_x: List[str], fake_x: List[str]):
    cnt_real_non_stop = collections.Counter()
    for line in real_x:
        words = line.split()
        for w in words:
            if w not in ENGLISH_STOP_WORDS:
                cnt_real_non_stop[w] += 1

    cnt_fake_non_stop = collections.Counter()
    for line in fake_x:
        words = line.split()
        for w in words:
            if w not in ENGLISH_STOP_WORDS:
                cnt_fake_non_stop[w] += 1

    # Write csv
    df = pd.DataFrame(cnt_fake_non_stop.items())
    df.to_csv("./analysis/fake_words_non_stop.csv")

    df = pd.DataFrame(cnt_real_non_stop.items())
    df.to_csv("./analysis/real_words_non_stop.csv")

## Part 1: 
- Describe the datasets. You will be predicting whether a headline is real or fake news from words that appear in the headline. Is that feasible? Give 3 examples of specific keywords that may be useful, together with statistics on how often they appear in real and fake headlines.
- For the rest of the project, you should split your dataset into ~70% training, ~15% validation, and ~15% test.

<b>Answer:</b>
* According to [`fake_word_non_stop.csv`](./analysis/fake_word_non_stop.csv) (all distinct words in the fake new and its occurrence that does not include the stop words), we can see that the most frequent keys is "Trump", "Donald", and "Hilary".
* The data was splited in the train, test, and validationin set in [`notebook.py`](./notebook.py).

## Part 2: Create Naive Bayes classifier
## Part 5 and 6: Create and analyze Logistic Regression.
## Part 7: Create and analyze Decision Tree

### Note: All this models are trained on data CONTAINING stop-words.

In [10]:
real_x, fake_x = read_data(REAL_DATA_PATH, FAKE_DATA_PATH)

real_lines = np.asarray([" ".join(row) for row in real_x])
fake_lines = np.asarray([" ".join(row) for row in fake_x])
real_y = np.asarray(len(real_lines) * [1])
fake_y = np.asarray(len(fake_lines) * [0])

data_lines = np.append(real_lines, fake_lines, axis=0)
data_label = np.append(real_y, fake_y, axis=0)

random.seed(0)
random.shuffle(data_lines)
random.seed(0)
random.shuffle(data_label)

# # SPlit train and other parts!
x_train, x_, y_train, y_ = train_test_split(data_lines, data_label, test_size=0.3)
# # Split test and validation for those parts!
x_test, x_val, y_test, y_val = train_test_split(x_, y_, test_size=0.5)

# Vectorize train test
vectorizer = CountVectorizer()
counts_train = vectorizer.fit_transform(x_train)
counts_test = vectorizer.transform(x_test)

# Naive Bayes
classifier = MultinomialNB()
classifier.fit(counts_train, y_train)
predictions = classifier.predict(counts_test)
print('Testing accuracy for Naive Bayes =', sum(predictions == y_test) / len(y_test))

# Logistic Regression
regression = LogisticRegression()
regression.fit(counts_train, y_train)
predictions = regression.predict(counts_test)
print('Testing accuracy Logistic Regression =', sum(predictions == y_test) / len(y_test))

coef = np.asarray(regression.coef_)
coef = coef.flatten()
# Getting the n (10) largest coefficients of the logistic regression.
max_args = (-coef).argsort()[:10]
max_probs = [coef[arg] for arg in max_args]
# Getting the n (10) smallest coefficients of the logistic regression.
min_args = (coef).argsort()[:10]
min_probs = [coef[arg] for arg in min_args]

print()
print("Largest 10 coefficient for logistic regression: " + str(max_probs) + "\n")
print("Smallest 10 coefficient for logistic regression: " + str(min_probs))
print()

# Decision tree classifier with normal data
dtree = DecisionTreeClassifier()
dtree.fit(counts_train, y_train)
predictions = dtree.predict(counts_test)
print('Testing accuracy Decision Tree classifier =', sum(predictions == y_test) / len(y_test))

Testing accuracy for Naive Bayes = 0.8591836734693877
Testing accuracy Logistic Regression = 0.8673469387755102

Largest 10 coefficient for logistic regression: [1.6945726991574033, 1.5952871127753492, 1.5491104095088786, 1.37658765749825, 1.3200212833377263, 1.2346420108946954, 1.1783898230422958, 1.1772149318892855, 1.1692347159558547, 1.1361404508257504]

Smallest 10 coefficient for logistic regression: [-1.9089776914282195, -1.755714622009212, -1.502867042717414, -1.4997133850831266, -1.3971909247765497, -1.3886643883888463, -1.326854464596621, -1.3225371306974745, -1.3197105974382202, -1.263507260612528]

Testing accuracy Decision Tree classifier = 0.7795918367346939


* Now try classify the data with non stop words:

In [15]:
# Redo the whole process with non-stop-word

real_x, fake_x = read_data(REAL_DATA_PATH, FAKE_DATA_PATH, ENGLISH_STOP_WORDS)

real_lines = np.asarray([" ".join(row) for row in real_x])
fake_lines = np.asarray([" ".join(row) for row in fake_x])
real_y = np.asarray(len(real_lines) * [1])
fake_y = np.asarray(len(fake_lines) * [0])

data_lines = np.append(real_lines, fake_lines, axis=0)
data_label = np.append(real_y, fake_y, axis=0)

random.seed(0)
random.shuffle(data_lines)
random.seed(0)
random.shuffle(data_label)

# # SPlit train and other parts!
x_train, x_, y_train, y_ = train_test_split(data_lines, data_label, test_size=0.3)
# # Split test and validation for those parts!
x_test, x_val, y_test, y_val = train_test_split(x_, y_, test_size=0.5)

# Vectorize train test
vectorizer = CountVectorizer()
counts_train = vectorizer.fit_transform(x_train)
counts_test = vectorizer.transform(x_test)

classifier = MultinomialNB()
classifier.fit(counts_train, y_train)
predictions = classifier.predict(counts_test)
print('Testing accuracy for Naive Bayes =', sum(predictions == y_test) / len(y_test))

regression = LogisticRegression()
regression.fit(counts_train, y_train)
predictions = regression.predict(counts_test)
print('Testing accuracy Logistic Regression =', sum(predictions == y_test) / len(y_test))

coef = np.asarray(regression.coef_)
coef = coef.flatten()
max_args = (-coef).argsort()[:10]
max_probs = [coef[arg] for arg in max_args]
min_args = (coef).argsort()[:10]
min_probs = [coef[arg] for arg in min_args]

print()
print("Largest 10 coefficient for logistic regression: " + str(max_probs) + "\n")
print("Smallest 10 coefficient for logistic regression: " + str(min_probs))
print()

# Decision tree classifier for nonstop word
dtree = DecisionTreeClassifier()
dtree.fit(counts_train, y_train)
predictions = dtree.predict(counts_test)
print('Testing accuracy Decision Tree classifier =', sum(predictions == y_test) / len(y_test), "\n")

print("Trying different depth from 1 to 100...")
for i in range(10):
    max_depth = np.random.randint(1,100)
    dtree = DecisionTreeClassifier(max_depth=max_depth)
    dtree.fit(counts_train, y_train)
    predictions = dtree.predict(counts_test)
    t = sum(predictions == y_test) / len(y_test)
    print(f'Testing accuracy Decision Tree classifier for depth {max_depth} = {t}')


Testing accuracy for Naive Bayes = 0.8448979591836735
Testing accuracy Logistic Regression = 0.8428571428571429

Largest 10 coefficient for logistic regression: [1.5154451202658934, 1.5133845233359218, 1.4489570956531233, 1.3610718727678532, 1.336908467209301, 1.3188928359405674, 1.282321519151832, 1.2663677690919637, 1.2576864286655547, 1.1212793508954753]

Smallest 10 coefficient for logistic regression: [-1.9387706777634726, -1.6237100196063001, -1.610085723416201, -1.55157132862154, -1.5461239219035725, -1.5319972558210335, -1.3282052941479423, -1.279801907856682, -1.2629452649977808, -1.261497226060493]

Testing accuracy Decision Tree classifier = 0.789795918367347 

Trying different depth from 1 to 100...
Testing accuracy Decision Tree classifier for depth 48 = 0.7714285714285715
Testing accuracy Decision Tree classifier for depth 99 = 0.7959183673469388
Testing accuracy Decision Tree classifier for depth 12 = 0.6755102040816326
Testing accuracy Decision Tree classifier for depth

## Part 3:
### Part 3a:
* According to the [`real_words.csv`](./analysis/real_words.csv), sorting the csv file in ascending and descending order in the number of presence gives us the least and most frequent keywords, respectively:
  * The most frequent keywords in real news: 'donald', 'to', 'us', 'trumps', 'in', 'on', 'of', 'for', 'the'.
  * The least frequent keywords in real news: 'ba', 'how', 'climate', 'obama', 'house', 'has', 'first', 'he', 'not', 'what'.

* According to the [`fake_words.csv`](./analysis/fake_words.csv), sorting the csv file in ascending and descending order in the number of presence gives us the least and most frequent keywords, respectively:
  * The most frequent keywords in fake news: 'trump', 'the', 'to', 'in', 'donald', 'of', 'for', 'a', 'and', 'on'.

  * The least frequent keywords in fake news: 'why', 'after', 'campaign', 'america', 'voter', 'vote', 'not', 'supporter', 'about', 'says'.

### Part 3b:
* According to the [`real_words_non_stop.csv`](./analysis/real_words_non_stop.csv), sorting the csv file in descending order in the number of presence gives us the most frequent keywords:
  * The most frequent keywords in real news: 'donald', 'trumps', 'says', 'trum', 'north', 'election', 'clinton', 'president', 'russia', 'korea'.

* According to the [`fake_words_non_stop.csv`](./analysis/fake_words_non_stop.csv), sorting the csv file in descending order in the number of presence gives us the most frequent keywords:
  * The most frequent keywords in fake news: 'trump', 'donald', 'hillary', 'clinton', 'trum', 'new', 'just', 'election', 'obama', 'president'


### Part 3c:
* It is important to remove stop words from the model because stop words are not relevant to the main content or are not the strong inference of whether to classify a headline as real or fake. Therefore, it should be removed from the dataset.
* It is important to keep stop words from the model because some particular stop words appears to make a headline looks more professional, defining the credibility of the headline of the article.

The code to achieve these results is below.

In [None]:
export_most_presence_words(real_lines, fake_lines)
export_most_presence_words_with_stop_words(real_lines, fake_lines)