# Supervised Machine Learning

An attempt to repurpose the basic structure of the decision tree classifier employed in the livestream from Great Learning: 

https://www.youtube.com/watch?v=tdFMIO5lfgA

In [1]:
# We will be using pandas dataframes and numpy matrix manipulation
import pandas as pd
import numpy as np

def fullprint(*args, **kwargs):
    """
    Unsets numpy's array truncation to print full arrays.
    Returns to default numpy truncation before exiting.
    https://stackoverflow.com/questions/1987694/how-to-print-the-full-numpy-array-without-truncation/24542498#24542498
    """
    from pprint import pprint
    import numpy
    opt = numpy.get_printoptions()
    numpy.set_printoptions(threshold=numpy.inf)
    pprint(*args, **kwargs)
    numpy.set_printoptions(**opt)

### Loading Data
Fake and Real news data files downloaded via Kaggle user Clément Bisaillon: 

https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

In [2]:
# Load in our csv files.
fake = pd.read_csv('./Fake.csv')
real = pd.read_csv('./True.csv')

In [3]:
# Let's see what the data sets look like. 
# title, text, subject, date
# True.csv looks the same way btw.
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


### Isolating and Cleaning Data
We want to drop nan entries and slice csvs to include as many data entries as we want.
* There are upwards of 20k entries each, 1k is good for testing.

**Reminder:** a "data entry" at this point is a title, text, subject, and date

#### Independent data
Then, isolate just the independent variable information from the real and fake datasets.
* This constitutes the article Titles and Texts.

We will need to have all independent variables in a single parsable object. Numpy's `concatenate` will do the trick with axis=0 to avoid array flattenting.

We are left with an array of 2 columns (title, text) with len(independent_fake)+len(independent_real) rows.

In [4]:
# Clean and slice data 
fake = fake.dropna()[0:5000]
real = real.dropna()[0:5000]

# Isolate independent data points
independent_fake = fake.iloc[:,:-2].values
independent_real = real.iloc[:,:-2].values

# Combine arrays
X = np.concatenate((independent_fake,independent_real),axis=0)

#### Dependent data

We now need to develop a category array to label entries as real or fake.

The easiest way to do this is generate numpy arrays of 1's for "fake" and 0's for "real". Then concatenate in the same way we did the independent variables.

In [5]:
dependent_fake = np.ones(len(independent_fake)) # fake == 1
dependent_real = np.zeros(len(independent_real)) # real == 0

y = np.concatenate((dependent_fake,dependent_real),axis=0)

### Convert data
We need to process our independent data (titles, texts) into numerics using a CountVectorizer object.

How a `CountVectorizer` works:
* When fit to a dataset, the CountVectorizer will extract all instances of a datapoint appearing in a dataset.
    * In this case, a datapoint is a word and the dataset is an article's title or body text.
* When transformed to the fit, the CountVectorizer will output array of occurrences of each datapoint per entry.
    * A.k.a it tells use how many times each word appears in a particular article.
    
The `.get_features()` attribute will show the unique datapoints that have been extracted.
    
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [6]:
from sklearn.feature_extraction.text import CountVectorizer

# Generate the occurrence array for the body text.
cv_body = CountVectorizer()

mat_body = cv_body.fit_transform(X[:,1]).todense() # body text occurrence array
# print(cv_body.get_feature_names())

In [7]:
# Generate the occurrence array for the titles.
cv_head = CountVectorizer()
mat_head = cv_head.fit_transform(X[:,0]).todense()

In [8]:
# Stick the occurrence arrays together (title matched with body text)
# https://numpy.org/doc/stable/reference/generated/numpy.hstack.html
X_mat = np.hstack((mat_head, mat_body))

### Separate training and testing data
This is a critical step in machine learning. 
We need to separate the independent and dependent data into train. This is made very easy for us by the `train_test_split()` sklearn function. We can define how much of our data we want to be train vs. test too.

**Recall:** X_mat is our organized and processed independent data. y is our organized dependent array.

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_mat, y, test_size=0.2, random_state=0)

### Machine Learning
Now we can finally make our machine learn something!

We are using a decision tree that will decide to classify a data entry as a member of the categories 'real' or 'fake' based on that entry's similarity to other entries.

I *think* that the 'entropy' criterion has the decision tree sort data into categories whos independent data is most similar. So for us I believe that means that we are essentially sorting real and fake news by word usage, making the assertion (assumption) that all fake news uses similar words in their titles and texts compared to other fake news. Also that all real news uses similar words in similar frequency.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

https://towardsdatascience.com/entropy-how-decision-trees-make-decisions-2946b9c18c8

In [10]:
from sklearn.tree import DecisionTreeClassifier

# Create decision tree object. 
# Fit the training data to the ML object.
dtc = DecisionTreeClassifier(criterion='entropy')
dtc.fit(X_train, y_train)

# Generate dependent data (real/fake) from test independent data
y_pred = dtc.predict(X_test)

### Testing Accuracy

The `confusion_matrix()` function in sklearn allows us to compare the ACTUAL dependent variables (whether a tested article is real or fake) with the predicted category from our algorithm.

In this case, the array will be 2x2 since there are two categories (real & fake) with two possible predictive outcomes (correctly or incorrectly identifying). 

array([[# correctly labeled fake, # incorrectly labeled fake], 

[# correctly labeled true, # incorrectly labeled true]])

To determine an algorithm's effectiveness, sum the decending diagonal (total correctly categorized datapoints) and divide by sum of all matrix values (total # categorizations, correct and incorrect).

In [11]:
from sklearn.metrics import confusion_matrix

# Compare our dependent variable prediction with the actual values
confusion_matrix(y_test, y_pred)

array([[1011,    1],
       [   0,  988]])

## Testing Learned Model on New Data

Let's see if our super smart fake news catagorizer can look at new entries!