# Introduction
I will be using the Reuters Corpus Volume 1 dataset from scikit-learn. I am leaning heavily on the documentation from the [scikit-learn.org](https://scikit-learn.org), as well as previous lecture notebooks from this course. This is a categorization problem, taking data from news articles and determining the type of news article from the features.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import keras
from sklearn.tree import DecisionTreeClassifier, plot_tree
import tensorflow as tf
from tensorflow.keras import datasets, layers, models

## Load the dataset
I'm using the [Reuters Corpus Volume 1](https://scikit-learn.org/stable/datasets/real_world.html?highlight=corpus%20volume#id6) dataset from sklearn. It can often take a little while to load because it's over 800,000 entries. The target part of the dataset is similar to one hot encoding, except that the articles can be between 1 and 17 of 103 different types of articles. The target, therefore, is either easier or more difficult, depending on your point of view. Both the data and target portion of the dataset are compressed into sparse matrix format.

In [3]:
from sklearn.datasets import fetch_rcv1
rcv1 = fetch_rcv1()

Downloading https://ndownloader.figshare.com/files/5976069
Downloading https://ndownloader.figshare.com/files/5976066
Downloading https://ndownloader.figshare.com/files/5976063
Downloading https://ndownloader.figshare.com/files/5976060
Downloading https://ndownloader.figshare.com/files/5976057
Downloading https://ndownloader.figshare.com/files/5976048


Both the 'data' and 'target' components of rcv1 are in a sparse matrix format. Because of the limitations of the machine, I can't access the entire data block. If you want to live dangerously, you can switch the commented lines that define X_raw and y_raw. I'd love to see what happens and how it will impact the classifiers.

The target names are codes 

In [9]:
cutStart = 41000
cutEnd = 47000
X_raw = rcv1.data[cutStart:cutEnd, 40000:47000].todense() # Use this line if you are running on a mortal machine
y_raw = rcv1.target[cutStart:cutEnd].todense() # Use this line if you are running on a mortal machine
#X_raw = rcv1.data.todense() # Use this line only if you are confident in your amount of RAM
#y_raw = rcv1.target.todense() # Use this line only if you used the line directly above. Live dangerously.


Below is the dictionary for converting the target name codes into English names that we can understand. I found it for RCV1 on [github](https://gist.github.com/gavinmh/6253739).

In [8]:
# I found this translation on GITHUB.
code_names = { "CCAT": "CORPORATE/INDUSTRIAL", "C11": "STRATEGY/PLANS", "C12": "LEGAL/JUDICIAL", "C13": "REGULATION/POLICY",
              "C14": "SHARE LISTINGS", "C15": "PERFORMANCE", "C151": "ACCOUNTS/EARNINGS", "C1511": "ANNUAL RESULTS", "C152": "COMMENT/FORECASTS",
              "C16": "INSOLVENCY/LIQUIDITY", "C17": "FUNDING/CAPITAL", "C171": "SHARE CAPITAL", "C172": "BONDS/DEBT ISSUES",
              "C173": "LOANS/CREDITS", "C174": "CREDIT RATINGS", "C18": "OWNERSHIP CHANGES", "C181": "MERGERS/ACQUISITIONS",
              "C182": "ASSET TRANSFERS", "C183": "PRIVATISATIONS", "C21": "PRODUCTION/SERVICES", "C22": "NEW PRODUCTS/SERVICES",
              "C23": "RESEARCH/DEVELOPMENT", "C24": "CAPACITY/FACILITIES", "C31": "MARKETS/MARKETING", "C311": "DOMESTIC MARKETS",
              "C312": "EXTERNAL MARKETS", "C313": "MARKET SHARE", "C32": "ADVERTISING/PROMOTION", "C33": "CONTRACTS/ORDERS",
              "C331": "DEFENCE CONTRACTS", "C34": "MONOPOLIES/COMPETITION", "C41": "MANAGEMENT", "C411": "MANAGEMENT MOVES",
              "C42": "LABOUR", "CCAT": "ECONOMICS", "E11": "ECONOMIC PERFORMANCE", "E12": "MONETARY/ECONOMIC", "E121": "MONEY SUPPLY",
              "E13": "INFLATION/PRICES", "E131": "CONSUMER PRICES", "E132": "WHOLESALE PRICES", "E14": "CONSUMER FINANCE", "E141": "PERSONAL INCOME",
              "E142": "CONSUMER CREDIT", "E143": "RETAIL SALES", "E21": "GOVERNMENT FINANCE", "E211": "EXPENDITURE/REVENUE",
              "E212": "GOVERNMENT BORROWING", "E31": "OUTPUT/CAPACITY", "E311": "INDUSTRIAL PRODUCTION", "E312": "CAPACITY UTILIZATION",
              "E313": "INVENTORIES", "E41": "EMPLOYMENT/LABOUR", "E411": "UNEMPLOYMENT", "E51": "TRADE/RESERVES", "E511": "BALANCE OF PAYMENTS",
              "E512": "MERCHANDISE TRADE", "E513": "RESERVES", "E61": "HOUSING STARTS", "E71": "LEADING INDICATORS", "GCAT": "GOVERNMENT/SOCIAL",
              "G15": "EUROPEAN COMMUNITY", "G151": "EC INTERNAL MARKET", "G152": "EC CORPORATE POLICY", "G153": "EC AGRICULTURE POLICY", 
              "G154": "EC MONETARY/ECONOMIC", "G155": "EC INSTITUTIONS", "G156": "EC ENVIRONMENT ISSUES", "G157": "EC COMPETITION/SUBSIDY",
              "G158": "EC EXTERNAL RELATIONS", "G159": "EC GENERAL", "GCRIM": "CRIME, LAW ENFORCEMENT", "GDEF": "DEFENCE", 
              "GDIP": "INTERNATIONAL RELATIONS", "GDIS": "DISASTERS AND ACCIDENTS", "GENT": "ARTS, CULTURE, ENTERTAINMENT",
              "GENV": "ENVIRONMENT AND NATURAL WORLD", "GFAS": "FASHION", "GHEA": "HEALTH", "GJOB": "LABOUR ISSUES", "GMIL": "MILLENNIUM ISSUES",
              "GOBIT": "OBITUARIES", "GODD": "HUMAN INTEREST", "GPOL": "DOMESTIC POLITICS", "GPRO": "BIOGRAPHIES, PERSONALITIES, PEOPLE",
              "GREL": "RELIGION", "GSCI": "SCIENCE AND TECHNOLOGY", "GSPO": "SPORTS", "GTOUR": "TRAVEL AND TOURISM", "GVIO": "WAR, CIVIL WAR",
              "GVOTE": "ELECTIONS", "GWEA": "WEATHER", "GWELF": "WELFARE, SOCIAL SERVICES", "MCAT": "MARKETS", "M11": "EQUITY MARKETS",
              "M12": "BOND MARKETS", "M13": "MONEY MARKETS", "M131": "INTERBANK MARKETS", "M132": "FOREX MARKETS", "M14": "COMMODITY MARKETS",
              "M141": "SOFT COMMODITIES", "M142": "METALS TRADING", "M143": "ENERGY MARKETS", "ECAT": "ECONOMIC/SOCIAL"}


# Set up Models
I will be using a Decision Tree and a Neural Net model. I'm using an 80/20 split for training and test data. I'm using a variation of the function used in Assignment 04.

In [None]:
# Function to split data, inspired by Assignment 04
data_split_ratio = 0.8

def get_train_test(data, targets, ratio):
  mask = np.random.rand(len(data)) < ratio
  data_train = data[mask]
  data_test = data[~mask]
  target_train = targets[mask]
  target_test = targets[~mask]
  return data_train, target_train, data_test, target_test

In [None]:
X_train, y_train, X_test, y_test = get_train_test(X_raw, y_raw, data_split_ratio)

### Decision Tree
A single line sets up a decision tree. At a depth of 70, that's about the best that the decision tree will get with both training data and test data. It is probably over-fitting the training data, but there are marginal gains (it increases in accuracy slowly from the low 0.20s to the high 0.20s) between a depth of 30 and 70 for the test data. The training data caps out at an accuracy of roughly 0.98.

In [None]:
# Setting the decision tree
tree_depth = 70
tree = DecisionTreeClassifier(max_depth=tree_depth)

### Neural Net
Only using a single layer because otherwise everything crashes. The final accuracy will be impacted by this limitation.

I played with different optimizers before settling on the gradient descent (`SGD()`) optimizer. The `.CategoricalCrossentropy()` method seems to be the best way to display loss at each epoch, but MSE is present in the commented section as well for comparison. I wanted to make it three or more internal layers deep, but I run out of RAM on Colab when I try to do even two layers. I'd be interested to see what happens when you can run the full dataset through a multi-layer neural net. I guess that it stands a good chance of improving the accuracy of the model, but it would definitely bog down the performance of the machine.

In [None]:
full_size = X_raw.shape[1]
activators = y_raw.shape[1]
network_nn =  keras.Sequential(name="Neural_Network")
#network_nn.add(keras.layers.Flatten()) # I'm not using images, so I don't think I need this.
network_nn.add(keras.layers.Dense(full_size, activation='relu'))
network_nn.add(keras.layers.Dense(full_size, activation='relu')) # My hardware can't handle this
#network_nn.add(keras.layers.Dense(full_size, activation='relu')) # My hardware can't handle this
network_nn.add(keras.layers.Dense(activators, activation='softmax'))
#loss_fn = keras.losses.MeanSquaredError(reduction="auto", name="mean_squared_error")
loss_fn = keras.losses.CategoricalCrossentropy()
opt = keras.optimizers.SGD()

## Run the models
The decision tree model is first because it takes less processing, less RAM, and less time. The Neural Net is second because it takes more of all of those things.

### Decision Tree model
Very simple: fit the data to the tree. Call the `.fit()` method and let it run.

In [None]:
# Fitting the data
tree.fit(X_train, y_train);

### Decision Tree Performance
The performance of the tree is accessable through the .score() method of DecisionTreeClassifier. Handy!



In [None]:
print(f"Training score: {tree.score(X_train,y_train):0.5f}")
print(f"Testing score : {tree.score(X_test,y_test):0.5f}")


Training score: 0.99367
Testing score : 0.14286


### Nerual Network model
There are a lot of limitations going on with the neural net here. The fact that I can't use the entirety of the feature set probably plays a part in it, although having 47k+ features is a huge amount of features. One of two things must be happening: either neural nets are not very good for this type of problem or I don't have the necessary skill to set up a neural net to solve this particular problem. There are enough confounding variables for me that I'm not sure which it is.

I found the print of the loss and accuracy from Module 9 useful, so I used that function here. I also read the github article from which it came, but I don't think I was having that issue with the Neural Net. I did try the net without it, just to see, and I didn't think the output was as useful.

In [None]:
# Structure based on https://github.com/keras-team/keras/issues/2548
# Taken from module 9
class EvaluateCallback(keras.callbacks.Callback):
    def __init__(self, test_data):
        self.test_data = test_data
        
    def on_epoch_end(self, epoch, logs):
        x, y = self.test_data
        loss, acc = self.model.evaluate(x, y, verbose=0)
        if 'test loss' not in logs:
            logs['test loss'] = []
            logs['test acc'] = []
        logs['test loss'] += [loss]
        logs['test acc'] += [acc]
        print('Testing loss: {}, acc: {}\n'.format(round(loss, 4), round(acc, 4)))

The neural network accuracy depends on a lot of factors, but I have yet to see a run hit 0.1. Changing the batch size can help, to a point, but mostly it makes training and evaluating take longer. The number of epochs past 10 doesn't help the accuracy either, but it's easy enough to change it to see.

In [None]:
network_nn_epochs = 10
network_nn.compile(loss=loss_fn, optimizer=opt, metrics=['accuracy'])
history = network_nn.fit(X_train, y_train, batch_size=50, epochs=network_nn_epochs, verbose=1,
                         callbacks=[EvaluateCallback((X_test, y_test))])
network_nn.summary()

Epoch 1/15
Testing loss: 0.0314, acc: 0.0078

Epoch 2/15
Testing loss: 0.0314, acc: 0.0078

Epoch 3/15
Testing loss: 0.0314, acc: 0.0078

Epoch 4/15
Testing loss: 0.0314, acc: 0.0078

Epoch 5/15
Testing loss: 0.0314, acc: 0.0078

Epoch 6/15
Testing loss: 0.0314, acc: 0.0069

Epoch 7/15
Testing loss: 0.0314, acc: 0.006

Epoch 8/15
Testing loss: 0.0314, acc: 0.0052

Epoch 9/15
Testing loss: 0.0314, acc: 0.0052

Epoch 10/15
Testing loss: 0.0314, acc: 0.0052

Epoch 11/15
Testing loss: 0.0314, acc: 0.0043

Epoch 12/15
Testing loss: 0.0314, acc: 0.0052

Epoch 13/15
Testing loss: 0.0314, acc: 0.0052

Epoch 14/15
Testing loss: 0.0314, acc: 0.0043

Epoch 15/15
Testing loss: 0.0314, acc: 0.006

Model: "Neural_Network"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_14 (Dense)             (None, 7000)              49007000  
_________________________________________________________________
dense_15 (Dense)  

# Conclusions and Observations
First: don't use a giant dataset without serious computing power. Paring everything down to operate on the limited RAM and GPU resources means I don't know much about what features are most important and the shear size of the feature list means that it would be impractical for me to read all of it.

Second: it's not difficult to set up a problem for an algorithm. The APIs are easy to use. Finding an algorithm that will work best for your dataset is a much more difficult problem.

Third: ambition can really make things more difficult than they have to be. I wanted to see what happens when you have a big dataset and I found out that the problem might be intractible with hardware limitations.

Most importantly: I chose a large dataset and a classification problem (although not an image classification problem). It's similar in many ways to the iris or mushroom classifier problems, but with a lot less clarity. I knew this going in and thought that I could just set up the problem and let time and computing solve it. That was hubris on my part, but also not yet understanding the exact limitations of the software and the hardware.

The Decision Tree has a much greater accuracy than chance while the Neural Net is roughly chance. I think that a huge part of that is that I can't set up the Neural Net I want to set up due to processing constraints. The Decision Tree does much better than chance and I think that's because the Decision Tree works very well on finding a feature to make an A/B choice. It isn't a perfect classifier though, and I think for this dataset it would perform better with access to more of the features so that it could figure out more ways to classify. The other issue I think that comes into play is that there are some elements of the dataset that qualify as multiple types of article. It would be like classifying a tiger and a bobcat and a housecat: they all should be cats, but some of them are also big cats or wild cats.

You can see in the segment below a list of all of the article types and that there are several that overlap.

In [11]:
for e in rcv1.target_names:
  print(code_names[e])

STRATEGY/PLANS
LEGAL/JUDICIAL
REGULATION/POLICY
SHARE LISTINGS
PERFORMANCE
ACCOUNTS/EARNINGS
ANNUAL RESULTS
COMMENT/FORECASTS
INSOLVENCY/LIQUIDITY
FUNDING/CAPITAL
SHARE CAPITAL
BONDS/DEBT ISSUES
LOANS/CREDITS
CREDIT RATINGS
OWNERSHIP CHANGES
MERGERS/ACQUISITIONS
ASSET TRANSFERS
PRIVATISATIONS
PRODUCTION/SERVICES
NEW PRODUCTS/SERVICES
RESEARCH/DEVELOPMENT
CAPACITY/FACILITIES
MARKETS/MARKETING
DOMESTIC MARKETS
EXTERNAL MARKETS
MARKET SHARE
ADVERTISING/PROMOTION
CONTRACTS/ORDERS
DEFENCE CONTRACTS
MONOPOLIES/COMPETITION
MANAGEMENT
MANAGEMENT MOVES
LABOUR
ECONOMICS
ECONOMIC PERFORMANCE
MONETARY/ECONOMIC
MONEY SUPPLY
INFLATION/PRICES
CONSUMER PRICES
WHOLESALE PRICES
CONSUMER FINANCE
PERSONAL INCOME
CONSUMER CREDIT
RETAIL SALES
GOVERNMENT FINANCE
EXPENDITURE/REVENUE
GOVERNMENT BORROWING
OUTPUT/CAPACITY
INDUSTRIAL PRODUCTION
CAPACITY UTILIZATION
INVENTORIES
EMPLOYMENT/LABOUR
UNEMPLOYMENT
TRADE/RESERVES
BALANCE OF PAYMENTS
MERCHANDISE TRADE
RESERVES
HOUSING STARTS
LEADING INDICATORS
ECONOMIC/SO