# Schibsted NLP Machine Learning Challenge #

In this task, you’ll be doing text classification on an open data set from Reuters. The task
formulation is a bit vague on purpose, enabling you to make your own choices where needed.
<br>
<br>

### Instructions ###

* Download the [Reuters data set](www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz)
* Your task is to classify the documents to the categories found in TOPICS
* Adhere to the train/test split specified in the XML as LEWISSPLIT
* If you have time, evaluate and report the results
* Please submit a running piece of code, documented as you please (in the code, separate docs, document by tests, whatever makes the code understandable)
* We are interested in seeing your ideas, and your coding style/skills. Don’t worry if the classification results are sub-optimal
* If you have ideas on how to improve the solution further, please write that down as well
* Spend approximately 3 hours on the task
<br>
<br>

### Introduction ###

The Reuters dataset is given as 22 files in SGM format. SGM is an acronym for Standard Generalized Markup. Files that are saved with the . sgm file extension are written in the SGM programming language. The SGM language is a document generation language that is used to share information digitally using custom tags and DTD file references.

Each SGM file contains the body of an article, together with other important information, such as the topic(s) of the article, as well as whether it belongs to the training or testing set.

While researching how to parse the SGM files (as they couldn't be parsed using Python's internal `lxml` library), we ran across a [project](https://github.com/ankailou/reuters-preprocessing) on GitHub, written by Ankai Lou, which not only parses the SGM files, but also does a number of preprocessing and feature engineering steps (described below), which prepare our data for modeling. While inspecting the code in detail, we realized that using this library would greatle simplify the approach and solution of the classification problem.

It is our strong believe that Data Scientists should not reinvent the wheel, but build upon the previous work of colleagues, in the interest of efficiency and productivity. After all, one will, for example, rarely code a cross validation scheme from scratch, but rather rely on a pre-existing implementation, such as `scikit-learn`'s `cross_validate` or `KFold` functionality.

One small disadvantage of the existing `reuters-preprocessing` package is the fact that it is written for Python 2. While it is possible to have multiple versions of Python on the same machine, this may not be practicle. Thus, in the interest of full reproducibility, we forked the project and amended the code to work in Python 3. We also changed some of the functionality, namely the way the topics are generated, as well as the way TF-IDF values are calculated (instead of the original author's implementation, we opted to use `scikit-learn`'s `TfdfVectorizer`).

The forked and updated version can be found [here](https://github.com/ssantic/reuters-preprocessing), or directly cloned through this notebook:

In [None]:
!git clone https://github.com/ssantic/reuters-preprocessing

The `reuters-preprocessing` library does a number of things:

* For each file, the `BeautifulSoup4` library is used to generate a parse-tree from the SGML using the built-in Python `html.parser` library:
    * For each parse-tree, article blocks - delimited by `<reuters>` - are separated into strings.
    * For each article, the text in `<topics>` and `<places>` delimited by `<d>` is used as class labels; the text in `<title>` and `<body>` is extracted for tokenization.

* For each title/body text block, the `NLTK` and `string` libraries are used for tokenization:     
    * For each field, digits & unicode symbols are replaced by `None` using the `string` library.
    * For each field, punctuation symbols are replaced by `None` using the `string` library.
    * For each field, text blocks are tokenized into lists using `nltk.word_tokenize()`.
    * For the tokens, `stopwords` from `nltk.corpus` are filtered from the lists.
    * For the tokens, non-English words are filtered via `nltk.corpus.wordnet.synsets()`.
    * For the tokens, lemmatization is used via `nltk.stem.wordnet.WordLemmatizer()`.
    * For the tokens, tokens are stemmed via `nltk.stem.porter.PorterStemmer()`.
    * For each stem list, the word stems shorter than 4 characters are filtered from the list.

The final output of the text extraction & processing phase is a list of documents:

```
documents = [document = {'topics' : [], 'places' : [], 'words' : {'title' : [], 'body' : []}}]
```

From this list, a lexicon is generated for all unique words in titles and body fields:

```
lexicon = {'title' = set() : 'body' = set()}
```

This concludes the text extraction and processing phase and prepares the file input for feature selection.

<br>
<br>

### Rationale ###

Several portions of the text processing & tokenization phase are selected for specific reasons:

* Digits, unicode characters, and punctuation symbols are removed from the text because digits & meta-characters provide less valuable information to article context than actual words.
* Stopwords are removed from the text because words such as `the` are frequently present yet provide no contextual value. Though the TF-IDF process would inevitably filter stopwords during the weighting phase, removing stopwords at tokenization removes several polynomial time calculations during the TF-IDF calculations in linear time - a desirable improvement in performance.
* Non-English words are filtered from the text because the stemmer used is the English Porter stemmer; therefore, stemming non-English words is likely to produce erroneous data and artificially inflate the size of the lexicon - which will increase the runtime.
* Tokens are lemmatized to reduce the dataset by removing tenses before stemming.
* Tokens are stemmed to reduce the dataset further by minimizing the size of the lexicon.
* Stems shorter than 4 characters are filtered because sufficiently short stems appear frequently in articles yet provide little importance to classification - similar to stop words.

This minification of the tokens & lexicon ensure a minimum number of calculations during the selection phase, while not losing valuable information or context. The same feature reduction process was used for all feature vectors because filtering out low-value words & stems from the data is low-risk/high-reward for data quality and runtime. A unified text processing methodology also reduced the runtime.

Using the `scikit-learn` python module, the `TfidfVectorizer` submodule in the python submodule `feature_extraction.text` provides functionality for performing td-idf on a set of documents. The use of the built-in `scikit-learn` TF-IDF is based on the assumption that their TF-IDF calculator is far more sophisticated than any TF-IDF weight generator that can be manually coded. The purpose of generating this dataset is to observe the differences between a naive TF-IDF weighting and a far more complex iteration. The weighting and calculations are all left to the `scikit-learn` module.

We can run the preprocessing and feature engineering steps automatically, executing a command directly from the notebook (with the warning that it takes a while):

In [None]:
!python preprocess.py

This will generate a `dataset.csv` file, which we will be working for the rest of this project.

<br>
<br>

### Imports ###

In [3]:
import pandas as pd
import os
from bs4 import BeautifulSoup
import ast
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, classification_report
import tensorflow as tf
from tensorflow.keras import models, regularizers, layers, optimizers, losses, metrics
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

In [4]:
## !python preprocess.py

In [6]:
labels = []
for file in os.listdir('data'):
    data = open(os.path.join(os.getcwd(), "data", file), 'r')
    text = data.read()
    data.close()
    tree = BeautifulSoup(text.lower(), "html.parser")
    print("Extracting train/test information from file", file)
    for i in range(len(tree.find_all("reuters"))):
        if 'lewissplit="train"' in str(tree.find_all("reuters")[i]):
            labels.append('train')
        else:
            labels.append('test')

Extracting train/test information from file reut2-000.sgm
Extracting train/test information from file reut2-001.sgm
Extracting train/test information from file reut2-002.sgm
Extracting train/test information from file reut2-003.sgm
Extracting train/test information from file reut2-004.sgm
Extracting train/test information from file reut2-005.sgm
Extracting train/test information from file reut2-006.sgm
Extracting train/test information from file reut2-007.sgm
Extracting train/test information from file reut2-008.sgm
Extracting train/test information from file reut2-009.sgm
Extracting train/test information from file reut2-010.sgm
Extracting train/test information from file reut2-011.sgm
Extracting train/test information from file reut2-012.sgm
Extracting train/test information from file reut2-013.sgm
Extracting train/test information from file reut2-014.sgm
Extracting train/test information from file reut2-015.sgm
Extracting train/test information from file reut2-016.sgm
Extracting tra

In [9]:
df = pd.read_csv('C:/Schibsted-Homework/dataset.csv', sep='\t')

In [10]:
df.head()

Unnamed: 0,id,aaron,abandon,abat,abbey,abbrevi,abduct,abel,abet,abid,...,zimbabwean,zimmer,zinc,zirconium,zloti,zone,zurich,zweig,class-label:topics,Unnamed: 7836
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.059531,0.0,0.0,['cocoa'],
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[],
2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[],
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[],
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"['grain', 'wheat', 'corn', 'barley', 'oat', 's...",


In [11]:
# Replace unlabeled documents with 'unlabeled'
df['class-label:topics'] = df['class-label:topics'].apply(lambda x: "['unlabeled']" if x == '[]' else x)

In [12]:
# Extract only the first topic in cases where there were multiple topics
topics = []
for index, row in df.iterrows():
    topics.append(ast.literal_eval(df['class-label:topics'][index])[0])

In [13]:
# Add the new 'topics' column
df['topics'] = topics

In [14]:
# Delete unnecesary columns
del df['id']
del df['class-label:topics']
del df['Unnamed: 7836']

In [15]:
# Add the labels column for the train/test split
df['labels'] = labels

In [16]:
df['labels'].value_counts()

train    14668
test      6910
Name: labels, dtype: int64

In [17]:
df['topics'].unique()

array(['cocoa', 'unlabeled', 'grain', 'veg-oil', 'earn', 'acq', 'wheat',
       'copper', 'housing', 'money-supply', 'coffee', 'sugar', 'trade',
       'reserves', 'ship', 'cotton', 'carcass', 'crude', 'nat-gas', 'cpi',
       'money-fx', 'interest', 'gnp', 'soybean', 'meal-feed', 'alum',
       'tea', 'oilseed', 'gold', 'tin', 'strategic-metal', 'livestock',
       'retail', 'ipi', 'iron-steel', 'rubber', 'propane', 'heat', 'jobs',
       'lei', 'bop', 'zinc', 'orange', 'pet-chem', 'dlr', 'gas', 'silver',
       'wpi', 'fishmeal', 'hog', 'lumber', 'tapioca', 'instal-debt',
       'lead', 'potato', 'l-cattle', 'rice', 'nickel', 'inventories',
       'cpu', 'corn', 'fuel', 'jet', 'income', 'rand', 'platinum',
       'saudriyal', 'nzdlr', 'palm-oil', 'coconut', 'rapeseed', 'stg',
       'groundnut', 'wool', 'austdlr', 'soy-meal', 'plywood', 'barley',
       'cruzado', 'yen', 'f-cattle', 'hk', 'naphtha'], dtype=object)

In [18]:
df_train = df[df['labels'] == 'train']
y_train = df_train['topics']
X_train = df_train.drop(['labels', 'topics'], axis=1)

In [19]:
df_test = df[df['labels'] == 'test']
y_test = df_test['topics']
X_test = df_test.drop(['labels', 'topics'], axis=1)

In [20]:
X_train.head()

Unnamed: 0,aaron,abandon,abat,abbey,abbrevi,abduct,abel,abet,abid,abidjan,...,ziegler,zimbabw,zimbabwean,zimmer,zinc,zirconium,zloti,zone,zurich,zweig
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.059531,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
X_test.head()

Unnamed: 0,aaron,abandon,abat,abbey,abbrevi,abduct,abel,abet,abid,abidjan,...,ziegler,zimbabw,zimbabwean,zimmer,zinc,zirconium,zloti,zone,zurich,zweig
12593,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12594,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [22]:
# Train the Random Forest classifier and generate predictions
rfc = RandomForestClassifier(random_state=176, n_estimators=1000).fit(X_train, y_train)
rfc_preds = rfc.predict(X_test)

In [39]:
accuracy = accuracy_score(y_test, rfc_preds)
balanced_acc = balanced_accuracy_score(y_test, rfc_preds)
precision = precision_score(y_test, rfc_preds, average='macro')
recall = recall_score(y_test, rfc_preds, average='macro')
f1 = f1_score(y_test, rfc_preds, average='macro')

In [40]:
metrics = {'Metrics': [accuracy, balanced_acc, precision, recall, f1]}
metrics_df = pd.DataFrame(metrics, index=['Accuracy', 'Balanced Accuracy', 'Precision', 'Recall', 'F1 Score'])
metrics_df

Unnamed: 0,Metrics
Accuracy,0.719103
Balanced Accuracy,0.18414
Precision,0.435045
Recall,0.181509
F1 Score,0.224871


In [41]:
classification_report(y_test, rfc_preds)

'                 precision    recall  f1-score   support\n\n            acq       0.87      0.74      0.80       791\n           alum       1.00      0.05      0.09        22\n            bop       1.00      0.18      0.31        22\n        carcass       0.00      0.00      0.00         9\n          cocoa       1.00      0.26      0.42        19\n        coconut       0.00      0.00      0.00         1\n         coffee       1.00      0.59      0.74        27\n         copper       1.00      0.20      0.33        25\n           corn       0.00      0.00      0.00         1\n         cotton       1.00      0.09      0.17        11\n            cpi       0.64      0.35      0.45        26\n            cpu       0.00      0.00      0.00         1\n          crude       0.91      0.55      0.69       212\n            dlr       0.00      0.00      0.00        26\n           earn       0.67      0.92      0.78      1106\n       f-cattle       0.00      0.00      0.00         2\n           