# Schibsted NLP Machine Learning Challenge #

In this task, you’ll be doing text classification on an open data set from Reuters. The task
formulation is a bit vague on purpose, enabling you to make your own choices where needed.
<br>
<br>

### Instructions ###

* Download the [Reuters data set](www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz)
* Your task is to classify the documents to the categories found in TOPICS
* Adhere to the train/test split specified in the XML as LEWISSPLIT
* If you have time, evaluate and report the results
* Please submit a running piece of code, documented as you please (in the code, separate docs, document by tests, whatever makes the code understandable)
* We are interested in seeing your ideas, and your coding style/skills. Don’t worry if the classification results are sub-optimal
* If you have ideas on how to improve the solution further, please write that down as well
* Spend approximately 3 hours on the task

In [1]:
import pandas as pd
import os
from bs4 import BeautifulSoup
import ast
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score, classification_report
import tensorflow as tf
from keras import models, regularizers, layers, optimizers, losses, metrics
from keras.models import Sequential
from keras.layers import Dense
from keras.utils import np_utils, to_categorical

ImportError: No module named tensorflow

In [None]:
!python preprocess.py

In [5]:
labels = []
for file in os.listdir('data'):
    data = open(os.path.join(os.getcwd(), "data", file), 'r')
    text = data.read()
    data.close()
    tree = BeautifulSoup(text.lower(), "html.parser")
    print "Extracting train/test information from file", file
    for i in range(len(tree.find_all("reuters"))):
        if 'lewissplit="train"' in str(tree.find_all("reuters")[i]):
            labels.append('train')
        else:
            labels.append('test')

Extracting train/test information from file reut2-000.sgm
Extracting train/test information from file reut2-001.sgm
Extracting train/test information from file reut2-002.sgm
Extracting train/test information from file reut2-003.sgm
Extracting train/test information from file reut2-004.sgm
Extracting train/test information from file reut2-005.sgm
Extracting train/test information from file reut2-006.sgm
Extracting train/test information from file reut2-007.sgm
Extracting train/test information from file reut2-008.sgm
Extracting train/test information from file reut2-009.sgm
Extracting train/test information from file reut2-010.sgm
Extracting train/test information from file reut2-011.sgm
Extracting train/test information from file reut2-012.sgm
Extracting train/test information from file reut2-013.sgm
Extracting train/test information from file reut2-014.sgm
Extracting train/test information from file reut2-015.sgm
Extracting train/test information from file reut2-016.sgm
Extracting tra

In [6]:
df = pd.read_csv('C:/reuters-preprocessing/dataset3.csv', sep='\t')

In [7]:
df.head()

Unnamed: 0,id,aaron,abandon,abat,abbey,abbrevi,abduct,abel,abet,abid,...,zimbabwean,zimmer,zinc,zirconium,zloti,zone,zurich,zweig,class-label:topics,Unnamed: 7836
0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.059531,0.0,0.0,['cocoa'],
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[],
2,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[],
3,3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,[],
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"['grain', 'wheat', 'corn', 'barley', 'oat', 's...",


In [8]:
# Replace unlabeled documents with 'unlabeled'
df['class-label:topics'] = df['class-label:topics'].apply(lambda x: "['unlabeled']" if x == '[]' else x)

In [9]:
# Extract only the first topic in cases where there were multiple topics
topics = []
for index, row in df.iterrows():
    topics.append(ast.literal_eval(df['class-label:topics'][index])[0])

In [10]:
# Add the new 'topics' column
df['topics'] = topics

In [11]:
# Delete unnecesary columns
del df['id']
del df['class-label:topics']
del df['Unnamed: 7836']

In [12]:
# Add the labels column for the train/test split
df['labels'] = labels

In [13]:
df['labels'].value_counts()

train    14668
test      6910
Name: labels, dtype: int64

In [14]:
df['topics'].unique()

array(['cocoa', 'unlabeled', 'grain', 'veg-oil', 'earn', 'acq', 'wheat',
       'copper', 'housing', 'money-supply', 'coffee', 'sugar', 'trade',
       'reserves', 'ship', 'cotton', 'carcass', 'crude', 'nat-gas', 'cpi',
       'money-fx', 'interest', 'gnp', 'soybean', 'meal-feed', 'alum',
       'tea', 'oilseed', 'gold', 'tin', 'strategic-metal', 'livestock',
       'retail', 'ipi', 'iron-steel', 'rubber', 'propane', 'heat', 'jobs',
       'lei', 'bop', 'zinc', 'orange', 'pet-chem', 'dlr', 'gas', 'silver',
       'wpi', 'fishmeal', 'hog', 'lumber', 'tapioca', 'instal-debt',
       'lead', 'potato', 'l-cattle', 'rice', 'nickel', 'inventories',
       'cpu', 'corn', 'fuel', 'jet', 'income', 'rand', 'platinum',
       'saudriyal', 'nzdlr', 'palm-oil', 'coconut', 'rapeseed', 'stg',
       'groundnut', 'wool', 'austdlr', 'soy-meal', 'plywood', 'barley',
       'cruzado', 'yen', 'f-cattle', 'hk', 'naphtha'], dtype=object)

In [15]:
df_train = df[df['labels'] == 'train']
y_train = df_train['topics']
X_train = df_train.drop(['labels', 'topics'], axis=1)

In [16]:
df_test = df[df['labels'] == 'test']
y_test = df_test['topics']
X_test = df_test.drop(['labels', 'topics'], axis=1)

In [17]:
X_train.head()

Unnamed: 0,aaron,abandon,abat,abbey,abbrevi,abduct,abel,abet,abid,abidjan,...,ziegler,zimbabw,zimbabwean,zimmer,zinc,zirconium,zloti,zone,zurich,zweig
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.059531,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
X_test.head()

Unnamed: 0,aaron,abandon,abat,abbey,abbrevi,abduct,abel,abet,abid,abidjan,...,ziegler,zimbabw,zimbabwean,zimmer,zinc,zirconium,zloti,zone,zurich,zweig
12593,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12594,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12595,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12596,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12597,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# Train the Random Forest classifier and generate predictions
rfc = RandomForestClassifier(random_state=176, n_estimators=1000).fit(X_train, y_train)
rfc_preds = rfc.predict(X_test)

In [38]:
accuracy = accuracy_score(y_test, rfc_preds)
balanced_acc = balanced_accuracy_score(y_test, rfc_preds)
precision = precision_score(y_test, rfc_preds, average='macro')
recall = recall_score(y_test, rfc_preds, average='macro')
f1 = f1_score(y_test, rfc_preds, average='macro')
class_report = classification_report(y_test, rfc_preds)

In [26]:
accuracy

0.7191027496382055

In [27]:
balanced_acc

0.18413962904849832

In [28]:
precision

0.4350451135948002

In [29]:
f1

0.22487136927420956

In [35]:
recall

0.18150906291923402

In [41]:
class_report

u'                 precision    recall  f1-score   support\n\n            acq       0.87      0.74      0.80       791\n           alum       1.00      0.05      0.09        22\n            bop       1.00      0.18      0.31        22\n        carcass       0.00      0.00      0.00         9\n          cocoa       1.00      0.26      0.42        19\n        coconut       0.00      0.00      0.00         1\n         coffee       1.00      0.59      0.74        27\n         copper       1.00      0.20      0.33        25\n           corn       0.00      0.00      0.00         1\n         cotton       1.00      0.09      0.17        11\n            cpi       0.64      0.35      0.45        26\n            cpu       0.00      0.00      0.00         1\n          crude       0.91      0.55      0.69       212\n            dlr       0.00      0.00      0.00        26\n           earn       0.67      0.92      0.78      1106\n       f-cattle       0.00      0.00      0.00         2\n          