# Germeval 2019 - Task 1 (hierarchical classification)

This code is for Germeval 2019 Task 1. In this task, blurbs of German text that describe a book are provided and the challange is to predict classifications of different genres for the books from these blurbs.
In subtask a, only the highest level classification (8 classes) is used and in subtask b, the entire hierarchy is used (343 classes total).

Note that this is a multiclass (each book can have >1 class) and multilable problem (there are 8, 93 and 242 labels on each level of the hierarchy). It's also noteworthy that each book can have different leafs on the last level of the hierarchy, for example a/b/c1, a/b/c2

The approach uses a combination of logistic regression and Naive Bayes. Best results on the dev-set (checked with the provided Python script and the gold file blurbs_dev_participants.txt):
* F1-Score Task A: 0.826775214835
* F1-Score Task B: 0.618365180467

Tokenization Parameters:
* spaCy was used as the tokenizer
* unicode accents were stripped
* casing was kept
* no lemmatization
* no stopwords

Vectorization Parameters:
* Only words that appeared in at least 4 documents were used
* Words that appeared in more than 40% of documents were ignored
* Inverse document-frequency-reweighting
* Sublinear term frequency scaling
* Smoothing
* N-grams of 1,1 and 1,2 were used for the two submitted entries

Logistic Regression Parameters:
* Liblinear solver with a maximum number of 1000 iterations, automatic multiclass fitting and balanced class weights
* L2 regularization with C=40.0
* No dual formulation

Final competition results:
* Entry 1: 0.82 / 0.62
* Entry 2: 0.82 / 0.61

Link to the competition: https://www.inf.uni-hamburg.de/en/inst/ab/lt/resources/data/germeval-2019-hmc.html

Note: This notebook is inspired by Jeremy Howard's post about a strong baseline system: https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline

# 0) Imports and setup

In [1]:
# Time the entire notebook
import timeit
start_time = timeit.default_timer()

## Imports

In [2]:
import pandas as pd, numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier

import re

  _nan_object_mask = _nan_object_array != _nan_object_array


In [3]:
import sys, sklearn
print(sys.version)
print(sklearn.__version__)
print(pd.__version__)
print(np.__version__)

3.5.0 |Anaconda custom (64-bit)| (default, Oct 20 2015, 14:39:26) 
[GCC 4.2.1 (Apple Inc. build 5577)]
0.20.0
0.23.4
1.11.3


In [4]:
import spacy
print(spacy.__version__)

2.0.12


In [5]:
!anaconda --version
!ipython --version
!jupyter --version

anaconda Command line client (version 1.1.0)
4.0.0
4.0.6


## Files

In [6]:
hierarchy_file = 'blurbs/hierarchy.txt'

# Use this for testing (testset with goldfile)
train_file = 'blurbs/blurbs_train.txt'
test_file = 'blurbs/blurbs_dev_participants.txt'

# Use this for the submission (bigger training set, correct testset) 
# train_file = 'blurbs/blurbs_train_and_dev.txt'
# test_file = 'blurbs/blurbs_test_nolabel.txt'

In [7]:
# A dictionary of all labels. Taken from utilities.py from the organizers of Germeval 2019
all_labels = {0: [u"Ratgeber", u"Kinderbuch & Jugendbuch", u"Literatur & Unterhaltung", u"Sachbuch", u"Ganzheitliches Bewusstsein", u"Architektur & Garten", u"Glaube & Ethik", u"Künste"],
1: [u"Eltern & Familie", u"Echtes Leben, Realistischer Roman", u"Abenteuer", u"Märchen, Sagen", u"Lyrik, Anthologien, Jahrbücher", u"Frauenunterhaltung", u"Fantasy", u"Kommunikation & Beruf", u"Lebenshilfe & Psychologie", u"Krimi & Thriller", u"Freizeit & Hobby", u"Liebe, Beziehung und Freundschaft", u"Familie", u"Natur, Wissenschaft, Technik", u"Fantasy und Science Fiction", u"Geister- und Gruselgeschichten", u"Schicksalsberichte", u"Romane & Erzählungen", u"Science Fiction", u"Politik & Gesellschaft", u"Ganzheitliche Psychologie", u"Natur, Tiere, Umwelt, Mensch", u"Psychologie", u"Lifestyle", u"Sport", u"Lebensgestaltung", u"Essen & Trinken", u"Gesundheit & Ernährung", u"Kunst, Musik", u"Architektur", u"Biographien & Autobiographien", u"Romance", u"Briefe, Essays, Gespräche", u"Kabarett & Satire", u"Krimis und Thriller", u"Erotik", u"Historische Romane", u"Theologie", u"Beschäftigung, Malen, Rätseln", u"Schulgeschichten", u"Biographien", u"Kunst", u"(Zeit-) Geschichte", u"Ganzheitlich Leben", u"Garten & Landschaftsarchitektur", u"Körper & Seele", u"Energieheilung", u"Abenteuer, Reisen, fremde Kulturen", u"Historische Romane, Zeitgeschichte", u"Klassiker & Lyrik", u"Fotografie", u"Design", u"Beauty & Wellness", u"Kunst & Kultur", u"Mystery", u"Ratgeber Partnerschaft & Sexualität", u"Detektivgeschichten", u"Spiritualität & Religion", u"Sachbuch Philosophie", u"Tiergeschichten", u"Horror", u"Literatur & Unterhaltung Satire", u"Infotainment & erzählendes Sachbuch", u"Fitness & Sport", u"Übernatürliches", u"Psychologie & Spiritualität", u"Handwerk Farbe", u"Weisheiten der Welt", u"Naturheilweisen", u"Lustige Geschichten, Witze", u"Wissen & Nachschlagewerke", u"Sterben, Tod und Trauer", u"Romantasy", u"Wirtschaft & Recht", u"Comic & Cartoon", u"Schullektüre", u"Glaube und Grenzerfahrungen", u"Mode & Lifestyle", u"Mondkräfte", u"Musik", u"Geschichte, Politik", u"Gemeindearbeit", u"Wohnen & Innenarchitektur", u"Esoterische Romane", u"Schicksalsdeutung", u"Religionsunterricht", u"Religiöse Literatur", u"Geld & Investment", u"Sportgeschichten", u"Religion, Glaube, Ethik, Philosophie", u"Recht & Steuern", u"Handwerk Holz", u"Regionalia"],
2: [u"Vornamen", u"Heroische Fantasy", u"Joballtag & Karriere", u"Psychothriller", u"Große Gefühle", u"Feiern & Feste", u"Medizin & Forensik", u"Phantastik", u"Ökologie / Umweltschutz", u"Aktuelle Debatten", u"Ganzheitliche Psychologie Lebenshilfe", u"Nordamerikanische Literatur", u"Babys & Kleinkinder", u"Schwangerschaft & Geburt", u"Tod & Trauer", u"Nordische Krimis", u"Gesunde Ernährung", u"Junge Literatur", u"Kreatives", u"Einfamilienhausbau", u"Künstler, Dichter, Denker", u"Themenkochbuch", u"Abenteuer & Action", u"Science Thriller", u"Justizthriller", u"Besser leben", u"Starke Frauen", u"Gesellschaftskritik", u"Psychologie Partnerschaft & Sexualität", u"Krankheit", u"Abenteuer-Fantasy", u"Kirchen- und Theologiegeschichte", u"Biblische Theologie AT", u"Biblische Theologie NT", u"Politik & Gesellschaft Andere Länder & Kulturen", u"Hard Science Fiction", u"All Age Fantasy", u"Trauma", u"Krisen & Ängste", u"Space Opera", u"19./20. Jahrhundert", u"Agenten-/Spionage-Thriller", u"Französische Literatur", u"Selbstcoaching", u"Kopftraining", u"Erzählungen & Kurzgeschichten", u"Gartengestaltung", u"Weltpolitik & Globalisierung", u"Internet", u"Geschenkbuch & Briefkarten", u"Reiseberichte", u"Literatur aus Spanien und Lateinamerika", u"Romantische Komödien", u"Märchen, Legenden und Sagen", u"Humorvolle Unterhaltung", u"Natur, Wissenschaft, Technik Tiere", u"Familiensaga", u"Wellness", u"Romanbiographien", u"Patientenratgeber", u"Politische Theorien", u"Erotik & Sex", u"Rätsel & Spiele", u"Politiker", u"Future-History", u"Gerichtsmedizin / Pathologie", u"Spirituelles Leben", u"Nationalsozialismus", u"Musterbriefe & Rhetorik", u"Einzelthemen der Theologie", u"Dystopie", u"Lyrik", u"Literatur aus Russland und Osteuropa", u"Regionalkrimis", u"Starköche", u"Yoga, Pilates & Stretching", u"Pflanzen & Garten", u"Jenseits & Wiedergeburt", u"Fitnesstraining", u"Problemzonen", u"Italienische Literatur", u"Christlicher Glauben", u"Handwerk Farbe Praxis", u"Handwerk Farbe Grundlagenwissen", u"Östliche Weisheit", u"Ernährung", u"Magen & Darm", u"Nahrungsmittelintoleranz", u"Deutschsprachige Literatur", u"Mittelalter", u"Historische Krimis", u"Kindererziehung", u"Körpertherapien", u"High Fantasy", u"Science Fiction Sachbuch", u"Pubertät", u"Länderküche", u"Styling", u"Schönheitspflege", u"Getränke", u"Lady-Thriller", u"Abschied, Trauer, Neubeginn", u"Laufen & Nordic Walking", u"Neue Wirtschaftsmodelle", u"Utopie", u"Afrikanische Literatur", u"Science Fiction Science Fantasy", u"Englische Literatur", u"Steampunk", u"Alternativwelten", u"Geschichte nach '45", u"Spiritualität & Religion Weltreligionen", u"Theologie Religionspädagogik", u"Raucherentwöhnung", u"Funny Fantasy", u"Skandinavische Literatur", u"Film & Musik", u"Westliche Wege", u"Entspannung & Meditation", u"Kindergarten & Pädagogik", u"Schule & Lernen", u"Spiele & Beschäftigung", u"Psychologie Lebenshilfe", u"Persönlichkeitsentwicklung", u"Mystery-Thriller", u"Homöopathie & Bachblüten", u"Liebe & Beziehung", u"Literaturgeschichte / -kritik", u"Ernährung & Kochen", u"Wandern & Bergsteigen", u"Sucht & Abhängigkeit", u"Politthriller", u"Sterbebegleitung & Hospizarbeit", u"50 plus", u"Job & Karriere", u"Konfirmation", u"Gemeindearbeit Religionspädagogik", u"Kasualien und Sakramente", u"Schauspieler, Regisseure", u"Praktische Anleitungen", u"Rücken & Gelenke", u"Unternehmen & Manager", u"Landschaftsgestaltung", u"Krimikomödien", u"Musiker, Sänger", u"Freizeit & Hobby Tiere", u"Gebete und Andachten", u"Glauben mit Kindern", u"Dark Fantasy", u"Lesen & Kochen", u"Kunst & Kunstgeschichte", u"Flirt & Partnersuche", u"Partnerschaft & Sex", u"Kommunikation", u"Wissen der Naturvölker", u"Urban Fantasy", u"Andere Länder", u"21. Jahrhundert", u"Engel & Schutzgeister", u"Chakren & Aura", u"Science Fiction Satire", u"Bauherrenratgeber", u"Bautechnik", u"Systematische Theologie", u"Praktische Theologie", u"Kosmologie", u"Literatur aus Fernost", u"Bibeln & Katechismus", u"Humoristische Nachschlagewerke", u"Wohnen", u"Länder, Städte & Regionen", u"Spirituelle Entwicklung", u"Indische Literatur", u"Cyberpunk", u"Wissenschaftler", u"Dying Earth", u"Monographien", u"Gesang- und Liederbücher", u"Innenarchitektur", u"Baumaterialien", u"Antike und neulateinische Literatur", u"Gemeindearbeit mit Kindern & Jugendlichen", u"Wissenschaftsthriller", u"Ökothriller", u"Fantasy Science Fantasy", u"Psychotherapie", u"Farbratgeber", u"Hausmittel", u"Schicksalsberichte Andere Länder & Kulturen", u"Design / Lifestyle", u"Diakonie und Seelsorge", u"Gemeindearbeit Sachbuch", u"Gottesdienst und Predigt", u"Sprache & Sprechen", u"(Zeit-) Geschichte Andere Länder & Kulturen", u"Arbeitstechniken", u"Mantras & Mudras", u"NS-Zeit & Nachkriegszeit", u"Kinderschicksal", u"Altbausanierung / Denkmalpflege", u"Neuere Geschichte", u"Umgangsformen", u"Geschichte und Theorie", u"Familie & Religion", u"Niederländische Literatur", u"Handwerk Farbe Gestaltung", u"Historische Fantasy", u"Alte Geschichte", u"Fantasy-/SF-Thriller", u"Bewerbung", u"Wirtschaftsthriller", u"Bibel in gerechter Sprache", u"Fahrzeuge / Technik", u"Handwerk Holz Gestaltung", u"Handwerk Holz Grundlagenwissen", u"Anthologien", u"Handwerk Holz Praxis", u"Bibeln & Bibelarbeit", u"Theologie Weltreligionen", u"Dialog der Traditionen", u"Magie & Hexerei", u"Tierkrimis", u"Medizinthriller", u"Literatur des Nahen Ostens", u"Kirchenthriller", u"Spielewelten", u"Astrologie & Sternzeichen", u"Stadtplanung", u"Feministische Theologie", u"Entwurfs- und Detailplanung", u"Street Art", u"Trennung", u"Philosophie", u"Tarot", u"Systemische Therapie & Familienaufstellung", u"Bauaufgaben", u"Griechische Literatur", u"Gartendesigner", u"Urgeschichte", u"Reden & Glückwünsche", u"Antiquitäten", u"Theater / Ballett"]}

## Constants

In [461]:
TEXT_COLUMN = 'body'

In [8]:
LEMMATIZE = False

## Set seed for reproducible results

In [283]:
import random

seed = 23 # tried 23, 5, 42, 82
np.random.seed(seed)
random.seed(seed)

## Tokenizer

In [10]:
nlp = spacy.load('de')

def tokenize_spacy(corpus, lemma=LEMMATIZE):
  doc = nlp(corpus)
  if lemma:
   return list(str(x.lemma_) for x in doc) # lemma_ to get string instead of hash
  else:
    return list(str(x) for x in doc)

## Helpers

### Data extraction

In [11]:
# Turn a provided XML-file into a string with an added root element
def data_from_xml(xml_file):
  # Read in the file
  with open(xml_file, 'r') as file :
    data = file.read()  
    # Replace "&" with "und" to avoid parsing problems
    data = data.replace("&", "und")
    # Add a root node
    data = '<root>\n' + data + '</root>\n'
    return data

# Get the root element from a data-string
def root_from_data(data, encoding='utf-8'):
  import xml.etree.ElementTree as ET
  xml_parser = ET.XMLParser(encoding=encoding)
  xml_root = ET.fromstring(data)
  return xml_root

### Labels

In [12]:
# Turn a list of labels with all sorts of special characters into one that is usable for data frames
# Examples: 
# 'Kinderbuch & Jugendbuch' -> 'kinderbuch_jugendbuch'
# '(Zeit-) Geschichte' -> 'zeit geschichte'
def labels_to_ids(labels):
  ids = []
  for label in labels:
    label = label.replace(' & ', '_')
    label = label.replace(' und ', '_')
    label = label.lower()
    # Stuff for depth-level 2 and 3
    label = label.replace(" / ", "_")
    label = label.replace("/", "_")
    label = label.replace(", ", "_")
    label = label.replace("-", "")
    label = label.replace("(", "")
    label = label.replace(")", "")
    label = label.replace(".", "")
    label = label.replace("'", "")
    ids.append(label)
  return ids

In [13]:
# Get the labels from the previous level
# Examples:
# u"Romane & Erzählungen" -> ['Literatur & Unterhaltung']
# u"Joballtag & Karriere" -> ['Kommunikation & Beruf', 'Ratgeber']

# TODO: refactor
def find_previous_labels(label):
  with open(hierarchy_file, 'r') as file :
    data = file.read().split('\n')
  extra_labels = []
  level_two_label = ''
  for row in data:
    # Note: this assumes, there's always two items per row
    # Also assumes that every highest level item is only found once in the file
    items = re.split(r'\t+', row.rstrip('\t')) # split on tab
    if(( len(items) == 2 ) and items[1] == label):
      extra_labels.append(items[0])
      level_two_label = items[0]
  for row in data:
    # Note: this assumes, there's always two items per row
    # Also assumes that every highest level item is only found once in the file
    items = re.split(r'\t+', row.rstrip('\t')) # split on tab
    if(( len(items) == 2 ) and items[1] == level_two_label):
      extra_labels.append(items[0])
  return extra_labels

In [14]:
# Get the labels from the next level
# Example: u"Ratgeber"
# ['Essen & Trinken', 'Gesundheit & Ernährung', 'Lebenshilfe & Psychologie', 'Eltern & Familie',
#  'Ratgeber Partnerschaft & Sexualität', 'Beauty & Wellness', 'Fitness & Sport', 'Kommunikation & Beruf',
#  'Geld & Investment', 'Recht & Steuern', 'Freizeit & Hobby', 'Wissen & Nachschlagewerke']
def find_next_level(label):
  extra_labels = []
  with open(hierarchy_file, 'r') as file :
    data = file.read().split('\n')
  for row in data:
    # Note: this assumes, there's always two items per row
    # Also assumes that every highest level item is only found once in the file
    items = re.split(r'\t+', row.rstrip('\t')) # split on tab
    if(( len(items) == 2 ) and items[0] == label):
      extra_labels.append(items[1])
  return extra_labels

### Data frame construction

In [15]:
# Construct an empty data frame from a provided list of santized label_ids
def dataframe_from_labels(label_ids=[]):
  base_columns = ['isbn', 'title', 'body', 'copyright', 'authors', 'published']
  # The testfile has no url and no labels (just the base columns)
  if(label_ids==[]):
    columns = base_columns
  else:
    base_columns.append('url')
    columns = base_columns + label_ids
  return pd.DataFrame(columns = columns)

In [16]:
# Write a 1 for every label that matches the ones passed and a 0 for every other label
def entries_from_labels(matching_labels, label_ids):
  entries = [0 for x in label_ids]
  for item in labels_to_ids(matching_labels):
    entries[label_ids.index(item)] = 1
  return entries

# Build a dataframe from a given root-element. Works for both training and test data.
# d is the depth of the labels to consider: 0 = level 1, 1=level 2, 2=level 3
def dataframe_from_root(root, label_ids=[], d=0):
  if(label_ids==[]):
    is_test = True
  else:
    is_test = False
  # Empty dataframe for the given label_ids
  df = dataframe_from_labels(label_ids)

  for node in root:
    matching_labels = []
    
    isbn = node.find("isbn").text
    title = node.find("title").text
    body = node.find("body").text
    copyright = node.find("copyright").text
    authors = node.find("authors").text
    published = node.find("published").text
    
    # Training-set
    if(is_test == False):
      url = node.find("url").text
    
      categories = node.find("categories").findall("category")
      for c in categories:
        topics = c.findall("topic")
        # Use all top-level matching_labels
        for t in topics:
          if t.attrib.get("d") == str(d):
            matching_labels.append(t.text)
    
      df = df.append(pd.Series([isbn, title, body, copyright, authors, published, url]+entries_from_labels(matching_labels, label_ids),
                               index = df.columns), ignore_index = True)
    
    # Test-set
    else:
      df = df.append(pd.Series([isbn, title, body, copyright, authors, published],
                               index = df.columns), ignore_index = True)
    
  return df

In [17]:
# Create dataframes for training and test from given labels and a level depth
# depth=0 -> level 1 only (suitable for subtask a)
def get_train_test(label_ids, depth=0):
  train_df = dataframe_from_root(train_root, label_ids, d=depth)
  test_df = dataframe_from_root(test_root)
  # Add 0s for all labels for the test_df
  test_df = test_df.reindex(columns=[*test_df.columns.tolist(), *label_ids], fill_value=0)
  return train_df, test_df

### Predictions to expected answer-format

In [18]:
def write_answerfile(ansers_taskA, answers_TaskB, filename='ORGNAME__MODEL.txt'):
  # Add the required subtask-headers
  final_answers = 'subtask_a\n' + answers_taskA + '\nsubtask_b\n' + answers_taskB

  out = open(filename, 'w')
  out.write(final_answers)
  out.close()

## 1) Data loading and labels

In [19]:
# XML->data-string
train_data = data_from_xml(train_file)
test_data = data_from_xml(test_file)

# Root element from data-string
train_root = root_from_data(train_data)
test_root = root_from_data(test_data)

In [20]:
# Level 1 labels
labels = all_labels[0]
label_ids = labels_to_ids(labels)
# labels, label_ids

In [21]:
len(labels) # 8

8

In [22]:
# Level 2 labels
labels_level2 = all_labels[1]
label_ids_level2 = labels_to_ids(labels_level2)
# labels_level2, label_ids_level2

In [23]:
len(labels_level2) # 93

93

In [24]:
# Level 3 labels
labels_level3 = all_labels[2]
label_ids_level3 = labels_to_ids(labels_level3)
# labels_level3, label_ids_level3

In [25]:
len(labels_level3) # 242

242

# 2) Subtask A (level 1)

## Load and sanitize the training and test data

In [26]:
train_df, test_df = get_train_test(label_ids, depth=0)

In [27]:
train_df.head()

Unnamed: 0,isbn,title,body,copyright,authors,published,url,ratgeber,kinderbuch_jugendbuch,literatur_unterhaltung,sachbuch,ganzheitliches bewusstsein,architektur_garten,glaube_ethik,künste
0,9783641136291,Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,(c) Verlagsgruppe Random House GmbH,Noah Gordon,2013-12-02,https://www.randomhouse.de/ebook/Die-Klinik/No...,0,0,1,0,0,0,0,0
1,9783641185787,Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,(c) Verlagsgruppe Random House GmbH,Raymond Feist,2016-06-20,https://www.randomhouse.de/ebook/Die-Erben-von...,0,0,1,0,0,0,0,0
2,9783328103646,Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,(c) Verlagsgruppe Random House GmbH,Susanne Weingarten,2019-01-14,https://www.randomhouse.de/Taschenbuch/Voellig...,1,0,0,0,0,0,0,0
3,9783453357792,Dich erfüllen,An der Seite von Damien fühlt sich Nikki zum e...,(c) Verlagsgruppe Random House GmbH,J. Kenner,2014-04-14,https://www.randomhouse.de/Taschenbuch/Dich-er...,0,0,1,0,0,0,0,0
4,9783844504958,Der Orientzyklus,"Wer Kara Ben Nemsi, Hadschi Halef Omar und Sir...",(c) Verlagsgruppe Random House GmbH,Karl May,2007-08-13,https://www.randomhouse.de/Hoerbuch-Download/D...,0,0,1,0,0,0,0,0


In [28]:
train_df[200:250]

Unnamed: 0,isbn,title,body,copyright,authors,published,url,ratgeber,kinderbuch_jugendbuch,literatur_unterhaltung,sachbuch,ganzheitliches bewusstsein,architektur_garten,glaube_ethik,künste
200,9783570309971,Seelenkuss,Prinzessin Darejan erkennt ihre Schwester nich...,(c) Verlagsgruppe Random House GmbH,Lynn Raven,2015-07-13,https://www.randomhouse.de/Taschenbuch/Seelenk...,0,1,0,0,0,0,0,0
201,9783466372157,Benedikt XVI.,Mit Joseph Ratzinger verbindet sich eine atemb...,(c) Verlagsgruppe Random House GmbH,"Peter Seewald, Diözese Passau Körperschaft des...",2017-10-30,https://www.randomhouse.de/Buch/Benedikt-XVI./...,0,0,0,1,0,0,0,0
202,9783844504392,Karl Valentins sprachliche Wirrungen,Vor Karl Valentin und Liesl Karlstadt ist nich...,(c) Verlagsgruppe Random House GmbH,Karl Valentin,2007-04-13,https://www.randomhouse.de/Hoerbuch-Download/K...,0,0,1,0,0,0,0,0
203,9783641103750,Mein Laufbuch für die ersten 10 Kilometer,Ihr Entschluss steht fest: Sie möchten gern mi...,(c) Verlagsgruppe Random House GmbH,Thomas Wessinghage,2014-04-24,https://www.randomhouse.de/ebook/Mein-Laufbuch...,1,0,0,0,0,0,0,0
204,9783453151659,Der Partner,"Bevor sie die Falle zuschnappen ließen, hatten...",(c) Verlagsgruppe Random House GmbH,John Grisham,1999-09-01,https://www.randomhouse.de/Taschenbuch/Der-Par...,0,0,1,0,0,0,0,0
205,9783641118556,Saat der Angst,Unter mysteriösen Umständen verschwinden drei ...,(c) Verlagsgruppe Random House GmbH,Emily Benedek,2014-03-17,https://www.randomhouse.de/ebook/Saat-der-Angs...,0,0,1,0,0,0,0,0
206,9783641039363,Die Goldmacherin,Mainz 1461: Die junge Aurelia erlernt von ihre...,(c) Verlagsgruppe Random House GmbH,Sybille Conrad,2010-08-13,https://www.randomhouse.de/ebook/Die-Goldmache...,0,0,1,0,0,0,0,0
207,9783641192396,"Denk blau, zähl bis zwei","Bevor die großen Schiffe planoformen konnten, ...",(c) Verlagsgruppe Random House GmbH,Cordwainer Smith,2016-04-28,"https://www.randomhouse.de/ebook/Denk-blau,-za...",0,0,1,0,0,0,0,0
208,9783844504552,"Beweise, daß es böse ist","Nicht einmal die Gondeln tragen Trauer, als di...",(c) Verlagsgruppe Random House GmbH,Donna Leon,2005-08-15,https://www.randomhouse.de/Hoerbuch-Download/B...,0,0,1,0,0,0,0,0
209,9783641145651,Im Bann der Liebe,Atemberaubend schön und in den raffiniertesten...,(c) Verlagsgruppe Random House GmbH,Sylvia Day,2015-01-12,https://www.randomhouse.de/ebook/Im-Bann-der-L...,0,0,1,0,0,0,0,0


In [29]:
train = train_df.copy()
test = test_df.copy()
len(train), len(test)

(14548, 2079)

### Fix empty text in the train and test set

In [30]:
np.where(pd.isnull(test['body']))

(array([], dtype=int64),)

In [31]:
np.where(pd.isnull(train['body']))

(array([  623,   911,  3989,  4381,  4642,  7372,  8094,  8422, 11158,
        12850, 14060]),)

In [32]:
train[623:624]

Unnamed: 0,isbn,title,body,copyright,authors,published,url,ratgeber,kinderbuch_jugendbuch,literatur_unterhaltung,sachbuch,ganzheitliches bewusstsein,architektur_garten,glaube_ethik,künste
623,9783442158485,Der alte Mann und das Netz,,(c) Verlagsgruppe Random House GmbH,Christian Humberg,2015-08-17,https://www.randomhouse.de/Taschenbuch/Der-alt...,0,0,0,1,0,0,0,0


In [33]:
# Fill empty body with just the title (better results than author+title)
train.body = np.where(train.body.isnull(), train.title , train.body)
#Fill empty body with the title+authors
#train.body = np.where(train.body.isnull(), train.title + ' ' + train.authors, train.body)

In [34]:
train[623:624]

Unnamed: 0,isbn,title,body,copyright,authors,published,url,ratgeber,kinderbuch_jugendbuch,literatur_unterhaltung,sachbuch,ganzheitliches bewusstsein,architektur_garten,glaube_ethik,künste
623,9783442158485,Der alte Mann und das Netz,Der alte Mann und das Netz,(c) Verlagsgruppe Random House GmbH,Christian Humberg,2015-08-17,https://www.randomhouse.de/Taschenbuch/Der-alt...,0,0,0,1,0,0,0,0


## Tokenize & Vectorize

In [462]:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

# analyzer='char': character level; 'char_wb': only in words; 'word' (default): word based
# lowercase defaults to True
# norm = 'l1'/'l2' (defaults to l2)
# smooth_idf: Add "fake document" with all 1s (don't use for char?!)
# use_idf: inverse-document-frequency reweighting
# sublinear_tf: Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
# binary (defaults to False)
# Note: Set use_idf to False and norm to None to get 0/1 outputs.

# TF-IDF (Term Frequency - Inverse Document Frequency) Vectorizer
# Normalize term counts by taking into account how often they appear in a document,
# how long the document is and how common/rare a term is
vec = TfidfVectorizer(analyzer='word', ngram_range=(1,1), tokenizer=tokenize_spacy,
               min_df=4, max_df=0.4, strip_accents='unicode', use_idf=True,
               smooth_idf=True, sublinear_tf=True, lowercase=False, binary=False)

# Best: ngram 1,1; 4/0.4; T/T/T; lowercase=False; spacy-word tokenization

# C=40.0; dual=False; class_weight='balanced' 
# For multi-label: limit=0.04

In [463]:
trn_term_doc = vec.fit_transform(train[TEXT_COLUMN])
test_term_doc = vec.transform(test[TEXT_COLUMN])

In [464]:
trn_term_doc, test_term_doc

(<14548x23648 sparse matrix of type '<class 'numpy.float64'>'
 	with 865403 stored elements in Compressed Sparse Row format>,
 <2079x23648 sparse matrix of type '<class 'numpy.float64'>'
 	with 123452 stored elements in Compressed Sparse Row format>)

In [509]:
# SAVE
scipy.sparse.save_npz('trn_term_doc_level1.npz', trn_term_doc)
scipy.sparse.save_npz('test_term_doc_level1.npz', test_term_doc)

In [511]:
# LOAD - SKIP TO HERE
trn_term_doc = scipy.sparse.load_npz('trn_term_doc_level1.npz')
test_term_doc = scipy.sparse.load_npz('test_term_doc_level1.npz')

## Naive Bayes Logistic Regression

### Build the model

In [512]:
def pr(y_i, y, trn_term_doc):
    p = trn_term_doc[y==y_i].sum(0)
    return (p+1) / ((y==y_i).sum()+1)

In [513]:
def get_model(label_values, trn_term_doc):
    y = label_values.astype('int') # convert objects to ints
    
    r = np.log(pr(1,y, trn_term_doc) / pr(0,y, trn_term_doc))
    model = LogisticRegression(C=40.0, dual=False, solver='liblinear', multi_class='auto', max_iter=1000,
                           penalty='l2', class_weight='balanced', verbose=1)
    x_nb = trn_term_doc.multiply(r)
    return model.fit(x_nb, y), r # x_nb=training; y=targets

### Predictions from the model

In [514]:
preds = np.zeros((len(test), len(label_ids)))

for index, label_id in enumerate(label_ids):
    label_values = train[label_id].values
    print('Fitting: ', label_id)
    m,r = get_model(label_values, trn_term_doc)
    preds[:,index] = m.predict_proba(test_term_doc.multiply(r))[:,1] # why all rows, first column?!

Fitting:  ratgeber
[LibLinear]Fitting:  kinderbuch_jugendbuch
[LibLinear]Fitting:  literatur_unterhaltung
[LibLinear]Fitting:  sachbuch
[LibLinear]Fitting:  ganzheitliches bewusstsein
[LibLinear]Fitting:  architektur_garten
[LibLinear]Fitting:  glaube_ethik
[LibLinear]Fitting:  künste
[LibLinear]

In [515]:
preds[0,:]

array([  4.10145004e-02,   8.96660999e-02,   1.50870231e-02,
         5.08189619e-04,   4.82644210e-03,   1.25780924e-03,
         9.99994847e-01,   3.10462204e-03])

## Write Submission File

### Get answers

In [516]:
# Given preds and corresponding labels, produce an answer-file of format isbn <TAB> label
# The best default-value for for the limit has been empirically found (0.08)
# The default max_labels is 2, that is find up to two different labels
def answers_from_preds_multi(preds, labels, limit=0.08, max_labels=2):
  label_column_strings = labels
  test_isbn_df = test['isbn']
  isbn_list = list(test_isbn_df)
  answers_list = []
  for index, item in enumerate(preds):
    #max_index = np.argmax(item)
    
    # Sort probabilities from highest to lowest, since there were no examples with more than
    # three categories in the provided data, we'll stop there
    sorted_indexes = (-item).argsort()
    index_first = sorted_indexes[0]
    index_second = sorted_indexes[1]
    index_third = sorted_indexes[2]
    
    max_index = index_first
    
    # TODO: refactor
    # Multi-label 
    if(max_labels > 1 and (item[index_first] - item[index_second]) < limit):
      label_first = label_column_strings[index_first]
      label_first = [label_first] + find_previous_labels(label_first)

      label_second = label_column_strings[index_second]
      label_second = [label_second] + find_previous_labels(label_second)
      ls = label_first + label_second
    
      if(max_labels > 2 and (item[index_second] - item[index_third]) < 0.005):
        label_third = label_column_strings[index_third]
        ls.append(label_third)
    else:
      label_first = label_column_strings[max_index]
      label_first = [label_first] + find_previous_labels(label_first)
      ls = label_first
    isbn = isbn_list[index]
    answers_list += [[isbn, ls]]
  return answers_list # this is a list of lists to keep the order, in python 3.7+ a dict can be used

In [517]:
answers_list = answers_from_preds_multi(preds, labels)

In [518]:
# TODO: refactor
def answers_list_to_file(answers_list):
  answers = ''

  for item in answers_list:
    isbn = item[0]
    labels_level1 = item[1]
    if(len(item) == 3):
      labels_level2 = item[2]
      answers += isbn + '\t' + '\t'.join(labels_level1) + '\t' + '\t'.join(labels_level2) + '\n'
    elif(len(item) == 4):
      labels_level2 = item[2]
      labels_level3_nested = item[3] # can be a list of lists or a list
      # flatten if list of lists
      if(isinstance(labels_level3_nested[0], list)):
        labels_level3 = [item for sublist in labels_level3_nested for item in sublist] # flatten the list
      answers += isbn + '\t' + '\t'.join(labels_level1) + '\t' + '\t'.join(labels_level2) + '\t' + '\t'.join(labels_level3) + '\n'
    else:
      answers += isbn + '\t' + '\t'.join(labels_level1) + '\n'
  return answers[:-1] # Remove trailing \n

In [519]:
answers_taskA = answers_list_to_file(answers_list)

In [520]:
answers_taskB = answers_list_to_file(answers_list) # dummy that just takes subtask a results for subtask b

In [521]:
write_answerfile(answers_taskA, answers_taskB)

In [522]:
elapsed_first = timeit.default_timer() - start_time

## Subtask B (level 2)

### Level 2 Setup

In [476]:
train_df_level2, test_df_level2 = get_train_test(label_ids_level2, depth=1)

In [477]:
train_df_level2.head()

Unnamed: 0,isbn,title,body,copyright,authors,published,url,eltern_familie,echtes leben_realistischer roman,abenteuer,...,esoterische romane,schicksalsdeutung,religionsunterricht,religiöse literatur,geld_investment,sportgeschichten,religion_glaube_ethik_philosophie,recht_steuern,handwerk holz,regionalia
0,9783641136291,Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,(c) Verlagsgruppe Random House GmbH,Noah Gordon,2013-12-02,https://www.randomhouse.de/ebook/Die-Klinik/No...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9783641185787,Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,(c) Verlagsgruppe Random House GmbH,Raymond Feist,2016-06-20,https://www.randomhouse.de/ebook/Die-Erben-von...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9783328103646,Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,(c) Verlagsgruppe Random House GmbH,Susanne Weingarten,2019-01-14,https://www.randomhouse.de/Taschenbuch/Voellig...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9783453357792,Dich erfüllen,An der Seite von Damien fühlt sich Nikki zum e...,(c) Verlagsgruppe Random House GmbH,J. Kenner,2014-04-14,https://www.randomhouse.de/Taschenbuch/Dich-er...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9783844504958,Der Orientzyklus,"Wer Kara Ben Nemsi, Hadschi Halef Omar und Sir...",(c) Verlagsgruppe Random House GmbH,Karl May,2007-08-13,https://www.randomhouse.de/Hoerbuch-Download/D...,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [478]:
train2 = train_df_level2.copy()
test2 = test_df_level2.copy()
len(train2), len(test2)

(14548, 2079)

In [479]:
np.where(pd.isnull(test2['body']))

(array([], dtype=int64),)

In [480]:
np.where(pd.isnull(train2['body']))

(array([  623,   911,  3989,  4381,  4642,  7372,  8094,  8422, 11158,
        12850, 14060]),)

In [481]:
# Fill empty body with just the title (better results than author+title)
train2.body = np.where(train2.body.isnull(), train2.title , train2.body)

In [482]:
trn_term_doc_level2 = vec.fit_transform(train2[TEXT_COLUMN])
test_term_doc_level2 = vec.transform(test2[TEXT_COLUMN])

In [483]:
trn_term_doc_level2, test_term_doc_level2

(<14548x23648 sparse matrix of type '<class 'numpy.float64'>'
 	with 865403 stored elements in Compressed Sparse Row format>,
 <2079x23648 sparse matrix of type '<class 'numpy.float64'>'
 	with 123452 stored elements in Compressed Sparse Row format>)

In [523]:
# SAVE
scipy.sparse.save_npz('trn_term_doc_level2.npz', trn_term_doc_level2)
scipy.sparse.save_npz('test_term_doc_level2.npz', test_term_doc_level2)

In [524]:
# LOAD - SKIP TO HERE
trn_term_doc_level2 = scipy.sparse.load_npz('trn_term_doc_level2.npz')
test_term_doc_level2 = scipy.sparse.load_npz('test_term_doc_level2.npz')

In [484]:
preds_level2 = np.zeros((len(test), len(label_ids_level2)))

for index, label_id in enumerate(label_ids_level2):
    label_values = train2[label_id].values
    print('Fitting: ', label_id)
    m,r = get_model(label_values, trn_term_doc_level2)
    preds_level2[:,index] = m.predict_proba(test_term_doc_level2.multiply(r))[:,1] # why all rows, first column?!

Fitting:  eltern_familie
[LibLinear]Fitting:  echtes leben_realistischer roman
[LibLinear]Fitting:  abenteuer
[LibLinear]Fitting:  märchen_sagen
[LibLinear]Fitting:  lyrik_anthologien_jahrbücher
[LibLinear]Fitting:  frauenunterhaltung
[LibLinear]Fitting:  fantasy
[LibLinear]Fitting:  kommunikation_beruf
[LibLinear]Fitting:  lebenshilfe_psychologie
[LibLinear]Fitting:  krimi_thriller
[LibLinear]Fitting:  freizeit_hobby
[LibLinear]Fitting:  liebe_beziehung_freundschaft
[LibLinear]Fitting:  familie
[LibLinear]Fitting:  natur_wissenschaft_technik
[LibLinear]Fitting:  fantasy_science fiction
[LibLinear]Fitting:  geister_gruselgeschichten
[LibLinear]Fitting:  schicksalsberichte
[LibLinear]Fitting:  romane_erzählungen
[LibLinear]Fitting:  science fiction
[LibLinear]Fitting:  politik_gesellschaft
[LibLinear]Fitting:  ganzheitliche psychologie
[LibLinear]Fitting:  natur_tiere_umwelt_mensch
[LibLinear]Fitting:  psychologie
[LibLinear]Fitting:  lifestyle
[LibLinear]Fitting:  sport
[LibLinear]Fitt

In [486]:
len(preds_level2), len(preds_level2[0,:]) # 2079 elements, 93 classes

(2079, 93)

### Level 2 predictions

In [487]:
# Takes predictions and corresponding labels (of same length and order)
# and a target list and returns a list of lists, each item has the format [label, probability]
# The labels have to be in a specified target_list
def get_nextlevel_preds(preds, labels, target_list):
  if(len(preds) != len(labels)):
    raise Exception('The length of the predictions and the corresponding labels should be the same!')
  new_list = []
  for index, item in enumerate(labels):
    if item in target_list:
      new_list.append([labels[index], preds[index]])
  return new_list

In [488]:
# Takes a list of lists, each item has the format [label, probability]
# Returns a list of the format [label, probability] that contains the label with the highest probability
def get_max_from_list(list, cutoff=0.0):
  max = 0
  l = ''
    
  for item in list:
    label = item[0]
    probability = item[1]
    if probability > max:
        max = probability
        l = label
  if(max > cutoff):
    return l, max
  else:
    return ()

# Takes a list of lists, each item has the format [label, probability]
# Returns a list of lists containing all items that meet the cutoff. The items have the format [label, probability]
def get_max_from_list_multi(list, cutoff=1.0):
  items = []
  for item in list:
    label = item[0]
    probability = item[1]
    if probability > cutoff:
      items.append(label)
  return items

In [489]:
# Iterate over all level 1 classifications and get the level 2 label with
# the highest probability, but only if it is in the hirarchy that corresponds to the label from level 1
def new_answers(answers_list, cutoff=0.0):
    new_ansers_list = []
    
    for index, item in enumerate(answers_list):
      new_label_strings = []

      isbn = item[0]
      labels_level1 = item[1] #  can contain one or two labels

      labels_next = find_next_level(labels_level1[0])
      new_labels = get_nextlevel_preds(preds_level2[index],labels_level2,labels_next) 
      # Get the label with the highest probability
      # Format: [label, probability]
      max_label = get_max_from_list(new_labels, cutoff=cutoff)

      if(max_label != ()):
        new_label_strings.append(max_label[0])

      # Add extra level 2 labels if level 1 had two labels
      # TODO: refactor
      if(len(labels_level1) == 2):
        labels_next2 = find_next_level(labels_level1[1])    
        new_labels2 = get_nextlevel_preds(preds_level2[index],labels_level2,labels_next2)
        max_label2 = get_max_from_list(new_labels2, cutoff=cutoff)
        
        if(max_label2 != ()):
          new_label_strings.append(max_label2[0])

      new_entry = [isbn, labels_level1, new_label_strings]
      new_ansers_list.append(new_entry)
    return new_ansers_list

### Write output file

In [490]:
answers_list2 = new_answers(answers_list, cutoff=0.09)

In [491]:
answers_taskB = answers_list_to_file(answers_list2)

In [492]:
write_answerfile(answers_taskA, answers_taskB)

In [493]:
elapsed_second = timeit.default_timer() - start_time

## Subtask B (level 3)

### Level 3 Setup

In [494]:
# Depth level 3
train_df_level3, test_df_level3 = get_train_test(label_ids_level3, depth=2)

In [495]:
train_df_level3.head()

Unnamed: 0,isbn,title,body,copyright,authors,published,url,vornamen,heroische fantasy,joballtag_karriere,...,philosophie,tarot,systemische therapie_familienaufstellung,bauaufgaben,griechische literatur,gartendesigner,urgeschichte,reden_glückwünsche,antiquitäten,theater_ballett
0,9783641136291,Die Klinik,Ein Blick hinter die Kulissen eines Krankenhau...,(c) Verlagsgruppe Random House GmbH,Noah Gordon,2013-12-02,https://www.randomhouse.de/ebook/Die-Klinik/No...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,9783641185787,Die Erben von Midkemia 4,Die Bedrohungen für Midkemia und Kelewan wolle...,(c) Verlagsgruppe Random House GmbH,Raymond Feist,2016-06-20,https://www.randomhouse.de/ebook/Die-Erben-von...,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2,9783328103646,Völlig losgelöst,In der Dreizimmerwohnung stapeln sich Flohmark...,(c) Verlagsgruppe Random House GmbH,Susanne Weingarten,2019-01-14,https://www.randomhouse.de/Taschenbuch/Voellig...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9783453357792,Dich erfüllen,An der Seite von Damien fühlt sich Nikki zum e...,(c) Verlagsgruppe Random House GmbH,J. Kenner,2014-04-14,https://www.randomhouse.de/Taschenbuch/Dich-er...,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,9783844504958,Der Orientzyklus,"Wer Kara Ben Nemsi, Hadschi Halef Omar und Sir...",(c) Verlagsgruppe Random House GmbH,Karl May,2007-08-13,https://www.randomhouse.de/Hoerbuch-Download/D...,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [496]:
train3 = train_df_level3.copy()
test3 = test_df_level3.copy()
len(train3), len(test3)

(14548, 2079)

In [497]:
np.where(pd.isnull(test3['body']))

(array([], dtype=int64),)

In [498]:
np.where(pd.isnull(train3['body']))

(array([  623,   911,  3989,  4381,  4642,  7372,  8094,  8422, 11158,
        12850, 14060]),)

In [499]:
# Fill empty body with just the title (better results than author+title)
train3.body = np.where(train3.body.isnull(), train3.title , train3.body)

In [500]:
trn_term_doc_level3 = vec.fit_transform(train3[TEXT_COLUMN])
test_term_doc_level3 = vec.transform(test3[TEXT_COLUMN])

In [501]:
trn_term_doc_level3, test_term_doc_level3

(<14548x23648 sparse matrix of type '<class 'numpy.float64'>'
 	with 865403 stored elements in Compressed Sparse Row format>,
 <2079x23648 sparse matrix of type '<class 'numpy.float64'>'
 	with 123452 stored elements in Compressed Sparse Row format>)

In [525]:
# SAVE
scipy.sparse.save_npz('trn_term_doc_level3.npz', trn_term_doc_level3)
scipy.sparse.save_npz('test_term_doc_level3.npz', test_term_doc_level3)

In [526]:
# LOAD - SKIP TO HERE
trn_term_doc_level3 = scipy.sparse.load_npz('trn_term_doc_level3.npz')
test_term_doc_level3 = scipy.sparse.load_npz('test_term_doc_level3.npz')

In [502]:
preds_level3 = np.zeros((len(test), len(label_ids_level3)))

for index, label_id in enumerate(label_ids_level3):
    label_values = train3[label_id].values
    print('Fitting: ', label_id)
    m,r = get_model(label_values, trn_term_doc_level3)
    preds_level3[:,index] = m.predict_proba(test_term_doc_level3.multiply(r))[:,1] # why all rows, first column?!

Fitting:  vornamen
[LibLinear]Fitting:  heroische fantasy
[LibLinear]Fitting:  joballtag_karriere
[LibLinear]Fitting:  psychothriller
[LibLinear]Fitting:  große gefühle
[LibLinear]Fitting:  feiern_feste
[LibLinear]Fitting:  medizin_forensik
[LibLinear]Fitting:  phantastik
[LibLinear]Fitting:  ökologie_umweltschutz
[LibLinear]Fitting:  aktuelle debatten
[LibLinear]Fitting:  ganzheitliche psychologie lebenshilfe
[LibLinear]Fitting:  nordamerikanische literatur
[LibLinear]Fitting:  babys_kleinkinder
[LibLinear]Fitting:  schwangerschaft_geburt
[LibLinear]Fitting:  tod_trauer
[LibLinear]Fitting:  nordische krimis
[LibLinear]Fitting:  gesunde ernährung
[LibLinear]Fitting:  junge literatur
[LibLinear]Fitting:  kreatives
[LibLinear]Fitting:  einfamilienhausbau
[LibLinear]Fitting:  künstler_dichter_denker
[LibLinear]Fitting:  themenkochbuch
[LibLinear]Fitting:  abenteuer_action
[LibLinear]Fitting:  science thriller
[LibLinear]Fitting:  justizthriller
[LibLinear]Fitting:  besser leben
[LibLinear

In [503]:
len(preds_level3), len(preds_level3[0,:]) # 2079 elements, 242 classes

(2079, 242)

### Level 3 Predictions

In [504]:
# NOTE: Multileafs for LEVEL3!
def new_answers2(answers_list2, cutoff=0.7, multi_label_cutoff=1.0):
    new_ansers_list = []
    
    for index, item in enumerate(answers_list2):
      new_label_strings = []

      isbn = item[0]
      labels_level1 = item[1] #  can contain one or two labels
      labels_level2 = item[2] # can contain one or two labels
    
      # Find next labels if 2nd level is not empty
      if(labels_level2 != []):
        labels_next = find_next_level(labels_level2[0])
        
        new_labels = get_nextlevel_preds(preds_level3[index],labels_level3,labels_next) 
        max_labels = get_max_from_list_multi(new_labels, cutoff=multi_label_cutoff)

        new_label_strings.append(max_labels)
        
        # For the cases where there are two level 2 labels
        if(len(labels_level2) > 1):
          labels_next2 = find_next_level(labels_level2[1])        
          new_labels2 = get_nextlevel_preds(preds_level3[index],labels_level3,labels_next2)
          max_labels2 = get_max_from_list(new_labels2, cutoff=cutoff)
          if(max_labels2 != ()):
            new_label_strings.append([max_labels2[0]])

        new_entry = [isbn, labels_level1, labels_level2, new_label_strings]
        
      else:
        new_entry = [isbn, labels_level1, labels_level2]

      new_ansers_list.append(new_entry)
    return new_ansers_list

### Write Final Output File

In [505]:
answers_list3 = new_answers2(answers_list2, cutoff=0.7, multi_label_cutoff=0.15)

In [506]:
answers_taskB = answers_list_to_file(answers_list3)

In [507]:
write_answerfile(answers_taskA, answers_taskB)

## Runtime

In [83]:
# Print the total time the notebook ran
elapsed_final = timeit.default_timer() - start_time

In [84]:
mins1, secs1 = divmod(elapsed_first, 60)
hours1, mins1 = divmod(mins1, 60)

print("Running time level 1: %d:%d:%d.\n" % (hours1, mins1, secs1))

mins2, secs2 = divmod(elapsed_second, 60)
hours2, mins2 = divmod(mins2, 60)

print("Running time level 2: %d:%d:%d.\n" % (hours2, mins2, secs2))

mins3, secs3 = divmod(elapsed_final, 60)
hours3, mins3 = divmod(mins3, 60)

print("Total running time: %d:%d:%d.\n" % (hours3, mins3, secs3))

Running time level 1: 0:8:24.

Running time level 2: 0:21:58.

Total running time: 0:46:55.



## ---- Model Search ----

## Imports and data loading

The provided training-data from phase one of the competition is split 80/20 into train and test.

In [110]:
# Imports
import scipy as scipy

In [111]:
# Data-loading
X_tmp = train['body'].copy()
y_tmp = train[label_ids].copy()
len(X_tmp), len(y_tmp)

(14548, 14548)

In [112]:
# Setup
test_percentage=0.2

## Vectorize

In [113]:
vec = TfidfVectorizer(analyzer='word', ngram_range=(1,1), tokenizer=tokenize_spacy,
               min_df=4, max_df=0.4, strip_accents='unicode', use_idf=True,
               smooth_idf=True, sublinear_tf=True, lowercase=False, binary=False)
X_vec = vec.fit_transform(X_tmp)

In [527]:
X_vec[0]

<1x23648 sparse matrix of type '<class 'numpy.float64'>'
	with 41 stored elements in Compressed Sparse Row format>

In [114]:
X_train, X_test = train_test_split(X_vec, test_size=test_percentage, shuffle=False)
X_train, X_test

(<11638x23648 sparse matrix of type '<class 'numpy.float64'>'
 	with 693282 stored elements in Compressed Sparse Row format>,
 <2910x23648 sparse matrix of type '<class 'numpy.float64'>'
 	with 172121 stored elements in Compressed Sparse Row format>)

In [115]:
y_train_tmp, y_test_tmp = train_test_split(y_tmp, test_size=test_percentage, shuffle=False)

len(y_train_tmp), len(y_test_tmp)

(11638, 2910)

In [116]:
y_train = y_train_tmp.as_matrix()
y_train = scipy.sparse.coo_matrix(y_train_tmp, dtype=int)
y_train

  if __name__ == '__main__':


<11638x8 sparse matrix of type '<class 'numpy.int64'>'
	with 12440 stored elements in COOrdinate format>

In [117]:
y_test = y_test_tmp.as_matrix()
y_test = scipy.sparse.coo_matrix(y_test_tmp, dtype=int)
y_test

  if __name__ == '__main__':


<2910x8 sparse matrix of type '<class 'numpy.int64'>'
	with 3113 stored elements in COOrdinate format>

In [118]:
feature_names = [('text', 'TEXT')]
feature_names

[('text', 'TEXT')]

In [119]:
label_names = list(zip(label_ids, [['0', '1'] for x in label_ids]))
label_names

[('ratgeber', ['0', '1']),
 ('kinderbuch_jugendbuch', ['0', '1']),
 ('literatur_unterhaltung', ['0', '1']),
 ('sachbuch', ['0', '1']),
 ('ganzheitliches bewusstsein', ['0', '1']),
 ('architektur_garten', ['0', '1']),
 ('glaube_ethik', ['0', '1']),
 ('künste', ['0', '1'])]

## Test of different models

In [121]:
import sklearn.metrics as metrics
from sklearn.ensemble import VotingClassifier

In [270]:
def prediction_from_model(model, X_train, y_train, y_test):
  classifier = OneVsRestClassifier(model)
  classifier.fit(X_train, y_train)
  prediction = classifier.predict(X_test)
  return metrics.f1_score(y_test, prediction, average='micro')

### Decision Trees

In [266]:
from sklearn.tree import DecisionTreeClassifier

#### Vanilla

In [298]:
# criterion='gini' (alternative: 'entropy'), splitter='best' (alternative: 'random'),
# max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0,
# class_weight=None, max_features=None (int, float, 'auto', 'sqrt', 'log2')
model_dt_vanilla = DecisionTreeClassifier(random_state=seed)

In [299]:
prediction_from_model(model_dt_vanilla, X_train, y_train, y_test) # 0.60491432266408007

0.60491432266408007

#### Optimized

In [377]:
model_dt = DecisionTreeClassifier(random_state=seed, splitter='random', min_samples_split=15)

In [378]:
prediction_from_model(model_dt, X_train, y_train, y_test) # 0.61245110821382009

0.61245110821382009

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

#### Vanilla

In [302]:
model_rf_vanilla = RandomForestClassifier(random_state=seed)

In [291]:
prediction_from_model(model_rf_vanilla, X_train, y_train, y_test) # 0.61246504194966034



0.61246504194966034

#### Optimized

In [303]:
model_rf = RandomForestClassifier(random_state=seed, n_estimators=200, min_samples_split=5,
                                  bootstrap=False, class_weight='balanced_subsample', n_jobs=-1)

In [304]:
prediction_from_model(model_rf, X_train, y_train, y_test) # 0.6667

0.66666666666666663

### K-Nearest Neighbors

In [156]:
from sklearn.neighbors import KNeighborsClassifier
# https://scikit-learn.org/stable/modules/neighbors.html#classification

#### Vanilla

In [318]:
# n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2
# metric=’minkowski’, metric_params=None, n_jobs=None, **kwargs
model_knn_vanilla = KNeighborsClassifier()

In [384]:
prediction_from_model(model_knn_vanilla, X_train, y_train, y_test) # 0.71636228102869925

0.71636228102869925

#### Optimized

In [385]:
# p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2
model_knn = KNeighborsClassifier(weights='distance', n_neighbors=9, n_jobs=-1) # euclidean distance

In [386]:
prediction_from_model(model_knn, X_train, y_train, y_test) # 0.73769585253456216

0.73769585253456216

### Logistic Regression

#### Vanilla

In [379]:
# random_state=None: If None, the random number generator is the RandomState instance used by np.random
model_lr_vanilla = LogisticRegression(random_state=seed)

In [382]:
prediction_from_model(model_lr_vanilla, X_train, y_train, y_test) # 0.69192876089427802



0.69192876089427802

#### Optimized

In [None]:
model_lr = LogisticRegression(C=40.0, dual=False, solver='liblinear', multi_class='auto', max_iter=1000,
                           penalty='l2', class_weight='balanced', verbose=1)

In [383]:
prediction_from_model(model_lr, X_train, y_train, y_test) # 0.78828099708643573

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]

0.78828099708643573

### Multinomial Naive Bayes

In [390]:
from sklearn.naive_bayes import MultinomialNB

#### Vanilla

In [None]:
# alpha=1.0, fit_prior=True, class_prior=None
model_mnb_vanilla = MultinomialNB()

In [391]:
prediction_from_model(model_mnb_vanilla, X_train, y_train, y_test) # 0.62527190033616764

0.62527190033616764

#### Optimized

In [430]:
model_mnb = MultinomialNB(alpha=0.08)

In [431]:
prediction_from_model(model_mnb, X_train, y_train, y_test) # 0.77026346702466875

0.77026346702466875

### Linear SVC

In [None]:
from sklearn.svm import LinearSVC, SVC

#### Vanilla

In [388]:
# model_svm_vanilla = LinearSVC()

# Needed due to lack of predict_proba() for LinearSVC -> very slow
model_svm_vanilla = SVC(random_state=seed, kernel='linear', probability=True)

In [389]:
%%time
prediction_from_model(model_svm_vanilla, X_train, y_train, y_test) # 0.77313276193807945

CPU times: user 15min 32s, sys: 4.31 s, total: 15min 36s
Wall time: 15min 40s


0.77313276193807945

#### Optimized

In [251]:
#model_svm = LinearSVC(C=1.0, class_weight='balanced') # 0.78895527208138228
# Needed due to lack of predict_proba() for LinearSVC -> very slow
# 0.78816793893129766
model_svm = SVC(kernel='linear', probability=True, C=1.0, class_weight='balanced')

In [432]:
%%time
prediction_from_model(model_svm, X_train, y_train, y_test) # 0.78816793893129766

CPU times: user 17min 46s, sys: 3.04 s, total: 17min 49s
Wall time: 17min 51s


0.78816793893129766

### SVC

#### Vanilla

In [441]:
model_svc_vanilla = SVC()

In [442]:
%%time
prediction_from_model(model_svc_vanilla, X_train, y_train, y_test) # 0.51270131163871824

CPU times: user 3min 8s, sys: 961 ms, total: 3min 9s
Wall time: 3min 10s




0.51270131163871824

#### Optimized

In [443]:
model_svc = SVC(C=15900.0, class_weight='balanced', cache_size=500)

In [444]:
%%time
prediction_from_model(model_svc, X_train, y_train, y_test) # 0.78778083077420391

CPU times: user 3min 40s, sys: 996 ms, total: 3min 41s
Wall time: 3min 42s




0.78778083077420391

### Ensemble

In [None]:
ensemble = VotingClassifier(estimators=[('Logistic Regression', model_lr),
                                        ('KNN', model_knn),
                                        ('Naive Bayes', model_mnb)],
                            voting='soft')

In [459]:
#model_dt
#model_rf
#model_mnb
#model_knn
#model_svc
#model_svm
#model_lr
top2_estimators = [('4', model_knn), ('7', model_lr)]

top2b_estimators = [('3', model_mnb), ('7', model_lr)]

top3_estimators = [('3', model_mnb), ('4', model_knn), ('7', model_lr)]

all_estimators = [('1', model_dt),('2', model_rf), ('3', model_mnb), ('4', model_knn), ('7', model_lr)]

ensemble_top2 = VotingClassifier(estimators=top2_estimators,
                            voting='soft')

ensemble_top2b = VotingClassifier(estimators=top2b_estimators,
                            voting='soft')

ensemble_top3 = VotingClassifier(estimators=top3_estimators,
                            voting='soft')

ensemble_all = VotingClassifier(estimators=all_estimators,
                            voting='soft')
# Soft (cannot use hard because predict_proba is not defined)
# model_lr, model_knn, model_mnb: 0.79495052882975081

In [456]:
%%time
prediction_from_model(ensemble_top2, X_train, y_train, y_test) # 0.80006788866259326

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]CPU times: user 15.7 s, sys: 42.3 s, total: 57.9 s
Wall time: 55.8 s


0.80006788866259326

In [460]:
%%time
prediction_from_model(ensemble_top2b, X_train, y_train, y_test) # 0.79670239076669425

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]CPU times: user 7.84 s, sys: 225 ms, total: 8.06 s
Wall time: 2.06 s


0.79670239076669425

In [457]:
%%time
prediction_from_model(ensemble_top3, X_train, y_train, y_test) # 0.79495052882975081

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]CPU times: user 15.7 s, sys: 41.6 s, total: 57.3 s
Wall time: 54.6 s


0.79495052882975081

In [458]:
%%time
prediction_from_model(ensemble_all, X_train, y_train, y_test) # 0.77105907025515563

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]CPU times: user 2min 48s, sys: 41.4 s, total: 3min 29s
Wall time: 3min 23s


0.77105907025515563