### FastText Classification
FastText[1] is a neural network based text classification model designed to be computationally efficient. The code below implements the basic architecture of the model, described in (2).

The model will be trained using the Adam algorithm as opposed to mini-batch gradient descent, as it was found to improve accuracy significantly on the dataset to be used. The Adam algorithm is a computationally efficient first-order gradient-based optimization of stochastic gradient functions, based on adaptive estimates of lower-order moments.

When the training data are sequences of variable lengths we can not simply stack multiple training sequences into one tensor. Instead, it is common to assume that there is a maximal sequence length, so that all sequences in a batch are fitted into tensors of the same dimensions. For sequences shorter than the maximal length, we append them with a special pad word so that all sequences in a batch are of the same length. A pad word is a special token, whose embedding is an all-zero vector, so that the presence of pad words does not change the output of the model. In this code, the pad word has an ID of 0. Additionally, the number of words that are in each input sentence (before they got padded to be the same length) are provided as an input parameter to the FastText model.

[1] Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas. Bag of Tricks for Efficient Text Classification. arXiv preprint arXiv:1607.01759., 2016

In [1]:
# coding: utf-8
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

import collections
import math
import os
import random

import nltk
nltk.download('punkt')
from nltk import word_tokenize
from collections import namedtuple

import sys, getopt

from random import shuffle


num_classes = 3

learning_rate = 0.005
num_epochs = 3
batch_size = 10
embedding_dim = 10

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\stefa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
class FastText(nn.Module):
    """Define the computation graph for fasttext model."""
    
    def __init__(self, vocab_size, num_classes, embedding_dim, learning_rate):
        """Init the model with default parameters/hyperparameters."""
        super(FastText, self).__init__()
        self.num_classes = num_classes
        self.embedding_dim = embedding_dim
        self.learning_rate = learning_rate
        self.loss_func = F.cross_entropy
        # create all the variables (weights) that the model needs here
        self.vocab_size = vocab_size
        self.output = nn.Linear(self.embedding_dim, self.num_classes)
        self.embedder = nn.Embedding(self.vocab_size, self.embedding_dim, 0)
        
        self.optimizer = torch.optim.Adam(self.parameters(), lr=learning_rate)
        
    
    def forward(self, x, sens_lengths):
        """Implement the FastText computation"""
        embedded = self.embedder(x)
        #print(embedded)
        x = embedded.sum(dim=1) / sens_lengths
        x = self.output(x)

        return x

In [3]:
from fasttext import load_question_2_1, train_fast_text

word_to_id, train_data, valid_data, test_data = load_question_2_1('question_2-1_data')
model = FastText(len(word_to_id)+2, num_classes, embedding_dim=embedding_dim, learning_rate=learning_rate)

model_file_path = os.path.join('models', 'fasttext_model_file_q2-1')
train_fast_text(model, train_data, valid_data, test_data, model_file_path, batch_size=batch_size, num_epochs=num_epochs)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\stefa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


number of sequences is 8216. 
PAD word id is 0 .
Unknown word id is 1 .
size of vocabulary is 3666. 
read 1000 sentences from question_2-1_data\sentences_train.txt .
read 500 sentences from question_2-1_data\sentences_dev.txt .
read 500 sentences from question_2-1_data\sentences_test.txt .


  "type " + obj.__name__ + ". It won't be checked "


Epoch 0 : train loss = 1.1029887515306473 , validation accuracy = 0.40799999237060547 .
Epoch 1 : train loss = 1.0515777665376662 , validation accuracy = 0.421999990940094 .
Epoch 2 : train loss = 0.9764573341608047 , validation accuracy = 0.4440000057220459 .
Accuracy on the test set : 0.4560000002384186.


### Dataset

The dataset provided contains three files: **train.json**, **validation.json**, and **test.json**, which are the training dataset, validation dataset, and the test dataset, respectively. 
See an example below: 
```
{
   "ID": S1,
   "Label": 3,
   "Sentence":"What country has the best defensive position in the board game Diplomacy ?"
}
```
In the training set and the validation set, the response variable is called `Label`.

In [4]:
import json # You can use this library to read the .json files into a Python dict: https://docs.python.org/2/library/json.html
from nltk import word_tokenize # You can use this to tokenize strings, or do your own pre-processing.

In [5]:
nltk.download("stopwords")
from nltk.corpus import stopwords, brown
from nltk import DefaultTagger, UnigramTagger, BigramTagger, TrigramTagger, pos_tag


# Train a mixture of Trigram, Bigram and Unigram POS taggers (implements 'backoff' method)
brown_train = brown.tagged_sents(tagset="universal")
t0 = DefaultTagger("NOUN")
t1 = UnigramTagger(brown_train, backoff=t0)
t2 = BigramTagger(brown_train, backoff=t1)
t3 = TrigramTagger(brown_train, backoff=t2)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\stefa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
from nltk.util import ngrams

stopwords_en = set(stopwords.words('english'))

num_classes = 6
learning_rate = 0.1
embedding_dim = 10

pad_word_id = 0
unknown_word_id = 1

from prepros import preprocessor
from fasttext import Dataset, map_word_to_id, map_token_seq_to_word_id_seq



def custom_preprocessor(text):
    """
    Applies preprocessing to given string of text.
    
    Inputs:
        text (string) : input string to preprocess
        
    Outputs:
        tokens (list) : list of preprocessed tokens
    """
    output = []
    
    tokens = word_tokenize(text)                                 # Split sentence into tokens
    tokens = [i for i in tokens if i not in stopwords_en]        
    
    bigram = ngrams(tokens, 2)                                   # Extract bigrams from tokens
    bigrams = ["".join(i) for i in bigram]
    
    tags = t3.tag(tokens)                                        # POS tag tokens
    tags = ["".join(i) for i in tags]
    
    output.extend(bigrams)
    output.extend(tags)
    output.extend(tokens)

    return output
    
    
def build_vocabulary(training_dict):
    """
    Build a vocabulary from the training data set.
    
    Inputs:
        training_dict (dictionary): data from json training data
    Outputs:
        word_to_id (dictionary) : A dictionary mapping of words to ids
    """
    
    data = []
    for i in training_dict:
        tokens = word_tokenize(i['Sentence'])          # Split sentence into tokens       
        bigram = ngrams(tokens, 2)                     # Extract bigrams from tokens             
        tags = t3.tag(tokens)                          # POS tag word tokens
        
        bigrams = ["".join(j) for j in bigram]
        tags = ["".join(k) for k in tags]
        data.extend(bigrams)
        data.extend(tags)
        data.extend(tokens)
    print('number of sequences is %s. ' % len(data))
    count = [['$PAD$', pad_word_id], ['$UNK$', unknown_word_id]]
    sorted_counts = collections.Counter(data).most_common()
    count.extend(sorted_counts)
    word_to_id = dict()
    for word, _ in count:
        word_to_id[word] = len(word_to_id)

    print("PAD word id is %s ." % word_to_id['$PAD$'])
    print("Unknown word id is %s ." % word_to_id['$UNK$'])
    print('size of vocabulary is %s. ' % len(word_to_id))
    return word_to_id


def create_dataset(sentence_dict, word_to_id):
    """
    Create dataset given data loaded from JSON file.
    
    Inputs:
        sentence_dict (dictionary): data from JSON file
        word_to_id (dictionary) : dictionary mapping of words to their ID's
        
    Outputs:
        dataset (Dataset) : Dataset object
    """
    sentences = []
    labels = []
    
    for i in sentence_dict:
        word_id_seq = map_token_seq_to_word_id_seq(custom_preprocessor(i['Sentence']), word_to_id)        
        sentences.append(word_id_seq)
        if 'Label' in i:
            labels.append(i['Label'])
            
    if len(labels) == 0:
        dataset = Dataset(sentences)
    else:
        dataset = Dataset(sentences, labels)
        
    return dataset
    
    

def load_question_2_2(data_folder):
    """
    Prepare and load relevant files for training FastText model.
    
    Inputs:
        data_folder (string) : Path of the data folder
    
    Outputs:
        train_dataset, validation_dataset, test_dataset (Dataset) : Dataset objects for training, validation and test
                                                                    datasets
        test_ids (List) : List of ID's for corresponding sentences in the test dataset
    """
    # Read in json data files
    trainingSentenceFile = os.path.join(data_folder, "train.json")
    validationSentenceFile = os.path.join(data_folder, "validation.json")
    testSentenceFile = os.path.join(data_folder, "test.json")
    
    with open(trainingSentenceFile, 'r') as train_file:
        trainingSentences = json.load(train_file)
    with open(validationSentenceFile, 'r') as validation_file:
        validationSentences = json.load(validation_file)
    with open(testSentenceFile, 'r') as test_file:
        testSentences = json.load(test_file)
    
    # Build word to id mapping
    word_to_id = build_vocabulary(trainingSentences)
    
    # Create Dataset objects and list of test IDs
    train_dataset = create_dataset(trainingSentences, word_to_id)
    validation_dataset = create_dataset(validationSentences, word_to_id)
    test_dataset = create_dataset(testSentences, word_to_id)
    test_ids = [i["ID"] for i in testSentences]
    
    # Return word to id mapping, Dataset objects and test IDs
    return word_to_id, train_dataset, validation_dataset, test_dataset, test_ids
    
    

model_file_path = os.path.join('models', 'fasttext_model_file_q2-2')
word_to_id, train_dataset, valid_dataset, test_dataset, test_ids = load_question_2_2('question_2-2_data')
model = FastText(len(word_to_id)+2, num_classes, embedding_dim=embedding_dim, learning_rate=learning_rate)
train_fast_text(model, train_dataset, valid_dataset, test_dataset, model_file_path, batch_size=100, num_epochs=20)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\stefa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


number of sequences is 221205. 
PAD word id is 0 .
Unknown word id is 1 .
size of vocabulary is 44914. 
Epoch 0 : train loss = 0.6694232958394128 , validation accuracy = 0.9279999732971191 .
Epoch 1 : train loss = 0.0261098971854694 , validation accuracy = 0.9300000071525574 .
Epoch 2 : train loss = 0.0027319554324463213 , validation accuracy = 0.9369999766349792 .
Epoch 3 : train loss = 0.0013911327912717604 , validation accuracy = 0.9430000185966492 .
Epoch 4 : train loss = 0.000524844769569353 , validation accuracy = 0.9369999766349792 .
Epoch 5 : train loss = 0.00031905708247732773 , validation accuracy = 0.9430000185966492 .
Epoch 6 : train loss = 0.0002386649818642606 , validation accuracy = 0.9430000185966492 .
Epoch 7 : train loss = 0.0001895769731876302 , validation accuracy = 0.9409999847412109 .
Epoch 8 : train loss = 0.00015442724277916667 , validation accuracy = 0.9430000185966492 .
Epoch 9 : train loss = 0.00012719354492366766 , validation accuracy = 0.9430000185966492 .


In [8]:
import pandas as pd

def write_results(data_folder):
    """ 
    Write prediction results to a csv file according to kaggle's format.
    
    Inputs:
        data_folder (string) : Path of the folder to write results to
    """
    predictions = os.path.join(data_folder, "fasttext_model_file_q2-2predictions.csv")
    df = pd.read_csv(predictions, header=None, names=["category"])
    df.insert(0, "id", test_ids, True)
    
    output_file = os.path.join(data_folder, "q2-2predictions.csv")
    df.to_csv(output_file, index=False)

write_results("models")