# Text multiclassification with fastai applying N-grams ( ͡° ͜ʖ ͡°) #

### Motivation ###
While there are many examples of text classification in which there are only two classes, I decide to deal with a multiclassification problem. The data I used are user stories from stack-overflow with tags assigned to them.

In [None]:
from fastai import *
from fastai.text import *
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
from functools import partial 
import io 
import os
import sklearn.feature_extraction.text as sklearn_text
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer

A function wrote to clean our texts a bit

In [None]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def clean_text(text):
    """
        text: a string
        
        return: modified initial string
    """
    text = BeautifulSoup(text, "lxml").text # HTML decoding
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    #text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    return text

In [None]:
df = pd.read_csv('../input/stack-overflow-user-stories/stack-overflow-data.csv')
df = df[pd.notnull(df['tags'])]
df = df[pd.notnull(df['post'])]
df['post'] = df['post'].apply(clean_text)
df.head()

In [None]:
df.size

The number of rows in our dataset is almost 80 000. Let's do a train, validation split, making sure the samples are balanced. A *stratify* argument from scikit.learn *train_test_split* function provides this.

In [None]:
df_trn, df_val = train_test_split(df, stratify = df['tags'],  test_size = 0.2, random_state = 12)
df_trn['tags'].value_counts()

Now we are ready to create databunch. The creation of a databunch should consist the following two steps:

1. **Tokenization** - it takes words from description and converts them into a standard form of tokens. Basically each token represents a word.

2. **Numericalization** - The next thing we do is we take a complete unique list of all of the possible tokens﹣ that's called the vocab which gets created for us. So here is every possible token (the first ten of them) that appear in our all of the descriptions. We then replace every description with a list of numbers.

So through tokenization and numericalization, this is the standard way in NLP of turning a document into a list of numbers. Fortunately, this can be easily done with fast.ai

In [None]:
labels = list(df['tags'].unique())
data = (TextList.from_df(df_trn, cols='post')
                .split_by_rand_pct(0.2)
                .label_from_df(classes=labels)
                .databunch(bs=48))
data.show_batch()

In NLP, a token is the basic unit of processing. Here, the tokens mostly correspond to words or punctuation, as well as several special tokens, corresponding to unknown words, capitalization, etc.

All those tokens starting with "xx" are fastai special tokens. You can see the list of all of them and their meanings in the fastai docs.

Let's see the string-to-ints.

In [None]:
data.vocab.stoi

...

In [None]:
len(data.train_dl.x), len(data.valid_dl.x)

A term-document matrix represents a document as a "bag of words", that is, we don't keep track of the order the words are in, just which words occur (and how often). This is the implementation. Here we use the most common sparse storage format - compressed sparse row (CSR)

In [None]:
def get_term_doc_matrix(label_list, vocab_len):
    j_indices = []
    indptr = []
    values = []
    indptr.append(0)

    for i, doc in enumerate(label_list):
        feature_counter = Counter(doc.data)
        j_indices.extend(feature_counter.keys())
        values.extend(feature_counter.values())
        indptr.append(len(j_indices))

    return scipy.sparse.csr_matrix((values, j_indices, indptr),
                                   shape=(len(indptr) - 1, vocab_len),
                                   dtype=int)



val_term_doc = get_term_doc_matrix(data.valid_dl.x, len(data.vocab.itos))
trn_term_doc = get_term_doc_matrix(data.train_dl.x, len(data.vocab.itos))

In [None]:
x= trn_term_doc
y=data.train_dl.y.items
val_y = data.valid_dl.y.items

### Logistic Regression ###
Let's start with a simple logistic regreeion. The C paramterer is already tuned

In [None]:
m = LogisticRegression(C=0.03, dual=False)
m.fit(x, y)
preds = m.predict(val_term_doc)
(preds==val_y).mean()

### Binarized Logistic Regression ###
How about binarized version? Here we take care inly about if the word occurs in a document or not. The frequency does not really matter.

In [None]:
m = LogisticRegression(C=0.03, dual=False)
m.fit(trn_term_doc.sign(), y)
preds = m.predict(val_term_doc.sign())
(preds==val_y).mean()

As we see, the accuracy here is a bit better. How about taking as a token a pair or a triple of words? We will check out **ngrams**. Let's fit normalized logistic regression where the features are the trigrams. We will use *CountVectorizer* from *sklearn.feature_extraction.text* 

In [None]:
veczr =  CountVectorizer(ngram_range=(1,3), preprocessor=noop, tokenizer=noop, max_features=800000)


Now we get words from the train data description, transfer them into ngrams and add all unigrams, bigrams and trigrams to a vocabulary of our vectorizer.

In [None]:
docs = data.train_dl.x

In [None]:
train_words = [[docs.vocab.itos[o] for o in doc.data] for doc in data.train_dl.x]
train_ngram_doc = veczr.fit_transform(train_words)
train_ngram_doc

In [None]:
veczr.vocabulary_

Let's apply analogical steps to validation data descriptions.

In [None]:
valid_words = [[docs.vocab.itos[o] for o in doc.data] for doc in data.valid_dl.x]
val_ngram_doc = veczr.transform(valid_words)
val_ngram_doc

Now we are ready to create full vocabulary of ngrams

In [None]:
vocab = veczr.get_feature_names()
vocab[100000:100005]

### Binarized Logistic Regression with Ngrams ###

Let's extend a model adding our bigrams and trigrams to it

In [None]:
y=data.train_dl.y
valid_labels = data.valid_dl.y.items

In [None]:
m = LogisticRegression(C=0.03, dual=True)
m.fit(train_ngram_doc.sign(), y.items);
preds = m.predict(val_ngram_doc.sign())
(preds.T==valid_labels).mean()

We see this is a bit better than a model with only unigrams. Let's tune paramter C.

In [None]:
a_list = []
i_list = []
for i in range (1,100):
    m = LogisticRegression(C=i/100, dual=True)
    m.fit(train_ngram_doc.sign(), y.items);
    preds = m.predict(val_ngram_doc.sign())
    a = (preds.T==valid_labels).mean()
    a_list.append(a)
    i_list.append(i)
plt.plot(i_list, a_list)

So what is the best parameter C?

In [None]:
best_c = i_list[np.argmax(a_list)]/100
best_c

Now we calculate chosen model.

In [None]:
m = LogisticRegression(C=best_c, dual=True)
m.fit(train_ngram_doc.sign(), y.items);
preds = m.predict(val_ngram_doc.sign())
(preds.T==valid_labels).mean()

And finally plot the results.

In [None]:
%matplotlib inline
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font_scale=2)
#predictions = model.predict(X_test, batch_size=1000)

LABELS = df['tags'].unique()

confusion_matrix = metrics.confusion_matrix(valid_labels, preds)

plt.figure(figsize=(35, 15))
sns.heatmap(confusion_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d", annot_kws={"size": 20});
plt.title("Confusion matrix", fontsize=20)
plt.ylabel('True label', fontsize=20)
plt.xlabel('Predicted label', fontsize=20)
plt.show()

Thanks for reading! ( ͡ᵔ ͜ʖ ͡ᵔ )