# Question Type Classification

In [1]:
## Importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re, nltk
import gensim
import codecs

from sklearn.metrics import confusion_matrix, accuracy_score, average_precision_score
from sklearn.model_selection import KFold, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer



The training and test datasets can be downloaded from here : http://cogcomp.cs.illinois.edu/Data/QA/QC/

The data files are renamed to training_dataset.txt and validation_dataset.txt 

In [2]:
## Reading the datasets
f_train = open('training_dataset.txt', 'r+')
f_test = open('validation_dataset.txt', 'r+')

train = pd.DataFrame(f_train.readlines(), columns = ['Question'])
test = pd.DataFrame(f_test.readlines(), columns = ['Question'])

## Problem Statement

This problem can essentially be looked at as a text classification task with hierarchical labels - Coarse and Fine. The question type provided is in a hierarchical form and hence, we can either use the major class (e.g. DESC, ENTY, HUM etc.) to do Coarse classification or we can use the fine labels (e.g. 'ind', 'place', 'others' etc.) along with the major class labels to do a much detailed/fine-grained question-type classification. 

In this notebook, we look at both of them, with more focus on fine-grained classification

In [3]:
# Separating text content (question) and target class/label
train['QType'] = train.Question.apply(lambda x: x.split(' ', 1)[0])
train['Question'] = train.Question.apply(lambda x: x.split(' ', 1)[1])
train['QType-Coarse'] = train.QType.apply(lambda x: x.split(':')[0])
train['QType-Fine'] = train.QType.apply(lambda x: x.split(':')[1])
test['QType'] = test.Question.apply(lambda x: x.split(' ', 1)[0])
test['Question'] = test.Question.apply(lambda x: x.split(' ', 1)[1])
test['QType-Coarse'] = test.QType.apply(lambda x: x.split(':')[0])
test['QType-Fine'] = test.QType.apply(lambda x: x.split(':')[1])

## Exploring the dataset (EDA)

In [4]:
train.describe()

Unnamed: 0,Question,QType,QType-Coarse,QType-Fine
count,5452,5452,5452,5452
unique,5381,50,6,47
top,How deep is a fathom ?\n,HUM:ind,ENTY,ind
freq,3,962,1250,962


In [5]:
test.describe()

Unnamed: 0,Question,QType,QType-Coarse,QType-Fine
count,500,500,500,500
unique,500,42,6,39
top,What is the state flower of Michigan ?\n,DESC:def,DESC,def
freq,1,123,138,123


In [6]:
train.append(test).describe()

Unnamed: 0,Question,QType,QType-Coarse,QType-Fine
count,5952,5952,5952,5952
unique,5871,50,6,47
top,How deep is a fathom ?\n,HUM:ind,ENTY,ind
freq,3,1017,1344,1017


### Findings

1. 50 different question types/target labels are present in train+test combined
2. For coarse classification, 6 question types/labels are present
3. There are some duplicate questions in training set. None in test set
4. 10 questions in test set are also present in training set

## Label encoding of target variables

In [7]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(pd.Series(train.QType.tolist() + test.QType.tolist()).values)
train['QType'] = le.transform(train.QType.values)
test['QType'] = le.transform(test.QType.values)
le2 = LabelEncoder()
le2.fit(pd.Series(train['QType-Coarse'].tolist() + test['QType-Coarse'].tolist()).values)
train['QType-Coarse'] = le2.transform(train['QType-Coarse'].values)
test['QType-Coarse'] = le2.transform(test['QType-Coarse'].values)

## Approach :

The general solution pipeline would be : 

Text pre-processing ----> Feature extraction ----> Training using an ML/DL algorithm ----> Parameter tuning, with cross-validation ----> Prediction and checking accuracy, precision etc, on test set 

We'll iterate through a lot of methods for pre-processing, feature extraction and ML/DL algos.

Pre-processing : Text cleaning, stopword removal, stemming, lemmatization etc.

Feature extraction/Word embeddings : Count Vectors, TF-IDF, GloVe, Word2Vec

ML/DL algorithms : Linear models (Logistic Regression, Linear SVM), Non-linear models (Naive Bayes), Tree models (XGBoost, LightGBM), DL models (LSTMs/RNNs)

Some points to note :
1. In Question-type classification problems, there will be certain words which only appear
   for certain classes, while they are absent in all other classes/qtypes. Extracting features
   using TF-IDF should be ideal here as it penalizes words which occur in every example and lays
   emphasis on certain words which appear only for specific training examples
2. Identification using "wh-" words (Why, What, When etc.) might be useful. They provide some context, but
   cannot be relied on too much
3. n-gram methods and LSTM/RNNs can also be explored to derive information from the words appearing
   frequently/sequentially
4. The 50 target classes have hierarchical structure. An alternative approach can also be tried 
   later, wherein first, we try to predict the level 1 classes (DESC, HUM, NUM etc.) and then
   again run the model for each of these classes to predict their sub-classes. This approach
   would be highly inferable and easily explainable, but risks overfitting

### Text pre-processing

In [8]:
## Importing required NLTK libraries
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer 
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [9]:
# Preparing a combined corpus of train and test data
all_corpus = pd.Series(train.Question.tolist() + test.Question.tolist()).astype(str)

In [10]:
# Finding words with stops in-between
dot_words = []
for row in all_corpus:
    for word in row.split():
        if '.' in word and len(word)>2:
            dot_words.append(word)


In [11]:
import collections
collections.Counter(dot_words)

Counter({'...': 1,
         '...the': 1,
         '.com': 1,
         '.dbf': 1,
         '.tbk': 1,
         '1.76': 1,
         '2.5': 1,
         '42.3': 1,
         '5.9': 1,
         'A.G.': 1,
         'Answers.com': 5,
         'B.Y.O.B.': 1,
         'Bros.': 2,
         'C.C.': 1,
         'D.A.': 1,
         'D.B.': 1,
         'D.C.': 6,
         'D.H.': 1,
         'Dr.': 4,
         'G.M.T.': 1,
         'H.G.': 1,
         'I.V.': 1,
         'Inc.': 3,
         'J.D.': 1,
         'J.F.K.': 1,
         'J.R.R.': 2,
         'Jan.': 1,
         'Jr.': 4,
         'KnowPost.com': 1,
         'L.A.': 4,
         'L.L.': 1,
         'LL.M.': 1,
         'Mr.': 6,
         'Mrs.': 6,
         'Ms.': 2,
         'Mt.': 1,
         'N.M': 1,
         'No.': 3,
         'No.1': 1,
         'O.J.': 1,
         'P.T.': 2,
         'R.E.M.': 2,
         'Rev.': 1,
         'S.O.S.': 1,
         'Sen.': 1,
         'St.': 17,
         'T.S.': 2,
         'T.V.': 1,
         'U.K.': 

On exploring the corpus, it was found that there are some frequently-occuring words with punctuation marks within it. During text cleaning process, if we remove the punctuation marks, these words would be disintegrated and might lost its importance/discriminative ability during classification. Hence, we create a "keep_list" to ensure that frequently occuring words are not destroyed during text cleaning process

In [12]:
common_dot_words = ['U.S.', 'St.', 'Mr.', 'Mrs.', 'D.C.'] # some frequently occuring words with punctuation

In [13]:
def text_clean(corpus, keep_list):
    '''
    Purpose : Function to keep only alphabets, digits and certain words (punctuations, qmarks, tabs etc. removed)
    
    Input : Takes a text corpus, 'corpus' to be cleaned along with a list of words, 'keep_list', which have to be retained
            even after the cleaning process
    
    Output : Returns the cleaned text corpus
    
    '''
    cleaned_corpus = pd.Series()
    for row in corpus:
        qs = []
        for word in row.split():
            if word not in keep_list:
                p1 = re.sub(pattern='[^a-zA-Z0-9]',repl=' ',string=word)
                p1 = p1.lower()
                qs.append(p1)
            else : qs.append(word)
        cleaned_corpus = cleaned_corpus.append(pd.Series(' '.join(qs)))
    return cleaned_corpus

In [15]:
def preprocess(corpus, keep_list, cleaning = True, stemming = False, stem_type = None, lemmatization = True, remove_stopwords = True):
    
    '''
    Purpose : Function to perform all pre-processing tasks (cleaning, stemming, lemmatization, stopwords removal etc.)
    
    Input : 
    'corpus' - Text corpus on which pre-processing tasks will be performed
    'keep_list' - List of words to be retained during cleaning process
    'cleaning', 'stemming', 'lemmatization', 'remove_stopwords' - Boolean variables indicating whether a particular task should 
                                                                  be performed or not
    'stem_type' - Choose between Porter stemmer or Snowball(Porter2) stemmer. Default is "None", which corresponds to Porter
                  Stemmer. 'snowball' corresponds to Snowball Stemmer
    
    Note : Either stemming or lemmatization should be used. There's no benefit of using both of them together
    
    Output : Returns the processed text corpus
    
    '''
    if cleaning == True:
        corpus = text_clean(corpus, keep_list)
    
    if remove_stopwords == True:
        stop = set(stopwords.words('english'))
        corpus = [[x for x in x.split() if x not in stop] for x in corpus]
    else :
        corpus = [[x for x in x.split()] for x in corpus]
    
    if lemmatization == True:
        lem = WordNetLemmatizer()
        corpus = [[lem.lemmatize(x, pos = 'v') for x in x] for x in corpus]
    
    if stemming == True:
        if stem_type == 'snowball':
            stemmer = SnowballStemmer(language = 'english')
            corpus = [[stemmer.stem(x) for x in x] for x in corpus]
        else :
            stemmer = PorterStemmer()
            corpus = [[stemmer.stem(x) for x in x] for x in corpus]
    
    corpus = [' '.join(x) for x in corpus]
        

    return corpus

In [16]:
# Applying the pre-processing function on the combined text corpus
all_corpus = preprocess(all_corpus, keep_list = common_dot_words, remove_stopwords = False)

# Separating back to train and test corpus
trn_corpus = all_corpus[0:train.shape[0]]
test_corpus = all_corpus[train.shape[0]::]

## Feature extraction / Word embeddings

In [17]:
def feats_from_text(all_corpus, trn_corpus, test_corpus, n_dims = 500, model = 'tf-idf'):
    
    '''
    Purpose : Function to extract numeric feature vectors/ word embeddings from the text corpus using one of CountVectors, tf-idf,
              GloVe or Word2Vec
    
    Inputs : 
    'all_corpus' : Combined train + test corpus. The chosen model is fit on this corpus
    'trn_corpus' and 'test_corpus' : Train and test corpora. The fitted word embedding is transformed over these two sets
    'n_dims' : Gives the dimension of the transformed word vector. Default is set as 500 
    'model' : Denotes which word embedding to use. Default is 'tf-idf'. Others are : 'cv', 'glove' and 'word2vec'
    
    Output : Returns two feature sets, 'train_feats' and 'test_feats'
    
    Note : The function returns an empty dataframe with an error message if an unknown model value is entered
    
    '''
    ## TF-IDF
    if model.lower() == 'tf-idf':
        tfidf_vec = TfidfVectorizer(max_features = n_dims)
        tfidf_vec.fit(all_corpus)
        train_feats = tfidf_vec.transform(trn_corpus).toarray()
        test_feats = tfidf_vec.transform(test_corpus).toarray()
        # Converting feature-arrays to dataframes
        train_feats = pd.DataFrame(train_feats)
        train_feats.rename(columns = lambda x: 'w_'+str(x), inplace = True)
        test_feats = pd.DataFrame(test_feats)
        test_feats.rename(columns = lambda x: 'w_'+str(x), inplace = True)
        
    ## Count Vectors
    if model.lower() == 'cv':
        cv = CountVectorizer(max_features = n_dims)
        cv.fit(all_corpus)
        train_feats = cv.transform(trn_corpus).toarray()
        test_feats = cv.transform(test_corpus).toarray()
        # Converting feature-arrays to dataframes
        train_feats = pd.DataFrame(train_feats)
        train_feats.rename(columns = lambda x: 'w_'+str(x), inplace = True)
        test_feats = pd.DataFrame(test_feats)
        test_feats.rename(columns = lambda x: 'w_'+str(x), inplace = True)
        
    '''
    In both, word2vec and GloVe, we take the mean of embeddings of all words in a question. Thus, now each question is 
    represented by a 300-D vector, whose values are an average of n 300-D vectors, each vector representing a word
    in the sentence and as extracted from GloVe embeddings
    
    '''
    
    ## GloVe
    if model.lower() == 'glove':
        glove_dict = {}
        # Loading GloVe word embeddings
        f = codecs.open('E:\\glove.6B.300d.txt', encoding = 'utf-8')
        for line in f:
            values = line.split(' ')
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            glove_dict[word] = coefs
        f.close()
        
        # Tokenizing train and test corpus
        trn_tokens = [[x for x in x.split(' ')]for x in trn_corpus]
        test_tokens = [[x for x in x.split(' ')]for x in test_corpus]
        
        # Using mean word-->vector mappings stored in GloVe to prepare train and test features
        train_feats = pd.DataFrame(np.array([np.mean([glove_dict[w] for w in words if w in glove_dict]
                    or [np.zeros(300)], axis=0) for words in trn_tokens]))
        
        test_feats = pd.DataFrame(np.array([np.mean([glove_dict[w] for w in words if w in glove_dict]
                    or [np.zeros(300)], axis=0) for words in test_tokens]))

    
    ## Word2Vec
    if model.lower() == 'word2vec':
        # Loading pre-trained word vectors
        word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('E:\\GoogleNews-vectors-negative300.bin', binary = True)
        w2v = dict(zip(word2vec_model.wv.index2word, word2vec_model.wv.syn0))
        
        # Tokenizing train and test corpus
        trn_tokens = [[x for x in x.split(' ')]for x in trn_corpus]
        test_tokens = [[x for x in x.split(' ')]for x in test_corpus]
        
        # Using mean word-->vector mappings stored in word2vec to prepare train and test features
        train_feats = pd.DataFrame(np.array([np.mean([w2v[w] for w in words if w in w2v]
                    or [np.zeros(300)], axis=0) for words in trn_tokens]))
        
        test_feats = pd.DataFrame(np.array([np.mean([w2v[w] for w in words if w in w2v]
                    or [np.zeros(300)], axis=0) for words in test_tokens]))

      
    ## Error handling
    if model.lower() not in ['tf-idf', 'cv', 'glove', 'word2vec']:
        print("Unknown value for parameter 'model' entered. Returning an empty data-frame")
        return pd.DataFrame(), pd.DataFrame()
    
    
    return train_feats, test_feats

In [18]:
## Feature extraction from text corpus
train_feats, test_feats = feats_from_text(all_corpus, trn_corpus, test_corpus, model = 'tf-idf', n_dims = 300)

In [19]:
## Creating feature-sets (arrays)
X = train_feats.values
X_test = test_feats.values
y = train.QType.values
y_test = test.QType.values
y_coarse = train['QType-Coarse'].values
y_test_coarse = test['QType-Coarse'].values

## ML/DL algorithms for model training

In [20]:
## Importing requisite libraries
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

### I. Linear Models

#### (a) Logistic Regression

In [21]:
## Logistic Regression
model = LogisticRegression(penalty = 'l2', C = 1.0)

In [22]:
# Checking cross-validation accuracy
cross_val_score(model, X, y, cv = 5)



array([ 0.67235189,  0.6533212 ,  0.66483012,  0.68888889,  0.6635514 ])

In [None]:
# Fitting model on entire training set
model.fit(X,y)

In [None]:
# Making predictions on test set
preds = model.predict(X_test)

#### (b) Linear SVM

In [23]:
## Linear SVM
model = SVC(C = 1, kernel = 'linear')

In [None]:
# Checking cross-validation accuracy
cross_val_score(model, X, y, cv = 6)

In [24]:
## Hyper-parameter tuning for SVC
cv_means = []
for i in [0.1, 0.5, 1, 2, 5, 10]:
    model = SVC(C = i, kernel = 'linear')
    cv_score = cross_val_score(model, X, y, cv = 6)
    cv_means.append(cv_score.mean())
    print("CV scores for C = {} are : {}".format(i, cv_score))

print("\n{}".format(cv_means))



CV scores for C = 0.1 are : [ 0.56713212  0.52705628  0.55786026  0.53318584  0.57174888  0.55141243]




CV scores for C = 0.5 are : [ 0.6895811   0.67316017  0.67358079  0.65486726  0.69394619  0.6779661 ]




CV scores for C = 1 are : [ 0.69924812  0.68181818  0.68122271  0.66371681  0.69618834  0.68474576]




CV scores for C = 2 are : [ 0.70784103  0.68290043  0.68886463  0.67588496  0.70627803  0.69039548]




CV scores for C = 5 are : [ 0.69709989  0.67099567  0.67139738  0.67035398  0.6838565   0.68813559]




CV scores for C = 10 are : [ 0.70247046  0.64935065  0.66593886  0.66814159  0.67376682  0.67231638]

[0.55139930067992837, 0.6771836002425663, 0.68448998787014093, 0.69202742595912115, 0.68030650354338762, 0.67199746151551187]


 C = 2 turns out to be the best parameter. Hence, model is re-initialized accordingly

In [37]:
# Fitting model on entire training set
model = SVC(C = 2, kernel = 'linear')
model.fit(X,y)

SVC(C=2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [38]:
# Making predictions on test set
preds = model.predict(X_test)

### II. Non-linear models

Gaussian, Multinomial and Bernoulli Naive Bayes algorithms were tried. Multinomial NB was the most successful algorithm by a distance

In [27]:
## Multinomial Naive-Bayes
model = MultinomialNB()

In [28]:
# Checking cross-validation accuracy
cross_val_score(model, X, y, cv = 5)



array([ 0.56642729,  0.55505005,  0.56014692,  0.57314815,  0.54205607])

In [None]:
# Fitting model
model.fit(X, y)

In [None]:
# Making predictions
preds = model.predict(X_test)

### III. Tree models

#### (a) LightGBM

In [29]:
model = LGBMClassifier(boosting_type = 'gbdt', num_leaves = 31, max_depth = -1, reg_alpha = 0,
                       reg_lambda = 0, learning_rate = 0.1, n_estimators = 1150, max_bin = 255, 
                       objective = 'multiclass', subsample = 1, subsample_freq = 1)
## Model can be tuned further

In [30]:
# cross-validation framework
fold_num = 1
cv_acc = []
skf = StratifiedKFold(n_splits = 6, shuffle = True, random_state = 102)
for train_idx, val_idx in skf.split(X,y):
    print("Fitting fold %d" %fold_num)
    model.fit(X[train_idx], y[train_idx], eval_metric = 'logloss', early_stopping_rounds = None)
    cv_preds = model.predict(X[val_idx])
    acc = accuracy_score(y[val_idx],cv_preds)
    cv_acc.append(acc)
    print("Accuracy for fold {} = {}\n".format(fold_num,acc))
    fold_num += 1




Fitting fold 1
Accuracy for fold 1 = 0.6143931256713212

Fitting fold 2
Accuracy for fold 2 = 0.6182212581344902

Fitting fold 3
Accuracy for fold 3 = 0.6137855579868708

Fitting fold 4
Accuracy for fold 4 = 0.6172566371681416

Fitting fold 5
Accuracy for fold 5 = 0.6252796420581656

Fitting fold 6
Accuracy for fold 6 = 0.6110484780157835



In [33]:
# Fitting model on entire training set
model.fit(X,y)

LGBMClassifier(boosting_type='gbdt', colsample_bytree=1.0, learning_rate=0.1,
        max_bin=255, max_depth=-1, min_child_samples=10,
        min_child_weight=5, min_split_gain=0.0, n_estimators=1150,
        n_jobs=-1, num_leaves=31, objective='multiclass', random_state=0,
        reg_alpha=0, reg_lambda=0, silent=True, subsample=1,
        subsample_for_bin=50000, subsample_freq=1)

In [34]:
# Making predictions on test set
preds = model.predict(X_test)

#### (b) XGBoost

In [None]:
model = XGBClassifier(max_depth = 12, learning_rate = 0.01, n_estimators = 301, 
                      objective = "multi:softmax", gamma = 0, base_score = 0.5, 
                      reg_lambda = 10, subsample = 0.8, colsample_bytree = 0.8, num_class = 50)
## Model can be tuned further

In [None]:
# cross-validation framework
fold_num = 1
cv_acc = []
skf = StratifiedKFold(n_splits = 4, shuffle = True, random_state = 102)
for train_idx, val_idx in skf.split(X,y):
    print("Fitting fold %d" %fold_num)
    model.fit(X[train_idx], y[train_idx], eval_metric = "merror")
    cv_preds = model.predict(X[val_idx])
    acc = accuracy_score(y[val_idx], cv_preds)
    cv_acc.append(acc)
    print("Accuracy for fold {} = {}\n".format(fold_num,acc))
    fold_num += 1


In [None]:
# Fitting model on entire training set
model.fit(X, y, eval_metric = "merror")

In [None]:
# Making predictions on test set
preds = model.predict(X_test)

## Evaluating different models

In [None]:
## Checking prediction accuracy on test set
np.mean(preds == y_test)

In [None]:
## Calculating precision, recall etc. from confusion matrix
cm_test = confusion_matrix(y_test, preds)
precision = np.zeros(len(cm_test))
recall = np.zeros(len(cm_test))
for i in range(0, len(cm_test)):
    precision[i] = cm_test[i,i]/(sum(cm_test[:,i]) + 10e-7)
    recall[i] =  cm_test[i,i]/(sum(cm_test[i,:]) + 10e-7)

print("Avg. precision and avg. recall are : {} and {} respectively".format(np.mean(precision), np.mean(recall)))

## Experiments and Results

### 1. To remove or not to remove (stopwords)

Words starting with "wh..." like "what", "why", "when" etc. and some other words like "how" seem highly relevant intuitively for question-type classification problem. 

Removing stopwords removes all these "wh-" words. Not surprisingly, the performance metrics (multi-class accuracy, avg. precision, avg. recall) go down significantly on removal of these words.

E.g. 
* MultinomialNB with 500 CountVector features - 36% test acc. and 48-49% val. acc.
     
* MultinomialNB with 500 CountVector features (no stopword removal) - 61-62% test and val acc.

An alternative is to remove all the stopwords, except the "wh-" words. This was tried. 

In [None]:
wh_words = ['who', 'what', 'when', 'why', 'how', 'which', 'where', 'whom']
stop = set(stopwords.words('english'))
for word in wh_words:
    stop.remove(word)

However it was found, that the evaluation metrics were better when all stopwords were retained. 

* Linear SVM with 500 tf-idf features (complete non-removal of stopwords) : 73% val. acc. and 77% test acc.

* Linear SVM with 500 tf-idf features (removal of stopwords, keeping only the wh_words) : 70% val acc. and 73% test acc.

Hence, the final verdict is to NOT remove the stopwords at all

### 2. Checking single and 2-letter words

During text cleaning process, one and two-letter words are generally removed. To inspect what kinds of words will be removed from our corpus if we perform this operation, we run the below line of code. This check is performed after removing the stopwords.

In [None]:
two_letter_words = [[x for x in x if len(x) <= 2] for x in all_corpus]

It was observed that some of these words seemed important and having discriminative power, e.g. : numerals like 1, 11, 89 etc., common words like 'tv', 'cd', 'iq' etc. Numerals were central subject of quite a few questions. Hence, it was decided not to remove one or two-letter words

### 3. Coarse predictions

As discussed earlier, the problem can also be approached by training the model on the 6 major classes instead of all the detailed 50 classes. Definitely, this will have a higher precision/recall/accuracy, but the biggest benefit is regarding the kind of errors the model would make. If we can optimize classification on these 6 major classes to a very high extent, then even if the model makes errors on fine classification, the errors would lie in an associated, similar hierarchical bucket.

In [None]:
# Fitting model
model.fit(X, y_coarse)
# Making predictions
model.predict(X_test)
# Checking accuracy
np.mean(preds == y_test_coarse)

Results of coarse classifications :
   1. Linear SVM with 500 TF-IDF features - 84% test acc., 80-81% val acc. + 87% and 83% test precision and recall respectively
   2. Linear SVM with 500 CountVector features - 84% test acc., 79-80% val acc. + 88% and 83% test precision and recall 
      respectively
   3. Multinomial NB with 500 TF-IDF features - 81% test acc., 75% val acc. + 85% and 79% test precision and recall respectively 

### 4. Linear models RULE!

Different kinds of models were tried with Count Vector and TF-IDF word embeddings. It was seen that the linear models (Logistic regression and SVM) performed much better than the others on this text classification task. SVM had the best validation accuracy. Performance metrics on different models are listed below.

1. Linear SVM with 500 CountVector features - 78% test acc., 70% val acc. + 68% and 64% test precision and recall respectively

2. Linear SVM with 500 TF-IDF features - 77% test acc., 73% val acc. + 67% and 63% test precision and recall respectively

3. Logistic Regression with 500 CountVector features - 76% test acc., 70% val acc. + 65% and 56% test precision and recall

4. MultinomialNB with 300 tf-idf features - 61% test acc. and 56% val acc.

5. Untuned LGBM with 500 CountVector features - 64% test acc.

6. Untuned XGBoost with 500 CountVector features - 68% test acc.


If we think of it, text classification datasets are very well-suited for algorithms like SVM and matrix factorization methods. Due to the word embeddings like Count Vectors/TF-IDF, the feature-set created was sparse and suited for factorization-based methods.

### 5. War of the Embeddings - TFIDF/CV vs GloVe/Word2Vec

TF-IDF and Count Vector embeddings generate sparse matrices with non-negative values, whereas GloVe and Word2Vec generate a dense matrix with all kinds of real values. As such, both kinds of embeddings are suited for different kinds of algorithms.

Simple, linear models, matrix factorization methods, SVMs etc. favour sparse matrices and hence perform well along with TF-IDF/CountVector embeddings. GloVe and Word2Vec embeddings are more suited to complex models (e.g. GBM models, neural networks etc.). This was also seen when comparing their performance with different algorithms

 * TF-IDF/CV
         1. Linear SVM with Count Vector - 78% test acc.
         2. LGBM with Count Vector - 64% test acc.
         
 * GloVe/Word2Vec
         1. LGBM with pre-trained Word2Vec - 72% test acc.
         2. Logistic Regression with Word2Vec - 70% test acc.
         
         3. LGBM with pre-trained GloVe - 74% test acc. (better precision and recall as compared to Word2Vec)
         4. Logistic Regression with GloVe - 71% test acc.
         
 
In general, linear SVM performs well irrespective of the word embeddings. In fact, linear SVM with word2vec embeddings reach close to 77% test accuracy. SVM models also have higher precision/recall than other algorithms in either of the embeddings.

## Summary

Overall, linear models, especially linear SVM seems most suited for this problem. Better feature engineering, combined with linear SVM should improve the models further. This also seems to be backed by the vast literature available for this problem.

## Future Steps

A few more ideas/experiments which can be executed on this dataset to potentially improve score and/or to build diverse models are listed below :

1. Ensembles - Lots of linear, tree, non-linear models were used in this solution. Being a diverse bunch of models, they form good candidates to be combined in an ensemble model (maybe, stacking)

2. Tuning parameters of the Tree models - will bring a small gain in the metrics

3. Using embeddings like word2vec and GloVe in a different way (not mean of words in a sentence) - maybe, the word embeddings can be weighted by TF-IDF, or mean of only few significant words in the sentence can be taken

4. Using ANNs with word2vec/GloVe embeddings as the resultant feature-sets seem suited for deep, non-linear models

5. Exploring with LSTMs/RNNs and treating this as a sequence problem. Didn't try LSTMs/RNNs here since linear models were performing far superiorly

6. Using only first few words in each question to extract features - Usually, the first few words contain keywords like "What", "What is", "Name", "How" etc.

7. Implementation of state-of-the-art methods from research papers (e.g. http://bit.ly/2CzrGE1, some of Stanford NLP group's work)

8. The classic way of error analysis - Looking at examples which the model is getting wrong to derive insights as to how to improve precision/recall etc.