# Quora Insincere Questions Classification
This notebook uses Pytorch to work on the Kaggle competition  - https://www.kaggle.com/c/quora-insincere-questions-classification .

Summary:

1. Input contains question id, question text and the target class.
target class is "0" if it is a sincere question.
target class is "1" if the question is toxic.
Training data contains approximately 13,00,000(+) questions.
In that, approximately 80,000(+) questions belong to targe class "1".

2. Normalizing the Data - All non-char characters are removed from the data.

3. Word to Vector is obtained using gensim library.
   The corpus contains both train and test questions.
   Vector length  = 50.
   
4. Uses LSTM in combination with CNN.
   Optimizer = optim.Adam,lr=0.01,momentum=0.9
   Loss = nn.BCEWithLogitsLoss #binary cross entropy loss
   
   
5. As the data size of "target=1" is very less compared to that of "target=0",
   we have done custom sampling. ie., everytime, all of the data for target=1 is included with the equal number of samples from    target=0. The samples are randomly shuffled before training.
   
6. Confusion Matrix and ROC are used to visualize the prediction accuracy.

Import Statements - all in one place - 
Pandas is used to read the train/test files. 
Pytorch is used for the Deep Learning Framework.
Scikit Learn is used for Metrics.

In [None]:
#Shree Ganeshaaya Namaha
#%load_ext autoreload
#%autoreload 2
import pandas as pd
import numpy as np
import re
from datetime import datetime
import time

#import nltk
#from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
#from nltk.corpus import stopwords
# Improving the stop words list
#nltk.download('stopwords')
from gensim.models import Word2Vec
from sklearn.utils import resample
from sklearn.metrics import roc_curve, confusion_matrix

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.optim import lr_scheduler

from matplotlib import pyplot as plt

import itertools

import pickle

Global Constants - all in one place.

In [None]:
start_time=time.time()
time_limit=5#5 hours of time limit.

train_file="../input/train.csv"
test_file="../input/test.csv"

#stop_words = stopwords.words('english')
#negation_words = ["don't", "won't","doesn't","couldn't","isn't","wasn't","wouldn't","can't","ain't","shouldn't","not"]
negation_words = ["don't", "won't","doesn't","couldn't","isn't","wasn't","wouldn't","can't","ain't","shouldn't"]

# vector size in word2vec model.
# Also, Used in the embedding matrics.
EMBEDDING_DIM=100

#target classes
class_names=[0,1]

#Batch size used to train/test.
BATCH_SIZE=4000




General Purpose Functions

In [None]:
#print the timestamp along with the log message
def myprint(*argv):
    sstr=datetime.now()
    for arg in argv:
        sstr="{} {}".format(sstr,arg)
    print(sstr)


Following block is not used at pesent, as it takes too long to pos-tag and stemming the corpus.

In [None]:
        
"""        
#lemmatizer = WordNetLemmatizer()
stemmer=PorterStemmer()

## Does the following - 
## 1) replace the org names and person names with a common word.
## In that way the training will not bias based on person/org.
## And also, the text corpus will reduce and model can work on important words only.
## 2) the continous chunks / couplets will be joined using "_" and will be treated as one word. 
## 3) removes the stop-words, again to reduce the corpus to important words.
## 4) stemming - let us use the base state of the word - again to reduce the corpus to few important words only.


def reduce_to_important_words_only(line):
    p_names,org_names,chunks=continuous_chunks(line)
    for chunk in chunks:
        tmp=re.sub(" ","_",chunk)
        try:
            line=re.sub(chunk, tmp, line)
        except Exception as e:
            print("chunk=",chunk,str(e) )
    for chunk in p_names:
        try:
            line=re.sub(chunk, "PERSON_NAME", line)
        except Exception as e:
            print("chunk=",chunk,str(e) )

    for chunk in org_names:
        try:
            line=re.sub(chunk, "ORG_NAME", line)
        except Exception as e:
            print("chunk=",chunk,str(e) )                

    
    line = line.lower()
    word_list = line.split(' ')
    newword_list = []
    prev_word = ''

    for word in word_list:
        word=word.strip()
        ### get the root form of the word
        if word in stop_words:
            continue

        if word in ["person_name","org_name"]:
            newword_list.append(word)
            continue

        if word in negation_words:
            prev_word = word
            continue

        ### LEMMATIZE
        #word= lemmatizer.lemmatize(word)
        word= stemmer.stem(word)

        if prev_word in negation_words:                
            word = 'not_' + word                        
            newword_list.append(word)                        
            prev_word = ''
        else:
            newword_list.append(word)

    line = ' '.join(newword_list)
    return line
    
def continuous_chunks(line):
    chunked = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(line)))
    continuous_chunk = []
    person_names=[]
    org_names=[]
    for i in chunked:
        ttype=i.__class__.__name__
        if ttype == "Tree":
            if i.label()== "PERSON":
                person_names.append(" ".join([token for token, pos in i.leaves()]))
            elif i.label()== "ORGANIZATION":
                org_names.append(" ".join([token for token, pos in i.leaves()]))
            else:
                if len(i.leaves()) > 1:
                    continuous_chunk.append(" ".join([token for token, pos in i.leaves()]))

        elif ttype == "Tuple":
            if i[1] == "PERSON":
                person_names.append(i[0])
            elif i[1] == "ORGANIZATION":                    
                org_names.append(i[0])

             #elif current_chunk:
             #        named_entity = " ".join(current_chunk)
             #        if named_entity not in continuous_chunk:
             #                continuous_chunk.append(named_entity)
             #                current_chunk = []
             #else:
             #        continue
    return person_names, org_names, continuous_chunk 
"""


Keep only alphabetic/text content.

In [None]:
stemmer=PorterStemmer()

def normalize_text(filename):
    myprint("normalize file ", filename)
    df=pd.read_csv(filename)
    arr=[]
    for i,row in df.iterrows():
        qid=row["qid"]
        qn=normalize_line(row["question_text"])
        if "target" in row:
            target=row["target"]
            tple=(qid,qn,target)
        else:
            tple=(qid,qn)
        arr.append(tple)
        #df.set_value(i,"question_text",v)
        if (i%100000==0):
            myprint(i)
    return arr

def normalize_line(line):
    line=str(line).lower()
    for word in negation_words:
        line=re.sub(word,"not",line)
        
    line = re.sub(r'\W', ' ', str(line))
    line = re.sub(r'\d', ' ', line)

    line = re.sub(r'br[\s$]', ' ', line)
    line = re.sub(r'\s+[a-z][\s$]', ' ',line)
    line = re.sub(r'b\s+', '', line)
    line = re.sub(r'\s+', ' ', line)
    line=line.strip()
    arr=line.split(" ")
    
    tokens=[]
    for w in arr:
        w=w.strip()
        if len(w)>0:
            #w=stemmer.stem(w)
            tokens.append(w)
    return tokens


PRE-PROCESS/NORMALIZE THE QUESTION TEXT TO REMOVE UN-NECESSARY CONTENT.
The output of this is - 
train_arr_no_gram is an array of a tuple like this - ('00014894849d00ba98a9', ['my', 'voic', 'rang', 'is', 'my', 'chest', 'voic', 'goe', 'up', 'to', 'includ', 'sampl', 'in', 'my', 'higher', 'chest', 'rang', 'what', 'is', 'my', 'voic', 'type'],0)
ie., array of (question_id,[array question text tokens],question_type).

test_arr_no_gram is an array of a tuple like this - ('000032939017120e6e44', ['do', 'you', 'have', 'an', 'adopt', 'dog', 'how', 'would', 'you', 'encourag', 'peopl', 'to', 'adopt', 'and', 'not', 'shop'])
ie., array of (question_id,[array question text tokens]).

In [None]:
myprint("normalize train-start")
train_arr_no_gram=normalize_text(train_file)
myprint("normalize train-done")

myprint("normalize test-start")
test_arr_no_gram=normalize_text(test_file)
myprint("normalize test-done")


# N-Gram
Let us use 3-gram, this will give bit denser data to train.
Sample entry of train_arr before n_gram operation is - 
['00002165364db923c7e6', ['how', 'did', 'quebec', 'nationalist', 'see', 'their', 'provinc', 'as', 'nation', 'in', 'the'], 0]

the same after n_gram operation is - 
['00002165364db923c7e6',['how', 'did', 'quebec', 'did', 'quebec', 'nationalist', 'quebec', 'nationalist', 'see', 'nationalist', 'see', 'their', 'see', 'their', 'provinc', 'their', 'provinc', 'as', 'provinc', 'as', 'nation', 'as', 'nation', 'in', 'nation', 'in', 'the'],0]



In [None]:
#tokens contains an array of words.
def tokenize_n_gram(tokens,n):
    arr_gram=[]
    pointer=0
    cnt=0
    done=False
    lth=len(tokens)
    while done==False:
        if (pointer+n) > lth:
            done=True
            break
        arr_gram.append(tokens[pointer+cnt])
        cnt+=1
        if cnt == n:
            cnt=0
            pointer+=1
    return arr_gram

myprint("start n_gram for train_arr")
train_arr=[]
for (qid,tokens,qtype) in train_arr_no_gram:
    n_gramed=tokenize_n_gram(tokens,3)
    train_arr.append((qid,n_gramed,qtype))
    
myprint("start n_gram for test_arr")    
test_arr=[]
for (qid,tokens) in test_arr_no_gram:
    n_gramed=tokenize_n_gram(tokens,3)
    test_arr.append((qid,n_gramed))
myprint("done n-gram")

In [None]:

#with open("train_arr_norm", 'wb') as handle:
#    pickle.dump(train_arr, handle, protocol=pickle.HIGHEST_PROTOCOL)

#with open("test_arr_norm", 'wb') as handle:
#    pickle.dump(test_arr, handle, protocol=pickle.HIGHEST_PROTOCOL)    

In [None]:
#with open("train_arr_norm", 'rb') as handle:
#    train_arr = pickle.load(handle)
#with open("test_arr_norm", 'rb') as handle:
#    test_arr = pickle.load(handle)


PREPARE THE CORPUS OF WORDS from TRAIN and TEST sets.

In [None]:
lth_train=len(train_arr)
lth_test=len(test_arr)
corpus=np.concatenate((np.array([qn for (i,qn,t) in train_arr]), np.array([qn for (i,qn) in test_arr])))
lth_total=len(corpus)
print("lth_train=",lth_train,",lth_test=",lth_test,",total=",lth_total)

CREATE THE WORD2VEC.

In [None]:
myprint("start word2vec")
gw2c = Word2Vec(corpus, size=EMBEDDING_DIM, window=5, min_count=1, workers=4)
myprint("end word2vec")

WE ARE NOT USING THE FOLLOWING FUNCTION AS WE CREATE OUR OWN WORD VECTORS USING GENSIM IN THE ABOVE STEP.

In [None]:
"""   
def load_standard_word2vec(self): 

    #ffile="embeddings\\glove.840B.300d.txt"
    ffile="..\\embeddings\\GoogleNews-vectors-negative300.bin"
    #ffile="embeddings\\paragram_300_sl999.txt"
    #ffile="embeddings\\wiki-news-300d-1M.vec"

    print("loading embedding word2vec file ",ffile)

    self.word2vec = gensim.models.KeyedVectors.load_word2vec_format(ffile, binary=True)  
    self.EMBEDDING_DIM=300
"""

DECIDE ON THE MAX NUMBER OF WORDS TO BE CONSIDERED IN EACH QUESTION.
Steps for this - 
    Get the length vector - contains lengths of each question in the corpus.
    Sort lengths based on ascending order.
    If the MAX length is too big compared to the length of the 95% of the data, then consider the length of the 95% of the data.     ie., we don't want to waste our calculation on the 5% of the questions if they are too big.
    But, we want to cover at the least - all the words in 95% of the corpus.

In [None]:
def find_max_sent_lth(corpus):   
    myprint("start find_max_sent_lth")
    #we want to take the max_sent_lth as the length of the 95% data.
    lths=[len(arr) for arr in corpus]
    lths=np.sort(lths)    
    mmax=lths[-1]         
    mmax_95_perc=lths[int(len(lths)*0.95)]
    diff=mmax-mmax_95_perc
    mmax_5_perc=lths[int(len(lths)*0.05)]
    print("max lth=",mmax,",max_95_perc=",mmax_95_perc,",max_5_perc=",mmax_5_perc,", diff=",diff)
    if diff > mmax_5_perc:    
        ret=mmax_95_perc
    else:
        ret=mmax
    
    myprint("end find_max_sent_lth=",ret)
    return ret

MAX_SENTENCE_LENGTH=find_max_sent_lth(corpus)

BUILDING WORD INDEX - This is done manually, as gensim word2vec gives us an array, but we want to use a dictionary object for faster processing.

In [None]:
"""
Following code is for manually doing it, however, is not used currently.
def build_word_index(corpus):
    myprint("start build_word_index")
    word2index={} #SOS_token:0, EOS_token:1}
    index2word={}#0:SOS_token,1:EOS_token}
    num_words=0
    for tokens in corpus:
        for token in tokens:
            if token not in word2index:
                word2index[token]=num_words
                index2word[num_words]=token
                num_words+=1
    myprint("end build_word_index, num_words=",num_words)
    return word2index,index2word,num_words

word2index,index2word,num_words=build_word_index(corpus)
"""
tmparr=gw2c.wv.index2word
word2index={}
index2word={}
for i,w in enumerate(tmparr):
    word2index[w]=i
    index2word[i]=w
num_words=len(word2index)

print("num_words=",num_words)

sequences_from_tokens function does following - 
1) Takes the array of tokens/words as input and returns the array of corresponding indexes from word2index.
2) Also, this ensures that the returned array is of length=MAX_SENTENCE_LENGTH.
    ie., if the tokens arr is bigger, it truncates; and pads if the arr is shorter.

In [None]:
def sequences_from_tokens(tokens_arr):
    ret=[]
    for arr in tokens_arr:
        ind_arr=[]
        if len(tokens_arr) > MAX_SENTENCE_LENGTH:
            tokens_arr=tokens_arr[0:MAX_SENTENCE_LENGTH]
        tokens_arr=tokens_arr
        for word in arr:
            ind_arr.append(word2index[word])
        
        ind_arr=F.pad(torch.FloatTensor(ind_arr),(0,MAX_SENTENCE_LENGTH-len(ind_arr)),"constant",0).numpy()[:]
        
        ret.append(ind_arr)
    return ret


BUILD EMBEDDING MATRIX - This will contain all words in the corpus in the vector form.
This MATRIX is used as the first layer in the Neural Network.

In [None]:
def build_embed_matrix(lword2idx,lwv):
    # prepare embedding matrix
    myprint('Filling pre-trained embeddings...')

    #index in word2idx  starts from 1.
    embedding_matrix = np.zeros((len(lword2idx)+1, EMBEDDING_DIM))

    for word, i in lword2idx.items():

        try:
            embedding_vector = lwv[word]        
        except:# Exception as e:
            #print (e)
            #if word is not found in word2vec - 
            embedding_vector=np.zeros(EMBEDDING_DIM)
            for indx in range(0,50):
                embedding_vector[indx]=i
            #also that partuclar position
            #embedding_vector[i]=i
        embedding_matrix[i]=embedding_vector
    return embedding_matrix

embed_matrix=build_embed_matrix(word2index,gw2c.wv)
print(embed_matrix.shape)


THIS IS OUR NEURAL NETWORK MODEL.

In [None]:
class NNModel(nn.Module):    

    def create_embed_layer(self,numwords, EMBEDDING_DIM, embedding_matrix):    
        # load pre-trained word embeddings into an Embedding layer
        # note that we set trainable = False so as to keep the embeddings fixed
    
        #torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, 
        #sparse=False, _weight=None)
        embedding_layer = nn.Embedding(
          numwords+1,
          EMBEDDING_DIM)
    
        embedding_layer.load_state_dict({'weight': torch.FloatTensor(embedding_matrix)})
        ## or do this - embedding_layer.weight.data.copy_(pretrained_embeddings)
    
        #we do not want to update the pretrained embeddings. so set requires_grad to False.
        embedding_layer.weight.requires_grad = False
        return embedding_layer


    
    def __init__(self,numwords,EMBEDDING_DIM,embedding_matrix):
        super(NNModel,self).__init__()
        
        op_classes=2        
        self.embedding=self.create_embed_layer(numwords,EMBEDDING_DIM,embedding_matrix)
        self.conv1_kernel_size=3
        
        self.rnn_hidden_size=15
        self.rnn_num_layers=3
        
        #considers trigrams
        #Make sure in_channels is 1, as we are doing 
        #this before we run conv2d - embed.unsqueeze(1)#change shape to - [batch size, 1, max_sentence_length, emb_dim]
        #so, 2nd dimension of the input is 1, that means in_channels should also be 1.
        self.conv1 = nn.Conv1d(EMBEDDING_DIM,#input size
                               self.rnn_hidden_size,#output size
                               self.conv1_kernel_size)
        
        
        ### NOW CREATE A GRU BI DIRECTIONAL NETWORK, with a small dropout.
        #self.gru = nn.GRU(EMBEDDING_DIM, self.gru_hidden_size, num_layers=self.gru_num_layers, bidirectional=True, 
        #                   dropout=0.1)
        self.lstm = nn.LSTM(self.rnn_hidden_size, self.rnn_hidden_size, num_layers=self.rnn_num_layers, bidirectional=True, 
                           dropout=0.1, batch_first=True)
        
        
        #self.rnn=nn.RNN(EMBEDDING_DIM,self.hidden_nodes)
        #self.rnn=nn.LSTM(EMBEDDING_DIM,self.hidden_nodes,num_layers=n_layers,bidirectional=self.bidirectional,dropout=dropout)
        
        
        #from  bidirectional LSTM - we will consider the last two layers - ie., last forward layer and last backwar layer output.
        #that's why the input to next layer will become hidden_nodes*2
        self.fc=nn.Linear(self.rnn_hidden_size*2,op_classes)
        
        self.figcnt=0
        self.class_names=[0,1]
        
        
    #def forward(self, indata,gru_h):   
    def forward(self,indata,h,c):
        #[batch_size x WORD_SEQ_IN_SENTENCE]
        indata=torch.LongTensor(indata)
        
        embed=self.embedding(indata)        
        #[batch size, sentence length, emb dim]
        #packed=torch.nn.utils.rnn.pack_padded_sequence(embed2,input_lengths)
        
        ############# CONVOLUTIONAL LAYER ##################
        #convert batch_size x sentence_lth x emb_dim into batch_size x emb_dim x sentence_length
        conv1_input=embed.transpose(1,2)                        
        conv1_out=self.conv1(conv1_input)
        max_pooled = F.max_pool1d(conv1_out, 3)
        
        #convert batch_size x hidden_size x (modified sent length) to batch_size x (modified sent length) x hidden_size
        rnn_in=max_pooled.transpose(1,2)
        
        ######## RNN #######################

        if h is None:
            h = torch.zeros(self.rnn_num_layers*2, rnn_in.shape[0], self.rnn_hidden_size)
            c = torch.zeros(self.rnn_num_layers*2, rnn_in.shape[0], self.rnn_hidden_size)
            #h = torch.randn(self.gru_num_layers*2, conv1.shape[1], self.gru_hidden_size)
            #c = torch.randn(self.gru_num_layers*2, conv1.shape[1], self.gru_hidden_size)

        #output,gru_h=self.gru(embed,gru_h)        
        rnn_out, (h,c)=self.lstm(rnn_in,(h,c))
        
        #gru_hidden_state is of shape (num_layers * num_directions, batch, hidden_size): 
        #if we have 2 layers, (-1,..) gives the output from the previous run
        # and  (-2,..) gives the output from the previous-to-previous run.
        
        ##concat the data from the last forward layer and the last backward layer.
        #cat=torch.cat((gru_h[-2,:,:],gru_h[-1,:,:]),dim=1)
        cat=torch.cat((h[-2,:,:],h[-1,:,:]),dim=1)
                
        out = self.fc(cat)
        
        
        #print("input=",indata.shape)
        #print("embed=",embed.shape)
        #print("conv1_input=",conv1_input.shape)
        #print("conv1_out=",conv1_out.size())
        #print("conv1 after maxpool=",max_pooled.shape)
        #print("hidden shape=",h.shape)        
        #print("hidden-cat shape=",cat.shape)
        #print("out shape=",out.shape)        

        #input= torch.Size([4000, 72])
        #embed= torch.Size([4000, 72, 100])
        #conv1_input= torch.Size([4000, 100, 72])
        #conv1_out= torch.Size([4000, 10, 70])
        #conv1 after maxpool= torch.Size([4000, 10, 23])
        #hidden shape= torch.Size([6, 4000, 10])
        #hidden-cat shape= torch.Size([4000, 20])
        #out shape= torch.Size([4000, 2])

        return out, h, c

    def round_prediction(self,preds_sigmoid):    
        #round predictions to the closest integer
        #preds=np.array([0 if i<0.7 else 1 for i in preds_sigmoid])
        preds=np.array([0 if i>j else 1 for [i,j] in preds_sigmoid])
        return preds
        
    def binary_accuracy(self,rounded_preds, y):
        """
        Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
        """
        #round predictions to the closest integer
        
        correct = (torch.Tensor(rounded_preds) == torch.Tensor(y)).float() #convert into float for division     
        acc = correct.sum()/len(correct)
        return acc, correct.sum().item()
    

    def plot_metrics(self,y_pred_rounded,y_pred,batch_target):                        

        # Plot non-normalized confusion matrix
        plt.figure(self.figcnt)
        self.figcnt+=1
        self.plot_confusion(batch_target,y_pred_rounded) 

        fpr, tpr, thresholds = roc_curve(batch_target, y_pred_rounded, pos_label=2)
        myprint("fpr=",fpr,", tpr=",tpr,",thresholds=",thresholds)
    
    
    def plot_confusion(self,
             batch_target,
             pred_rounded,
             classes=[0,1],
             normalize=False,
             title='Confusion matrix',
             cmap=plt.cm.Blues):

        cm=confusion_matrix(batch_target, pred_rounded)#, labels=class_names)
        print(cm)

        """
        This function prints and plots the confusion matrix.
        Normalization can be applied by setting `normalize=True`.
        """
        if normalize:
            cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
            print("Normalized confusion matrix")
        else:
            print('Confusion matrix, without normalization')
    
        
    
        plt.imshow(cm, interpolation='nearest', cmap=cmap)
        plt.title(title)
        plt.colorbar()
        tick_marks = np.arange(len(classes))
        plt.xticks(tick_marks, classes, rotation=45)
        plt.yticks(tick_marks, classes)
    
        fmt = '.2f' if normalize else 'd'
        thresh = cm.max() / 2.
        for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
            plt.text(j, i, format(cm[i, j], fmt),
                     horizontalalignment="center",
                     color="white" if cm[i, j] > thresh else "black")
    
        plt.ylabel('True label')
        plt.xlabel('Predicted label')
        plt.tight_layout()
        plt.show()
        
    def plot_prediction(self,y_pred_sigmoid,y_pred_rounded):
        y=[]
        cnt=1
        for i in y_pred_sigmoid:
            y.append(cnt)
            cnt=cnt+1 
    
        plt.figure(self.figcnt)
        plt.title("sigmoid_{}".format(self.figcnt))
        #plt.scatter(y,y_pred_sigmoid,s=5)
        x=[i for [i,j] in y_pred_sigmoid]
        y=[j for [i,j] in y_pred_sigmoid]
        #print(x)
        #print(y)
        plt.scatter(x,y,s=5)
        plt.show()
        self.figcnt+=1
                        
        #plt.figure(self.figcnt)
        #plt.title("rounded_{}".format(self.figcnt))
        #plt.scatter(y,y_pred_rounded,s=5)
        #plt.show()
        #self.figcnt+=1
           
    

Following is just to get an idea on how many class-0 and class-1 items are present in train data.

In [None]:
class_0=[(qn,cls) for (qid,qn,cls) in train_arr if cls==0]
class_1=[(qn,cls) for (qid,qn,cls) in train_arr if cls==1]
lth_0=len(class_0)
lth_1=len(class_1)

train_qn_cls=[(qn,cls) for (qid,qn,cls) in train_arr]
        
print("Total train data belonging to class_0=",lth_0)
print("Total train data belonging to class_1=",lth_1)


# Recreate Training Set
There are only 80810 class_1 samples , and 1225312 class_0 samples.
let us upsample it. ie., make the class-1 samples equal to class-2.
we can use "from sklearn.utils import resample" for this. 
But, i saw that it repeats the same sample many times.
So we are doing it manually here and using np.random.shuffle to randomize the spread of data.

In [None]:
diff=len(class_0)-len(class_1)
print("diff=",diff)
total=len(class_0)+len(class_1)

j=0
for i in range (0,diff):    
    if  i % 50000 == 0:
        print (i)
    pos=np.random.randint(0,len(train_qn_cls))
    
    #insert class1 sample randomly into train_arr.
    train_qn_cls.insert(pos,class_1[j])
    
    j+=1
    if j >= len(class_1):
        j=0
    if len(train_qn_cls) == total:
        break
print("done")


Create the model, optimizer and loss function.
Intialize the hidden weights of GRU (gru_h) to None.

In [None]:
model=NNModel(num_words,EMBEDDING_DIM,
          embed_matrix)
#optimizer=optim.SGD(model.parameters(),lr=0.01,momentum=0.9)
#optimizer=optim.Adam(model.parameters(),lr=0.001)

lr=0.01
optimizer=optim.Adam(model.parameters(), lr=lr)
#decay LR by a factor of 0.1 every 2 epochs
scheduler=lr_scheduler.StepLR(optimizer,step_size=1,gamma=0.1)


#loss_fn=torch.nn.MSELoss(reduction="sum")
#loss_fn=nn.CrossEntropyLoss()#we don't use this here as it is btter suited for multi class classificaiton.

#BCEWithLogitsLoss combines the sigmoid layer and the BCELoss in one single class but is numerically more stable and hence, should be preferred.
#Note here that you don’t need to pass the input tensor to the sigmoid layer before training with BCEWithLogitsLoss.
loss_fn=nn.BCEWithLogitsLoss()#binary cross entropy loss
gru_h=None
h=None
c=None


extract_batch_from_chunk() function will extract the batch of questions and targets from the given array.

For the last batch in the array, the number of elements may be smaller than the batch size.
In that case, we append the starting elements of the train set to make up the batch size.
This is because, we want to reuse the last calculated gru_h that is of the dimension of batch size.

In [2]:
def extract_batch_from_chunk(sequences_chunk,targets_chunk,start,end):
    lth=len(sequences_chunk)
    if end > lth:
        #this last batch is lesser than the standard batch size.
        #we add the initial question set to make this batch of standard size.
        batch_input=sequences_chunk[start:lth]
        batch_target=targets_chunk[start:lth]

        #fill the remaining with the questions from first.
        remain_input=sequences_chunk[0:end-lth]
        remain_target=targets_chunk[0:end-lth]

        batch_input=np.concatenate((batch_input,remain_input))
        batch_target=np.concatenate((batch_target,remain_target))

    else:            
        batch_input=sequences_chunk[start:end]
        batch_target=targets_chunk[start:end]

    return batch_input, batch_target

    

This is the core training Loop - 
  1) calls models forward() function
  2) calculates loss
  3) propagates the loss backwards
  4) steps up the weight values using the optimizer.


In [1]:
#def train_loop(batch_input,batch_target,gru_h,cnt):    
def train_loop(batch_input,batch_target,h,c,cnt): 
    
    class_0_batch=[cls for cls in batch_target if cls==0]
    class_1_batch=[cls for cls in batch_target if cls==1]
    myprint("target len cls0=",len(class_0_batch)," cls1=",len(class_1_batch))

    batch_target_2d=[[1,0] if cls==0 else [0,1] for cls in batch_target]
    batch_target_2d=torch.Tensor(batch_target_2d)
        
    #rnn_batch_input=torch.LongTensor(np.transpose(cnn_batch_input))
    # batch_input=torch.LongTensor(batch_input)
            
    optimizer.zero_grad()
    
    #myprint( "rnn start")
    #y_pred_sigmoid,gru_h=model(batch_input,gru_h)
    #y_pred,gru_h=model(batch_input,gru_h)
    
    y_pred,h,c=model(batch_input,h,c)
    #myprint("rnn loss")
    
    #our loss function BCEWithLogitsLoss calculates the sigmoid from the given prediction.
    #So, we just pass the raw predicted values.
    loss=loss_fn(y_pred,batch_target_2d)
    loss_num=loss.item()  
    #myprint("rnn backward")
    loss.backward()
    optimizer.step()

    ## detach the gru from the previous history/graph.
    ## Otherwise, it will hog the memory.
    #gru_h.detach_()
    h.detach_()
    c.detach_()
    
    ## Calculate the sigmoid value of hte probability of the two classes.
    y_pred_sigmoid= torch.sigmoid(y_pred)

    ## For ex. if class1 probability is more than that of class 0, 
    #then we predict that this sample belongs to class 1.
    y_pred_rounded=model.round_prediction(y_pred_sigmoid)
            
    ## This is just an informational log.
    class_0_pred=[cls for cls in y_pred_rounded if cls==0]
    class_1_pred=[cls for cls in y_pred_rounded if cls==1]
    myprint("predicted len cls0=",len(class_0_pred)," cls1=",len(class_1_pred))
             
    acc,num_correct = model.binary_accuracy(y_pred_rounded, batch_target)
    
    ## Let us plot sigmoid values and confusion metrics for some batches in between.
    ## we can't plot for all batches, as the browser will collapse due to increased output logs.
    if cnt % 15 == 0:
        model.plot_prediction( y_pred_sigmoid.detach().numpy(),y_pred_rounded)
        model.plot_metrics(y_pred_rounded, y_pred_sigmoid.detach().numpy(),batch_target)
    #else:
    #    cm=confusion_matrix(batch_target, y_pred_rounded)#, labels=class_names)
    #    print(cm)
    return loss_num, num_correct, h , c



train_fn() function does all the nuances like creating the batch, shuffling the data, and calls the core train_loop.

In [None]:
#print(class_0)


"""if epoch == 0:
lr=0.01
elif epoch == 1:
lr=0.001
else:
lr=0.0001
"""



def train_fn(end_epoch): 

    #set the mode to training.
    model.train()
    h=None
    c=None
    
    #Each epoch should cover all elements of class_0 and of class_1.
    for epoch in range(0,end_epoch):
        if epoch < 2:
            scheduler.step()        
        epoch_loss=0
        #epoch_accuracy=0
        cnt=0
        total=0
        correct=0
        gru_h=None
        #### as the class_1 items are too few when compared to class_0, 
        #### we take the chunk, that contains equal number of class_0 and class_1 item for training
        #### we continue training the chunk, till we cover all items in class_0.
        #start_chunk=0
        #end_chunk=lth_1        
        #done_chunks=False
        #chunk_num=-1
        
        myprint("epoch-",epoch, " start shuffle")
        for i in range (0,10):                    
            np.random.shuffle(train_qn_cls)
        myprint("epoch-",epoch, " end shuffle")
        print(optimizer.defaults)
            
        qn_epoch=[]
        target_epoch=[]
        for (q,t) in train_qn_cls:
            qn_epoch.append(q)
            target_epoch.append(t)

        """
        while done_chunks==False:
            chunk_num+=1
            
            #we have covered ALL CLASS 0 items in this epoch, so break.
            if start_chunk > lth_0:
                done_chunks=True
                break
                
            #this is going to be the last loop.
            if end_chunk > lth_0:
                end_chunk=lth_0
                start_chunk=end_chunk-lth_1
                done_chunks=True
            print("start_chunk=",start_chunk,",end_chunk=",end_chunk,",done_chunks=",done_chunks)
            
            arr0=class_0[start_chunk:end_chunk]
            arr1=class_1
            #print("arr=",arr, " arr1=",arr1)
            lst=np.concatenate((arr0,arr1))
    
            ### LET US SHUFFLE
            for i in range (0,100):
                np.random.shuffle(lst)
            
            lth=len(lst)
            tokens_chunk=[qn for (qn,target) in lst]
            targets_chunk=[target for (qn,target) in lst]
            sequences_chunk=sequences_from_tokens(tokens_chunk)

            data_len=len(sequences_chunk)
            print("datalen=",data_len)        
            
            batch_in_chunk=0
            for i in range (0,data_len,BATCH_SIZE):                         
                end_batch=i+BATCH_SIZE                
                batch_input,batch_target=extract_batch_from_chunk(sequences_chunk,targets_chunk,i,end_batch)
                ...
            start_chunk=start_chunk+lth_1
            end_chunk=start_chunk+lth_1

        """
        data_len=len(qn_epoch)
        for i in range (0,data_len,BATCH_SIZE):
                end_batch=i+BATCH_SIZE
                
                batch_input,batch_target=extract_batch_from_chunk(qn_epoch,target_epoch,i,end_batch)
                batch_input=sequences_from_tokens(batch_input)
                #loss_num,num_correct,gru_h=train_loop(batch_input,batch_target,gru_h,cnt)
                loss_num,num_correct,h,c=train_loop(batch_input,batch_target,h,c,i)
                
                
                if i%2 == 0:
                    myprint("epoch-",epoch, # chunk=",chunk_num," batch_in_chunk=",batch_in_chunk,
                        " batch_in_epoch-",cnt,                
                      " loss=",loss_num,
                      ", correctly predicted =",num_correct)
    
                    
                #epoch_accuracy += acc.item()
                correct+=num_correct
                epoch_loss+=loss_num
        
                cnt=cnt+1
                #batch_in_chunk+=1
                total+=len(batch_target)
                
                #if more than 5 hours, then break the training loop.
                if ( (time.time() - start_time)/3600 >= time_limit ):
                    myprint("Breaking the train loop as time exceeded limit.")
                    break

        epoch_loss=epoch_loss / cnt
        #epoch_accuracy= epoch_accuracy / cnt
            
        myprint( "DONE. epoch-",epoch," num_batches=", cnt,
          " loss=",epoch_loss, #" accuracy=", epoch_accuracy, 
          " total=",total,
          ", correctly predicted=",correct)
    
    #torch.save(model.state_dict(), "model.pth")
    return h, c
    #return model, optim, self.loss_fn,gru_h


In [None]:
h,c=train_fn(6)

Test Prediction will have these steps - 
1. normalize the test question texts.
2. create sequences array
3. predict using the trained model
4. create output/result file submission.csv .

In [None]:

#print(class_0)
def test_fn(): 
    myprint("TEST STARTED")
    
    ffile=open("submission.csv","w")
    ffile.write("qid,prediction\n")
    
    #ffile_raw=open("test_raw_result.csv","w")
    
    #set the mode to eval.
    model.eval()
    
    data_len=len(test_arr)

    tokens=[qn for (qid,qn) in test_arr]
    qids=[qid for (qid,qn) in test_arr]    
    
    for i in range (0,data_len,BATCH_SIZE):
        end=i+BATCH_SIZE
        
        if end > data_len:
            end=data_len
        batch_input=sequences_from_tokens(tokens[i:end])
                    
        batch_qid=qids[i:end]
                            
        optimizer.zero_grad()

        with torch.set_grad_enabled(False):
            #pred,_=model(batch_input,None)
            #we do not provide the h/c values that we obtained during train,
            #as the last batch will not have the batch_size.
            pred,_,_=model(batch_input,None,None)

            #myprint("batch input=",len(batch_input), "preds=",len(pred))
            pred_sigmoid= torch.sigmoid(pred)
            pred_rounded=model.round_prediction(pred_sigmoid)
            
            cnt=0
            for qid in batch_qid:
                lline="{},{}\n".format(qid,pred_rounded[cnt])            
                ffile.write(lline)
                
                #lline="{},{}\n".format(qid,pred_sigmoid[cnt])
                #ffile_raw.write(lline)
                cnt+=1

    ffile.close()
    #ffile_raw.close()
    print("DONE")   
     

In [None]:
test_fn()