# Preface    

**Text preprocessing and embeddings** - I'm using some of the methods that are not presented in different notebooks at the point of time.   
For example: lower only first letter in sentence, TweetTokenizer, normalise unicode data, few-steps embedding concatenation.

**Model** - as the base I use the Hung The Nguyen PyTorch model - [Pytorch starter](https://www.kaggle.com/hung96ad/pytorch-starter). I adapt it to the concept described by Alexander Burmistrov  [Toxic Comment Classification Challenge 3rd place](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52644) .


### Quick summary:
**1. Text preprocessing**:
  
  * replace words/characters based on dictionary ex. don't -> do not, Brexit -> leave EU (*@Dieter, @Theo Viel*)
  * change first letter in sentence to lower (size of first letter do matter in pretrained embedding, it's better to lower only the first letter in sentence instead of all of the words in text),   
  * remove apostrophe and 's ending form word. ex. AI's -> AI (there isn't "AI's" vector in pretrained glove but there is "AI" vector),
  * replace digits with mask, ex. 'Marek gets 3333' -> 'Marek gets ####'"
  * use TweetTokenizer() from NLTK for splitting words - I think it's the best tokenizer available for unformal text,   
  * normalise unicode data to remove umlauts, accents etc.
  * padding="pre", truncating="post"
    
**2. Text additional features (apply MinMaxScaler)**:
  * unique words rate,
  * rate of all-caps words,
  * sentence length rate (number of word / max sequence length parameter),     
  
** 3. Embeddings:**      
*ref Alexander Burmistrov*
  * concatenated fasttext and glove twitter embeddings (I've prepare similar function for average embegings too),
  * Glove vector is used by itself if there is no Fasttext vector but not the other way around. 
  * If there is no vector for word I'm looking for, transform word to lowercase and look again, 
  * Words without word vectors are replaced with a word vector for a word "something".
  * Added additional value that was set to 1 if a word was written in all capital letters and 0 otherwise,      

** 4. Architecture  -> PyTorch**    
*ref Alexander Burmistrov, Hung The Nguyen :*

1) Concatenated fasttext and glove twitter embeddings.    
2) SpatialDropout1D(0.1)   
3) First Layer of RNN: Bidirectional LSTM with a kernel size 64 -> LSTM    
4) Attention of LSTM layer -> LSTM_Atten    
5) Second layer of RNN: Bidirectional GRU with a kernel size 128 -> GRU    
6) Attention of GRU layer -> GRU_Atten    
7) A concatenation of the: [LSTM, LSTM_Atten, GRU, GRU_Atten, Text additional features]    
8)  Dense layers:    Dense(192, relu, Dropout 0,1) -> Dense(64, relu, Dropout 0,1) -> Dense(1, sigmoid).      

**Loss:** Binary Cross Entropy    
**Optimizer:** Adam,

### References:    
* [@Dieter](http://https://www.kaggle.com/christofhenkel) -  [How to: Preprocessing when using embeddings](https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings)
* [@Theo Viel](https://www.kaggle.com/theoviel) -  [Improve your score with some text preprocessing](https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing)
* [@Alexander Burmistrov](https://www.kaggle.com/mrboor) -  [Toxic Comment Classification Challenge 3rd place](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/52644)    
* [@Hung The Nguyen](https://www.kaggle.com/hung96ad) - [Pytorch starter](https://www.kaggle.com/hung96ad/pytorch-starter)

# Notebook

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load data

In [None]:
import pandas as pd
import numpy as np

def load_data():
    #load and shuffle training set 
    train_df = pd.read_csv("../input/train.csv")
    test_df = pd.read_csv("../input/test.csv")
    
    print("Train shape : ",train_df.shape)
    print("Test shape : ",test_df.shape)
    list_sentences_train = list(train_df["question_text"].fillna("NAN_WORD").values)
    list_sentences_test = list(test_df["question_text"].fillna("NAN_WORD").values)
    
    return list_sentences_train, list_sentences_test

In [None]:
list_sentences_train, list_sentences_test = load_data()

### What is the percentage of insincere questions in training set?

In [None]:
from collections import Counter
tmp = Counter([i for i in pd.read_csv("../input/train.csv")['target'].tolist()])
print("The percentage of insincere questions in train set is {:.2%} .".format(tmp[1]/(tmp[1]+tmp[0])))

train_1_prc = tmp[1]/(tmp[1]+tmp[0])
del(tmp)

### Choose MAX_SEQUENCE_LENGTH

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline  

sns.set(rc={'figure.figsize':(12,6)})
sns.distplot([len(i.split()) for i in list_sentences_train+list_sentences_test])
plt.title('Distribution of the length of the question_text - all examples')

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline  

sns.set(rc={'figure.figsize':(12,6)})
sns.distplot(np.array([len(i.split()) for i in list_sentences_train])[np.array([i==1 for i in pd.read_csv("../input/train.csv")['target'].tolist()])])
plt.title('Distribution of the length of the question_text - only insincere questions')

I think that max length equal 70 words will be good enough.

In [None]:
MAX_SEQUENCE_LENGTH = 70

### Define functions 

In [None]:
def change_word(text: str,dict_:dict) -> str:
    """
    Function that replace words base on dictionary.
    
    Parameters
    ----------
    text : str
        Input text
    dict_: dict
        Dictionary with pairs {pattern: repl,..} ex. {"'":"'", "‘":"'}
        
    Returns
    -------
    text : processed text 
    
    Examples
    --------
    >>> change_word(text="sample ? text", dict_={"?":"question", "‘":"'"})
    'sample question text'
    """
    for s in dict_.items():
        text = text.replace(s[0],s[1])
    return text

In [None]:
import re
def lower_first_in_sentence(text: str) -> str:
    """
    Function that change first letter in sentence to lower. 
    
    Parameters
    ----------
    text : str
        Input text
        
    Returns
    -------
    text : processed text 
    
    Examples
    --------
    >>> lower_first_in_sentence("Matt is smart - claims John. Yes, I think so. Wait.. he is not.")
    'matt is smart - claims John. yes, I think so. wait.. he is not.'
    """
    spl = re.compile('(\?+ *|\!+ *|\.+ *)')
    def lower_firts(txt): return txt if (len(txt)<2 or txt[0] =="I") else (txt[0].lower() + txt[1:])
    return ''.join([lower_firts(i) for i in re.split(spl, text)])

In [None]:
import re
def strip(word:str ) -> str: 
    """
    Function that removes 's and solo apostrophe ' from end of the word. ex. AI's -> AI 
    
    Parameters
    ----------
    text : str
        Input text
        
    Returns
    -------
    text : processed text 
    
    Examples
    --------
    >>> strip("exactly AI's VC's Donald'")
    'exactly AI VC Donald '
    """

    return re.sub("('$ |'$|'s |'s)",' ',word) if len(word)>2 else word

In [None]:
def clean_numbers(x):
    """
    Function that replaces digits
    """
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    return x

In [None]:
def load_embed(file):
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    
    if file == '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec':
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file) if len(o)>100)
    else:
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin'))
        
    return embeddings_index

In [None]:
def check_coverage(vocab, embeddings_index):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    for word in vocab.keys():
        try:
            known_words[word] = embeddings_index[word]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

In [None]:
import numpy as np
from tqdm import tqdm

def concat_embed(first_embed, second_embed, word_index):
    """
    Function that concat two embeddings and apply rules described here: 
      * concatenate first embedding and second embedding,
      * first vector is used by itself if there is no second vector but not the other way around. 
      * If there is no first vector for word you looking for, transform word to lowercase and look again, 
      * Words without word vectors are replaced with a word vector for a word "something".
      * Add additional value that was set to 1 if a word was written in all capital letters and 0 otherwise,      
    
    Parameters
    ----------
    first_embed : dict
        First embedding

    second_embed : dict
        Second embedding
        
    word_index : dict
        Dictionary that contains words you are looking for in a form {'ID':word} ex. {1:'the',2:'ok'...} 
        word_index = {t: i+1 for i,t in enumerate(vocab)}
        
    Returns
    -------
    wv_matrix : array of vectors (shape: nb_words x WV_DIM)
    WV_DIM : embedding size
    nb_words : number of words in embedding
    
    Examples:
    tbc
    --------

    """

    WV_DIM=first_embed['I'].shape[0]+second_embed['I'].shape[0]+1
    nb_words = len(word_index)+1

    wv_matrix = np.zeros(shape=(nb_words, WV_DIM))
    for word, i in tqdm(word_index.items()):
        cap_flag = word.isupper() and word!='I'
        cap = np.where(cap_flag,np.array([1]),np.array([0]))
        if word in first_embed:
            if word in second_embed:
                #print('1')
                wv_matrix[i] = np.hstack((second_embed[word],first_embed[word],cap))
            else:
                wv_matrix[i] = np.hstack((second_embed['something'],first_embed[word],cap))
        else:
            if word.lower() in first_embed:
                if word.lower() in second_embed:
                    wv_matrix[i] = np.hstack((second_embed[word.lower()],first_embed[word.lower()],cap))
                else:
                    wv_matrix[i] = np.hstack((second_embed['something'],first_embed[word.lower()],cap))
            else:
                wv_matrix[i] = np.hstack((second_embed['something'],first_embed['something'],cap))
            
    return wv_matrix, WV_DIM, nb_words

In [None]:
import numpy as np
from tqdm import tqdm

def avg_embed(first_embed, second_embed, word_index):
    """
    Function that average two embeddings and apply rules described here: 
      * concatenate first embedding and second embedding,
      * first vector is used by itself if there is no second vector but not the other way around. 
      * If there is no first vector for word you looking for, transform word to lowercase and look again, 
      * Words without word vectors are replaced with a word vector for a word "something".
      * Add additional value that was set to 1 if a word was written in all capital letters and 0 otherwise,      
    
    Parameters
    ----------
    first_embed : dict
        First embedding

    second_embed : dict
        Second embedding
        
    word_index : dict
        Dictionary that contains words you are looking for in a form {'ID':word} ex. {1:'the',2:'ok'...} 
        word_index = {t: i+1 for i,t in enumerate(vocab)}
        
    Returns
    -------
    wv_matrix : array of vectors (shape: nb_words x WV_DIM)
    WV_DIM : embedding size
    nb_words : number of words in embedding
    
    Examples:
    tbc
    --------


    """
    WV_DIM=np.mean([second_embed['I'],first_embed['I']], axis = 0).shape[0]+1
    nb_words = len(word_index)+1

    wv_matrix = np.zeros(shape=(nb_words, WV_DIM))
    for word, i in tqdm(word_index.items()):
        cap_flag = word.isupper() and word!='I'
        cap = np.where(cap_flag,np.array([1]),np.array([0]))
        if word in first_embed:
            if word in second_embed:
                #print('1')
                wv_matrix[i] = np.hstack((np.mean([second_embed[word],first_embed[word]], axis = 0),cap))
            else:
                wv_matrix[i] = np.hstack((np.mean([second_embed['something'],first_embed[word]], axis = 0),cap))
        else:
            if word.lower() in first_embed:
                if word.lower() in second_embed:
                    wv_matrix[i] = np.hstack((np.mean([second_embed[word.lower()],first_embed[word.lower()]], axis = 0),cap))
                else:
                    wv_matrix[i] = np.hstack((np.mean([second_embed['something'],first_embed[word.lower()]], axis = 0),cap))
            else:
                wv_matrix[i] = np.hstack((np.mean([second_embed['something'],first_embed['something']], axis = 0),cap))        
    return wv_matrix, WV_DIM, nb_words

In [None]:
from multiprocessing import Pool, cpu_count
print("Number of available cpu cores: {}".format(cpu_count()))

def process_in_parallel(function, list_):
    with Pool(cpu_count()) as p:
        tmp = p.map(function, list_)
    return tmp

### Define pseudo functions

In [None]:
import re
from tqdm import tqdm
from collections import Counter, OrderedDict
import operator
import unicodedata as ud

from nltk.tokenize import TweetTokenizer

def process_questions(list_sentences, dict_, exclude):
    """
    Function that applies text preprocessing
    """
    sleep(0.5)
    print("Lower first in sentence")
    list_sentences = process_in_parallel(lower_first_in_sentence, list_sentences)
    #list_sentences = [lower_first_in_sentence(s) for s in tqdm(list_sentences)]
    
    print("Change words - using dictionary")
    list_sentences = [change_word(s,dict_) for s in list_sentences]
    
    print("Remove 's and solo apostrophe ' from end of the word. ex. AI's -> AI ")
    list_sentences = process_in_parallel(strip, list_sentences)
    #list_sentences = [strip(s) for s in tqdm(list_sentences)]
    
    print("Replace digits with mask ex. 'Marek gets 3333' -> 'Marek gets ####'")
    list_sentences = process_in_parallel(clean_numbers,list_sentences)
    #list_sentences = [clean_numbers(s) for s in tqdm(list_sentences)]
    
    print("Normalise unicode data to remove umlauts, accents etc.")
    #https://gist.github.com/j4mie/557354
    list_sentences = [ud.normalize('NFKD', i).encode('ASCII', 'ignore') for i in list_sentences]
    
    print("Use TweetTokenizer() from NLTK for splitting words.")
    tokenizer = TweetTokenizer()
    list_sentences = process_in_parallel(tokenizer.tokenize,list_sentences)
    #list_sentences = [tokenizer.tokenize(i) for i in tqdm(list_sentences)]
    
    print("Create vocab")
    dictionary = Counter([item for sublist in list_sentences for item in sublist])
    dictionary = OrderedDict(sorted(dictionary.items(), key=operator.itemgetter(1), reverse = True))

    #exclude = [',','.','"','(',')','[',']',"'",'’',"\\",'{','}','…','..']                      
    print("Exclude some punctuations: {}".format(" ".join(exclude)))
    list_sentences = [list(filter(lambda x: x not in exclude,i)) for i in list_sentences]

    return list_sentences, dictionary

In [None]:
from time import sleep
from tqdm import tqdm
from sklearn.preprocessing import MinMaxScaler

def text_features(list_sentences, MAX_SEQUENCE_LENGTH):
    """
    Function creates matrix with additional text features.
    Column 1: Unique words rate
    Column 2: Rate of all-caps words
    Column 3: Sentence length rate (number of word / max sequence length parameter)
    """
    print("Column 1: Unique words rate\nColumn 2: Rate of all-caps words\nColumn 3: Sentence length rate (number of word / max sequence length parameter)")
    sleep(0.2)
    #"Unique words rate" 
    def uwr(seq): return len(set(seq))/len(seq) if len(seq)>0 else 0
    #"Rate of all-caps words"
    def acw(seq): return len(list(filter(lambda x: x.isupper() and x!='I', seq)))/len(seq) if len(seq)>0 else 0
    #"sentence length rate" 
    def slr(seq): return len(seq)/MAX_SEQUENCE_LENGTH if len(seq)>0 else 0
    
    uwr_f = [uwr(i) for i in tqdm(list_sentences)]
    acw_f = [acw(i) for i in tqdm(list_sentences)]
    slr_f = [slr(i) for i in tqdm(list_sentences)]
    scaler = MinMaxScaler()
    return scaler.fit_transform(np.array([uwr_f,acw_f,slr_f]).transpose())

## Let's get work started    
    
### Prepare sequences and additional fetures


Define dictionaries that maps words and characters to change

In [None]:
signs = {"'":"'", "‘":"'","´": "'", "°": "","`": "'", '“': '"', '”': '"', '“': '"',
         "₹": "e","€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2","—": "-", 
         "–": "-", "’": "'", "_": "-", "£": "e",'∞': 'infinity', 'θ': 'theta', '÷': '/', 
         'α': 'alpha', '•': '.', 'à': 'a', '−': '-','β': 'beta', '∅': '', '³': '3', 'π': 'pi'}


misspelled = { 'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 
                 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor',
                 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 
                 'youtu ': 'youtube ', 'Quorans':'Quora','Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 
                 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 
                 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 
                 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 
                 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 
                 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', 
                 '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex-boyfriend', 'airhostess': 'air hostess', 
                 "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 
                 'demonitization': 'demonetization', 'demonetisation': 'demonetization',
                  "’":"'","‘":"'","´":"'","`":"'",'9/11':'terrorism',"Quoran":"Koran",'1/2':'half',
                  'cryptocurrencies':'cryptocurrency',"Brexit":'leave EU', "Blockchain":"blockchain",'..':''}

contractions = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", 
                       "could've": "could have", "couldn't": "could not", "didn't": "did not",  
                       "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", 
                       "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", 
                       "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  
                       "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have",
                       "I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", 
                       "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", 
                       "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", 
                       "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", 
                       "mayn't": "may not", "might've": "might have","mightn't": "might not",
                       "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", 
                       "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
                       "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", 
                       "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", 
                       "she'd": "she would", "she'd've": "she would have", "she'll": "she will", 
                       "she'll've": "she will have", "she's": "she is", "should've": "should have", 
                       "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have",
                       "so's": "so as", "this's": "this is","that'd": "that would", 
                       "that'd've": "that would have", "that's": "that is", "there'd": "there would", 
                       "there'd've": "there would have", "there's": "there is", "here's": "here is",
                       "they'd": "they would", "they'd've": "they would have", "they'll": "they will", 
                       "they'll've": "they will have", "they're": "they are", "they've": "they have", 
                       "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", 
                       "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", 
                       "weren't": "were not", "what'll": "what will", "what'll've": "what will have", 
                       "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is",
                       "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", 
                       "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", 
                       "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", 
                       "won't've": "will not have", "would've": "would have", "wouldn't": "would not", 
                       "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
                       "y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                       "you'd": "you would", "you'd've": "you would have", "you'll": "you will", 
                       "you'll've": "you will have", "you're": "you are", "you've": "you have" ,
                       "Isn't":"is not", "\u200b":"", "It's": "it is","I'm": "I am","don't":"do not"}

dict_= {}
dict_.update(signs)
dict_.update(misspelled)
dict_.update(contractions)

In [None]:
exclude_list = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', '#', '*', '+', '\\', '•',  '~', '@', '£', 
 '·', '_', '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', '½', 'à', '…', 
 '“', '★', '”', '–', '●', 'â', '►', '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', 
 '▒', '：', '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', '‘', '∞', 
 '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', '¹', '≤', '‡', '√', ]

In [None]:
#Apply text preprocessing
sentences, vocab = process_questions(list_sentences_train+list_sentences_test, dict_, exclude_list)

In [None]:
#Prepare matrix with additional features
additional_features = text_features(sentences, MAX_SEQUENCE_LENGTH)

In [None]:
#Prepare embedding

import gc
from time import sleep
import pandas as pd
from keras.preprocessing.sequence import pad_sequences

def prepare_sequences():

    word_index = {t: i+1 for i,t in enumerate(vocab)}
    
    glove = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
    wiki_news = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'
    
    print("Extracting GloVe embedding")
    embed_glove = load_embed(glove)

    print("Extracting FastText embedding")
    embed_fasttext = load_embed(wiki_news)

    print("Glove coverage: ")
    oov_glove = check_coverage(vocab, embed_glove)

    print("FastText coverage: ")
    oov_fasttext = check_coverage(vocab, embed_fasttext)

    print("Create embedding for word vocabulary")
    sleep(0.2)
    wv_matrix, WV_DIM, nb_words = concat_embed(embed_glove, embed_fasttext, word_index)
    
    #del(vocab)
    del(globals()['vocab'])
    gc.collect()
    del(embed_glove,embed_fasttext)
    gc.collect()
    sleep(5)

    print("Create sequences")
    sleep(0.2)
    sequences = [[word_index.get(t, 0) for t in sentence]
                 for sentence in tqdm(sentences[:len(list_sentences_train)])]
    test_sequences = [[word_index.get(t, 0)  for t in sentence] 
                      for sentence in tqdm(sentences[len(list_sentences_train):])]

    print("Assign additional features to train / test list")
    sleep(0.2)    
    additional_features_train = additional_features[:len(list_sentences_train),:]
    additional_features_test = additional_features[len(list_sentences_train):,:]

    del(globals()['list_sentences_train'])
    del(globals()['list_sentences_test'])
    
    print("Pad sequences")
    sleep(0.2)
    train_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding="pre", truncating="post")
    #list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
    train_df = pd.read_csv("../input/train.csv")
    train_y = train_df['target'].values
    print('Shape of data tensor:', train_data.shape)
    print('Shape of label tensor:', train_y.shape)

    test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding="pre",truncating="post")
    print('Shape of test_data tensor:', test_data.shape)

    gc.collect()
    sleep(5)
    return word_index, wv_matrix, WV_DIM, nb_words, train_data, train_y, test_data, additional_features_train, additional_features_test

In [None]:
word_index, wv_matrix, WV_DIM, nb_words, train_data, train_y, test_data, train_data_a, test_data_a = prepare_sequences()

In [None]:
import gc
from time import sleep
gc.collect()
sleep(10)

### Build model

https://www.kaggle.com/hung96ad/pytorch-starter

In [None]:
embed_size = WV_DIM # how big is each word vector
#max_features = 95000 # how many unique words to use (i.e num rows in embedding vector)
nb_words = nb_words #number of unique words
maxlen = MAX_SEQUENCE_LENGTH # max number of words in a question to use

batch_size = 2048
train_epochs = 6

SEED = 666

print("embed_size : {embed_size},\nnb_words : {nb_words},\nmaxlen : {maxlen},\nbatch_size : {batch_size}, \
      \ntrain_epochs : {train_epochs},\nSEED : {SEED}".format(
    **{'embed_size':embed_size, 'nb_words':nb_words,'maxlen':maxlen,'batch_size':batch_size,'train_epochs':train_epochs,'SEED':SEED}))

In [None]:
import torch
import torch.nn as nn
import torch.utils.data

In [None]:
import random
import os
import torch

def seed_torch(seed=666):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True

In [None]:
class Attention(nn.Module):
    def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
        super(Attention, self).__init__(**kwargs)
        
        self.supports_masking = True

        self.bias = bias
        self.feature_dim = feature_dim
        self.step_dim = step_dim
        self.features_dim = 0
        
        weight = torch.zeros(feature_dim, 1)
        nn.init.xavier_uniform_(weight)
        self.weight = nn.Parameter(weight)
        
        if bias:
            self.b = nn.Parameter(torch.zeros(step_dim))
        
    def forward(self, x, mask=None):
        feature_dim = self.feature_dim
        step_dim = self.step_dim

        eij = torch.mm(
            x.contiguous().view(-1, feature_dim), 
            self.weight
        ).view(-1, step_dim)
        
        if self.bias:
            eij = eij + self.b
            
        eij = torch.tanh(eij)
        a = torch.exp(eij)
        
        if mask is not None:
            a = a * mask

        a = a / torch.sum(a, 1, keepdim=True) + 1e-10

        weighted_input = x * torch.unsqueeze(a, -1)
        return torch.sum(weighted_input, 1)

In [None]:
from torch.nn import * 

class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        
        hidden_size = 64
        
        self.embedding = nn.Embedding(nb_words, embed_size)
        self.embedding.weight = nn.Parameter(torch.tensor(wv_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        
        self.embedding_dropout = nn.Dropout2d(0.1)
        self.lstm = nn.LSTM(embed_size, hidden_size, bidirectional=True, batch_first=True)
        self.gru = nn.GRU(hidden_size*2, hidden_size, bidirectional=True, batch_first=True)
        
        self.lstm_attention = Attention(hidden_size*2, maxlen)
        self.gru_attention = Attention(hidden_size*2, maxlen)
    
        self.AvgPool1d = nn.AdaptiveAvgPool1d(1)
        self.MaxPool1d = nn.AdaptiveMaxPool1d(1)

        self.linear = nn.Linear(399, 192)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.1)
        self.linear2 = nn.Linear(192, 64)
        self.out = nn.Linear(64, 1)
        self.out_act = nn.Sigmoid()
    
    def forward(self, x, x_a):
        h_embedding = self.embedding(x)
        h_embedding = torch.squeeze(self.embedding_dropout(torch.unsqueeze(h_embedding, 0)))
        
        h_lstm, _ = self.lstm(h_embedding)
        h_gru, _ = self.gru(h_lstm)
        
        h_lstm_atten = self.lstm_attention(h_lstm)
        h_gru_atten = self.gru_attention(h_gru)
        
        avg_pool = torch.squeeze(self.AvgPool1d(h_gru))
        max_pool = torch.squeeze(self.MaxPool1d(h_gru))
        #avg_pool = torch.mean(h_gru, 1)
        #max_pool, _ = torch.max(h_gru, 1)
        #print(h_gru_atten.shape)
        #print(avg_pool.shape)
        #print(max_pool.shape)
        #print(x_a.shape)

        conc = torch.cat((h_gru_atten,h_lstm_atten, avg_pool, max_pool, x_a), 1)
        #print(conc.shape)
        conc = self.relu(self.linear(conc))
        conc = self.dropout(conc)
        conc = self.relu(self.linear2(conc))
        conc = self.dropout(conc)       
        out = self.out(conc)
        y = self.out_act(out)
        return y

In [None]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

In [None]:
from sklearn.metrics import f1_score, roc_auc_score

def threshold_search(y_true, y_proba):
    best_threshold = 0
    best_score = 0
    for threshold in tqdm([i/100 for i in range(10,90)]):
        score = f1_score(y_true=y_true, y_pred=y_proba > threshold)
        if score > best_score:
            best_threshold = threshold
            best_score = score
    search_result = {'threshold': best_threshold, 'f1': best_score}
    return search_result

### Train model and predict

In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
splits = list(StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED).split(train_data, train_y))

In [None]:
import warnings
warnings.filterwarnings('always')

import time

train_preds = np.zeros((len(train_data)))
test_preds = np.zeros((len(test_data)))

seed_torch(SEED)

x_test_cuda = torch.tensor(test_data, dtype=torch.long).cuda()
x_test_a_cuda = torch.tensor(test_data_a, dtype=torch.float32).cuda()

test_dataset = torch.utils.data.TensorDataset(x_test_cuda,x_test_a_cuda)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

loss_kFold = []

for i, (train_idx, valid_idx) in enumerate(splits):
    x_train_fold = torch.tensor(train_data[train_idx], dtype=torch.long).cuda()
    x_train_a_fold = torch.tensor(train_data_a[train_idx], dtype=torch.float32).cuda()
    y_train_fold = torch.tensor(train_y[train_idx, np.newaxis], dtype=torch.float32).cuda()
    x_val_fold = torch.tensor(train_data[valid_idx], dtype=torch.long).cuda()
    x_val_a_fold = torch.tensor(train_data_a[valid_idx], dtype=torch.float32).cuda()
    y_val_fold = torch.tensor(train_y[valid_idx, np.newaxis], dtype=torch.float32).cuda()
    
    model = NeuralNet()
    model.cuda()
    
    loss_fn = torch.nn.BCELoss()
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
    
    train = torch.utils.data.TensorDataset(x_train_fold,x_train_a_fold, y_train_fold)
    valid = torch.utils.data.TensorDataset(x_val_fold, x_val_a_fold,y_val_fold)
    
    train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
    valid_loader = torch.utils.data.DataLoader(valid, batch_size=batch_size, shuffle=False)
    
    print(f'Fold {i + 1}')
    loss_epoch = []
    for epoch in range(train_epochs):
        start_time = time.time()
        
        model.train()
        avg_loss = 0.
        losses = []
        for x_batch,x_a_batch, y_batch in tqdm(train_loader, disable=True):
            
            optimizer.zero_grad()
            # (1) Forward
            y_pred = model(x_batch,x_a_batch)
            # (2) Compute diff
            loss = loss_fn(y_pred, y_batch)
            # (3) Compute gradients
            loss.backward()
            # (4) update weights
            optimizer.step()
            avg_loss += loss.item() / len(train_loader)
            losses.append(loss.data.cpu().numpy())
        
        loss_epoch.append(losses)
        
        model.eval()
        valid_preds_fold = np.zeros((x_val_fold.size(0)))
        test_preds_fold = np.zeros(len(test_data))
        avg_val_loss = 0.
        for i, (x_batch, x_2_batch, y_batch) in enumerate(valid_loader):
            y_pred = model(x_batch,x_2_batch).detach()
            avg_val_loss += loss_fn(y_pred, y_batch).item() / len(valid_loader)
            valid_preds_fold[i * batch_size:(i+1) * batch_size] = y_pred.cpu().numpy()[:, 0]
        
        elapsed_time = time.time() - start_time 
        print('Epoch {}/{} \t loss={:.4f} \t val_loss={:.4f} \t time={:.2f}s'.format(
            epoch + 1, train_epochs, avg_loss, avg_val_loss, elapsed_time))
        
    for i, (x_batch,x_a_batch) in enumerate(test_loader):
        y_pred = model(x_batch,x_a_batch).detach()
        test_preds_fold[i * batch_size:(i+1) * batch_size] = y_pred.cpu().numpy()[:, 0]

    train_preds[valid_idx] = valid_preds_fold
    print(threshold_search(train_y[valid_idx], train_preds[valid_idx]))
    
    test_preds += test_preds_fold / len(splits)  
    
    loss_kFold.append(loss_epoch)

In [None]:
search_result = threshold_search(train_y, train_preds)
search_result

In [None]:
test_df = pd.read_csv("../input/test.csv")
sub = pd.DataFrame({"qid": test_df["qid"].values})
sub['prediction'] = (test_preds > search_result['threshold']).astype(int)
sub.to_csv("submission.csv", index=False)

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline  

sns.set(rc={'figure.figsize':(12,6)})
sns.distplot(test_preds)
plt.title('Distribution of the test predictions probability')

In [None]:
from collections import Counter
print("The percentage of insincere questions in train set is {:.2%} .".format(train_1_prc))

In [None]:
test_pred_1_prc = Counter(sub["prediction"])[1]/len(sub["prediction"])
print("The percentage of predicted insincere questions in test set is {:.2%} .".format(test_pred_1_prc))