In [None]:
import time
start_time = time.time()

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.utils import shuffle
from sklearn.svm import SVC
from sklearn import model_selection
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

from tqdm import tqdm
import numpy as np
import pandas as pd
import re

In [None]:
!pip install tensorflow-text==2.0.0 --user

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as textb

In [None]:
#print full tweet , not a part
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 310)

### Data loading

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
train_copy = train.copy()
length_train = len(train.index)
length_train

### Pipeline of the model

### Text cleaning/preprocessing(+) + Transformer + Support Vector Machine + Majority voting  for semantically equivalent but mislabelled tweets + Filtering basing on keywords<br>

Many notebooks in the competition show the Support Vector Machine works quite well for the classificaion. I start my work from the : https://www.kaggle.com/ihelon/starter-nlp-svm-tf-idf<br>

In https://www.kaggle.com/gibrano/disaster-universal-sentences-encoder-svm the Multilingual Universal Sentence Encoder is used for sentence encoding. Here I follow the work and use the Multilingual Universal Sentence Encoder (from tensorflow_hub).

The approach from https://www.kaggle.com/bandits/using-keywords-for-prediction-improvement is applied for final filtering of the results basing on the 'keywords'.

In the training data there are many 'semantically equivalent' tweets. For example some tweets differ only in the URLs. We may hypothesise the URLs are not important for prediction of the 'target'. To find such 'only URL different' tweets we can make cleaning of the 'text' strings: we eliminate the URLs. After that such tweets become equal as strings. In this way we use the data cleaning for the semantically equivalent tweets detection. So, the semantically equivalent (but different as strings) records in train set generate equivalence classes, which we reveal by the cleaning (in this model). What is important, there are classes, where tweets have 'mislabelling'. We can find 1 and 0 labels in the same class. But all tweets in such class are considered as semantically equal and as such must be all 0 xor all 1 labelled.<br>
About 'equal' tweets also see here: https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert; https://www.kaggle.com/atpspin/same-tweet-two-target-labels <br>
In raw 'text' in the train set we can find 55 records (lines) with 'mislabelling' (the records generate 18 equivalence classes): https://www.kaggle.com/dmitri9149/transformer-simple-baseline-model-s<br>
In that notebook the model was used without a data cleaning. The raw 'text' strings were supplyed to the Transformer. 
In this model I do text cleaning/preprocessing and it reveals 85 equivalence classes which correspond to 309 records in the train set. It is about 4 % of the training 'text' data. Something is to be done with it.<br>
I calculate the mean for the 'target' in each class with mislabelling and the 'target' for the corresponfing records in train set is changed depending on the mean value: if mean is greater or equal 0.5 to 1, if mean is lower 0.5 to 0 .<br>
Similar ideas were used in https://www.kaggle.com/atpspin/same-tweet-two-target-labels <br>
<br>
I would like to clarify the meaning of word 'mislabelling' which was used. My understanding of the data evolves and now I think the term like 'uncertainty' will be better. The notion 'disaster' is very subjective and not well determined. When we see a tweet is labelled as 1, there might be probability (and may be a big one) that it is to be labelled as 0. May be a graded (like 0, 1,2,...5) labelling will be more natural for an expert team. In the constrains of only 0 or 1 labelling we may expect different experts may assign different labells to 'semantically equivalent tweets'. This is naturally arising uncertainty in the case of the data. Experts A and B may easily agree that addition or elimination of URLs at the end of a tweet does not change meaning of the tweet. But experts A and B may totally disagree WHAT IS THE MEANING. It may be 'disaster' = 1 for A and 0 for B. I did attempt to 'refactor' the text and change the 'mislabelling' to something else. But it heavily complicates the notebook. I hope the comments are enought.
<br>

Quite informative research article 'Twitter Sentiment Analysis System' (based on similar data) https://arxiv.org/ftp/arxiv/papers/1807/1807.07752.pdf 

#################### addition from 03.05.2020
If it will be need for me to write an alarm tweet. It will not be a time / place for a novel. After short "targeted broadcasting" using #- and @- words I will write the main and short text: what / when / where / with whom something happens. I will try to message max information for others and to attract max attention of others to the situation. It will be done close to the beginning of my tweet. I will not "wait" tail of the tweet (I am limited in size). Closer to the end of string I may add references to URLs and other resources for a clarification, may be some extra words (may be yes, may be not). Tweets are very different in length. Some of them just have the "main" text. Others are more long and decorated by the wordy tail. The idea is that if the most valuable information is concentrated at the beginning of a text string, the words at the tails of "long" tweets may bring a noise and overshadow the words with the biggest predictive power concentrated at the beginning. While processing the text let us cut to some extend the tails of the longest tweets and see, what happens.

### Examples

All the records are from the original train set before a text cleaning. The examples motivate the regex for the 'text' cleaning/preprocessing.

In [None]:
index_hot = [4415, 4400, 4399,4403,4397,4396, 4394,4414, 4393,4392,4404,4407,4420,4412,4408,4391,4405]
train_example = train.copy()
train_example.loc[index_hot,:]

There are  17 elements (strings, records in the original train'text') in the equivalence class. The strings differ only in URLs. The target for all elements will be relabeled to 0, because the mean of the target for the class is 0.294118 which is < 0.5. This is the biggest class in the model. Please, check the behaviour of 'keyword' and 'location'.

In [None]:
index_hot = [6840,6834,6837,6841,6816]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 5 elements in the class. 
The records will be relabelled to 0  (target mean < 0.5)

In [None]:
index_hot = [6828,6831]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 2 elements in the class. 
The class is considered as different from the class in previous example. It might be reasonable, because the tweets in the previous class have much more words i.e. much more semantical information. 
The recors will be relabelled to 1 (target mean >= 0.5). 

In [None]:
index_hot = [591, 587]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 2 elements in the class. Will be relabelled to 1 ( target mean for the class is 0.5).

In [None]:
index_hot = [601,576,584,608,606,603,592,604]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 8 elements in the class. The classes are revealed by text cleaning in the model. The strings in the class (before the text cleaning) differ not only by URLs but by 'via @usatoday' or 'via @USATODAY' at the end of the strings. Mean target is 0.75 >= 0.5 in the class, relabelling to 1. 
The class is different from the class in the previous example. The reason is the '#world' word at the beginning of sentence. Such 'suffixes' which include #- (and @- ) words are quite common and may have  semantical meaning. 

In [None]:
index_hot = [3667,3674,3688,3696]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 4 elements in the class. Relabelling to 0. Strings differ only by ! and ? or by URL at the end of string. The indexes 3667, 3674, 3688, 3696 are not far from each other (indexes of equivalent strings are usually close to each other).

In [None]:
index_hot = [3913,3914,3936,3921,3941,3937,3938]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 7 elements in the class. Relabelling to 0. The ' Full reÛ_ http://t.co/xxkHjySn0p http://t.co/JEVHKNJGBX' is actually a truncation of 'Full read by http://t.co/xxkHjySn0p http://t.co/JEVHKNJGBX' In this model I eliminate this 'Full read by' string and all possible truncations like 'Full re', 'Full reÛ_' , 'Full r', 'Full rea' which we can also find in the 'text'.

In [None]:
index_hot = [3136, 3133]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 2 elements in the class. Relabelling to 1.

In [None]:
index_hot = [3930,3933,3924,3917]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 4 elements in the class. Relabelling to 0. We can see here the 'Full rea' truncation, not 'Full re' as in the previous example. 

In [None]:
index_hot = [246,270,266,259,253,251,250,271]
train_example = train.copy()
train_example.loc[index_hot,:]

There are 8 elements. Relabelling to 0. The 'via @something' pattern AT THE END OF STRING will be eliminated in the model. It helps to reveal new classes or broaden existing classes. See the FedEx example above.

### Data cleaning/preprosessing

A tweet is very different in structure from usual English sentence. Quite often there is a core sencence  + 'suffix' and 'prefix'. Suffix is reach in #words and @words. The prefix is reach in URLs. The URLs 'prefix' may have many letters and words. But it seems it is not very informative for the classification. 

In [None]:
def new_line(text):
    text = re.sub(r'\t', ' ', text) # remove tabs
    text = re.sub(r'\n', ' ', text) # remove line jump
    return text

#### About url function. 

The examples:<br> 
"Strict liability in the context of an airplane accident: Pilot error is a common<br> component of most aviation cr... http://t.co/6CZ3bOhRd4" <br>
"Experts in France begin examining airplane debris found on Reunion<br> Island: French air accident experts o... http://t.co/YVVPznZmXg #news" <br>
URL can not be truncated in a tweet, and if a string is too long, it is truncated at position just before the URL and ... is added.  'cr...' or 'o...' in the examples. In the cleaning procedure below, if such contracted 'words' have 3 or less letters, the words are considered as noise and are eliminated. 

In [None]:
def url(text):
# quite many tweets are truncated like "Experts in France 
# begin examining airplane debris found on Reunion Island: French air 
# accident experts o... http://t.co/YVVPznZmXg #news" , the explanation is above
    text = re.sub(r' \w{1,3}\.{3,3} http\S{0,}', ' ', text)
    text = re.sub(r' \w{1,3}Û_ http\S{0,}', ' ', text)
# some symbols and words one space before 'http' are eliminated, it is assumed the words have no a 
# semantical meaning and predictive power in the position. 
    text = re.sub(r"mp3 http\S{0,}", r" ", text)
    text = re.sub(r"rar http\S{0,}", r" ", text)
    pattern = re.compile(r'( pin\:\d+ | via )http\S{0,}')
    text = pattern.sub(r' ', text)
# the pattern in tweet context have no a big meaning and the elimination of the words 
# unify the strings structure 
    pattern = re.compile(r'Full read by|Full read b|Full read|Full rea|Full re|Full r')
    text = pattern.sub(r' ', text)
    pattern = re.compile(r'Full story at|Full story a|Full story|Full stor|Full sto|Full st|Full s')
    text = pattern.sub(r' ', text)
    
    return text

In [None]:
def clean(text):    
    text = new_line(text)
# eliminate the pattern
    text = re.sub(r'(&amp;|&gt;|&lt;)', " ", text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces
    text = url(text)
    
# the pattern is 'translated as 'USER'
# in https://www.kaggle.com/quentinsarrazin/tweets-preprocessing similar 'translation' is used
# in https://arxiv.org/ftp/arxiv/papers/1807/1807.07752.pdf similar pattern 
# is 'translated as 'USER_NAME'
    text = re.sub(r'@\S{0,}', ' USER ', text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces  
# shrink multiple USER USER USER ... to USER
    text = re.sub(r'\b(USER)( \1\b)+', r'\1', text)
    
# multiple  letters repeats like in 'Oooooohhh' are truncated to 2 letters, not possible to truncate 
# to 1 letter, because it may generated false meaning like  'good' to 'god'
    text = re.sub(r'([a-zA-Z])\1{1,}', r'\1\1', text)
    
#  URLs , if not yet eliminated by url function are eliminated 
    text = re.sub(r"htt\S{0,}", " ", text)
    
# remove all characters if not in the list [a-zA-Z\d\s]
    text = re.sub(r"[^a-zA-Z\d\s]", " ", text)
    
# the digit(s) pattern is 'translated' to 'NUMBER'
# in https://www.kaggle.com/quentinsarrazin/tweets-preprocessing similar 'translation' is used
    text = re.sub(r'^\d\S{0,}| \d\S{0,}| \d\S{0,}$', ' NUMBER ', text)
    text = re.sub(r"\s+", " ", text) # remove extra spaces 
# shrink multiple NUMBER NUMBER  ... to NUMBER
    text = re.sub(r'\b(NUMBER)( \1\b)+', r'\1', text)
    
# remove digits if not eliminated above in 'NUMBER translation'
    text = re.sub(r"[0-9]", " ", text)
    
    text = text.strip() # remove spaces at the beginning and at the end of string    
# to reveal more equivalence classes the ' via USER' at the end of string is eliminated
    text = re.sub(r' via\s{1,}USER$', ' ', text)
    
    text = re.sub(r"\s+", " ", text) # remove extra spaces
    text = text.strip() # remove spaces at the beginning and at the end of string
    
    return text

In [None]:
train['text'][5450:5550] # train text before cleaning

In [None]:
train.text = train.text.apply(clean)
test.text = test.text.apply(clean)

#### Cut tails of longest tweets. 

In [None]:
max_length_tr = train.text.map(len).max()
max_length_te = test.text.map(len).max()
max_length = max(max_length_tr, max_length_te)

print("At the stage of text processing:")
print(f"...the size of longest text string in train set is  {max_length_tr}")
print(f"...the size of longest text string in test set is  {max_length_te}")


In [None]:
# the new max possible length will be (max_length - delta) , strings longer than new_max will be 
# decreased to new_max 
def cut(max_len, delta, x):
    new_max = max_len - delta
    length = len(x)
    if length <= new_max:
        return x 
    else:
        return x[:(new_max-length)]
    

delta = 25 
train.text = train.text.map(lambda x: cut(max_length, delta, x))
test.text = test.text.map(lambda x: cut(max_length, delta, x))

new_max_length_tr = train.text.map(len).max()
new_max_length_te = test.text.map(len).max()

print("After we cut tails of the longest tweets:")
print(f"...the size of longest text string in train set is  {new_max_length_tr}")
print(f"...the size of longest text string in test set is  {new_max_length_te}")

#### The 'text' after the cleaning/processing looks like this:

In [None]:
train['text'][5450:5550] # the 'text' after cleaning 

### Equivalence classes with mislabelling. 

In [None]:
# the code in the cell is taken from 
# https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert
df_mislabeled = train.groupby(['text']).nunique().sort_values(by='target', ascending=False)
df_mislabeled = df_mislabeled[df_mislabeled['target'] > 1]['target']
index_misl = df_mislabeled.index.tolist()

lenght = len(index_misl)

print(f"There are {lenght} equivalence classes with mislabelling")

The 85 'mislabelled tweets' (each of them respresent a class with min 2 elements). (expand)

In [None]:
index_misl # the list of strings (after cleaning/preprocessing) which represent the 85  classes

The list of all 309 strings after the cleaning (expand). From the list you may see the strings which represent each class + how  'location' and 'keyword' variables behave within classes.

In [None]:
train_nu_target = train[train['text'].isin(index_misl)].sort_values(by = 'text')
#train_nu_target.head(60)
train_nu_target[0:309]

In [None]:
num_records = train_nu_target.shape[0]
length = len(index_misl)
print(f"There are {num_records} records in train set which are split in {lenght} equivalence classes (with mislabelling)") 

Let us calculate some statistic for each class. Below in table the target mean + number of records in train set for each class are calculated. As we can see there are from 2 to 17 elements in an equivalence class. (expand)

In [None]:
copy = train_nu_target.copy()
classes = copy.groupby('text').agg({'keyword':np.size, 'target':np.mean}).rename(columns={'keyword':'Number of records in train set', 'target':'Target mean'})

classes.sort_values('Number of records in train set', ascending=False).head(100)

### Majority voting

If target mean is lower than 0.5 , I relabel to 0, otherwise to 1.

In [None]:
majority_df = train_nu_target.groupby(['text'])['target'].mean()
len(majority_df.index)

In [None]:
def relabel(r, majority_index):
    ind = ''
    if r['text'] in majority_index:
        ind = r['text']
#        print(ind)
        if majority_df[ind] < 0.5:
            return 0
        else:
            return 1
    else: 
        return r['target']

In [None]:
train['target'] = train.apply( lambda row: relabel(row, majority_df.index), axis = 1)

In [None]:
new_df = train[train['text'].isin(majority_df.index)].sort_values(['target', 'text'], ascending = [False, True])
new_df.head(310)

The 'target' for mislabelled tweets is recalculated. 
The number of mislabelled tweets is 0 after recalculation. 

In [None]:
# the code in the cell is taken from 
# https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert
df_mislabeled = train.groupby(['text']).nunique().sort_values(by='target', ascending=False)
df_mislabeled = df_mislabeled[df_mislabeled['target'] > 1]['target']
index_misl = df_mislabeled.index.tolist()
#index_dupl[0:50]
len(index_misl)

### Load the Multilingual Encoder module 

In [None]:
use = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

### Some words about Universal Sentence Encoders and the Transformer

A Universal Sentence Encoders encode sentencies to fixed length vectors (The size is 512 in the case of the Multilingual Encoder). The encoders are pre trained on several different tasks: (research article) https://arxiv.org/pdf/1803.11175.pdf. And a use case: https://towardsdatascience.com/use-cases-of-googles-universal-sentence-encoder-in-production-dd5aaab4fc15<br>
Two architectures are in use in the encoders: Transformer and Deep Averaging Networks. Transformer use "self attention mechanism" that learns contextual relations between words and (depending on model) even subwords in a sentence. Not only a word , but it position in a sentence is also taking into account (like positions of other words). There are different ways to implement the intuitive notion of "contextual relation between words in a sentence" ( so, different ways to construct "representation space" for the contextual words relation). If the several "ways" are implemented in a model in the same time: the term "multi head attention mechanism" is used.<br>
Transformers have 2 steps. Encoding: read the text and transform it in vector of fixed length, and decoding: decode the vector (produce prediction for the task). For example: take sentence in English, encode, and translate (decode) in sentence in German.<br>
For our model we need only encoding mechanism: sentencies are encoded in vectors and supplied for classification to Support Vector Machine.<br>
Good and intuitive explanation of the Transformer: http://jalammar.github.io/illustrated-transformer/ ; The original and quite famous now paper "Attention is all you need": (research article) https://arxiv.org/pdf/1706.03762.pdf. More about multi head attention: (research article) https://arxiv.org/pdf/1810.10183.pdf. How Transformer is used in BERT:<br> https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270.

The Multilingual Universal Sentence Encoder:(research articles) https://arxiv.org/pdf/1810.12836.pdf; https://arxiv.org/pdf/1810.12836.pdf; Example code: https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3 The Multilingual Encoder uses very interesting Sentence Piece tokenization to make a pretrained vocabulary: (research articles) https://www.aclweb.org/anthology/D18-2012.pdf; https://www.aclweb.org/anthology/P18-1007.pdf.<br>

About the text preprocessing and importance of its coherence with the text preprocessing that is conducted for pretraining + about the different models of text tokeniation:<br>
very good article: https://mlexplained.com/2019/11/06/a-deep-dive-into-the-wonderful-world-of-preprocessing-in-nlp/.<br>

For deep understanding of the Transormer:  http://nlp.seas.harvard.edu/2018/04/03/attention.html

In the book : http://d2l.ai/ you may find how to code Transformer from scratch (MXNet framework is used). 
In the https://github.com/dsgiitr/d2l-pytorch the book/code (d2l.ai) is "translated" to PyTorch. 

Below the encoding is applied to every sentence in train.text and test.text columns and the resulting vectors are saved to lists.

In [None]:
X_train = []
for r in tqdm(train.text.values):
  emb = use(r)
  review_emb = tf.reshape(emb, [-1]).numpy()
  X_train.append(review_emb)

X_train = np.array(X_train)
y_train = train.target.values

X_test = []
for r in tqdm(test.text.values):
  emb = use(r)
  review_emb = tf.reshape(emb, [-1]).numpy()
  X_test.append(review_emb)

X_test = np.array(X_test)

### Training and Evaluating

In [None]:
train_arrays, test_arrays, train_labels, test_labels = train_test_split(X_train,
                                                                        y_train,
                                                                        random_state =42,
                                                                        test_size=0.20)

In [None]:
def svc_param_selection(X, y, nfolds):

#    Cs = [1.35, 1.40, 1.45]
#    gammas = [2.15, 2.20, 2.25, 2.30]    best params: {'C': 1.4, 'gamma': 2.25}
    
    Cs = [1.40]
    gammas = [2.25] 

    param_grid = {'C': Cs, 'gamma' : gammas}
    grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=nfolds, n_jobs=8)
    grid_search.fit(X, y)
    grid_search.best_params_
    return grid_search

model = svc_param_selection(train_arrays,train_labels, 10)

In [None]:
model.best_params_

#### Accuracy and confusion matrix

In [None]:
pred = model.predict(test_arrays)

In [None]:
cm = confusion_matrix(test_labels,pred)
cm

In [None]:
accuracy = accuracy_score(test_labels,pred)
accuracy

### Support Vector Machine prediction

In [None]:
test_pred = model.predict(X_test)
submission['target'] = test_pred.round().astype(int)
#submission.to_csv('submission.csv', index=False)

### Using keywords for better prediction

Here I follow https://www.kaggle.com/bandits/using-keywords-for-prediction-improvement The idea is that some keywords with very high probability (sometimes = 1) signal about disaster (or usual) tweets. It is possible to add the extra 'keyword' feature to the model, but the simple approach also works. I make correction for the disaster tweets prediction to the model basing on the "disaster" keywords.

In [None]:
train_df_copy = train
train_df_copy = train_df_copy.fillna('None')
ag = train_df_copy.groupby('keyword').agg({'text':np.size, 'target':np.mean}).rename(columns={'text':'Count', 'target':'Disaster Probability'})

ag.sort_values('Disaster Probability', ascending=False).head(20)

In [None]:
count = 2
prob_disaster = 0.9
keyword_list_disaster = list(ag[(ag['Count']>count) & (ag['Disaster Probability']>=prob_disaster)].index)
#we print the list of keywords which will be used for prediction correction 
keyword_list_disaster

In [None]:
ids_disaster = test['id'][test.keyword.isin(keyword_list_disaster)].values
submission['target'][submission['id'].isin(ids_disaster)] = 1

In [None]:
submission.to_csv("submission.csv", index=False)
submission.head(10)

### Discussion and some more examples

In the model I do not convert words to lower case. The examples below show (most probably) there are mislabeled tweets which are not detected by the cleaning/preprocessing in the model. Another procedures may reveal even more (or less) such tweets.<br> 
In my opinon the problem exists: there are quite many mislabelled samples. The implicit assumption for all recognition metods is : equal train samples can not have different 'target'. It is difficult to predict, how methods will behave if the assumption is broken. Most probably it will be reflected in unstable behaviour of the methods with such kind of data.<br> 
Here I try to calculate the relabelling basing on a regular procedure. But in the cases of equivalence classes (with mislabelling) with just 2 elements (mean = 0.5) the majority voting is very crude. May be for the classes with just 2 elements the manual relabelling is a better solution.

In [None]:
index_hot = [2700, 2695, 2713, 2698, 2692, 2686, 2685,2684]
train_example = train_copy.copy()
train_example.loc[index_hot,:]

In [None]:
index_hot = [6842,6821,6824,6828,6831,6843]
train_example = train_copy.copy()
train_example.loc[index_hot,:]

In [None]:
index_hot = [6113,6103,6097,6094,6091,6119,6123]
train_example = train_copy.copy()
train_example.loc[index_hot,:]

In [None]:
index_hot = [3670,3674,3688,3667,3696]
            
train_example = train_copy.copy()
train_example.loc[index_hot,:]

### Please, upvote if you like.

In [None]:
print("--- %s seconds ---" % (time.time() - start_time))