<font size="+3" color="blue"><b>1. Objective</b></font><a id="1"></a>

* **Competiton**     : [Tweet Sentiment Extraction](https://www.kaggle.com/c/tweet-sentiment-extraction)
* **Predict**    : Support phrases from given tweet text
* **Evaluation** : [Jaccard Score](https://towardsdatascience.com/overview-of-text-similarity-metrics-3397c4601f50). We will come to know more about this in below sections.
* **Last Date to Join this competition in kaggle** :June 9, 2020 - Entry Deadline.So dont get late to join.
* **Stages of this kernel** : Data >> Features(EDA) >> Comparison(EDA) >> Model

**Jaccard Score** is more about how exactly the predicted words match against actual words in a sentence.

In [1]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

# First set of sentences
Actual_1 = 'My life is totally awesome'
Predict_1 = 'awesome life'

# First set of sentences
Actual_2 = 'We are active kagglers'
Predict_2 = 'We are active kagglers'
    
print("Jaccard score for first set of scentences: {}".format(jaccard(Actual_1,Predict_1)))
print("Jaccard score for second set of scentences: {}".format(jaccard(Actual_2,Predict_2)))

Jaccard score for first set of scentences: 0.4
Jaccard score for second set of scentences: 1.0


<font size="+2" color="blue"><b>2. Data</b></font><a id="2"></a>

The data is collected from twitter.And we have four columns of data

| Columns       |      Description          | 
|---------------|:-------------------------:|
| ID            |  Unique ID for each tweet |       
| Text          |  Whole content of tweet   |   
| Selected Text |  Selected Text of tweet   |    
| Sentiment     |  Sentiment of tweet       |

In [2]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import re
import string
import matplotlib.pyplot as plt
import matplotlib_venn as venn
import seaborn as sns


from tqdm import tqdm
import spacy
import random
from spacy.util import compounding
from spacy.util import minibatch
from collections import defaultdict
from collections import  Counter


# sklearn 
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

#nltk
from nltk.corpus import stopwords
from nltk.util import ngrams
stop=set(stopwords.words('english'))
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from PIL import Image

#Avoid warning messages
import warnings
warnings.filterwarnings("ignore")

#plotly libraries
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from plotly import tools
from plotly.subplots import make_subplots
import cufflinks
cufflinks.go_offline()
cufflinks.set_config_file(world_readable=True, theme='pearl')


import tensorflow as tf
import tensorflow.keras.backend as K
from sklearn.model_selection import StratifiedKFold
from transformers import *
import tokenizers

from datetime import datetime as dt

In [3]:
train=pd.read_csv('../input/tweet-sentiment-extraction/train.csv')
test=pd.read_csv('../input/tweet-sentiment-extraction/test.csv')
train.sample(6)

Unnamed: 0,textID,text,selected_text,sentiment
4580,7892df6e65,yupp t`s better than people being rude to her x,yupp t`s better than people being rude to her,neutral
26673,80b87139ea,": Your mail server just rejected a simple, pla...",Not Good,negative
23034,cea5677a2b,thnx babe just call me when u finish it....,thnx,positive
18517,ab0b88c596,LOL #yourock,#yourock,positive
25376,295fee573c,"feeling better, still coughing. : / not moving...","feeling better, still coughing. : / not moving...",neutral
14471,96c035584a,Starting some work on final year project. Just...,Starting some work on final year project. Just...,neutral


In [4]:
print("There are {} rows and {} columns in train file".format(train.shape[0],train.shape[1]))
print("There are {} rows and {} columns in test file".format(test.shape[0],test.shape[1]))

There are 27481 rows and 4 columns in train file
There are 3534 rows and 3 columns in test file


In [5]:
print("There are {} percentage of test data proportion compared to train data".format(round(test.shape[0]/train.shape[0]*100,2)))

There are 12.86 percentage of test data proportion compared to train data


In [6]:
# Function for missing value
def miss_val(df):
    total=df.isnull().sum()
    return pd.concat([total],axis=1,keys=['Total'])
print("Missing values for train dataset \n")
print(miss_val(train))
print("---------------------------------------------------------------------")
print("Missing values for test dataset \n")
print(miss_val(test))

Missing values for train dataset 

               Total
textID             0
text               1
selected_text      1
sentiment          0
---------------------------------------------------------------------
Missing values for test dataset 

           Total
textID         0
text           0
sentiment      0


Since we have 1 NULL row,we will remove it from train data.

In [7]:
train=train.dropna()
train.shape

(27480, 4)

<font size="+1" color="chocolate"><b>EDA on Selected text</b></font> <br>

We will undergo some basic text prepocessing and EDA on our target field- **Selected Text**.This is to understand how this feature is distributed in train data.

* Find URLs
* Punctuations
* Length of tweets
* Average of tweets
* Most words 

#### Why to consider URL?

URLs makes no sense for extreme sentiments.There are chances that they stay on neutral side.Lets check how they are spread in selected text


In [8]:
# Convert to lower
train['target']=train['selected_text'].str.lower()

In [9]:
# Find URL
def find_link(string): 
    url = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', string)
    return "".join(url) 
train['target_url']=train['target'].apply(lambda x: find_link(x))
df=pd.DataFrame(train.loc[train['target_url']!=""]['sentiment'].value_counts()).reset_index()
df.rename(columns={"index": "sentiment", "sentiment": "url_count"})

Unnamed: 0,sentiment,url_count
0,neutral,345
1,positive,3
2,negative,3


#### Can punctutations/symbols play a part in modelling?

Since we are analysing sentimental tweets,people describe their emotions in symbols.Say symbols like continuous stars **( * )** is considered to be extreme emotions(happy,angry,delight etc).Other symbols like **(# - tagging)** or **(@ - mention)** are also used very often in tweets.

Lets analyse all of them including other punctuations

In [10]:
# Function to find punctuation
def find_punct(text):
    line = re.findall(r'[!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]*', text)
    string="".join(line)
    return list(string)

In [11]:
# New Features with punctuation and punctuation length
train['target_punct']=train['target'].apply(lambda x:find_punct(x))
train['target_punct_len']=train['target'].apply(lambda x:len(find_punct(x)))

In [12]:
def find_star(text):
   # if len(text.split())<1:
    line=re.findall(r'[*]{2,5}',text)
    return len(line)

In [13]:
train['star']=train['target'].apply(lambda x:find_star(x))
train.loc[train['star']!=0]['sentiment'].value_counts().to_frame()

Unnamed: 0,sentiment
negative,317
neutral,248
positive,50


Eventhough negative shows high counts.Still it describes about neutral tweets dependency.Let us analyse the tweet with only ( * ) in tweet.

In [14]:
def find_only_star(text):
    if len(text.split())==1:
        line=re.findall(r'[*]{2,5}',text)
        return len(line)
    else:
        return 0

In [15]:
# grt column value that has only * in its tweet
train['only_star']=train['target'].apply(lambda x:find_only_star(x))
train.loc[train['only_star']==1]['sentiment'].value_counts()

negative    96
Name: sentiment, dtype: int64

In [16]:
train['target']= np.where(train['only_star']==1,"abusive",train['target'])

#### Remove URLs & Punctuation

In [17]:
def remove_link(string): 
    text = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'," ",string)
    return " ".join(text.split())

In [18]:
def remove_punct(text):
    line = re.sub(r'[!"\$%&\'()*+,\-.\/:;=#@?\[\\\]^_`{|}~]+'," ",text)
    return " ".join(line.split())

In [19]:
train['target']=train['target'].apply(lambda x:remove_link(x))
train['target']=train['target'].apply(lambda x:remove_punct(x))

In [20]:
train['target_tweet_length']=train['target'].str.split().map(lambda x: len(x))
train['target_tweet_length'].describe().to_frame()

Unnamed: 0,target_tweet_length
count,27480.0
mean,7.282205
std,7.096309
min,0.0
25%,1.0
50%,5.0
75%,11.0
max,35.0


<font size="+2" color="indigo"><b> Basic Setup</b></font><br><a id="6.1"></a>

In [21]:
#Since we dont have length larger than 96
MAX_LEN = 96

# Pretrained model of roberta
PATH = '../input/tf-roberta/'
tokenizer = tokenizers.ByteLevelBPETokenizer(
    vocab_file=PATH+'vocab-roberta-base.json', 
    merges_file=PATH+'merges-roberta-base.txt', 
    lowercase=True,
    add_prefix_space=True
)

# Sentiment ID value is encoded from tokenizer
sentiment_id = {'positive': 1313, 'negative': 2430, 'neutral': 7974}

In [22]:
train = pd.read_csv('../input/tweet-sentiment-extraction/train.csv').fillna('')
ct=train.shape[0] #27481

# Initialising training inputs
input_ids=np.ones((ct,MAX_LEN),dtype="int32")          # Array with value 1 of shape(27481,96)
attention_mask=np.zeros((ct,MAX_LEN),dtype="int32")    # Array with value 0 of shape(27481,96)
token_type_ids=np.zeros((ct,MAX_LEN),dtype="int32")    # Array with value 0 of shape(27481,96)
start_tokens=np.zeros((ct,MAX_LEN),dtype="int32")      # Array with value 0 of shape(27481,96)
end_tokens=np.zeros((ct,MAX_LEN),dtype="int32")        # Array with value 0 of shape(27481,96)

In below code ,please go through comments which i have mentioned between codes to identify variables progress line by line.I have added a sample row from train data for explanation.

> text1 = my boss is bullying me <br>
> text2 = bullying me

In [23]:
for k in range(train.shape[0]):
#1 FIND OVERLAP
    text1 = " "+" ".join(train.loc[k,'text'].split())
    text2 = " ".join(train.loc[k,'selected_text'].split())
    
    # idx - position where the selected text are placed. 
    idx = text1.find(text2)   # we get [12] position
    
    # all character position as 0 and then places 1 for selected text position  
    chars = np.zeros((len(text1))) 
    chars[idx:idx+len(text2)]=1    # [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.] 
    
    #tokenize id of text 
    if text1[idx-1]==' ': chars[idx-1] = 1    
    enc = tokenizer.encode(text1)  #  [127, 3504, 16, 11902, 162]
        
#2. ID_OFFSETS - start and end index of text
    offsets = []
    idx=0
    for t in enc.ids:
        w = tokenizer.decode([t])
        offsets.append((idx,idx+len(w)))     #  [(0, 3), (3, 8), (8, 11), (11, 20), (20, 23)]
        idx += len(w) 
    
#3  START-END TOKENS
    toks = []
    for i,(a,b) in enumerate(offsets):
        sm = np.sum(chars[a:b]) # number of characters in selected text - [0.0,0.0,0.0,9.0,3.0] - bullying me
        if sm>0: 
            toks.append(i)  # token position - selected text - [3, 4]
        
    s_tok = sentiment_id[train.loc[k,'sentiment']] # Encoded values by tokenizer
    
    #Formating input for roberta model
    input_ids[k,:len(enc.ids)+5] = [0] + enc.ids + [2,2] + [s_tok] + [2]   #[ 0   127  3504    16 11902   162     2     2  2430     2]
    attention_mask[k,:len(enc.ids)+5] = 1                                  # [1 1 1 1 1 1 1 1 1 1]
    
    if len(toks)>0:
        # this will produce (27481, 96) & (27481, 96) arrays where tokens are placed
        start_tokens[k,toks[0]+1] = 1
        end_tokens[k,toks[-1]+1] = 1 


In [24]:
test = pd.read_csv('../input/tweet-sentiment-extraction/test.csv').fillna('')

ct_test = test.shape[0]

# Initialize inputs
input_ids_t = np.ones((ct_test,MAX_LEN),dtype='int32')        # array with value 1 for shape (3534, 96)
attention_mask_t = np.zeros((ct_test,MAX_LEN),dtype='int32')  # array with value 0 for shape (3534, 96)
token_type_ids_t = np.zeros((ct_test,MAX_LEN),dtype='int32')  # array with value 0 for shape (3534, 96)

# Set Inputs attention 
for k in range(test.shape[0]):
        
#1. INPUT_IDS
    text1 = " "+" ".join(test.loc[k,'text'].split())
    enc = tokenizer.encode(text1)                
     
    # Encoded value of tokenizer
    s_tok = sentiment_id[test.loc[k,'sentiment']]
    
    #setting up of input ids - same as we did for train
    input_ids_t[k,:len(enc.ids)+5] = [0] + enc.ids + [2,2] + [s_tok] + [2]
    attention_mask_t[k,:len(enc.ids)+5] = 1


In [25]:

def scheduler(epoch):
    return 3e-5 * 0.2**epoch

In [26]:

def build_model():
    
    # Initialize keras layers
    ids = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    att = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)
    tok = tf.keras.layers.Input((MAX_LEN,), dtype=tf.int32)

    # Fetching pretrained models 
    config = RobertaConfig.from_pretrained(PATH+'config-roberta-base.json')
    bert_model = TFRobertaModel.from_pretrained(PATH+'pretrained-roberta-base.h5',config=config)
    x = bert_model(ids,attention_mask=att,token_type_ids=tok)
    
    # Setting up layers
    x1 = tf.keras.layers.Dropout(0.1)(x[0]) 
    x1 = tf.keras.layers.Conv1D(128, 2,padding='same')(x1)
    x1 = tf.keras.layers.LeakyReLU()(x1)
    x1 = tf.keras.layers.Conv1D(64, 2,padding='same')(x1)
    x1 = tf.keras.layers.Dense(1)(x1)
    x1 = tf.keras.layers.Flatten()(x1)
    x1 = tf.keras.layers.Activation('softmax')(x1)
    
    x2 = tf.keras.layers.Dropout(0.1)(x[0]) 
    x2 = tf.keras.layers.Conv1D(128, 2, padding='same')(x2)
    x2 = tf.keras.layers.LeakyReLU()(x2)
    x2 = tf.keras.layers.Conv1D(64, 2, padding='same')(x2)
    x2 = tf.keras.layers.Dense(1)(x2)
    x2 = tf.keras.layers.Flatten()(x2)
    x2 = tf.keras.layers.Activation('softmax')(x2)

    # Initializing input,output for model.THis will be trained in next code
    model = tf.keras.models.Model(inputs=[ids, att, tok], outputs=[x1,x2])
    
    #Adam optimizer for stochastic gradient descent. if you are unware of it - https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/
    optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
    model.compile(loss='binary_crossentropy', optimizer=optimizer)

    return model
    

In [27]:
model1 = build_model()
model1.fit([input_ids,attention_mask,token_type_ids], [start_tokens, end_tokens], epochs = 2)

Train on 27481 samples
Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f868bc9ccf8>

<font size="+2" color="indigo"><b>Run Model</b></font><br><a id="6.5"></a>

In [28]:
start_time=dt.now()

n_splits=5 # Number of splits

# INitialize start and end token
preds_start = np.zeros((input_ids_t.shape[0],MAX_LEN))
preds_end = np.zeros((input_ids_t.shape[0],MAX_LEN))

DISPLAY=1
for i in range(5):
    print('#'*40)
    print('### MODEL %i'%(i+1))
    print('#'*40)
    
    K.clear_session()
    #model = build_model()
    # Pretrained model
    #model.load_weights('../input/model4/v4-roberta-%i.h5'%i)

    print('Predicting Test...')
    preds = model1.predict([input_ids_t,attention_mask_t,token_type_ids_t],verbose=DISPLAY)
    preds_start += preds[0]/n_splits
    preds_end += preds[1]/n_splits
    
end_time=dt.now()
print("   ")
print("   ")
print("Time Taken to run above code :",(end_time-start_time).total_seconds()/60," minutes")

########################################
### MODEL 1
########################################
Predicting Test...
########################################
### MODEL 2
########################################
Predicting Test...
########################################
### MODEL 3
########################################
Predicting Test...
########################################
### MODEL 4
########################################
Predicting Test...
########################################
### MODEL 5
########################################
Predicting Test...
   
   
Time Taken to run above code : 1.3838464833333333  minutes


<font size="+2" color="indigo"><b> Submission</b></font><br><a id="6.6"></a>

In [29]:
all = []

for k in range(input_ids_t.shape[0]):
    # Argmax - Returns the indices of the maximum values along axis
    a = np.argmax(preds_start[k,])
    b = np.argmax(preds_end[k,])
    if a>b: 
        st = test.loc[k,'text']
    else:
        text1 = " "+" ".join(test.loc[k,'text'].split())
        enc = tokenizer.encode(text1)
        st = tokenizer.decode(enc.ids[a-1:b])
    all.append(st)

In [30]:
test.head()

Unnamed: 0,textID,text,sentiment
0,f87dea47db,Last session of the day http://twitpic.com/67ezh,neutral
1,96d74cb729,Shanghai is also really exciting (precisely -...,positive
2,eee518ae67,"Recession hit Veronique Branquinho, she has to...",negative
3,01082688c6,happy bday!,positive
4,33987a8ee5,http://twitpic.com/4w75p - I like it!!,positive


In [31]:
test['selected_text'] = all
submission=test[['textID','selected_text']]
submission.to_csv('submission.csv',index=False)
submission.head(5)

Unnamed: 0,textID,selected_text
0,f87dea47db,last session of the day
1,96d74cb729,exciting
2,eee518ae67,shame!
3,01082688c6,happy
4,33987a8ee5,i like
