#### Twitter Sentiment Analysis

The objective of this task is to detect hate speech in tweets.

We are given a training sample of tweets and labels, where label '1' denotes negative tweet (with hate speech) and 
label '0' denotes positive tweet (with out hate speech).

Our objective is to predict the labels on the test dataset.

#### Data
Our overall collection of tweets was split in the ratio of 65:35 into training and testing data. Out of the testing data, 30% is public and the rest is private.

#### Evaluation Metric:
The metric used for evaluating the performance of classification model would be F1-Score.

Precision = TP/TP+FP

Recall = TP/TP+FN

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

<b> Loading train and test data </b>

In [None]:
import os
os.chdir(r"C:\Users\pdrva\Desktop\Divya\Kaggle\Analytics Vidya Hackthon Data\Twitter Sentiment Analysis")
import pandas as pd
train=pd.read_csv("train_E6oV3lV.csv")
test=pd.read_csv("test_tweets_anuFYb8.csv")
submission=pd.read_csv("sample_submission_gfvA5FD.csv")

<b> Understanding data </b>

In [None]:
print("train data has {} rows and {} columns".format(train.shape[0],train.shape[1]))
print("test data has {} rows and {} columns".format(test.shape[0],test.shape[1]))

In [3]:
print("\ntrain datatypes-->\n")
print(train.dtypes)
print("\ntest datatypes-->\n")
print(test.dtypes)


train datatypes-->

id        int64
label     int64
tweet    object
dtype: object

test datatypes-->

id        int64
tweet    object
dtype: object


In [4]:
import numpy as np
np.unique(train['label'], return_counts=True)

(array([0, 1], dtype=int64), array([29720,  2242], dtype=int64))

Notice that there is class imbalance in the data.
Only 7 percent of the training records have hate speech while the rest of the records doesn't have hate speech.

In [5]:
train.head(20)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


The data has @User mentions and #tag mentions, preprocessing step involves removing these tags.

#### Data cleaning

In [6]:
import re
def remove_handles(text,pattern):
    handles=re.findall(pattern,text)
    for handle in handles:
        text=re.sub(handle,'',text)
    return text

# Remove twitter handles
train['clean_tweet']=train.tweet.apply(lambda tweet: remove_handles(str(tweet),"@[\w]*"))

In [7]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can't use cause th...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...,camping tomorrow dannyâ¦
7,8,0,the next school year is the year for exams.ð...,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...,welcome here ! i'm it's so #gr8 !


In [8]:
def remove_specialchars(text,pattern):
    text=re.sub(pattern,' ',text)
    return text

# Remove special characters,numbers and punctuations except " # " and " ' "
train['clean_tweet']=train.clean_tweet.apply(lambda tweet: remove_specialchars(str(tweet),"[^a-zA-Z'#]"))

In [9]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so sel...
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for #lyft credit i can't use cause th...
2,3,0,bihday your majesty,bihday your majesty
3,4,0,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation,factsguide society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...,huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...,camping tomorrow danny
7,8,0,the next school year is the year for exams.ð...,the next school year is the year for exams ...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,we won love the land #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...,welcome here i'm it's so #gr


In [10]:
# Replace short cuts with actual words
shortcuts = {'u': 'you', 'y': 'why', 'r': 'are', 'doin': 'doing', 'hw': 'how', 'k': 'okay', 'm': 'am', 'b4': 'before',
            'idc': "i do not care", 'ty': 'thankyou', 'wlcm': 'welcome', 'bc': 'because', '<3': 'love', 'xoxo': 'love',
            'ttyl': 'talk to you later', 'gr8': 'great', 'bday': 'birthday', 'awsm': 'awesome', 'gud': 'good', 'h8': 'hate',
            'lv': 'love', 'dm': 'direct message', 'rt': 'retweet', 'wtf': 'hate', 'idgaf': 'hate',
             'irl': 'in real life', 'yolo': 'you only live once', 'ur': 'your'}

def replace_shortcuts(text):
    for i,token in enumerate(text):
        if token in shortcuts.keys():
            token=shortcuts[token]
            text[i]=token
    return text

train.clean_tweet=train.clean_tweet.apply(lambda tweet: replace_shortcuts(tweet.split()))

In [11]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet
0,1,0,@user when a father is dysfunctional and is s...,"[when, a, father, is, dysfunctional, and, is, ..."
1,2,0,@user @user thanks for #lyft credit i can't us...,"[thanks, for, #lyft, credit, i, can't, use, ca..."
2,3,0,bihday your majesty,"[bihday, your, majesty]"
3,4,0,#model i love u take with u all the time in ...,"[#model, i, love, you, take, with, you, all, t..."
4,5,0,factsguide: society now #motivation,"[factsguide, society, now, #motivation]"
5,6,0,[2/2] huge fan fare and big talking before the...,"[huge, fan, fare, and, big, talking, before, t..."
6,7,0,@user camping tomorrow @user @user @user @use...,"[camping, tomorrow, danny]"
7,8,0,the next school year is the year for exams.ð...,"[the, next, school, year, is, the, year, for, ..."
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,"[we, won, love, the, land, #allin, #cavs, #cha..."
9,10,0,@user @user welcome here ! i'm it's so #gr...,"[welcome, here, i'm, it's, so, #gr]"


In [12]:
def hashtag_extract(text):
    tweet=(" ").join(text)
    hashtags=re.findall(r"#(\w+)",tweet)
    return hashtags

# Extract hashtags to a new column
train['hashtags']=train.clean_tweet.apply(lambda tweet: hashtag_extract(tweet))

In [13]:
def remove_hash(text,pattern):
    for i,token in enumerate(text):
        word=re.sub(pattern,'',token)
        text[i]=word
    return text

# Remove # from tweet
train['clean_tweet']=train.clean_tweet.apply(lambda tweet: remove_hash(tweet,'#'))

In [14]:
train.head(20)

Unnamed: 0,id,label,tweet,clean_tweet,hashtags
0,1,0,@user when a father is dysfunctional and is s...,"[when, a, father, is, dysfunctional, and, is, ...",[run]
1,2,0,@user @user thanks for #lyft credit i can't us...,"[thanks, for, lyft, credit, i, can't, use, cau...","[lyft, disapointed, getthanked]"
2,3,0,bihday your majesty,"[bihday, your, majesty]",[]
3,4,0,#model i love u take with u all the time in ...,"[model, i, love, you, take, with, you, all, th...",[model]
4,5,0,factsguide: society now #motivation,"[factsguide, society, now, motivation]",[motivation]
5,6,0,[2/2] huge fan fare and big talking before the...,"[huge, fan, fare, and, big, talking, before, t...",[allshowandnogo]
6,7,0,@user camping tomorrow @user @user @user @use...,"[camping, tomorrow, danny]",[]
7,8,0,the next school year is the year for exams.ð...,"[the, next, school, year, is, the, year, for, ...","[school, exams, hate, imagine, actorslife, rev..."
8,9,0,we won!!! love the land!!! #allin #cavs #champ...,"[we, won, love, the, land, allin, cavs, champi...","[allin, cavs, champions, cleveland, clevelandc..."
9,10,0,@user @user welcome here ! i'm it's so #gr...,"[welcome, here, i'm, it's, so, gr]",[gr]


#### Converting unstructured text to structured numeric format

In [15]:
def tweet_max_length(tweets):
    max_length=0
    for tweet in tweets:
        if max_length < len(tweet):
            max_length = len(tweet)
    return max_length

max_length=tweet_max_length(train.clean_tweet)
print("tweet with max length has a length of:", max_length)

tweet with max length has a length of: 42


In [16]:
def join_tokens(text):
    return (' '.join(text))

train.clean_tweet=train.clean_tweet.apply(lambda tweet: join_tokens(tweet))

In [17]:
train.head()

Unnamed: 0,id,label,tweet,clean_tweet,hashtags
0,1,0,@user when a father is dysfunctional and is s...,when a father is dysfunctional and is so selfi...,[run]
1,2,0,@user @user thanks for #lyft credit i can't us...,thanks for lyft credit i can't use cause they ...,"[lyft, disapointed, getthanked]"
2,3,0,bihday your majesty,bihday your majesty,[]
3,4,0,#model i love u take with u all the time in ...,model i love you take with you all the time in...,[model]
4,5,0,factsguide: society now #motivation,factsguide society now motivation,[motivation]


In [18]:
# Downloading pre-trained 100-dimensional glove embeddings

os.chdir(r"C:\Users\pdrva\Desktop\Divya\INSOFE\Day29\20200202_Batch_73_CSE_7321c_Lab05_RNN_Data\glove_100d")
embeddings_index = {}
f = open( 'glove.6B.100d.txt',encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [19]:
MAX_NUM_WORDS = 40000 # Vocabulary size
MAX_SEQUENCE_LENGTH = 42 # Number of time steps (at each time step one word/word vector is given as input)
embedding_size = 100 

In [20]:
# vectorize the text samples into a 2D tensor

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(train.clean_tweet)
sequences = tokenizer.texts_to_sequences(train.clean_tweet)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels=np.asarray(train.label)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Using TensorFlow backend.


Found 38582 unique tokens.
Shape of data tensor: (31962, 42)
Shape of label tensor: (31962,)


In [21]:
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
VALIDATION_SPLIT=0.2
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

print(x_train.shape)
print(y_train.shape)
print(x_val.shape)
print(y_val.shape)

(25570, 42)
(25570,)
(6392, 42)
(6392,)


In [22]:
num_words = min(MAX_NUM_WORDS, (len(word_index)+1))
embedding_matrix = np.zeros((num_words, embedding_size))
for word, i in word_index.items():
    if i < num_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
    else:
        continue

In [23]:
print(embedding_matrix.shape)

(38583, 100)


In [24]:
from keras.layers import Embedding

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            embedding_size,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)

#### Using LSTM for Sentiment Analysis

In [33]:
print('Training model.')
nb_epochs =50

from keras.layers import Dense, Input, Dropout, LSTM, Bidirectional
from keras.models import Model
from keras.callbacks import EarlyStopping,ReduceLROnPlateau, ModelCheckpoint
from keras import regularizers

mc = ModelCheckpoint("weights.{epoch:02d}-{val_acc:.2f}.hdf5", monitor='val_loss',mode='min', save_best_only=True)

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
z = Dropout(0.2)(embedded_sequences)
z = Bidirectional(LSTM(16,kernel_regularizer=regularizers.l2(0.1)))(z)
z = Dropout(0.3)(z)
preds_lstm = Dense(1, activation='sigmoid')(z)


from keras.optimizers import Adam
adam = Adam(lr=0.001)
es = EarlyStopping(monitor='val_loss', patience=10, verbose=1)
rlrop = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5)

model_lstm = Model(sequence_input, preds_lstm)
model_lstm.compile(loss='binary_crossentropy',
              optimizer=adam,
              metrics=['acc'])

model_lstm_hist = model_lstm.fit(x_train, y_train,
                  batch_size=128,
                  epochs=nb_epochs,callbacks=[mc,es],
                  validation_data=(x_val, y_val)).history

Training model.
Train on 25570 samples, validate on 6392 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 00017: early stopping


In [34]:
import glob
from keras.models import load_model

path=os.path.join(os.getcwd(),"*")
list_of_files = glob.glob(path) # * means all if need specific format then *.csv
latest_file = max(list_of_files, key=os.path.getctime)

model = load_model(latest_file)

In [35]:
preds_train=model_lstm.predict(x_train)
preds_val=model_lstm.predict(x_val)

In [36]:
preds_train = (preds_train >= 0.28).astype(np.int)
preds_val = (preds_val >= 0.28).astype(np.int)

In [37]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_train,preds_train))
print(confusion_matrix(y_val,preds_val))

[[23693    63]
 [   12  1802]]
[[5821  143]
 [ 156  272]]


In [38]:
from sklearn.metrics import f1_score, accuracy_score
print('Accuracy:', accuracy_score(preds_train, y_train))
print("F1 Score: ", f1_score(preds_train, y_train))

Accuracy: 0.997066875244427
F1 Score:  0.9796140255504213


In [39]:
from sklearn.metrics import f1_score, accuracy_score
print('Accuracy:', accuracy_score(preds_val, y_val))
print("F1 Score: ", f1_score(preds_val, y_val))

Accuracy: 0.9532227784730913
F1 Score:  0.6453143534994069


#### Preparing test data

In [63]:
os.chdir(r"C:\Users\pdrva\Desktop\Divya\Kaggle\Analytics Vidya Hackthon Data\Twitter Sentiment Analysis")
test=pd.read_csv("test_tweets_anuFYb8.csv")
test.head()

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."


In [64]:
# Remove twitter handles
test['clean_tweet']=test.tweet.apply(lambda tweet: remove_handles(str(tweet),"@[\w]*"))

In [65]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...,#white #supremacists want everyone to see th...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","3rd #bihday to my amazing, hilarious #nephew..."


In [66]:
# Remove special characters,numbers and punctuations except " # " and " ' "
test['clean_tweet']=test.clean_tweet.apply(lambda tweet: remove_specialchars(str(tweet),"[^a-zA-Z'#]"))

In [67]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...,#white #supremacists want everyone to see th...
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your #acne #altwaystohe...
3,31966,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",rd #bihday to my amazing hilarious #nephew...


In [68]:
# Replace short cuts with actual words
test.clean_tweet=test.clean_tweet.apply(lambda tweet: replace_shortcuts(tweet.split()))

In [69]:
test.head()

Unnamed: 0,id,tweet,clean_tweet
0,31963,#studiolife #aislife #requires #passion #dedic...,"[#studiolife, #aislife, #requires, #passion, #..."
1,31964,@user #white #supremacists want everyone to s...,"[#white, #supremacists, want, everyone, to, se..."
2,31965,safe ways to heal your #acne!! #altwaystohe...,"[safe, ways, to, heal, your, #acne, #altwaysto..."
3,31966,is the hp and the cursed child book up for res...,"[is, the, hp, and, the, cursed, child, book, u..."
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","[rd, #bihday, to, my, amazing, hilarious, #nep..."


In [70]:
# Extract hashtags to a new column
test['hashtags']=test.clean_tweet.apply(lambda tweet: hashtag_extract(tweet))

In [71]:
# Remove # from tweet
test['clean_tweet']=test.clean_tweet.apply(lambda tweet: remove_hash(tweet,'#'))

In [72]:
test.head()

Unnamed: 0,id,tweet,clean_tweet,hashtags
0,31963,#studiolife #aislife #requires #passion #dedic...,"[studiolife, aislife, requires, passion, dedic...","[studiolife, aislife, requires, passion, dedic..."
1,31964,@user #white #supremacists want everyone to s...,"[white, supremacists, want, everyone, to, see,...","[white, supremacists, birds, movie]"
2,31965,safe ways to heal your #acne!! #altwaystohe...,"[safe, ways, to, heal, your, acne, altwaystohe...","[acne, altwaystoheal, healthy, healing]"
3,31966,is the hp and the cursed child book up for res...,"[is, the, hp, and, the, cursed, child, book, u...","[harrypotter, pottermore, favorite]"
4,31967,"3rd #bihday to my amazing, hilarious #nephew...","[rd, bihday, to, my, amazing, hilarious, nephe...","[bihday, nephew]"


In [73]:
test_max_length=tweet_max_length(test.clean_tweet)
print("tweet in test data with max length has a length of:", test_max_length)

tweet in test data with max length has a length of: 31


In [74]:
test.clean_tweet=test.clean_tweet.apply(lambda tweet: join_tokens(tweet))

In [75]:
test.head()

Unnamed: 0,id,tweet,clean_tweet,hashtags
0,31963,#studiolife #aislife #requires #passion #dedic...,studiolife aislife requires passion dedication...,"[studiolife, aislife, requires, passion, dedic..."
1,31964,@user #white #supremacists want everyone to s...,white supremacists want everyone to see the ne...,"[white, supremacists, birds, movie]"
2,31965,safe ways to heal your #acne!! #altwaystohe...,safe ways to heal your acne altwaystoheal heal...,"[acne, altwaystoheal, healthy, healing]"
3,31966,is the hp and the cursed child book up for res...,is the hp and the cursed child book up for res...,"[harrypotter, pottermore, favorite]"
4,31967,"3rd #bihday to my amazing, hilarious #nephew...",rd bihday to my amazing hilarious nephew eli a...,"[bihday, nephew]"


In [76]:
# vectorize the text samples into a 2D tensor
test_sequences = tokenizer.texts_to_sequences(test.clean_tweet)

test_data = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', test_data.shape)

Shape of data tensor: (17197, 42)


#### Making predictions on test data with best model

In [77]:
submission.head()

Unnamed: 0,id,label
0,31963,0
1,31964,0
2,31965,0
3,31966,0
4,31967,0


In [78]:
preds_test=model_lstm.predict(test_data)
submission['label'] = (preds_test >= 0.28).astype(np.int)

In [79]:
submission.label.value_counts()

0    15923
1     1274
Name: label, dtype: int64

In [80]:
os.chdir(r'C:\Users\pdrva\Desktop\Divya\Kaggle\Analytics Vidya Hackthon Data\Twitter Sentiment Analysis')
submission.to_csv('Submission_lstm.csv', index= False)