In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


train_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/train.csv')
test_df = pd.read_csv('/kaggle/input/tweet-sentiment-extraction/test.csv')
print(train_df.columns,test_df.columns)

## Dataset
The dataset used is from a Kaggle competition - https://www.kaggle.com/c/tweet-sentiment-extraction .I am going to perform Sentiment Analysis part alone using a very simple Neural Network. Since the data we are dealing with is chaotic and in the form of text, it needs few data preprocessing steps before feeding it into the Neural Network. Lets take a look at few of the text samples to get ideas about how to proceed from here.

In [None]:
train_text = list(train_df.text)
train_sentiment = list(train_df.sentiment)
test_text = list(test_df.text)
test_sentiment = list(test_df.sentiment)


## Data Preprocessing

[' as much as i love to be hopeful, i reckon the chances are minimal =P i`m never gonna get my cake and stuff',
 'I really really like the song Love Story by Taylor Swift',
 'My Sharpie is running DANGERously low on ink',
 'i want to go to music tonight but i lost my voice.',
 'test test from the LG enV2',
 'Uh oh, I am sunburned',
 ' S`ok, trying to plot alternatives as we speak *sigh*',
 'i`ve been sick for the past few days  and thus, my hair looks wierd.  if i didnt have a hat on it would look... http://tinyurl.com/mnf4kw',...]
 
As we can see from above list, it has urls embedded between the text contents. They can be removed using regular expressions as follows.

In [None]:
import re
test_curated_text = []
train_curated_text = []
for text in train_text:
    train_curated_text.append( re.sub(r"http\S+", "", str(text)))
for text in test_text:
    test_curated_text.append(re.sub(r'http\S+',"",str(text)))


Now that we have cleaned the text, we can proceed to the second step of tokenizing the words.
## Tokenizing
The tokenising part can be done manually. However there is an easier approach with the help of tensorflow's Tokenizer. This Tokenizer alots each word in the list of samples fed to it to a unique number. It generally creates a table containing the unique words and their corresponding numbers.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
clean_text = []
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_curated_text)
word_index = tokenizer.word_index

## Text To Sequence

{'i': 1, 'to': 2, 'the': 3, 'a': 4, 'my': 5, 'it': 6, 'you': 7, 'and': 8, 'is': 9, 'in': 10, 'for': 11, 's': 12, 'of': 13, 't': 14, 'that': 15, 'on': 16, 'me': 17, 'so': 18, 'have': 19, 'but': 20, 'm': 21, 'just': 22, 'day': 23, 'with': 24, 'be': 25, 'at': 26, 'not': 27, 'was': 28, 'all': 29, 'now': 30, 'can': 31, 'good': 32, 'this': 33, 'out': 34, 'up': 35, 'get': 36, 'no': 37, 'are': 38, 'like': 39, 'go': 40, 'your': 41, 'do': 42, 'work': 43, 'today': 44, 'love': 45, 'too': 46, 'going': 47, 'got': 48, 'we': 49, 'lol': 50, 'what': 51, 'happy': 52, 'one': 53, 'from': 54, 'time': 55, 'u': 56, 'know': 57, 'there': 58, 'really': 59, 'back': 60, 'will': 61, 'don': 62, 'about': 63, 'im': 64, 'had': 65, 'its': 66, 'am': 67, 'see': 68, 'some': 69, 'they': 70, 'if': 71, 'night': 72, 'new': 73, 'home': 74, '2': 75, 'want': 76, 'well': 77, 'how': 78, 'think': 79, 'as': 80, 'still': 81, 'when': 82, 'll': 83, 'more': 84, 'oh': 85, 'thanks': 86, 'off': 87, 'much': 88, 'here': 89, 'he': 90, 'great': 91, 'miss': 92, 'an': 93, 'hope': 94, 'has': 95, 'last': 96, 're': 97, 'morning': 98, 'need': 99, 'haha': 100, 'her': 101, 'been': 102, 'fun': 103, 'she': 104,...}

So, there are 25330 words found in the train_curated_text list, each assigned with a number from 0 to 25329.
Now, each sentence must be encoded into list of numbers using this word_index generated with the tokenizer. These encode sentences must be of same length in order to be fed into a Neural Network. So we go in for padding of these encoded sentences. Shorter sentences as padded with zeros at the beginning while longer ones are truncated from at the end.

In [None]:
max_length = max([len(text) for text in train_curated_text])
from tensorflow.keras.preprocessing.sequence import pad_sequences
sequences = tokenizer.texts_to_sequences(train_curated_text)
padded = pad_sequences(sequences,maxlen=max_length, truncating='post')

testing_sequences = tokenizer.texts_to_sequences(test_curated_text)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)
print('sentence:',train_curated_text[2],'\nencoding:',sequences[2],'\npadded encoding:',padded[2])

## Output Preprocessing
Now we have our input preprocessed. Turning to our output, it is still in textual format and must be converted into numbers for compuational purpose. This can be performed with the help of LabelBinarizer in sklearn library. It helps to encode categorical data into their binary matrix representation. 

In [None]:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(train_sentiment)
training_labels_final = lb.transform(train_sentiment)
testing_labels_final = lb.transform(test_sentiment)
print('Before Binarizing:',train_sentiment[4:7],'\nAfter Binarizing',training_labels_final[4:7])

## Architecture
### Embedding Layer
These padded encodings are now fed into a embedding layer which is responsible for creating vector representation for each word based on the semantics. This helps the classifier in understanding the words more at the level of their meaning. The dimension of embedding to be used in this example is 13. The fourth root of vocabulary size (25330) is used as per the good rule of thumb. However, increase in dimensionality increases the quality of embedding. But this might go useless when large enough data is not available. So, I am using 13 dimension word embedding.
### Dense Layer
After Embedding layer, we have a global_average_pooling1d layer which is responsible for converting the 2d values from embedding layer into 1d by flattening them. This is done so that it can be fed into the next dense layer which requires one dimensional input. This Final Dense layer has three neurons each corresponding to the three possible categories, just like our binarized outputs.

In [None]:
vocab_size = len(word_index)
embedding_dim = 13
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAvgPool1D(),
    tf.keras.layers.Dense(3, activation='softmax')
])
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

## Training
Now it is time to train our processed data into our model.

In [None]:
num_epochs = 20
history = model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

## Visualization:
Let us try to visualize the train and validation accuracy with each epoch to get better idea about the training process.

In [None]:
from matplotlib import pyplot as plt
plt.plot(history.history['val_accuracy'])
plt.plot(history.history['accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.show()

## Inference:

The graph shows that the validation accuracy has flattened by 13th epoch. Training it further can cause overfitting of out model. However, we see jagged from 7-8 epoch. That again means overfitting and hence lets us fix our epoch number at 8 and retrain our model with both train and test data.

In [None]:
inputs = np.append(padded,testing_padded,axis=0)
labels = np.append(training_labels_final,testing_labels_final,axis=0)
num_epochs = 8
history = model.fit(inputs,labels, epochs=num_epochs)

The above model shows a training accuracy of 72.49%
## Prediction
Now the trained model is used to make predictions on our new example

In [None]:
example = tokenizer.texts_to_sequences(["i feel nausea",'the model is doing pretty good','But not great'])
example = pad_sequences(example,maxlen=max_length)
pred = model.predict(example)
print(pred[0],pred[1],pred[2])

Remember that the list shows probability of negative, neutral and positive respectively. The model is performing decently but one way to improve the model performance is by using Bigrams instead of single words separately. That way we could capture better contextual meaning. 'not great' and 'is great' cannot be put under positive just because they have the word 'great' in them. And similarly, 'not good' and 'not bad' can also get confusing when considering one word at a time. Whereas Bigrams can come in helpful in such scenarios. I am still a beginner and hope this helps others like me. I also look forward to know about my mistakes and areas to improve. Thanks for reaching the end.