<p style='text-align: right;'><b>Data Scientist :</b> Ruslan S.</p>
<p style='text-align: right;'><b>Collaborator :</b> Luka Anicin</p>
</div>
<h1 style='text-align: center;'>Sentiment Analysis - Natural Language Processing</h1>
<h3>Steps : Data Wrangling, Exploratory Data Analysis (EDA)</h3>
<p><b>Introduction: </b>Natural language processing (NLP) relates to the branch of computer science (artificial intelligence or AI) and is concerned with giving machines the ability to understand text and spoken words in a much similar way as human beings can.</p>  
<p>Sentiment Analysis is the classification of people's feelings or expressions into different viewpoints. Sentiments could be Positive, Negative, Neutral, and so on. The process is done in different consumer-centered branches to investigate human opinions on a singular product or topic.</p>
<img src='img/2.jpg'>
<br><br><b>DATA STAGES:</b>
<ul>
    <li>Text iput</li>
    <li>Tokenization</li>
    <li>Stop Word Filtering</li>
    <li>Negation</li>
    <li>Stemming</li>
    <li>Classification</li>
    <li>Sentiment Class</li>
</ul>

<h3>IMPORT LIBRARIES/PACKAGES...</h3>

In [1]:
# loading
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid", {"axes.facecolor": ".9"})
sns.set(rc={'figure.figsize':(11.7,8.27)})

import tensorflow as tf

import re

import nltk # It's a library that performs text processing tasks for Natural Language Processing
nltk.download('stopwords') 
from nltk.corpus import stopwords # English words which does not add much meaning to a sentence
from nltk.stem import SnowballStemmer # Stemmers remove morphological affixes from words, leaving only the word stem

from wordcloud import WordCloud # It's a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance

from sklearn.model_selection import train_test_split # Split arrays or matrices into random train and test subsets
from sklearn.preprocessing import LabelEncoder # Encode target labels with value between 0 and n_classes-1

from keras.preprocessing.text import Tokenizer # Text tokenization utility class
from keras.preprocessing.sequence import pad_sequences # It makes all the sequence in one constant length

# modeling part
from tensorflow.keras.layers import Conv1D, Bidirectional, LSTM, Dense, Input, Dropout
from tensorflow.keras.layers import SpatialDropout1D
from tensorflow.keras.callbacks import ModelCheckpoint

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ReduceLROnPlateau

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rshul\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<h3>DATA CLEANING...</h3>

In [None]:
# loading data
path = "../data/raw/raw.csv"
df = pd.read_csv(filepath_or_buffer=path, header=None, encoding='latin')
df.head(3)

In [None]:
# checking data types of the columns
df.info()

In [None]:
# renaming columns
def renaming_df(df, col_names):
    df.columns = col_names
    print(df.head(3))
    return df

col_names = ['target', 'id', 'date', 'query', 'user_name', 'predictor']
df = renaming_df(df, col_names)

<p><b>NOTE: </b>For our task, we need only two columns: target (sentiment level) and predictor (text/values). Others columns could be dropped.</p>

In [None]:
# dropping unnecessary columns
def drop_cols(df, col_names):
    df = df.drop(labels=col_names, axis=1)
    print(df.head(3))
    return df

col_names_drop = ['id', 'date', 'query', 'user_name']
df = drop_cols(df, col_names_drop)

In [None]:
# checking unique values for target column
df['target'].unique()

In [None]:
# replacing 'Zero' value with word 'Negative' and 'Four' value with word 'Positive'
def col_substitution(df, col_name, dic_val):
    df[col_name] = df[col_name].map(dic_val)
    print("Unique values: ", df[col_name].unique())
    print(df.head(5))
    return df

dic_val = {
    0: 'Negative',
    4: 'Positive',
}
df = col_substitution(df, 'target', dic_val)

In [None]:
# checking missing values
print(df['target'].isnull().sum())
print(df['predictor'].isnull().sum())

In [None]:
# checking duplicates
print(df.duplicated().sum())

In [None]:
# handling duplicates
def drop_duplicates_df(df):
    df = df.drop_duplicates(ignore_index=True)
    print(df.duplicated().sum())
    return df

df = drop_duplicates_df(df)

<b>NOTE:</b> Now we need to look at the distribution of our data. It's very important to have evenly divided data sets (classes) for avoiding any bias issues in future analysis.

In [None]:
# analyzing destribuation
sns.histplot(data=df, x='target', hue='target').set_title('Histogram of distributed target values')

<b>NOTE:</b> As we can see we have pretty much-balanced class values.

In [None]:
# exploring predictor columns
df['predictor'].sample(10)

<b>NOTE:</b> As we can see the predictor column has raw data (data that should be cleaned). 

<h3>PRE-PROCESSING...</h3>

<p><b>NOTE:</b> Tweets consist of different types of values besides simple (relative) words. It could be hyperlinks, images, or punctuation marks. So, our work here is to remove all of this noise to make the prediction more accurate.</p>
<img src="img/1.jpeg" alt='Image cleaning'>

<p>Here, we are going to apply stemming and lemmatization. Stemming is the process of reducing inflected words to their word stem, base, or root form generally a written word form. Where lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.</p>
<img src="img/3.png" alt="StemmingAndLemmatization">

<p>Additionally, we will be dealing with user ids and hyperlinks in our string values.</p>
<img src="img/4.jpg" alt="Link">

<p>And finally, we will complete our pre-process by removing stop words. Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.</p>
<img src="img/5.png" alt='Stop words'>

In [None]:
# creating instances for Stopwords and Stemming
stop_words = stopwords.words('english')
stop_words[:5]

In [None]:
snow_stemmer = SnowballStemmer('english')
snow_stemmer

In [None]:
# regular expression pattern
re_patter = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

In [None]:
def cleaning_data(data, isStemming=False):
    """
    Removing all noise.
        data: the text
        isStemming: the process of getting the normal forms of the words
    """
    data = re.sub(re_patter, " ", str(data).lower()).strip()
    data = data.split()
    token_list = []
    for word in data:
        if word not in stop_words:
            if isStemming:
                token_list.append(snow_stemmer.stem(word))
            else:
                token_list.append(word)
    return " ".join(token_list)

In [None]:
# applying pre-processing
df['predictor'] = df['predictor'].apply(lambda x: cleaning_data(x))
df.sample(5)

<h3>VISUALIZATION...</h3>

In [None]:
# visualizing the ratio/frequencies of positive and negative words
# link: https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html
def show_word_cloud(class_val):
    mask = df['target'] == class_val
    predictor_values = " ".join(df[mask]['predictor'])

    max_words = 1800
    width = 1300
    height = 500
    
    word_cloud = WordCloud(max_words=max_words,
                          width=width,
                          height=height)
    word_cloud = word_cloud.generate(predictor_values)
    plt.imshow(word_cloud, interpolation='bilinear')
    plt.axis('off')
    plt.show()

In [None]:
# show positives
show_word_cloud('Positive')

In [None]:
# show negatives
show_word_cloud('Negative')

<h3>TRAIN/TEST SPLIT...</h3>

In [None]:
# splitting our data
# link: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)
print(f"Train size: {train_data.shape}\nTest size: {test_data.shape}")

<h3>TOKENIZATION...</h3>

<p><b>NOTE: </b>Tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module itself and can be used in programs as shown below.</p>
<img src="img/6.jpg" alt="tokenization">

In [None]:
# applying tokenization
# link: https://www.kaggle.com/arunrk7/nlp-beginner-text-classification-using-lstm
token_instance = Tokenizer()
token_instance.fit_on_texts(train_data['predictor'])

In [None]:
wordIndex = token_instance.word_index
vocabSize = len(token_instance.word_index) + 1
vocabSize

In [None]:
temp_num = 10
for key, val in wordIndex.items():
    if temp_num > 0:
        print(key, val)
        temp_num -= 1

In [None]:
# using pad_sequences to make the length of sequences the same. If it's too short adding pads, otherwise truncate it.
# link: https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences
token_instance.texts_to_sequences

In [None]:
MAX_SEQ_LENGTH = 30

def pad_trunc_funtion(data, col_name, max_length):
    return pad_sequences(token_instance.texts_to_sequences(data[col_name]),
                                     maxlen=max_length)

In [None]:
X_train = pad_trunc_funtion(train_data, 'predictor', MAX_SEQ_LENGTH)
X_test = pad_trunc_funtion(test_data, 'predictor', MAX_SEQ_LENGTH)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")

<h1>STOPPED HERE</h1>

In [None]:
# checking
X_train[:4]

In [None]:
# creating variable with unique target values (classes)
classes = list(train_data['target'].unique())
classes

<h3>LABEL ENCODING...</h3>

<p><b>NOTE: </b>Label Encoding is a popular encoding technique for handling categorical variables. In this technique, each label is assigned a unique integer based on alphabetical ordering.</p>

In [None]:
# applying encoding and reshaping
encoder = LabelEncoder().fit(list(train_data['target']))

y_train = encoder.transform(list(train_data['target']))
y_test = encoder.transform(list(test_data['target']))

y_train.shape, y_test.shape

In [None]:
list(y_train[:15])

In [None]:
# reshaping 
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)

y_train.shape, y_test.shape

In [None]:
y_train[:15]

<h3>WORD EMBEDDING...</h3>

<p><b>NOTE: </b>Word Embedding is a language modeling technique used for mapping words to vectors of real numbers. It represents words or phrases in vector space with several dimensions. ... The input layer contains the context words and the output layer contains the current word.</p>
<img src="img/7.png" alt="word embedding">

In [None]:
# we're going to use GloVe: global vectors for word representation
# link: https://nlp.stanford.edu/projects/glove/
WORD_EMBEDDING = "../temp/glove.6B/glove.6B.50d.txt"

In [None]:
def open_read_word_embedding(path_file):
    counter = 0
    word_vectors = {}
    
    with open(path_file, mode='r', encoding="utf8") as we:
        try:
            for i in we:
                list_of_values = i.split()
                key = list_of_values[0]
                coefs = np.asarray(list_of_values[1:], dtype='float32')
                word_vectors[key] = coefs
        except:
            counter += 1
            print(we)
            
    print(f"There is/are {counter} lines with error.")
    return word_vectors

In [None]:
embedding_idx = open_read_word_embedding(WORD_EMBEDDING)
print(len(embedding_idx))

In [None]:
EMBEDDING_DIMENSION = 50
embedding_mtx = np.zeros((vocabSize, EMBEDDING_DIMENSION))
print(embedding_mtx.shape)
embedding_mtx[:1]

In [None]:
counter = 0

for key, value in wordIndex.items():
    if key in embedding_idx.keys():
        vector = embedding_idx[key]
        embedding_mtx[value] = vector
        counter += 1
        
print(f"{counter} out of {len(embedding_mtx)} vectors were placed into matrix")

<h3>MODELING...</h3>

<p><b>NOTE: </b>Embedding layer is one of the available layers in Keras. This is mainly used in Natural Language Processing related applications such as language modeling, but it can also be used with other tasks that involve neural networks. While dealing with NLP problems, we can use pre-trained word embeddings such as GloVe. Alternatively we can also train our own embeddings using Keras embedding layer.</p>
<img src="img/8.png" alt="LSTM sample model diagram">

In [None]:
# creating embedding layer
# link for info about embedding layers: https://medium.com/analytics-vidhya/understanding-embedding-layer-in-keras-bbe3ff1327ce
emb_layer = tf.keras.layers.Embedding(vocabSize, EMBEDDING_DIMENSION, 
                                      weights=[embedding_mtx],
                                      input_length=MAX_SEQ_LENGTH, 
                                      trainable=False)

<p><b>NOTE: </b>we will be using Long short-term memory (LSTM). Link for a refrence is provided below.</p>
<a href="https://en.wikipedia.org/wiki/Long_short-term_memory">Wikipedia page</a>
<br /><br /><b>Architecture: </b>
<ul>
    <li>Embedding Layer - <a href="https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526">read more...</a></li>
    <li>Conv1DLayer - <a href="">read more...</a></li>
    <li>LSTM - <a href="https://towardsdatascience.com/understanding-1d-and-3d-convolution-neural-network-keras-9d8f76e29610">read more...</a></li>
    <li>Dense - <a href="https://heartbeat.fritz.ai/classification-with-tensorflow-and-dense-neural-networks-8299327a818a">read more...</a></li>
</ul>

In [None]:
# initilizing our model
seq_input = Input(shape=(MAX_SEQ_LENGTH,), dtype='int32')
emb_sequence = emb_layer(seq_input)
x =  SpatialDropout1D(0.2)(emb_sequence)
x = Conv1D(64, 5, activation='relu')(x)
x = Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2))(x)
x = Dense(512, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(512, activation='relu')(x)
outputs = Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(seq_input, outputs)

<p><b>NOTE: </b>we're using Adam optimization algorithm for Gradient Descent.</p>
<a href="https://keras.io/api/optimizers/adam/">read more...</a>
<p><b>NOTE: </b>we'll be using 'LRScheduler' (<a href="https://keras.io/api/callbacks/learning_rate_scheduler/">read more...</a>) and 'ModelCheckPoint' (<a href="https://keras.io/api/callbacks/model_checkpoint/">read more...</a>) while we training our model.</p>

In [None]:
# compiling our model
LR = 1e-3

model.compile(optimizer=Adam(learning_rate=LR), 
              loss='binary_crossentropy', 
              metrics=['accuracy'])

ReduceLROnPlateau = ReduceLROnPlateau(factor=0.1,
                                     min_lr=0.01,
                                     monitor='val_loss',
                                     verbose=1)

<h3>TRAINING...</h3>

In [None]:
# checking if we will be using GPU or CPU to train our model
if tf.config.list_physical_devices('GPU'):
    print(f"Using GPU to train model...")
else:
    print(f"Using CPU to train model...")

In [None]:
X_test.shape, y_test.shape, X_train.shape, y_train.shape

In [None]:
# training our model
BATCH_SIZE = 1000
EPOCHS = 1

hist = model.fit(X_train, y_train, batch_size=BATCH_SIZE, epochs=EPOCHS,
                validation_data=(X_test, y_test), callbacks=[ReduceLROnPlateau])

In [None]:
X_test.shape

<h3>MODEL METRICS...</h3>