## <u>Table of Contents</u>
*  [1. Reading and Analyzing Dataframe](#1)
*  [2. Train-Test Splitting](#2)
*  [3. Random Oversampling](#3)
*  [4. Text Preprocessing](#4)
*  [5. Label Encoding](#5)
*  [6. Tokenizing Sentences and Fixing Sentence Length](#6)
*  [7. Bi-LSTM Model](#7)
*  [8. Model Evaluation](#8)
*  [9. References](#9)

# Libraries

In [None]:
import pandas as pd
import re
import matplotlib.pyplot as plt
import tensorflow as tf

from tensorflow.python.client import device_lib
tf.test.gpu_device_name()

In [None]:
import pickle
from tqdm import tqdm
import numpy as np
from sklearn.metrics import f1_score
import sklearn
from sklearn.metrics import precision_recall_fscore_support as score

# 1. Reading and Analysing DataFrame

In [None]:
train_df = pd.read_csv('../input/text-classification-dataset/telugu/telugu/telugu_train.csv')
dev_df = pd.read_csv('../input/text-classification-dataset/telugu/telugu/telugu_valid.csv')
test_df = pd.read_csv('../input/text-classification-dataset/telugu/telugu/telugu_test.csv')

In [None]:
train_df = train_df.drop(columns=['Unnamed: 0'])
dev_df = dev_df.drop(columns=['Unnamed: 0'])
test_df = test_df.drop(columns=['Unnamed: 0'])

In [None]:
train_df.columns = ['text', 'label']
dev_df.columns = ['text', 'label']
test_df.columns = ['text', 'label']

In [None]:
train_df.head()

In [None]:
dev_df.head()

In [None]:
test_df.head()

In [None]:
# Analysing train dataframe attributes 
print('* Size of dataframe: {}\n'.format(train_df.shape))
print('* Datatype of columns are:\n {}\n'.format(train_df.dtypes))
print('* Count of different categories:\n {}\n'.format(train_df['label'].value_counts()))
print('* Number of NaNs among text are: {}\n'.format(train_df['text'].isnull().sum()))

# Converting text to string
train_df['text'] = train_df['text'].astype(str)

# Removing NaNs
train_df = train_df.dropna(subset=['text'])
print('NaNs are removed from the dataframe. Number of NaNs can be confirmed to be {}. The size of dataframe has reduced to {}'.format(train_df['text'].isnull().sum(), train_df.shape))

# Analysing dev dataframe attributes 
print('* Size of dataframe: {}\n'.format(dev_df.shape))
print('* Datatype of columns are:\n {}\n'.format(dev_df.dtypes))
print('* Count of different categories:\n {}\n'.format(dev_df['label'].value_counts()))
print('* Number of NaNs among text are: {}\n'.format(dev_df['text'].isnull().sum())) 

# Converting text to string
dev_df['text'] = dev_df['text'].astype(str)

# Removing NaNs
dev_df = dev_df.dropna(subset=['text'])
print('NaNs are removed from the dataframe. Number of NaNs can be confirmed to be {}. The size of dataframe has reduced to {}'.format(dev_df['text'].isnull().sum(), dev_df.shape))

# Analysing test dataframe attributes 
print('* Size of dataframe: {}\n'.format(test_df.shape))
print('* Datatype of columns are:\n {}\n'.format(test_df.dtypes))
print('* Count of different categories:\n {}\n'.format(test_df['label'].value_counts()))
print('* Number of NaNs among text are: {}\n'.format(test_df['text'].isnull().sum()))

# Converting text to string
test_df['text'] = test_df['text'].astype(str)

# Removing NaNs
test_df = test_df.dropna(subset=['text'])
print('NaNs are removed from the dataframe. Number of NaNs can be confirmed to be {}. The size of dataframe has reduced to {}'.format(test_df['text'].isnull().sum(), test_df.shape))

In [None]:
# Plotting label value counts
train_df.groupby('label').count().plot(kind='bar')
plt.show()

In [None]:
# Plotting word lenghts of train text
train_word_length = [len(x) for x in train_df['text']]
plt.plot(train_word_length)

In [None]:
# Plotting label value counts
dev_df.groupby('label').count().plot(kind='bar')
plt.show()

In [None]:
# Plotting word lenghts of train text
dev_word_length = [len(x) for x in dev_df['text']]
plt.plot(dev_word_length)

In [None]:
# Plotting label value counts
test_df.groupby('label').count().plot(kind='bar')
plt.show()

In [None]:
# Plotting word lenghts of train text
test_word_length = [len(x) for x in test_df['text']]
plt.plot(test_word_length)

In [None]:
# # Reading dataframe
# df = pd.read_csv('../input/consume-complaints-dataset-fo-nlp/complaints_processed.csv')
# df.head()

In [None]:
# # Renaming columns 
# df = df.rename(columns={'narrative':'tweet' })

# # Removing SNo column
# df.drop(['Unnamed: 0'], axis=1, inplace=True)
# df.head()

In [None]:
# # Analysing dataframe attributes 
# print('* Size of dataframe: {}\n'.format(df.shape))
# print('* Datatype of columns are:\n {}\n'.format(df.dtypes))
# print('* Count of different product categories:\n {}\n'.format(df['product'].value_counts()))
# print('* Number of NaNs among tweets are: {}\n'.format(df['tweet'].isnull().sum())) 

In [None]:
# # Removing NaNs
# df = df.dropna(subset=['tweet'])
# print('NaNs are removed from the dataframe. Number of NaNs can be confirmed to be {}. The size of dataframe has reduced to {}'.format(df['tweet'].isnull().sum(), df.shape))

In [None]:
# # Plotting word lenghts of tweets
# word_length = [len(x) for x in df['tweet']]
# plt.plot(word_length)

In [None]:
# # Converting sentences to string
# df['tweet'] = df['tweet'].astype(str)

In [None]:
# # Types of products
# df['product'].value_counts()

In [None]:
# # Plotting product value counts
# df.groupby('product').count().plot(kind='bar')
# plt.show()

Dataframe is <b>imbalanced</b>. Improving the balance of the dataframe can improve <b>accuracy</b>.

# 2. Train-Test Splitting

In [None]:
# # Importing train test splilt library 
# from sklearn.model_selection import train_test_split

# # Train-Test Splitting
# train_data, test_data = train_test_split(df, test_size = 0.20)

In [None]:
# # Train and test data dimensions
# train_data.shape, test_data.shape

In [None]:
# # Balance of train data
# train_data.groupby('product').count().plot(kind='bar')
# plt.show()

* credit_card, debt_collection, mortgages_and_loans, retail_banking columns consist of <b>very few values</b>. So, the values in these columns will be increased using <b>random oversampling</b>.
* Oversampling is done in train set because this will prevent <b>data leakage</b> to test set.  

# 3. Random Oversampling

In [None]:
# # Train set value counts 
# train_data.groupby('product').count()

In [None]:
# # Randomly selecting 7000 indices in classes with low value count
# import numpy as np
# to_add_1 = np.random.choice(train_data[train_data['product']=='credit_card'].index,size = 7000,replace=False)   
# to_add_2 = np.random.choice(train_data[train_data['product']=='debt_collection'].index,size = 7000,replace=False) 
# to_add_3 = np.random.choice(train_data[train_data['product']=='mortgages_and_loans'].index,size = 7000,replace=False)  
# to_add_4 = np.random.choice(train_data[train_data['product']=='retail_banking'].index,size=7000,replace=False)

# # Indices to be added
# to_add = np.concatenate((to_add_1, to_add_2, to_add_3, to_add_4 ))
# len(to_add)

In [None]:
# # Forming a dataframe for randomly selected indices
# df_replicate = train_data[train_data.index.isin(to_add)]
# df_replicate  

In [None]:
# # Concatenating replicated df to orinigal df
# train_data = pd.concat([train_data, df_replicate])
# train_data['product'].value_counts()

Value counts of minority classes have <b>increased</b>. 

# 4. Text Preprocessing

In [None]:
# # Importing NLTK Libraries
# import nltk
# from nltk.corpus import stopwords
# from nltk import *

# nltk.download('stopwords')
# stop = stopwords.words('english')

In [None]:
# # Declaring function for text preprocessing 

# def preprocess_text(main_df):
#     df_1 = main_df.copy()
 
#     df_1['tweet'] = df_1['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in stop)) 
  
#     # remove punctuations and convert to lower case
#     df_1['tweet'] = df_1['tweet'].apply(lambda x: re.sub('[!@#$:).;,?&]', '', x.lower()))
  
#     # remove double spaces
#     df_1['tweet'] = df_1['tweet'].apply(lambda x: re.sub(' ', ' ', x))
    
#     return df_1  

In [None]:
# # Preprocessing training and test data 
# train_data = preprocess_text(train_data)
# test_data = preprocess_text(test_data)

In [None]:
# # Verifying text preprocessing
# train_data['tweet'].head()

# 5. Label Encoding

In [None]:
# Declaring train labels
train_labels = train_df['label']
valid_labels = dev_df['label']
test_labels = test_df['label']

In [None]:
# Converting labels to numerical features
import numpy as np
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
valid_labels = le.transform(valid_labels)
test_labels = le.transform(test_labels)

print(le.classes_)
print(np.unique(train_labels, return_counts=True))
print(np.unique(valid_labels, return_counts=True))
print(np.unique(test_labels, return_counts=True))

In [None]:
# Changing labels to categorical features
import numpy as np
from tensorflow.python.keras.utils import np_utils
from tensorflow.keras.utils import to_categorical
import numpy as np

train_labels = to_categorical(np.asarray(train_labels))
valid_labels = to_categorical(np.asarray(valid_labels))
test_labels = to_categorical(np.array(test_labels))

In [None]:
labels_count = train_df['label'].value_counts()
labels_count = len(labels_count)

# 6. Tokenizing Sentences and Fixing Sentence Length

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Defining training parameters
max_sequence_length = 170   
max_words = 30000   

# Tokenizing tweets/sentences wrt num_words
tokenizer = Tokenizer(num_words = max_words)  # Selects most frequent words 
tokenizer.fit_on_texts(train_df.text)      # Develops internal vocab based on training text
train_sequences = tokenizer.texts_to_sequences(train_df.text)  # converts text to sequence

valid_sequences = tokenizer.texts_to_sequences(dev_df.text)
test_sequences = tokenizer.texts_to_sequences(test_df.text)

In [None]:
# Fixing the sequence length 
from tensorflow.keras.preprocessing.sequence import pad_sequences
train_data = pad_sequences(train_sequences, maxlen = max_sequence_length)
valid_data = pad_sequences(valid_sequences, maxlen = max_sequence_length)
test_data = pad_sequences(test_sequences, maxlen = max_sequence_length)
train_data.shape, valid_data.shape, test_data.shape

# 7. Bi-LSTM Model

## # 7.1 Declaring Model

In [None]:
# Model Parameters
embedding_dim = 100

In [None]:
# Importing Libraries

import tensorflow as tf
import sys, os, re, csv, codecs, numpy as np, pandas as pd
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from tensorflow.keras.layers import Bidirectional, GlobalMaxPool1D, Conv1D, SimpleRNN
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.models import Sequential
from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers
from tensorflow.keras.layers import Dense, Input, Input, Flatten, Dropout, BatchNormalization
from tensorflow.keras.layers import Conv1D, MaxPooling1D, Embedding 

In [None]:
# Model Training
model = Sequential()
model.add(Embedding(max_words, 
                   embedding_dim,
                   input_length = max_sequence_length))

# Bidirectional LSTM 
model.add(Bidirectional(LSTM(128, return_sequences=True, dropout=0.4, recurrent_dropout=0)))   

model.add(GlobalMaxPool1D())

model.add(Dense(labels_count,activation='softmax'))  

model.summary()

## # 7.2 Passing Data Through Network

In [None]:
model.compile(loss = 'categorical_crossentropy', optimizer='Adam', metrics = ['accuracy'])

In [None]:
checkpoint_filepath = '/model'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    save_weights_only=False,
    monitor='val_loss',
    mode='auto',
    save_best_only=True)

In [None]:
# # declaring weights of product categories
# class_weight = {0: 4,          
#                 1: 5,    
#                 2: 3,      
#                 3: 3,     
#                 4: 4}      

# training and validating model 
# history = model.fit(train_data, train_labels, batch_size=48, epochs= 20, class_weight = class_weight, validation_data=(test_data, test_labels)) # best 89(now) or 48 or 60 epochs # default epochs = 23 # batch_size changed to 1 (takes 2.30hrs) from 16
# best 89(now) or 48 or 60 epochs # default epochs = 23 # batch_size changed to 1 (takes 2.30hrs) from 16
history = model.fit(train_data, train_labels, batch_size=48, epochs= 50, validation_data=(valid_data, valid_labels), callbacks=[model_checkpoint_callback])

In [None]:
# model.save('model')

In [None]:
import pickle
# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [None]:
# Prediction on Test Data
predicted_bi_lstm = model.predict(test_data)
predicted_bi_lstm

In [None]:
pred_labels = np.argmax(predicted_bi_lstm.round(), axis=1)

In [None]:
actual_labels = np.argmax(test_labels, axis=1)

In [None]:
classification_results = []
for i in range(5):
    predict_label = pred_labels[i]
    actual_label = actual_labels[i]
    text = test_df['text'][i]
    classification_results.append({'actual': actual_label, 'predict': predict_label, 'sentence': text})

In [None]:
with open('classification_results.pkl', 'wb') as fp:
    pickle.dump(classification_results, fp)

# 8. Model Evaluation

In [None]:
model = load_model("/model")

In [None]:
# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)

## #8.1Model Performance Attributes

In [None]:
precision, recall, fscore, support = score(test_labels, predicted_bi_lstm.round())

print('precision: {}'.format(precision))
print('recall: {}'.format(recall))
print('fscore: {}'.format(fscore))
print('support: {}'.format(support))
print('################################')
print(sklearn.metrics.classification_report(test_labels, predicted_bi_lstm.round()))

## #8.2 Model Performance with Epochs


In [None]:
def accuracy_plot(history):
    
    fig, ax = plt.subplots(1, 2, figsize=(12,5))
    
    fig.suptitle('Model Performance with Epochs', fontsize = 16)
    # Subplot 1 
    ax[0].plot(history.history['accuracy'])
    ax[0].plot(history.history['val_accuracy'])
    ax[0].set_title('Model Accuracy', fontsize = 14)
    ax[0].set_xlabel('Epochs', fontsize = 12)
    ax[0].set_ylabel('Accuracy', fontsize = 12)
    ax[0].legend(['train', 'validation'], loc='best')
    
    # Subplot 2
    ax[1].plot(history.history['loss'])
    ax[1].plot(history.history['val_loss'])
    ax[1].set_title('Model Loss', fontsize = 14)
    ax[1].set_xlabel('Epochs', fontsize = 12)
    ax[1].set_ylabel('Loss', fontsize = 12)
    ax[1].legend(['train', 'validation'], loc='best')
    
    
accuracy_plot(history)

## #8.3 Confusion Matrix

In [None]:
# Declaring function for plotting confusion matrix
import seaborn as sns
from sklearn.metrics import confusion_matrix

def plot_cm(model, test_data, test_labels):
    
    products = train_df['label'].unique()
        
    # Calculate predictions
    pred = model.predict(test_data)
    
    # Declaring confusion matrix
    cm = confusion_matrix(np.argmax(np.array(test_labels),axis=1), np.argmax(pred, axis=1))
    
    # Heat map labels

    group_counts = ['{0:0.0f}'.format(value) for value in cm.flatten()]
    group_percentages = ['{0:.2%}'.format(value) for value in cm.flatten()/np.sum(cm)]
    
    labels = [f"{v2}\n{v3}" for v2, v3 in zip(group_counts, group_percentages)]
    labels = np.asarray(labels).reshape(labels_count,labels_count)

    # Plotting confusion matrix
    plt.figure(figsize=(12,8))
    
    sns.heatmap(cm, cmap=plt.cm.Blues, annot=labels, annot_kws={"size": 15}, fmt = '',
                xticklabels = products,
                yticklabels = products)
    
    plt.xticks(fontsize = 12)
    plt.yticks(fontsize = 12, rotation = 'horizontal')
    plt.title('Confusion Matrix\n', fontsize=19)
    plt.xlabel('Predicted Labels', fontsize=17)
    plt.ylabel('Actual Labels', fontsize=17)
    
plot_cm(model, test_data, test_labels)

Model <b>accuracy</b> is verified with confusion matrix. 

# 9. References

1. NLP Implementation: https://www.kaggle.com/the0electronic0guy/nlp-with-disaster-tweets

2. NLP Book: Kulkarni, Akshay, and Adarsha Shivananda. Natural language processing recipes. Apress, 2019.

3. LSTM: https://www.kaggle.com/kritanjalijain/twitter-sentiment-analysis-lstm

4. Bi-LSTM: https://www.kaggle.com/kritanjalijain/twitter-sentiment-analysis-lstm-2#Bidirectional-LSTM-Using-NN 

5. Bi-LSTM: https://www.kaggle.com/eashish/bidirectional-gru-with-convolution

6. Bi-LSTM: https://www.kaggle.com/victorbnnt/classification-using-lstm-85-accuracy

7. Imbalanced Datasets: https://towardsdatascience.com/yet-another-twitter-sentiment-analysis-part-1-tackling-class-imbalance-4d7a7f717d44

8. Multiclass Classification: https://towardsdatascience.com/machine-learning-multiclass-classification-with-imbalanced-data-set-29f6a177c1a