<a href="https://colab.research.google.com/github/vignesh-pala/NLP/blob/master/NLP_Ch12_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Challenge 12 (NLP with Keras-TF2)**

Use the yelp review data set available at https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+S...

Create a NLP text classification model using CNNs and see if you can achieve more than 80% accuracy on TEST data.

Use KerasClassifier (to use k-fold cross validation etc.) and RandomizedSearchCV (to find the best combination of Hyper-parameters) and fine tune the model.

In [76]:
import tensorflow 
from tensorflow import keras
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from tensorflow.keras.datasets import reuters
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import models,layers,regularizers
from tensorflow.keras.models import Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.regularizers import l1, l2, l1_l2
from tensorflow.keras.layers import Flatten, Dense, BatchNormalization
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras import preprocessing
from tensorflow.keras import models, layers, backend
from keras.utils import np_utils

from matplotlib import pyplot
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import RandomizedSearchCV
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout

import nltk
nltk.download('stopwords')
import string
#stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [39]:
# Read input from Google drive
df = pd.read_csv('/content/yelp_labelled.txt', names=['sentence', 'label'], sep='\t')

In [3]:
df.head()

Unnamed: 0,sentence,label
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [4]:
# check size of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sentence  1000 non-null   object
 1   label     1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


In [5]:
#check for null values
df.isnull().sum()

sentence    0
label       0
dtype: int64

In [40]:
#seperate sentence and labels
sentence = df['sentence'].values
label = df['label'].values
sentence[0]

'Wow... Loved this place.'

In [41]:
# Train Test & Validation Split
X_train, X_test, y_train, y_test = train_test_split(sentence, label, test_size=0.2, random_state=42)

 * Tokenize and Convert to Sequences


In [42]:
max_words = 1000

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

x_train_tkn = tokenizer.texts_to_sequences(X_train)
x_test_tkn = tokenizer.texts_to_sequences(X_test)
#x_val_tkn = tokenizer.texts_to_sequences(X_val)

print(X_train[1])
print(x_train_tkn[1]) 

An excellent new restaurant by an experienced Frenchman.
[46, 144, 370, 66, 61, 46, 773, 774]


 Pad the vector to have fixed length

    Note: the max words in the sentences is only around 30, so set maxlen = 30

In [43]:
maxlen = 30
x_train_pad = preprocessing.sequence.pad_sequences(x_train_tkn, maxlen=maxlen,  padding='post')
x_test_pad = preprocessing.sequence.pad_sequences(x_test_tkn, maxlen=maxlen,  padding='post')
# x_val_pad = preprocessing.sequence.pad_sequences(x_val_tkn, maxlen=maxlen,  padding='post')

See how the sentence is transformed from start to end

In [44]:
print(X_train[100])
print(x_train_tkn[100])
print(x_train_pad[100])

We waited for thirty minutes to be seated (although there were 8 vacant tables and we were the only folks waiting).
[17, 159, 14, 880, 88, 6, 29, 333, 410, 44, 24, 547, 881, 278, 2, 17, 24, 1, 65, 331, 330]
[ 17 159  14 880  88   6  29 333 410  44  24 547 881 278   2  17  24   1
  65 331 330   0   0   0   0   0   0   0   0   0]


In [91]:
#Adding 1 because of  reserved 0 index
vocab_size = len(tokenizer.word_index) + 1
vocab_size

1825

**KerasClassifier takes the complete model as a function, so lets build it with all possible parameters**

In [124]:
def cnn_model(vocab_size, embedding_dim, maxlen, num_filters, kernel_size, l1_penalty, layers ):
  model = Sequential()
  model.add(Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=maxlen))
  model.add(Conv1D(num_filters, kernel_size, activation='relu', kernel_regularizer=l1(l1_penalty)))
  model.add(GlobalMaxPooling1D())
  model.add(Dense(layers, activation='relu'))
  model.add(Dense(1, activation='sigmoid'))
  model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
  return model

**Pass in all possible parameter values based on your reasonable judgement** 

In [125]:
param_grid = dict(vocab_size = [1000,1500,1800,2000],
                  embedding_dim = [50,100], 
                  maxlen = [30,50,100], 
                  num_filters = [32,64,128], 
                  kernel_size = [3,5,7], 
                  l1_penalty = [0.1, 0.01, 0.001, 0.0001], 
                  layers = [10,32]
)

**Initialize KerasClassifier & RandomizedSearchCV**

Note: Here, Epochs, and Batch_Size also can be changed as part of tuning 

In [126]:
model1 = KerasClassifier(build_fn=cnn_model, 
                        epochs= 5,
                        batch_size = 16,
                        verbose = False
                        )

Note: Similarly, no. of Iterations and Cross Validation also can be changed as part of tuning

In [127]:
grid = RandomizedSearchCV(estimator=model1, param_distributions=param_grid, cv=4, verbose=1, n_iter=50)

**Note:** It appears, the WARNINGS in the below fit method is due to incompatible parameters selection by RandomizedSearchCV. Hence, can be ignored.

In [128]:
grid_result = grid.fit(x_train_pad, y_train)

Fitting 4 folds for each of 50 candidates, totalling 200 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.




[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:  5.4min finished




In [136]:
  
print("Accuracy given by the Best model  :", grid_result.best_score_)
print("Hyperparameters of the Best model : ", grid_result.best_params_)

Accuracy given by the Best model  : 0.8262499868869781
Hyperparameters of the Best model :  {'vocab_size': 1800, 'num_filters': 64, 'maxlen': 100, 'layers': 32, 'l1_penalty': 0.0001, 'kernel_size': 3, 'embedding_dim': 100}


Evaluate on Test data

In [137]:
test_acc = grid.score(x_test_pad,y_test)

In [138]:
print("Accuracy on Test set: {}", test_acc)

Accuracy on Test set: {} 0.8399999737739563


* As you can see, the Model outperformed on the **Test Data with 84% accuracy**
* The Model may be futher improved by increasing the # of Iterations  and adding additional hyperparameters.

**Appendix**

**References**:

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/

Learning task: Fine tune your understanding on loss functions and optimizers using

https://medium.com/data-science-group-iitr/loss-functions-and-optimization-algorithms-demystified-bb92daff331c

----------------------------------------------

https://www.kaggle.com/sanikamal/text-classification-with-python-and-keras

https://machinelearningmastery.com/develop-n-gram-multichannel-convolutional-neural-network-sentiment-analysis/

uses Glove emedding
https://medium.com/saarthi-ai/sentence-classification-using-convolutional-neural-networks-ddad72c7048c

https://thedatafrog.com/en/articles/sentiment-analysis-convolutional-network/

https://thedatafrog.com/en/articles/text-preprocessing-machine-learning-yelp/


For Future enhancements:

Cleanup text (using below code) and retrain the model.

In [None]:
stop_words = set(stopwords.words('english'))
table = str.maketrans('', '', string.punctuation)

In [None]:
def clean_doc(doc):
	# split into tokens by white space
	tokens = doc.split()

  #remove punctuations
	tokens = [w.translate(table) for w in tokens]

	# remove remaining tokens that are not alphabetic
	tokens = [word for word in tokens if word.isalpha()]

	tokens = [w for w in tokens if  w not in stop_words]

	# filter out short tokens
	tokens = [word for word in tokens if len(word) > 1]

	return tokens

In [None]:
X_train1 = [clean_doc(t) for t in X_train]
X_test1 = [clean_doc(t) for t in X_test]
X_train1[4]

['Just', 'lunch', 'great', 'experience']