<a href="https://colab.research.google.com/github/yashwanth-kokkanti/kerasPractise/blob/main/imdbTextProcessingKeras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
### This Notebook demonstrates Text Classification using Keras 



In [2]:
import tensorflow as tf
import numpy as np

In [3]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  17.8M      0  0:00:04  0:00:04 --:--:-- 17.9M


In [4]:
!ls aclImdb

imdbEr.txt  imdb.vocab	README	test  train


In [5]:
!rm -r aclImdb/train/unsup

In [7]:
batch_size = 32

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=1337,
)

raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=1337,
)

raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)


print(
    "Number of batches in raw_train_ds: %d"
    % tf.data.experimental.cardinality(raw_train_ds)
)
print(
    "Number of batches in raw_val_ds: %d" % tf.data.experimental.cardinality(raw_val_ds)
)
print(
    "Number of batches in raw_test_ds: %d"
    % tf.data.experimental.cardinality(raw_test_ds)
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
Number of batches in raw_train_ds: 625
Number of batches in raw_val_ds: 157
Number of batches in raw_test_ds: 782


In [9]:
## Lets preview some examples 

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(5):
    print (text_batch.numpy()[i])
    print (label_batch.numpy()[i])

b'I\'ve seen tons of science fiction from the 70s; some horrendously bad, and others thought provoking and truly frightening. Soylent Green fits into the latter category. Yes, at times it\'s a little campy, and yes, the furniture is good for a giggle or two, but some of the film seems awfully prescient. Here we have a film, 9 years before Blade Runner, that dares to imagine the future as somthing dark, scary, and nihilistic. Both Charlton Heston and Edward G. Robinson fare far better in this than The Ten Commandments, and Robinson\'s assisted-suicide scene is creepily prescient of Kevorkian and his ilk. Some of the attitudes are dated (can you imagine a filmmaker getting away with the "women as furniture" concept in our oh-so-politically-correct-90s?), but it\'s rare to find a film from the Me Decade that actually can make you think. This is one I\'d love to see on the big screen, because even in a widescreen presentation, I don\'t think the overall scope of this film would receive its

In [12]:
## Preparing data -- Removing special characters etc 

from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
import re


def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, "<br />", ' ')
  return tf.strings.regex_replace(
      stripped_html, '[%s]' % re.escape(string.punctuation), ""
  )

max_features = 20000
embedding_dim = 128 
sequence_length  = 500

vectorize_layer = TextVectorization(
    standardize = custom_standardization,
    max_tokens = max_features,
    output_mode = 'int',
    output_sequence_length = sequence_length,
)

text_ds = raw_train_ds.map(lambda x, y: x)

vectorize_layer.adapt(text_ds)

In [15]:
## There are two options to vectorize Text data 

## 1. Way using Embedding layer 

text_input = tf.keras.Input(shape=(1, ), dtype=tf.string, name='text')

x = vectorize_layer(text_input)
x = tf.keras.layers.Embedding(max_features + 1, embedding_dim)(x)


In [18]:
## 2. Applying to Text data to obtain dataset of word indices

def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

## Vectorize Data 
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)



# Do async prefetching / buffering of the data for best performance on GPU. ## Copied as it is from keras 
train_ds = train_ds.cache().prefetch(buffer_size=10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

In [21]:
## Building Model

from tensorflow.keras import layers 

## Integer input for Vocab indices 
inputs = tf.keras.Input(shape=(None, ), dtype='int64')

## Next we add layer to map those vocab indices into a space of dimensionality 
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

## Conv1D + global max pooling 
x = layers.Conv1D(128, 7, padding='valid', activation='relu', strides=3)(x)
x = layers.Conv1D(128, 7, padding='valid', activation='relu', strides=3)(x)
x = layers.GlobalMaxPooling1D()(x)

## Adding A Vanilla hidden Layer 
x = layers.Dense(128, activation='relu')(x)
x = layers.Dropout(0.5)(x)

## Project onto a single output layer and squash it with Sigmoid 
predictions = layers.Dense(1, activation='sigmoid', name='predictions')(x)

model = tf.keras.Model(inputs, predictions)

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [22]:
epochs = 3

model.fit(train_ds, validation_data=val_ds, epochs=epochs)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<tensorflow.python.keras.callbacks.History at 0x7f24cf1246d8>

In [23]:
## Evaluate the model on test set 

model.evaluate(test_ds)



[0.40894073247909546, 0.8620399832725525]

In [25]:
## Make an end to end model capable of processing raw strings 

inputs = tf.keras.Input(shape=(1, ), dtype='string')

indices = vectorize_layer(inputs)

outputs = model(indices)

end_to_end_model = tf.keras.Model(inputs, outputs)

end_to_end_model.compile (loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

end_to_end_model.evaluate(raw_test_ds)



[0.4089408218860626, 0.8620399832725525]