<a href="https://colab.research.google.com/github/sushant-97/keras_projects/blob/main/Text_classification_from_scratch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
import numpy as np

In [2]:
# Loading dataset - IMDB
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  10.2M      0  0:00:07  0:00:07 --:--:-- 15.8M


In [3]:
!ls aclImdb
!ls aclImdb/test
!ls aclImdb/train

imdbEr.txt  imdb.vocab	README	test  train
labeledBow.feat  neg  pos  urls_neg.txt  urls_pos.txt
labeledBow.feat  pos	unsupBow.feat  urls_pos.txt
neg		 unsup	urls_neg.txt   urls_unsup.txt


In [4]:
!cat aclImdb/train/pos/6249_7.txt

Hundstage is an intentionally ugly and unnerving study of life in a particularly dreary suburb of Vienna. It comes from former documentary director Ulrich Seidl who adopts a very documentary-like approach to the material. However, the film veers away from normal types and presents us with characters that are best described as "extremes"  some are extremely lonely; some extremely violent; some extremely weird; some extremely devious; some extremely frustrated and misunderstood; and so on. The film combines several near plot less episodes which intertwine from time to time, each following the characters over a couple of days during a sweltering Viennese summer. Very few viewers will come away from the film feeling entertained  the intention is to point up the many things that are wrong with people, the many ills that plague our society in general. It is a thought-provoking film and its conclusions are pretty damning on the whole.<br /><br />A fussy old widower fantasises about his elde

We are only interested in the pos and neg subfolders, so let's delete the rest:

In [5]:
!rm -r aclImdb/train/unsup

In [6]:
batch_size = 32
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size = batch_size,
    validation_split = 0.2,
    subset = "training",
    seed = 1337,
)

raw_val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size = batch_size,
    validation_split = 0.2,
    subset = "validation",
    seed = 1337,
)

raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size = batch_size
)

print(f"Number of batches in raw_train_ds: {raw_train_ds.cardinality()}")
print(f"Number of batches in raw_val_ds: {raw_val_ds.cardinality()}")
print(f"Number of batches in raw_test_ds: {raw_test_ds.cardinality()}")

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
Number of batches in raw_train_ds: 625
Number of batches in raw_val_ds: 157
Number of batches in raw_test_ds: 782


In [7]:
# Let's preview few samples

for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(5):
    print(text_batch.numpy()[i])
    print(label_batch.numpy()[i])

b'I\'ve seen tons of science fiction from the 70s; some horrendously bad, and others thought provoking and truly frightening. Soylent Green fits into the latter category. Yes, at times it\'s a little campy, and yes, the furniture is good for a giggle or two, but some of the film seems awfully prescient. Here we have a film, 9 years before Blade Runner, that dares to imagine the future as somthing dark, scary, and nihilistic. Both Charlton Heston and Edward G. Robinson fare far better in this than The Ten Commandments, and Robinson\'s assisted-suicide scene is creepily prescient of Kevorkian and his ilk. Some of the attitudes are dated (can you imagine a filmmaker getting away with the "women as furniture" concept in our oh-so-politically-correct-90s?), but it\'s rare to find a film from the Me Decade that actually can make you think. This is one I\'d love to see on the big screen, because even in a widescreen presentation, I don\'t think the overall scope of this film would receive its

In [8]:
# Prepare the Data
# i.e. remove <br /> tags

In [9]:
from tensorflow.keras.layers import TextVectorization
import string
import re

In [10]:
# HTML tags will not be removed from Standardizer and
# need to create custom Standardization function

def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
  return tf.strings.regex_replace(
      stripped_html, f"[{re.escape(string.punctuation)}]", ""
  )

# Model Constants:
max_features = 20000
embedding_dim = 128
sequence_length = 500


# we can now instantiate text vectorization layer
# vectorization_layer - normalize, split, map strings to int

vectorize_layer = TextVectorization(
    standardize = custom_standardization,
    max_tokens = max_features,
    output_mode = 'int',
    output_sequence_length = sequence_length,
)

# Now that the vocab layer has been created, call `adapt` on a text-only
# dataset to create the vocabulary. You don't have to batch, but for very large
# datasets this means you're not keeping spare copies of the dataset in memory.

# Let's make a text-only dataset (no labels):
text_ds = raw_train_ds.map(lambda x, y: x)

#Let's call 'adapt:
vectorize_layer. adapt(text_ds)

In [26]:
# Two options to vectorize data
# op1: make it part of model
# op2: apply it to the text dataset

# op2 will allows us to do asynchronous CPU processing and buffering of data when training on GPU
# we will use op2

In [11]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

# Vectorize the data:
train_ds = raw_train_ds.map(vectorize_text)
val_ds = raw_val_ds.map(vectorize_text)
test_ds = raw_test_ds.map(vectorize_text)

# Do async prefetching / buffering of the data for best performance on GPU.
train_ds = train_ds.cache().prefetch(buffer_size = 10)
val_ds = val_ds.cache().prefetch(buffer_size=10)
test_ds = test_ds.cache().prefetch(buffer_size=10)

**Build Model**

Simple 1D convnet staring with an Embedding layer

In [12]:
from tensorflow.keras import layers

# A integer input for vocab indices
inputs = tf.keras.Input(shape = (None,), dtype = 'int64')

# Next, we add a layer to map those vocab indices into a space of dimensionality: 'embedding_dim'
x = layers.Embedding(max_features, embedding_dim)(inputs)
x = layers.Dropout(0.5)(x)

# Conv1D + Global max pooling
x = layers.Conv1D(128, 7, padding = 'valid', activation = 'relu', strides = 3)(x)
x = layers.Conv1D(128, 7, padding = 'valid', activation = 'relu', strides = 3)(x)
x = layers.GlobalMaxPooling1D()(x)

# Vanila Hidden Layer
x = layers.Dense(128, activation = 'relu')(x)
x = layers.Dropout(0.5)(x)

# We project onto a single unit output layer and squash it with a sigmoid:
predictions = layers.Dense(1, activation = 'sigmoid', name = 'predictions')(x)

model = tf.keras.Model(inputs, predictions)

# Compile the model with binary crossentropy loss and an adam optimizer
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

In [13]:
epochs = 4

# Fit the model using the train and test datasets
model.fit(train_ds, validation_data = val_ds, epochs = epochs)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f5607ecc8d0>

In [15]:
# Evaluating model on the test set
model.evaluate(test_ds)



[0.14115910232067108, 0.9625200033187866]

**Making an end-to-end model**

In [17]:
# A String input
inputs = tf.keras.Input(shape = (1,), dtype = 'string')

# Turn string into vocab indices
indices = vectorize_layer(inputs)

#Turn vocab indices into predictions
outputs = model(indices)

#end2end model
end_to_end_model = tf.keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss = 'binary_crossentropy',
    optimizer = 'adam',
    metrics = ['accuracy']
)

# Test it with 'raw_test_ds' whcih yields raw strings
end_to_end_model.evaluate(raw_test_ds)



[0.1411590874195099, 0.9625200033187866]