## Reference
https://www.tensorflow.org/tutorials/text/word_embeddings

In [None]:
import io
import os
import re
import shutil
import string
import tensorflow as tf

from datetime import datetime
from tensorflow.keras import Model, Sequential
from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [None]:
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,
                                    untar=True, cache_dir='.',
                                    cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
os.listdir(dataset_dir)

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


['imdb.vocab', 'test', 'imdbEr.txt', 'README', 'train']

In [None]:
dataset_dir

'./aclImdb'

In [None]:
dataset

'./aclImdb_v1.tar.gz'

In [None]:
!ls -lrt

total 82168
drwxr-xr-x 4 7297 1000     4096 Jun 26  2011 aclImdb
drwxr-xr-x 1 root root     4096 Jan 20 17:27 sample_data
-rw-r--r-- 1 root root 84125825 Jan 26 01:52 aclImdb_v1.tar.gz.tar.gz


In [None]:
dataset_dir

'./aclImdb'

In [None]:
train_dir = os.path.join(dataset_dir, 'train')
os.listdir(train_dir)

['unsupBow.feat',
 'neg',
 'pos',
 'unsup',
 'urls_pos.txt',
 'urls_neg.txt',
 'urls_unsup.txt',
 'labeledBow.feat']

In [None]:
dataset_dir

'./aclImdb'

In [None]:
!ls -lrt ./aclImdb/train/pos | head -3

total 51624
-rw-r--r-- 1 7297 1000   975 Apr 12  2011 99_8.txt
-rw-r--r-- 1 7297 1000   638 Apr 12  2011 98_10.txt


In [None]:
!cat ./aclImdb/train/pos/99_8.txt 

A Christmas Together actually came before my time, but I've been raised on John Denver and the songs from this special were always my family's Christmas music. For years we had a crackling cassette made from a record that meant it was Christmas. A few years ago, I was finally able to track down a video of it on Ebay, so after listening to all the music for some 21 years, I got to see John and the Muppets in action for myself. If you ever get the chance, it's a lot of fun--great music, heart-warming and cheesy. It's also interesting to see the 70's versions of the Muppets and compare them to their newer versions today. I believe Denver actually took some heat for doing a show like this--I guess normally performers don't compromise their images by doing sing-a-longs with the Muppets, but I'm glad he did. Even if you can't track down the video, the soundtrack is worth it too. It has some Muppified traditional favorites, but also some original Denver tunes as well.

In [None]:
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)

In [None]:
!ls -lrt ./aclImdb/train

total 65200
-rw-r--r-- 1 7297 1000  2450000 Apr 12  2011 urls_unsup.txt
drwxr-xr-x 2 7297 1000   348160 Apr 12  2011 pos
drwxr-xr-x 2 7297 1000   356352 Apr 12  2011 neg
-rw-r--r-- 1 7297 1000   612500 Apr 12  2011 urls_pos.txt
-rw-r--r-- 1 7297 1000   612500 Apr 12  2011 urls_neg.txt
-rw-r--r-- 1 7297 1000 21021197 Apr 12  2011 labeledBow.feat
-rw-r--r-- 1 7297 1000 41348699 Apr 12  2011 unsupBow.feat


In [None]:
batch_size = 1024
seed = 123
train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='training', seed=seed)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train', batch_size=batch_size, validation_split=0.2, 
    subset='validation', seed=seed)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [None]:
import random
idx = random.sample(range(1, batch_size), 5)
for text_batch, label_batch in train_ds.take(1):
  for i in idx:
    print(i, label_batch[i].numpy(), text_batch.numpy()[i])

974 0 b'Bloodsuckers has the potential to be a somewhat decent movie, the concept of military types tracking down and battling vampires in space is one with some potential in the cheesier realm of things. Even the idea of the universe being full of various different breeds of vampire, all with different attributes, many of which the characters have yet to find out about, is kind of cool as well. As to how most of the life in the galaxy outside of earth is vampire, I\'m not sure how the makers meant for that to work, given the nature of vampires. Who the hell they are meant to be feeding on if almost everyone is a vampire I don\'t know. As it is the movie comes across a low budget mix of Firefly/Serenity and vampires movies with a dash of Aliens.<br /><br />The action parts of the movie are pretty average and derivative (Particularly of Serenity) but passable- they are reasonably well executed and there is enough gore for a vampire flick, including some of the comical blood-spurting var

In [None]:
type(train_ds)

tensorflow.python.data.ops.dataset_ops.BatchDataset