<a href="https://colab.research.google.com/github/zntbhctp/DeepLearning/blob/main/Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
dataset = [
    "The cat sat on the mat",
    "Dad sat on the cat",
    "Cat is mad"
]

text_vectorization = layers.TextVectorization(
    output_mode = "int"
)



In [None]:
text_vectorization.adapt(dataset)
vocabulary = text_vectorization.get_vocabulary()
vocabulary

['', '[UNK]', 'the', 'cat', 'sat', 'on', 'mat', 'mad', 'is', 'dad']

In [None]:
text_vectorization.adapt(dataset)
vocabulary = text_vectorization.get_vocabulary()

test_sentence = "Cat is on the floor"

encoded_sentence = text_vectorization(test_sentence).numpy()

decoded_sentence = " ".join(vocabulary[i] for i in encoded_sentence)
decoded_sentence

'cat is on the [UNK]'

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  9464k      0  0:00:08  0:00:08 --:--:-- 13.0M


In [None]:
!tar -xf aclImdb_v1.tar.gz

In [None]:
!rm -r /content/aclImdb/train/unsup

In [None]:
!cat /content/aclImdb/train/pos/10000_8.txt

Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without th

In [None]:
import os, pathlib, shutil, random

base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"

for category in ("neg", "pos"):
    os.makedirs(val_dir / category, exist_ok= True)
    files = os.listdir(train_dir / category)
    random.Random(1).shuffle(files)
    num_val_samples = int(0.2* len(files))
    val_files = files[:num_val_samples]
    for fname in val_files:
        shutil.move(train_dir / category / fname, 
                    val_dir / category / fname)


In [None]:
files = os.listdir(val_dir / "pos")
len(files)

2500

In [None]:
from tensorflow.keras.utils import text_dataset_from_directory

batch_size = 32

train_ds = text_dataset_from_directory(
    "/content/aclImdb/train",
    batch_size = batch_size
)


val_ds = text_dataset_from_directory(
    "/content/aclImdb/val",
    batch_size = batch_size
)

test_ds = text_dataset_from_directory(
    "/content/aclImdb/test",
    batch_size = batch_size
)

Found 20000 files belonging to 2 classes.
Found 5000 files belonging to 2 classes.
Found 25000 files belonging to 2 classes.


In [None]:
for inputs, targets in train_ds:
    print(inputs.shape)
    print(targets.shape)
    print(inputs[0])
    print(targets[0])
    break

(32,)
(32,)
tf.Tensor(b"Having not seen the previous two in the trilogy of Bourne movies, I was a little reluctant to watch The Bourne Ultimatum.<br /><br />However it was a very thrilling experience and I didn't have the problem of not understanding what was happening due to not seeing the first two films. Each part of the story was easy to understand and I fell in love with The Bourne Ultimatum before it had reached the interval! I don't think I have ever watched such an exquisitely made, and gripping film, especially an action film. Since I usually shy away from action and thriller type movies, this was such great news to me. Ultimatum is one of the most enthralling films, it grabs your attention from the first second till the last minute before the credits roll.<br /><br />Matt Damon was simply fantastic as his role as Jason Bourne. I've heard a lot about his great performances in the Bourne 1+2, and now, this fabulous actor has one more to add to his list. I look forward to seeing

In [None]:
text_vectorization = layers.TextVectorization(
    max_tokens = 20000,
    output_mode = "multi_hot"
)

text_only_train_ds = train_ds.map(lambda x , y : x)
text_vectorization.adapt(text_only_train_ds)

binary_1gram_train_ds = train_ds.map(lambda x,y : (text_vectorization(x), y))
binary_1gram_val_ds = val_ds.map(lambda x,y : (text_vectorization(x), y))
binary_1gram_test_ds = test_ds.map(lambda x,y : (text_vectorization(x), y))



In [None]:
for inputs, targets in binary_1gram_train_ds:
    print(inputs.shape)
    print(targets.shape)
    print(inputs[0])
    print(targets[0])
    break

(32, 20000)
(32,)
tf.Tensor([1. 1. 1. ... 0. 0. 0.], shape=(20000,), dtype=float32)
tf.Tensor(1, shape=(), dtype=int32)


In [None]:
def get_model(max_tokens = 20000, hidden_dim = 16):
    inputs = keras.Input(shape = (max_tokens, ))
    x = layers.Dense(hidden_dim, activation = "relu")(inputs)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation = "sigmoid")(x)
    mdl = keras.Model(inputs, outputs)
    mdl.compile(optimizer = "adam", loss = "binary_crossentropy", metrics = ["accuracy"])
    return mdl

In [None]:
mdl = get_model()
callbacks = keras.callbacks.ModelCheckpoint("binary_1gram.keras", save_best_only=True)
mdl.fit(binary_1gram_train_ds, validation_data=binary_1gram_val_ds, epochs = 10, callbacks = callbacks)

mdl = keras.models.load_model("binary_1gram.keras")
mdl.evaluate(binary_1gram_test_ds)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[0.2886897027492523, 0.8788800239562988]

In [None]:
text_vectorization = layers.TextVectorization(
    max_tokens = 20000,
    output_mode = "multi_hot",
    ngrams = 2
)

text_only_train_ds = train_ds.map(lambda x , y : x)
text_vectorization.adapt(text_only_train_ds)

binary_2gram_train_ds = train_ds.map(lambda x,y : (text_vectorization(x), y))
binary_2gram_val_ds = val_ds.map(lambda x,y : (text_vectorization(x), y))
binary_2gram_test_ds = test_ds.map(lambda x,y : (text_vectorization(x), y))

mdl = get_model()
callbacks = keras.callbacks.ModelCheckpoint("binary_2gram.keras", save_best_only=True)
mdl.fit(binary_2gram_train_ds, validation_data=binary_2gram_val_ds, epochs = 10, callbacks = callbacks)

mdl = keras.models.load_model("binary_2gram.keras")
mdl.evaluate(binary_2gram_test_ds)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[0.2583639621734619, 0.8949199914932251]