<a href="https://colab.research.google.com/github/schauppi/tensorflow_datasets/blob/schauppi/amazon_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#path to data
test_data = "/content/drive/MyDrive/datasets/test.ft.txt"
train_data = "/content/drive/MyDrive/datasets/train.ft.txt"

In [5]:
#preprocess test data - split labels and sentences
test_data = open(test_data, "r")
test_data = test_data.readlines()
test_sentences = []
test_labels = []
for i in range(len(test_data)):
  test_labels.append(test_data[i][:10])
  test_sentences.append(test_data[i][11:])

In [6]:
#preprocess train data - split labels and sentences
train_data = open(train_data, "r")
train_data = train_data.readlines()
train_sentences = []
train_labels = []
for i in range(len(train_data)):
  train_labels.append(train_data[i][:10])
  train_sentences.append(train_data[i][11:])

In [7]:
for i in range(10):
  print(test_sentences[i])

Great CD: My lovely Pat has one of the GREAT voices of her generation. I have listened to this CD for YEARS and I still LOVE IT. When I'm in a good mood it makes me feel better. A bad mood just evaporates like sugar in the rain. This CD just oozes LIFE. Vocals are jusat STUUNNING and lyrics just kill. One of life's hidden gems. This is a desert isle CD in my book. Why she never made it big is just beyond me. Everytime I play this, no matter black, white, young, old, male, female EVERYBODY says one thing "Who was that singing ?"

One of the best game music soundtracks - for a game I didn't really play: Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums. There is an incredible mix of fun, epic, and emotional songs. Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in

In [8]:
for i in range(10):
  print(train_sentences[i])

Stuning even for the non-gamer: This sound track was beautiful! It paints the senery in your mind so well I would recomend it even to people who hate vid. game music! I have played the game Chrono Cross but out of all of the games I have ever played it has the best music! It backs away from crude keyboarding and takes a fresher step with grate guitars and soulful orchestras. It would impress anyone who cares to listen! ^_^

The best soundtrack ever to anything.: I'm reading a lot of reviews saying that this is the best 'game soundtrack' and I figured that I'd write a review to disagree a bit. This in my opinino is Yasunori Mitsuda's ultimate masterpiece. The music is timeless and I'm been listening to it for years now and its beauty simply refuses to fade.The price tag on this is pretty staggering I must say, but if you are going to buy any cd for this much money, this is the only one that I feel would be worth every penny.

Amazing!: This soundtrack is my favorite music of all time, h

In [9]:
#preprocess the labels
#__label__1 = 0
#__label__2 = 1

for i in range(len(train_labels)):
  if train_labels[i] == "__label__1":
    train_labels[i] = 0
  else:
    train_labels[i] = 1

In [10]:
#preprocess the labels
#__label__1 = 0
#__label__2 = 1
for i in range(len(test_labels)):
  if test_labels[i] == "__label__1":
    test_labels[i] = 0
  else:
    test_labels[i] = 1

In [11]:
#len of datasets
print(len(train_sentences))
print(len(train_labels))
print(len(test_sentences))
print(len(test_labels))

3600000
3600000
400000
400000


In [12]:
#define hyperparameters
vocab_size = 10000
embedding_dim = 64
max_length = 300
trunc_type = "post"
padding_type = "post"
oov_token = "<OOV>"

In [13]:
#tokenize the train sentences
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(train_sentences)

In [14]:
#pad the train sentences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, maxlen=max_length, truncating=trunc_type, padding=padding_type)

In [15]:
#pad the test sentences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, maxlen=max_length, truncating=trunc_type, padding=padding_type)

In [16]:
#convert to numpy arrays
train_labels = np.array(train_labels)
train_padded = np.array(train_padded)

test_labels = np.array(test_labels)
test_padded = np.array(test_padded)

In [85]:
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Conv1D(128, 5, activation="relu", padding="same"),
    keras.layers.BatchNormalization(),
    keras.layers.Conv1D(64, 5, activation="relu", padding="same"),
    keras.layers.BatchNormalization(),
    keras.layers.Conv1D(32, 5, activation="relu", padding="same"),
    keras.layers.BatchNormalization(),
    keras.layers.Conv1D(16, 5, activation="relu", padding="same"),
    keras.layers.BatchNormalization(),
    keras.layers.GlobalAveragePooling1D(),
    keras.layers.Dense(24, activation="relu"),
    keras.layers.Dense(1, activation="sigmoid")
])

In [86]:
model.summary()

Model: "sequential_25"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_26 (Embedding)     (None, 300, 64)           640000    
_________________________________________________________________
conv1d_36 (Conv1D)           (None, 300, 128)          41088     
_________________________________________________________________
batch_normalization_7 (Batch (None, 300, 128)          512       
_________________________________________________________________
conv1d_37 (Conv1D)           (None, 300, 64)           41024     
_________________________________________________________________
batch_normalization_8 (Batch (None, 300, 64)           256       
_________________________________________________________________
conv1d_38 (Conv1D)           (None, 300, 32)           10272     
_________________________________________________________________
batch_normalization_9 (Batch (None, 300, 32)         

In [87]:
model.compile(loss="binary_crossentropy", metrics=["accuracy"], optimizer="adam")

In [89]:
#history = model.fit(train_padded, train_labels, epochs=5, batch_size=2048, validation_data=(test_padded, test_labels))