# dmarketing.ai

## Deep Learning

## Project: Build a reviews' classifier based on Amazon's reviews dataset

*__dmarketing.ai__* (*Digital Marketing AI*) is a deep learning project focused on building severally, vastly diverse usage, neural net models. <br /><br /> 
In this _Jupyter Notebook_ you will find step by step build Recurrent Neural Network that can perform review sentiment classification and deside whether customer review was : *'negative', 'neutral' , 'positive'*. <br /> 
Dataset for building a classifier were downloaded from [link](https://registry.opendata.aws/amazon-reviews/#usageexamples) and contains *train.csv* and *test.csv* files which contain training and testing data respectivly.

## Design sequential architecture that takes review as an input and outputs sentiment.

### Step 1: Load & Explore Dataset

In [None]:
import os

DATA_FOLDER_PATH = "./data"
TRAIN_DATA_PATH = os.path.join(DATA_FOLDER_PATH, 'train.csv')
TEST_DATA_PATH = os.path.join(DATA_FOLDER_PATH, 'test.csv')

#### Counting the number of samples available in the csv_file.

I constructed a generator by which I will be iterate through CSV files due to their large size which makes them impossible to load into RAM memory.

In [None]:
def count_samples(csv_file_path):
    '''Counts samples of data containes in a single csv file.

            Parameters:
            csv_file_path (str): file system path to a csv file with data samples.

            Returns:
            sample_cnt (int): number of samples.
    '''
    samples_cnt = 0

    with open(csv_file_path, 'r', errors='ignore') as csv_file:
        for lines in csv_file:
            samples_cnt += 1
    return samples_cnt

In [None]:
train_samples = count_samples(TRAIN_DATA_PATH)
test_samples = count_samples(TEST_DATA_PATH)

In [None]:
print("Number of train samples : {}\nNumber of test samples : {}".format(train_samples, test_samples))

#### Constructing a function that allows to iterate over choosen column in CSV file.
The data inside CSV file contain three columns containing following features:

- `'rating'` is an integer that represents rating of a corresponding review.
- `'title'` is a string that represents title of a corresponding review.
- `'review'` is a string that constain text of a review. 

In [None]:
import csv, string

RATING_IDX = 0
TITLE_IDX = 1
REVIEW_IDX = 2

def flow_from_csv(path=None, col_idx=REVIEW_IDX):
    '''Produces generator that iterates through col_idxes in csv file containg data.
                
            Parameters:
            path (str): file system path to a csv file with data samples.
            loc_idx(int): number of column.
                
            Returns:
            generator: generator that returns data from each row specified by col_idx.
     '''
    with open(path, 'r', errors='ignore') as csv_file:
        reader = csv.reader(csv_file)
        
        readed_cnt = 0
        while readed_cnt != train_samples - 1:
            row = next(reader)
            
            text = row[col_idx].lower()
            text = text.translate(str.maketrans('', '', string.punctuation))
            
            readed_cnt += 1
            yield text
            
    return

#### Creating a Tokenizer class object and fiting it on reviews in train dataset.

Tokenizer object will be then used to : create sequences out of strings of reviews, padding those sequences to a given length. <br/>
For more detailed description visit [keras.preprocessing.text.Tokenizer documentation](https://keras.io/preprocessing/text/)

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

rev_max_words = 10000

rev_tokenizer = Tokenizer(num_words=rev_max_words)
review_gen = flow_from_csv(TRAIN_DATA_PATH, REVIEW_IDX)

rev_tokenizer.fit_on_texts(review_gen)

#### Based on the tokenizer determining the most frequently occured words.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

word_cnt = rev_tokenizer.word_counts

# Extracts the most frequent words
most_freq_words = 30

sorted_items = list(word_cnt.items())
sorted_items.sort(key=lambda item: item[-1], reverse=True)

most_freq_keys = [k for k, v in sorted_items[:most_freq_words]]
most_freq_values = [v for k, v in sorted_items[:most_freq_words]]

# Draws bar char of most frequent words
plt.figure(figsize=(10, 10))
plt.title(str(most_freq_words) + " most frequent words")
plt.xlabel("Word")
plt.xticks(rotation=-90)
plt.ylabel("Occurance")
plt.bar(most_freq_keys, most_freq_values)

### Step 2: Design & Validate a Model Architecture 


#### Creating data pipeline.

Creating a data pipeline that will produce generator returning *tuple(inputs, targets)* that will be used to train neural network model.

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

classes = ["negative", "neutral", "positive"]
# Returns index of a corresponding class in classes list
rating2class = {'1': 0,
                '2': 0,
                '3': 1,
                '4': 2,
                '5': 2}

NUMBER_OF_CLASSES = len(classes)         # Returns number of unique values in rating2class dict 

In [None]:
def skip_rows(gen, num):
    '''Skips rows of csv file read by genertor.

            Parameters:
            gen (generator): csv file reader generator.
            num (int): number of row to be skipped.
    '''
    skipped = 0
    while skipped != num:
        next(gen)
        skipped += 1;

In [None]:
def prepare_text(text):
    '''Returns text without punctuations and all characters are lowercase. Input text remains unmodified.

            Parameters:
            text (str): text based on which new modified string is returned.

            Returns:
            retv (str): returned modified string.
    '''
    retv = text.lower()
    retv = retv.translate(str.maketrans('', '', string.punctuation))
    return retv

In [None]:
def first_pipeline(path, maxlen, batch_size=1, start_idx=0):
    '''Produces generator that will be used to train neural network.

            Parameters:
            path (str): file system path to a csv file with data samples.

            Returns:
            generator: generator that returns tuple(list_of_reviews, list_of_outputs).
    '''
    csv_file = open(path, 'r', errors='ignore')
    reader = csv.reader(csv_file)

    readed_cnt = start_idx
    skip_rows(reader, start_idx)

    while True:
        reviews = []
        ratings = []

        for _ in range(batch_size):
            row = next(reader)
            # Extracts ratings
            rating = row[RATING_IDX]
            rating_class = rating2class[rating]
            ratings.append(rating_class)

            # Extracts and clears reviews
            review = prepare_text(row[REVIEW_IDX])
            reviews.append(review)

            readed_cnt += 1

        # Tokenizes and pads sequences
        reviews = rev_tokenizer.texts_to_sequences(reviews)
        reviews = pad_sequences(reviews, maxlen=maxlen)
        
        # Converts input to binary class matrix
        ratings = to_categorical(ratings, num_classes=NUMBER_OF_CLASSES, dtype='uint8')
            
        yield reviews, ratings, [None]
            
        # Provides infinite data generation
        if readed_cnt + batch_size >= train_samples - 1:
            csv_file.close()
            csv_file = open(path, 'r', errors='ignore')
            reader = csv.reader(csv_file)
            readed_cnt = start_idx
            # Skips first start_idx rows
            skip_rows(reader, start_idx)

#### Creating and training model.

Creating Recurrent Neural Network model and training it.

In [None]:
rev_max_len = 80    # Maximal length of a sequence that can be feed to neural network

model_v1 = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(rev_max_words, 128, input_length=rev_max_len),
    tf.keras.layers.GRU(64, recurrent_dropout=0.2, dropout=0.2),
    tf.keras.layers.Dense(NUMBER_OF_CLASSES, activation='softmax')
])

model_v1.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

In [None]:
model_v1.summary()

In [None]:
def tensorboard_callback(log_dir):
    '''Returns keras.callback object to save tensorboard parameter of trained model.

            Parameters:
            log_dir (str): path to a directory in which we will store information.

            Returns:
            (list of tensorflow.keras.callbacks): list object containg one callback that can be directly use while training.
    '''
    return [tf.keras.callbacks.TensorBoard(
                                    log_dir=log_dir,
                                    histogram_freq=1,
                                    embeddings_freq=1)]

In [None]:
data_gen = first_pipeline(TRAIN_DATA_PATH, maxlen=max_len, batch_size=256)
val_gen = first_pipeline(TRAIN_DATA_PATH, maxlen=max_len, batch_size=256, start_idx=2 * 10^6)

history = model_v1.fit(data_gen, steps_per_epoch=1000,
                       epochs=10, 
                       validation_data=val_gen,
                       validation_steps=500,
                       callbacks=tensorboard_callback('best_sequential_rev_model'))

#### Plots of loss function and accuracy parameters with respect to epoch.

In [None]:
def plot_accuracy(history):
    '''Plots training and validation accuracy obtained in training process.

            Parameters:
            history (tensorflow.python.keras.callbacks.History): history of training.

            Returns:
            matplotlib plot of accuracies with respect to epoch.
    '''
    hist_dict = history.history
    train_acc = hist_dict['acc']
    val_acc = hist_dict['val_acc']

    epochs = np.arange(1, 41)

    plt.plot(epochs, train_acc, 'bo', label='Train accuracy')
    plt.plot(epochs, val_acc, 'r-', label='Validation accuracy')
    plt.grid()
    plt.legend(loc='best')

In [None]:
def plot_loss(history):
    '''Plots training and validation losses obtained in training process.

            Parameters:
            history (tensorflow.python.keras.callbacks.History): history of training.

            Returns:
            matplotlib plot of accuracies with respect to epoch.
    '''
    hist_dict = history.history
    train_loss = hist_dict['loss']
    val_loss = hist_dict['val_loss']

    plt.plot(epochs, train_loss, 'bo', label='Train loss')
    plt.plot(epochs, val_loss, 'r-', label='Validation loss')
    plt.grid()
    plt.legend(loc='best')

In [None]:
plot_accuracy(history)

In [None]:
plot_loss(history)

#### Saving best Sequential model of RNN obtain.

In [None]:
model_v1.save('best_sequential_model.h5')

### Step 3: Test Model on New Reviews

In [None]:
test_gen = first_pipeline(TEST_DATA_PATH, maxlen=rev_max_len, batch_size=int(test_samples/1000))

model_v1.evaluate(test_gen, steps=1000)

### Step 4: Summary

In [None]:
from tensorflow.keras.utils import plot_model

plot_model(model_v1)

## Design sequential architecture that takes review as an input and outputs sentiment.

### Step 1: Preparing data pipeline for new architecture

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

title_max_words = 10000

title_tokenizer = Tokenizer(num_words=title_max_words)
title_gen = flow_from_csv(path=TRAIN_DATA_PATH, col_idx=TITLE_IDX)

title_tokenizer.fit_on_texts(title_gen)

In [None]:
def second_pipeline(path, rev_maxlen, title_maxlen, batch_size=1, start_idx=0):
    '''Produces generator that will be used to train neural network.

            Parameters:
            path (str): file system path to a csv file with data samples.

            Returns:
            generator: generator that returns tuple([list_of_reviews, list_of_titles], list_of_outputs).
    '''
    csv_file = open(path, 'r', errors='ignore')
    reader = csv.reader(csv_file)

    readed_cnt = start_idx
    skip_rows(reader, start_idx)

    while True:
        reviews = []
        titles = []
        ratings = []

        for _ in range(batch_size):
            row = next(reader)
            # Extracts ratings
            rating = row[RATING_IDX]
            rating_class = rating2class[rating]
            ratings.append(rating_class)

            # Extracts and clears reviews
            review = prepare_text(row[REVIEW_IDX])
            reviews.append(review)
            
            # Extracts and clears titles
            title = prepare_text(row[TITLE_IDX])
            titles.append(title)

            readed_cnt += 1

        # Tokenizes and pads sequences of review
        reviews = rev_tokenizer.texts_to_sequences(reviews)
        reviews = pad_sequences(reviews, maxlen=rev_maxlen)
        
        # Tokenizes and pads sequences of titles
        titles = title_tokenizer.texts_to_sequences(titles)
        titles = pad_sequences(titles, maxlen=title_maxlen)
        
        # Converts input to binary class matrix
        ratings = to_categorical(ratings, num_classes=NUMBER_OF_CLASSES, dtype='uint8')
            
        yield [reviews, titles], ratings, [None]
            
        # Provides infinite data generation
        if readed_cnt + batch_size >= train_samples - 1:
            csv_file.close()
            csv_file = open(path, 'r', errors='ignore')
            reader = csv.reader(csv_file)
            readed_cnt = start_idx
            # Skips first start_idx rows
            skip_rows(reader, start_idx)

### Step 2: Design & Validate an RNN muliple inputs architecture

In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras import layers

rev_max_len = 100
title_max_len = 20

# Constructing new architecture of RNN that take two separate inputs (review and title)
review_input_layer = layers.Input(shape=(rev_max_len,), dtype='int32', name='review_input')
embedded_review = layers.Embedding(rev_max_words, 128, input_length=rev_max_len)(review_input_layer)
review_lstm = layers.LSTM(128, recurrent_dropout=0.4, dropout=0.4)(embedded_review)

title_input_layer = layers.Input(shape=(title_max_len,), dtype='int32', name='title_input')
embedded_title = layers.Embedding(title_max_words, 16, input_length=rev_max_len)(title_input_layer)
title_lstm = layers.LSTM(16, recurrent_dropout=0.2, dropout=0.2)(embedded_title)

# Concatenating layers
concatenated_layer = layers.concatenate([review_lstm, title_lstm], axis=-1)
output_layer = layers.Dense(NUMBER_OF_CLASSES, activation='softmax')(concatenated_layer)

model_v2 = Model([review_input_layer, title_input_layer], output_layer)

In [None]:
model_v2.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

In [None]:
model_v2.summary()

In [None]:
data_gen = second_pipeline(TRAIN_DATA_PATH, rev_max_len, title_max_len, batch_size=256)
val_gen = second_pipeline(TRAIN_DATA_PATH, rev_max_len, title_max_len, batch_size=256, start_idx=2*10^6)

history = model_v2.fit(data_gen, steps_per_epoch=1000,
                       epochs=10,
                       validation_data=val_gen,
                       validation_steps=500,
                       callbacks=tensorboard_callback('best_multiple_input_model'))

In [None]:
plot_accuracy(history)

In [None]:
plot_loss(history)

In [None]:
model_v2.save('multiply_input_model.h5')

### Step 3: Test obtain model on new data

In [None]:
test_gen = second_pipeline(TEST_DATA_PATH, rev_max_len, title_max_len, batch_size=int(test_samples/1000))

model_v2.evaluate(test_gen, steps=1000)

### Step 4: Summary

In [None]:
from tensorflow.keras.utils import plot_model

plot_model(model_v2)

## Design sequential architecture that takes title as an input and outputs sentiment.

### Step 1: Preparing data pipeline for new architecture

In [None]:
def third_pipeline(path, maxlen, batch_size=1, start_idx=0):
    '''Produces generator that will be used to train neural network.

            Parameters:
            path (str): file system path to a csv file with data samples.

            Returns:
            generator: generator that returns tuple(list_of_titles, list_of_outputs).
    '''
    csv_file = open(path, 'r', errors='ignore')
    reader = csv.reader(csv_file)

    readed_cnt = start_idx
    skip_rows(reader, start_idx)

    while True:
        titles = []
        ratings = []

        for _ in range(batch_size):
            row = next(reader)
            # Extracts ratings
            rating = row[RATING_IDX]
            rating_class = rating2class[rating]
            ratings.append(rating_class)

            # Extracts and clears titles
            title = prepare_text(row[TITLE_IDX])
            titles.append(title)

            readed_cnt += 1

        # Tokenizes and pads sequences
        titles = title_tokenizer.texts_to_sequences(titles)
        titles = pad_sequences(titles, maxlen=maxlen)
        
        # Converts input to binary class matrix
        ratings = to_categorical(ratings, num_classes=NUMBER_OF_CLASSES, dtype='uint8')
            
        yield titles, ratings, [None]
            
        # Provides infinite data generation
        if readed_cnt + batch_size >= train_samples - 1:
            csv_file.close()
            csv_file = open(path, 'r', errors='ignore')
            reader = csv.reader(csv_file)
            readed_cnt = start_idx
            # Skips first start_idx rows
            skip_rows(reader, start_idx)

### Step 2: Design & Validate new architecture

In [None]:
model_v3 = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(title_max_words, 128, input_length=title_max_len),
    tf.keras.layers.LSTM(16, recurrent_dropout=0.2, dropout=0.2),
    tf.keras.layers.Dense(3, activation='softmax')
])

model_v3.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

In [None]:
model_v3.summary()

In [None]:
data_gen = third_pipeline(TRAIN_DATA_PATH, title_max_len, batch_size=256)
val_gen = third_pipeline(TRAIN_DATA_PATH, title_max_len, batch_size=256, start_idx=2*10^6)

history = model_v3.fit(data_gen, steps_per_epoch=1000,
                       epochs=10,
                       validation_data=val_gen,
                       validation_steps=500,
                       callbacks=tensorboard_callback('best_sequential_title_model'))

In [None]:
model_v3.save('best_sequential_title_model.h5')

### Step 3: Test obtain model on new data

In [None]:
test_gen = third_pipeline(TEST_DATA_PATH, title_max_len, batch_size=int(test_samples/1000))

model_v3.evaluate(test_gen, steps=1000)

### Step 4: Summary

In [None]:
from tensorflow.keras.utils import plot_model

plot_model(model_v3)