# Task for Today  

***

## Children's Book Age Group Prediction  

Given *the names and descriptions of highly-rated children's books*, let's try to predict the **age group** for a given book.  
  
We will use a TensorFlow neural network with an RNN to make our predictions.

# Getting Started

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

import re
from nltk.corpus import stopwords
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import tensorflow as tf

In [None]:
data = pd.read_csv('../input/highly-rated-children-books-and-stories/children_stories.Csv', encoding='latin-1')

In [None]:
data

# Creating Labels

We would like to create labels from the "cats" column. Let us divide the values in the column into two categories: *younger* and *older*.  
  
We can start by sorting the indices of the column's value_counts() to see where we should make the split.

In [None]:
data['cats'].value_counts().sort_index()

It looks like **5 and up** is a good way to classify *older* books.  
  
Let's just sort the unique values as a list so we can see any leading or trailing whitespace in the entries.

In [None]:
sorted(list(data['cats'].unique()))

As we can see, there is one blank age ("Age "), so let us remove any rows with this value.

In [None]:
data = data.drop(data.query("cats == 'Age '").index, axis=0).reset_index(drop=True)

We will create a list of all the values that count as *younger*.

In [None]:
young_ages = [
    'Age 6months+',
    'Age  0-3',
    'Age 0+',
    'Age 0-2',
    'Age 0-3',
    'Age 0-4',
    'Age 0-5',
    'Age 0-6',
    'Age 1+',
    'Age 1-2',
    'Age 1-3',
    'Age 1-4',
    'Age 1-5',
    'Age 1-6',
    'Age 2+',
    'Age 2-4',
    'Age 2-5',
    'Age 2-6',
    'Age 2-7',
    'Age 2-9',
    'Age 3+',
    'Age 3-4',
    'Age 3-5',
    'Age 3-6',
    'Age 3-7',
    'Age 4+',
    'Age 4-11',
    'Age 4-5',
    'Age 4-6',
    'Age 4-7',
    'Age 4-8'
]

We apply a lambda function to change the value for a given book in the "cats" column to 0 if the book is for *younger* children and 1 if the book is for *older* children.

In [None]:
data['cats'] = data['cats'].apply(lambda age: 0 if age in young_ages else 1)

Let's now check the value_counts() divided by the total number of examples to see if our split is good.  
  
51/49 is a decent split.

In [None]:
data['cats'].value_counts() / len(data['cats'])

# Processing Text  
  
Now that we have the labels properly assigned, let's prepare to create a dense encoding of each word in the name and the description.

In [None]:
data

We can start by defining a function to remove any digits and stop words from the texts.

In [None]:
def process_text(text):
    
    # Remove digits
    text = re.sub(r'\d+', ' ', text)
    
    # Split on whitespace
    text = text.split()
    
    # Join on whitespace, but only the words that are not stop words
    text = ' '.join([word for word in text if word not in stopwords.words('english')])
    
    return text

Let's create three variables: **names** and **descriptions**, which will be the processed text columns, and **labels** which will just be a copy of the "cats" column.

In [None]:
names = data['names'].copy().apply(process_text)
descriptions = data['desc'].copy().apply(process_text)

labels = data['cats'].copy()

In [None]:
names

In [None]:
descriptions

In [None]:
labels

Now let us tokenize the texts to give us a word-to-integer mapping for all the words in all the texts.  
  
Keras' Tokenizer will automatically apply filtering for punctuation/special characters and split the strings (into words) on whitespace.  
*Note:* We are fitting the tokenizer on the concatenation of the **names** and **descriptions**, so that all words are accounted for.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(pd.concat([names, descriptions]))

names = tokenizer.texts_to_sequences(names)
descriptions = tokenizer.texts_to_sequences(descriptions)

Now our texts look like sequences of integers.

In [None]:
names[0:5]

Let's get the size of the vocabulary (the number of all unique words across all texts).  
  
We can get this from the length of the tokenizer's word_index, and add +1 for the 0 character (which is not assigned to any word, but instead used for padding).

In [None]:
vocab_length = len(tokenizer.word_index) + 1

print("Vocabulary length:", vocab_length)

We should also get the lengths of the longest sequences in **names** and **descriptions**.

In [None]:
max_name_length = np.max(list(map(lambda name: len(name), names)))
max_desc_length = np.max(list(map(lambda desc: len(desc), descriptions)))

print("Max name length:", max_name_length)
print("Max description length:", max_desc_length)

We can now pad the sequences according to their longest sequence (any sequences shorter than the max will have zeros added to the end).

In [None]:
names = pad_sequences(names, maxlen=max_name_length, padding='post')
descriptions = pad_sequences(descriptions, maxlen=max_desc_length, padding='post')

Now **names** and **descriptions** are proper NumPy arrays that have sequences of uniform length.

In [None]:
print("Shape:", names.shape)
names

In [None]:
print("Shape:", descriptions.shape)
descriptions

# Splitting the Data (Train/Test)  
  
We can split into train and test sets using sklearn's train_test_split() function.  
  
Let's use a train size of 70% and include a random state of 100.

In [None]:
names_train, names_test, descriptions_train, descriptions_test, labels_train, labels_test = train_test_split(names, descriptions, labels, train_size=0.7, random_state=100)

# Modeling  
  
We are going to feed our feature data in through two inputs (one for the names and one for the descriptions).

First, let's focus on the names.

We can embed the names in a high-dimensional vector space using a Keras Embedding layer.  
  
This can allow us to learn representations for words, rather than manually creating the representations.  
It also allows us to have smaller inputs, as we are using a dense encoding.  
  
We will then flatten the output from the embedding and prepare to send it to the final output.

In [None]:
name_dim = 64

name_input = tf.keras.Input(shape=(max_name_length,), name="name_input")

name_embedding = tf.keras.layers.Embedding(
    input_dim=vocab_length,
    output_dim=name_dim,
    input_length=max_name_length,
    name="name_embedding"
)(name_input)

name_flatten = tf.keras.layers.Flatten(name="name_flatten")(name_embedding)

Now, let's focus on the descriptions.  
  
For the descriptions, we will also perform an embedding, but we will then feed it through a Gated Recurrent Unit (GRU) in order to capture time-dependent information in the data.  
  
We will set return_sequences=True in the GRU and flatten the output.

In [None]:
desc_dim = 64

desc_input = tf.keras.Input(shape=(max_desc_length,), name="desc_input")

desc_embedding = tf.keras.layers.Embedding(
    input_dim=vocab_length,
    output_dim=desc_dim,
    input_length=max_desc_length,
    name="desc_embedding"
)(desc_input)

gru_layer = tf.keras.layers.GRU(
    units=256,
    return_sequences=True,
    name="gru_layer"
)(desc_embedding)

desc_flatten = tf.keras.layers.Flatten(name="desc_flatten")(gru_layer)

We will finalize our model by concatenating the outputs from the two sub-models and creating a final prediction.

In [None]:
concat = tf.keras.layers.concatenate([name_flatten, desc_flatten], name="concatenate")

output = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(concat)

Let's take a look at what the model looks like.

In [None]:
model = tf.keras.Model(inputs=[name_input, desc_input], outputs=output)

print(model.summary())
tf.keras.utils.plot_model(model)

# Training  
  
Now we just have to compile and fit our model.

In [None]:
batch_size = 32
epochs = 14

model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=[
        'accuracy',
        tf.keras.metrics.AUC(name='auc')
    ]
)

history = model.fit(
    [names_train, descriptions_train],
    labels_train,
    validation_split=0.2,
    batch_size=batch_size,
    epochs=epochs,
    callbacks=[
        tf.keras.callbacks.ReduceLROnPlateau()
    ]
)

In [None]:
fig = px.line(
    history.history,
    y=['loss', 'val_loss'],
    labels={'x': "epoch", 'y': "loss"},
    title="Loss Over Time"
)

fig.show()

In [None]:
fig = px.line(
    history.history,
    y=['accuracy', 'val_accuracy'],
    labels={'x': "epoch", 'y': "accuracy"},
    title="Accuracy Over Time"
)

fig.show()

# Results

In [None]:
results = model.evaluate([names_test, descriptions_test], labels_test)

print("Accuracy:", results[1])
print(" ROC AUC:", results[2])

# Data Every Day  

This notebook is featured on Data Every Day, a YouTube series where I train models on a new dataset each day.  

***

Check it out!  
https://youtu.be/hhsz2FKGsuQ