# Experimenting with Keras to Identify Authors

Much of what I'm doing in this notebook comes from François Chollet, _Deep Learning with Python_, Second Edition (Manning, 2021).

First, I need to load the data, which I have prepared ahead of time.

In [1]:
# Import Pandas
import pandas as pd

In [2]:
# Import the data
df = pd.read_csv('output/all_names_deduplicated.csv')
df.head()

Unnamed: 0,author,dll_author_id
0,"Acosta, José de, 1540-1600",A5598
1,"Seneca, Lucius Annaeus, approximately 55 B.C.-...",A4920
2,"Gregory, Saint, Bishop of Tours, 538-594",A5257
3,"Keil, Henricus, 1822-1894",A3509
4,"Ruusbroec, Jan van, 1293-1381",A4218


In [3]:
# Investigate the data

# The length of the longest string in the 'author' column
print(f"Longest string: {df['author'].str.len().max()}")

# Get the average length
print(f"Average string length: {df['author'].str.len().mean()}")

# Average number of words per author
df2 = df.copy()
df2['author_words'] = df2['author'].apply(lambda x: len(x.split()))
print(f"Average number of words per author: {df2['author_words'].mean()}")

# The number of author strings
print(f"The total number of authors: {len(df['author'])}")

# Ratio of of number of records and mean length
ratio = len(df['author'])/df2['author_words'].mean()
print(f"Ratio of words per author to mean length: {ratio}")

Longest string: 254
Average string length: 29.07211951010219
Average number of words per author: 3.90802714720337
The total number of authors: 25638
Ratio of words per author to mean length: 6560.343373854722


## Building the model

The goal of the model is to apply the correct DLL ID to the name of an author.

I'll need training and validation data for the author names and DLL ID's.

In [23]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import TextVectorization, Embedding, Flatten, Dense

# Label encoding
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['dll_author_id'])

# Splitting the data
X_train, X_val, y_train, y_val = train_test_split(df['author'], y, test_size=0.2, random_state=42)

# Ensure X_train is a 1D array
X_train = X_train.to_numpy().reshape(-1)
X_val = X_val.to_numpy().reshape(-1)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


## Preprocessing the text

Since Keras can't process text, I will also need to convert the text into vectors.

I'll use the `TextVectorization` layer in Keras to preprocess the author strings instead of manually normalizing and vectorizing them.

`max_tokens` is set to 20000 because, according to Chollet (390), “In general, 20,000 is the right vocabulary size for text classification.” 

The `output_mode` is `int` because we want numbers, not text.

The `output_sequence_length` is set at 254, the length of the longest string in the 'authors' column. I found that by doing `df['author'].str.len().max()`.

In [5]:
# Define the TextVectorization layer
vectorize_layer = TextVectorization(
    max_tokens=20000,
    output_mode='int',
    output_sequence_length=254
)

# Adapt the vectorization layer to the training data
vectorize_layer.adapt(X_train)


## Making the model

I'm using a sequential model in accordance with Chollet's "golden constant" for determining when to use a bag-of-words model or a sequence model:

>[Y]ou should pay close attention to the ratio between the number of samples in your training data and the mean number of words per sample …. If that ratio is small—less than 1,500—then the bag-of-bigrams model will perform better (and as a bonus, it will be much faster to train and to iterate on too). If that ratio is higher than 1,500, then you should go with a sequence model. In other words, sequence models work best when lots of training data is available and when each sample is relatively short. (421)

The number of records overall is 25638, and the average number of words per record is 3.90802714720337, which works out to a ratio of approximately 6560. That isn't just the training data, I know, but this seems like a reasonable path to follow.

In the model itself, the parameters are:

- `vectorize_layer`: This turns the text into data that Keras can process. It normalizes and tokenizes the text data, then it turns that data into vectors.
- `Embedding(input_dim=17719, output_dim=128)`: According to Chollet (398), word embeddings “map human language into a structured geometric space.” In other words, they “pack more information into far fewer dimensions.” He goes on to explain (401) that “The `Embedding` layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, looks up these integers in an internal dictionary, and returns the associated vectors.” The `input_dim` is set to the size of the vocabulary, which I found by doing `len(vectorize_layer.get_vocabulary())`. The `output_dim` parameter is set to a standard default of 128.
- `Flatten()`: This function reduces the number of dimensions of the data to 2, which is what the `Dense` layer expects.
- `Dense(128, activation='relu')`: This is one of the dense layers of the process. It has 128 'neurons' for processing the data. 128 is a standard "Goldilocks" number of neurons—not too many, not too few. A good place to start. The `relu` in `activation` refers to a standard function known as Rectified Linear Units.
- `Dense(len(label_encoder.classes_), activation='softmax')`: This adds another dense layer, with the number of neurons set to the number of DLL Identifiers. This accords with Chollet's advice (151): "If you’re trying to classify data points among N classes, your model should end with a Dense layer of size N." Chollet (151) recommends the `softmax` function for single-label, multi-class problems like this one since "it will output a probability distribution over the N output classes." In other words, the probabilities will add up to 1.

In [6]:
# Define the model
model = Sequential([
    vectorize_layer,
    Embedding(input_dim=17719, output_dim=128, mask_zero=True),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(len(label_encoder.classes_), activation='softmax')
])

## Compile the Model

The `compile()` method assembles the model according to whatever parameters it receives. 

The optimizer determines how efficiently the model learns. Keras has several optimizers available. I selected `adam` because it is a popular choice for this sort of problem, and because it combines a lot of the best features of two other popular optimizers, `AdaGrad` and `RMSProp`.

The loss parameter measures "how far [the] output is from what you expected” (Chollet 31). I've selected `sparse_categorical_crossentropy` because Chollet (151) demonstrates that it "is almost always the loss function you should use for such problems. It minimizes the distance between the probability distributions output by the model and the true distribution of the targets.”

The metric I'm interested in is, of course, accuracy.

In [7]:
# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

In [13]:
# Save the best performing model
# from tensorflow.keras.callbacks import ModelCheckpoint
# callbacks = [ModelCheckpoint("authors",save_best_only=True,save_format='tf')]

# Train the model
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_val, y_val), batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [19]:
# Evaluating the model on the validation set
loss, accuracy = model.evaluate(X_val, y_val)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")

Validation Accuracy: 69.01%


In [25]:
# Example of predicting a new author name
new_author = "Keil, Henricus, 1822-1894"

# Ensure the input is in the correct shape (list of strings)
# This must be a list with a single string, so it becomes a 1D tensor of shape (1,)
vectorized_input = vectorize_layer(tf.constant([new_author]))

print(vectorized_input.shape)  # This should output something like (1, 254)


(1, 254)


In [26]:

# Predict using the model
predicted_probabilities = model.predict(vectorized_input)
predicted_label = predicted_probabilities.argmax(axis=-1)  # Get the index of the highest probability

# Map the predicted label back to the original dll_author_id
predicted_id = label_encoder.inverse_transform([predicted_label])[0]

print(f"Predicted dll_author_id: {predicted_id}")


ValueError: in user code:

    File "/Users/sjhuskey/anaconda3/envs/aiml/lib/python3.11/site-packages/keras/engine/training.py", line 2169, in predict_function  *
        return step_function(self, iterator)
    File "/Users/sjhuskey/anaconda3/envs/aiml/lib/python3.11/site-packages/keras/engine/training.py", line 2155, in step_function  **
        outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/Users/sjhuskey/anaconda3/envs/aiml/lib/python3.11/site-packages/keras/engine/training.py", line 2143, in run_step  **
        outputs = model.predict_step(data)
    File "/Users/sjhuskey/anaconda3/envs/aiml/lib/python3.11/site-packages/keras/engine/training.py", line 2111, in predict_step
        return self(x, training=False)
    File "/Users/sjhuskey/anaconda3/envs/aiml/lib/python3.11/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
        raise e.with_traceback(filtered_tb) from None
    File "/Users/sjhuskey/anaconda3/envs/aiml/lib/python3.11/site-packages/keras/layers/preprocessing/text_vectorization.py", line 573, in _preprocess
        raise ValueError(

    ValueError: Exception encountered when calling layer 'text_vectorization' (type TextVectorization).
    
    When using `TextVectorization` to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(None, 254) with rank=2
    
    Call arguments received by layer 'text_vectorization' (type TextVectorization):
      • inputs=tf.Tensor(shape=(None, 254), dtype=string)


In [28]:
# Example of predicting a new author name
new_author = "Hall, Joseph, 1574-1656"

# Pass the raw string to the model
predicted_probabilities = model.predict([new_author])  # Pass as a list with one item

# Get the index of the highest probability
predicted_label = predicted_probabilities.argmax(axis=-1)

# Map the predicted label back to the original dll_author_id
predicted_id = label_encoder.inverse_transform([predicted_label])[0]

print(f"Predicted dll_author_id: {predicted_id}")


Predicted dll_author_id: A3964


  y = column_or_1d(y, warn=True)
