# One-hot encoded book titles

PyBooks wants to catalog and analyze the book genres in its library. Apply one-hot encoding to a list of book genres to make them machine-readable.

* Define the size of the vocabulary and save to vocab_size.
* Create one-hot vectors using the appropriate torch technique and vocab_size.
* Create a dictionary mapping genres to their corresponding one-hot vectors using dictionary comprehension; the dictionary keys should be the genre.

In [1]:
import torch

genres = ['Fiction','Non-fiction','Biography', 'Children','Mystery']

# Define the size of the vocabulary
vocab_size = len(genres)

# Create one-hot vectors
one_hot_vectors = torch.eye(vocab_size)

# Create a dictionary mapping genres to their one-hot vectors
one_hot_dict = {genre : one_hot_vectors[i] for i, genre in enumerate(genres)}

for genre, vector in one_hot_dict.items():
    print(f'{genre}: {vector.numpy()}')

Fiction: [1. 0. 0. 0. 0.]
Non-fiction: [0. 1. 0. 0. 0.]
Biography: [0. 0. 1. 0. 0.]
Children: [0. 0. 0. 1. 0.]
Mystery: [0. 0. 0. 0. 1.]


 The output matrix represents the presence of genres in a binary format. This type of encoding allows machines to better understand and use the genre data for various tasks, such as predicting book popularity or making book recommendations.

**Bag-of-words for book titles**

PyBooks now has a list of book titles that need to be encoded for further analysis. The data team believes the Bag of Words (BoW) model could be the best approach.

* Import the CountVectorizer class for implementing bag-of-words.
* Initialize an object of the class you imported, then use this object to transform the titles into a matrix representation.
* Extract and display the first five feature names and encoded titles with the get_feature_names_out() method.

In [2]:
import torchtext

# Import from sklearn
from sklearn.feature_extraction.text import CountVectorizer

titles = ['The Great Gatsby','To Kill a Mockingbird','1984','The Catcher in the Rye',
          'The Hobbit', 'Great Expectations']

# Initialize Bag-of-words with the list of book titles
vectorizer = CountVectorizer()
# The fit_transform method is used to learn the vocabulary and convert the input text data into a BoW representation
bow_encoded_titles = vectorizer.fit_transform(titles)

# Extract and print the first five features
print(bow_encoded_titles.toarray()[0, :5])

# The get_feature_names_out method returns the list of feature names (unique words in the vocabulary)
print(vectorizer.get_feature_names_out()[:5])

[0 0 0 1 1]
['1984' 'catcher' 'expectations' 'gatsby' 'great']


The BoW representation is a sparse matrix where each row represents a document (title) and each column represents a unique word in the vocabulary. The value at position (i, j) in the matrix indicates the count of the j-th word in the i-th document.

The output matrix provides a clear picture of the word frequencies in the book titles. By analyzing the output, you can identify the frequency of words like 'catcher' and 'great' in the titles. The word frequency feature vectors can be used later by machine learning algorithms.

In [3]:
# Extract and print the first five features
print(bow_encoded_titles.toarray())
print(vectorizer.get_feature_names_out())

[[0 0 0 1 1 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 1 1 0 0 1]
 [1 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 0 0 0 1 0 0 1 2 0]
 [0 0 0 0 0 1 0 0 0 0 1 0]
 [0 0 1 0 1 0 0 0 0 0 0 0]]
['1984' 'catcher' 'expectations' 'gatsby' 'great' 'hobbit' 'in' 'kill'
 'mockingbird' 'rye' 'the' 'to']


* The vocabulary is ['1984', 'catcher', 'expectations', 'gatsby', 'great', 'hobbit', 'in', 'kill', 'mockingbird', 'rye', 'the', 'to'].

* The BoW vector for "The Great Gatsby" is [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0].

* The value at index 3 (gatsby) is 1 because "gatsby" appears once in the title.
* The value at index 4 (great) is 1 because "great" appears once in the title.
* The value at index 10 (the) is 1 because "the" appears once in the title.
* All other values are 0 because the corresponding words do not appear in the title.

**Applying TF-IDF to book descriptions**

PyBooks has collected several book descriptions and wants to identify important words within them using the TF-IDF encoding technique. By doing this, they hope to gain more insights into the unique attributes of each book to help with their book recommendation system.

* Import the class from sklearn.feature_extraction.text that converts a collection of raw documents to a matrix of TF-IDF features.
* Instantiate an object of this class, then use this object to encode the descriptions into a TF-IDF matrix of vectors.
* Retrieve and display the first five feature names from the vectorizer and encoded vectors from tfidf_encoded_descriptions.

In [5]:
# Importing TF-IDF from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

descriptions = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?"
]

# Initialize TF-IDF encoding vectorizer
vectorizer = TfidfVectorizer()
tfidf_encoded_descriptions = vectorizer.fit_transform(descriptions)

# Extract and print the first five features
print(tfidf_encoded_descriptions.toarray()[0, :5])
print(vectorizer.get_feature_names_out()[:5])

[0.         0.46979139 0.58028582 0.38408524 0.        ]
['and' 'document' 'first' 'is' 'one']


In [6]:
# Extract and print the first five features
print(tfidf_encoded_descriptions.toarray())
print(vectorizer.get_feature_names_out())

[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]
['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']


By examining the feature names and their corresponding TF-IDF values, you can uncover significant words that contribute to the uniqueness and relevance of each book. Your team is excited about the insights gained from your analysis. Keep up the great work!