<a href="https://colab.research.google.com/github/rhiosutoyo/Teaching-Deep-Learning-and-Its-Applications/blob/main/7_1_working_with_text_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with Text Data
This implementation covers both one-hot encoding and word embedding for movie reviews, demonstrating the transformation from raw text to numerical vectors.
* Steps 1-5 involve preparing the data and converting text to one-hot encoded vectors, which are binary vectors representing each word.
* Steps 6-7 involve initializing a word embedding layer and converting text to dense embedding vectors, which are numerical representations learned by the model.
* Step 8 involves printing the original reviews and their corresponding vector representations to visualize the transformation.

In [1]:
!pip install torch scikit-learn numpy



In [2]:
import torch
import torch.nn as nn
from sklearn.preprocessing import OneHotEncoder
import numpy as np

##1. Defines a list of sample movie reviews
We start by creating a list of movie reviews. Each review is a string that represents a single user’s opinion about a movie. This list serves as our input data for the process.

In [3]:
# Sample movie reviews
reviews = [
    "I love this movie, it's amazing!",
    "The movie was okay, not great.",
    "I didn't like the movie at all.",
    "Absolutely fantastic! A must-watch.",
    "Not my type of movie, very boring."
]

##2. Creates a vocabulary set from the reviews
We build a vocabulary by extracting all unique words from the reviews. This vocabulary set will be used to encode the reviews.

In [4]:
# Create a vocabulary set
vocab = set(" ".join(reviews).split())
vocab_size = len(vocab)

## 3. Maps each word in the vocabulary to a unique index
We create a mapping from each word in the vocabulary to a unique integer index. This helps in converting words to numerical representations.

In [5]:
# Create a mapping from word to index
word_to_index = {word: idx for idx, word in enumerate(vocab)}
index_to_word = {idx: word for word, idx in word_to_index.items()}

## 4. Uses the OneHotEncoder from sklearn.preprocessing to one-hot encode the words
We use OneHotEncoder to encode each word as a one-hot vector. A one-hot vector is a binary vector with one “1” and all other elements “0”, corresponding to the unique index of the word in the vocabulary.

In [6]:
# One-hot encode the words
one_hot_encoder = OneHotEncoder(sparse=False, categories='auto')
one_hot_encoder.fit(np.array(list(vocab)).reshape(-1, 1))



## 5. Converts each review into one-hot encoded vectors
For each review, we split the review into words and convert each word into its one-hot encoded vector using the encoder.

In [7]:
# Convert reviews to one-hot encoded vectors
def review_to_one_hot_vectors(review):
    words = review.split()
    word_indices = [word_to_index[word] for word in words]
    one_hot_vectors = one_hot_encoder.transform(np.array(words).reshape(-1, 1))
    return one_hot_vectors

## 6. Initializes a word embedding layer using torch.nn.Embedding
We create a word embedding layer using PyTorch’s nn.Embedding. This layer will learn to map words to dense vectors (embeddings) of a specified dimension during the training of a neural network.

In [8]:
# Word embedding using PyTorch's nn.Embedding
embedding_dim = 10  # Size of the word embedding vectors
embedding = nn.Embedding(vocab_size, embedding_dim)

## 7. Converts each review into word embedding vectors
We convert each review into embedding vectors by passing the index of each word through the embedding layer. The embedding layer transforms these indices into dense vectors.

In [9]:
# Convert reviews to word embedding vectors
def review_to_embedding_vectors(review):
    words = review.split()
    word_indices = torch.tensor([word_to_index[word] for word in words], dtype=torch.long)
    embedding_vectors = embedding(word_indices)
    return embedding_vectors

## 8. Prints the reviews and their corresponding one-hot encoded vectors and word embedding vectors
Finally, we print the original reviews along with their one-hot encoded vectors and word embedding vectors to see the transformation from text to numerical representations.

In [10]:
# Convert all reviews to one-hot and embedding vectors
one_hot_review_vectors = [review_to_one_hot_vectors(review) for review in reviews]
embedding_review_vectors = [review_to_embedding_vectors(review) for review in reviews]

# Print the movie reviews and their one-hot encoding vectors
for review, one_hot_vector in zip(reviews, one_hot_review_vectors):
    print(f"Review: {review}")
    print("One-hot Encoding Vectors:")
    print(one_hot_vector)
    print()

# Print the movie reviews and their embedding vectors
for review, embedding_vector in zip(reviews, embedding_review_vectors):
    print(f"Review: {review}")
    print("Embedding Vectors:")
    print(embedding_vector)
    print()

Review: I love this movie, it's amazing!
One-hot Encoding Vectors:
[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0.]]

Review: The movie was okay, not great.
One-hot Encoding Vectors:
[[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0. 0.]
 [0. 0