In [None]:
import numpy as np
import re
import pandas as pd

In [None]:
Here’s a brief explanation of the libraries you've imported:

1. **`numpy` (imported as `np`)**:
   - A fundamental library for numerical computing in Python.
   - It provides support for large multi-dimensional arrays and matrices, along with mathematical functions to operate on these arrays (e.g., element-wise operations).

2. **`re`**:
   - The regular expression (regex) library in Python.
   - It is used to search, match, and manipulate text using patterns, making it useful for tasks like pattern matching and string parsing.

3. **`pandas` (imported as `pd`)**:
   - A powerful library for data manipulation and analysis.
   - It provides data structures like DataFrame and Series to handle structured data, perform data cleaning, and analyze datasets efficiently.

These libraries are commonly used together in data analysis tasks.

In [None]:
sentences = data.split('.')
sentences

In [None]:
The code `sentences = data.split('.')` does the following:

1. **`data.split('.')`**:
   - Splits the string `data` into smaller parts (substrings) wherever there is a period (`.`).
   - This creates a list of sentences, assuming each sentence ends with a period.

2. **`sentences`**:
   - Stores the resulting list of sentence-like segments from `data`.

After running this code, `sentences` will contain each sentence from `data` as a separate item in a list.

In [None]:
clean_sent=[]
for sentence in sentences:
    if sentence=="":
        continue
    sentence = re.sub('[^A-Za-z0-9]+', ' ', (sentence))
    sentence = re.sub(r'(?:^| )\w (?:$| )', ' ', (sentence)).strip()
    sentence = sentence.lower()
    clean_sent.append(sentence)

clean_sent

In [None]:
Here’s a breakdown of what this code does:

1. **`clean_sent = []`**:
   - Creates an empty list `clean_sent` to store cleaned sentences.

2. **`for sentence in sentences:`**:
   - Iterates through each sentence in the `sentences` list.

3. **`if sentence == "": continue`**:
   - Skips any empty strings in `sentences` to avoid processing blank entries.

4. **Cleaning the sentence**:
   - **`re.sub('[^A-Za-z0-9]+', ' ', sentence)`**:
     - Removes any characters that are not letters (`A-Za-z`) or numbers (`0-9`), replacing them with a space.
   - **`re.sub(r'(?:^| )\w (?:$| )', ' ', sentence).strip()`**:
     - Removes any single characters (like "a" or "I") that are surrounded by spaces or appear at the beginning or end of the sentence.
     - `strip()` removes any leading or trailing whitespace.
   - **`sentence.lower()`**:
     - Converts the sentence to lowercase.

5. **`clean_sent.append(sentence)`**:
   - Appends the cleaned, lowercase version of the sentence to the `clean_sent` list.

After the loop, `clean_sent` contains the cleaned, processed version of each sentence from `sentences`.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
The line `from tensorflow.keras.preprocessing.text import Tokenizer` does the following:

- **`tensorflow.keras.preprocessing.text`**:
  - A module within TensorFlow’s Keras API focused on preparing and preprocessing text data for machine learning models.

- **`Tokenizer`**:
  - A class that converts text data into sequences of tokens (e.g., words or subwords).
  - It creates a vocabulary based on the input text and assigns each unique word an integer ID.
  - Useful for tasks like text classification, sentiment analysis, or natural language processing.

By importing `Tokenizer`, you can easily preprocess and tokenize text data, transforming it into numerical format suitable for neural networks.

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(clean_sent)
sequences = tokenizer.texts_to_sequences(clean_sent)
print(sequences)

In [None]:
Here’s what each line of this code does:

1. **`tokenizer = Tokenizer()`**:
   - Initializes a `Tokenizer` object, which will be used to tokenize (convert words into integer sequences) the text.

2. **`tokenizer.fit_on_texts(clean_sent)`**:
   - Analyzes the list `clean_sent` and builds a vocabulary based on the unique words found in the sentences.
   - Each word is assigned a unique integer ID based on its frequency, with more common words receiving lower integer IDs.

3. **`sequences = tokenizer.texts_to_sequences(clean_sent)`**:
   - Converts each sentence in `clean_sent` into a sequence of integers.
   - Each integer represents a word from the vocabulary created by the tokenizer.
   - For example, if "the" is the most frequent word, it might be encoded as `1` throughout the sequences.

4. **`print(sequences)`**:
   - Outputs the list `sequences`, where each sentence in `clean_sent` is now represented as a list of integers.

After running this code, `sequences` contains the tokenized integer representations of each sentence from `clean_sent`.

In [None]:
index_to_word = {}
word_to_index = {}

for i, sequence in enumerate(sequences):
#     print(sequence)
    word_in_sentence = clean_sent[i].split()
#     print(word_in_sentence)

    for j, value in enumerate(sequence):
        index_to_word[value] = word_in_sentence[j]
        word_to_index[word_in_sentence[j]] = value

print(index_to_word, "\n")
print(word_to_index)

In [None]:
Here’s what this code does:

1. **`index_to_word = {}` and `word_to_index = {}`**:
   - Creates two dictionaries:
     - `index_to_word`: Maps each integer (token) to its corresponding word.
     - `word_to_index`: Maps each word to its corresponding integer (token).

2. **Loop through sequences**:
   - **`for i, sequence in enumerate(sequences):`**:
     - Iterates over each `sequence` in `sequences`, with `i` as the index of the sentence.

3. **Token-to-Word Mapping**:
   - **`word_in_sentence = clean_sent[i].split()`**:
     - Splits the original sentence in `clean_sent[i]` (a cleaned, lowercase version of the sentence) into words.

4. **Inner loop to populate dictionaries**:
   - **`for j, value in enumerate(sequence):`**:
     - Iterates through each integer `value` in `sequence`, with `j` as the index of the word in the sentence.
   - **`index_to_word[value] = word_in_sentence[j]`**:
     - Adds a key-value pair to `index_to_word`, mapping each integer `value` to the corresponding word in the sentence.
   - **`word_to_index[word_in_sentence[j]] = value`**:
     - Adds a key-value pair to `word_to_index`, mapping each word to its integer `value`.

5. **`print(index_to_word, "\n")` and `print(word_to_index)`**:
   - Outputs the dictionaries:
     - `index_to_word`: Maps tokens to their respective words.
     - `word_to_index`: Maps words to their respective tokens.

After running this code, you’ll have two dictionaries:
- `index_to_word`: Allows you to decode tokenized sentences back into words.
- `word_to_index`: Allows you to convert words into their tokenized (integer) form.

In [None]:
vocab_size = len(tokenizer.word_index) + 1
emb_size = 10
context_size = 2

contexts = []
targets = []

for sequence in sequences:
    for i in range(context_size, len(sequence) - context_size):
        target = sequence[i]
        context = [sequence[i - 2], sequence[i - 1], sequence[i + 1], sequence[i + 2]]
#         print(context)
        contexts.append(context)
        targets.append(target)
print(contexts, "\n")
print(targets)

In [None]:
This code is preparing data for a word embedding model by creating "contexts" and "targets" for each word in each sentence. Here's a breakdown:

1. **Define parameters**:
   - **`vocab_size = len(tokenizer.word_index) + 1`**:
     - `vocab_size` is the total number of unique words in the vocabulary, plus 1 to account for padding or indexing starting at 1.
   - **`emb_size = 10`**:
     - `emb_size` is the size of the word embedding vector, which determines how many features represent each word.
   - **`context_size = 2`**:
     - `context_size` defines the window around each target word (in this case, 2 words on each side).

2. **Initialize `contexts` and `targets` lists**:
   - `contexts` will store the words around each target word.
   - `targets` will store each target word.

3. **Loop through each sequence in `sequences`**:
   - For each sequence (sentence represented by integer tokens):
     - **`for i in range(context_size, len(sequence) - context_size):`**:
       - Loops through the sequence, excluding the first and last `context_size` tokens to ensure context words are available.
     - **`target = sequence[i]`**:
       - Sets the `target` word at position `i`.
     - **`context = [sequence[i - 2], sequence[i - 1], sequence[i + 1], sequence[i + 2]]`**:
       - Creates a `context` list containing 4 tokens (2 before and 2 after the target).

4. **Append `context` and `target` to their respective lists**:
   - `contexts.append(context)`: Adds the list of context words to `contexts`.
   - `targets.append(target)`: Adds the target word to `targets`.

5. **Print `contexts` and `targets`**:
   - Outputs the lists:
     - `contexts`: Each entry contains the context words around a target word.
     - `targets`: Each entry is the target word corresponding to each context.

This setup is often used in word embedding models like Skip-gram, where the goal is to predict the target word based on its surrounding context.

In [None]:
for i in range(5):
    words = []
    target = index_to_word.get(targets[i])
    for j in contexts[i]:
        words.append(index_to_word.get(j))
    print(words," -> ", target)

In [None]:
This code prints sample context-target pairs for the first few items in the `contexts` and `targets` lists, converting integer tokens back to words using the `index_to_word` dictionary.

1. **Loop through the first 5 items**:
   - **`for i in range(5):`**:
     - Iterates through the first five entries in `contexts` and `targets` to print examples.

2. **Initialize `words`**:
   - **`words = []`**:
     - Creates an empty list to store context words as actual words (rather than tokens).

3. **Get the target word**:
   - **`target = index_to_word.get(targets[i])`**:
     - Retrieves the word corresponding to the `target` token at `targets[i]` using `index_to_word` dictionary.

4. **Convert context tokens to words**:
   - **`for j in contexts[i]:`**:
     - Iterates through each token in `contexts[i]`.
   - **`words.append(index_to_word.get(j))`**:
     - Retrieves the word corresponding to each context token `j` and appends it to `words`.

5. **Print context and target**:
   - **`print(words, " -> ", target)`**:
     - Outputs the context words (`words`) and the target word (`target`) in the format `['context_word1', 'context_word2', ...] -> target_word`.

This printout helps verify that context-target pairs are correctly formed. Each line represents a target word and its context, showing the words instead of numerical tokens.

In [None]:
X = np.array(contexts)
Y = np.array(targets)

In [None]:
This code converts the `contexts` and `targets` lists into NumPy arrays:

1. **`X = np.array(contexts)`**:
   - Converts `contexts` (a list of context word tokens) into a NumPy array `X`.
   - This allows for easier handling and efficient processing in machine learning models.

2. **`Y = np.array(targets)`**:
   - Converts `targets` (a list of target word tokens) into a NumPy array `Y`.

After running this, `X` contains the context tokens for each target word in a structured array format, and `Y` contains the target tokens. This format is ready for input into machine learning models, where `X` would typically serve as the input features, and `Y` as the labels.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Lambda

In [None]:
Here’s what each import does:

1. **`import tensorflow as tf`**:
   - Imports the TensorFlow library, a powerful framework for building and training machine learning and deep learning models.

2. **`from tensorflow.keras.models import Sequential`**:
   - Imports the `Sequential` class, which is used to build models layer by layer. It’s ideal for models that have a simple, linear stack of layers.

3. **`from tensorflow.keras.layers import Dense, Embedding, Lambda`**:
   - **`Dense`**:
     - A fully connected layer where each neuron receives input from all neurons in the previous layer.
     - Used for output layers or intermediate layers in neural networks.
   - **`Embedding`**:
     - Converts integer-encoded words (like those in `X` and `Y`) into dense vectors of fixed size (`emb_size`), which serve as word embeddings.
     - Useful for mapping word tokens to embeddings in natural language processing.
   - **`Lambda`**:
     - Allows you to wrap custom functions as Keras layers.
     - Often used for simple operations (e.g., reshaping, arithmetic) that don’t require their own layer type.

Together, these imports provide tools for creating and defining neural network models, especially ones used for embedding and NLP tasks.

In [None]:
model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=emb_size, input_length=2*context_size),
    Lambda(lambda x: tf.reduce_mean(x, axis=1)),
    Dense(256, activation='relu'),
    Dense(512, activation='relu'),
    Dense(vocab_size, activation='softmax')
])

In [None]:
This code defines a neural network model using TensorFlow's Keras API. Here’s a breakdown of the layers:

1. **`Sequential([ ... ])`**:
   - Specifies that the model will be built in a linear stack of layers (from input to output).

2. **`Embedding(input_dim=vocab_size, output_dim=emb_size, input_length=2*context_size)`**:
   - This layer converts input tokens (e.g., the context words) into dense vectors (embeddings).
   - **`input_dim=vocab_size`**: The size of the vocabulary (number of unique words).
   - **`output_dim=emb_size`**: The size of the embedding vector for each word.
   - **`input_length=2*context_size`**: The length of the input sequence (2 context words on each side of the target word).

3. **`Lambda(lambda x: tf.reduce_mean(x, axis=1))`**:
   - Applies a custom function to the output of the `Embedding` layer.
   - **`tf.reduce_mean(x, axis=1)`** computes the mean of the embeddings along the context dimension (averaging the context word embeddings).

4. **`Dense(256, activation='relu')`**:
   - A fully connected layer with 256 neurons and the ReLU activation function, which introduces non-linearity.

5. **`Dense(512, activation='relu')`**:
   - Another fully connected layer, this time with 512 neurons and ReLU activation.

6. **`Dense(vocab_size, activation='softmax')`**:
   - The output layer, which has as many neurons as the size of the vocabulary (`vocab_size`).
   - **`softmax` activation** is used to convert the output into probabilities, which can be interpreted as the likelihood of each word being the target word.

This model is likely intended for tasks like word prediction or embedding learning, where the context (surrounding words) is used to predict the target word.

In [None]:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


In [None]:
This line of code compiles the neural network model:

1. **`model.compile(...)`**:
   - Prepares the model for training by specifying the loss function, optimizer, and evaluation metrics.

2. **`loss='sparse_categorical_crossentropy'`**:
   - Specifies the loss function used for multi-class classification tasks where each target word is represented by an integer index.
   - **`sparse_categorical_crossentropy`** is suitable for tasks where the target values are integers (as opposed to one-hot encoded labels).

3. **`optimizer='adam'`**:
   - The optimizer used to minimize the loss function during training.
   - **`Adam`** is an efficient and popular optimizer that adjusts learning rates dynamically during training.

4. **`metrics=['accuracy']`**:
   - Specifies that **`accuracy`** will be tracked as a performance metric during training.
   - Accuracy measures how often the predicted word matches the actual target word.

This setup configures the model to learn from data using the Adam optimizer and evaluate its performance based on accuracy.

In [None]:
history = model.fit(X, Y, epochs=80)


In [None]:
The code:

**`history = model.fit(X, Y, epochs=80)`**

is used to train the model:

1. **`model.fit(...)`**:
   - Trains the model on the provided data (`X` and `Y`), adjusting its weights to minimize the loss function.

2. **`X`**:
   - The input data, in this case, the context words (converted into integer token sequences).

3. **`Y`**:
   - The target data, which are the integer tokens for the target words.

4. **`epochs=80`**:
   - Specifies the number of times the model will iterate over the entire training dataset (i.e., 80 passes through the data).
   - Each epoch updates the model's weights to reduce the error (loss).

5. **`history`**:
   - Stores the training history, which includes information about the loss and accuracy after each epoch.
   - This can be used later to analyze the model's performance during training (e.g., for plotting or debugging).

This line trains the model for 80 epochs using the provided data.

In [None]:
import seaborn as sns
sns.lineplot(model.history.history)

In [None]:
This line of code:

**`sns.lineplot(model.history.history)`**

is used to visualize the model's training progress using Seaborn:

1. **`import seaborn as sns`**:
   - Imports the Seaborn library, which is a powerful visualization tool built on top of Matplotlib.
   - It makes it easy to create informative and attractive statistical plots.

2. **`model.history.history`**:
   - Refers to the training history of the model, which is stored in the `history` object after calling `model.fit()`.
   - **`model.history.history`** is a dictionary that contains values like loss, accuracy, and other metrics recorded during training, for each epoch.

3. **`sns.lineplot(...)`**:
   - Creates a line plot using Seaborn.
   - By passing `model.history.history` to `sns.lineplot()`, it plots the training metrics (like loss and accuracy) across epochs, showing how they change over time.

This code helps visualize how the model's performance (e.g., loss and accuracy) evolves during training, making it easier to analyze the model's learning progress.

In [None]:
from sklearn.decomposition import PCA

embeddings = model.get_weights()[0]

pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(embeddings)

In [None]:
This code performs **Principal Component Analysis (PCA)** to reduce the dimensionality of word embeddings for visualization:

1. **`embeddings = model.get_weights()[0]`**:
   - Retrieves the weights of the model, specifically the embeddings learned by the `Embedding` layer (the first set of weights).
   - These embeddings are word vectors (dense representations) in a higher-dimensional space.

2. **`pca = PCA(n_components=2)`**:
   - Initializes a PCA object with the goal of reducing the embeddings to 2 dimensions (`n_components=2`).
   - PCA is a technique that reduces the number of features while preserving as much variance as possible.

3. **`reduced_embeddings = pca.fit_transform(embeddings)`**:
   - Applies PCA to the `embeddings` to reduce their dimensionality from the original size (e.g., 10 or more dimensions) to 2 dimensions.
   - **`fit_transform()`** fits the PCA model to the data and then transforms the embeddings into the reduced 2D space.

This process allows the word embeddings to be visualized in 2D, making it easier to explore relationships between words (e.g., clustering similar words together).

In [None]:
test_sentenses = [
    "known as structured learning",
    "transformers have applied to",
    "where they produced results",
    "cases surpassing expert performance"
]

In [None]:
The variable **`test_sentences`** is a list of strings, where each string represents a sentence. Here's a breakdown:

1. **`test_sentences = [...]`**:
   - This defines a list called `test_sentences` containing 4 example sentences as elements.

Each sentence in this list is a short fragment of text. These sentences could be used for testing or further processing, such as tokenization, embedding generation, or model inference, depending on the task at hand.

For example:
- "known as structured learning"
- "transformers have applied to"
- "where they produced results"
- "cases surpassing expert performance"

These sentences might be fed into a trained model for tasks like text classification, prediction, or embedding visualization.

In [None]:
for sent in test_sentenses:
    test_words = sent.split(" ")
#     print(test_words)
    x_test =[]
    for i in test_words:
        x_test.append(word_to_index.get(i))
    x_test = np.array([x_test])
#     print(x_test)

    pred = model.predict(x_test)
    pred = np.argmax(pred[0])
    print("pred ", test_words, "\n=", index_to_word.get(pred),"\n\n")



In [None]:
This code is used to predict the target word for each sentence in `test_sentences` using the trained model. Here's a breakdown:

1. **`for sent in test_sentences:`**:
   - Iterates over each sentence in the `test_sentences` list.

2. **`test_words = sent.split(" ")`**:
   - Splits the sentence into individual words (tokens) using spaces as separators.

3. **`x_test = []`**:
   - Initializes an empty list `x_test` to store the tokenized word indices (based on `word_to_index`).

4. **`for i in test_words:`**:
   - Loops through each word in `test_words`.

5. **`x_test.append(word_to_index.get(i))`**:
   - Converts each word to its corresponding index (token) using the `word_to_index` dictionary and appends it to `x_test`.

6. **`x_test = np.array([x_test])`**:
   - Converts `x_test` to a NumPy array and reshapes it to match the input shape expected by the model.

7. **`pred = model.predict(x_test)`**:
   - Uses the trained model to predict the next word (target) given the context (`x_test`).

8. **`pred = np.argmax(pred[0])`**:
   - `model.predict()` returns a probability distribution across all vocabulary words.
   - `np.argmax(pred[0])` retrieves the index of the highest probability word (the predicted target word).

9. **`print("pred ", test_words, "\n=", index_to_word.get(pred),"\n\n")`**:
   - Prints the original words in the sentence (`test_words`), followed by the predicted target word (using `index_to_word.get(pred)` to map the index back to the word).

In summary, this code processes each test sentence, tokenizes it, predicts the target word using the trained model, and prints the results.