<a href="https://colab.research.google.com/github/vijaydr29/Basic-NLP-Procedure/blob/main/Basic_NLP_Prcocedure.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Agenda**



#### 1. **NLP Overview:**
   - Definition and Importance
   - Key Challenges

#### 2. **How to Process Data:**
   - Text Cleaning
   - Tokenization
   - Stopword Removal
   - Lemmatization
   

#### 3. **How ML Models Work:**
   - Introduction to Machine Learning
   - Text Representation
      - Bag of Words
      - TF-IDF
      - Word Embeddings (Word2Vec)


#### 4. **Word Embeddings (Word2Vec):**
   - Word2Vec Basics
   - Gensim Library for Word2Vec


#### 5. **Seq2Seq Models:**
   - Introduction to Seq2Seq
   - Encoder-Decoder Architecture


#### 6. **Transformer Models with Attention:**
   - Introduction to Transformers
   - Self-Attention Mechanism
   - BERT (Bidirectional Encoder Representations from Transformers)



## **1. NLP Overview: Definition and Importance**



#### **1.1 Definition:**

NLP, or Natural Language Processing, is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language. The goal is to enable machines to understand, interpret, and generate human-like text.

In practical terms, NLP involves the development of algorithms and models that can:

- Understand the meaning behind a piece of text.
- Extract relevant information from unstructured data.
- Generate coherent and contextually relevant text.




#### **1.2 Importance of NLP:**

NLP plays a crucial role in various applications, contributing to advancements in technology and improving user experiences. Some key areas where NLP is essential include:

- **Chatbots and Virtual Assistants:** NLP powers conversational agents, allowing users to interact with machines in a natural language.

- **Information Extraction:** NLP helps extract structured information from unstructured text, such as named entities, relationships, and events.

- **Sentiment Analysis:** Businesses use NLP to analyze customer sentiments expressed in reviews, social media, or surveys.

- **Language Translation:** NLP is behind machine translation systems that enable the translation of text from one language to another.

- **Speech Recognition:** NLP algorithms are used to convert spoken language into written text, facilitating voice-activated systems.



#### **1.3 Key Challenges in NLP:**

Despite the advancements, NLP faces several challenges due to the complexity and nuances of human language. Some key challenges include:

- **Ambiguity:** Words and phrases often have multiple meanings based on context, making it challenging for machines to accurately interpret.

- **Sarcasm and Irony:** Detecting sarcasm, irony, and other forms of figurative language is difficult for NLP systems, as it requires understanding contextual cues.

- **Lack of Context:** Understanding context is essential for accurate language comprehension. NLP models may struggle when faced with ambiguous or incomplete information.

- **Data Limitations:** NLP models heavily rely on large amounts of labeled data for training. Limited or biased datasets can lead to biased and less accurate models.



#### **Real-life Example:**

To illustrate the importance of NLP, consider a social media sentiment analysis application. The goal is to determine the sentiment (positive, negative, or neutral) of user comments about a product. NLP techniques, such as tokenization and sentiment analysis models, can be employed to automatically analyze and categorize user sentiments.


In [None]:
#  importing the "drive" module from the "google.colab" library, facilitating access to Google Drive within a Colab notebook.
from google.colab import drive

In [None]:
# mounting the user's Google Drive to the "/content/drive" directory in a Google Colab notebook, enabling access to files and data stored on Google Drive within the Colab environment.
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# reading a CSV file named "data.csv" located in the "Data" folder on the your Google Drive and stores it as a DataFrame df
# write your data file path
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Master Class/IMDB Dataset.csv')

In [None]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
# Import the TextBlob class from the textblob library
from textblob import TextBlob

# Assuming you have already mounted Google Drive and loaded the IMDB dataset into a DataFrame df

# Limit the analysis to the first 100 reviews
num_reviews_to_analyze = 100

# Initialize variables for accuracy calculation
correct_predictions = 0

# Iterate through the first 100 reviews in the dataset
for index, row in df.head(num_reviews_to_analyze).iterrows():
    # Extract the review text from the 'review' column
    review_text = row['review']

    # Create a TextBlob object for the current review
    analysis = TextBlob(review_text)

    # Calculate the sentiment polarity of the review
    sentiment_polarity = analysis.sentiment.polarity

    # Convert polarity to predicted sentiment label
    predicted_sentiment = 'positive' if sentiment_polarity > 0 else 'negative' if sentiment_polarity < 0 else 'neutral'

    # Compare with the true sentiment label
    true_sentiment = row['sentiment']

    # Check if prediction is correct
    if predicted_sentiment == true_sentiment:
        correct_predictions += 1

    # Print the sentiment polarity and predicted sentiment of the review
    print(f"Review {index + 1}: Polarity={sentiment_polarity}, Predicted Sentiment={predicted_sentiment}, True Sentiment={true_sentiment}")

# Calculate accuracy
accuracy = correct_predictions / num_reviews_to_analyze
print(f"Accuracy: {accuracy * 100:.2f}%")


Review 1: Polarity=0.023433179723502305, Predicted Sentiment=positive, True Sentiment=positive
Review 2: Polarity=0.1097222222222222, Predicted Sentiment=positive, True Sentiment=positive
Review 3: Polarity=0.35400793650793644, Predicted Sentiment=positive, True Sentiment=positive
Review 4: Polarity=-0.0578125, Predicted Sentiment=negative, True Sentiment=negative
Review 5: Polarity=0.2179522497704316, Predicted Sentiment=positive, True Sentiment=positive
Review 6: Polarity=0.15529411764705883, Predicted Sentiment=positive, True Sentiment=positive
Review 7: Polarity=0.2855218855218855, Predicted Sentiment=positive, True Sentiment=positive
Review 8: Polarity=0.08271604938271605, Predicted Sentiment=positive, True Sentiment=negative
Review 9: Polarity=-0.1428628389154705, Predicted Sentiment=negative, True Sentiment=negative
Review 10: Polarity=0.41500000000000004, Predicted Sentiment=positive, True Sentiment=positive
Review 11: Polarity=0.12738095238095237, Predicted Sentiment=positive,


## **2. How to Process Data:**



#### **2.1 Text Cleaning:**

Text cleaning involves removing unnecessary elements from the text, such as special characters or unwanted symbols.

#### **2.2 Tokenization:**

Tokenization is the process of breaking text into individual words or tokens.


In [None]:
# Import the Natural Language Toolkit (nltk) library
import nltk

# Download the Punkt tokenizer model (if not already downloaded)
nltk.download('punkt')

# Import the word_tokenize function from the nltk.tokenize module
from nltk.tokenize import word_tokenize

# Sample text
sample_text = "This is random text."

# Tokenize the text using word_tokenize
tokens = word_tokenize(sample_text)

# Print the list of tokens
print(tokens)


['This', 'is', 'random', 'text', '.']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Explanation:**
- `word_tokenize(text)`: Using the NLTK library, this function tokenizes the input text into a list of words.

**Real-life Example:**  
Consider a scenario where you want to analyze the frequency of words in a text document. Tokenization helps in breaking down the text into individual words for further analysis.



#### **2.3 Stopword Removal:**

Stopwords are common words that do not carry significant meaning and are often removed to focus on the more meaningful words.


#### **2.4 Lemmatization:**

Lemmatization reduces words to their base or root form to unify words with similar meanings.

In [None]:
# Import the WordNetLemmatizer class from the nltk.stem module
from nltk.stem import WordNetLemmatizer

# Import the word_tokenize function from the nltk.tokenize module
from nltk.tokenize import word_tokenize

# Import the Natural Language Toolkit (nltk) library
import nltk

# Download the WordNet dataset (if not already downloaded)
nltk.download('wordnet')

# Create an instance of the WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Example sentence
sentence = "The quick brown foxes are jumping over the lazy dogs."

# Tokenize the sentence
tokens = word_tokenize(sentence)

# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

# Print the original tokens and the lemmatized tokens
print("Original Tokens:", tokens)
print("Lemmatized Tokens:", lemmatized_tokens)


Original Tokens: ['The', 'quick', 'brown', 'foxes', 'are', 'jumping', 'over', 'the', 'laziest', 'dogs', '.']
Lemmatized Tokens: ['The', 'quick', 'brown', 'fox', 'are', 'jumping', 'over', 'the', 'laziest', 'dog', '.']


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Explanation:**
- `lemmatizer.lemmatize(word)`: Lemmatizes each word using the WordNet lemmatizer.

**Real-life Example:**  
In information extraction, lemmatization can help in grouping different forms of a word (e.g., "running" and "ran") under a common lemma.


## **3. How ML Models Work:**



#### **3.1 Introduction to Machine Learning:**

Machine Learning (ML) involves the development of algorithms that enable computers to learn patterns from data and make predictions or decisions without being explicitly programmed. In the context of NLP, machine learning models can be trained on textual data to perform various tasks such as sentiment analysis, classification, or language translation.

#### **3.2 Text Representation:**

Before feeding text data into machine learning models, it needs to be converted into a numerical format. Several methods are used for text representation, each with its own advantages and drawbacks.

#### **3.3 Bag of Words (BoW):**

The Bag of Words model represents a document as an unordered set of words, disregarding grammar and word order. It creates a matrix where each row corresponds to a document, and each column corresponds to a unique word in the entire corpus.


In [None]:
# Assuming you have already mounted Google Drive and loaded the IMDB dataset into a DataFrame df

# Import the CountVectorizer class from the sklearn.feature_extraction.text module
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of the CountVectorizer class
vectorizer = CountVectorizer()

# Extract the 'review' column from the IMDB dataset as the corpus
corpus = df['review'].tolist()

# Transform the corpus into a Bag of Words representation
X_bow = vectorizer.fit_transform(corpus)

# Display the feature names (unique words) in the corpus
print("Feature Names (Unique Words):", vectorizer.get_feature_names_out()[:20])  # Displaying the first 20 feature names

# Display the Bag of Words matrix (showing only the first 5 reviews for brevity)
print("Bag of Words Matrix:")
print(X_bow[:5].toarray())


Feature Names (Unique Words): ['00' '000' '00000000000' '0000000000001' '00000001' '00001' '00015'
 '000dm' '000s' '001' '003830' '006' '0069' '007' '0079' '007s' '0080'
 '0083' '009' '0093638']
Bag of Words Matrix:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


**Explanation:**
- `CountVectorizer`: Converts a collection of text documents to a matrix of token counts.
- Each row in the matrix corresponds to a document, and each column corresponds to a unique word in the corpus.
- The values in the matrix represent the frequency of each word in the corresponding document.




#### **3.4 TF-IDF (Term Frequency-Inverse Document Frequency):** (Skip)

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It considers both the frequency of a word in a document and the overall importance of the word in the entire corpus.

In [None]:
# Assuming you have already mounted Google Drive and loaded the IMDB dataset into a DataFrame df

# Import the TfidfVectorizer class from the sklearn.feature_extraction.text module
from sklearn.feature_extraction.text import TfidfVectorizer

# Take a smaller sample of the 'review' column from the IMDB dataset
sample_size = 1000  # You can adjust the sample size based on your needs
corpus_sample = df['review'].head(sample_size).tolist()

# Create an instance of the TfidfVectorizer class
vectorizer_tfidf = TfidfVectorizer()

# Create the TF-IDF representation for the given corpus sample
X_tfidf = vectorizer_tfidf.fit_transform(corpus_sample)

# Display the feature names (unique words) in the corpus
print("Feature Names (Unique Words):", vectorizer_tfidf.get_feature_names_out()[:20])  # Displaying the first 20 feature names

# Display the TF-IDF matrix (showing only the first 5 reviews for brevity)
print("TF-IDF Matrix:")
print(X_tfidf[:5].toarray())


Feature Names (Unique Words): ['00' '000' '007' '00am' '01pm' '08' '10' '100' '1000' '100th' '101' '102'
 '103' '105' '11' '12' '120' '13' '135' '13th']
TF-IDF Matrix:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


**Explanation:**
- `TfidfVectorizer`: Converts a collection of raw documents to a matrix of TF-IDF features.
- The TF-IDF matrix is similar to the Bag of Words matrix but gives more weight to terms that are important to a specific document.



#### **3.5 Word Embeddings (Word2Vec):**

Word Embeddings represent words as dense vectors in a continuous vector space, capturing semantic relationships between words. Word2Vec is a popular method for generating word embeddings.


![picture](https://drive.google.com/uc?export=view&id=1n0mUZcYsRLo_pInU2xiUG--t5bOISc0Q)



In [None]:
# Assuming you have already mounted Google Drive and loaded the IMDB dataset into a DataFrame df

# Import the Word2Vec class from the gensim.models module
from gensim.models import Word2Vec

# Import the word_tokenize function from the nltk.tokenize module
from nltk.tokenize import word_tokenize

# Tokenize a smaller sample of the 'review' column from the IMDB dataset
sample_size = 1000  # You can adjust the sample size based on your needs
tokenized_corpus_sample = [word_tokenize(review.lower()) for review in df['review'].head(sample_size).tolist()]

# Create the Word2Vec model
word2vec_model = Word2Vec(sentences=tokenized_corpus_sample, vector_size=100, window=5, min_count=1, workers=4)
# Adjust the parameters (e.g., vector_size, window) based on your specific requirements.

# Get the vector representation of a word (e.g., 'document')
vector_representation = word2vec_model.wv['document']

# Print the vector representation
print("Vector representation of 'document':", vector_representation)


Vector representation of 'document': [-0.01799615  0.01300799  0.00495014 -0.00586682 -0.00115296 -0.0326869
  0.00814584  0.06054067 -0.02035532 -0.02982742  0.00301758 -0.04267221
  0.00667032  0.01732409  0.00645217 -0.03111579  0.01347287 -0.03030831
 -0.01232108 -0.03072064  0.00579258  0.01554295  0.02778946  0.00035944
 -0.00729582  0.00198373 -0.03189528  0.00383202 -0.02696364  0.00714664
  0.02200642 -0.01503243 -0.00250842 -0.01869168 -0.02314418  0.01579471
  0.01724296 -0.01859335  0.00939682 -0.02926154  0.01487183 -0.01600094
 -0.02168584 -0.01009515  0.00540448 -0.01167929 -0.0138716   0.00317287
  0.00301045  0.00172426  0.01636955 -0.03009641  0.01105109  0.00368481
 -0.00448172  0.00808757  0.01398096 -0.0123611  -0.02044742  0.00928421
  0.00129415 -0.01355246  0.00084942 -0.00437145 -0.01513183  0.01167085
 -0.01791506  0.01776645 -0.01820335  0.01870342 -0.00245577  0.01904474
  0.02756433  0.00783187  0.02264715  0.00305092 -0.00458813 -0.00446247
 -0.02884027 -0


**Explanation:**
- `Word2Vec`: Learns word embeddings from a large corpus of text.
- Each word is represented as a dense vector, capturing semantic relationships.

## **4. Word Embeddings (Word2Vec):**

#### **4.1 Word2Vec Basics:**

Word2Vec is a popular word embedding technique that represents words as dense vectors in a continuous vector space. It captures semantic relationships between words and is widely used in natural language processing tasks.

Word2Vec has two primary architectures:

1. **Continuous Bag of Words (CBOW):**
   - Predicts the current word given its context words.
   - Suitable for smaller datasets.

2. **Skip-gram:**
   - Predicts context words given the current word.
   - Performs well with larger datasets and captures more detailed word relationships.

#### **4.2 Gensim Library for Word2Vec:**

Gensim is a Python library that provides tools for working with Word2Vec models. It's efficient and allows for the training of Word2Vec models on large datasets.

#### **Real-life Code Example:**

Let's demonstrate how to use Gensim to train a Word2Vec model on a small corpus.

In [None]:
# Assuming you have already mounted Google Drive and loaded the IMDB dataset into a DataFrame df

# Import the Word2Vec class from the gensim.models module
from gensim.models import Word2Vec

# Import the word_tokenize function from the nltk.tokenize module
from nltk.tokenize import word_tokenize

# Import the Natural Language Toolkit (nltk) library
import nltk
# Download the Punkt tokenizer model (if not already downloaded)
nltk.download('punkt')

# Take a smaller sample of the 'review' column from the IMDB dataset
sample_size = 1000  # You can adjust the sample size based on your needs
tokenized_corpus_sample = [word_tokenize(review.lower()) for review in df['review'].head(sample_size).tolist()]

# Create the Word2Vec model (Skip-gram)
word2vec_model = Word2Vec(sentences=tokenized_corpus_sample, vector_size=100, window=5, min_count=1, workers=4, sg=1)
# Adjust the parameters (e.g., vector_size, window) based on your specific requirements.

# Save the Word2Vec model
word2vec_model.save("/content/drive/MyDrive/Master Class/word2vec_model_imdb_sample.model")
# - Saves the trained Word2Vec model to a file named "word2vec_model_imdb_sample.model".

# Load the Word2Vec model
loaded_model = Word2Vec.load("/content/drive/MyDrive/Master Class/word2vec_model_imdb_sample.model")
# - Loads the Word2Vec model from the saved file.

# Get the vector representation of a word (e.g., 'movie')
vector_representation = loaded_model.wv['movie']
# - loaded_model.wv['movie']: Retrieves the vector representation of the word 'movie' from the loaded Word2Vec model.

# Print the vector representation
print("Vector representation of 'movie':", vector_representation)
# - Prints the vector representation of the word 'movie'.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Vector representation of 'movie': [-0.04137838  0.45587063  0.43418795  0.42026514 -0.15631789 -0.2346355
 -0.01740664  0.75538415 -0.39739498 -0.19003744  0.09647486 -0.301931
  0.04421014  0.13624126  0.16692582 -0.22978473  0.20822304 -0.23731
 -0.11544676 -1.037378   -0.06992677  0.20673127  0.51394564 -0.5281713
 -0.05746479  0.25768358 -0.7350139   0.31799152  0.09495896  0.06307528
  0.2668496  -0.04243638  0.2659982  -0.38196835 -0.23148724  0.08468099
  0.6383195  -0.22720338 -0.15904032 -0.700673   -0.13491616 -0.21236315
 -0.24900283  0.12050357 -0.1567658  -0.45392305 -0.14960532  0.1221492
 -0.04226052  0.18164657 -0.14724125 -0.11589345 -0.4547419  -0.12299178
  0.19331095 -0.2721154  -0.0184927  -0.38386193 -0.17120913  0.3001281
  0.24102426  0.41272256 -0.00529413  0.08416962 -0.48258567  0.5757731
  0.08983218  0.3334069  -0.43658233  0.3951002  -0.26522332  0.40785015
  0.02912864 -0.16939105  0.48818204  0.11548709  0.0523996   0.12026362
 -0.22329101 -0.0388239  -0

**Explanation:**
- `Word2Vec`: Initializes a Word2Vec model.
- `sentences`: The tokenized sentences from the corpus.
- `vector_size`: Dimensionality of the word vectors.
- `window`: Maximum distance between the current and predicted word within a sentence.
- `min_count`: Ignores all words with a total frequency lower than this.
- `workers`: Number of CPU cores to use when training the model.
- `sg=1`: Indicates the Skip-gram architecture.

In the example, we tokenize the corpus, train a Word2Vec model using Gensim, save the model, and then load it back to obtain the vector representation of the word 'word'.

This Word2Vec model can be used to obtain vector representations for words in a given context, enabling semantic similarity and analogy calculations.

## **5. Seq2Seq Models:**

#### **5.1 Introduction to Seq2Seq:**

Sequence-to-Sequence (Seq2Seq) models are a type of neural network architecture designed for tasks involving sequences, such as language translation, summarization, and chatbot responses. The basic idea is to use two recurrent neural networks (RNNs) known as an encoder and a decoder to transform input sequences into output sequences.

#### **5.2 Encoder-Decoder Architecture:**

The Seq2Seq model comprises two main components: an encoder and a decoder.

**5.2.1 Encoder:**
   - Takes an input sequence and encodes it into a fixed-size context vector.
   - The context vector contains essential information about the input sequence.
   - Commonly implemented using recurrent neural networks (RNNs) or long short-term memory networks (LSTMs).

**5.2.2  Decoder:**
   - Takes the context vector produced by the encoder and generates the output sequence.
   - Predicts one element at a time, often autoregressively.
   - Also commonly implemented using RNNs or LSTMs.

#### **Real-life Code Example:**

Let's create a simple Seq2Seq model using the Keras library for language translation. In this example, we'll train a Seq2Seq model to translate English sentences to French.

In [None]:
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Flatten
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Load the IMDB dataset
vocab_size = 10000  # Consider the top 10,000 words
max_len = 100  # Consider the first 100 words of each review
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size)
x_train = pad_sequences(x_train, maxlen=max_len)
x_test = pad_sequences(x_test, maxlen=max_len)

# Build the Seq2Seq model for sentiment analysis
latent_dim = 4

model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=latent_dim, input_length=max_len))
model.add(LSTM(latent_dim))
model.add(Dense(1, activation='sigmoid'))

# Compile the model using binary crossentropy loss and the Adam optimizer.
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model using the IMDB training data.
model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.2)

# Evaluate the model on the IMDB test data.
loss, accuracy = model.evaluate(x_test, y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Test Loss: 0.4184, Test Accuracy: 0.8350


**Explanation:**

This code demonstrates a simple Seq2Seq model using Keras for English-to-French word translation:

1. **Data Preparation:**
   - Lists of English and French words.
   - Tokenization using dictionaries (`tokenizer_eng` and `tokenizer_frn`).

2. **Sequence and One-Hot Encoding:**
   - Conversion of words to integer sequences (`seqs_eng` and `seqs_frn`).
   - One-hot encoding of sequences (`one_hot_eng` and `one_hot_frn`).

3. **Seq2Seq Model:**
   - Sequential model with Embedding, LSTM, and Dense layers.
   - Embedding layer maps input sequences to vectors.
   - LSTM layer captures sequential patterns.
   - Dense layer with softmax activation predicts the output sequence.

4. **Model Compilation and Training:**
   - Compilation with Adam optimizer, categorical cross-entropy loss, and accuracy metric.
   - Training on one-hot encoded English and French sequences.

5. **Inference:**
   - Input word 'hello' is converted to a one-hot encoded sequence.
   - Model predicts the output sequence.
   - Predicted word is obtained by finding the word with the highest probability in the French vocabulary.

6. **Results Printing:**
   - Prints the input English word and the predicted French word.

This example is educational, illustrating the fundamental structure of a Seq2Seq model for word translation.

## **6. Transformer Models with Attention**

### **6.1 Introduction to Transformers:**

- Transformers are a type of neural network architecture introduced in the paper "Attention is All You Need."
- Key components: self-attention mechanism, encoder-decoder structure.
- Effective in capturing long-range dependencies in sequences.


In [None]:
# Install the transformers library
!pip install transformers

# Import necessary classes from the transformers library
from transformers import BertTokenizer, BertModel

# Tokenize input text
# Instantiate a BERT tokenizer for the 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Input text to be tokenized
text = "This is an introduction to transformers."

# Encode the input text using the BERT tokenizer
# 'return_tensors' parameter specifies that PyTorch tensors should be returned
encoded_tokens = tokenizer.encode(text, return_tensors='pt')

# Load pre-trained BERT model
# Instantiate a BERT model for the 'bert-base-uncased' variant
model = BertModel.from_pretrained('bert-base-uncased')

# Forward pass through the BERT model
# Pass the encoded tokens through the BERT model to obtain the output
output = model(encoded_tokens)

# Print the output
print(output)




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[-0.3407, -0.0158, -0.1392,  ..., -0.4034,  0.3556,  0.6352],
         [-0.7462, -0.6976, -0.1616,  ..., -0.0087,  0.8424,  0.2508],
         [-0.5540, -0.6966,  0.8854,  ...,  0.3237,  0.1576,  0.9480],
         ...,
         [ 2.2967, -0.1664,  0.3546,  ..., -0.5724, -0.3273,  0.3215],
         [ 0.5750,  0.0995, -0.3975,  ...,  0.2667, -0.2205, -0.4782],
         [ 0.5753,  0.1630, -0.0906,  ...,  0.2518, -0.5264, -0.3900]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-0.9570, -0.5569, -0.9623,  0.9004,  0.7563, -0.2063,  0.9565,  0.4562,
         -0.8887, -1.0000, -0.6778,  0.9670,  0.9836,  0.6399,  0.9413, -0.8410,
         -0.3558, -0.6316,  0.3938, -0.7038,  0.6955,  1.0000,  0.0240,  0.4847,
          0.5733,  0.9959, -0.8466,  0.9564,  0.9773,  0.7756, -0.8173,  0.3507,
         -0.9875, -0.3182, -0.9632, -0.9967,  0.6160, -0.8233, -0.0236, -0.1948,
         -0.9170,  0.4625,  1.00

### **6.2 Self-Attention Mechanism:**
#### Theory:
- Self-attention allows a model to weigh the importance of different words in a sequence relative to each other.
- Attention scores are computed based on the similarity between words.

#### Code Example:

In [None]:
import torch
import torch.nn.functional as F
import pandas as pd

# Load your IMDB dataset
# Replace 'your_imdb_dataset.csv' with the actual path to your dataset
imdb_df = pd.read_csv('/content/drive/MyDrive/Master Class/IMDB Dataset.csv')

# Choose a smaller subset of the dataset for demonstration purposes
subset_size = 5
imdb_subset = imdb_df.head(subset_size)

# Assume 'review' column contains preprocessed text data
# Convert the text data to tensors (you might want to use embeddings in a real-world scenario)
text_tensors = torch.rand((subset_size, 10, 10))  # Replace with your actual text to tensor conversion

# Assume query, key, and value are derived from the text data (in practice, they might be learned parameters)
query = text_tensors
key = text_tensors
value = text_tensors

# Compute attention scores using scaled dot-product attention
attention_scores = F.softmax(torch.bmm(query, key.transpose(1, 2)), dim=-1)

# Compute the weighted sum using the attention scores and values
weighted_sum = torch.bmm(attention_scores, value)

# Print the result
print(weighted_sum)


tensor([[[0.5341, 0.5769, 0.6861, 0.7477, 0.5404, 0.3739, 0.5394, 0.5204,
          0.3726, 0.7516],
         [0.4455, 0.5178, 0.6766, 0.7877, 0.5551, 0.4618, 0.5949, 0.5512,
          0.3458, 0.7800],
         [0.6210, 0.6750, 0.6444, 0.7448, 0.5077, 0.3322, 0.5309, 0.5663,
          0.3179, 0.7416],
         [0.5772, 0.5678, 0.6786, 0.6870, 0.4948, 0.3471, 0.5189, 0.4879,
          0.3910, 0.6862],
         [0.5964, 0.6513, 0.6520, 0.7353, 0.5267, 0.3217, 0.5309, 0.5006,
          0.3443, 0.7222],
         [0.5157, 0.5976, 0.6542, 0.7731, 0.4915, 0.4169, 0.5883, 0.5433,
          0.3230, 0.7291],
         [0.6740, 0.7218, 0.5578, 0.7275, 0.5761, 0.2513, 0.4591, 0.4606,
          0.3518, 0.6907],
         [0.6107, 0.6337, 0.6357, 0.7171, 0.4793, 0.3154, 0.5130, 0.4887,
          0.3629, 0.6683],
         [0.6296, 0.6070, 0.6628, 0.6666, 0.4919, 0.2933, 0.4793, 0.4677,
          0.3995, 0.6661],
         [0.6378, 0.6804, 0.5877, 0.7156, 0.4859, 0.2871, 0.4814, 0.4609,
          0.3491,

### **6.3 BERT (Bidirectional Encoder Representations from Transformers):**
#### Theory:
- BERT is a pre-trained transformer model designed for bidirectional representation learning.
- It utilizes masked language modeling and next sentence prediction during pre-training.
- State-of-the-art performance in various NLP tasks.

#### Code Example:

In [None]:
from transformers import BertTokenizer, BertForMaskedLM

# Tokenize and mask input text
# Instantiate a BERT tokenizer for the 'bert-base-uncased' model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Input text to be tokenized and masked
text = "The cat is [MASK] on the mat."

# Encode the input text using the BERT tokenizer
# 'return_tensors' parameter specifies that PyTorch tensors should be returned
encoded_tokens = tokenizer.encode(text, return_tensors='pt')  # Use encode directly

# Load pre-trained BERT model for masked language modeling
# Instantiate a BERT model for masked language modeling (MLM)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Forward pass through the BERT model for masked language modeling
# 'return_dict=True' returns a dictionary containing various model outputs
output = model(encoded_tokens, return_dict=True)

# Retrieve the predicted logits (scores) for masked tokens
predictions = output.logits

# Print the predicted logits
print(predictions)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tensor([[[ -6.6352,  -6.5930,  -6.5885,  ...,  -5.9774,  -5.6839,  -4.0493],
         [-14.5612, -14.3862, -14.4238,  ..., -12.1998, -11.9709, -11.0100],
         [ -8.0677,  -8.1387,  -8.1146,  ...,  -7.2061,  -6.4629,  -6.5417],
         ...,
         [ -7.3555,  -7.8470,  -8.0578,  ...,  -7.2294,  -6.7105,  -5.0182],
         [-10.6764, -10.2837, -10.5259,  ...,  -8.5979,  -9.6679,  -6.8525],
         [-10.5542, -10.5132, -10.5281,  ...,  -9.9582,  -9.2576,  -5.6333]]],
       grad_fn=<ViewBackward0>)
