<a href="https://colab.research.google.com/github/saikiran101/Gen-AI-Text-Preprocessing/blob/main/text_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stemming and lemmitization

text preprocessing

**objective:**

clean text data so its uniform, lowercase and ready for further analysis.

lowercasing

- Goal: Convert all Text to lowercase to avoid treating “LOVE” and “love”  as different word.

In [None]:
import pandas as pd

reviews=["Subhash Chandra Bose was a prominent Indian nationalist leader who fought for India's independence from British rule.",
  "He was born on January 23, 1897, in Cuttack, Odisha.",
  "Bose was known for his militant approach and formed the Indian National Army (INA) to overthrow British rule.",
  "He sought assistance from Axis powers during World War II to achieve his goals.",
  "His famous slogan Give me blood, and I will give you freedom inspired many Indians to join the freedom struggle.",
  "Bose's 😎 mysterious disappearance in 1945 remains a subject of intrigue and speculation"
]

df_reviews=pd.DataFrame(reviews,columns=["Reviews"])

df_reviews["Reviews_lowercase"]=df_reviews["Reviews"].str.lower()
df_reviews

Unnamed: 0,Reviews,Reviews_lowercase
0,Subhash Chandra Bose was a prominent Indian na...,subhash chandra bose was a prominent indian na...
1,"He was born on January 23, 1897, in Cuttack, O...","he was born on january 23, 1897, in cuttack, o..."
2,Bose was known for his militant approach and f...,bose was known for his militant approach and f...
3,He sought assistance from Axis powers during W...,he sought assistance from axis powers during w...
4,"His famous slogan Give me blood, and I will gi...","his famous slogan give me blood, and i will gi..."
5,Bose's 😎 mysterious disappearance in 1945 rema...,bose's 😎 mysterious disappearance in 1945 rema...


## Remove the Punctuation and Emojis

Goal is to remove unnecessary  characters like  punctuation and emojis to simplify analysis.

In [None]:
import re
df_reviews['Reviews_punctuation_emojis']=df_reviews["Reviews_lowercase"].apply(lambda x:re.sub(r'[^\w\s]','',x))
df_reviews

Unnamed: 0,Reviews,Reviews_lowercase,Reviews_punctuation_emojis
0,Subhash Chandra Bose was a prominent Indian na...,subhash chandra bose was a prominent indian na...,subhash chandra bose was a prominent indian na...
1,"He was born on January 23, 1897, in Cuttack, O...","he was born on january 23, 1897, in cuttack, o...",he was born on january 23 1897 in cuttack odisha
2,Bose was known for his militant approach and f...,bose was known for his militant approach and f...,bose was known for his militant approach and f...
3,He sought assistance from Axis powers during W...,he sought assistance from axis powers during w...,he sought assistance from axis powers during w...
4,"His famous slogan Give me blood, and I will gi...","his famous slogan give me blood, and i will gi...",his famous slogan give me blood and i will giv...
5,Bose's 😎 mysterious disappearance in 1945 rema...,bose's 😎 mysterious disappearance in 1945 rema...,boses mysterious disappearance in 1945 remain...


In [None]:
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords


stop_words=set(stopwords.words('english'))

df_reviews['Review_NoStopwords']= df_reviews['Reviews_punctuation_emojis'].apply(lambda x: ' '.join(words for words in x.split() if words not in stop_words))
df_reviews

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sai51\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,Reviews,Reviews_lowercase,Reviews_punctuation_emojis,Review_NoStopwords
0,Subhash Chandra Bose was a prominent Indian na...,subhash chandra bose was a prominent indian na...,subhash chandra bose was a prominent indian na...,subhash chandra bose prominent indian national...
1,"He was born on January 23, 1897, in Cuttack, O...","he was born on january 23, 1897, in cuttack, o...",he was born on january 23 1897 in cuttack odisha,born january 23 1897 cuttack odisha
2,Bose was known for his militant approach and f...,bose was known for his militant approach and f...,bose was known for his militant approach and f...,bose known militant approach formed indian nat...
3,He sought assistance from Axis powers during W...,he sought assistance from axis powers during w...,he sought assistance from axis powers during w...,sought assistance axis powers world war ii ach...
4,"His famous slogan Give me blood, and I will gi...","his famous slogan give me blood, and i will gi...",his famous slogan give me blood and i will giv...,famous slogan give blood give freedom inspired...
5,Bose's 😎 mysterious disappearance in 1945 rema...,bose's 😎 mysterious disappearance in 1945 rema...,boses mysterious disappearance in 1945 remain...,boses mysterious disappearance 1945 remains su...


## Tokenization

Break text into samller parts: <br>
    1.word tokenization <br>
    2.subword tokenization<br>
    ![alt text](image.png)<br>
    3.Sentence tokenization

Tokenization is a crucial step in text processing and natural language processing (NLP) for several reasons:

Breaking Down Text: Tokenization breaks down a large body of text into smaller units, such as words or sentences, making it easier to analyze and process.

Normalization: It helps in normalizing the text by converting it into a standard format. For example, splitting contractions ("don't" to "do" and "not") or handling punctuation.

Feature Extraction: In machine learning and NLP, features are often derived from tokens. Tokenization allows for the extraction of meaningful features from the text.

Efficiency: Processing smaller units (tokens) is computationally more efficient than processing entire strings of text.

Context Understanding: Tokenization helps in understanding the context by identifying individual words or phrases, which can then be used for further analysis like part-of-speech tagging, named entity recognition, etc.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

text = "Hello, world! Tokenization is important for text processing."
tokens = word_tokenize(text)

print(tokens)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sai51\AppData\Roaming\nltk_data...


['Hello', ',', 'world', '!', 'Tokenization', 'is', 'important', 'for', 'text', 'processing', '.']


[nltk_data]   Unzipping tokenizers\punkt.zip.


## Next step Embedding

Represent words as numbers so computers can understand similarities:<br>
- turns words into vectors <br>
- similar words (eg., "king" and "queen") have similar vectors<br>
![image.png](attachment:image.png)

# Why do we need embeddings?

In the past, methods like one-hot encoding were used to represent words. This approach has limitations:

- **No semantic meaning**: One-hot vectors don't capture relationships between words.
- **High dimensionality**: For large vocabularies, these vectors become very sparse and memory intensive.

Embeddings address these issues by representing words as dense vectors, capturing semantic meaning and relationships between words. Similar words (e.g., "king" and "queen") have similar vectors, making it easier for computers to understand and process text data.

# How embedding works?
Embeddings are a way to represent words or phrases as vectors of real numbers in a continuous vector space. This representation captures semantic meaning, allowing similar words to have similar vector representations. Here's a high-level overview of how embeddings work:
- **Initialization**: Embeddings are typically initialized with random values or pre-trained vectors (like Word2Vec, GloVe, or FastText).

- **Training: During** training, the embeddings are adjusted based on the context in which words appear. This is done using neural networks, such as:

- **Word2Vec**: Uses either Continuous Bag of Words (CBOW) or Skip-gram models to predict a word based on its context or vice versa.
- **GloVeV**: Uses global word-word co-occurrence statistics to create embeddings.
- **FastText**: Extends Word2Vec by considering subword information, which helps with rare words and misspellings.
- **Vector Space**: After training, each word is represented as a point in a high-dimensional space. Words with similar meanings are close to each other in this space.

Usage: These embeddings can be used in various NLP tasks, such as text classification, sentiment analysis, and machine translation, by feeding them into machine learning models.

# What can we do with embeddings?

Embeddings have a wide range of applications in NLP and machine learning, such as:

1. **Text Classification**: Representing text data for tasks like sentiment analysis and spam detection.
2. **Similarity Measurement**: Comparing vector representations of words or documents for clustering and recommendation systems.
3. **Machine Translation**: Representing words in different languages for translation models.
4. **Question Answering**: Finding relevant answers by representing questions and answers as vectors.
5. **Information Retrieval**: Enhancing search engines by improving query and document representations.


In [None]:
from gensim.models import Word2Vec

sentences=[
    ["king","queen","man","woman"],
    ['apple','banana','mango','orange'],
    ['king','man','ruler'],
    ['queen','woman','ruler'],
]
model=Word2Vec(sentences,vector_size=3, window=2, min_count=1,sg=1)


print("Get the embedding of the word 'king'",model.wv["king"])
print("Most similar words to 'king': ",model.wv.most_similar("king"))

## What is attention mechanism?
Attentions assign each word a weight a score how improtant it is in context.

- in "The cat, which was sleeping is jumped out of the wall," words like "Cat" and "jumped" get higher scores.
- Attention model help "Zoon in"  on words which are most matters forthe task.

# Why do we need attention mechanisum?
computer struggles, it gives focus to every word in the sentence , which  makes it hard for them to know which word is important.

- Attention mechanisum solves it to mimic like a human😎

Example:
-   The chief, who was famous for delicious pasta, quckly prepred the dish for the guest.<br>
A human reader only focus on the main point - " The cheif prepared the dish"<br>
details like famous for his delicious pasta add some context but arent essentail for the understanding the main action. <br>

# How does attention mechanisum work?
### Step1 is assign important to words

Each word in a sentence gets an importance score based on its meaning and relation to each words.<br>
"The cat, Which was sleeping, jumped over the wall."

### Step2: Calculate Attention Scores
- Words Closely related in meaning or action have  higher scores. Score are Calculated based on  relationships. related words get higher scores.

example:
"Cat" and "jumped" get a high socre because they re directly relation.

### Step3: Adjust Focus based on Scores

- The model uses these scores to  focus  on high-scoring words. making sense of the sentences main meaning.<br>
example:
- "Cat," "jumped" and "Wall" help  the model  understand  the action.


# Why do we need Transformers?
your reading a story one word at time, and your not allowed to go back to precious words.<br>
you can only remember the current word your reading. Understanding the story would be almost impossiable!<br>

### Before Transformers, it was hard for them to:
- Remember details that appeared eariler in a sentence or paragraph.
- understanding the complex relationship between words in long sentences.
- process text quckly
### Transformers changes this by letting models see the entire sentence or text at once, Which mean they can:
- pick out the most important words immediately.
- Understands the relationships, even if words are far apart in the sentence.
- Process text much faster, because they're not limited to going word by word.

-**Example:** Read a sentence like "The lion, tired after hunting all day, rested under a tree.


# What is a Tansformers?

A tranformers is a type of computer model the reads the whole sentence at once, Understanding<br>
which words are important and how they relate to each other.<br>

- it can see the whole story in one glance and find the key ideas immediately.

# How it works?
**Step 1:** Representing words with Embeddings and Positions
- Tranformers start by turning each words into sentence into a set of numbers (an embedding) that represents the meaning of the word.
- They also add a "Position" number to each word to keep track of the other of the words in the sentence.<br>

**Example**: In the sentence "The dog chase the ball" here each words converts into number representation:
- "The" -> [0.2,0.1,0.5](example numbers for meaning)
- "dog" -> [0.9,0.6,0.1]
- "Chased" -> [0.3,0.8,0.4]
- "ball" -> [0.7,0.9,0.2]

**Step 2:** Using Attention to Find Important Words

The transfomer then uses attention to determine which words in the sentence are the most important for understanding the meaning.

- Attention scores tell the transformer which words relate most closely to each other.

**Example:** In the sentence:
- "The dog, tired after running,chased the ball.

**Step 3:** Layers to build understanding step by step<br>

Transformer have layers each one refining the model's understanding. Each layer is like a "filter"
that focuses on different aspects of the sentence.

- In the first layer the transformer looks for basic relationship, like connecting "dog" with "chased"
- In the next layer, it might understand that "Chased the ball" is an action happening because the "dog" is involved.
